我问了一些关于如何处理 csv 文件的问题,结果是很多过程。
每个月我都会在 Google Drive 目录中收到几个 TXT 文件。我需要处理、合并这些信息并将其加载到 Postgres 库中。
收到的文件具有以下布局和结构:
#A1401099999999 022021I
00999999999 000000000099999999+000000000000000000-000000000000000000-
00999999999 000000000099999999-000000000000000000-000000000000000000-
00999999999 000000000099999999-000000000000000000-000000000000000000-
@036
我使用 AWK 将文件头分成 3 列:
$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
pfx = substr($0,8,7) OFS substr($0,30,6) OFS substr($0,36,1)
next
}
{
gsub(/[+-]/,"& ")
$1 = pfx OFS $1
print
}
结果
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
@036
在下一步中,我将所有文件的并集更改为以下模式:
cat -T *.txt > final.csv
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
@036
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
@042
我拆下底板:
sed -i '/@/d' final.csv
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,I,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
我将字段 2 的格式编辑为日期。是“MM/YYYY”。现在是“yyyy-mm-01”。我将 01 作为 dd 的默认值。
awk -F, '{OFS=",";a=substr($2,1,2);b=substr($2,3,4);$2=b"-"a"-01";print $0}' final.csv
结果:
10013534,2021-01-01,I,0090922002,000000000009102629+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091000002,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091100005,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091110002,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0099999995,000000008017897139+,000000000000000000-,000000000000000000-,
现在为了能够使用第 5 列中的函数,当值为负数时,我需要将位置从“-”更改为列的开头
awk -F "," '{sign=substr($5,length($5),1);$5=substr($5,0,length($5)-1); if(sign =="-"){$5="-"$5}; print}' ./mycsv
10013534,2021-01-01,I,0090922002,000000000009102629,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091000002,-000000000063288833,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091100005,-000000000063288833,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091110002,000000000063288833,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0099999995,-000000008017897139,000000000000000000-,000000000000000000-,
经过所有这些处理后,我想将数据导入 Postgres
import pandas, csv
from io import StringIO
from sqlalchemy import create_engine
def psql_insert_copy(table, conn, keys, data_iter):
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name, columns)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
engine = create_engine('postgresql://xxxxx:xxx@xxxx:xxx/xxxx')
df = pandas.read_csv("final.csv")
df.to_sql('xxxxxxx', engine, schema='xxxxxxx', method=psql_insert_copy)