在 CSV 处理中将每个进程保留在 1 个执行块中

在 CSV 处理中将每个进程保留在 1 个执行块中

我问了一些关于如何处理 csv 文件的问题,结果是很多过程。

每个月我都会在 Google Drive 目录中收到几个 TXT 文件。我需要处理、合并这些信息并将其加载到 Postgres 库中。

收到的文件具有以下布局和结构:

#A1401099999999              022021I                                   
00999999999   000000000099999999+000000000000000000-000000000000000000-   
00999999999   000000000099999999-000000000000000000-000000000000000000-  
00999999999   000000000099999999-000000000000000000-000000000000000000-
@036

我使用 AWK 将文件头分成 3 列:

$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
    pfx = substr($0,8,7) OFS substr($0,30,6) OFS substr($0,36,1)
    next
}
{
    gsub(/[+-]/,"& ")
    $1 = pfx OFS $1
    print
}

结果

9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
@036

在下一步中,我将所有文件的并集更改为以下模式:

cat -T *.txt > final.csv 


9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
@036
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-    
@042

我拆下底板:

sed -i '/@/d' final.csv

9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,012021,I,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,012021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999+,000000000000000000-,000000000000000000-
9999999,022021,I,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000-
9999999,022021,S,0099999999,000000000099999999-,000000000000000000-,000000000000000000- 

我将字段 2 的格式编辑为日期。是“MM/YYYY”。现在是“yyyy-mm-01”。我将 01 作为 dd 的默认值。

awk -F, '{OFS=",";a=substr($2,1,2);b=substr($2,3,4);$2=b"-"a"-01";print $0}' final.csv

结果:

10013534,2021-01-01,I,0090922002,000000000009102629+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091000002,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091100005,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0091110002,000000000063288833+,000000000000000000-,000000000000000000-,
10013534,2021-01-01,I,0099999995,000000008017897139+,000000000000000000-,000000000000000000-,

现在为了能够使用第 5 列中的函数,当值为负数时,我需要将位置从“-”更改为列的开头

awk -F "," '{sign=substr($5,length($5),1);$5=substr($5,0,length($5)-1); if(sign =="-"){$5="-"$5}; print}' ./mycsv

    10013534,2021-01-01,I,0090922002,000000000009102629,000000000000000000-,000000000000000000-,
    10013534,2021-01-01,I,0091000002,-000000000063288833,000000000000000000-,000000000000000000-,
    10013534,2021-01-01,I,0091100005,-000000000063288833,000000000000000000-,000000000000000000-,
    10013534,2021-01-01,I,0091110002,000000000063288833,000000000000000000-,000000000000000000-,
    10013534,2021-01-01,I,0099999995,-000000008017897139,000000000000000000-,000000000000000000-,

经过所有这些处理后,我想将数据导入 Postgres

import pandas, csv

from io import StringIO
from sqlalchemy import create_engine

def psql_insert_copy(table, conn, keys, data_iter):
   dbapi_conn = conn.connection
   with dbapi_conn.cursor() as cur:
       s_buf = StringIO()
       writer = csv.writer(s_buf)
       writer.writerows(data_iter)
       s_buf.seek(0)
       columns = ', '.join('"{}"'.format(k) for k in keys)
       if table.schema:
           table_name = '{}.{}'.format(table.schema, table.name, columns)
       else:
           table_name = table.name
       sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(table_name, columns)
       cur.copy_expert(sql=sql, file=s_buf)

engine = create_engine('postgresql://xxxxx:xxx@xxxx:xxx/xxxx')

df = pandas.read_csv("final.csv")
df.to_sql('xxxxxxx', engine, schema='xxxxxxx', method=psql_insert_copy)

相关内容