仅使用 sed 或 perl 修复带有不正确换行符的格式错误的 CSV

仅使用 sed 或 perl 修复带有不正确换行符的格式错误的 CSV

我有一个以逗号分隔的 CSV 文件,但由于某种原因,我们的系统在文件中的随机位置插入了一个换行符,这导致整个文件损坏。我可以获得文件中的列数。

如何使用sed和/或perl在单行命令中解决它?我知道它可以解决,awk但这是出于学习目的。如果使用perl,我不想使用内置的 CSV 函数。可以解决吗??我已经解决这个问题好几天了,我似乎找不到解决方案:(

格式错误的输入示例(大量随机插入\n)

policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residen
tial Lot”,3
206893,FL,CLAY COUNTY,-81.7
00455,“Residen
tial Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,
3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,
“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
4335
12,FL,CLAY COUNTY,-81.704613,
“Residential Lot”,1

所需输出

policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residential Lot”,3
206893,FL,CLAY COUNTY,-81.700455,“Residential Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
433512,FL,CLAY COUNTY,-81.704613,“Residential Lot”,1

答案1

$ awk -F, '{ while (NF < 6 || $NF == "") { brokenline=$0; getline; $0 = brokenline $0}; print }' file.csv
policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residential Lot”,3
206893,FL,CLAY COUNTY,-81.700455,“Residential Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
433512,FL,CLAY COUNTY,-81.704613,“Residential Lot”,1

awk只要当前行中的字段少于六个,或者最后一个字段为空(最后一个字段分隔符后有一行被断开),代码就会将下一行输入追加到当前行。


类似 Perl 的工作方式:

perl -ne 'chomp;while (tr/,/,/ < 5 || /,$/) { $_ .= readline; chomp } print "$_\n"' file.csv

答案2

就像 Kusalananda 所说,每行有 6 个字段,所以你可以尝试这个 gnu sed。

sed -E ':A;h;s/^/,/;s/((,[^,]+){6})(.*)/\3/;/./{g;N;s/\n//;bA};g' infile

相关内容