如果模式删除换行符

如果模式删除换行符

有一个这样的 CSV 文件:

1st,2nd,3rd,4th,5th,6th,7th
"first-line
",2,3,4,5,6,7
"second-line
",2,3,4,5,6,7
"third-line
",2,3,4,5,6,7
"normal-line",2,3,4,5,6,7
"forth-line
",2,3,4,5,6,7
"fifth-line
",2,3,4,5,6,7

看起来这些行是通过在第一列的结束引号之前插入换行符而被打破的。我想删除该换行符。

我使用了来自的解决方案这个答案,但是如果文本中存在正确的行(例如标题和“正常行”),则会出现混乱。

即使线路没有断线,有没有办法做到这一点?

答案1

新答案(2022 年 10 月),使用磨坊主从第一列中去除尾随空格:

$ mlr --csv put '$["1st"] = rstrip($["1st"])' file
1st,2nd,3rd,4th,5th,6th,7th
first-line,2,3,4,5,6,7
second-line,2,3,4,5,6,7
third-line,2,3,4,5,6,7
normal-line,2,3,4,5,6,7
forth-line,2,3,4,5,6,7
fifth-line,2,3,4,5,6,7

保留原始引用:

$ mlr --csv --quote-original put '$["1st"] = rstrip($["1st"])' file
1st,2nd,3rd,4th,5th,6th,7th
"first-line",2,3,4,5,6,7
"second-line",2,3,4,5,6,7
"third-line",2,3,4,5,6,7
"normal-line",2,3,4,5,6,7
"forth-line",2,3,4,5,6,7
"fifth-line",2,3,4,5,6,7

请注意,我们按名称而不是按位置使用列。


旧答案(2022 年 6 月):

假设您的数据都不包含该字符@(如果不是这种情况,请更改为其他未使用的字符)并且您想要删除所有嵌入的换行符:

$ csvformat -M @ file.csv | tr -d '\n' | tr '@' '\n'
1st,2nd,3rd,4th,5th,6th,7th
first-line,2,3,4,5,6,7
second-line,2,3,4,5,6,7
third-line,2,3,4,5,6,7
normal-line,2,3,4,5,6,7
forth-line,2,3,4,5,6,7
fifth-line,2,3,4,5,6,7

这使用csvformat来自 csvkit将 CSV 文件重新格式化为数据流,该数据流用作@记录终止符来代替换行符。转换后数据中仍然存在的任何换行符都将被后续tr命令删除。

然后,通过第二次调用将临时记录终止符更改回换行符tr

需要引用的字段仍将在输出中引用。

答案2

这将假设引用的文本不包含 6 个逗号。

awk -F, '
  NR == 1 {num_fields = NF}
  NF < num_fields {first=$0; getline; $0 = first FS $0}
  {print}
' file

更短的是,perl:整个文件,然后删除引号逗号之前的换行符

perl -0777 -pe 's/\n(?=",)//g' file

答案3

试试这个awk- 方法:

awk '{while (gsub("\"","&")%2) {getline T; $0 = $0 T}}  1' file
1st,2nd,3rd,4th,5th,6th,7th
"first-line",2,3,4,5,6,7
"second-line",2,3,4,5,6,7
"third-line",2,3,4,5,6,7
"normal-line",2,3,4,5,6,7
"forth-line",2,3,4,5,6,7
"fifth-line",2,3,4,5,6,7

它会不断追加下一行,直到双引号字符的数量达到偶数为止。

答案4

使用磨坊主真的很简单

mlr --csv clean-whitespace input.csv

具有

1st,2nd,3rd,4th,5th,6th,7th
first-line,2,3,4,5,6,7
second-line,2,3,4,5,6,7
third-line,2,3,4,5,6,7
normal-line,2,3,4,5,6,7
forth-line,2,3,4,5,6,7
fifth-line,2,3,4,5,6,7

相关内容