根据文件的第 5 列值过滤 .CSV 文件并将这些记录打印到新文件中

Question 1

awk -F '","'  'BEGIN {OFS=","} { if (toupper($5) == "STRING 1")  print }' file1.csv > file2.csv

输出

"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""

我想这就是你想要的。

Answer

awk -F '","'  'BEGIN {OFS=","} { if (toupper($5) == "STRING 1")  print }' file1.csv > file2.csv

输出

"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""

我想这就是你想要的。

Question 2

CSV 的问题在于没有标准。如果您需要经常处理 CSV 格式的数据，您可能需要寻找一种更强大的方法，而不仅仅是用作","字段分隔符。在这种情况下，Perl 的Text::CSVCPAN 模块非常适合这项工作：

$ perl -mText::CSV_XS -WlanE '
    BEGIN {our $csv = Text::CSV_XS->new;} 
    $csv->parse($_); 
    my @fields = $csv->fields(); 
    print if $fields[4] =~ /string 1/i;
' file1.csv
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""

Answer

CSV 的问题在于没有标准。如果您需要经常处理 CSV 格式的数据，您可能需要寻找一种更强大的方法，而不仅仅是用作","字段分隔符。在这种情况下，Perl 的Text::CSVCPAN 模块非常适合这项工作：

$ perl -mText::CSV_XS -WlanE '
    BEGIN {our $csv = Text::CSV_XS->new;} 
    $csv->parse($_); 
    my @fields = $csv->fields(); 
    print if $fields[4] =~ /string 1/i;
' file1.csv
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""

Question 3

csvgrep来自 csvkit

对于 awk，最可靠的方法是使用FPAT以下内容：https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk/45420607#45420607不幸的是，甚至FPAT无法处理引号中的文字换行符。

相反，如果您想更加理智，可以使用多种 CSV CLI 工具。一个非常容易通过 pip 安装的版本（虽然不一定是最快的，因为它基于 Python）来自csvgrepcsvkit：

pip install csvkit

然后我们可以得到与以下内容不匹配的行：

csvgrep -H -c5 -r '^string 1$' mytest.csv

命令解释：

-H: 第一行不是标题行
-i: 反转匹配
-c5：对第五列进行操作
-r：匹配以下正则表达式

具体例子：

printf '00,01,02,03,string 1,"04,\n""05"\n10,11,12,13,string 2,"14,\n""15"\n' > nohead.csv
printf 'col1,col2,col3,col4,col5,col6\n00,01,02,03,string 1,"04,\n""05"\n10,11,12,13,string 2,"14,\n""15"\n' > head.csv

然后：

csvgrep -H -c5 -r '^string 1$' nohead.csv | tail -n+2

输出：

00,01,02,03,string 1,"04,
""05"

我们通过管道进入，tail因为-H它添加了一个令人讨厌的虚拟标头：

a,b,c,d,e,f
00,01,02,03,string 1,"04,
""05"

我们-i可以反转匹配：

csvgrep -H -i -c5 -r '^sstring 1$' nohead.csv | tail -n+2

输出：

10,11,12,13,string 2,"14,
""15"

当我们有标题时，我们可以使用列名：

csvgrep -c col5 -r '^string 1$' head.csv

输出：

col1,col2,col3,col4,col5,col6
00,01,02,03,string 1,"04,
""05"

在 csvkit 1.0.7、Ubuntu 23.04 上测试。

Answer