我有一个具有以下格式的 .CSV 文件:
"column 1","column 2","column 3","column 4","column 5","column 6","column 7","column 8","column 9","column 10
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23455","12312255564","string, with, multiple, commas","string with or, without commas","string 2","USD","433","70%","07/15/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""
"46476","15467534544","lengthy string, with commas, multiple: colans","string with or, without commas","string 2","CAND","388","70%","09/21/2013",""
文件的第五列有不同的字符串。我需要根据第五列值过滤掉文件。可以说,我需要当前文件中的一个新文件,该文件的第五个字段中仅包含值“string 1”的记录。
为此我尝试了以下命令,
awk -F"," ' { if toupper($5) == "STRING 1") PRINT }' file1.csv > file2.csv
但它向我抛出了一个错误,如下所示:
awk: { if toupper($5) == "STRING 1") PRINT }
awk: ^ syntax error
awk: { if toupper($5) == "STRING 1") PRINT }
awk: ^ syntax error
然后我使用了以下命令,这给了我一个奇怪的输出。
awk -F"," '$5="string 1" {print}' file1.csv > file2.csv
输出:
"column 1" "column 2" "column 3" "column 4" string 1 "column 6" "column 7" "column 8" "column 9" "column 10
"12310" "42324564756" "a simple string with a comma" string 1 without commas" "string 1" "USD" "12" "70%" "08/01/2013" ""
"23455" "12312255564" "string with string 1 commas" "string with or without commas" "string 2" "USD" "433" "70%" "07/15/2013" ""
"23525" "74535243123" "string with commas string 1 "string with or without commas" "string 1" "CAND" "744" "70%" "05/06/2013" ""
"46476" "15467534544" "lengthy string with commas string 1 "string with or without commas" "string 2" "CAND" "388" "70%" "09/21/2013" ""
PS:为了安全起见,我使用了 toupper 命令,因为我不确定字符串是小写还是大写。我需要知道我的代码出了什么问题,以及在使用 AWK 搜索模式时字符串中的空格是否重要。
答案1
awk -F '","' 'BEGIN {OFS=","} { if (toupper($5) == "STRING 1") print }' file1.csv > file2.csv
输出
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""
我想这就是你想要的。
答案2
CSV 的问题在于没有标准。如果您需要经常处理 CSV 格式的数据,您可能需要寻找一种更强大的方法,而不仅仅是用作","
字段分隔符。在这种情况下,Perl 的Text::CSV
CPAN 模块非常适合这项工作:
$ perl -mText::CSV_XS -WlanE '
BEGIN {our $csv = Text::CSV_XS->new;}
$csv->parse($_);
my @fields = $csv->fields();
print if $fields[4] =~ /string 1/i;
' file1.csv
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""
答案3
csvgrep
来自 csvkit
对于 awk,最可靠的方法是使用FPAT
以下内容:https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk/45420607#45420607不幸的是,甚至FPAT
无法处理引号中的文字换行符。
相反,如果您想更加理智,可以使用多种 CSV CLI 工具。一个非常容易通过 pip 安装的版本(虽然不一定是最快的,因为它基于 Python)来自csvgrep
csvkit:
pip install csvkit
然后我们可以得到与以下内容不匹配的行:
csvgrep -H -c5 -r '^string 1$' mytest.csv
命令解释:
-H
: 第一行不是标题行-i
: 反转匹配-c5
:对第五列进行操作-r
:匹配以下正则表达式
具体例子:
printf '00,01,02,03,string 1,"04,\n""05"\n10,11,12,13,string 2,"14,\n""15"\n' > nohead.csv
printf 'col1,col2,col3,col4,col5,col6\n00,01,02,03,string 1,"04,\n""05"\n10,11,12,13,string 2,"14,\n""15"\n' > head.csv
然后:
csvgrep -H -c5 -r '^string 1$' nohead.csv | tail -n+2
输出:
00,01,02,03,string 1,"04,
""05"
我们通过管道进入,tail
因为-H
它添加了一个令人讨厌的虚拟标头:
a,b,c,d,e,f
00,01,02,03,string 1,"04,
""05"
我们-i
可以反转匹配:
csvgrep -H -i -c5 -r '^sstring 1$' nohead.csv | tail -n+2
输出:
10,11,12,13,string 2,"14,
""15"
当我们有标题时,我们可以使用列名:
csvgrep -c col5 -r '^string 1$' head.csv
输出:
col1,col2,col3,col4,col5,col6
00,01,02,03,string 1,"04,
""05"
在 csvkit 1.0.7、Ubuntu 23.04 上测试。
答案4
awk 'BEGIN {FS = "," }' '{ (if toupper($5) == "STRING 1") print; }' file1.csv > file2.csv