我有一个 .CSV 文件 (file.csv),其数据全部用双引号引起来。文件的示例格式如下:
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"
第9个字段是格式中的日期字段“DD-MMM-YY”。我必须将其转换为格式年/月/日。我试图使用下面的代码,但没有用。
awk -F, '
BEGIN {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
for (i=1; i<=12; i++) mdigit[month[i]]=i
}
{ m=substr($9,4,3)
$9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
print
}' OFS="," file.csv > temp_file.csv
执行上述代码后,文件 temp_file.csv 的输出如下所示。
column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"
据我了解,问题在于双引号中的逗号,因为我的代码也考虑了它们......请就以下问题提出建议:
1)双引号所有字段中的所有值有什么区别吗?如果它们有任何区别,我如何从除带逗号的字符串之外的所有值中删除它们? 2)对我的代码进行任何修改,以便我可以格式化第9个字段,其格式为“日-月-年”到年/月/日
答案1
您用逗号分隔,但随后有带逗号的字符串。不要认为您将第 9 列作为日期。在该行后面插入 aprint m
即可显示:
m=substr($9,4,3)
print m
例子
MY M: lum
column1,column2,column3,column4,column5,column6, column7, Column8,00/00/2009, Column10
MY M: me"
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
MY M: tho
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
MY M: me"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"
我认为您需要重新考虑一下您的方法,或者转义字符串中包含的任何逗号。
修复
awk
有一种奇怪但有用的能力来分割字符组。一种方法是分割","
而不是仅使用逗号。
示例(改进 #1)
$ awk -F'","' '
BEGIN {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
for (i=1; i<=12; i++) mdigit[month[i]]=i
}
{
if(NR==1){print}
else{ m=substr($9,4,3); print "MY M: " m;
$9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
print
} }' OFS="," file.csv
输出
MY M:
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10,,,,,,,,00/00/2000
MY M: OCT
"12,B000QRIGJ4,4432,string with quotes, and with a comma, and colon: in between,4432,author1, name,890,88,10/11/2011,12"
MY M: OCT
"4432,B000QRIGJ4,890,another, string with quotes, and with more than, two commas: in between,455,author2, name,12,455,10/12/2011,55"
MY M: OCT
"11,B000QRIGJ4,77,string with, commas and (paranthesis) and : colans, in between,12,author3, name,333,22,10/13/2011,232"
即使这也不完全正确。您需要对其进行额外的清理以恢复引号,然后删除字符串开头和结尾处的重复引号。
示例(改进#2)
$ awk -F'","' '
BEGIN {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
for (i=1; i<=12; i++) mdigit[month[i]]=i
}
{ m=substr($9,4,3); print "MY M: " m;
$9 = sprintf("\"%02d/%02d/20%02d\"",mdigit[m],substr($9,1,2),substr($9,8,20))
for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
/\"\"/ }' OFS="," file.csv
输出
MY M:
"column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10","","","","","","","",""00/00/2000"","",
MY M: OCT
""12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88",""10/11/2011"","12"",
MY M: OCT
""4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455",""10/12/2011"","55"",
MY M: OCT
""11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22",""10/13/2011"","232"",
我不会继续使用这种方法,希望您看到这不是解决问题的好方法,并且存在维护问题,并且如果任何输入随着时间的推移而发生变化,则非常脆弱。
示例(改进 #3)
好的,所以我不能就这样离开,所以这是一个有效的示例。
awk -F'","' '
BEGIN {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
for (i=1; i<=12; i++) mdigit[month[i]]=i
}
{ if (NR==1){print; next} }
{ m=substr($9,4,3)
$9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
}' OFS="," file.csv | sed -e 's/""/"/g' -e 's/,$//'
输出
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"
答案2
使用具有适当 CSV 解析器的工具。例如,对于红宝石:
ruby -rcsv -pe '
if $. > 1
row = CSV.parse_line($_)
row[8] = Date.parse(row[8]).strftime("%Y/%m/%d")
$_ = row.to_csv(:force_quotes=>true)
end
' file.csv
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"
答案3
简单的方法
将所有出现的 to 更改 DD-MMM-YYYY
为YYYY/MM/DD
无论在何处找到:
$ perl -pe 'BEGIN{ @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
}
s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1#;' foo.csv
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/9/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/9/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/9/13","232"
准确的方法
仅更改第 9 字段中的格式。使用 perl 的-a
标志将每一行分割成字段(例如awk
,但字段是$F[0],$F[1]...$F[N-1]
),并结合-F
将字段分隔符设置为","
您可以执行以下操作:
perl -F'\",\"' -lane 'BEGIN{
@month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);
for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
}
$F[8]=~s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1# if $.>1;
print join("\",\"",@F)' foo.csv
这将打印 YYYY/MM/DD 并假设(正如您在问题中所做的那样)所有年份都以20
.
答案4
使用 Miller ( mlr
) 清理空格(有些标题中似乎有空格),然后将字段中的日期转换Column9
为正确的格式。日期转换是通过首先使用 将给定日期转换为 Unix 时间,strptime()
然后立即使用 将它们重新格式化为所需的格式来完成的strftime()
。
mlr --csv \
clean-whitespace then \
put '$Column9 = strftime(strptime($Column9, "%d-%b-%y"), "%Y/%m/%d")' file
给出问题中数据的结果:
column1,column2,column3,column4,column5,column6,column7,Column8,Column9,Column10
12,B000QRIGJ4,4432,"string with quotes, and with a comma, and colon: in between",4432,"author1, name",890,88,2011/10/11,12
4432,B000QRIGJ4,890,"another, string with quotes, and with more than, two commas: in between",455,"author2, name",12,455,2011/10/12,55
11,B000QRIGJ4,77,"string with, commas and (paranthesis) and : colans, in between",12,"author3, name",333,22,2011/10/13,232
如果您想引用所有字段,请--quote-all
立即--csv
在命令行上添加该选项。默认情况下,Miller 仅引用实际需要引用的字段。
格式更美观:
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| column1 | column2 | column3 | column4 | column5 | column6 | column7 | Column8 | Column9 | Column10 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| 12 | B000QRIGJ4 | 4432 | string with quotes, and with a comma, and colon: in between | 4432 | author1, name | 890 | 88 | 2011/10/11 | 12 |
| 4432 | B000QRIGJ4 | 890 | another, string with quotes, and with more than, two commas: in between | 455 | author2, name | 12 | 455 | 2011/10/12 | 55 |
| 11 | B000QRIGJ4 | 77 | string with, commas and (paranthesis) and : colans, in between | 12 | author3, name | 333 | 22 | 2011/10/13 | 232 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+