我有如下所示的文件,其中每个字段都用逗号分隔。我需要更改第 5 个字段 ( "txt4 "(tst)""
) 并将"
该字段中每次出现的 替换为 ,但不包括两个外部引号chr(34)
。即,最后一个字段应转换为"txt4 chr(34)(tst)chr(34)"
.请注意,我的真实数据可以包含比此处显示的更多的字段,因此在解决方案中列出特定字段是不切实际的。
我需要使用 awk 来实现如下所示的输出。
示例 csv 文件:
"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)""
期望的输出:
"this is txt1","this is txt2",3,"this txt3","txt4 chr(34)(tst)chr(34)"
答案1
您实际上并没有过多说明数据的来源或预期格式。如果练习可以重新表述为“替换为"(
”或“替换为” ,则以下两个命令可以做到这一点:chr(34)(
")
)chr(34)
"(tst)"
chr(34)(tst)chr(23)
sed
$ sed -e 's/"(/chr(34)(/' -e 's/)"/)chr(34)/' file
"this is txt1","this is txt2",3,"this txt3","txt4 chr(34)(tst)chr(34)"
$ sed 's/"\((tst)\)"/chr(34)\1chr(34)/' file
"this is txt1","this is txt2",3,"this txt3","txt4 chr(34)(tst)chr(34)"
无法将文本解析为 CSV 记录,因为最后一个字段的格式无效。该字段的正确引用版本应该是这样的"txt4 ""(tst)"""
。
答案2
这里观察到有效的 CSV 字段引号位于行首、行尾或逗号旁边。因此:搜索每个引号及其两侧的字符。如果两者都不是逗号,则加倍引号。
这并非绝对正确:逗号可以位于有效 CSV 的引号内,例如:“one field,”“here”。但这适用于您的数据。
测试:
Paul--) ./awkFixCsv
"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)"" <<< Input
"this is txt1","this is txt2",3,"this txt3","txt4 ""(tst)""" <<< Output
"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)"",""","""","done" <<< Input
"this is txt1","this is txt2",3,"this txt3","txt4 ""(tst)""","""","""""","done" <<< Output
One,Two,"3","Four","Five "and" Six",Seven and Eight,"Nine" <<< Input
One,Two,"3","Four","Five ""and"" Six",Seven and Eight,"Nine" <<< Output
Paul--)
代码,测试数据作为此处文档,Fix 作为函数。如果您不知道如何将其合并到您的脚本中,请发表评论。
#! /bin/bash
AWK='
function Fix (s, Local, t, u, x) {
while (match (s, ".\042.")) {
u = substr (s, RSTART, RLENGTH);
x = (u ~ /..,/ || u ~ /,../) ? 0 : 1;
t = t substr (s, 1, RSTART + x);
s = substr (s, RSTART + 1);
}
return (t s);
}
{ print "\n" $0 " <<< Input"; }
{ $0 = Fix( $0); }
{ print $0 " <<< Output"; }
'
awk "${AWK}" <<[][]
"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)""
"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)"",""","""","done"
One,Two,"3","Four","Five "and" Six",Seven and Eight,"Nine"
[][]
答案3
珀尔的文本::CSV模块非常擅长处理这样的格式错误的 CSV。尤其:
如果 CSV 数据确实很糟糕,例如
1,"foo "bar" baz",42 or 1,""foo bar baz"",42
有一种方法可以解析此数据行并将引号按原样保留在带引号的字段内。这可以通过设置allow_loose_quotes并确保escape_char不等于quote_char来实现。
例如
$ echo '"this is txt1","this is txt2",3,"this txt3","txt4 "(tst)""' | perl -MText::CSV -lne '
BEGIN{$p = Text::CSV->new({escape_char => "", allow_loose_quotes => 1, quote_space => 1})}
@row = $p->fields() if $p->parse($_);
$p->escape_char("\""); $p->print(*STDOUT,\@row);
'
"this is txt1","this is txt2",3,"this txt3","txt4 ""(tst)"""