awk 从 CSV 文件中删除双引号,用逗号分隔的引号字符串

awk 从 CSV 文件中删除双引号,用逗号分隔的引号字符串

我有一个 csv 文件,我想删除其中的任何红色",。问题是我的所有字段看起来都像这样,"***********","*********",其中*= 可以是任何字符,包括,"。我需要删除分隔符之间可能出现的任何,或,以便我的字符串保留在双引号中。"","

"This, has a comma","This has "quotes"",",but what if this happens "too""

理想情况下输出应该是

"This has a comma","This has quotes","but what if this happens too"

我试过这个,但它删除了我所有的双引号

awk 'BEGIN{FS=OFS=","} {for (i=1; i<=NF; i++) {gsub("\"","", $i)}}1' input.csv > output.csv

编辑:在此示例输入行上尝试了 aborruso 解决方案。

"33333","23","01-07-31-0-000-026.001","   ","","55540","COUNTY ROAD","","","","COUNTY ROAD","","Y","","","VILLE","YY","33367","","","4.246","-22.331","3","X","6000      ","","134","","000003"," "," ","New New New 
 New","000000370","0000004111","0000002111","94111"," ","JOHN SMITH","BILL","E BILLY",""," ",", JANE DOE","DOE","BOB","","BILL E BILLY BOB",", JANE DOE JOHN DOE BABY DOE"," ","","313 MAIN BLVD","313","","","MAIN","BLVD","","","","NEW YORK","NY","48555","2335","Z005"," ","1234","0","0","12345","12345","0","12345","1234","    ","1234","","01","",""," "," "," "," "," "," "," "," "," "," "," "," ","0","","0","","","","","","","","0","0","0","0","0","","","","","","","","","1","","","","","","0","","","","","","","","","","","0","0","","","  ","       ","          ","       ","       ","01","      ","DISTRICT 9 THE MOVE","","       ","          ","DISNEY NETFLIX","","COME ADD 1/4 OF A CUP 1/4 OF A SPOON 13-T00Z-B32C; THE 524.01'(X)350'(Y)TO W R/W OF SO SI'LLY 987.7';WOULDN'T YOU SAY","24313317","Z",""

我得到的输出是这样的

"(error)","(error)","01-07-31-0-000-026.001","   ","","(error)","COUNTY ROAD","","","","COUNTY ROAD","","Y","","","VILLE","YY","(error)

这只是一个片段,但你应该明白我的意思了。

答案1

您可以使用这里(GNU sed)4.7:

sed 's/","/\x0/g;s/^"\|"$/\x1/g;s/[",]//g;s/\x0/","/g;s/\x1/"/g' input.csv
  • ","用 NUL 字符替换字段分隔符( \x0
  • "将开头和结尾的部分替换为 SOH char ( \x1)
  • 删除"所有,
  • 恢复字段分隔符和"

答案2

这不是 awk,但我认为它可能对你有用

如果你使用磨坊主然后运行

echo '"This, has a comma","This has ""quotes""",",but what if this happens ""too"""'| \
mlr --csv -N --quote-all cat then put -S 'for (k in $*){$[k]=gsub($[k],"(,|\")","")}'

你有

"This has a comma","This has quotes","but what if this happens too"

编辑

您已插入输入的文本。这很奇怪,它类似于

1 2 3 4 5 6 7 8 9 10 11 12 十三 14 15 16 17 18 19 20 21 22 23 24 二十五 二十六 二十七 二十八 二十九 三十 31 三十二 33 三十四 三十五 三十六 三十七 三十八 三十九 40 41 四十二 43 四十四 四十五 四十六 四十七 四十八 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
33333 23 01-07-31-0-000-026.001 55540 县道 县道 维尔 耶耶 33367 4.246 -22.331 3 X 6000 134 000003 新品 新品 新品
新的 000000370 0000004111 0000002111 94111 约翰·史密斯 账单 比利 无名氏 美国能源部 鲍勃 比利·鲍勃 , 珍妮·多伊 约翰·多伊 宝贝·多伊 313 主大道 313 主要的 大道 纽约 纽约 48555 2335 Z005 1234 0 0 12345 12345 0 12345 1234 1234 01 0 0 0 0 0 0 0 1 0 0 0 01 第九区 搬家 迪士尼 Netflix 来添加 1/4 杯 1/4 勺 13-T00Z-B32C;524.01'(X)350'(Y) 到 WR/W 如此愚蠢的 987.7';你不觉得吗 24313317

运行 mlr 命令,您将获得此 CSV 输出

"33333","23","01-07-31-0-000-026.001","   ","","55540","COUNTY ROAD","","","","COUNTY ROAD","","Y","","","VILLE","YY","33367","","","4.246","-22.331","3","X","6000      ","","134","","000003"," "," ","New New New 
 New","000000370","0000004111","0000002111","94111"," ","JOHN SMITH","BILL","E BILLY",""," "," JANE DOE","DOE","BOB","","BILL E BILLY BOB"," JANE DOE JOHN DOE BABY DOE"," ","","313 MAIN BLVD","313","","","MAIN","BLVD","","","","NEW YORK","NY","48555","2335","Z005"," ","1234","0","0","12345","12345","0","12345","1234","    ","1234","","01","",""," "," "," "," "," "," "," "," "," "," "," "," ","0","","0","","","","","","","","0","0","0","0","0","","","","","","","","","1","","","","","","0","","","","","","","","","","","0","0","","","  ","       ","          ","       ","       ","01","      ","DISTRICT 9 THE MOVE","","       ","          ","DISNEY NETFLIX","","COME ADD 1/4 OF A CUP 1/4 OF A SPOON 13-T00Z-B32C; THE 524.01'(X)350'(Y)TO W R/W OF SO SI'LLY 987.7';WOULDN'T YOU SAY","24313317","Z",""

答案3

在 GNU awk 中可以使用FPAT 分隔符规范,它可以正确处理引号内的分隔符,但当然会以牺牲一些性能为代价:

awk 'BEGIN{ FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," } { for (i=1; i<=NF; i++) {gsub("\"|,","", $i)}; print $0; }' input.csv > output.csv
awk 'BEGIN{ FPAT="([^,]*)|(\"[^\"]*\")"; OFS="|" } { $1=$1; print $0; }' input.csv > output2.csv

其中 FPAT 正则表达式的意思是

  • ([^,]*)- 任何不包含,(分隔符)或
  • (\"[^\"]*\")- 引号之间的任何字符串,中间\"不能有其他引号[^\"]
  • for (i=1; i<=NF; i++) {gsub("\"|,","", $i)};- 替换",输入字段值并使用 OFS 重建记录
  • 或者,使用$1=$1使用当前 OFS 重建记录使用分隔符|,然后将文件作为管道分隔符进行处理。

PS 您可以,;或 替换分隔符*,以+避免空字段值。

相关内容