需要格式化 CSV

需要格式化 CSV

我有一个 CSV 文件,其中包含如下所示的值示例

"Basic","""21,21""","[""21"",""21""]","","","","",""

我需要删除某些列上多余的双引号,例如 2 和 3

预期输出如下

"Basic","21,21","[21,21]","","","","",""

如何使用 awk、sed 或任何其他 Linux 工具实现此目的?

下面提到了一些更多的文件示例,该列中的值始终位于 [] 中,[] 中的引号必须删除。

"Basic","""40""","[""40""]","""13F""","[""13F""]","","" 
"Basic","""0""","[""0""]","","","""MCOMB""","[""MCOMB""]"

答案1

使用能够理解的 csv 解析器CSV在带引号的字段中嵌入引号和逗号等等,这可能比简单的逗号分隔字段更复杂一些。

磨坊主是一个很好的命令行工具,如下所示csvkit

或者使用 perl 或 python 等语言的 csv 解析库 - 例如文本::CSV对于 perl 或数据集对于蟒蛇。

如果您使用的是 Linux,所有这些都可能作为您使用的任何发行版的软件包提供。

答案2

答案3

我有一个sed解决方案

sed -e 's/,"""/,"/g' -e 's/""",/",/g' -e 's/\([^,]\)""/\1/g' -e 's/""\([^,]\)/\1/' 

这使

"Basic","40","[40]","13F","[13F]","",""
"Basic","0","[0]","","","MCOMB","[MCOMB]"
"Basic","21,21","[21,21]","","","","",""

sed 命令非常简单

  • 's/,"""/,"/g'将所有出现的情况替换,""","g
  • 's/\([^,]\)""/\1/g'找到任何非逗号字符[^,]和两个",记住字符\( \)并替换为记住的字符\1

请注意,行尾的尾随空格将删除最后一个""

正如 @cas 指出的,从长远来看,使用 csv 工具会更好。

答案4

我假设您想要删除数据中的所有双引号,即,而不是 CSV 格式中的双引号以及引用嵌入引号、逗号和换行符所必需的双引号。

使用csvformatcsvkit 和tr来删除每个字段的内部引用:

$ cat file
"Basic","""40""","[""40""]","""13F""","[""13F""]","",""
"Basic","""0""","[""0""]","","","""MCOMB""","[""MCOMB""]"
"Basic","""21,21""","[""21"",""21""]","","","","",""
$ csvformat -Q "'" file | tr -d '"' | csvformat -q "'"
Basic,40,[40],13F,[13F],,
Basic,0,[0],,,MCOMB,[MCOMB]
Basic,"21,21","[21,21]",,,,,

上面的管道首先将 CSV 文件中使用的引号字符从双引号更改为单引号。该tr命令删除所有剩余的双引号(部分数据)。最后的csvformat命令将数据转换回使用双引号进行引用。

如果您需要引用每个字段,甚至是空字段,请添加-U 1csvformat的第二次调用。默认情况下,csvkit 实用程序仅输出需要它的字段的引号。

相关内容