Awk：如果引用了列，则连接列，并删除逗号

Question 1

使用csvformat -S（或csvformat --skipinitialspace) fromcsvkit删除每个逗号后的初始空格字符，将数据转换为正确引用的 CSV 记录。然后磨坊主( mlr) 迭代每个记录的每个字段，删除所有嵌入的逗号。

$ csvformat -S file | mlr --csv put 'for (k,v in $*) { $[k] = gsub(v,",","") }'
COL1,COL2,COL3
a,b,c
d,efg,h

这两个工具都支持 CSV，并且知道如何读取带引号字段、嵌入逗号、嵌入换行符等的 CSV 记录。如果字段需要引用，csvkit 工具和 Miller 都会输出带引号的字段。

例如，向数据添加一条记录，其中包含一个带换行符的字段和另一个带引号的字段：

$ cat file
COL1, COL2, COL3
a, b, c
d, "e,f,g", h
My data, "Line 1,
Line 2", "This is a quote: ""The, quote"""

$ csvformat -S file | mlr --csv put 'for (k,v in $*) { $[k] = gsub(v,",","") }'
COL1,COL2,COL3
a,b,c
d,efg,h
My data,"Line 1
Line 2","This is a quote: ""The quote"""

Answer

使用csvformat -S（或csvformat --skipinitialspace) fromcsvkit删除每个逗号后的初始空格字符，将数据转换为正确引用的 CSV 记录。然后磨坊主( mlr) 迭代每个记录的每个字段，删除所有嵌入的逗号。

$ csvformat -S file | mlr --csv put 'for (k,v in $*) { $[k] = gsub(v,",","") }'
COL1,COL2,COL3
a,b,c
d,efg,h

这两个工具都支持 CSV，并且知道如何读取带引号字段、嵌入逗号、嵌入换行符等的 CSV 记录。如果字段需要引用，csvkit 工具和 Miller 都会输出带引号的字段。

例如，向数据添加一条记录，其中包含一个带换行符的字段和另一个带引号的字段：

$ cat file
COL1, COL2, COL3
a, b, c
d, "e,f,g", h
My data, "Line 1,
Line 2", "This is a quote: ""The, quote"""

$ csvformat -S file | mlr --csv put 'for (k,v in $*) { $[k] = gsub(v,",","") }'
COL1,COL2,COL3
a,b,c
d,efg,h
My data,"Line 1
Line 2","This is a quote: ""The quote"""

Question 2

对于任何 awk，如果您的输入确实看起来像在引号外的每个逗号后显示一个空格，并且在引用字段内没有双引号或换行符，并且在引用字段内的逗号后没有空格：

$ awk 'BEGIN{FS=OFS=", "} {for (i=1; i<=NF; i++) gsub(/[",]/,"",$i)} 1' file
COL1, COL2, COL3
a, b, c
d, efg, h

或者，FPAT如果您的输入在每个字段中可能有前导空白，并且在引用字段内没有双引号或换行符，并且在引用字段内的逗号后可能有空白，则可以使用 GNU awk：

$ awk -v FPAT='([^,]*)|( *"[^"]+")' -v OFS=',' '
    { for (i=1; i<=NF; i++) gsub(/[",]/,"",$i) }
1' file
COL1, COL2, COL3
a, b, c
d, efg, h

看使用 awk 高效解析 csv 的最稳健方法是什么有关使用 awk 解析 CSV 的更多信息。

Answer

对于任何 awk，如果您的输入确实看起来像在引号外的每个逗号后显示一个空格，并且在引用字段内没有双引号或换行符，并且在引用字段内的逗号后没有空格：

$ awk 'BEGIN{FS=OFS=", "} {for (i=1; i<=NF; i++) gsub(/[",]/,"",$i)} 1' file
COL1, COL2, COL3
a, b, c
d, efg, h

或者，FPAT如果您的输入在每个字段中可能有前导空白，并且在引用字段内没有双引号或换行符，并且在引用字段内的逗号后可能有空白，则可以使用 GNU awk：

$ awk -v FPAT='([^,]*)|( *"[^"]+")' -v OFS=',' '
    { for (i=1; i<=NF; i++) gsub(/[",]/,"",$i) }
1' file
COL1, COL2, COL3
a, b, c
d, efg, h

看使用 awk 高效解析 csv 的最稳健方法是什么有关使用 awk 解析 CSV 的更多信息。

Question 3

我想我现在找到了合适的解决方案：

'{ for (i=1; i<=NF; i+=1)
    { gsub(/^"|",*$|,/,"",$i);
      printf $i ((i != NF) ? ", " : "\n")
    }
 }'

...但是如果字段中有空格，则这不起作用。这有效：

# delimit by comma
-F"," '{
    # m non-zero will tell us if we are in quoted section
    m=0;
    # iterate over every field
    for (i=1; i<=NF; i+=1) {
        # we found a field that starts with possible white-space
        # followed by a quote
        if (match($i,"^ *\"")) {
            # if we are not already in a quoted section, remove the quote, and set 'm'
            if (!m) {sub(/^ *\"/,"",$i)}; m++ }
            # if we are in a quoted section and we encounter a 
            # quote, set 'm' to next lowest-level of quoting
            else if (match($i, "\"")) {m--; 
                # and if we are now outside of the quoted field, remove the quote
                if (!m) {sub("\"","",$i)}};
            # print a comma delimeter unless we're at the last field,
            # in which case we put in a newline
            printf ($i (i==NF? "\n" : (m?"":", ")))
        }
    }
}'

很想知道更紧凑的解决方案！

Answer

我想我现在找到了合适的解决方案：

'{ for (i=1; i<=NF; i+=1)
    { gsub(/^"|",*$|,/,"",$i);
      printf $i ((i != NF) ? ", " : "\n")
    }
 }'

...但是如果字段中有空格，则这不起作用。这有效：

# delimit by comma
-F"," '{
    # m non-zero will tell us if we are in quoted section
    m=0;
    # iterate over every field
    for (i=1; i<=NF; i+=1) {
        # we found a field that starts with possible white-space
        # followed by a quote
        if (match($i,"^ *\"")) {
            # if we are not already in a quoted section, remove the quote, and set 'm'
            if (!m) {sub(/^ *\"/,"",$i)}; m++ }
            # if we are in a quoted section and we encounter a 
            # quote, set 'm' to next lowest-level of quoting
            else if (match($i, "\"")) {m--; 
                # and if we are now outside of the quoted field, remove the quote
                if (!m) {sub("\"","",$i)}};
            # print a comma delimeter unless we're at the last field,
            # in which case we put in a newline
            printf ($i (i==NF? "\n" : (m?"":", ")))
        }
    }
}'

很想知道更紧凑的解决方案！

Question 4

这稍微更紧凑并且采用了不同的方法。它正确处理提供的测试数据：

BEGIN { FS="\"" }

{
    separator = ""
    for (i = 1; i <= NF; i++) {
        if (i % 2) {
            # Odd numbered field, handle as CSV
            n = split($i, parts, ",")
            for (j = 1; j <= n; j++) {
                printf "%s%s", separator, parts[j];
                separator = ","
            }
        }
        else {
            # Even numbered field, handle as quoted text
            gsub(",", "", $i)
            printf "%s", $i;
            separator = ""
        }
    }
    print "";
}

我使用以下方法测试了它：

COL1, COL2, COL3
a, b, c
d, "e,f,g" , h
"i,j,k"
"l,m",n,o
p,"q"
s, t,u, "w,,z"

上面的代码将双引号视为主要分隔符。它假设引号是成对的，在这种情况下，偶数字段（$2、$4、$6 ...）被引用，奇数字段（$1、$3、$5、...）是外部引号。每种字段（偶数引用/奇数未引用）的处理方式都不同。

如果有必要，可以使用正则表达式作为字段分隔符 (FS) 来处理转义引号。我不确定是否要删除所有空格，是否可以添加。

Answer

这稍微更紧凑并且采用了不同的方法。它正确处理提供的测试数据：

BEGIN { FS="\"" }

{
    separator = ""
    for (i = 1; i <= NF; i++) {
        if (i % 2) {
            # Odd numbered field, handle as CSV
            n = split($i, parts, ",")
            for (j = 1; j <= n; j++) {
                printf "%s%s", separator, parts[j];
                separator = ","
            }
        }
        else {
            # Even numbered field, handle as quoted text
            gsub(",", "", $i)
            printf "%s", $i;
            separator = ""
        }
    }
    print "";
}

我使用以下方法测试了它：

COL1, COL2, COL3
a, b, c
d, "e,f,g" , h
"i,j,k"
"l,m",n,o
p,"q"
s, t,u, "w,,z"

上面的代码将双引号视为主要分隔符。它假设引号是成对的，在这种情况下，偶数字段（$2、$4、$6 ...）被引用，奇数字段（$1、$3、$5、...）是外部引号。每种字段（偶数引用/奇数未引用）的处理方式都不同。

如果有必要，可以使用正则表达式作为字段分隔符 (FS) 来处理转义引号。我不确定是否要删除所有空格，是否可以添加。

Awk：如果引用了列，则连接列，并删除逗号

答案1

答案2

答案3

答案4

相关内容