在损坏的 CSV 文件中转义嵌套的双引号

Question

我创建了一个示例输入文件，每行有 10 个字段，其中字段 4 和 9 可能被引用：

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

然后编写此脚本（使用 GNU awk 作为第三个参数match()）来识别每个输入行的类型，然后相应地修改引用的字段：

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

。

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

输出总是引用可能在输入中引用的 2 个字段 - 如果您不喜欢这样，可以将其作为练习进行简单的调整。我还使用了更传统的方法在 CSV 中“转义”双引号，即将其加倍。如果您\"更喜欢 . ，这又是一个微不足道的改变""。看https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk有关在 CSV 和 CSV“标准”上使用 awk 的更多信息。

Answer 1

我创建了一个示例输入文件，每行有 10 个字段，其中字段 4 和 9 可能被引用：

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

然后编写此脚本（使用 GNU awk 作为第三个参数match()）来识别每个输入行的类型，然后相应地修改引用的字段：

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

。

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

输出总是引用可能在输入中引用的 2 个字段 - 如果您不喜欢这样，可以将其作为练习进行简单的调整。我还使用了更传统的方法在 CSV 中“转义”双引号，即将其加倍。如果您\"更喜欢 . ，这又是一个微不足道的改变""。看https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk有关在 CSV 和 CSV“标准”上使用 awk 的更多信息。

在损坏的 CSV 文件中转义嵌套的双引号

答案1

相关内容