在损坏的 CSV 文件中转义嵌套的双引号

在损坏的 CSV 文件中转义嵌套的双引号

我有一个大的损坏的“CSV”文件,其中有许多嵌套的双引号。例如:

123,"I wonder how to escape "these" quotes with backslashes.",123,456
456,"I wonder how to escape "these" quotes with backslashes.",456,789

知道如何解决这个问题吗?

更新举个真实的例子:

n9sih438,4994fa72322,PMC,Rapid Identification of Malaria Vaccine Candidates Based on alpha-Helical Coiled Coil Protein Motif,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"To identify malaria antigens for vaccine development, we selected alpha-helical coiled coil domains of proteins predicted to be present in the parasite erythrocytic stage. The corresponding synthetic peptides are expected to mimic structurally "native" epitopes. Indeed the 95 chemically synthesized peptides were all specifically recognized by human immune sera, though at various prevalence. Peptide specific antibodies were obtained both by affinity-purification from malaria immune sera and by immunization of mice. These antibodies did not show significant cross reactions, i.e., they were specific for the original peptide, reacted with native parasite proteins in infected erythrocytes and several were active in inhibiting in vitro parasite growth. Circular dichroism studies indicated that the selected peptides assumed partial or high alpha-helical content. Thus, we demonstrate that the bioinformatics/chemical synthesis approach described here can lead to the rapid identification of molecules which target biologically active antibodies, thus identifying suitable vaccine candidates. This strategy can be, in principle, extended to vaccine discovery in a wide range of other pathogens.",2007-07-25

嵌套双引号可以出现在“标题”字段(第 4 个字段)和“摘要”字段(第 9 个字段)中。

答案1

我创建了一个示例输入文件,每行有 10 个字段,其中字段 4 和 9 可能被引用:

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

然后编写此脚本(使用 GNU awk 作为第三个参数match())来识别每个输入行的类型,然后相应地修改引用的字段:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

输出总是引用可能在输入中引用的 2 个字段 - 如果您不喜欢这样,可以将其作为练习进行简单的调整。我还使用了更传统的方法在 CSV 中“转义”双引号,即将其加倍。如果您\"更喜欢 . ,这又是一个微不足道的改变""。看https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk有关在 CSV 和 CSV“标准”上使用 awk 的更多信息。

相关内容