加入/合并不共享所有标题/列的 CSV 文件

Question 1

如果您想尝试替代且非常干净且简单的工具（https://github.com/johnkerl/miller），从包含输入 CSV 文件的文件夹开始，使用此命令

mlr --csv unsparsify *.csv >out.csv

你将会拥有

A,B,C,D,F,E
10,20,10,20,5,
,20,10,20,10,5
,,,10,20,30

如果你想让 F a 结束，命令是

mlr --csv unsparsify then reorder -e -f F *.csv

如果您有很多文件，您可以分两步完成：

mlr --icsv cat *.csv >tmp.txt
mlr --ocsv unsparsify tmp.txt >out.csv

Answer

如果您想尝试替代且非常干净且简单的工具（https://github.com/johnkerl/miller），从包含输入 CSV 文件的文件夹开始，使用此命令

mlr --csv unsparsify *.csv >out.csv

你将会拥有

A,B,C,D,F,E
10,20,10,20,5,
,20,10,20,10,5
,,,10,20,30

如果你想让 F a 结束，命令是

mlr --csv unsparsify then reorder -e -f F *.csv

如果您有很多文件，您可以分两步完成：

mlr --icsv cat *.csv >tmp.txt
mlr --ocsv unsparsify tmp.txt >out.csv

Question 2

BEGIN {
        OFS = FS = ","

        # Parse given column headers and remeber their order.

        # nf will be the number of fields we'd want in the output.
        nf = split(pick, header)
        for (i = 1; i <= nf; ++i)
                order[header[i]] = i

        # Output headers.
        print pick
}

FNR == 1 {
        # Parse column headers from input file.

        delete reorder

        for (i = 1; i <= NF; ++i)
                # If the current header is one that we'd like to pick...
                if ($i in order)
                        # ... record what column it is located in.
                        reorder[order[$i]] = i

        next
}

{
        # Process data fields from input file.

        # We build a new output record, so explicitly split the current record
        # and save it in the field array, then empty the record and rebuild.
        split($0, field)
        $0 = ""

        for (i = 1; i <= nf; ++i)
                # If reorder[i] is zero, it's a column that is not available in the
                # current file.
                $i = (reorder[i] == 0 ? "" : field[reorder[i]])

        print
}

上面的awk脚本将选择您想要提取（以某种特定顺序）的列作为参数，并从每个输入文件中提取这些列。

您在问题中显示的数据示例：

$ awk -v pick='A,B,C,D,E,F' -f script.awk file*.csv
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20

$ awk -v pick='F,B,A' -f script.awk file*.csv
F,B,A
5,20,10
10,20,
20,,

Answer

BEGIN {
        OFS = FS = ","

        # Parse given column headers and remeber their order.

        # nf will be the number of fields we'd want in the output.
        nf = split(pick, header)
        for (i = 1; i <= nf; ++i)
                order[header[i]] = i

        # Output headers.
        print pick
}

FNR == 1 {
        # Parse column headers from input file.

        delete reorder

        for (i = 1; i <= NF; ++i)
                # If the current header is one that we'd like to pick...
                if ($i in order)
                        # ... record what column it is located in.
                        reorder[order[$i]] = i

        next
}

{
        # Process data fields from input file.

        # We build a new output record, so explicitly split the current record
        # and save it in the field array, then empty the record and rebuild.
        split($0, field)
        $0 = ""

        for (i = 1; i <= nf; ++i)
                # If reorder[i] is zero, it's a column that is not available in the
                # current file.
                $i = (reorder[i] == 0 ? "" : field[reorder[i]])

        print
}

上面的awk脚本将选择您想要提取（以某种特定顺序）的列作为参数，并从每个输入文件中提取这些列。

您在问题中显示的数据示例：

$ awk -v pick='A,B,C,D,E,F' -f script.awk file*.csv
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20

$ awk -v pick='F,B,A' -f script.awk file*.csv
F,B,A
5,20,10
10,20,
20,,

Question 3

假设数据行之间没有真正的空行并使用 GNU awk 进行排序：

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    delete f
    for (i=1; i<=NF; i++) {
        f[$i] = i
        flds[$i]
    }
    numFiles++
    next
}
{
    for (tag in f) {
        val[numFiles,tag] = $(f[tag])
    }
}
END {
    PROCINFO["sorted_in"] = "@val_str_asc"
    sep = ""
    for (tag in flds) {
        printf "%s%s", sep, tag
        sep = OFS
    }
    print ""
    for (fileNr=1; fileNr<=numFiles; fileNr++) {
        sep = ""
        for (tag in flds) {
            printf "%s%s", sep, val[fileNr,tag]
            sep = OFS
        }
        print ""
    }
}

。

$ awk -f tst.awk file{1..3}
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20

Answer

假设数据行之间没有真正的空行并使用 GNU awk 进行排序：

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    delete f
    for (i=1; i<=NF; i++) {
        f[$i] = i
        flds[$i]
    }
    numFiles++
    next
}
{
    for (tag in f) {
        val[numFiles,tag] = $(f[tag])
    }
}
END {
    PROCINFO["sorted_in"] = "@val_str_asc"
    sep = ""
    for (tag in flds) {
        printf "%s%s", sep, tag
        sep = OFS
    }
    print ""
    for (fileNr=1; fileNr<=numFiles; fileNr++) {
        sep = ""
        for (tag in flds) {
            printf "%s%s", sep, val[fileNr,tag]
            sep = OFS
        }
        print ""
    }
}

。

$ awk -f tst.awk file{1..3}
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20

加入/合并不共享所有标题/列的 CSV 文件

答案1

答案2

答案3

相关内容