我有一个包含来自数据记录系统的 1000 个文件的目录,每个文件可以有多达 40,000 行或更多行。挑战在于,有时数据不是从一个或多个传感器记录的,因此会丢失,例如
文件1:
A,B,C,D,F
10,20,10,20,5
文件2:
B,C,D,E,F
20,10,20,5,10
文件3:
D,E,F
10,30,20
所需的结果是将所有文件合并/连接到一个标头。如果输入文件缺少一列(因为传感器损坏),则该部分将替换为空值
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20
最后一列 F 始终存在,因为它是日期/时间戳。
我找到了这个答案,但是它假设所有文件中的所有标题/列都相同
我也发现了这个问题合并多个 CSV 文件以获取匹配和不匹配的列但答案对于我来说还不够完整,无法使用它。
谢谢
答案1
如果您想尝试替代且非常干净且简单的工具(https://github.com/johnkerl/miller),从包含输入 CSV 文件的文件夹开始,使用此命令
mlr --csv unsparsify *.csv >out.csv
你将会拥有
A,B,C,D,F,E
10,20,10,20,5,
,20,10,20,10,5
,,,10,20,30
如果你想让 F a 结束,命令是
mlr --csv unsparsify then reorder -e -f F *.csv
如果您有很多文件,您可以分两步完成:
mlr --icsv cat *.csv >tmp.txt
mlr --ocsv unsparsify tmp.txt >out.csv
答案2
BEGIN {
OFS = FS = ","
# Parse given column headers and remeber their order.
# nf will be the number of fields we'd want in the output.
nf = split(pick, header)
for (i = 1; i <= nf; ++i)
order[header[i]] = i
# Output headers.
print pick
}
FNR == 1 {
# Parse column headers from input file.
delete reorder
for (i = 1; i <= NF; ++i)
# If the current header is one that we'd like to pick...
if ($i in order)
# ... record what column it is located in.
reorder[order[$i]] = i
next
}
{
# Process data fields from input file.
# We build a new output record, so explicitly split the current record
# and save it in the field array, then empty the record and rebuild.
split($0, field)
$0 = ""
for (i = 1; i <= nf; ++i)
# If reorder[i] is zero, it's a column that is not available in the
# current file.
$i = (reorder[i] == 0 ? "" : field[reorder[i]])
print
}
上面的awk
脚本将选择您想要提取(以某种特定顺序)的列作为参数,并从每个输入文件中提取这些列。
您在问题中显示的数据示例:
$ awk -v pick='A,B,C,D,E,F' -f script.awk file*.csv
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20
$ awk -v pick='F,B,A' -f script.awk file*.csv
F,B,A
5,20,10
10,20,
20,,
答案3
假设数据行之间没有真正的空行并使用 GNU awk 进行排序:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
delete f
for (i=1; i<=NF; i++) {
f[$i] = i
flds[$i]
}
numFiles++
next
}
{
for (tag in f) {
val[numFiles,tag] = $(f[tag])
}
}
END {
PROCINFO["sorted_in"] = "@val_str_asc"
sep = ""
for (tag in flds) {
printf "%s%s", sep, tag
sep = OFS
}
print ""
for (fileNr=1; fileNr<=numFiles; fileNr++) {
sep = ""
for (tag in flds) {
printf "%s%s", sep, val[fileNr,tag]
sep = OFS
}
print ""
}
}
。
$ awk -f tst.awk file{1..3}
A,B,C,D,E,F
10,20,10,20,,5
,20,10,20,5,10
,,,10,30,20