我有一个文本文件内容如下
Notes1,Notes2,Id3,Id4
I'd like to play tennis with you some day with everyone,Mary enjoys cooking,id1234,5678
Some of my friends can speak English well and turkish well,She likes bananas,id3456,9898
最终输出类似:
word, iterationcount,id3,id4,columnname
I'd , 1, id1234,5678,Notes1
like, 1, id1234,5678,Notes1
with, 2, id1234,5678,Notes1 ....
some, 1, id3456,9898,Notes2
well, 2, id3456,9898,Notes2
对于第 1 列的每个单词和第 2 列的相同输出,其计数基于 Id3、Id4 上的分组
我尝试了多种方法,如下所示
awk -F, '{for (i=1; i<=NF-1; i++) words[$1","$2][$i]+=1} END {for (i in words) {for (word in words[i]) {print word "," words[i][word]}}} ' file.csv
awk -F, '{count[$2","$3]+=(NF-1); for (i=1; i<=NF-1; i++) words[$2","$3][$i]+=1} END {for (i in count) {for (word in words[i]) {print i, word, words[i][word]}}} ' file.csv | sort
缺少一些东西,任何人都可以建议。
答案1
使用 GNU awk (鉴于您在问题中发布的代码,您显然已经在使用它,并且通常是 Linux 上的 awk 变体)FPAT
和数组数组:
$ cat tst.awk
BEGIN {
OFS = ","
FPAT = "([^" OFS "]*)|(\"([^\"]|\"\")*\")"
}
NR == 1 {
for ( i=1; i<=NF; i++ ) {
fldName[i] = $i
}
next
}
{
analyze(1)
analyze(2)
}
function analyze(fldNr, words,i,word,cnt,key,out) {
out = "out" fldNr
split($fldNr,words)
for ( i in words ) {
word = words[i]
cnt[$3 OFS $4 OFS fldName[fldNr]][word]++
}
if ( !doneHdr[fldNr]++ ) {
print "word", "iterationcount", "id3", "id4", "columnname" > out
}
for ( key in cnt ) {
for ( word in cnt[key] ) {
print word, cnt[key][word], key > out
}
}
}
$ awk -f tst.awk file.csv
$ head -100 out?
==> out1 <==
word,iterationcount,id3,id4,columnname
some,1,id1234,5678,Notes1
you,1,id1234,5678,Notes1
with,2,id1234,5678,Notes1
day,1,id1234,5678,Notes1
everyone,1,id1234,5678,Notes1
tennis,1,id1234,5678,Notes1
to,1,id1234,5678,Notes1
play,1,id1234,5678,Notes1
I'd,1,id1234,5678,Notes1
like,1,id1234,5678,Notes1
can,1,id3456,9898,Notes1
friends,1,id3456,9898,Notes1
well,2,id3456,9898,Notes1
Some,1,id3456,9898,Notes1
of,1,id3456,9898,Notes1
and,1,id3456,9898,Notes1
speak,1,id3456,9898,Notes1
my,1,id3456,9898,Notes1
turkish,1,id3456,9898,Notes1
English,1,id3456,9898,Notes1
==> out2 <==
word,iterationcount,id3,id4,columnname
cooking,1,id1234,5678,Notes2
Mary,1,id1234,5678,Notes2
enjoys,1,id1234,5678,Notes2
likes,1,id3456,9898,Notes2
bananas,1,id3456,9898,Notes2
She,1,id3456,9898,Notes2
上面假设如果您的任何字段包含逗号,那么它们将用双引号引起来,如果您的任何引用字段包含双引号,它们将通过加倍转义,每RFC 4180。
它还假设您的所有字段都不能包含换行符。如果他们中的任何一个可以看到使用 awk 高效解析 csv 的最稳健方法是什么了解您需要做什么来处理它们。