有没有办法从 csv 文件中获取唯一字数

有没有办法从 csv 文件中获取唯一字数

我有一个文本文件内容如下

Notes1,Notes2,Id3,Id4
I'd like to play tennis with you some day with everyone,Mary enjoys cooking,id1234,5678
Some of my friends can speak English well and turkish well,She likes bananas,id3456,9898

最终输出类似:

word, iterationcount,id3,id4,columnname
I'd , 1, id1234,5678,Notes1
like, 1, id1234,5678,Notes1
with, 2, id1234,5678,Notes1 .... 

some, 1, id3456,9898,Notes2
well, 2, id3456,9898,Notes2

对于第 1 列的每个单词和第 2 列的相同输出,其计数基于 Id3、Id4 上的分组

我尝试了多种方法,如下所示

awk -F, '{for (i=1; i<=NF-1; i++) words[$1","$2][$i]+=1} END {for (i in words) {for (word in words[i]) {print word "," words[i][word]}}} ' file.csv


awk -F, '{count[$2","$3]+=(NF-1); for (i=1; i<=NF-1; i++) words[$2","$3][$i]+=1} END {for (i in count) {for (word in words[i]) {print i, word, words[i][word]}}} ' file.csv | sort

缺少一些东西,任何人都可以建议。

答案1

使用 GNU awk (鉴于您在问题中发布的代码,您显然已经在使用它,并且通常是 Linux 上的 awk 变体)FPAT和数组数组:

$ cat tst.awk
BEGIN {
    OFS = ","
    FPAT = "([^" OFS "]*)|(\"([^\"]|\"\")*\")"
}
NR == 1 {
    for ( i=1; i<=NF; i++ ) {
        fldName[i] = $i
    }
    next
}
{
    analyze(1)
    analyze(2)
}

function analyze(fldNr,     words,i,word,cnt,key,out) {
    out = "out" fldNr
    split($fldNr,words)
    for ( i in words ) {
        word = words[i]
        cnt[$3 OFS $4 OFS fldName[fldNr]][word]++
    }
    if ( !doneHdr[fldNr]++ ) {
        print "word", "iterationcount", "id3", "id4", "columnname" > out
    }
    for ( key in cnt ) {
        for ( word in cnt[key] ) {
            print word, cnt[key][word], key > out
        }
    }
}

$ awk -f tst.awk file.csv

$ head -100 out?
==> out1 <==
word,iterationcount,id3,id4,columnname
some,1,id1234,5678,Notes1
you,1,id1234,5678,Notes1
with,2,id1234,5678,Notes1
day,1,id1234,5678,Notes1
everyone,1,id1234,5678,Notes1
tennis,1,id1234,5678,Notes1
to,1,id1234,5678,Notes1
play,1,id1234,5678,Notes1
I'd,1,id1234,5678,Notes1
like,1,id1234,5678,Notes1
can,1,id3456,9898,Notes1
friends,1,id3456,9898,Notes1
well,2,id3456,9898,Notes1
Some,1,id3456,9898,Notes1
of,1,id3456,9898,Notes1
and,1,id3456,9898,Notes1
speak,1,id3456,9898,Notes1
my,1,id3456,9898,Notes1
turkish,1,id3456,9898,Notes1
English,1,id3456,9898,Notes1

==> out2 <==
word,iterationcount,id3,id4,columnname
cooking,1,id1234,5678,Notes2
Mary,1,id1234,5678,Notes2
enjoys,1,id1234,5678,Notes2
likes,1,id3456,9898,Notes2
bananas,1,id3456,9898,Notes2
She,1,id3456,9898,Notes2

上面假设如果您的任何字段包含逗号,那么它们将用双引号引起来,如果您的任何引用字段包含双引号,它们将通过加倍转义,每RFC 4180

它还假设您的所有字段都不能包含换行符。如果他们中的任何一个可以看到使用 awk 高效解析 csv 的最稳健方法是什么了解您需要做什么来处理它们。

相关内容