如何查找多个文件中经常一起出现的关键词?

如何查找多个文件中经常一起出现的关键词?

我想找到经常相互关联的关键词。

例子

一个目录包含 markdown 文件,每个文件的最后一行都有一些关键字:

$ tail -n 1 file1.md
#doctor #donkey #plants

$ tail -n 1 file2.md
#doctor #firework #university

$ tail -n 1 file3.md
#doctor #donkey #linux #plants

伪输出

  • 100%包含关键字“#donkey”的文件也包含关键字“#doctor”。
  • 50% 包含关键字“#plants”的文件也包含关键字“#linux”。
  • ……

一个 Shell 脚本、一个 awk 脚本或只是一个关于如何实现此目标的解释就足够了!

任何帮助,将不胜感激。非常感谢

答案1

对数组的数组使用 GNU awk:

如果关键字位于每个文件的第一行,则还可以使用 GNU awk 来nextfile提高效率:

$ cat tst.awk
FNR == 1 {
    for ( i=1; i<=NF; i++ ) {
        words[$i]++
        for ( j=i+1; j<=NF; j++ ) {
            pairs[$i][$j]++
            pairs[$j][$i]++
        }
    }
    nextfile
}
END {
    for ( word1 in pairs ) {
        for ( word2 in pairs[word1] ) {
            pct = pairs[word1][word2] * 100 / words[word1]
            printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
        }
    }
}

$ awk -f tst.awk file*.md
100% of the files containing the keyword "#university" also contain the keyword "#doctor".
100% of the files containing the keyword "#university" also contain the keyword "#firework".
100% of the files containing the keyword "#plants" also contain the keyword "#donkey".
50% of the files containing the keyword "#plants" also contain the keyword "#linux".
100% of the files containing the keyword "#plants" also contain the keyword "#doctor".
100% of the files containing the keyword "#donkey" also contain the keyword "#plants".
50% of the files containing the keyword "#donkey" also contain the keyword "#linux".
100% of the files containing the keyword "#donkey" also contain the keyword "#doctor".
100% of the files containing the keyword "#linux" also contain the keyword "#plants".
100% of the files containing the keyword "#linux" also contain the keyword "#donkey".
100% of the files containing the keyword "#linux" also contain the keyword "#doctor".
33% of the files containing the keyword "#doctor" also contain the keyword "#university".
66% of the files containing the keyword "#doctor" also contain the keyword "#plants".
66% of the files containing the keyword "#doctor" also contain the keyword "#donkey".
33% of the files containing the keyword "#doctor" also contain the keyword "#linux".
33% of the files containing the keyword "#doctor" also contain the keyword "#firework".
100% of the files containing the keyword "#firework" also contain the keyword "#university".
100% of the files containing the keyword "#firework" also contain the keyword "#doctor".

或者在最后一行再次依赖 gawk ENDFILE

$ cat tst.awk
ENDFILE {
    for ( i=1; i<=NF; i++ ) {
        words[$i]++
        for ( j=i+1; j<=NF; j++ ) {
            pairs[$i][$j]++
            pairs[$j][$i]++
        }
    }
}
END {
    for ( word1 in pairs ) {
        for ( word2 in pairs[word1] ) {
            pct = pairs[word1][word2] * 100 / words[word1]
            printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
        }
    }
}

$ awk -f tst.awk file*.md

或者仍在最后一行,但使用 tail+gawk 更有效:

$ cat tst.awk
{
    for ( i=1; i<=NF; i++ ) {
        words[$i]++
        for ( j=i+1; j<=NF; j++ ) {
            pairs[$i][$j]++
            pairs[$j][$i]++
        }
    }
}
END {
    for ( word1 in pairs ) {
        for ( word2 in pairs[word1] ) {
            pct = pairs[word1][word2] * 100 / words[word1]
            printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
        }
    }
}

$ tail -qn1 file*.md | awk -f tst.awk

相关内容