我想找到经常相互关联的关键词。
例子
一个目录包含 markdown 文件,每个文件的最后一行都有一些关键字:
$ tail -n 1 file1.md
#doctor #donkey #plants
$ tail -n 1 file2.md
#doctor #firework #university
$ tail -n 1 file3.md
#doctor #donkey #linux #plants
伪输出
- 100%包含关键字“#donkey”的文件也包含关键字“#doctor”。
- 50% 包含关键字“#plants”的文件也包含关键字“#linux”。
- ……
一个 Shell 脚本、一个 awk 脚本或只是一个关于如何实现此目标的解释就足够了!
任何帮助,将不胜感激。非常感谢
答案1
对数组的数组使用 GNU awk:
如果关键字位于每个文件的第一行,则还可以使用 GNU awk 来nextfile
提高效率:
$ cat tst.awk
FNR == 1 {
for ( i=1; i<=NF; i++ ) {
words[$i]++
for ( j=i+1; j<=NF; j++ ) {
pairs[$i][$j]++
pairs[$j][$i]++
}
}
nextfile
}
END {
for ( word1 in pairs ) {
for ( word2 in pairs[word1] ) {
pct = pairs[word1][word2] * 100 / words[word1]
printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
}
}
}
$ awk -f tst.awk file*.md
100% of the files containing the keyword "#university" also contain the keyword "#doctor".
100% of the files containing the keyword "#university" also contain the keyword "#firework".
100% of the files containing the keyword "#plants" also contain the keyword "#donkey".
50% of the files containing the keyword "#plants" also contain the keyword "#linux".
100% of the files containing the keyword "#plants" also contain the keyword "#doctor".
100% of the files containing the keyword "#donkey" also contain the keyword "#plants".
50% of the files containing the keyword "#donkey" also contain the keyword "#linux".
100% of the files containing the keyword "#donkey" also contain the keyword "#doctor".
100% of the files containing the keyword "#linux" also contain the keyword "#plants".
100% of the files containing the keyword "#linux" also contain the keyword "#donkey".
100% of the files containing the keyword "#linux" also contain the keyword "#doctor".
33% of the files containing the keyword "#doctor" also contain the keyword "#university".
66% of the files containing the keyword "#doctor" also contain the keyword "#plants".
66% of the files containing the keyword "#doctor" also contain the keyword "#donkey".
33% of the files containing the keyword "#doctor" also contain the keyword "#linux".
33% of the files containing the keyword "#doctor" also contain the keyword "#firework".
100% of the files containing the keyword "#firework" also contain the keyword "#university".
100% of the files containing the keyword "#firework" also contain the keyword "#doctor".
或者在最后一行再次依赖 gawk ENDFILE
:
$ cat tst.awk
ENDFILE {
for ( i=1; i<=NF; i++ ) {
words[$i]++
for ( j=i+1; j<=NF; j++ ) {
pairs[$i][$j]++
pairs[$j][$i]++
}
}
}
END {
for ( word1 in pairs ) {
for ( word2 in pairs[word1] ) {
pct = pairs[word1][word2] * 100 / words[word1]
printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
}
}
}
$ awk -f tst.awk file*.md
或者仍在最后一行,但使用 tail+gawk 更有效:
$ cat tst.awk
{
for ( i=1; i<=NF; i++ ) {
words[$i]++
for ( j=i+1; j<=NF; j++ ) {
pairs[$i][$j]++
pairs[$j][$i]++
}
}
}
END {
for ( word1 in pairs ) {
for ( word2 in pairs[word1] ) {
pct = pairs[word1][word2] * 100 / words[word1]
printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
}
}
}
$ tail -qn1 file*.md | awk -f tst.awk