如何计算属于文件的每个单词在作为参数传递的所有“n”个文件中的出现次数？

Question 1

我会做：

#! /bin/sh -
# usage: wordcount <file-with-words-to-search-for> [<file>...]
words=$(tr -s '[[:space:]]' '[\n*]' < "${1?No word list provided}" | grep .)
[ -n "$words" ] || exit

shift
for file do
  printf 'File: %s\n' "$file"
  tr -s '[[:space:]]' '[\n*]' | grep -Fxe "$words" | sort | uniq -c | sort -rn
done

（这仅给出每个文件中至少找到一次的单词的计数）。

Answer

我会做：

#! /bin/sh -
# usage: wordcount <file-with-words-to-search-for> [<file>...]
words=$(tr -s '[[:space:]]' '[\n*]' < "${1?No word list provided}" | grep .)
[ -n "$words" ] || exit

shift
for file do
  printf 'File: %s\n' "$file"
  tr -s '[[:space:]]' '[\n*]' | grep -Fxe "$words" | sort | uniq -c | sort -rn
done

（这仅给出每个文件中至少找到一次的单词的计数）。

Question 2

您可以迭代命令行上提供的文件列表，如下所示：

for file in "$@"
do
    echo "Considering file ==> $file <=="
done

您的单词匹配方法应该非常有效。您还可以使用以下方式搜索单词的出现次数grep -o

echo 'I can cry cryogenic tears when I scry my hands. Can you cry too?' |
    grep -o '\bcry\b'    # \b marks a word boundary

将其结果通过管道输入wc -l即可得到输入流中出现的次数。

使用$( ... )允许一个命令的输出插入到另一个命令使用的文本中。例如

echo "The date and time right now is $(date)"

我们需要一些额外的工作来避免搜索第一个文件，而是将其用作单词列表。但把它们放在一起你可能会得到这样的结果：

wordfile="$1"
wordlist=($(cat "$wordfile"))
shift

for file in "$@"
do
    for word in "${wordlist[@]}"
    do
        # echo "$file: $word:" $(grep -o "\b${word}\b" "$file" | wc -l)  # My way
        echo "$file: $word:" $(tr ' ' '\n' <"$file" | grep -c "$word")   # Your way
    done
done

它的效率不是很高，因为对于 N 个单词，它会搜索每个文件 N 次。您可能会发现这grep -f很有帮助。

Answer