如何比较多个文件之间的同一个单词?

如何比较多个文件之间的同一个单词?

我想计算多个文件中相同的单词,然后显示它们在哪个文件中。

文件1:

This is so beautiful

文件2:

There are so beautiful

文件3:

so beautiful

期望的输出1:

so:3
beautiful:3

期望的输出2:

so:
file1:1
file2:1
file3:1

beautiful:
file1:1
file2:1
file3:1

答案1

尝试这个,

# Declare the files you want to include
files=( file* )

# Function to find common words in any number of files
wcomm() {
    # If no files provided, exit the function.
    [ $# -lt 1 ] && return 1
    # Extract words from first file
    local common_words=$(grep -o "\w*" "$1" | sort -u)
    while [ $# -gt 1 ]; do
        # shift $1 to next file
        shift
        # Extract words from next file
        local next_words=$(grep -o "\w*" "$1" | sort -u)
        # Get only words in common from $common_words and $next_words
        common_words=$(comm -12 <(echo "${common_words,,}") <(echo "${next_words,,}"))
    done
    # Output the words common to all input files
    echo "$common_words"
}

# Output number of matches for each of the common words in total and per file
for w in $(wcomm "${files[@]}"); do
    echo $w:$(grep -oiw "$w" "${files[@]}" | wc -l);
    for f in "${files[@]}"; do
        echo $f:$(grep -oiw "$w" "$f" | wc -l);
    done;
    echo;
done

输出:

beautiful:3
file1:1
file2:1
file3:1

so:3
file1:1
file2:1
file3:1

解释:

作为注释包含在脚本内。

特征:

  • 文件数量与您的数量一样多ARG_MAX允许
  • 查找由任何理解为单词分隔符分隔的所有单词grep
  • 忽略大小写,因此“beautiful”和“Beautiful”是同一个词。

答案2

试试这个代码。如果需要进行调整

bash-4.1$ cat test.sh
#!/bin/bash

OUTPUT_FILE=/tmp/output.txt

awk '{
for(i=1;i<=NF;i++)
{
        Arr[$i]++
}
}
END{
for (i in Arr){
        if(Arr[i]>1)
        {
                print i":"Arr[i]
        }
}
}' file* > ${OUTPUT_FILE}

cat ${OUTPUT_FILE}
echo ""

IFS=":"
while read WORD TOTAL_COUNT
do
        echo "${WORD}:"
        for FILE_NAME in file*
        do
                COUNT=$(tr ' ' '\n' < ${FILE_NAME} | grep -c "${WORD}")
                if [ "${COUNT}" -gt "0" ]
                then
                        echo "${FILE_NAME}:${COUNT}"
                fi
        done
done < ${OUTPUT_FILE}


bash-4.1$ bash test.sh
beautiful:3
so:3

beautiful:
file1:1
file2:1
file3:1
so:
file1:1
file2:1
file3:1

答案3

用于grep提供单词和文件名,然后awk重新格式化输出以获得所需的结果:

grep -Ho '\w\+' file* |
awk -F':' '{ words[$1 FS $2]++; seen[$2]++ }
END{ for (x in seen) {
         print x":" seen[x];
         for (y in words) {
            if (y ~ "\\<" x "\\>")print substr(y, 1, length(y)-length(x)), words[y]
         }
     }
}'

这将为您提供如下良好的输出(一次性获得所需的输出):

so:3
file1: 1
file2: 1
file3: 1
This:1
file1: 1
beautiful:3
file3: 1
file1: 1
file2: 1
There:1
file2: 1
are:1
file2: 1
is:1
file1: 1

答案4

如果你不想写代码,只想用快速的方式知道结果,你可以使用这个命令:

cat list_of_words | while read line; do echo $line; grep -riE '$line'-c where_to_look_or_folder; done

-r :read into files
-i: no casesensitive
-E: regexp is useable if you want something more complicated to search
-c: counter

输出:

word1
path:filename:count

例子:

cat text | while read line; do echo $line; grep -riE '$line'-c somwhwere/nowhere; done

相关内容