如何比较多个文件之间的同一个单词？

Question 1

尝试这个，

# Declare the files you want to include
files=( file* )

# Function to find common words in any number of files
wcomm() {
    # If no files provided, exit the function.
    [ $# -lt 1 ] && return 1
    # Extract words from first file
    local common_words=$(grep -o "\w*" "$1" | sort -u)
    while [ $# -gt 1 ]; do
        # shift $1 to next file
        shift
        # Extract words from next file
        local next_words=$(grep -o "\w*" "$1" | sort -u)
        # Get only words in common from $common_words and $next_words
        common_words=$(comm -12 <(echo "${common_words,,}") <(echo "${next_words,,}"))
    done
    # Output the words common to all input files
    echo "$common_words"
}

# Output number of matches for each of the common words in total and per file
for w in $(wcomm "${files[@]}"); do
    echo $w:$(grep -oiw "$w" "${files[@]}" | wc -l);
    for f in "${files[@]}"; do
        echo $f:$(grep -oiw "$w" "$f" | wc -l);
    done;
    echo;
done

输出：

beautiful:3
file1:1
file2:1
file3:1

so:3
file1:1
file2:1
file3:1

解释:

作为注释包含在脚本内。

特征:

文件数量与您的数量一样多ARG_MAX允许
查找由任何理解为单词分隔符分隔的所有单词grep。
忽略大小写，因此“beautiful”和“Beautiful”是同一个词。

Answer

尝试这个，

# Declare the files you want to include
files=( file* )

# Function to find common words in any number of files
wcomm() {
    # If no files provided, exit the function.
    [ $# -lt 1 ] && return 1
    # Extract words from first file
    local common_words=$(grep -o "\w*" "$1" | sort -u)
    while [ $# -gt 1 ]; do
        # shift $1 to next file
        shift
        # Extract words from next file
        local next_words=$(grep -o "\w*" "$1" | sort -u)
        # Get only words in common from $common_words and $next_words
        common_words=$(comm -12 <(echo "${common_words,,}") <(echo "${next_words,,}"))
    done
    # Output the words common to all input files
    echo "$common_words"
}

# Output number of matches for each of the common words in total and per file
for w in $(wcomm "${files[@]}"); do
    echo $w:$(grep -oiw "$w" "${files[@]}" | wc -l);
    for f in "${files[@]}"; do
        echo $f:$(grep -oiw "$w" "$f" | wc -l);
    done;
    echo;
done

输出：

beautiful:3
file1:1
file2:1
file3:1

so:3
file1:1
file2:1
file3:1

解释:

作为注释包含在脚本内。

特征:

文件数量与您的数量一样多ARG_MAX允许
查找由任何理解为单词分隔符分隔的所有单词grep。
忽略大小写，因此“beautiful”和“Beautiful”是同一个词。

Question 2

试试这个代码。如果需要进行调整

bash-4.1$ cat test.sh
#!/bin/bash

OUTPUT_FILE=/tmp/output.txt

awk '{
for(i=1;i<=NF;i++)
{
        Arr[$i]++
}
}
END{
for (i in Arr){
        if(Arr[i]>1)
        {
                print i":"Arr[i]
        }
}
}' file* > ${OUTPUT_FILE}

cat ${OUTPUT_FILE}
echo ""

IFS=":"
while read WORD TOTAL_COUNT
do
        echo "${WORD}:"
        for FILE_NAME in file*
        do
                COUNT=$(tr ' ' '\n' < ${FILE_NAME} | grep -c "${WORD}")
                if [ "${COUNT}" -gt "0" ]
                then
                        echo "${FILE_NAME}:${COUNT}"
                fi
        done
done < ${OUTPUT_FILE}


bash-4.1$ bash test.sh
beautiful:3
so:3

beautiful:
file1:1
file2:1
file3:1
so:
file1:1
file2:1
file3:1

Answer

试试这个代码。如果需要进行调整

bash-4.1$ cat test.sh
#!/bin/bash

OUTPUT_FILE=/tmp/output.txt

awk '{
for(i=1;i<=NF;i++)
{
        Arr[$i]++
}
}
END{
for (i in Arr){
        if(Arr[i]>1)
        {
                print i":"Arr[i]
        }
}
}' file* > ${OUTPUT_FILE}

cat ${OUTPUT_FILE}
echo ""

IFS=":"
while read WORD TOTAL_COUNT
do
        echo "${WORD}:"
        for FILE_NAME in file*
        do
                COUNT=$(tr ' ' '\n' < ${FILE_NAME} | grep -c "${WORD}")
                if [ "${COUNT}" -gt "0" ]
                then
                        echo "${FILE_NAME}:${COUNT}"
                fi
        done
done < ${OUTPUT_FILE}


bash-4.1$ bash test.sh
beautiful:3
so:3

beautiful:
file1:1
file2:1
file3:1
so:
file1:1
file2:1
file3:1

Question 3

用于grep提供单词和文件名，然后awk重新格式化输出以获得所需的结果：

grep -Ho '\w\+' file* |
awk -F':' '{ words[$1 FS $2]++; seen[$2]++ }
END{ for (x in seen) {
         print x":" seen[x];
         for (y in words) {
            if (y ~ "\\<" x "\\>")print substr(y, 1, length(y)-length(x)), words[y]
         }
     }
}'

这将为您提供如下良好的输出（一次性获得所需的输出）：

so:3
file1: 1
file2: 1
file3: 1
This:1
file1: 1
beautiful:3
file3: 1
file1: 1
file2: 1
There:1
file2: 1
are:1
file2: 1
is:1
file1: 1

Answer

用于grep提供单词和文件名，然后awk重新格式化输出以获得所需的结果：

grep -Ho '\w\+' file* |
awk -F':' '{ words[$1 FS $2]++; seen[$2]++ }
END{ for (x in seen) {
         print x":" seen[x];
         for (y in words) {
            if (y ~ "\\<" x "\\>")print substr(y, 1, length(y)-length(x)), words[y]
         }
     }
}'

这将为您提供如下良好的输出（一次性获得所需的输出）：

so:3
file1: 1
file2: 1
file3: 1
This:1
file1: 1
beautiful:3
file3: 1
file1: 1
file2: 1
There:1
file2: 1
are:1
file2: 1
is:1
file1: 1

Question 4

如果你不想写代码，只想用快速的方式知道结果，你可以使用这个命令：

cat list_of_words | while read line; do echo $line; grep -riE '$line'-c where_to_look_or_folder; done

-r :read into files
-i: no casesensitive
-E: regexp is useable if you want something more complicated to search
-c: counter

输出：

word1
path:filename:count

例子：

cat text | while read line; do echo $line; grep -riE '$line'-c somwhwere/nowhere; done

Answer

如果你不想写代码，只想用快速的方式知道结果，你可以使用这个命令：

cat list_of_words | while read line; do echo $line; grep -riE '$line'-c where_to_look_or_folder; done

-r :read into files
-i: no casesensitive
-E: regexp is useable if you want something more complicated to search
-c: counter

输出：

word1
path:filename:count

例子：

cat text | while read line; do echo $line; grep -riE '$line'-c somwhwere/nowhere; done

如何比较多个文件之间的同一个单词？

答案1

答案2

答案3

答案4

相关内容