我有一个包含多行的文件。我想知道,对于整个文件中出现的每个单词,有多少行包含该单词,例如:
0 hello world the man is world
1 this is the world
2 a different man is the possible one
我期待的结果是:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
请注意,“world”的计数是 2,而不是 3,因为该单词出现在 2 行中。因此,将空格转换为换行符并不是确切的解决方案。
答案1
另一个 Perl 变体,使用列表::实用程序
$ perl -MList::Util=uniq -alne '
map { $h{$_}++ } uniq @F }{ for $k (sort keys %h) {print "$k: $h{$k}"}
' file
0: 1
1: 1
2: 1
a: 1
different: 1
hello: 1
is: 3
man: 2
one: 1
possible: 1
the: 3
this: 1
world: 2
答案2
bash 中的直截了当:
declare -A wordcount
while read -ra words; do
# unique words on this line
declare -A uniq
for word in "${words[@]}"; do
uniq[$word]=1
done
# accumulate the words
for word in "${!uniq[@]}"; do
((wordcount[$word]++))
done
unset uniq
done < file
看数据:
$ declare -p wordcount
declare -A wordcount='([possible]="1" [one]="1" [different]="1" [this]="1" [a]="1" [hello]="1" [world]="2" [man]="2" [0]="1" [1]="1" [2]="1" [is]="3" [the]="3" )'
并根据需要格式化:
$ printf "%s\n" "${!wordcount[@]}" | sort | while read key; do echo "$key:${wordcount[$key]}"; done
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
答案3
这是一个非常简单的 Perl 脚本:
#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my %linewords = ();
map { $linewords{$_}=1 } split / /;
foreach my $word (keys %linewords) {
$words{$word}++;
}
}
foreach my $word (sort keys %words) {
print "$word:$words{$word}\n";
}
基本思想是循环输入;对于每一行,将其拆分为单词,然后将这些单词保存到哈希(关联数组)中,以删除任何重复项,然后循环该单词数组并向该单词的总体计数器添加一个。最后,报告单词及其计数。
答案4
另一个简单的替代方案是使用 Python (>3.6)。该解决方案与@Larry 在他的文章中提到的问题相同评论。
from collections import Counter
with open("words.txt") as f:
c = Counter(word for line in [line.strip().split() for line in f] for word in set(line))
for word, occurrence in sorted(c.items()):
print(f'{word}:{occurrence}')
# for Python 2.7.x compatibility you can replace the above line with
# the following one:
# print('{}:{}'.format(word, occurrence))
上面的更明确的版本:
from collections import Counter
FILENAME = "words.txt"
def find_unique_words():
with open(FILENAME) as f:
lines = [line.strip().split() for line in f]
unique_words = Counter(word for line in lines for word in set(line))
return sorted(unique_words.items())
def print_unique_words():
unique_words = find_unique_words()
for word, occurrence in unique_words:
print(f'{word}:{occurrence}')
def main():
print_unique_words()
if __name__ == '__main__':
main()
输出:
0:1
1:1
2:1
a:1
different:1
hello:1
is:3
man:2
one:1
possible:1
the:3
this:1
world:2
上面还假设单词.txt与以下目录位于同一目录中脚本.py。请注意,这与此处提供的其他解决方案没有太大区别,但也许有人会发现它有用。