查找文件中最常出现的字母/字符组合
而不是仅仅寻找重复出现的单词(a la:查找文件中 n 个最常见的单词),我需要列出所有重复出现的字母组合字符串...
想要记录文件中任意/所有长度最常出现的字母/字符组合?
示例列表:
Stack
Exchange
Internet
Web
Question
Find
Frequent
Words
Combination
Letters
....
产生的重复字母组合:
[a,b,c,d,e,f,g,i,k,l,m,n,o,q,r,s,t,u,w,x]
in
ue
st
tion
ion
on
ti
et
te
ter
...
能够根据出现次数列出结果 = 奖励:)
答案1
我需要列出所有重复出现的字母组合字符串......
...所以我让脚本查看从 1 个字母到完整行长度的所有可能长度(这是单词的长度,因为示例数据每行提供 1 个单词)...
文件ssf.mawk
:
#!/usr/bin/mawk -f
BEGIN {
FS=""
}
{
_=tolower($0)
for(i=1;i<=NF;i++)
for(j=i;j<=NF;j++)
print substr(_,i,j-i+1) | "sort|uniq -c|sort -n"
}
使用样本输入缩短运行输出:
$ printf '%s\n' Stack Exchange Internet Web Question Find Frequent Words Combination Letters .... | ./ssf.mawk
1 ....
1 ac
1 ack
1 an
1 ang
(((这里省略了很多行)))
4 s
5 i
8 n
8 t
10 e
mawk-1.3.3
我在Debian8 上对此进行了测试gawk-4.1.1
。
答案2
任何组合,假设至少有两个(对于的最小N
更改),忽略大小写,每行,可以通过类似的方法来完成{2,$l}
{N,$l}
% < examplelist
Stack
Exchange
Internet
Web
Question
Find
Frequent
Words
Combination
Letters
% < examplelist perl -nlE '$_=lc; $l=length; next if $l < 2; m/(.{2,$l})(?{ $freq{$1}++ })^/; END { say "$freq{$_} $_" for keys %freq }' | sort -rg | head -4
3 in
2 ue
2 tion
2 tio
答案3
这是一个按出现次数对输出进行排序的 Perl 脚本。最小字符串长度是可配置的,并且包括一个调试选项来查看发生了什么。
#!/usr/bin/perl
# Usage: perl script_file input_file
use strict;
my $min_str_len = 2;
my $debug = 0;
my %uniq_substrings;
while(<>)
{
chomp;
my $s = lc $_; # assign to $s for clearity
printf STDERR qq|#- String: [%s]\n|, $s if $debug;
my $line_len = length($s);
for my $len ($min_str_len .. $line_len)
{
printf STDERR qq|# Length: %u\n|, $len if $debug;
# break string into characters
my @c = split(//,$s);
# iterate over list while large enough to provide strings of $len characters
while(@c>=$len)
{
my $substring = join('', @c[0..$len-1]);
my $curr_count = ++$uniq_substrings{$substring};
printf STDERR qq|%s (%s)\n|, $substring, $curr_count if $debug;
shift @c;
}
}
}
sub mysort
{
# sort by count, subsort by alphabetic
my $retval =
($uniq_substrings{$b} <=> $uniq_substrings{$a})
|| ($a cmp $b);
return $retval;
}
for my $str (sort(mysort keys %uniq_substrings))
{
printf qq|%s = %u\n|, $str, $uniq_substrings{$str};
}
答案4
脚本:
MIN=2
MAX=5
while read A; do
[ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
for LEN in $(seq ${MIN} ${max}); do
for k in $(seq 0 $((${#A}-${LEN}))); do
echo "${A:$k:${LEN}}"
done
done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9
有一些解释:
# define minimal length of letters combinations
MIN=2
# define maximal length of letters combinations
MAX=5
# take line by line
while read A; do
# determine max length of letters combination for this line
# because it is shorter than MAX above if length of the line is shorter
[ ${MAX} -lt ${#A} ] && max=${MAX} || max=${#A}
# in cycle take one by one possible lengths of letters combination for line
for LEN in $(seq ${MIN} ${max}); do
# in cycle take all possible letters combination for length LEN for line
for k in $(seq 0 $((${#A}-${LEN}))); do
# print a letter combination
echo "${A:$k:${LEN}}"
done
done
done <<< "$(cat file1|tr 'A-Z' 'a-z')" |sort|uniq -c|sort -k1,7rn -k9
# the data are taken from file "file1" and converted to lowercase,
# the data are sorted, unique lines counted,
# after results sorted according to string numerical values for numbers
# and strings with the same numbers sorted in alphabetical order
如果参数 MIN=2 且 MAX=5,则输出前 30 行(总输出有 152 行):
3 in
2 er
2 et
2 io
2 ion
2 nt
2 on
2 qu
2 que
2 st
2 te
2 ter
2 ti
2 tio
2 tion
2 ue
1 ac
1 ack
1 an
1 ang
1 ange
1 at
1 ati
1 atio
1 ation
1 bi
1 bin
1 bina
1 binat
1 ch
...
如果参数 MIN=1 且 MAX=3,则输出前 20 行(总输出有 109 行):
10 e
8 n
8 t
5 i
4 o
4 r
4 s
3 a
3 c
3 in
2 b
2 d
2 er
2 et
2 f
2 io
2 ion
2 nt
2 on
2 q
...