给定一小组单词(为了具体起见,假设为 3 个),但更一般地说n
,我想在文件中搜索其中两个单词彼此接近的情况。就接近而言,假设这两个单词最多k
相距一个字符,其中k
是某个常数。
理由:我正在我的收件箱 ( ) 中查找/var/spool/mail/username
带有特定关键字的特定电子邮件。我不确定这些关键词是如何出现的。不过有一个词是比较常见的。两个词靠得很近的情况不太常见。
一个具体的激励例子:
“铝”、“行李”、“存储”。
在本例中,我正在搜索有关行李箱的电子邮件。
n
就和而言,解决方案k
将是最好的。
有关如何将其应用于多个文件的一些指示会很有帮助。
我不在乎解决方案使用什么语言。
答案1
您可能会考虑:
1) glark, which has an option:
( expr1 --and=NUM expr2 )
Match both of the two expressions, within NUM lines of each other.
2) bool, with expressions like:
bool -O0 -C0 -D5 -b "two near three"
3) peg, which accepts options like:
peg "/x/ and near(sub { /y/ or /Y/ }, 5)"
glark 的代码位于https://github.com/jpace/glark并且可能在某些存储库中。
bool 和 peg 的一些详细信息:
bool print context matching a boolean expression (man)
Path : ~/executable/bool
Version : 0.2.1
Type : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help : probably available with -h,--help
Home : https://www.gnu.org/software/bool/ (doc)
peg Perl version of grep, q.v. (what)
Path : ~/bin/peg
Version : 3.10
Length : 4749 lines
Type : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Repo : Debian 8.9 (jessie)
Home : http://piumarta.com/software/peg/ (pm)
Home : http://www.cpan.org/authors/id/A/AD/ADAVIES/peg-3.10 (doc)
最美好的祝愿...干杯,drl
答案2
从词干工具开始,例如 https://linux.die.net/man/1/hunspell 然后使用正则表达式 https://linux.die.net/man/1/grep 然后使用 wc sort 和 unique 按单词的接近程度进行排序。
伪bash;
WORDS=$1
HAYSTACK=/var/mail
STEMS=$(hunspell --stem $WORDS)
REGEX=$(echo $STEMS | perl -pe 's/ /.*/g')
while read MATCH ; do
FILE=$(echo $MATCH | cut -d : 1)
COUNT=$(echo $MATCH | cut -d : 2 | perl -pe 's/.*('"$REGEXX"').*/$1/g' | wc -c)
echo $COUNT\t$FILE
done < <(grep -rP "$REGEX" $HAYSTACK) | \
sort -nr
如果你想要更快,你可以使用 https://linux.die.net/man/1/locate 使用正则表达式限制单词之间的空间
a.{1,50}b
答案3
我喜欢这个主意格雷普邮件(在我们的商店,我们编写了一个名为 rapgrep 的实用程序,需要所有模式,对于一般情况)。
此片段演示了在字符距离方面更具体的答案,寻找单词:国家、男人、时间:
# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
pl " Input data file $FILE:"
head $FILE
pl " Results, egrep:"
egrep 'time|men|country' $FILE
pl " Results, egrep, with byte offset:"
egrep -b 'time|men|country' $FILE
pl " Results, egrep, with byte offset, matches only:"
egrep -o -b 'time|men|country' $FILE |
tee t1
pl " Looking for minimum distance between all pairs:"
awk -F":" '
{ a[$2] = $1 # Compare every item to the new item
for ( b in a ) {
for ( c in a ) {
# print " Working on b = ",b," c = ",c
if ( b != c ) {
v0 = a[c]-a[b]
v1 = v0 < 0 ? -v0 : v0 # convert to > 0
v2 = (b < c) ? b " " c : c " " b # trivial sort of names
print v1, v2
}
}
}
}
' t1 |
datamash -t" " -s --group 2,3 min 1
生产:
-----
Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.
-----
Results, egrep:
Now is the time
for all good men
of their country.
-----
Results, egrep, with byte offset:
0:Now is the time
16:for all good men
52:of their country.
-----
Results, egrep, with byte offset, matches only:
11:time
29:men
61:country
-----
Looking for minimum distance between all pairs:
country men 32
country time 50
men time 18
以及一个稍微复杂的文件,其中多次出现某些单词:
-----
Input data file data2:
Now is the time men
for all good men
to come to the aid
of their men country.
-----
Results, egrep:
Now is the time men
for all good men
of their men country.
-----
Results, egrep, with byte offset:
0:Now is the time men
20:for all good men
56:of their men country.
-----
Results, egrep, with byte offset, matches only:
11:time
16:men
33:men
65:men
69:country
-----
Looking for minimum distance between all pairs:
country men 4
country time 58
men time 5
这利用了 GNU grep 中的字节计数选项,awk 程序计算单词对之间的所有距离,最后进行 datamash 排序、分组并选择最小距离。
这可以相当容易地参数化以允许命令行上的单词以及允许的距离。请参阅文件 t1,了解从 awk 程序到 datamash 的输入数据的形式。
在如下系统上运行:
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.9 (jessie)
bash GNU bash 4.3.30
grep (GNU grep) 2.20
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
datamash (GNU datamash) 1.2
最美好的祝愿...干杯,drl