我想我可能只是搜索错误,但我没有找到任何答案。如果有重复的,请告诉我,我可以把它删除。
问题背景
我在用着ack
(关联),其底层有 Perl 5,用于获取 n 元语法 - 特别是高阶 n 元语法。我可以使用我知道的语法(基本上最多$9
)获得最多 9 克,但我无法获得 10 克。使用$10
只会给我$1
一个0
后缀。类似的事情$(10)
并${10}
没有解决问题。我是不是对使用语言建模工具包的解决方案感兴趣,我想使用ack
.
我使用的一个数据集是马克·吐温的全集
( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt
)。
我已经把事情解析干净了(参见解析注释在帖子的末尾)并将解析结果保存为TWAIN_Mark_complete_parsed.txt
.
我从 2-grams 中得到的很好,代码和部分结果是
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) +(?=(\S+) +)' \
--output '$1 $2' | \
sort | uniq -c | \
sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
18176 of the
13288 in the
一直到 9 克,
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
17 to mrs jane clemens and mrs moffett in st
17 mrs jane clemens and mrs moffett in st louis
(注意,我对命令进行元编程ack
,而不是仅仅键入每一个命令。)
问题/我尝试过的
我第一次尝试 10 克以及结果是
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
sort | uniq -c | sort -rn > Twain_10grams.txt
$ head -n 2 Twain_10grams.txt
17 to mrs jane clemens and mrs moffett in st to0
17 mrs jane clemens and mrs moffett in st louis mrs0
为了更好地了解正在发生的事情,
参见这个答案(和这条评论)了解如何通过逐字差异突出显示来获取彩色差异的详细信息。基本上apt
还是yum
为colordiff
,然后pip
为diff-highlight
。
使用$(10)
代替$10
给出前两行输出
17 to mrs jane clemens and mrs moffett in st $(10)
17 mrs jane clemens and mrs moffett in st louis $(10)
(两分钟后)。
使用${10}
代替$10
给出前两行输出
17 to mrs jane clemens and mrs moffett in st ${10}
17 mrs jane clemens and mrs moffett in st louis ${10}
我的想法就到此为止了。
预期/期望输出
请注意,有是统计(非常实际输出与此处显示的输出不同的可能性(非零且有限)。 9-grams 的前两个结果并不是不同的单词序列。更常见的 10 克的其他可能部分可以通过查看前 10 个最常见的 9 克来找到 - 使用head
代替head -n 2
。即便如此,我相当确定即使这样也不能保证我们拥有两个最常见的 10 克。然而,我希望我能够清楚地表达出我想要实现的目标。
17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu
编辑我已经找到了另一组将预期输出更改为(可能不是实际输出,而是将其从我之前使用的简单模型更改为的输出。)
17 to mrs jane clemens and mrs moffett in st louis
7 happiness in his home had been wounded and bruised almost
这将是head -n 2
我一直用来展示我得到的结果的。
我不想通过我将在这里使用的相同过程来获取它。
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
17 to mrs jane clemens and mrs moffett in st louis
$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
3 mrs jane clemens and mrs moffett in st louis honolulu
2 mrs jane clemens and mrs moffett in st louis san
2 mrs jane clemens and mrs moffett in st louis no
2 mrs jane clemens and mrs moffett in st louis 224
1 mrs jane clemens and mrs moffett in st louis wash
1 mrs jane clemens and mrs moffett in st louis wailuku
1 mrs jane clemens and mrs moffett in st louis virginia
1 mrs jane clemens and mrs moffett in st louis the
1 mrs jane clemens and mrs moffett in st louis sept
1 mrs jane clemens and mrs moffett in st louis on
1 mrs jane clemens and mrs moffett in st louis hartford
1 mrs jane clemens and mrs moffett in st louis carson
编辑用于查找较新的第二名频率的代码是
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
1 home had been wounded and bruised almost to death thirdly
1 home had been wounded and bruised almost to death secondly
1 home had been wounded and bruised almost to death it
1 home had been wounded and bruised almost to death fourthly
1 home had been wounded and bruised almost to death first
1 home had been wounded and bruised almost to death fifthly
1 home had been wounded and bruised almost to death and
从评论中编辑
@Inian 做得很棒评论:
这记录在发行说明中 -github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md-现在,您只能使用以下变量:$1 到 $9、$、$.、$&、$`、$' 和 $+_
为了未来的人,我要放一个版本,今天存档, 的RELEASE-NOTES
man
的页面确实有ack
以下几行
$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".
但我希望有办法获得更高的数字。根据来自 的信息RELEASE-NOTES
,这种希望似乎基本破灭了。
然而,我仍然想知道是否有人有解决方法或 hack,无论是使用ack
还是任何更“标准”的 *NIX 类型终端工具。我的偏好,按顺序是perl
,,,,。如果有类似的东西(即只是命令行解析,grep
awk
sed
ack
不是基于 NLP 工具包的解决方案),我对此也很感兴趣。
我认为最好将其作为一个新问题提出。如果你在这里回答,那就太好了。如果我最终发布新问题,我会将链接放在这里:目前,这只是同一问题的链接。
解析注释
为了让我的语料库准备好进行 n 元语法分析,这是我的解析。
tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:
是的,这一切都可以在一行上(并且没有尾随&& :
),但这可以更轻松地阅读并解释为什么我正在做我正在做的事情。
系统详情
$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name: Microsoft Windows 10 Enterprise
Version: 10.0.17134 N/A Build 17134
Manufacturer: Microsoft Corporation
Configuration: Member Workstation
Build Type: Multiprocessor Free
答案1
尽管我不是 Perl 专家,但这里有一个可能的 hack。看着全合一的源文件,似乎ack
只处理$
输出字符串中的单个字符。更改此设置以接受多个字符无疑是可行的,但为了保持简单,您可以0..9
使用abc...
.例如,我进行了这些更改以接受$a
和$b
作为$10
和$11
(显示为diff -u
)
@@ -188,7 +188,7 @@
$opt_output =~ s/\\r/\r/g;
$opt_output =~ s/\\t/\t/g;
- my @supported_special_variables = ( 1..9, qw( _ . ` & ' + f ) );
+ my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' + f ) );
@special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;
# If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
# on them not changing in the process of doing the s///.
my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output;
+ $keep{a} = $10;
+ $keep{b} = $11;
$keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
$keep{f} = $filename if exists $keep{f};
my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );
但是,如果您只想进行第 10 场比赛,则可以使用$+
如下所示最后一个成功搜索模式的最后一个括号匹配的文本。
答案2
三种替代解决方案:
确认版本2
看来在 ack 版本 2 中变量$10
$11
等是有效的:
$ echo 'abcdefghijklmn' |
ack '(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)' \
--output '$1 $2 $3 $11'
a b c k
$ ack --version
ack 2.24
Running under Perl 5.28.1 at /usr/bin/perl
其中,要得到重叠字符串将是:
echo 'abcdefghijklmn' |
ack '(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))' \
--output '$1 $2 $3 $11'
a b c k
b c d l
c d e m
d e f n
Perl5
但是,可以通过以下方式直接在 Perl 中完成相同的操作:
echo 'abcdefghijklmn' |
perl -ne 'while($_ =~ /(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))/g ){
print $1," ",$2," ",$11," ","\n" }'
a b k
b c l
c d m
d e n
因此,要查找并打印单词(由一个或多个空格分隔):
echo "word1 word2 word3 word4 word5 word6" |
perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+))/g ){$,=" ";print $1,$2,$3,$4,"\n" }'
word1 word2 word3 word4
word2 word3 word4 word5
word3 word4 word5 word6
打印的行有一个尾随空格(希望您不介意)。
Perl6
:ov
或者你可以尝试使用(overlap) 修饰符的Perl6 (Raku) :
echo "one two three four five" |
perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**2 <|w> /, :ov); say @var.join("\n") ;'
one two three
two three four
three four five
通过更改单个数字,将匹配其他计数:
echo "one two three four five" |
perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**3 <|w> /, :ov); say @var.join("\n") ;'
one two three four
two three four five
结果
使用 perl5 结果将是:
perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))/g ){
$,=" ";
print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,"\n"
}' TWAIN_Mark_complete_parsed.txt |
sort |
uniq -c |
sort -rn >Twain_10grams5.txt
请注意,Perl6 无法完成(内存太多)如此大的测试文本(Perl6 仍然太新)。使用 ack 比 perl5 慢得多,但文件是相同的。
head -n 10 Twain_10grams5.txt
17 to mrs jane clemens and mrs moffett in st louis
8 ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
7 in his home had been wounded and bruised almost to
7 his home had been wounded and bruised almost to death
7 happiness in his home had been wounded and bruised almost
6 shelley's happiness in his home had been wounded and bruised
5 was by the social fireside in the time of the
5 thing indeed if you would like to listen to it
5 laughable thing indeed if you would like to listen to
5 it was in this way that he found out that