确认版本2

确认版本2

我想我可能只是搜索错误,但我没有找到任何答案。如果有重复的,请告诉我,我可以把它删除。

问题背景

我在用着ack关联),其底层有 Perl 5,用于获取 n 元语法 - 特别是高阶 n 元语法。我可以使用我知道的语法(基本上最多$9)获得最多 9 克,但我无法获得 10 克。使用$10只会给我$1一个0后缀。类似的事情$(10)${10}没有解决问题。我是不是对使用语言建模工具包的解决方案感兴趣,我想使用ack.

我使用的一个数据集是马克·吐温的全集

( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt)。

我已经把事情解析干净了(参见解析注释在帖子的末尾)并将解析结果保存为TWAIN_Mark_complete_parsed.txt.

我从 2-grams 中得到的很好,代码和部分结果是

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) +(?=(\S+) +)' \
    --output '$1 $2' | \
    sort | uniq -c | \
    sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
  18176 of the
  13288 in the

一直到 9 克,

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
    sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
     17 to mrs jane clemens and mrs moffett in st
     17 mrs jane clemens and mrs moffett in st louis

(注意,我对命令进行元编程ack,而不是仅仅键入每一个命令。)

问题/我尝试过的

我第一次尝试 10 克以及结果是

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
    sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
     17 to mrs jane clemens and mrs moffett in st to0
     17 mrs jane clemens and mrs moffett in st louis mrs0

为了更好地了解正在发生的事情,

diff -u <(head -n 2 Twain_10grams.txt) <(head -n 2 Twain_9grams.txt) |色差|差异突出显示

参见这个答案(和这条评论)了解如何通过逐字差异突出显示来获取彩色差异的详细信息。基本上apt还是yumcolordiff,然后pipdiff-highlight

使用$(10)代替$10给出前两行输出

     17 to mrs jane clemens and mrs moffett in st $(10)
     17 mrs jane clemens and mrs moffett in st louis $(10)

(两分钟后)。

使用${10}代替$10给出前两行输出

     17 to mrs jane clemens and mrs moffett in st ${10}
     17 mrs jane clemens and mrs moffett in st louis ${10}

我的想法就到此为止了。

预期/期望输出

请注意,有统计(非常实际输出与此处显示的输出不同的可能性(非零且有限)。 9-grams 的前两个结果并不是不同的单词序列。更常见的 10 克的其他可能部分可以通过查看前 10 个最常见的 9 克来找到 - 使用head代替head -n 2。即便如此,我相当确定即使这样也不能保证我们拥有两个最常见的 10 克。然而,我希望我能够清楚地表达出我想要实现的目标。

17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu

编辑我已经找到了另一组将预期输出更改为(可能不是实际输出,而是将其从我之前使用的简单模型更改为的输出。)

     17 to mrs jane clemens and mrs moffett in st louis
      7 happiness in his home had been wounded and bruised almost

这将是head -n 2我一直用来展示我得到的结果的。

我不想通过我将在这里使用的相同过程来获取它。

$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      3 mrs jane clemens and mrs moffett in st louis honolulu
      2 mrs jane clemens and mrs moffett in st louis san
      2 mrs jane clemens and mrs moffett in st louis no
      2 mrs jane clemens and mrs moffett in st louis 224
      1 mrs jane clemens and mrs moffett in st louis wash
      1 mrs jane clemens and mrs moffett in st louis wailuku
      1 mrs jane clemens and mrs moffett in st louis virginia
      1 mrs jane clemens and mrs moffett in st louis the
      1 mrs jane clemens and mrs moffett in st louis sept
      1 mrs jane clemens and mrs moffett in st louis on
      1 mrs jane clemens and mrs moffett in st louis hartford
      1 mrs jane clemens and mrs moffett in st louis carson

编辑用于查找较新的第二名频率的代码是

$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
      1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      1 home had been wounded and bruised almost to death thirdly
      1 home had been wounded and bruised almost to death secondly
      1 home had been wounded and bruised almost to death it
      1 home had been wounded and bruised almost to death fourthly
      1 home had been wounded and bruised almost to death first
      1 home had been wounded and bruised almost to death fifthly
      1 home had been wounded and bruised almost to death and

从评论中编辑

@Inian 做得很棒评论

这记录在发行说明中 -github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md-现在,您只能使用以下变量:$1 到 $9、$、$.、$&、$`、$' 和 $+_

为了未来的人,我要放一个版本,今天存档, 的RELEASE-NOTES

man的页面确实有ack以下几行

$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".

但我希望有办法获得更高的数字。根据来自 的信息RELEASE-NOTES,这种希望似乎基本破灭了。

然而,我仍然想知道是否有人有解决方法或 hack,无论是使用ack还是任何更“标准”的 *NIX 类型终端工具。我的偏好,按顺序是perl,,,,。如果有类似的东西(即只是命令行解析,grepawksedack不是基于 NLP 工具包的解决方案),我对此也很感兴趣。

我认为最好将其作为一个新问题提出。如果你在这里回答,那就太好了。如果我最终发布新问题,我会将链接放在这里:目前,这只是同一问题的链接


解析注释

为了让我的语料库准备好进行 n 元语法分析,这是我的解析。

tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
 tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
 sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
 awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:

是的,这一切都可以在一行上(并且没有尾随&& :),但这可以更轻松地阅读并解释为什么我正在做我正在做的事情。


系统详情

$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free

答案1

尽管我不是 Perl 专家,但这里有一个可能的 hack。看着全合一的源文件,似乎ack只处理$输出字符串中的单个字符。更改此设置以接受多个字符无疑是可行的,但为了保持简单,您可以0..9使用abc....例如,我进行了这些更改以接受$a$b作为$10$11(显示为diff -u

@@ -188,7 +188,7 @@
         $opt_output =~ s/\\r/\r/g;
         $opt_output =~ s/\\t/\t/g;
 
-        my @supported_special_variables = ( 1..9, qw( _ . ` & ' +  f ) );
+        my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' +  f ) );
         @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;
 
         # If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
                 # on them not changing in the process of doing the s///.
 
                 my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output;
+                $keep{a} = $10;
+                $keep{b} = $11;
                 $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
                 $keep{f} = $filename if exists $keep{f};
                 my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );

但是,如果您只想进行第 10 场比赛,则可以使用$+如下所示最后一个成功搜索模式的最后一个括号匹配的文本

答案2

三种替代解决方案:

确认版本2

看来在 ack 版本 2 中变量$10 $11等是有效的:

$ echo 'abcdefghijklmn' | 
  ack '(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)' \
  --output '$1 $2 $3 $11'

a b c k

$ ack --version
ack 2.24
Running under Perl 5.28.1 at /usr/bin/perl

其中,要得到重叠字符串将是:

echo 'abcdefghijklmn' |
    ack '(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))' \
    --output '$1 $2 $3 $11'
a b c k
b c d l
c d e m
d e f n

Perl5

但是,可以通过以下方式直接在 Perl 中完成相同的操作:

echo 'abcdefghijklmn' | 
    perl -ne 'while($_ =~ /(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))/g ){
        print $1," ",$2," ",$11," ","\n" }'
a b k
b c l
c d m
d e n

因此,要查找并打印单词(由一个或多个空格分隔):

echo "word1 word2 word3 word4 word5 word6" |
    perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+))/g ){$,=" ";print $1,$2,$3,$4,"\n" }'

word1 word2 word3 word4 
word2 word3 word4 word5 
word3 word4 word5 word6

打印的行有一个尾随空格(希望您不介意)。

Perl6

:ov或者你可以尝试使用(overlap) 修饰符的Perl6 (Raku) :

echo "one two three four five" | 
    perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**2 <|w> /, :ov); say @var.join("\n") ;'

one two three
two three four
three four five

通过更改单个数字,将匹配其他计数:

echo "one two three four five" | 
perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**3 <|w> /, :ov); say @var.join("\n") ;'

one two three four
two three four five

结果

使用 perl5 结果将是:

perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))/g ){
 $,=" ";
 print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,"\n" 
}' TWAIN_Mark_complete_parsed.txt | 
    sort | 
    uniq -c | 
    sort -rn >Twain_10grams5.txt

请注意,Perl6 无法完成(内存太多)如此大的测试文本(Perl6 仍然太新)。使用 ack 比 perl5 慢得多,但文件是相同的。

head -n 10 Twain_10grams5.txt
     17 to mrs jane clemens and mrs moffett in st louis 
      8 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 
      7 in his home had been wounded and bruised almost to 
      7 his home had been wounded and bruised almost to death 
      7 happiness in his home had been wounded and bruised almost 
      6 shelley's happiness in his home had been wounded and bruised 
      5 was by the social fireside in the time of the 
      5 thing indeed if you would like to listen to it 
      5 laughable thing indeed if you would like to listen to 
      5 it was in this way that he found out that

相关内容