使用 grep 获取标记周围 n 个单词的标点符号问题

Question

你不能让 GNUgrep -o输出相同的文本（比如你的meaning n words before the或and n words after the）两次。不过，您可以pcregrep通过使用-o<n>where nis the n^th捕获组并捕获前瞻运算符中匹配的内容来完成此操作（这不会将光标推进下一个匹配）：

$ pcregrep -o0 -o2  '(\w+\W+){0,5}token(?=((\W+\w+){0,5}))' file
This is a token, but when any punctuation is
n words around a specific token, meaning n words before the
meaning n words before the token and n words after the
and n words after the token. There is no fix pattern

-o0整个文本是否匹配，是前瞻运算符内部-o1匹配的内容。(....)(?=(here))

请注意，在这样的输入上：

6 5 4 3 2 1 token token 1 2 3 4 5 6

它会给出：

5 4 3 2 1 token token 1 2 3 4
token 1 2 3 4 5

因为它在第一个匹配之后开始寻找第二个匹配代币，因此只查找0第二个之前的单词token。

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 |
   pcregrep -o1  '(?=((\w+\W+){0,5}token(\W+\w+){0,5}))\w*'
5 4 3 2 1 token token 1 2 3 4
4 3 2 1 token token 1 2 3 4 5
3 2 1 token token 1 2 3 4 5
2 1 token token 1 2 3 4 5
1 token token 1 2 3 4 5
token token 1 2 3 4 5
token 1 2 3 4 5

可能也不是您想要的（即使每个“令牌”前面和后面最多有 5 个单词）。

要为每次出现的“token”生成一行，两边最多有 5 个单词，我认为单独使用它并不容易pcregrep。

您需要记录每个“标记”单词的位置，然后匹配up-to-5-words<that-position>"token"up-to-5-words每个位置的。

就像是：

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 | perl -lne '
    my @positions; push @positions, $-[0] while /\btoken\b/g;
    for $o (@positions) {
      print $& if /(\w+\W+){0,5}(?<=^.{$o})token(\W+\w+){0,5}/
    }'
5 4 3 2 1 token token 1 2 3 4
4 3 2 1 token token 1 2 3 4 5

或者澄清哪个代币在每种情况下都匹配：

$ echo 6 5 4 3 2 1 token token 1 2 3 4 5 6 | perl -lne '
    my @positions; push @positions, $-[0] while /\btoken\b/g;
    for $o (@positions) {
      print "$1<token>$3" if /((\w+\W+){0,5})(?<=^.{$o})token((\W+\w+){0,5})/
    }'
5 4 3 2 1 <token> token 1 2 3 4
4 3 2 1 token <token> 1 2 3 4 5

（我希望它可以被简化/优化）。

Answer 1