如何减少AWK中正则表达式的贪婪性？

Question 1

如果要选择此后的@第一个,，则需要将其指定为@[^,]*,

其后@跟随任意数量*的非逗号 ( [^,])，后跟逗号 ( ,)。

这种方法的作用相当于@.*?,，但不适用于类似的事情@.*?string，因为后面的内容不仅仅是一个字符。否定一个角色很容易，但是否定正则表达式中的字符串要困难得多。

另一种方法是预处理您的输入，以用string输入中不会出现的字符替换或在其前面添加：

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

如果您不能保证输入不包含替换字符（\1上面），一种方法是使用转义机制：

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

这适用于固定strings，但不适用于任意正则表达式，例如@.*?foo.bar.

Answer

如果要选择此后的@第一个,，则需要将其指定为@[^,]*,

其后@跟随任意数量*的非逗号 ( [^,])，后跟逗号 ( ,)。

这种方法的作用相当于@.*?,，但不适用于类似的事情@.*?string，因为后面的内容不仅仅是一个字符。否定一个角色很容易，但是否定正则表达式中的字符串要困难得多。

另一种方法是预处理您的输入，以用string输入中不会出现的字符替换或在其前面添加：

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

如果您不能保证输入不包含替换字符（\1上面），一种方法是使用转义机制：

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

这适用于固定strings，但不适用于任意正则表达式，例如@.*?foo.bar.

Question 2

已经有几个很好的答案提供了解决方法awk已经有几个很好的答案为无法进行非贪婪匹配Perl 兼容的正则表达式（PCRE）。请注意，大多数简单的“匹配和打印”awk脚本可以轻松地perl使用-n，并且可以使用以下命令转换更复杂的脚本a2pAwk 到 Perl 的翻译器。

珀尔有一个非贪婪运算符，可以在 Perl 脚本和任何使用 PCRE 的内容中使用。例如，也在GNU grep 的-P选项中实现。

PCRE 是不相同与 Perl 的正则表达式相似，但非常接近。它是许多程序的正则表达式库的流行选择，因为它非常快，而且 Perl 对扩展正则表达式的增强非常有用。

来自佩尔雷(1)手册页：

   By default, a quantified subpattern is "greedy", that is, it will match
   as many times as possible (given a particular starting location) while
   still allowing the rest of the pattern to match.  If you want it to
   match the minimum number of times possible, follow the quantifier with
   a "?".  Note that the meanings don't change, just the "greediness":

       *?        Match 0 or more times, not greedily
       +?        Match 1 or more times, not greedily
       ??        Match 0 or 1 time, not greedily
       {n}?      Match exactly n times, not greedily (redundant)
       {n,}?     Match at least n times, not greedily
       {n,m}?    Match at least n but not more than m times, not greedily

Answer

已经有几个很好的答案提供了解决方法awk已经有几个很好的答案为无法进行非贪婪匹配Perl 兼容的正则表达式（PCRE）。请注意，大多数简单的“匹配和打印”awk脚本可以轻松地perl使用-n，并且可以使用以下命令转换更复杂的脚本a2pAwk 到 Perl 的翻译器。

珀尔有一个非贪婪运算符，可以在 Perl 脚本和任何使用 PCRE 的内容中使用。例如，也在GNU grep 的-P选项中实现。

PCRE 是不相同与 Perl 的正则表达式相似，但非常接近。它是许多程序的正则表达式库的流行选择，因为它非常快，而且 Perl 对扩展正则表达式的增强非常有用。

来自佩尔雷(1)手册页：

   By default, a quantified subpattern is "greedy", that is, it will match
   as many times as possible (given a particular starting location) while
   still allowing the rest of the pattern to match.  If you want it to
   match the minimum number of times possible, follow the quantifier with
   a "?".  Note that the meanings don't change, just the "greediness":

       *?        Match 0 or more times, not greedily
       +?        Match 1 or more times, not greedily
       ??        Match 0 or 1 time, not greedily
       {n}?      Match exactly n times, not greedily (redundant)
       {n,}?     Match at least n times, not greedily
       {n,m}?    Match at least n but not more than m times, not greedily

Question 3

这是一篇旧文章，但以下信息可能对其他人有用。

有一种方法（诚然很粗糙）可以在 awk 中执行非贪婪 RE 匹配。基本思想是使用 match(string, RE) 函数，并逐渐减小字符串的大小，直到匹配失败，类似于（未经测试）：

if (match(string, RE)) {
    rstart = RSTART
    for (i=RLENGTH; i>=1; i--)
        if (!(match(substr(string,1,rstart+i-1), RE))) break;
    # At this point, the non-greedy match will start at rstart
    #  for a length of i+1
}

Answer

这是一篇旧文章，但以下信息可能对其他人有用。

有一种方法（诚然很粗糙）可以在 awk 中执行非贪婪 RE 匹配。基本思想是使用 match(string, RE) 函数，并逐渐减小字符串的大小，直到匹配失败，类似于（未经测试）：

if (match(string, RE)) {
    rstart = RSTART
    for (i=RLENGTH; i>=1; i--)
        if (!(match(substr(string,1,rstart+i-1), RE))) break;
    # At this point, the non-greedy match will start at rstart
    #  for a length of i+1
}

Question 4

awk 中没有办法进行非贪婪匹配。不过，您也许可以获得所需的输出。 sch 的建议适用于该行。如果您不能依赖逗号，但“作者”始终是您想要的内容的开头，您可以这样做：

awk '{ sub(/@.*Author/,"Author"); print }'

如果 Author 前面的字符数始终相同，您可以这样做：

awk '{ sub(/@.{21}/,""); print }'

您只需要知道整个数据集中的数据是什么样的。

Answer

awk 中没有办法进行非贪婪匹配。不过，您也许可以获得所需的输出。 sch 的建议适用于该行。如果您不能依赖逗号，但“作者”始终是您想要的内容的开头，您可以这样做：

awk '{ sub(/@.*Author/,"Author"); print }'

如果 Author 前面的字符数始终相同，您可以这样做：

awk '{ sub(/@.{21}/,""); print }'

您只需要知道整个数据集中的数据是什么样的。

如何减少AWK中正则表达式的贪婪性？

答案1

答案2

答案3

答案4

相关内容