计算括号正则表达式出现的次数

2024-5-26 • tag-icon

我正在尝试计算包含递归括号表达式的正则表达式的出现次数。在我的特定情况下，我正在寻找按行或按文件计算出现次数(NP *) (VP *) (NP *)。我的示例文件包含（第 4 行有一个递归情况）：

$ more mini.example 
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (XP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement) (NP (NN opposition)) (VP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) (NP gouvernement (NP (NN opposition)) (VP et) (NP gouvernement))  </parse>
    <parse> (NP (NN opposition)) (VP et) (FP gouvernement) (NP (NN opposition)) (RP et) (NP gouvernement) </parse>
    <parse> (NP (NN opposition)) (VP et) </parse>
    <parse> (VP et) (NP gouvernement) </parse>

我想要这样的输出：

我试过这个：

$ grep -Pon '(?<=\(NP ).*(?=\).*(?<=\(VP ).*(?=\).*(?<=\(NP ).*(?=\))))' mini.example | cut -d : -f 1 | uniq -c | sort -k 1

但输出是：

这与所需的不同。即使整个模式不匹配并且无法验证递归，它也会唯一地计算模式的第一部分。感谢您的任何帮助。

答案1

也许是这样的：

grep -nPo '(?=(\((?:[^()]++|(?1))*\)) (?=\(VP)(?1) (?=\(NP)(?1))\(NP' |
 cut -d: -f1 | uniq -c

也就是说，它匹配 a ，(NP前提是它是 a 的开头，(NP *) (VP *) (NP *)我们对各部分使用 PCRE 递归匹配(...)（(\((?:[^()]++|(?1))*\))直接的部分）来自 pcrepattern 手册页）。

答案1

相关内容