如何才能找到某个特定单词重复 N 次的行？

Question 1

在中perl，this不区分大小写地用其自身替换，并计算替换次数：

$ perl -ne 's/(this)/$1/ig == 3 && print' <<EOF
How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this
EOF
How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

使用匹配次数反而：

perl -ne 'my $c = () = /this/ig; $c == 3 && print'

如果你有 GNU awk，一个非常简单的方法：

gawk -F'this' -v IGNORECASE=1 'NF == 4'

字段的数量将比分隔符的数量多一。

Answer

在中perl，this不区分大小写地用其自身替换，并计算替换次数：

$ perl -ne 's/(this)/$1/ig == 3 && print' <<EOF
How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this
EOF
How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

使用匹配次数反而：

perl -ne 'my $c = () = /this/ig; $c == 3 && print'

如果你有 GNU awk，一个非常简单的方法：

gawk -F'this' -v IGNORECASE=1 'NF == 4'

字段的数量将比分隔符的数量多一。

Question 2

假设你的源文件是 tmp.txt，

grep -iv '.*this.*this.*this.*this' tmp.txt | grep -i '.*this.*this.*this.*'

左边的 grep 输出 tmp.txt 中所有不包含 4 个或更多不区分大小写的“this”的行。

结果通过管道传输到右侧 grep，它将输出左侧 grep 结果中出现 3 次或更多次的所有行。

更新：感谢@Muru，这是该解决方案的更好版本，

grep -Eiv '(.*this){4,}' tmp.txt | grep -Ei '(.*this){3}'

用 n+1 替换 4，用 n 替换 3。

Answer

假设你的源文件是 tmp.txt，

grep -iv '.*this.*this.*this.*this' tmp.txt | grep -i '.*this.*this.*this.*'

左边的 grep 输出 tmp.txt 中所有不包含 4 个或更多不区分大小写的“this”的行。

结果通过管道传输到右侧 grep，它将输出左侧 grep 结果中出现 3 次或更多次的所有行。

更新：感谢@Muru，这是该解决方案的更好版本，

grep -Eiv '(.*this){4,}' tmp.txt | grep -Ei '(.*this){3}'

用 n+1 替换 4，用 n 替换 3。

Question 3

在 Python 中，这可以完成这项工作：

#!/usr/bin/env python3

s = """How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this"""

for line in s.splitlines():
    if line.lower().count("this") == 3:
        print(line)

输出：

How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

或者从文件中读取，以文件作为参数：

#!/usr/bin/env python3
import sys

file = sys.argv[1]

with open(file) as src:
    lines = [line.strip() for line in src.readlines()]

for line in lines:
    if line.lower().count("this") == 3:
        print(line)

将脚本粘贴到一个空文件中，另存为find_3.py，通过以下命令运行：
```
python3 /path/to/find_3.py <file_withlines>
```

当然单词“this”可以被任何其他单词（或其他字符串或行部分）替换，并且每行出现的次数可以设置为行中的任何其他值：

    if line.lower().count("this") == 3:

编辑

如果文件很大（数十万/数百万行），下面的代码会更快；它按行读取文件而不是一次加载文件：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    for line in src:
        if line.lower().count("this") == 3:
            print(line.strip())

Answer