搜索文件中重复出现的某个子字符串

Question 1

这些提供了解决某些问题的可能方法对你的问题的不同修改：

<your-file grep '^possible[[:digit:]]' | sort | uniq -d

会给你一个按词法排序的以开头的重复行列表possible<digit>。

greppossible选择以开头且后跟至少一位数字的行。
sort对结果进行排序，使重复项相邻（需要uniq）。
uniq -d报告重复。

要匹配possible<digits>输入中任何位置的任何出现，假设 GNU 实现grep或兼容：

<your-file grep -Po 'possible\d+' | sort | uniq -d

在这样的输入上：

可能123我是xx可能123yy

那会给出：

possible123

对于包含重复的所有唯一行possible<digit>：

perl -lne 'if (/possible(\d+)/) {
             $count{$1}++;
             $lines{$1}->{$_}++;
           }
           END{
             for $k (grep {$count{$_} > 1} keys %count) {
               print for keys %{$lines{$k}}
             }
           }' < your-file

在这样的输入上：

this is the best possible solution
possible1234 solution!!!
possible5678 solution!!!
possible5678 solution!!!
possible0000 solution!
impossible0000 other solution!

它给：

possible0000 solution!
impossible0000 other solution!
possible5678 solution!!!

（除了给定的行将possibleXXXX彼此相邻之外，没有任何定义的顺序）。

Answer

这些提供了解决某些问题的可能方法对你的问题的不同修改：

<your-file grep '^possible[[:digit:]]' | sort | uniq -d

会给你一个按词法排序的以开头的重复行列表possible<digit>。

greppossible选择以开头且后跟至少一位数字的行。
sort对结果进行排序，使重复项相邻（需要uniq）。
uniq -d报告重复。

要匹配possible<digits>输入中任何位置的任何出现，假设 GNU 实现grep或兼容：

<your-file grep -Po 'possible\d+' | sort | uniq -d

在这样的输入上：

可能123我是xx可能123yy

那会给出：

possible123

对于包含重复的所有唯一行possible<digit>：

perl -lne 'if (/possible(\d+)/) {
             $count{$1}++;
             $lines{$1}->{$_}++;
           }
           END{
             for $k (grep {$count{$_} > 1} keys %count) {
               print for keys %{$lines{$k}}
             }
           }' < your-file

在这样的输入上：

this is the best possible solution
possible1234 solution!!!
possible5678 solution!!!
possible5678 solution!!!
possible0000 solution!
impossible0000 other solution!

它给：

possible0000 solution!
impossible0000 other solution!
possible5678 solution!!!

（除了给定的行将possibleXXXX彼此相邻之外，没有任何定义的顺序）。

Question 2

对于您修改后的示例，awk基于 - 的解决方案可以工作：

awk '/possible[[:digit:]]+/ {count[$0]++;} END{for (line in count) {if (count[line]>1) print line}}' input.txt

或者，如果您awk不理解 POSIX 字符类

awk '/possible[0-9]+/ {count[$0]++;} END{for (line in count) {if (count[line]>1) print line}}' input.txt

possible这将检查每一行是否包含后跟至少一个数字的模式。如果找到，它将增加一个发生计数器对于整条线。最后，它只会打印出现次数计数器大于 1 的那些行。

请注意，只有当您的实际输入如图所示时，这才有效。如果possibleNNNN不同的线上可以有相同的图案，那就失败了！

Answer

对于您修改后的示例，awk基于 - 的解决方案可以工作：

awk '/possible[[:digit:]]+/ {count[$0]++;} END{for (line in count) {if (count[line]>1) print line}}' input.txt

或者，如果您awk不理解 POSIX 字符类

awk '/possible[0-9]+/ {count[$0]++;} END{for (line in count) {if (count[line]>1) print line}}' input.txt

possible这将检查每一行是否包含后跟至少一个数字的模式。如果找到，它将增加一个发生计数器对于整条线。最后，它只会打印出现次数计数器大于 1 的那些行。

请注意，只有当您的实际输入如图所示时，这才有效。如果possibleNNNN不同的线上可以有相同的图案，那就失败了！

Question 3

我们可以使用以下方法来保留重复行的顺序。

将正则表达式存储到环境变量“re”中，然后使用 match 命令查找正则表达式行。在 thoise 行上，我们通过 gsub 命令更新了此时看到的正则表达式的计数。

$ re='\<possible[0-9]+\>' \
awk  'BEGIN { r = ENVIRON["re"] }
match($0, r) && 
(a[substr($0,RSTART,RLENGTH)] += gsub(r, "&")) == 2'  logfile

Perl 中的单行代码将是：

perl -lne 'print if /\bpossible(\d+)\b/ && 2 == ($h{$1} +=()= //g)' logfile

Answer

我们可以使用以下方法来保留重复行的顺序。

将正则表达式存储到环境变量“re”中，然后使用 match 命令查找正则表达式行。在 thoise 行上，我们通过 gsub 命令更新了此时看到的正则表达式的计数。

$ re='\<possible[0-9]+\>' \
awk  'BEGIN { r = ENVIRON["re"] }
match($0, r) && 
(a[substr($0,RSTART,RLENGTH)] += gsub(r, "&")) == 2'  logfile

Perl 中的单行代码将是：

perl -lne 'print if /\bpossible(\d+)\b/ && 2 == ($h{$1} +=()= //g)' logfile

Question 4

使用 GNU awk 来表示\<字边界、\w简写和 match() 的第三个参数：

$ awk 'match($0,/\<possible\w*/,a) && ++cnt[a[0]]==2' file
possible5678 solution!!!
possible0000 solution!
There should be some "possible7777" solution!

您没有说您想要输出哪一行具有重复键，因此上面输出了第二个。

Answer

使用 GNU awk 来表示\<字边界、\w简写和 match() 的第三个参数：

$ awk 'match($0,/\<possible\w*/,a) && ++cnt[a[0]]==2' file
possible5678 solution!!!
possible0000 solution!
There should be some "possible7777" solution!

您没有说您想要输出哪一行具有重复键，因此上面输出了第二个。

搜索文件中重复出现的某个子字符串

答案1

答案2

答案3

答案4

相关内容