查找 6 个文件之间的共同点

Question 1

与awk您一起可以执行以下操作：

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

限制：内存，因为所有行都保存在 RAM 中。

如果您想按出现次数排序，则必须相应地调整打印命令，例如按的值排序nseen。这gawk很简单：在END块中的 -loop 之前添加以下内容for：

PROCINFO["sorted_in"]="@val_num_desc"

输入文件：

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

输出（具有gawk数组遍历功能PROCINFO）

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

编辑：

您提供的文件具有 MSDOS 格式。要么通过转换它们

 dos2unix file1.txt file2.txt ....

或调整中的记录分隔符awk。作为代码中的第一个条目添加以下内容：

 BEGIN { RS="\r\n" }

编辑2：您的文件有不规则的分隔符。问题是，a<tab>b和a<tab>b<tab>被视为不同的行，而您可能认为它们是相同的。

对于每个文件有两个感兴趣字段的特殊情况，您宁愿比较这两个字段的内容，而不是整行。还考虑 MSDOS 格式：

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

最终所有六个文件都有更多的重叠。它专注于带有制表符分隔符的两个字段，并打印一行的输出。

Answer

与awk您一起可以执行以下操作：

#skip if multiple appearance in one file
{if ( seenin[$0] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$0]=seenin[$0]" "FILENAME ; nseen[$0]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

限制：内存，因为所有行都保存在 RAM 中。

如果您想按出现次数排序，则必须相应地调整打印命令，例如按的值排序nseen。这gawk很简单：在END块中的 -loop 之前添加以下内容for：

PROCINFO["sorted_in"]="@val_num_desc"

输入文件：

$ cat file1
a
a
b
b
c
d
e

$ cat file2
c
c
x
z
e
y
z
f

$ cat file3
f
i
a
c
z
i
k

输出（具有gawk数组遍历功能PROCINFO）

$awk -f compare_lines_multifiles.awk file1 file2 file3
line "c" seen in 3 files:  file1 file2 file3
line "z" seen in 2 files:  file2 file3
line "a" seen in 2 files:  file1 file3
line "e" seen in 2 files:  file1 file2
line "f" seen in 2 files:  file2 file3

编辑：

您提供的文件具有 MSDOS 格式。要么通过转换它们

 dos2unix file1.txt file2.txt ....

或调整中的记录分隔符awk。作为代码中的第一个条目添加以下内容：

 BEGIN { RS="\r\n" }

编辑2：您的文件有不规则的分隔符。问题是，a<tab>b和a<tab>b<tab>被视为不同的行，而您可能认为它们是相同的。

对于每个文件有两个感兴趣字段的特殊情况，您宁愿比较这两个字段的内容，而不是整行。还考虑 MSDOS 格式：

BEGIN { RS="\r\n" }
#skip if multiple appearance in one file
{if ( seenin[$1"\t"$2] ~ FILENAME ) {next}}
#add filename to list of files the line has been seen in, increase seen counter
{seenin[$1"\t"$2]=seenin[$1"\t"$2]" "FILENAME ; nseen[$1"\t"$2]++}

#print
END {for (line in nseen) { if (nseen[line]>1) {
   printf "%s \"%s\" %s %d %s %s\n",
     "line",line,"seen in",nseen[line],"files:",seenin[line]} } }

最终所有六个文件都有更多的重叠。它专注于带有制表符分隔符的两个字段，并打印一行的输出。

Question 2

我建议采用不同的方法。只需将它们全部遍历sort并uniq -c计算每行被看到的次数：

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

这将打印每行一次，但也会打印该行被看到的次数。例如，如果我有这三个文件：

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

我会得到这个输出：

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

因此，如果您想按照发现的次数来分隔行，您可以执行以下操作来仅查看所有 3 个（或 6 个，在您的情况下）文件中出现的行：

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Answer

我建议采用不同的方法。只需将它们全部遍历sort并uniq -c计算每行被看到的次数：

sort 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt | uniq -c

这将打印每行一次，但也会打印该行被看到的次数。例如，如果我有这三个文件：

$ cat file1 
dog
cat
bird

$ cat file2
fly
bird
moose

$ cat file3
bird
dog
flea

我会得到这个输出：

$ sort file1 file2 file3 | uniq -c
      3 bird
      1 cat
      2 dog
      1 flea
      1 fly
      1 moose

因此，如果您想按照发现的次数来分隔行，您可以执行以下操作来仅查看所有 3 个（或 6 个，在您的情况下）文件中出现的行：

$ sort file1 file2 file3 | uniq -c | awk '$1==3'
  3 bird
$ sort file1 file2 file3 | uniq -c | awk '$1==2'
      2 dog
$ sort file1 file2 file3 | uniq -c | awk '$1==1'
      1 cat
      1 flea
      1 fly
      1 moose

Question 3

您的第一次尝试是正确的方法：

comm -12 2.txt 3.txt | comm -12 - 4.txt | comm -12 - 5.txt | comm -12 - 6.txt | comm -12 - 7.txt

这就像流一样并行地完成工作。原则上，您可以通过这种方式处理具有数百万行的文件。

你遇到的问题通讯(1) 似乎是由输入问题引起的，即空格和行结尾。如果您先清理这些内容，您可能会发现原来的方法既快速又方便。

这里有一个例子来证明这一点。查找可被素数组整除的数字：

$ for D in 2 3 5 7 11 13 
> do seq 1 1000 | 
> awk -v D=$D '$0 % D == 0 { print $0 }' | 
> sort > $D
> done

$ comm -12 2 3 | comm -12 - 5 | comm -12 - 7 
210
420
630
840

事实证明，1 到 1000 之间没有数字可以被 2、3、5、7 和 11 整除。

Answer