比较 60 个大文件并仅输出所有文件共有的行

Question 1

尝试这个，

awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

解释：

BEGINFILE { ... }在每个文件的开头运行
- fnum++增加文件计数器
- delete f 删除数组这是用来过滤每个文件的重复行（请参阅符合 posix 的解决方案的链接）。
!f[$0]++ { ... }仅在文件中第一次出现行时运行（当f[$0]为 0（假）时）
- s[$0]++增加行计数器。
END { ... }最后运行一次
- for (l in s){if (s[l] == fnum) print l}循环行并打印出现次数等于文件数的每一行。

600.000 行在内存中应该没问题。否则，您可能会删除s小于块fnum中的所有内容BEGINFILE{...}。

Answer

尝试这个，

awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

解释：

BEGINFILE { ... }在每个文件的开头运行
- fnum++增加文件计数器
- delete f 删除数组这是用来过滤每个文件的重复行（请参阅符合 posix 的解决方案的链接）。
!f[$0]++ { ... }仅在文件中第一次出现行时运行（当f[$0]为 0（假）时）
- s[$0]++增加行计数器。
END { ... }最后运行一次
- for (l in s){if (s[l] == fnum) print l}循环行并打印出现次数等于文件数的每一行。

600.000 行在内存中应该没问题。否则，您可能会删除s小于块fnum中的所有内容BEGINFILE{...}。

Question 2

bash 中的并行版本。它应该适用于大于内存的文件。

export LC_ALL=C
comm -12 \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) \
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) \
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) \
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) \
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) \
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) \
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) \
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) \
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

如果文件已排序，则替换sort为。cat

Answer

bash 中的并行版本。它应该适用于大于内存的文件。

export LC_ALL=C
comm -12 \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) \
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) \
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) \
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) \
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) \
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) \
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) \
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) \
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

如果文件已排序，则替换sort为。cat

Question 3

对于，在标有的数组上zsh使用其数组交集运算符${a:*b}独特的标志（也使用$(<file)ksh 运算符和f参数扩展标志来分割换行符）：

#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

（该脚本将文件列表作为参数；空行将被忽略）。

Answer

对于，在标有的数组上zsh使用其数组交集运算符${a:*b}独特的标志（也使用$(<file)ksh 运算符和f参数扩展标志来分割换行符）：

#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

（该脚本将文件列表作为参数；空行将被忽略）。

Question 4

和join：

cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

我在三个文件 a、b 和 c 中只有数字（1-6、3-8、5-9）。这是三者共有的两行（数字、字符串）。

]# cat jnd
5
6

它并不优雅/高效，尤其是cp在两者之间。但它可以很容易地并行工作。选择文件子组 ( for f in a*)，为文件指定唯一的名称，然后您可以一次运行多个子组。您仍然必须连接这些结果... - 对于 64 个文件，您将有 8 个线程，每个线程连接 8 个文件，然后剩余的 8 个连接文件可以再次拆分为 4 个线程。

Answer

和join：

cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

我在三个文件 a、b 和 c 中只有数字（1-6、3-8、5-9）。这是三者共有的两行（数字、字符串）。

]# cat jnd
5
6

它并不优雅/高效，尤其是cp在两者之间。但它可以很容易地并行工作。选择文件子组 ( for f in a*)，为文件指定唯一的名称，然后您可以一次运行多个子组。您仍然必须连接这些结果... - 对于 64 个文件，您将有 8 个线程，每个线程连接 8 个文件，然后剩余的 8 个连接文件可以再次拆分为 4 个线程。

比较 60 个大文件并仅输出所有文件共有的行

答案1

答案2

答案3

答案4

相关内容