确定两个排序列表是否包含唯一元素的最快方法

Question 1

使用，您可以使用和 fifocomm获取在第一次匹配时返回的脚本：head

#!/bin/bash -e 

[ -p tmpfifo ] || mkfifo tmpfifo
comm -12 A B | head -n1 >tmpfifo &

# If this wc is zero, no matches.  Otherwise, a match was found. 
# You can use this directly in the script, echo it, 
# change the script exit value, or however else you need to use it.
wc -l tmpfifo

目前，这将继续在后台执行通信，我很难找到PID杀死的权利（$!正在给予head而不是杀死comm）。如果您确定这是唯一正在运行的通信，则可以使用killall，但如果其他通信正在运行，则可能存在危险。

Answer

使用，您可以使用和 fifocomm获取在第一次匹配时返回的脚本：head

#!/bin/bash -e 

[ -p tmpfifo ] || mkfifo tmpfifo
comm -12 A B | head -n1 >tmpfifo &

# If this wc is zero, no matches.  Otherwise, a match was found. 
# You can use this directly in the script, echo it, 
# change the script exit value, or however else you need to use it.
wc -l tmpfifo

目前，这将继续在后台执行通信，我很难找到PID杀死的权利（$!正在给予head而不是杀死comm）。如果您确定这是唯一正在运行的通信，则可以使用killall，但如果其他通信正在运行，则可能存在危险。

Question 2

您可以尝试使用 AWK 来解析文件。起初我想分解更大的文件，或者将 A 存储在 mem 中并运行 B 将每一行与 mem 中的 A 进行比较。不过，我认为 AWK 可能就是您正在寻找的。

http://www.linuxjournal.com/article/8913是底漆

http://forums.devshed.com/unix-help-35/compare-two-files-using-awk-or-sed-425150.html正在谈论文件比较。我现在不在linux上，或者我会尝试测试一下。

呆呆地http://www.gnu.org/software/gawk/manual/html_node/index.html

Answer

您可以尝试使用 AWK 来解析文件。起初我想分解更大的文件，或者将 A 存储在 mem 中并运行 B 将每一行与 mem 中的 A 进行比较。不过，我认为 AWK 可能就是您正在寻找的。

http://www.linuxjournal.com/article/8913是底漆

http://forums.devshed.com/unix-help-35/compare-two-files-using-awk-or-sed-425150.html正在谈论文件比较。我现在不在linux上，或者我会尝试测试一下。

呆呆地http://www.gnu.org/software/gawk/manual/html_node/index.html

Question 3

如果文件已排序，您也许可以使用 join(1) 或 merge(1) 来相当有效地工作。 head -1输出将在第一行停止，并在退出时使用 SIGPIPE 终止命令的其余部分。

此外，您可以通过在较大的文件 A 上使用 uniq(1) 来缩小问题大小。这会将其归结为一组不同的行，然后可以将其与 B 文件列表进行比较。

另一种可能性是调整你的 python 脚本来执行如下操作。

For each B file:
    Read in each line
    Add the file name to a list of files keyed on a hash of the line 

Loop through the A file:
    Look up each line in the dictionary
    Output the file name when a match is found.

如果“B”文件中不同行的数量很大，这将占用大量内存，因此它可能实用，也可能不实用。如果您不介意进行后处理以消除误报，则可以通过仅存储哈希来减少此阶段的内存消耗。

第三种方法是将全部数据加载到数据库中并进行连接，但这会产生导入数据的开销，这可能太大。使用适当的索引，实际匹配查询将非常快，并且可以立即检查所有 B 文件，即

Create table A (
       TextLine varchar (100) -- or whatever length you need
)

Create table B (
       TextLine varchar (100)
      ,Filename varchar (20)
)

Alter table B
  add constraint PK_B
      primary key (TextLine, FileName)


select distinct B.FileName
  from A
  join B
    on a.TextLine = B.TextLine

Answer