匹配具有重复 ID 的两个文件

Question 1

Python3

#!/usr/bin/env python3

import sys

def make_id_dict(f):
  d = {}
  for line in open(f):
    k, v = line.split()
    if k in d:
      d[k] += [ v ]
    else:
      d[k] = [ v ]
  return d


filename1 = sys.argv[1]
filename2 = sys.argv[2]

dict1 = make_id_dict(filename1)
dict2 = make_id_dict(filename2)

for key in sorted(dict1):
  for i, value1 in enumerate(dict1[key]):
    value2 = dict2[key][i]
    if value1 == value2:
      result = 'match'
    else:
      result = 'mismatch'
    print(key, value1, value2, result)

保存为脚本match-files-by-ids.py然后调用：

$ python3 match-files-by-ids.py File1 File2
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

Answer

Python3

#!/usr/bin/env python3

import sys

def make_id_dict(f):
  d = {}
  for line in open(f):
    k, v = line.split()
    if k in d:
      d[k] += [ v ]
    else:
      d[k] = [ v ]
  return d


filename1 = sys.argv[1]
filename2 = sys.argv[2]

dict1 = make_id_dict(filename1)
dict2 = make_id_dict(filename2)

for key in sorted(dict1):
  for i, value1 in enumerate(dict1[key]):
    value2 = dict2[key][i]
    if value1 == value2:
      result = 'match'
    else:
      result = 'mismatch'
    print(key, value1, value2, result)

保存为脚本match-files-by-ids.py然后调用：

$ python3 match-files-by-ids.py File1 File2
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

Question 2

如果文件：

都已排序；并且
每个都恰好包含零个或两个给定索引的出现

然后如果我们join对它们进行匹配，我们将得到每个可配对 ID 的四个“匹配”，即

1st IDn from File1 against 1st IDn from File2
1st IDn from File1 against 2nd IDn from File2
2nd IDn from File1 against 1st IDn from File2
2nd IDn from File1 against 2nd IDn from File2

等等。无法配对的行将被忽略（除非我们使用该-a选项）。然后我们可以简单地从每个匹配组中挑选出第一行和第四行，并比较它们各自的第二个字段：

$ join File1 File2 | 
  awk '!((NR-1)%4 && NR%4) {$4 = $3==$2 ? "match" : "mismatch"; print}'
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

其中，(NR-1)%4对于 1、5、9 等行，为零；NR%4对于 4、8 等行，为零。因此，对于 1、4；5、9...，与运算和反相运算成立。

Answer

如果文件：

都已排序；并且
每个都恰好包含零个或两个给定索引的出现

然后如果我们join对它们进行匹配，我们将得到每个可配对 ID 的四个“匹配”，即

1st IDn from File1 against 1st IDn from File2
1st IDn from File1 against 2nd IDn from File2
2nd IDn from File1 against 1st IDn from File2
2nd IDn from File1 against 2nd IDn from File2

等等。无法配对的行将被忽略（除非我们使用该-a选项）。然后我们可以简单地从每个匹配组中挑选出第一行和第四行，并比较它们各自的第二个字段：

$ join File1 File2 | 
  awk '!((NR-1)%4 && NR%4) {$4 = $3==$2 ? "match" : "mismatch"; print}'
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

其中，(NR-1)%4对于 1、5、9 等行，为零；NR%4对于 4、8 等行，为零。因此，对于 1、4；5、9...，与运算和反相运算成立。

Question 3

这是一个awk解决方案：

awk '
  NR==FNR {
    if ($1 in a) 
      b[$1]=$2; 
    else 
      a[$1]=$2; 
    next;
  } 
  ($1 in a) {
    print $1, a[$1], $2, $2 == a[$1] ? "match" : "mismatch";
    delete a[$1];
    next;
  }
  ($1 in b) {
    print $1, b[$1], $2, $2 == b[$1] ? "match" : "mismatch";
  }' File1 File2

测试：

$ awk '
  NR==FNR {if ($1 in a) b[$1]=$2; else a[$1]=$2; next} 
  ($1 in a) {print $1, a[$1], $2, $2 == a[$1] ? "match" : "mismatch"; delete a[$1]; next; }
  ($1 in b) {print $1, b[$1], $2, $2 == b[$1] ? "match" : "mismatch";}
' File1 File2
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

Answer

这是一个awk解决方案：

awk '
  NR==FNR {
    if ($1 in a) 
      b[$1]=$2; 
    else 
      a[$1]=$2; 
    next;
  } 
  ($1 in a) {
    print $1, a[$1], $2, $2 == a[$1] ? "match" : "mismatch";
    delete a[$1];
    next;
  }
  ($1 in b) {
    print $1, b[$1], $2, $2 == b[$1] ? "match" : "mismatch";
  }' File1 File2

测试：

$ awk '
  NR==FNR {if ($1 in a) b[$1]=$2; else a[$1]=$2; next} 
  ($1 in a) {print $1, a[$1], $2, $2 == a[$1] ? "match" : "mismatch"; delete a[$1]; next; }
  ($1 in b) {print $1, b[$1], $2, $2 == b[$1] ? "match" : "mismatch";}
' File1 File2
1 12 13 mismatch
1 13 13 match
2 15 15 match
2 16 17 mismatch
4 15 15 match
4 18 18 match

匹配具有重复 ID 的两个文件

答案1

Python3

答案2

答案3

相关内容