我有两个如下所示的文件:
file1(唯一ID):
C84610112
C96209347
C84774620
C84774691
C85594749
C89372772
C89651687
C89845500
C89914896
C91269765
C91526663
C92210411
C92254517
C93709504
C94303303
C95100561
C95100609
C95417520
C95696352
C96045246
C96045496
C96060727
C96076986
和文件2:
1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
2 C98230482 score: -57.431 nathvy = 47 nconfs = 575
3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595
5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502
6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
7 C96209347 score: -54.713 nathvy = 24 nconfs = 162
8 C96209347 score: -53.901 nathvy = 24 nconfs = 159
9 C06169346 score: -53.438 nathvy = 22 nconfs = 105
10 C95696352 score: -52.848 nathvy = 38 nconfs = 878
11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092
12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355
13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375
14 C98222837 score: -50.730 nathvy = 34 nconfs = 588
15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136
16 C32832068 score: -50.546 nathvy = 22 nconfs = 548
17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220
18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235
19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942
22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640
23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038
24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236
25 C89372772 score: -49.308 nathvy = 22 nconfs = 471
26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850
27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158
28 C70006381 score: -49.092 nathvy = 24 nconfs = 880
我想将 中的 IDfile1
与 (第二列) 中的 ID进行匹配file2
,并打印匹配的 ID。此外,file2
有些 ID 是重复的,例如C96209347
(尽管整行并不相同)。我想只 grep 那些第一次出现的行,而跳过其他行。因此在这个特定示例中,C96209347
只应打印 中的第三行file2
。有人可以帮忙吗?
答案1
尝试这个,
grep -f file1 file2 | awk '!_[$2]++'
1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536
解释
grep -f file1 file2
:在 file2 中搜索从 file1 获得的模式的匹配项awk '!_[$2]++'
$2
:如果字段之前出现过,则不打印任何内容(通过)_
是数组名称(可以是任何内容,例如“seen”)_[$2]++
将创建一个数组条目,其键是字段的内容$2
,并添加 1- 如果
_[$2]
是不是(!
) 已设置,则打印该行。该print
命令是 awk 在条件匹配时执行的默认操作。
答案2
仅使用 awk:
$ awk 'NR==FNR {a[$1]=1; next} $2 in a {print; delete a[$2]}' file1 file2
1 C95696352 score: -69.785 nathvy = 38 nconfs = 888
3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188
6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228
20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536