我有一份物种清单和数据库中的主记录。我想在主记录文件的第三列中搜索物种的匹配并打印出整行。
物种列表
Methanocaldococcus jannaschii
Methanosarcina mazei
Methanosarcina acetivorans
Archaeoglobus fulgidus
Pyrococcus furiosus
Sulfolobus solfataricus
Aeropyrum pernix
Halobacterium sp.
Sulfolobus tokodaii
Nanoarchaeum equitans
Methanothermobacter thermautotrophicus
Pirellula sp.
Borrelia burgdorferi
鉴于文件species_list中的第一列是属第二列是物种
主记录
taxon_id STRING_type STRING_name_compact official_name_NCBI
243232 core Methanocaldococcus jannaschii Methanocaldococcus jannaschii DSM2661
573063 periphery Methanocaldococcus infernus Methanocaldococcus infernus ME
573064 core Methanocaldococcus fervens Methanocaldococcus fervens AG86
579137 periphery Methanocaldococcus vulcanius Methanocaldococcus vulcanius M7
644281 periphery Methanocaldococcus sp. FS40622 Methanocaldococcus sp. FS406-22
243232 core Methanocaldococcus jannaschii Methanocaldococcus jannaschii DSM2661
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
269797 core Methanosarcina barkeri Methanosarcina barkeri str. Fusaro
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
269797 core Methanosarcina barkeri Methanosarcina barkeri str. Fusaro
565033 core Geoglobus acetivorans Geoglobus acetivorans
694431 core Desulfurella acetivorans Desulfurella acetivorans A63
1123296 core Stenoxybacter acetivorans Stenoxybacter acetivorans DSM19021
224325 core Archaeoglobus fulgidus Archaeoglobus fulgidus DSM4304
期望的输出:
243232 core Methanocaldococcus jannaschii Methanocaldococcus jannaschii DSM2661
243232 core Methanocaldococcus jannaschii Methanocaldococcus jannaschii DSM2661
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Methanosarcina mazei Go1
224325 core Archaeoglobus fulgidus Archaeoglobus fulgidus DSM4304
我正在尝试grep
for 循环
for i in $(cat species_list); do grep -w "$i" master_record; done
但我所得到的只是具有匹配属或种的品系,而不是同时获得两者。此外,它没有指定第三列的搜索。
我awk
也尝试使用
awk 'NR=FNR{a[$0]; next}{if ($3 in a){print $0}}' species_list master_record
但没有结果。
PS:我是脚本编写的初学者。我将不胜感激所提供的任何帮助。谢谢!
答案1
您可以使用awk
or grep
(并且没有for
循环):
grep -f species master_record
-f
允许给出包含正则表达式列表的文件
或者
awk 'NR==FNR{a[$0];next}(($3 " " $4) in a)' file1 file2
除了匹配的数组元素之外,这与您的命令几乎相同。
答案2
使用米勒(https://github.com/johnkerl/miller)你可以加入
mlr --nidx --fs " " --repifs join -j 1,2 -l 3,4 -r 1,2 -f master_record.csv species_list.csv
它给你的
243232 core Methanocaldococcus jannaschii DSM2661
243232 core Methanocaldococcus jannaschii DSM2661
192952 periphery Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Go1
224325 core Archaeoglobus fulgidus DSM4304
在您想要的输出中,您有 5 次“Methanosarcina mazei”。为什么?
在 master_record 中,您只出现了 3 次。