同时提取具有两个匹配字符串的文件

同时提取具有两个匹配字符串的文件

我有一份物种清单和数据库中的主记录。我想在主记录文件的第三列中搜索物种的匹配并打印出整行。

物种列表

Methanocaldococcus jannaschii
Methanosarcina mazei
Methanosarcina acetivorans
Archaeoglobus fulgidus
Pyrococcus furiosus
Sulfolobus solfataricus
Aeropyrum pernix
Halobacterium sp.
Sulfolobus tokodaii
Nanoarchaeum equitans
Methanothermobacter thermautotrophicus
Pirellula sp.
Borrelia burgdorferi

鉴于文件species_list中的第一列是第二列是物种

主记录

taxon_id     STRING_type     STRING_name_compact     official_name_NCBI
243232  core    Methanocaldococcus jannaschii   Methanocaldococcus jannaschii DSM2661
573063  periphery       Methanocaldococcus infernus     Methanocaldococcus infernus ME
573064  core    Methanocaldococcus fervens      Methanocaldococcus fervens AG86
579137  periphery       Methanocaldococcus vulcanius    Methanocaldococcus vulcanius M7
644281  periphery       Methanocaldococcus sp. FS40622  Methanocaldococcus sp. FS406-22
243232  core    Methanocaldococcus jannaschii   Methanocaldococcus jannaschii DSM2661
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
269797  core    Methanosarcina barkeri  Methanosarcina barkeri str. Fusaro
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
269797  core    Methanosarcina barkeri  Methanosarcina barkeri str. Fusaro
565033  core    Geoglobus acetivorans   Geoglobus acetivorans
694431  core    Desulfurella acetivorans        Desulfurella acetivorans A63
1123296 core    Stenoxybacter acetivorans       Stenoxybacter acetivorans DSM19021
224325  core    Archaeoglobus fulgidus  Archaeoglobus fulgidus DSM4304

期望的输出:

243232  core    Methanocaldococcus jannaschii   Methanocaldococcus jannaschii DSM2661
243232  core    Methanocaldococcus jannaschii   Methanocaldococcus jannaschii DSM2661
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
192952  periphery       Methanosarcina mazei    Methanosarcina mazei Go1
224325  core    Archaeoglobus fulgidus  Archaeoglobus fulgidus DSM4304

我正在尝试grepfor 循环

for i in $(cat species_list); do grep -w "$i" master_record; done

但我所得到的只是具有匹配属或种的品系,而不是同时获得两者。此外,它没有指定第三列的搜索。

awk也尝试使用

awk 'NR=FNR{a[$0]; next}{if ($3 in a){print $0}}' species_list master_record

但没有结果。

PS:我是脚本编写的初学者。我将不胜感激所提供的任何帮助。谢谢!

答案1

您可以使用awkor grep(并且没有for循环):

grep -f species master_record

-f允许给出包含正则表达式列表的文件


或者

awk 'NR==FNR{a[$0];next}(($3 " " $4) in a)' file1 file2

除了匹配的数组元素之外,这与您的命令几乎相同。

答案2

使用米勒(https://github.com/johnkerl/miller)你可以加入

mlr --nidx --fs " " --repifs join -j 1,2 -l 3,4 -r 1,2 -f master_record.csv species_list.csv

它给你的

243232 core Methanocaldococcus jannaschii DSM2661
243232 core Methanocaldococcus jannaschii DSM2661
192952 periphery Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Go1
192952 periphery Methanosarcina mazei Go1
224325 core Archaeoglobus fulgidus DSM4304

在您想要的输出中,您有 5 次“Methanosarcina mazei”。为什么?

在 master_record 中,您只出现了 3 次。

相关内容