如何提取第一列中每个条目的第一行?

如何提取第一列中每个条目的第一行?

我有一个包含数百行的文件:

Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:23001231-23011701 HanXRQChr01g0004391 5938    6078    141 7.25e-55    220 95.035  locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein
Chr01:38759426-38779934 HanXRQChr15g0474141 7163    7204    42  1.96e-08    67.6    95.238  locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29
Chr01:38759426-38779934 HanXRQChr15g0474141 7003    7043    41  7.05e-08    65.8    95.122  locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29

其中一些行基于第一列是唯一的,例如第一行Chr01:19967945-1997264,而对于其他一些行,我基于第一列有多个行,例如Chr01:23001231-23011701

对于第一列中的每个值,我只想保留第一行,因为第一行包含第 6 列、第 7 列和第 8 列中一些其他参数的最佳值。

我想要的输出是

Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

答案1

您可以使用 awk 来跟踪您已经看到的第一个字段:

awk '!seen[$1]++' infile

seen这使用以第一个字段 ( $1)为键的哈希值。我们检查 的后递增值是否seen[$1]为假,即当遇到新值时,seen[$1]++返回 0 且!seen[$1]++为 true;如果我们看到该值已经seen[$1]++返回大于 0 的值并且!seen[$1]++为 false。

当条件为 true 时,默认操作是打印整行 ( { print $0 }),这正是我们想要的,因此我们不必将其拼写出来。

这以更详细但更容易理解的方式执行相同的操作:

awk 'seen[$1] == 0 {
         ++seen[$1]
         print $0
     }' infile

答案2

$ sort -u -s -k1,1 file
Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

sort命令只会将第一个空格分隔的字段视为排序键,并返回删除重复键后排序的数据(将返回找到的第一个唯一键)。告诉-s我们sort使用“稳定”的排序算法,即不会改变具有相同键的记录顺序的算法(我不能 100% 确定需要这样做,但使用它似乎是合理的)。

相关内容