我有一个包含数百行的文件:
Chr01:19967945-19972643 HanXRQChr01g0004001 1 4698 4698 0.0 8676 100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1 10470 10470 0.0 19335 100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:23001231-23011701 HanXRQChr01g0004391 5938 6078 141 7.25e-55 220 95.035 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1 20472 20472 0.0 37805 100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein
Chr01:38759426-38779934 HanXRQChr15g0474141 7163 7204 42 1.96e-08 67.6 95.238 locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29
Chr01:38759426-38779934 HanXRQChr15g0474141 7003 7043 41 7.05e-08 65.8 95.122 locus_tag=HanXRQChr15g0474141 gn=IQD29 begin=37205639 end=37211555 len=5917 chr=HanXRQChr15 strand=-1 sp=Helianthus annuus def=Probable IQ-domain 29
其中一些行基于第一列是唯一的,例如第一行Chr01:19967945-1997264
,而对于其他一些行,我基于第一列有多个行,例如Chr01:23001231-23011701
。
对于第一列中的每个值,我只想保留第一行,因为第一行包含第 6 列、第 7 列和第 8 列中一些其他参数的最佳值。
我想要的输出是
Chr01:19967945-19972643 HanXRQChr01g0004001 1 4698 4698 0.0 8676 100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1 10470 10470 0.0 19335 100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1 20472 20472 0.0 37805 100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein
答案1
您可以使用 awk 来跟踪您已经看到的第一个字段:
awk '!seen[$1]++' infile
seen
这使用以第一个字段 ( $1
)为键的哈希值。我们检查 的后递增值是否seen[$1]
为假,即当遇到新值时,seen[$1]++
返回 0 且!seen[$1]++
为 true;如果我们看到该值已经seen[$1]++
返回大于 0 的值并且!seen[$1]++
为 false。
当条件为 true 时,默认操作是打印整行 ( { print $0 }
),这正是我们想要的,因此我们不必将其拼写出来。
这以更详细但更容易理解的方式执行相同的操作:
awk 'seen[$1] == 0 {
++seen[$1]
print $0
}' infile
答案2
$ sort -u -s -k1,1 file
Chr01:19967945-19972643 HanXRQChr01g0004001 1 4698 4698 0.0 8676 100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1 10470 10470 0.0 19335 100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1 20472 20472 0.0 37805 100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein
该sort
命令只会将第一个空格分隔的字段视为排序键,并返回删除重复键后排序的数据(将返回找到的第一个唯一键)。告诉-s
我们sort
使用“稳定”的排序算法,即不会改变具有相同键的记录顺序的算法(我不能 100% 确定需要这样做,但使用它似乎是合理的)。