如何提取第一列中每个条目的第一行？

Question 1

您可以使用 awk 来跟踪您已经看到的第一个字段：

awk '!seen[$1]++' infile

seen这使用以第一个字段 ( $1)为键的哈希值。我们检查的后递增值是否seen[$1]为假，即当遇到新值时，seen[$1]++返回 0 且!seen[$1]++为 true；如果我们看到该值已经seen[$1]++返回大于 0 的值并且!seen[$1]++为 false。

当条件为 true 时，默认操作是打印整行 ( { print $0 })，这正是我们想要的，因此我们不必将其拼写出来。

这以更详细但更容易理解的方式执行相同的操作：

awk 'seen[$1] == 0 {
         ++seen[$1]
         print $0
     }' infile

Answer

您可以使用 awk 来跟踪您已经看到的第一个字段：

awk '!seen[$1]++' infile

seen这使用以第一个字段 ( $1)为键的哈希值。我们检查的后递增值是否seen[$1]为假，即当遇到新值时，seen[$1]++返回 0 且!seen[$1]++为 true；如果我们看到该值已经seen[$1]++返回大于 0 的值并且!seen[$1]++为 false。

当条件为 true 时，默认操作是打印整行 ( { print $0 })，这正是我们想要的，因此我们不必将其拼写出来。

这以更详细但更容易理解的方式执行相同的操作：

awk 'seen[$1] == 0 {
         ++seen[$1]
         print $0
     }' infile

Question 2

$ sort -u -s -k1,1 file
Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

该sort命令只会将第一个空格分隔的字段视为排序键，并返回删除重复键后排序的数据（将返回找到的第一个唯一键）。告诉-s我们sort使用“稳定”的排序算法，即不会改变具有相同键的记录顺序的算法（我不能 100% 确定需要这样做，但使用它似乎是合理的）。

Answer

$ sort -u -s -k1,1 file
Chr01:19967945-19972643 HanXRQChr01g0004001 1   4698    4698    0.0 8676    100.000 locus_tag=HanXRQChr01g0004001 gn=HanXRQChr01g0004001 begin=19967815 end=19972682 len=4868 chr=HanXRQChr01 strand=-1 sp=Helianthus annuus def=Probable protein kinase superfamily protein
Chr01:23001231-23011701 HanXRQChr01g0004391 1   10470   10470   0.0 19335   100.000 locus_tag=HanXRQChr01g0004391 gn=HanXRQChr01g0004391 begin=22999643 end=23012645 len=13003 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Putative squalene cyclase; Squalene cyclase, C-terminal; Squalene cyclase, N-terminal
Chr01:38759426-38779934 HanXRQChr01g0005671 1   20472   20472   0.0 37805   100.000 locus_tag=HanXRQChr01g0005671 gn=SPI begin=38759245 end=38779898 len=20654 chr=HanXRQChr01 strand=1 sp=Helianthus annuus def=Probable beige/BEACH domain ;WD domain, G-beta repeat protein

该sort命令只会将第一个空格分隔的字段视为排序键，并返回删除重复键后排序的数据（将返回找到的第一个唯一键）。告诉-s我们sort使用“稳定”的排序算法，即不会改变具有相同键的记录顺序的算法（我不能 100% 确定需要这样做，但使用它似乎是合理的）。

如何提取第一列中每个条目的第一行？

答案1

答案2

相关内容