我有一个如下所示的文件:
7 C00000002 score: -41.156 nathvy = 49 nconfs = 2251
8 C00000002 score: -39.520 nathvy = 49 nconfs = 3129
9 C00000004 score: -38.928 nathvy = 24 nconfs = 150
10 C00000002 score: -38.454 nathvy = 49 nconfs = 9473
11 C00000004 score: -37.704 nathvy = 24 nconfs = 156
12 C00000001 score: -37.558 nathvy = 41 nconfs = 51
2 C00000002 score: -48.649 nathvy = 49 nconfs = 3878
3 C00000001 score: -44.988 nathvy = 41 nconfs = 1988
4 C00000002 score: -42.674 nathvy = 49 nconfs = 6740
5 C00000002 score: -42.453 nathvy = 49 nconfs = 4553
6 C00000002 score: -41.829 nathvy = 49 nconfs = 7559
我的第二列是一些未在此处排序的 ID,其中一些是重复的,例如 ( C00000001
)。它们都分配有不同的数字,后跟分数:(数字通常以 开头-
)。
我想要做的是:
1) 读取第二列(未排序的 ID)并始终选择出现的第一个 ID。因此,如果是,C00000001
则选择带有的score : -37.558
。
2)现在,当我呈现唯一的值时,我想根据后面的数字对它们进行排序score:
,这意味着最负的数字位于第一个位置,而最正的数字位于最后一个位置。
我希望以与输入文件相同的方式打印输出(相同结构)。
答案1
$ sort -k2,2 -u < filename | sort -k4,4n
7 C00000002 score: -41.156 nathvy = 49 nconfs = 2251
9 C00000004 score: -38.928 nathvy = 24 nconfs = 150
12 C00000001 score: -37.558 nathvy = 41 nconfs = 51
解释:
sort -k2,2 -u
:根据第二列对行进行排序并且不改变它们的顺序(因为它们基本上是相同的值)并保留第一行。sort -k4,4n
:按照分数按数字排序(无需-r
反转)。
答案2
使用 GNU awk > 4.0:
$ gawk '
!seen[$2] {seen[$2] = $0}
END {PROCINFO["sorted_in"] = "@val_num_asc"; for (i in seen) print seen[i]}
' file
7 C00000002 score: -41.156 nathvy = 49 nconfs = 2251
9 C00000004 score: -38.928 nathvy = 24 nconfs = 150
12 C00000001 score: -37.558 nathvy = 41 nconfs = 51
答案3
贡献一个可以轻松配置的附加单行命令
for row in $(cat tmp | awk '{print $2}' | sort | uniq); do cat tmp | grep $row | head -n 1; done | sort -r --key=4
7 C00000002 score: -41.156 nathvy = 49 nconfs = 2251
9 C00000004 score: -38.928 nathvy = 24 nconfs = 150
12 C00000001 score: -37.558 nathvy = 41 nconfs = 51