我有一个文件
Gene stable GO_ID
AAEL025769 AAEL025769-RA GO:0005525
AAEL020629 AAEL020629-RA GO:0003677
AAEL020629 AAEL020629-RA GO:0005634
AAEL020629 AAEL020629-RA GO:0000786
AAEL020629 AAEL020629-RA GO:0046982
AAEL011255 AAEL011255-RA GO:0005525
AAEL000004 AAEL000004-RA GO:0016021
AAEL000004 AAEL000004-RA GO:0016757
AAEL000004 AAEL000004-RA GO:0005789
AAEL000004 AAEL000004-RA GO:0006506
AAEL000004 AAEL000004-RA GO:0000030
AAEL003589 AAEL003589-RA NA
AAEL026354 AAEL026354-RA NA
对于某些基因,有多个 GO-ID(例如上例中的 AAEL020629 和 AAEL000004)。对于每个基因,如果有多个 GO_ID,我想将它们全部组合在单行中(用逗号和空格分隔它们)。
下面是我想要的输出:
Gene GO_ID
AAEL025769 GO:0005525
AEL020629 GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255 GO:0005525
AAEL000004 GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589 NA
AAEL026354 NA
知道我该怎么做吗?谢谢
答案1
和磨坊主
$ mlr --pprint nest --implode --values --across-records --nested-fs ', ' -f GO_ID then cut -x -f stable file
Gene GO_ID
AAEL025769 GO:0005525
AAEL020629 GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255 GO:0005525
AAEL000004 GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL003589 NA
AAEL026354 NA
或(稍微简单一些,但对输出的控制较少)GNU 数据整合
$ datamash -HW groupby Gene collapse GO_ID < file
GroupBy(Gene) collapse(GO_ID)
AAEL025769 GO:0005525
AAEL020629 GO:0003677,GO:0005634,GO:0000786,GO:0046982
AAEL011255 GO:0005525
AAEL000004 GO:0016021,GO:0016757,GO:0005789,GO:0006506,GO:0000030
AAEL003589 NA
AAEL026354 NA
答案2
awk 可以帮助:
$ awk '{ a[$1]=a[$1]", "$3; }
END { for (i in a) { sub(/,/,"",a[i]);printf "%s %s\n",i,a[i] } }
' file
Gene GO_ID
AAEL003589 NA
AAEL025769 GO:0005525
AAEL026354 NA
AAEL000004 GO:0016021, GO:0016757, GO:0005789, GO:0006506, GO:0000030
AAEL020629 GO:0003677, GO:0005634, GO:0000786, GO:0046982
AAEL011255 GO:0005525