我正在尝试获取每个数据堆栈可用的字符串的成对组合,
输入文件包含两列:col1 是基因名,col2 是各种应激源的名称。
gene1 FishKairomones
gene1 Microcystin
gene1 Calcium
gene2 Cadmium
gene2 Microcystis
gene2 FishKairomones
gene2 Phosphorous
gene3 FishKairomones
gene3 Microcystin
gene3 Phosphorous
gene3 Cadmium
因此从表中可以看出,gene1 对 3 种压力源有反应:鱼利他素、微囊藻毒素和钙。
我想获得这样的成对表:
gene1 FishKairomones gene1 Microcystin
gene1 FishKairomones gene1 Calcium
gene1 Microcystin gene1 Calcium
gene2 Cadmium gene2 Microcystis
gene2 Cadmium gene2 FishKairomones
gene2 Cadmium gene2 Phosphorous
gene2 Microcystis gene2 FishKairomones
gene2 Microcystis gene2 Phosphorous
gene2 FishKairomones gene2 Phosphorous
正如您所看到的,基因 1 鱼利他素与基因 1 微囊藻素相关,基因 1 鱼利他素也与基因 1 钙相关,基因 1 微囊藻素与基因 1 钙相关。同样,我想对所有基因都这样做。
有时该基因可以有 3 个应激源,有时有 4 个,等等。
我在这里尝试了代码:用于“cat”文件中所有行的成对扩展的命令行工具
这会创建整个文件的所有成对组合,这不是我想要的。
答案1
AWK
解决方案(甚至适用于无序的输入线):
awk '{ a[$1]=($1 in a? a[$1]",":"")$2 } # grouping `stressors` by `gene` names
END {
for (k in a) { # for each `gene`
len=split(a[k], b, ","); # split `stressors` string into array b
for (i=1;i<len;i++) # construct pairwise combinations
for (j=i+1;j<=len;j++) # between `stressors`
print k,b[i],k,b[j]
}
}' file
输出:
gene1 FishKairomones gene1 Microcystin
gene1 FishKairomones gene1 Calcium
gene1 Microcystin gene1 Calcium
gene2 Cadmium gene2 Microcystis
gene2 Cadmium gene2 FishKairomones
gene2 Cadmium gene2 Phosphorous
gene2 Microcystis gene2 FishKairomones
gene2 Microcystis gene2 Phosphorous
gene2 FishKairomones gene2 Phosphorous
gene3 FishKairomones gene3 Microcystin
gene3 FishKairomones gene3 Phosphorous
gene3 FishKairomones gene3 Cadmium
gene3 Microcystin gene3 Phosphorous
gene3 Microcystin gene3 Cadmium
gene3 Phosphorous gene3 Cadmium