基于两列的字符串的成对组合

基于两列的字符串的成对组合

我正在尝试获取每个数据堆栈可用的字符串的成对组合,

输入文件包含两列:col1 是基因名,col2 是各种应激源的名称。

        gene1   FishKairomones
        gene1   Microcystin
        gene1   Calcium
        gene2   Cadmium
        gene2   Microcystis
        gene2   FishKairomones
        gene2   Phosphorous
        gene3   FishKairomones
        gene3   Microcystin
        gene3   Phosphorous
        gene3   Cadmium

因此从表中可以看出,gene1 对 3 种压力源有反应:鱼利他素、微囊藻毒素和钙。

我想获得这样的成对表:

    gene1   FishKairomones  gene1   Microcystin
    gene1   FishKairomones  gene1   Calcium
    gene1   Microcystin gene1   Calcium
    gene2   Cadmium gene2   Microcystis
    gene2   Cadmium gene2   FishKairomones
    gene2   Cadmium gene2   Phosphorous
    gene2   Microcystis gene2   FishKairomones
    gene2   Microcystis gene2   Phosphorous
    gene2   FishKairomones  gene2   Phosphorous

正如您所看到的,基因 1 鱼利他素与基因 1 微囊藻素相关,基因 1 鱼利他素也与基因 1 钙相关,基因 1 微囊藻素与基因 1 钙相关。同样,我想对所有基因都这样做。

有时该基因可以有 3 个应激源,有时有 4 个,等等。

我在这里尝试了代码:用于“cat”文件中所有行的成对扩展的命令行工具

这会创建整个文件的所有成对组合,这不是我想要的。

答案1

AWK解决方案(甚至适用于无序的输入线):

awk '{ a[$1]=($1 in a? a[$1]",":"")$2 }   # grouping `stressors` by `gene` names
     END { 
         for (k in a) {                   # for each `gene`
             len=split(a[k], b, ",");     # split `stressors` string into array b
             for (i=1;i<len;i++)          # construct pairwise combinations
                 for (j=i+1;j<=len;j++)   # between `stressors` 
                     print k,b[i],k,b[j] 
         } 
     }' file

输出:

gene1 FishKairomones gene1 Microcystin
gene1 FishKairomones gene1 Calcium
gene1 Microcystin gene1 Calcium
gene2 Cadmium gene2 Microcystis
gene2 Cadmium gene2 FishKairomones
gene2 Cadmium gene2 Phosphorous
gene2 Microcystis gene2 FishKairomones
gene2 Microcystis gene2 Phosphorous
gene2 FishKairomones gene2 Phosphorous
gene3 FishKairomones gene3 Microcystin
gene3 FishKairomones gene3 Phosphorous
gene3 FishKairomones gene3 Cadmium
gene3 Microcystin gene3 Phosphorous
gene3 Microcystin gene3 Cadmium
gene3 Phosphorous gene3 Cadmium

相关内容