如果在前一个字段中匹配,则水平连接列。要连接的多列

如果在前一个字段中匹配,则水平连接列。要连接的多列

因此,我有一个包含插入命中的文件,其中包含一些具有以下方面的特征标记(chr、开始、结束、chr、星号、结束、重叠碱基对的数量:

    chr1    69744110    69793325    .   -1  -1  0
    chr1    82791976    82831348    chr1    82792114    82792615    501
    chr1    82791976    82831348    chr1    82816285    82817077    792
    chr1    82791976    82831348    chr1    82828015    82829891    1876
    chr1    88599340    88658398    .   -1  -1  0
    chr1    137772945   137830035   .   -1  -1  0
    chr1    137875312   137920590   .   -1  -1  0
    chr1    193433080   193446861   .   -1  -1  0
    chr10   26483800    26501370    chr10   26484794    26485295    501
    chr10   68069913    68089436    .   -1  -1  0
    chr10   95098349    95113967    .   -1  -1  0
    chr10   97310211    97335589    .   -1  -1  0
    chr10   111083097   111118237   chr10   111088928   111090274   1346
    chr10   117904141   117947090   chr10   117905334   117906320   986
    chr10   117904141   117947090   chr10   117918966   117919852   886
    chr10   117904141   117947090   chr10   117926867   117927368   501
    chr11   11521339    11587607    chr11   11523970    11524747    777
    chr11   11521339    11587607    chr11   11555497    11559868    4371
    chr11   11521339    11587607    chr11   11560639    11562128    1489
    chr11   11521339    11587607    chr11   11564617    11565370    753

所以我需要的是连接第 5 列(column5/column5...)、第 6 列(column6/column6...)和第 7 列(column/column7)中的值...如果我在第一个中有匹配项3 列。我也想保留第 4 栏,但如果我错过了也没关系。

输出应如下所示:

    chr1    69744110    69793325    .   -1  -1  0
    chr1 82791976 82791976 chr1 82792114/82816285/82828015 82792615/82817077/82829891 501/792/1876
    chr1    88599340    88658398    .   -1  -1  0
    chr1    137772945   137830035   .   -1  -1  0
    chr1    137875312   137920590   .   -1  -1  0
    chr1    193433080   193446861   .   -1  -1  0
    chr10   26483800    26501370    chr10   26484794    26485295    501            (...)
    chr10   117904141   117947090   chr10 117905334/117918966/117926867 117906320/117919852/117927368   986/886/501
    (...)

我已经进行了多次试验,我能做的最好的就是:

    awk '{ k=$1 FS $2 FS $3;  a[k]=(k in a)? a[k]"/"$5 : $5 }
 END{ for(i in a) { 
          split(i,b,FS); b[5]=a[i]"\t"b[5]; r=""; 
          for(j=1;j<=NF;j++) { 
              r=(r!="")? r"\t"b[j] : b[j] 
          } 
          print r} 
    }' input.bed > output.bed

但这样我就丢失了值,并且无法连接多个列。

你能帮我吗?

编辑:

新尝试:

    awk -F'\t' -v OFS='\t' '{
        if ($2 in a) {
            a[$2] = a[$2]";"$5;
            b[$2] = b[$2]";"$6;
        } else {
            a[$2] = $5;
            b[$2] = $6;
       }
    }
    END { for (i in a) print i, a[i], b[i] }' input.bed > output.bed

但我继续丢失未评估的字段。

答案1

用awk。不幸的是,awk 没有内置的数组连接函数,但 gawk 在线手册有一个如何编写数组连接的示例。

如果这是在文件中aggregate.awk(我假设输入文件是制表符分隔的)

BEGIN {
    FS = OFS = "\t"
}

# ref https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html#Join-Function
function join(array, start, end, sep,    result, i)
{
    if (sep == "")
        sep = " "
    else if (sep == SUBSEP) # magic value
        sep = ""
    result = array[start]
    for (i = start + 1; i <= end; i++)
        result = result sep array[i]
    return result
}

function print_record() {
    last_line[5] = join(col5, 1, n, "/")
    last_line[6] = join(col6, 1, n, "/")
    last_line[7] = join(col7, 1, n, "/")
    print join(last_line, 1, NF, OFS)
}

{
    key = $1 OFS $2 OFS $3
}

key != prev_key {
    if (n > 0) {
        print_record()
    }
    delete col5
    delete col6
    delete col7
    n = 0
}

{
    n++
    col5[n] = $5
    col6[n] = $6
    col7[n] = $7
    prev_key = key
    split($0, last_line)
}

END {print_record()}

然后我们有:

$ awk -f aggregate.awk input.bed
chr1    69744110        69793325        .       -1      -1      0
chr1    82791976        82831348        chr1    82792114/82816285/82828015      82792615/82817077/82829891      501/792/1876
chr1    88599340        88658398        .       -1      -1      0
chr1    137772945       137830035       .       -1      -1      0
chr1    137875312       137920590       .       -1      -1      0
chr1    193433080       193446861       .       -1      -1      0
chr10   26483800        26501370        chr10   26484794        26485295        501
chr10   68069913        68089436        .       -1      -1      0
chr10   95098349        95113967        .       -1      -1      0
chr10   97310211        97335589        .       -1      -1      0
chr10   111083097       111118237       chr10   111088928       111090274       1346
chr10   117904141       117947090       chr10   117905334/117918966/117926867   117906320/117919852/117927368   986/886/501
chr11   11521339        11587607        chr11   11523970/11555497/11560639/11564617     11524747/11559868/11562128/11565370     777/4371/1489/753

答案2

问题:

如果我在前 3 列中有匹配项,则连接第 5,6 和 7 列中的值

回答:

perl -lane 'if($.==1){@a=@F;next} if($F[0]eq$a[0]&&$F[1]eq$a[1]&&$F[2]eq$a[2]){$a[4].="/$F[4]";$a[5].="/$F[5]";$a[6].="/$F[6]";}else{for($i=0;$i<@a;$i++){printf "\t%s",$a[$i]};print"";@a=@F}END{for($i=0;$i<@a;$i++){printf "\t%s",$a[$i]};print""}' input.bed

输出:

    chr1    69744110        69793325        .       -1      -1      0
    chr1    82791976        82831348        chr1    82792114/82816285/82828015      82792615/82817077/82829891      501/792/1876
    chr1    88599340        88658398        .       -1      -1      0
    chr1    137772945       137830035       .       -1      -1      0
    chr1    137875312       137920590       .       -1      -1      0
    chr1    193433080       193446861       .       -1      -1      0
    chr10   26483800        26501370        chr10   26484794        26485295        501
    chr10   68069913        68089436        .       -1      -1      0
    chr10   95098349        95113967        .       -1      -1      0
    chr10   97310211        97335589        .       -1      -1      0
    chr10   111083097       111118237       chr10   111088928       111090274       1346
    chr10   117904141       117947090       chr10   117905334/117918966/117926867   117906320/117919852/117927368   986/886/501
    chr11   11521339        11587607        chr11   11523970/11555497/11560639/11564617     11524747/11559868/11562128/11565370     777/4371/1489/753

笔记:

可能有一个更短或更优雅的解决方案

相关内容