因此,我有一个包含插入命中的文件,其中包含一些具有以下方面的特征标记(chr、开始、结束、chr、星号、结束、重叠碱基对的数量:
chr1 69744110 69793325 . -1 -1 0
chr1 82791976 82831348 chr1 82792114 82792615 501
chr1 82791976 82831348 chr1 82816285 82817077 792
chr1 82791976 82831348 chr1 82828015 82829891 1876
chr1 88599340 88658398 . -1 -1 0
chr1 137772945 137830035 . -1 -1 0
chr1 137875312 137920590 . -1 -1 0
chr1 193433080 193446861 . -1 -1 0
chr10 26483800 26501370 chr10 26484794 26485295 501
chr10 68069913 68089436 . -1 -1 0
chr10 95098349 95113967 . -1 -1 0
chr10 97310211 97335589 . -1 -1 0
chr10 111083097 111118237 chr10 111088928 111090274 1346
chr10 117904141 117947090 chr10 117905334 117906320 986
chr10 117904141 117947090 chr10 117918966 117919852 886
chr10 117904141 117947090 chr10 117926867 117927368 501
chr11 11521339 11587607 chr11 11523970 11524747 777
chr11 11521339 11587607 chr11 11555497 11559868 4371
chr11 11521339 11587607 chr11 11560639 11562128 1489
chr11 11521339 11587607 chr11 11564617 11565370 753
所以我需要的是连接第 5 列(column5/column5...)、第 6 列(column6/column6...)和第 7 列(column/column7)中的值...如果我在第一个中有匹配项3 列。我也想保留第 4 栏,但如果我错过了也没关系。
输出应如下所示:
chr1 69744110 69793325 . -1 -1 0
chr1 82791976 82791976 chr1 82792114/82816285/82828015 82792615/82817077/82829891 501/792/1876
chr1 88599340 88658398 . -1 -1 0
chr1 137772945 137830035 . -1 -1 0
chr1 137875312 137920590 . -1 -1 0
chr1 193433080 193446861 . -1 -1 0
chr10 26483800 26501370 chr10 26484794 26485295 501 (...)
chr10 117904141 117947090 chr10 117905334/117918966/117926867 117906320/117919852/117927368 986/886/501
(...)
我已经进行了多次试验,我能做的最好的就是:
awk '{ k=$1 FS $2 FS $3; a[k]=(k in a)? a[k]"/"$5 : $5 }
END{ for(i in a) {
split(i,b,FS); b[5]=a[i]"\t"b[5]; r="";
for(j=1;j<=NF;j++) {
r=(r!="")? r"\t"b[j] : b[j]
}
print r}
}' input.bed > output.bed
但这样我就丢失了值,并且无法连接多个列。
你能帮我吗?
编辑:
新尝试:
awk -F'\t' -v OFS='\t' '{
if ($2 in a) {
a[$2] = a[$2]";"$5;
b[$2] = b[$2]";"$6;
} else {
a[$2] = $5;
b[$2] = $6;
}
}
END { for (i in a) print i, a[i], b[i] }' input.bed > output.bed
但我继续丢失未评估的字段。
答案1
用awk。不幸的是,awk 没有内置的数组连接函数,但 gawk 在线手册有一个如何编写数组连接的示例。
如果这是在文件中aggregate.awk
(我假设输入文件是制表符分隔的)
BEGIN {
FS = OFS = "\t"
}
# ref https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html#Join-Function
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
function print_record() {
last_line[5] = join(col5, 1, n, "/")
last_line[6] = join(col6, 1, n, "/")
last_line[7] = join(col7, 1, n, "/")
print join(last_line, 1, NF, OFS)
}
{
key = $1 OFS $2 OFS $3
}
key != prev_key {
if (n > 0) {
print_record()
}
delete col5
delete col6
delete col7
n = 0
}
{
n++
col5[n] = $5
col6[n] = $6
col7[n] = $7
prev_key = key
split($0, last_line)
}
END {print_record()}
然后我们有:
$ awk -f aggregate.awk input.bed
chr1 69744110 69793325 . -1 -1 0
chr1 82791976 82831348 chr1 82792114/82816285/82828015 82792615/82817077/82829891 501/792/1876
chr1 88599340 88658398 . -1 -1 0
chr1 137772945 137830035 . -1 -1 0
chr1 137875312 137920590 . -1 -1 0
chr1 193433080 193446861 . -1 -1 0
chr10 26483800 26501370 chr10 26484794 26485295 501
chr10 68069913 68089436 . -1 -1 0
chr10 95098349 95113967 . -1 -1 0
chr10 97310211 97335589 . -1 -1 0
chr10 111083097 111118237 chr10 111088928 111090274 1346
chr10 117904141 117947090 chr10 117905334/117918966/117926867 117906320/117919852/117927368 986/886/501
chr11 11521339 11587607 chr11 11523970/11555497/11560639/11564617 11524747/11559868/11562128/11565370 777/4371/1489/753
答案2
问题:
如果我在前 3 列中有匹配项,则连接第 5,6 和 7 列中的值
回答:
perl -lane 'if($.==1){@a=@F;next} if($F[0]eq$a[0]&&$F[1]eq$a[1]&&$F[2]eq$a[2]){$a[4].="/$F[4]";$a[5].="/$F[5]";$a[6].="/$F[6]";}else{for($i=0;$i<@a;$i++){printf "\t%s",$a[$i]};print"";@a=@F}END{for($i=0;$i<@a;$i++){printf "\t%s",$a[$i]};print""}' input.bed
输出:
chr1 69744110 69793325 . -1 -1 0
chr1 82791976 82831348 chr1 82792114/82816285/82828015 82792615/82817077/82829891 501/792/1876
chr1 88599340 88658398 . -1 -1 0
chr1 137772945 137830035 . -1 -1 0
chr1 137875312 137920590 . -1 -1 0
chr1 193433080 193446861 . -1 -1 0
chr10 26483800 26501370 chr10 26484794 26485295 501
chr10 68069913 68089436 . -1 -1 0
chr10 95098349 95113967 . -1 -1 0
chr10 97310211 97335589 . -1 -1 0
chr10 111083097 111118237 chr10 111088928 111090274 1346
chr10 117904141 117947090 chr10 117905334/117918966/117926867 117906320/117919852/117927368 986/886/501
chr11 11521339 11587607 chr11 11523970/11555497/11560639/11564617 11524747/11559868/11562128/11565370 777/4371/1489/753
笔记:
可能有一个更短或更优雅的解决方案