文件:
chr1_156186369 chr1_156186369_A_C,T A C,T 33150.29 1/2:0,4,6:10:88:272
chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167
chrM_2619 chrM_2619_A_G,T A G,T 33023.29 1/2:0,5,5:10:99:293,144,129
chr9_119375271 chr9_119375271_T_A,G T A,G 248.29 1/2:1,11,5:17:99:359,107,113
我只需要从第 2 列和第 4 列中删除逗号,并打印逗号后面的单词的整行。
预期输出是:
chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272
chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272
chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167
chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129
chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129
chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113
chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113
我尝试了 awk 但没有得到任何结果,我也在这里阅读了类似类型的问题 如何在特定条件下从文件中提取行
答案1
使用 awk:
awk '{
split ($2,w2,",");
split ($4,w4,",");
for (i in w4) {
print $1,substr(w2[1],0,length(w2[1])-length(w4[i])) w4[i],$3,w4[i],$5,$6;
}}'
请注意,如果第 2 列和第 4 列的逗号后面的值不相等,则不会进行错误处理。
答案2
假设sed
单字符分隔值C,T
会重复
$ sed -E 's/^(.*)([A-Z]),([A-Z])(.*)\2,\3(.*)/\1\2\4\2\5\n\1\3\4\3\5/' ip.txt
chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272
chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272
chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167
chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129
chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129
chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113
chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113
^(.*)
起始文本([A-Z]),([A-Z])
逗号分隔的单个字符(.*)
重复之间的文字\2,\3
再次匹配逗号分隔的单个字符(.*)
其余线\1\2\4\2\5\n\1\3\4\3\5
所需的输出格式- 请注意,间距与预期输出并不完全匹配
答案3
我不知道如何使用单个命令来完成此操作,但它适用于以下循环bash
:
cat data.dat | while read line
do
if echo "${line}" | grep -q '[[:alpha:]],[[:alpha:]]'
then
letters=`echo "${line}" | grep -o '[[:alpha:]],[[:alpha:]]' | head -n 1`
for letter in `echo ${letters} | sed 's/,/ /g'`
do
echo "${line}" | sed 's/'"${letters}"'/'"${letter}"' /g'
done
else
echo "${line}"
fi
done
答案4
用逗号分割第四个字段并使用该列中的切片,并将最后一个_X,Y
字段替换为_slice
,如果有的话:
awk '{
n=split($4,slices,",")
for(i=1;i<=n;i++) {
res=$2
sub(/.,.*/,slices[i],res)
print $1, res, $3, slices[i], $5, $6
}
}' file
我不太喜欢打印字段的方式,因为我确实指示了从第 1 个到第 6 个字段,所以希望这是静态的。
$ awk '{n=split($4,slices,","); for(i=1;i<=n;i++) {res=$2; sub(/.,.*/,slices[i],res); print $1, res, $3, slices[i], $5, $6}}' a
chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272
chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272
chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167
chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129
chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129
chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113
chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113