A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
awk 脚本用于拆分第 4 列和第 5 列的值,然后将它们成对地进行测试。当两个数组之间的值不同时,将使用下划线打印第一列中的字符串以及第 2 或第 3 列中的相应值。如果两个核苷酸不同,则将生成两行输出。另外,针对每个 id 打印第 4 列和第 5 列中的不同值。
awk '{ split($4, a1, ""); split($5, a2, ""); for (i in a1) { if (a1[i] != a2[i]) print $1 "_" $(i+1) }}' input > out
做第一部分。
需要的输出为:
A01_11814111 G A
A01_11485519 G T
答案1
内容tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
内容tmp.awk
{
if (substr($4,1,1) != substr($5,1,1)) {
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
}
if (substr($4,2,1) != substr($5,2,1)) {
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
}
}
样本输出
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
奖金。在bash
#!/bin/bash
while read line
do
set $line
if [ ${4:0:1} != ${5:0:1} ]
then printf "$1_$2 ${4:0:1} ${5:0:1}\n"
fi
if [ ${4:1:1} != ${5:1:1} ]
then printf "$1_$3 ${4:1:1} ${5:1:1}\n"
fi
done < tmp.txt
样本输出
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
答案2
awk解决方案:
awk '{
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
}' input.txt
sed解决方案:
sed -r '
{
s@(\w*) *(\w*) *(\w*) *(\w)(\w) *\4(\w)$@\1_\3 \5 \6@
s@(\w*) *(\w*) *(\w*) *(\w)(\w) *(\w)\5$@\1_\2 \4 \6@
}' input.txt
输出(两者相同)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C