比较两个 DNA 序列

比较两个 DNA 序列

我有两个 DNAATGCATGC序列TACGTTGC

我想编写一个程序,如果比较A对齐则给出“+” TG否则C打印“-”

喜欢

ATGCATGC
TACGTTGC
+++++---

谁能帮我吗?

上面给出了我期望的输出。

我尝试过的如下:

#!/bin/bash


declare -a seq1=()
declare -a seq2=()


read -p 'Enter the ncleotide seq (charactor by charactor followe\d by space0) ' -a seq1

read -n1 seq1

read -p 'Enter the ncleotide seq (charactor by charactor followe\d by space0) ' -a seq2

read -n1 seq2

for a in ${seq1[*]} ; do
for b in ${seq2[*]} ; do
if [ $a == A ] || [ $b == T ] ; then
echo -n "+"
elif [ $a == A ] || [ $b == C ] ; then
echo -n " -"
elif [ $a == A ] || [ $b == G ] ; then
echo -n "-"
elif [ $a == T ] || [ $b == A ] ; then
echo -n "+"
elif [ $a == T ] || [ $b == C ] ; then
echo -n "-"
elif [ $a == T ] || [ $b == G ] ; then
echo -n "-"
elif [ $a == C ] || [ $b == G ] ; then
echo -n "+"
elif [ $a == C ] || [ $b == A ] ; then
echo -n "-"
elif [ $a == C ] || [ $b == T ] ; then
echo -n "-"
elif [ $a == G ] || [ $b == C ] ; then
echo -n "+"
elif [ $a == G ] || [ $b == A ] ; then
echo -n "-"
elif [ $a == G ] || [ $b == T ] ; then
echo -n "-"
else  
echo $a $b
fi
done
done

答案1

使用awk

awk -v seq1='CATGCATGCTCAT' -v seq2='ATACGTTGCGTTA' '
function sign(s) { cmp=(cmp==""?"":cmp) s }
BEGIN{
    split(seq1, tmp1, ""); split(seq2, tmp2, "");
    for(i in tmp1) {
        if( tmp1[i] == tmp2[i] ){ sign("-"); continue }
        if( (tmp1[i] ~/[AT]/ && tmp2[i] ~/[AT]/) || 
            (tmp1[i] ~/[GC]/ && tmp2[i] ~/[GC]/) ) { sign("+"); continue }
        sign("-")
    }
    print seq1 ORS seq2 ORS cmp
}'

输出:

CATGCATGCTCAT
ATACGTTGCGTTA
-+++++-----++

带有注释的相同代码:

awk -v seq1='CATGCATGCTCAT' -v seq2='ATACGTTGCGTTA' '
# set sequences one in seq1 another in seq2

function sign(s) { cmp=(cmp==""?"":cmp) s }
# function to join the the changes on +/- for each pair of chars

BEGIN{
    split(seq1, tmp1, ""); split(seq2, tmp2, "");
    # split each sequences characters into individual arrays

    for(i in tmp1) {
    # loop over keys on one of the arrays (assuming length of both seq will be same)

        if( tmp1[i] == tmp2[i] ){ sign("-"); continue }
        # if both chars were same AA, TT, CC, GG, ..., sign should be "-"

        if( (tmp1[i] ~/[AT]/ && tmp2[i] ~/[AT]/) || 
            (tmp1[i] ~/[GC]/ && tmp2[i] ~/[GC]/) ) { sign("+"); continue }
        # if one was "A" and another was "T" or vice-versa as well as
        # if one was "G" and another was "C" or vice-versa, sign should be "+"

        sign("-")
        # otherwise "-"

    }
    print seq1 ORS seq2 ORS cmp
    # print the last result, first printing the sequences then 
    # the comparison result in 'cmp'
}'

答案2

使用bashtr

#!/bin/bash

printf '%s\n%s\n' "$1" "$2"
tr 37124568 '++[-*]' <<<$(( $(tr ATCG 1234 <<<"$1 + $2") ))

测试:

$ bash script ACCTACCATAG AGTACCCGATC
ACCTACCATAG
AGTACCCGATC
-+-+----+++

该脚本将命令行中给出的序列中的字符替换为数字,这样当您将它们加在一起时,总和中的数字 3 和 7 意味着+,所有其他数字意味着-

字符按以下方式更改:

A --> 1
T --> 2
C --> 3
G --> 4

这意味着A+T与 1+2 相同,即 3。同样,对于C和,7 来自 3+4 G

外部tr调用将所有 3 和 7 转换为 ,+并将所有其他可能的数字转换为-。总和是通过算术展开式 计算的$(( ... )),该展开式将内部调用产生的两个数字相加,tr该内部调用只是将两个序列中的字母更改为数字。

不对输入进行验证。不会检查算术错误,例如溢出(对于长于少数碱基对的序列,可能会发生溢出,除非您小心地将它们分解,例如按字符)。

如果您想从标准输入读取序列,请使用

#!/bin/bash

read s1
read s2

printf '%s\n%s\n' "$s1" "$s2"
tr 37124568 '++[-*]' <<<$(( $(tr ATCG 1234 <<<"$s1 + $s2") ))

答案3

read seq1
read seq2
printf '%s\n' "$seq1" "$seq2" |
sed -Ee '
  N;H;p;z;x

  :zip
    #   1  2     3
    s/\n(.)(.*\n)(.)/\1\3\n\2/
    s/\n\n//; # zipping is complete when the two markers collide
  t zip

  :cmp
  s/^([-+]*)(AT|TA|CG|GC)/\1+/;t cmp
  s/[ATCG]{2}/-/;t cmp
'

输出:

ATGCATGCTATC
TACGTTGCATTG
+++++---++-+

这次使用另一种方法awk实用程序,我们在关联数组的帮助下逐个字符地压缩两个序列,h该数组预先填充了适当的键以给出 +,而其余的则得到 -

awk '
BEGIN {
  OFS = ORS

  split("AT TA CG GC", a)
  for (var in a) h[a[var]]

  seq1 = ARGV[1]; seq2 = ARGV[2]
  len = length(seq1)

  while ( ++pos <= len ) {
    s = substr(seq1, pos, 1) \
        substr(seq2, pos, 1) ;
    res = res ((s in h) ? "+" : "-")
  }

  print seq1, seq2, res
}
' "$seq1" "$seq2"

我们用蟒蛇3执行存储在字符串和压缩中的序列比较

python3 -c 'import sys
s,t = sys.argv[1::]
dna = "ATCG"
d = {i+j: "+" for i,j in zip(dna[0::2],dna[1::2])}
print(*[d.get(x+y,d.get(y+x,"-")) for x,y in zip(s,t)],sep="")
' "$seq1" "$seq2"

珀尔上面的Python示例的改编:

printf '%s\n' "$seq1" "$seq2" |
perl -F// -pale '
  my @h = qw(A T C G);
  my %h = (@h, reverse @h);
  my $zip = join q(), map { $F[$a++], $_ } <> =~ /./g;
  $_ = $zip ​=~ s/(.)(.)/qw(- +)[$2 eq $h{$1}]/reg;
'

相关内容