我有两个基因组序列,我想逐个字母地进行比较。它们具有几乎相同的模式,但发生了一些核苷酸转变。
例如
序列1
ATGGCGATGAGCAGCGGCGGCAGTGGTGGCGGCGTCCCGGAGCAGGAGGATTCCGTGCTGTTCCGGCGCGGCACAGGCCAGAGCGATGATTCTGACATTTGGGATGATACAGCACTGATAAAAGCATATGATAAAGCTGTGGCTTCATTTAAGCATGCTCTAAAGAATGGTGACATTTGTGAAACTTCGGGTAAACCAAAAACCACACCTAAAAGAAAACCTGCTAAGAAGAATAAAAGCCAAAAGAAGAATACTGCAGCTTCCTTACAACAGTGGAAAGTTGGGGACAAATGTTCTGCCATTTGGTCAGAAGACGGTTGCATTTACCCAGCTACCATTGCTTCAATTGATTTTAAGAGAGAAACCTGTGTTGTGGTTTACACTGGATATGGAAATAGAGAGGAGCAAAATCTGTCCGATCTACTTTCCCCAATCTGTGAAGTAGCTAATAATATAGAACAAAATGCTCAAGAGAATGAAAATGAAAGCCAAGTTTCAACAGATGAAAGTGAGAACTCCAGGTCTCCTGGAAATAAATCAGATAACATCAAGCCCAAATCTGCTCCATGGAACTCTTTTCTCCCTCCACCACCCCCCATGCCAGGGCCAAGACTGGGACCAGGAAAGCCAGGTCTAAAATTCAATGGCCCACCACCGCCACCGCCACCACCACCACCCCACTTACTATCATGCTGGCTGCCTCCATTTCCTTCTGGACCACCAATAATTCCCCCACCACCTCCCATATGTCCAGATTCTCTTGATGATGCTGATGCTTTGGGAAGTATGTTAATTTCATGGTACATGAGTGGCTATCATACTGGCTATTATATGGGTTTCAGACAAAATCAAAAAGAAGGAAGGTGCTCACATTCCTTAAATTAA
序列2
ATGGCGATGAGCAGCGGCGGCAGTGGTGGCGGCGTCCCGGAGCAGGAGGATTCCGTGCTGTTCCGGCGCGGCACAGGCCAGAGCGATGATTCTGACATTTGGGATGATACAGCACTGATAAAAGCATATGATAAAGCTGTGGCTTCATTTAAGCATGCTCTAAAGAATGGTGACATTTGTGAAACTTCGGGTAAACCAAAAACCACACCTAAAAGAAAACCTGCTAAGAAGAATAAAAGCCAAAAGAAGAATACTGCAGCTTCCTTACAACAGTGGAAAGTTGGGGACAAATGTTCTGCCATTTGGTCAGAAGACGGTTGCATTTACCCAGCTACCATTGCTTCAATTGATTTTAAGAGAGAAACCTGTGTTGTGGTTTACACTGGATATGGAAATAGAGAGGAGCAAAATCTGTCCGATCTACTTTCCCCAATCTGTGAAGTAGCTAATAATATAGAACAGAATGCTCAAGAGAATGAAAATGAAAGCCAAGTTTCAACAGATGAAAGTGAGAACTCCAGGTCTCCTGGAAATAAATCAGATAACATCAAGCCCAAATCTGCTCCATGGAACTCTTTTCTCCCTCCACCACCCCCCATGCCAGGGCCAAGACTGGGACCAGGAAAGCCAGGTCTAAAATTCAATGGCCCACCACCGCCACCGCCACCACCACCACCCCACTTACTATCATGCTGGCTGCCTCCATTTCCTTCTGGACCACCAATAATTCCCCCACCACCTCCCATATGTCCAGATTCTCTTGATGATGCTGATGCTTTGGGAAGTATGTTAATTTCATGGTACATGAGTGGCTATCATACTGGCTATTATATGGAAATGCTGGCATAG
有没有一种 bash 方法可以比较序列并着色或打印差异?
编辑:我想通过Linux工具来比较基因序列。
我需要逐行比较字母并突出显示差异。我认为diff
不擅长比较由数百个字母组成的单个单词。
答案1
您可以将每一行保存为单独的文件,例如
f1 - 第一个基因组序列 f2 - 第二个基因组序列
现在,您需要转换线条(水平到垂直):
awk '{gsub(".","&\n");printf "%s",$0}' < f1 >f1a
awk '{gsub(".","&\n");printf "%s",$0}' < f2 >f2a
这会将新格式保存在 2 个新文件( f1a 和 f2a )中,现在将这 2 个文件与diff
diff -y f1a f2a #will output both lines and show differences
diff -c f1a f2a #will only output the differences and tell you from which line to which line
更多关于差异的信息:http://man7.org/linux/man-pages/man1/diff.1.html
上面的内容也可以制作成一个小脚本,您可以在其中将 2 个基因组序列作为变量传递。
如果您想要彩色输出,请尝试使用colordiff
(在 Ubuntu 中sudo apt-get install colordiff
:)并在下面的脚本中替换diff
为。colordiff
对于并排输出,请使用选项-y
:
#!/bin/bash
# This script will compare the 2 genomic sequences
echo "1st genomic sequence, followed by [ENTER]:"
read gena
echo "2nd genomic sequence, followed by [ENTER]:"
read genb
echo $gena | awk '{gsub(".","&\n");printf "%s",$0}' > /tmp/fa
echo $genb | awk '{gsub(".","&\n");printf "%s",$0}' > /tmp/fb
echo "Insert the diff argument you wish to use (e.g. -y or -c). Please refer to man diff for information. Hit [ENTER]:"
read $arg
diff $arg /tmp/fa /tmp/fb
exit