我正在处理长度约为 30 的短字符串(它们是 DNA 序列)。就我的目的而言,每 5 个位置都需要替换为 4 个 DNA 碱基(A、C、T、G)中的任意一个。例如,如果我有一个输入,AAAAAAAAAAAAAA
输出将是以下列表:
AAAAAAAAAAAAAA
AAAACAAAAAAAAA
AAAATAAAAAAAAA
AAAAGAAAAAAAAA
AAAACAAAACAAAA
AAAACAAAATAAAA
....
也就是说,每第 5 个位置分别交换为 A、C、T 或 G,以生成所有可能序列的数组,其中每个第 5 个位置都是所有可能的 DNA 碱基。
我一直在尝试使用 for 循环,并且可以编辑每个第五个位置,但不能以组合方法
例如
echo "AAAAAAAAAAAAAAA" > one.spacer
for i in $(seq 1 3)
do
for base in {a,c,t,g}
do
awk -v b=$base -v x=$i '{print substr ($0,1,5*x-1) b substr ($0,5*x+1,100)}' one.spacer
done
done
给出输出:
AAAAaAAAAAAAAAA
AAAAcAAAAAAAAAA
AAAAtAAAAAAAAAA
AAAAgAAAAAAAAAA
AAAAAAAAAaAAAAA
AAAAAAAAAcAAAAA
AAAAAAAAAtAAAAA
AAAAAAAAAgAAAAA
AAAAAAAAAAAAAAa
AAAAAAAAAAAAAAc
AAAAAAAAAAAAAAt
AAAAAAAAAAAAAAg
但希望您能看到它仅在每个第 5 个位置进行单独编辑。我需要序列列表,其中包括,例如
AAAAgAAAAgAAAAg
AAAAcAAAAtAAAAa
以及所有其他组合。希望这更清楚一点
答案1
即使对于在每个 Unix 机器上的任何 shell 中使用任何 awk 的真实 30 字符宽度输入,这也将在不到一秒的时间内运行:
$ cat tst.awk
function mutate(old,lgth, new,i,j) {
for (i=5; i<=lgth; i+=5) {
for (j=1; j<=4; j++) {
new = substr(old,1,i-1) substr("ACTG",j,1) substr(old,i+1)
if ( !seen[new]++ ) {
print new
mutate(new,lgth)
}
}
}
}
{ mutate($0,length($0)) }
$ echo 'AAAAAAAAAAAAAAA' | awk -f tst.awk
AAAAAAAAAAAAAAA
AAAACAAAAAAAAAA
AAAATAAAAAAAAAA
AAAAGAAAAAAAAAA
AAAAGAAAACAAAAA
AAAAAAAAACAAAAA
AAAACAAAACAAAAA
AAAATAAAACAAAAA
AAAATAAAATAAAAA
AAAAAAAAATAAAAA
AAAACAAAATAAAAA
AAAAGAAAATAAAAA
AAAAGAAAAGAAAAA
AAAAAAAAAGAAAAA
AAAACAAAAGAAAAA
AAAATAAAAGAAAAA
AAAATAAAAGAAAAC
AAAAAAAAAGAAAAC
AAAACAAAAGAAAAC
AAAAGAAAAGAAAAC
AAAAGAAAAAAAAAC
AAAAAAAAAAAAAAC
AAAACAAAAAAAAAC
AAAATAAAAAAAAAC
AAAATAAAACAAAAC
AAAAAAAAACAAAAC
AAAACAAAACAAAAC
AAAAGAAAACAAAAC
AAAAGAAAATAAAAC
AAAAAAAAATAAAAC
AAAACAAAATAAAAC
AAAATAAAATAAAAC
AAAATAAAATAAAAT
AAAAAAAAATAAAAT
AAAACAAAATAAAAT
AAAAGAAAATAAAAT
AAAAGAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAACAAAAAAAAAT
AAAATAAAAAAAAAT
AAAATAAAACAAAAT
AAAAAAAAACAAAAT
AAAACAAAACAAAAT
AAAAGAAAACAAAAT
AAAAGAAAAGAAAAT
AAAAAAAAAGAAAAT
AAAACAAAAGAAAAT
AAAATAAAAGAAAAT
AAAATAAAAGAAAAG
AAAAAAAAAGAAAAG
AAAACAAAAGAAAAG
AAAAGAAAAGAAAAG
AAAAGAAAAAAAAAG
AAAAAAAAAAAAAAG
AAAACAAAAAAAAAG
AAAATAAAAAAAAAG
AAAATAAAACAAAAG
AAAAAAAAACAAAAG
AAAACAAAACAAAAG
AAAAGAAAACAAAAG
AAAAGAAAATAAAAG
AAAAAAAAATAAAAG
AAAACAAAATAAAAG
AAAATAAAATAAAAG
答案2
这与人们所认为的良好 shell 编码实践有很大出入,可能效率低下,并且无法很好地扩展到大型输入,但为了简洁起见,使用 ksh93 shell 并假设默认值$IFS
,您可以这样做:
words=($(<your-file))
printf '%s\n' ${words[@]//{4}(?)?/\1{A,C,T,G}}
使用${var//pattern/replacement}
,我们将每个 4 个字符 + 1 的序列替换为 4 个字符,并且{A,C,T,G}
在 ksh 中最终会按照 csh 大括号扩展对未加引号的参数扩展进行扩展。
答案3
Python中的模块itertools
有很多方法来处理此类组合问题。
python3 - <<\eof
import itertools as it
dna = 'atcg'
step = 5
with open('yourfile') as f:
for _ in f:
l = _.rstrip('\n')
w = len(l)
I = [i for i in range(step-1,w,step)]
for t1 in it.product(dna,repeat=int(w/step)):
t = list(t1)[::-1]
print(*[
t.pop(0) if idx in I else e
for idx,e in enumerate(l)],sep="")
eof
- 从迭代模块中,product 方法生成输入迭代的笛卡尔积,在我们的例子中是多次 DNA 序列。
- 我们将其变成一个无限迭代器,一旦达到笛卡尔乘积的数量,并且输入文件中仍然包含数据,该迭代器就永远不会结束并从头开始回收。