我有一个具有以下结构的文件:
>Cluster 0
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:1428:2080/1... *
1 51aa, >MG00HS05:520:C8M1TACXX:3:1101:1658:2480/1... at 3:51:1:49/96.08%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:15131:2756/1... at 1:51:1:51/100.00%
[thousands of similarly looking lines]
>Cluster 1
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:3733:2088/1... *
1 50aa, >MG00HS05:520:C8M1TACXX:3:1101:6962:2026/1... at 2:50:1:49/98.00%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:14617:2071/1... at 2:51:1:50/96.08%
[thousands of similarly looking lines]
>Cluster 2
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:5164:2153/1... *
1 51aa, >MG00HS05:520:C8M1TACXX:3:1101:15660:20057/1... at 1:51:1:51/98.04%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:8563:35493/1... at 1:50:1:51/96.08%
[thousands of similarly looking lines]
以开头的行数>
约为两百万行。
我想提取以 开头的行>
及其后面的行,而不提取以 开头的以下行>
,并将它们放入文件中。像这样的东西:
文件一:
>Cluster 0
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:1428:2080/1... *
1 51aa, >MG00HS05:520:C8M1TACXX:3:1101:1658:2480/1... at 3:51:1:49/96.08%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:15131:2756/1... at 1:51:1:51/100.00%
[thousands of similarly looking lines]
文件二
>Cluster 1
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:3733:2088/1... *
1 50aa, >MG00HS05:520:C8M1TACXX:3:1101:6962:2026/1... at 2:50:1:49/98.00%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:14617:2071/1... at 2:51:1:50/96.08%
[thousands of similarly looking lines]
文件_三
>Cluster 2
0 51aa, >MG00HS05:520:C8M1TACXX:3:1101:5164:2153/1... *
1 51aa, >MG00HS05:520:C8M1TACXX:3:1101:15660:20057/1... at 1:51:1:51/98.04%
2 51aa, >MG00HS05:520:C8M1TACXX:3:1101:8563:35493/1... at 1:50:1:51/96.08%
[thousands of similarly looking lines]
我已经编写了一个应该在 bash 中执行此操作的脚本,但它不起作用。我不是 bash 脚本编写专家。
mkdir FemaleMito1_clusters
while read i
do $i > FemaleMito1_clusters/FemaleMito1_${i#>}
n=1
while [ `grep -A $n $i FemaleMito1_cdhit2 | tail -n1 | grep -c "^>"` -eq 0 ]
do grep -A"$n" $i FemaleMito1_cdhit2 | tail -n1 >> FemaleMito1_clusters/FemaleMito1_"${i#>}"
((n++))
done
done < FemaleMito1_cdhit2_list #this is a file containing just the lines starting with >
我该怎么做?请随意完全跳过我的脚本,可能有一句台词可以满足我的要求。
我还必须过滤文件并仅保留超过特定行号的文件。我想过wc -l
在创建文件后用一个简单的方法来完成它,但如果有一种方法可以将其包含在命令中而不创建无用的文件,那就更好了。
答案1
您可以在 awk 中轻松完成此操作:
awk '{ if(/^>/){name=$0; sub(/^>/,"", name);}{print >> name".fa"}}' file.fa
这将迭代输入文件的所有行,如果第一个字符是 a >
,它将将该行保存为name
.然后,它将>
从 的内容中删除name
,因为您不希望在文件名中出现该内容。最后,每一行都附加到一个名为name.fa
where的文件中name
,无论当前序列的名称是什么。
如果您只想打印那些超过 N 行的序列,您可以使用:
awk -v min=4 '{
if(/^>/){
if(num >= min){
print seq >> name".fa"
}
name=$0;
sub(/^>/,"", name);
seq=$0;
num=0
}
else{
seq = seq"\n"$0;
num++
}
}
END{
if(num >= min){
print seq >> name".fa"
}
}' file.fa
作为基本规则,不要使用 shell 循环进行文本处理。它们速度慢、笨重并且容易出错。
答案2
尽管(正如您在评论中所建议的那样)可能有更适合您的应用程序的生物信息学工具,但可以使用以下方法来完成csplit
:
csplit -sz file '/^>/' '{*}'
给出
$ head xx*
==> xx00 <==
>Number_one
[some thousands lines]
==> xx01 <==
>Number_two
[some other thousands lines, less than the latter]
==> xx02 <==
>Number_three
[Some other hundreds lines]
有关输出文件名的编号和格式的选项,请参阅手册页 ( man csplit
)