我有一个集群 fasta 文件(称为 file),如下所示:
>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6HG3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
about 文件有 4 组:1AB2, 2AC6, 7KZL, and 6GH3
.第一个>1AB2
和第一个期间的内容>2AC6
属于该簇1AB2
。第一个>2AC6
和第一个期间的内容>7KZL
属于该簇2AC6
。
我想在第二个文件分成4个文件>XXXX
。每个文件应如下所示:
文件_1
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
文件_2
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
文件_3
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
文件_4
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
答案1
awk '/^>/ && NF==1 {close(out); out="file_"++n; next} {print > out}' file
根据您的测试输入,您要更改输出文件的标头定义为:以一个字段开头>
且只有一个字段的行。使用next
我们对此行不打印任何内容,但设置输出文件名。此外,close()
调用还可确保我们不会打开太多文件,否则awk
可能会引发错误。
输出:
$ head file_*
==> file_1 <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
==> file_2 <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
==> file_3 <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
==> file_4 <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
thanasis@basis:~/Documents/development/temp>
```
答案2
您可以使用csplit
:
csplit --prefix file_ --elide-empty-files --suppress-matched file '/^>....$/' '{*}'
它创建 4 个文件,file_00
以_03
您需要的内容命名。
答案3
使用awk+sed
组合:
awk -v f="wfile_" '
/^>/ && length==5 {
if (a++) print p, ",", NR-1, f a-1
p=NR+1
}
END {print p, ",$" f a}' < file |
split -l 10
for f in x*; do
sed -nf "$f" file
done
我们使用 awk 来确定块启动器的行号/^>.{4}$/
,然后构建适当的 sed 代码