awk/sed 将集群文件拆分为多个文件

awk/sed 将集群文件拆分为多个文件

我有一个集群 fasta 文件(称为 file),如下所示:

>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6HG3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

about 文件有 4 组:1AB2, 2AC6, 7KZL, and 6GH3.第一个>1AB2和第一个期间的内容>2AC6属于该簇1AB2。第一个>2AC6和第一个期间的内容>7KZL属于该簇2AC6

我想在第二个文件分成4个文件>XXXX。每个文件应如下所示:

文件_1

>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN

文件_2

>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO

文件_3

>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

文件_4

>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

答案1

awk '/^>/ && NF==1 {close(out); out="file_"++n; next} {print > out}' file

根据您的测试输入,您要更改输出文件的标头定义为:以一个字段开头>且只有一个字段的行。使用next我们对此行不打印任何内容,但设置输出文件名。此外,close()调用还可确保我们不会打开太多文件,否则awk可能会引发错误。


输出:

$ head file_*
==> file_1 <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA

==> file_2 <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP

==> file_3 <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

==> file_4 <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
thanasis@basis:~/Documents/development/temp> 
```

答案2

您可以使用csplit

csplit --prefix file_ --elide-empty-files --suppress-matched file '/^>....$/' '{*}'

它创建 4 个文件,file_00_03您需要的内容命名。

答案3

使用awk+sed组合:

awk -v f="wfile_" '
/^>/ && length==5 {
  if (a++) print p, ",", NR-1, f a-1
  p=NR+1
}
END {print p, ",$" f a}' < file |
split -l 10
for f in x*; do
sed -nf "$f" file
done

我们使用 awk 来确定块启动器的行号/^>.{4}$/,然后构建适当的 sed 代码

相关内容