awk 将一个文件分割成多个文件,并在另一个索引文件中指定名称

awk 将一个文件分割成多个文件,并在另一个索引文件中指定名称

我有一个集群 fasta 文件(称为 file),如下所示:

>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6GH3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

该文件有 4 组:1AB2, 2AC6, 7KZL, and 6GH3.第一个>1AB2和第一个期间的内容>2AC6属于该簇1AB2。第一个>2AC6和第一个期间的内容>7KZL属于该簇2AC6

我想在第二个文件中将文件分成 4 个文件,>XXXX并在此索引文件(ind.txt)中使用特定名称:

HG001 1AB2
HG010 2AC6
HG023 7KZL
HG004 6GH3

结果文件应如下所示:

HG001.fa

>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN

HG010.fa

>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO

HG023.fa

>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

HG004.fa

>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

我尝试使用

awk '/^>/ && NF==1; NR==FNR{a[$2]=$1} (substr($1,2) in a) {close(out); out="cluster/"a[substr($1,2)]".fa"}  {print > out}' ind.txt file

但它不起作用,并且我找不到该错误的解决方案。

答案1

mkdir -p cluster &&
awk 'NR==FNR  {map[">"$2]="cluster/"$1".fa"; next}
     /^>/ && NF==1 {close(out); out=map[$0]; next} 
     out != "" {print > out}
' ind.txt file

第一个条件操作 ( NR==FNR) 正在解析索引文件,创建文件名并将它们存储到数组中,其中第二个文件的标头是哈希值。

当找到标头 ( /^>/ && NF==1) 时,我们定义要使用的输出文件名。

对于任何其他行,我们打印到选定的文件名。此外,我还添加了一个条件"cluster/.fa",如果没有此标头的映射,则不打印到文件。

使用示例输入进行测试创建了这些文件:

$ head cluster/*.fa
==> cluster/HG001.fa <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA

==> cluster/HG004.fa <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

==> cluster/HG010.fa <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP

==> cluster/HG023.fa <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

相关内容