我有一个集群 fasta 文件(称为 file),如下所示:
>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6GH3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
该文件有 4 组:1AB2, 2AC6, 7KZL, and 6GH3
.第一个>1AB2
和第一个期间的内容>2AC6
属于该簇1AB2
。第一个>2AC6
和第一个期间的内容>7KZL
属于该簇2AC6
。
我想在第二个文件中将文件分成 4 个文件,>XXXX
并在此索引文件(ind.txt)中使用特定名称:
HG001 1AB2
HG010 2AC6
HG023 7KZL
HG004 6GH3
结果文件应如下所示:
HG001.fa
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
HG010.fa
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
HG023.fa
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
HG004.fa
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
我尝试使用
awk '/^>/ && NF==1; NR==FNR{a[$2]=$1} (substr($1,2) in a) {close(out); out="cluster/"a[substr($1,2)]".fa"} {print > out}' ind.txt file
但它不起作用,并且我找不到该错误的解决方案。
答案1
mkdir -p cluster &&
awk 'NR==FNR {map[">"$2]="cluster/"$1".fa"; next}
/^>/ && NF==1 {close(out); out=map[$0]; next}
out != "" {print > out}
' ind.txt file
第一个条件操作 ( NR==FNR
) 正在解析索引文件,创建文件名并将它们存储到数组中,其中第二个文件的标头是哈希值。
当找到标头 ( /^>/ && NF==1
) 时,我们定义要使用的输出文件名。
对于任何其他行,我们打印到选定的文件名。此外,我还添加了一个条件"cluster/.fa"
,如果没有此标头的映射,则不打印到文件。
使用示例输入进行测试创建了这些文件:
$ head cluster/*.fa
==> cluster/HG001.fa <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
==> cluster/HG004.fa <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG
==> cluster/HG010.fa <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
==> cluster/HG023.fa <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR