输入文件:131751_pphA.fasta
>ID:NDNDCOEC_02118 |[Genus species]|strain|PANS_1_2_annot.gbk|pphA|855|NODE_3_length_422941_cov_112.146787422941(422941):170566-171420:1 ^^ Genus species strain strain.|neighbours:ID:NDNDCOEC_02117(1),ID:NDNDCOEC_02119(1)|neighbour_genes:hypothetical protein,ntaA| | aligned:1-284 (284)
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
>ID:KJDCINFB_03194 |[Genus species]|strain|PNA_1_5_annot.gbk|pphA|855|NODE_5_length_527105_cov_93.286545527105(527105):274765-275619:1 ^^ Genus species strain strain.|neighbours:ID:KJDCINFB_03193(1),ID:KJDCINFB_03195(1)|neighbour_genes:hypothetical protein,ntaA| | aligned:1-284 (284)
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
>ID:LBFHNJKP_02554 |[Genus species]|strain|PANS_1_6_annot.gbk|pphA|855|NODE_4_length_527158_cov_95.108790527158(527158):251540-252394:-1 ^^ Genus species strain strain.|neighbours:ID:LBFHNJKP_02553(-1),ID:LBFHNJKP_02555(-1)|neighbour_genes:ntaA,hypothetical protein| | aligned:1-284 (284)
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
>ID:GPMHBDBL_03046 |[Genus species]|strain|PNA_200_2_annot.gbk|pphA_2|855|NODE_4_length_530984_cov_86.347264530984(530984):275036-275890:1 ^^ Genus species strain strain.|neighbours:ID:GPMHBDBL_03045(1),ID:GPMHBDBL_03047(1)|neighbour_genes:hypothetical protein,ntaA| | aligned:1-284 (284)
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
所需的输出:四个单独的输出文件:
PANS_1_2_pphA.fasta
>PANS_1_2_pphA
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
PNA_1_5_pphA.fasta
>PNA_1_5_pphA
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
PANS_1_6_pphA.fasta
>PANS_1_6_pphA
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
PNA_200_2_pphA_2.fasta
>PNA_200_2_pphA_2
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
multifasta 输入文件 ( 131751_pphA.fasta
) 包含四个带标题的 fasta 序列。我想要四个输出文件,它们是单独的 fasta 序列,其名称和标头根据上述菌株命名。例如,输入 fasta 中的标头之一包含应变信息,如|strain|PANS_1_2_annot.gbk|pphA|
。输出文件的名称应为
PANS_1_2_pphA.fasta
,标头应为>PANS_1_2_pphA
。
类似地,其他输出文件与
PNA_1_5_pphA.fasta
header >PNA_1_5_pphA
PANS_1_6_pphA.fasta
with header >PANS_1_6_pphA
PNA_200_2_pphA_2.fasta
with header>PNA_200_2_pphA_2
尝试了以下代码:
awk -F "|" '/^>/ {close(F); ID=$1; gsub("^>", "", ID); F=ID".fasta"} {print >> F}' 123764_pphA.fasta
产生具有以下名称的 fasta 输出文件:
ID:BKKCPFME_02840 .fasta ID:EKPOMJAO_03222 .fasta ID:HEIIBHGJ_01315 .fasta ID:KBMOKBJB_03162 .fasta ID:LECGKDGM_03166 .fasta
答案1
awk -F'|' '
NR%2{ close(fileName); hdr=$4 $5; sub("annot.gbk", "", hdr); fileName=hdr".fasta";
print ">"hdr >fileName; next; };
{ print >fileName; }' infile
答案2
一个非awk
解决方案,假设在给定的示例中,文件必须每两行分割:
# split the file every 2 lines and save in files prefixed FOO.
split -l2 131751_pphA.fasta FOO.
# loop over the files
for f in FOO.*; do
# `awk` and `sed` to get the pattern to use as file name and first line, i.e "PANS_1_2_pphA"
n=$(awk -F'|' '{ print $4$5 }' "$f" | \
sed 's/annot\.gbk//')
# copy the 2nd line of the file into a new file named as pattern+.fasta
sed -n 2p "$f" > "$n.fasta"
# add the pattern in the created file
sed -i "1i>\\$n" "$n.fasta"
# remove the splited files
rm "$f"
done
$ cat PANS_1_2_pphA.fasta
PANS_1_2_pphA
MIKKLIAEKGTLIFIEAHNPLSALIASKAEQTNSEGRIVKFDGIWSSSLTDSASRGIPDNETLALSSRLENIADIRNVTDMPIIMDADTGGKPEHFSYYVKRMINNGVNGVIIEDKTGLKKNSLFGTEVEQTLADINDFSEKIKRGKSAVYIDDFMIIARLESLIAGFDVEHALERADAYVEAGADGIMIHSCKKTPDEVFLFSTKFRKKYPSVPLICVPTTYSATSNRELSEAGFNVIIYANHMLRAAYKAMENVSKEILRYGRTAEIEKSCMSVKEIISLIP
答案3
awk -F '|' '
/^>/ {
close(out)
head = $4
sub("_[^_]*$","_" $5, head)
$0 = ">" head
out = head ".fasta"
} { print >out }' 131751_pphA.fasta
这是相似的到αГsнιn 的回答但允许每个序列多于一行(这在通用 FastA 文件中是可能的)。
我还以稍微不同的方式修剪标题名称,修剪掉FastA 标题行第四个分隔字段中的文件名最后一个 (?)_
之后的所有内容,并添加第五个字段。.gbk
|