从 fasta 行中删除多个模式

从 fasta 行中删除多个模式

我需要改变这种模式

>UniRef90_Q57KY8 Total protein n=182 Tax=GammaproteobacteriaTaxID=1236 RepID=Q57KY8_SALCH
MKKQLIRTLTASILLMSTSVLAQEAPSRTECIAPAKPGGGFDLTYKLIQVSLLETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPGEPGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKTLKDLMTAMEKDPNSVVIGAGASI
GSQDWMKSALLAQKANVDPHKMRYVAFEGGGEPVTALMGNHVQVVSGDLSEMVPYLGGDKIRVLAVFSENRLPGQLANVP
TAKEQGYDLVWPIIRGFYVGPKVSDADYQWWVDTFKKLQQTDEFKKQRDLRGLFEFDMTGQQLDDYVKKQVTDYREQAKAFGLAK
>UniRef90_G8LKQ2 UPF5341 protein yflP n=80 Tax=Bacteria TaxID=2 RepID=G8LKQ2_ENTCL
MKKQLLSTLAASVLMISASVVQAQDAPSRTECIAPAKPGGGFDLTCKLIQVSMLETGAIAKPMRVTYMPGGVGAVAYNAI
VAQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLATVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGAS
IGSQDWMKAALLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLNGDKIRVLAVFSENRLPGQLANV
PTAKEQGYDLVWPIIRGFFVGPKVTDAEYQWWVETFNKLQQTEAFKKQRDLRGLFEFNLSGKPLDEYVKKQVNDYREQAKAFGLAK
>UniRef90_E3GB58 Uncharacterized protein n=1 Tax=Enterobacter lignolyticus (strain SCF1) TaxID=701347 RepID=E3GB58_ENTLS
MKKTLLQTVIATALLMSTAAFAVEAPGRTECIAPAKPGGGFDLTCKLIQVSLQETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGASI
GSQDWMKAAKLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLQGDKIRVLAVFAENRLPGQLANVP
TAKEQGYDLVWPIIRGFYLGPKVSDDEYNWWVETFQKLQQTDEFKKQRELRGLFEFNMNGKALDEYVKKQVTDYREQAKSFGLAK

类似于

>Q57KY8_Gammaproteobacteria
MKKQLIRTLTASILLMSTSVLAQEAPSRTECIAPAKPGGGFDLTYKLIQVSLLETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPGEPGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKTLKDLMTAMEKDPNSVVIGAGASI
GSQDWMKSALLAQKANVDPHKMRYVAFEGGGEPVTALMGNHVQVVSGDLSEMVPYLGGDKIRVLAVFSENRLPGQLANVP
TAKEQGYDLVWPIIRGFYVGPKVSDADYQWWVDTFKKLQQTDEFKKQRDLRGLFEFDMTGQQLDDYVKKQVTDYREQAKAFGLAK
>G8LKQ2_Bacteria
MKKQLLSTLAASVLMISASVVQAQDAPSRTECIAPAKPGGGFDLTCKLIQVSMLETGAIAKPMRVTYMPGGVGAVAYNAI
VAQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLATVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGAS
IGSQDWMKAALLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLNGDKIRVLAVFSENRLPGQLANV
PTAKEQGYDLVWPIIRGFFVGPKVTDAEYQWWVETFNKLQQTEAFKKQRDLRGLFEFNLSGKPLDEYVKKQVNDYREQAKAFGLAK
>E3GB58_Enterobacter lignolyticus (strain SCF1) 
MKKTLLQTVIATALLMSTAAFAVEAPGRTECIAPAKPGGGFDLTCKLIQVSLQETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGASI
GSQDWMKAAKLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLQGDKIRVLAVFAENRLPGQLANVP
TAKEQGYDLVWPIIRGFYLGPKVSDDEYNWWVETFQKLQQTDEFKKQRELRGLFEFNMNGKALDEYVKKQVTDYREQAKSFGLAK

因此,删除开头的数据库名称,保留后面的代码,后面跟着带有税收名称的下划线。

答案1

您可以使用以下 perl 单行命令:

perl -ne 'if($_=~/^>/){($id,$tax)=$_=~/UniRef90_(\S+).*Tax=(.*)TaxID/; print ">",$id,"_",$tax,"\n";}else{print $_;}' input.fa > output.fa

这将读取input.fa、修改 fasta 标头并写入output.fa


命令解释:

perl -ne '                                          #call perl and read the file line-wise
  if($_=~/^>/){                                     #check if the line is a header
    ($id,$tax)=$_=~/UniRef90_(\S+).*Tax=(.*)TaxID/; #extract the ID and the tax string
    print ">",$id,"_",$tax,"\n";}                   #print the new header 
  else{                                             #print the sequence (not a header line)
    print $_;}
' input.fa > output.fa

答案2

你可能会使用:

$ sed -r '/^>/ s/^>[^_]+_([^ ]+) .* Tax=(.*)TaxID=.*/>\1_\2/' file
>Q57KY8_Gammaproteobacteria
MKKQLIRTLTASILLMSTSVLAQEAPSRTECIAPAKPGGGFDLTYKLIQVSLLETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPGEPGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKTLKDLMTAMEKDPNSVVIGAGASI
GSQDWMKSALLAQKANVDPHKMRYVAFEGGGEPVTALMGNHVQVVSGDLSEMVPYLGGDKIRVLAVFSENRLPGQLANVP
TAKEQGYDLVWPIIRGFYVGPKVSDADYQWWVDTFKKLQQTDEFKKQRDLRGLFEFDMTGQQLDDYVKKQVTDYREQAKAFGLAK
>G8LKQ2_Bacteria 
MKKQLLSTLAASVLMISASVVQAQDAPSRTECIAPAKPGGGFDLTCKLIQVSMLETGAIAKPMRVTYMPGGVGAVAYNAI
VAQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLATVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGAS
IGSQDWMKAALLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLNGDKIRVLAVFSENRLPGQLANV
PTAKEQGYDLVWPIIRGFFVGPKVTDAEYQWWVETFNKLQQTEAFKKQRDLRGLFEFNLSGKPLDEYVKKQVNDYREQAKAFGLAK
>E3GB58_Enterobacter lignolyticus (strain SCF1) 
MKKTLLQTVIATALLMSTAAFAVEAPGRTECIAPAKPGGGFDLTCKLIQVSLQETGAIEKPMRVTYMPGGVGAVAYNAIV
AQRPAEAGTVVAFSGGSLLNLSQGKFGRYGVDDVRWLASVGTDYGMIAVRADSPWKSLKDLLTAMEKDPNSVVIGAGASI
GSQDWMKAAKLAQQAKVDPHKMRYVAFEGGGEPVTALMGNHVQAVSGDLSEMVPYLQGDKIRVLAVFAENRLPGQLANVP
TAKEQGYDLVWPIIRGFYLGPKVSDDEYNWWVETFQKLQQTDEFKKQRELRGLFEFNMNGKALDEYVKKQVTDYREQAKSFGLAK

这依赖于您想要的第一段文本是第一个下划线 ( ) 之后的第一段文本_。Tax 名称后的输出中可能还留有尾随空格 - 您的文件似乎在 TaxID 之前是否有空格方面不一致,因此很难将其清除。如果它们很重要,我们可以在s末尾添加一个额外的命令来删除它们 -s/(.*)\s+/\1使命令完整:

sed -r '/^>/ s/^>[^_]+_([^ ]+) .* Tax=(.*)TaxID=.*/>\1_\2/;s/(.*)\s+/\1/' file

笔记

  • -r使用 ERE
  • /^>/查找以...开头的行>
  • s/old/new/在这些行上old替换new
  • [^_]+一些不存在的字符_
  • (some chars)保存some chars以供以后使用\1 \2等参考
  • .*任意数量的任意字符
  • ;分隔命令,就像在 shell 中一样
  • \s+一个或多个水平空白字符

相关内容