我有一个FASTA 格式文件的名称位于“>”符号之后,后跟带有序列的第二行。我想从名称(第一行)中过滤 refseqID,并删除浮动值。完整文件有许多名称/序列。
我可以使用以下命令从标签中删除“hg38_ncbiRefSeq_”:
sed 's/hg38_ncbiRefSeq_//g' file
原始格式
>hg38_ncbiRefSeq_NM_001276352.2 range=chr1:67093580-67127240 5'pad=0 3'pad=0 strand=- repeatMasking=none
MAEKILEKLDVLDKQAEIILARRTKINRLQSEGRKTTMAIPLTFDFQLEFEEALATSASKAISKIKEDKSCSITKSKMHVSFKCEPEPRKSNFEKSNLRPFFIQTNVKNKESESTEPVEEHLKSRSIRPYLYLKDTTEMENAGPLNVLYSQHRQACRRSLGSTDFSPMFNIQSNAHKKEKDSTLFTAQIEKKPRKPLDSVGLLEGDRNKRNKRTQIP
>hg38_ncbiRefSeq_NM_001276351.2 range=chr1:67093005-67127240 5'pad=0 3'pad=0 strand=- repeatMasking=none
MAEKILEKLDVLDKQAEIILARRTKINRLQSEGRKTTMAIPLTFDFQLEFEEALATSASKAISKIKEDKSCSITKSKMHVSFKCEPEPRKSNFEKSNLRPFFIQTNVKNKESESTAQIEKKPRKPLDSVGLLEGDRNKRKKSPQMNDFNIKENKSVRNYQLSKYRSVRKKSLLPLCFEDELKNPHAKIVNVSPTKTVTSHMEQKDTNPIIFHDTEYVRMLLLTKNRFSSHPLENENIYPHKRTNFILERNCEILKSIIGNQSISLFKPQKTMPTVQRKDIQIPMSFKAGHTTVDDKLKKKTNKQTLENRSWNTLYNFSQNFSSLTKQFVGYLDKAVIHEMSAQTGKFERMFSAGKPTSIPTSSALPVKCYSKPFKYIYELNNVTPLDNLLNLSNEILNAS
最终格式
>NM_001276352
MAEKILEKLDVLDKQAEIILARRTKINRLQSEGRKTTMAIPLTFDFQLEFEEALATSASKAISKIKEDKSCSITKSKMHVSFKCEPEPRKSNFEKSNLRPFFIQTNVKNKESESTEPVEEHLKSRSIRPYLYLKDTTEMENAGPLNVLYSQHRQACRRSLGSTDFSPMFNIQSNAHKKEKDSTLFTAQIEKKPRKPLDSVGLLEGDRNKRNKRTQIP
>NM_001276351
MAEKILEKLDVLDKQAEIILARRTKINRLQSEGRKTTMAIPLTFDFQLEFEEALATSASKAISKIKEDKSCSITKSKMHVSFKCEPEPRKSNFEKSNLRPFFIQTNVKNKESESTAQIEKKPRKPLDSVGLLEGDRNKRKKSPQMNDFNIKENKSVRNYQLSKYRSVRKKSLLPLCFEDELKNPHAKIVNVSPTKTVTSHMEQKDTNPIIFHDTEYVRMLLLTKNRFSSHPLENENIYPHKRTNFILERNCEILKSIIGNQSISLFKPQKTMPTVQRKDIQIPMSFKAGHTTVDDKLKKKTNKQTLENRSWNTLYNFSQNFSSLTKQFVGYLDKAVIHEMSAQTGKFERMFSAGKPTSIPTSSALPVKCYSKPFKYIYELNNVTPLDNLLNLSNEILNAS
答案1
您可以尝试类似以下操作awk
:
awk -F'[ _.]' '{if ($0~"^>") print ">"$3"_"$4; else print $0}' input_file
您可以按照之前评论中的建议使用较短的形式埃德·莫顿:
awk -F'[ _.]' '{print (/^>/ ? ">"$3"_"$4 : $0)}' input_file