我有这样的 1.files:
>YP_008856774.1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP_008856775.1
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP_008856776.1
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP_008856777.1
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR
我想重命名每个标签(例如>YP_008856776.1),如下所示:
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR
首先,我曾经sed -i "s/\_//g" 1.file
删除过\_
。或者我应该删除标题的最后四个字符,然后添加_
和“订单号”?简而言之,我想重命名>
;之后的标签。第一步是替换_
;然后删除每个标签的最后四个字符,然后_
在每个标签后面添加,最后在每个标签后面添加序号。(例如>YP_008856774.1到>YP008856774.1到>YP008856至 >YP008856_ 至 >YP008856_1)。以我现在的能力还做不到。你能帮我解决这个麻烦吗?谢谢。
答案1
在每个 Unix 机器上的任何 shell 中使用任何 awk:
$ awk '/>/{$0=substr($0,1,3) substr($0,5,6) "_" (++c)} 1' file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR
答案2
$ awk '/^>/ { tag = substr($0,1,3) substr($0,5,6); $0 = sprintf("%s_%d", tag, ++count[tag]) }; 1' file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR
上述awk
命令将使用原始标题行的特定部分(字符 1 到 3、字符 5 到 10,跳过_
位置 4)作为标记来重写每个标题行。为每个唯一标签维护一个计数器。
这假设原始标识符始终位于表单上,XX_NNNNNN
后跟任何其他文本(被忽略)。
你也可以使用
awk '/^>/ { sub(/_/, ""); sub(/...\..*/, ""); tag = $0; $0 = sprintf("%s_%d", tag, ++count[tag]) }; 1' file
这会稍微更加动态,因为它在删除下划线以及(并包括)一组三个字符和一个点之后的任何内容后,从原始标识符的剩余部分创建标签。
答案3
使用GNUawk
$ awk -F_ 'BEGIN {c=1} /^>/{match($2,/(.{6}).*/,a); $2=a[1] FS c++}1' OFS="" input_file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR
答案4
终于明白了,但是必须用seqkit
软件。
for i in `cat id`; do
echo ${i%*${i:(-8)}} | sed "s/\_//g" > tmp
for j in `cat tmp`; do
echo $j
echo -e ">abc\nACTG\n>123\nATTT" | seqkit replace -p ".+" -r "$j_{nr}" --nr-width 5 $i > $i.new.gz
less -S $i.new.gz | sed 's/>/\>'"${j}"_'/g' | sed 's/00//g'> $i.fa
done
done