以下文件以制表符分隔。我尝试从第一列中删除NbLab330C00 64506568
空格后的数字以获得NbLab330C00
.
$ head LAB330_TE_annotation.gff3
##gff-version 3
##date Sun Feb 14 08:41:36 UTC 2021
##Identity: Sequence identity (0-1) between the library sequence and the target region.
##ltr_identity: Sequence identity (0-1) between the left and right LTR regions.
##tsd: target site duplication.
##seqid source sequence_ontology start end score strand phase attributes
NbLab330C00 64506568 EDTA Gypsy_LTR_retrotransposon 2 3364 20798 - . ID=TE_homo_0;Name=TE_00007365_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.868;Method=homology
NbLab330C00 64506568 EDTA Gypsy_LTR_retrotransposon 3367 4198 3385 - . ID=TE_homo_1;Name=TE_00008087_LTR;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.865;Method=homology
NbLab330C00 64506568 EDTA hAT_TIR_transposon 4424 4715 1278 + . ID=TE_homo_2;Name=TE_00003964;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.834;Method=homology
NbLab330C00 64506568 EDTA hAT_TIR_transposon 5236 5453 835 + . ID=TE_homo_3;Name=TE_00001425;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.828;Method=homology
我尝试了以下awk
命令,但它也缩短了最后一列。
$ awk -v OFS='\t' '{print $1,$3,$4,$5,$7,$8,$9}' LAB330_TE_annotation.gff3 > LAB330_TE_annotation.fix.gff3
(base) ubuntu@ip-10-23-2-113:/efs/apollo/LAB330$ head LAB330_TE_annotation.fix.gff3
##gff-version
##date Feb 14 08:41:36 2021
##Identity: identity (0-1) between library sequence and
##ltr_identity: identity (0-1) between left and right
##tsd: site duplication.
##seqid sequence_ontology start end strand phase attributes
NbLab330C00 EDTA Gypsy_LTR_retrotransposon 2 20798 - .
NbLab330C00 EDTA Gypsy_LTR_retrotransposon 3367 3385 - .
NbLab330C00 EDTA hAT_TIR_transposon 4424 1278 + .
NbLab330C00 EDTA hAT_TIR_transposon 5236 835 + .
(base) ubuntu@ip-10-23-2-113:/efs/apollo/LAB330$
如何修复上述命令,
先感谢您,
答案1
awk 'BEGIN{ OFS=FS="\t" }
!/^#/{ sub(/ [0-9]+$/, "", $1) }
1
' LAB330_TE_annotation.gff3 > LAB330_TE_annotation.fix.gff3
这使得标题行以#
未修改的方式开头,并将第一个字段末尾的空格字符后跟至少一个数字替换为空字符串。
答案2
您可以使用cut
删除第二列。默认分隔符是制表符,因此您无需指定-d
switch。
$ cut -f 1,3- LAB330_TE_annotation.gff3
##gff-version 3
##date Sun Feb 14 08:41:36 UTC 2021
##Identity: Sequence identity (0-1) between the library sequence and the target region.
##ltr_identity: Sequence identity (0-1) between the left and right LTR regions.
##tsd: target site duplication.
##seqid source sequence_ontology start end score strand phase attributes
NbLab330C00 EDTA Gypsy_LTR_retrotransposon 2 3364 20798 - . ID=TE_homo_0;Name=TE_00007365_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.868;Method=homology
NbLab330C00 EDTA Gypsy_LTR_retrotransposon 3367 4198 3385 - . ID=TE_homo_1;Name=TE_00008087_LTR;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.865;Method=homology
NbLab330C00 EDTA hAT_TIR_transposon 4424 4715 1278 + . ID=TE_homo_2;Name=TE_00003964;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.834;Method=homology
NbLab330C00 EDTA hAT_TIR_transposon 5236 5453 835 + . ID=TE_homo_3;Name=TE_00001425;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.828;Method=homology
选择:$ cut -f 2 --complement LAB330_TE_annotation.gff3