我有以下文件:
awk -F'\t' '$3=="mRNA"' GCF_000390325.2_Ntom_v01_genomic.gff | head
NW_008828495.1 Gnomon mRNA 35293 38211 . + . ID=rna-XM_009608413.3;Parent=gene-LOC104084433;Dbxref=GeneID:104084433,Genbank:XM_009608413.3;Name=XM_009608413.3;gbkey=mRNA;gene=LOC104084433;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 76 samples with support for all annotated introns;product=cytochrome P450 CYP82D47-like;transcript_id=XM_009608413.3
NW_008828515.1 Gnomon mRNA 6799 11530 . + . ID=rna-XM_009591409.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009591409.3;Name=XM_009591409.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X2;transcript_id=XM_009591409.3
NW_008828515.1 Gnomon mRNA 6799 11530 . + . ID=rna-XM_009630598.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009630598.3;Name=XM_009630598.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 34 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X1;transcript_id=XM_009630598.3
NW_008828528.1 Gnomon mRNA 2303 14453 . + . ID=rna-XM_033657931.1;Parent=gene-LOC117278374;Dbxref=GeneID:117278374,Genbank:XM_033657931.1;Name=XM_033657931.1;gbkey=mRNA;gene=LOC117278374;model_evidence=Supporting evidence includes similarity to: 72%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC117278374;transcript_id=XM_033657931.1
NW_008828528.1 Gnomon mRNA 5510 7652 . - . ID=rna-XM_033657569.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657569.1;Name=XM_033657569.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1;transcript_id=XM_033657569.1
NW_008828528.1 Gnomon mRNA 5873 8848 . - . ID=rna-XM_033657711.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657711.1;Name=XM_033657711.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2;transcript_id=XM_033657711.1
NW_008828570.1 Gnomon mRNA 5 6611 . - . ID=rna-XM_009610342.3;Parent=gene-LOC104102329;Dbxref=GeneID:104102329,Genbank:XM_009610342.3;Name=XM_009610342.3;gbkey=mRNA;gene=LOC104102329;model_evidence=Supporting evidence includes similarity to: 27 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 56 samples with support for all annotated introns;partial=true;product=TATA-box-binding protein-like;start_range=.,5;transcript_id=XM_009610342.3
NW_008828592.1 Gnomon mRNA 9998 13370 . + . ID=rna-XM_033658453.1;Parent=gene-LOC104103684;Dbxref=GeneID:104103684,Genbank:XM_033658453.1;Name=XM_033658453.1;gbkey=mRNA;gene=LOC104103684;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 10 samples with support for all annotated introns;product=pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic;transcript_id=XM_033658453.1
NW_008828592.1 Gnomon mRNA 13457 18285 . - . ID=rna-XM_009612846.3;Parent=gene-LOC104104451;Dbxref=GeneID:104104451,Genbank:XM_009612846.3;Name=XM_009612846.3;gbkey=mRNA;gene=LOC104104451;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=uncharacterized LOC104104451;transcript_id=XM_009612846.3
NW_008828641.1 Gnomon mRNA 4417 7406 . + . ID=rna-XM_009613787.3;Parent=gene-LOC104105226;Dbxref=GeneID:104105226,Genbank:XM_009613787.3;Name=XM_009613787.3;gbkey=mRNA;gene=LOC104105226;model_evidence=Supporting evidence includes similarity to: 8 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 75 samples with support for all annotated introns;product=heat shock factor protein HSF30%2C transcript variant X1;transcript_id=XM_009613787.3
我使用下面的命令来提取ID,product
值,但我只得到了.mrna1,
awk -F'\t' '$3=="mRNA"' GCF_000390325.2_Ntom_v01_genomic.gff | perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*product="([^"]+)/; print "$1.mrna1,$2"}' > GCF_000390325.2_Ntom_v01_genomic.gff.csv
作为输出我想得到:
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
...
我错过了什么?
先感谢您,
答案1
每当您使用捕获变量 $1 $2 erc 时,我们必须先确保它们存在。
在这种情况下,$1 $2 是空的,并且由于您没有打开警告,因此您不会收到相关通知。
请注意,您的正则表达式期望在product =“之后有引号,而在您的数据中没有引号。我建议您使用 -w 选项调用perl。
perl -w -F'\t' -lane 'if(($F[2] eq "mRNA")&&/ID=([^\;]+).*product=([^;]+)/){print "$1.mrna1,$2"}'
答案2
无需通过管道传输/使用perl
。这一切都可以通过 来完成awk
。
$ awk -F'[\t;]' '{for(i=11; i < NF;i++) if($i ~ /^product=/) { sub(/ID=/,"",$9); sub(/^product=/, "", $i); print $9","$i }}' infile
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
rna-XM_033657711.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2
rna-XM_009610342.3,TATA-box-binding protein-like
rna-XM_033658453.1,pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic
rna-XM_009612846.3,uncharacterized LOC104104451
rna-XM_009613787.3,heat shock factor protein HSF30%2C transcript variant X1
$