提取字符串会导致正则表达式中出现空字符串

提取字符串会导致正则表达式中出现空字符串

我有以下文件:

awk -F'\t' '$3=="mRNA"'  GCF_000390325.2_Ntom_v01_genomic.gff | head
NW_008828495.1  Gnomon  mRNA    35293   38211   .   +   .   ID=rna-XM_009608413.3;Parent=gene-LOC104084433;Dbxref=GeneID:104084433,Genbank:XM_009608413.3;Name=XM_009608413.3;gbkey=mRNA;gene=LOC104084433;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 76 samples with support for all annotated introns;product=cytochrome P450 CYP82D47-like;transcript_id=XM_009608413.3
NW_008828515.1  Gnomon  mRNA    6799    11530   .   +   .   ID=rna-XM_009591409.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009591409.3;Name=XM_009591409.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 22 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X2;transcript_id=XM_009591409.3
NW_008828515.1  Gnomon  mRNA    6799    11530   .   +   .   ID=rna-XM_009630598.3;Parent=gene-LOC104116524;Dbxref=GeneID:104116524,Genbank:XM_009630598.3;Name=XM_009630598.3;gbkey=mRNA;gene=LOC104116524;model_evidence=Supporting evidence includes similarity to: 2 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 34 samples with support for all annotated introns;product=protein JASON-like%2C transcript variant X1;transcript_id=XM_009630598.3
NW_008828528.1  Gnomon  mRNA    2303    14453   .   +   .   ID=rna-XM_033657931.1;Parent=gene-LOC117278374;Dbxref=GeneID:117278374,Genbank:XM_033657931.1;Name=XM_033657931.1;gbkey=mRNA;gene=LOC117278374;model_evidence=Supporting evidence includes similarity to: 72%25 coverage of the annotated genomic feature by RNAseq alignments;product=uncharacterized LOC117278374;transcript_id=XM_033657931.1
NW_008828528.1  Gnomon  mRNA    5510    7652    .   -   .   ID=rna-XM_033657569.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657569.1;Name=XM_033657569.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1;transcript_id=XM_033657569.1
NW_008828528.1  Gnomon  mRNA    5873    8848    .   -   .   ID=rna-XM_033657711.1;Parent=gene-LOC117278090;Dbxref=GeneID:117278090,Genbank:XM_033657711.1;Name=XM_033657711.1;gbkey=mRNA;gene=LOC117278090;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2;transcript_id=XM_033657711.1
NW_008828570.1  Gnomon  mRNA    5   6611    .   -   .   ID=rna-XM_009610342.3;Parent=gene-LOC104102329;Dbxref=GeneID:104102329,Genbank:XM_009610342.3;Name=XM_009610342.3;gbkey=mRNA;gene=LOC104102329;model_evidence=Supporting evidence includes similarity to: 27 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 56 samples with support for all annotated introns;partial=true;product=TATA-box-binding protein-like;start_range=.,5;transcript_id=XM_009610342.3
NW_008828592.1  Gnomon  mRNA    9998    13370   .   +   .   ID=rna-XM_033658453.1;Parent=gene-LOC104103684;Dbxref=GeneID:104103684,Genbank:XM_033658453.1;Name=XM_033658453.1;gbkey=mRNA;gene=LOC104103684;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 10 samples with support for all annotated introns;product=pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic;transcript_id=XM_033658453.1
NW_008828592.1  Gnomon  mRNA    13457   18285   .   -   .   ID=rna-XM_009612846.3;Parent=gene-LOC104104451;Dbxref=GeneID:104104451,Genbank:XM_009612846.3;Name=XM_009612846.3;gbkey=mRNA;gene=LOC104104451;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 23 samples with support for all annotated introns;product=uncharacterized LOC104104451;transcript_id=XM_009612846.3
NW_008828641.1  Gnomon  mRNA    4417    7406    .   +   .   ID=rna-XM_009613787.3;Parent=gene-LOC104105226;Dbxref=GeneID:104105226,Genbank:XM_009613787.3;Name=XM_009613787.3;gbkey=mRNA;gene=LOC104105226;model_evidence=Supporting evidence includes similarity to: 8 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 75 samples with support for all annotated introns;product=heat shock factor protein HSF30%2C transcript variant X1;transcript_id=XM_009613787.3

我使用下面的命令来提取ID,product值,但我只得到了.mrna1,

awk -F'\t' '$3=="mRNA"'  GCF_000390325.2_Ntom_v01_genomic.gff | perl -F'\t' -lane 'if($F[2] eq "mRNA"){/ID=([^\;]+).*product="([^"]+)/; print "$1.mrna1,$2"}' > GCF_000390325.2_Ntom_v01_genomic.gff.csv

作为输出我想得到:

rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
...

我错过了什么?

先感谢您,

答案1

每当您使用捕获变量 $1 $2 erc 时,我们必须先确保它们存在。

在这种情况下,$1 $2 是空的,并且由于您没有打开警告,因此您不会收到相关通知。

请注意,您的正则表达式期望在product =“之后有引号,而在您的数据中没有引号。我建议您使用 -w 选项调用perl。

perl  -w -F'\t' -lane 'if(($F[2] eq "mRNA")&&/ID=([^\;]+).*product=([^;]+)/){print "$1.mrna1,$2"}'

答案2

无需通过管道传输/使用perl。这一切都可以通过 来完成awk

$ awk  -F'[\t;]' '{for(i=11; i < NF;i++) if($i ~ /^product=/) { sub(/ID=/,"",$9); sub(/^product=/, "", $i); print $9","$i }}' infile
rna-XM_009608413.3,cytochrome P450 CYP82D47-like
rna-XM_009591409.3,protein JASON-like%2C transcript variant X2
rna-XM_009630598.3,protein JASON-like%2C transcript variant X1
rna-XM_033657931.1,uncharacterized LOC117278374
rna-XM_033657569.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X1
rna-XM_033657711.1,uncharacterized mitochondrial protein AtMg00810-like%2C transcript variant X2
rna-XM_009610342.3,TATA-box-binding protein-like
rna-XM_033658453.1,pentatricopeptide repeat-containing protein At1g15510%2C chloroplastic
rna-XM_009612846.3,uncharacterized LOC104104451
rna-XM_009613787.3,heat shock factor protein HSF30%2C transcript variant X1
$ 

相关内容