我的数据的前几行看起来像
scaffold10x_1 AUGUSTUS gene 3591 3908 0.61 - . g1
scaffold10x_1 AUGUSTUS transcript 3591 3908 0.61 - . g1.t1
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3908 0.61 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3908 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 3906 3908 . - 0 transcript_id "g1.t1"; gene_id "g1";
我需要添加";
到最后一列中缺少它们的行。我已经习惯于grep -v transcript_id canada.gtf | grep -v "^#"
识别那些缺少它们的行。我可以使用 linux 命令来执行此操作吗?
答案1
sed
方法:
sed 's/[^[:space:]]\+[^;[:space:]]$/"&";/' file
输出:
scaffold10x_1 AUGUSTUS gene 3591 3908 0.61 - . "g1";
scaffold10x_1 AUGUSTUS transcript 3591 3908 0.61 - . "g1.t1";
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3908 0.61 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3908 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 3906 3908 . - 0 transcript_id "g1.t1"; gene_id "g1";
答案2
此sed
命令将确保每一行都以一个分号结尾,并且每一行中的最后一个单词都被引用:
sed -e 's/"\?\([a-z0-9.]\+\)"\?;*$/"\1";/' canada.gtf
以下是该命令的输出:
scaffold10x_1 AUGUSTUS gene 3591 3908 0.61 - . "g1";
scaffold10x_1 AUGUSTUS transcript 3591 3908 0.61 - . "g1.t1";
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3908 0.61 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3908 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 3906 3908 . - 0 transcript_id "g1.t1"; gene_id "g1";
如果您想就地修改文件,则可以使用该-i
标志:
sed -i -e 's/"\?\([a-z0-9.]\+\)"\?;*$/"\1";/' canada.gtf
如果您只想确保每行以以下结尾";
(并且您不希望"
在该行最后一个单词的开头出现匹配),那么您可以使用以下命令:
sed -e 's/"\?;\?$/";/' canada.gtf
这是该命令的输出:
scaffold10x_1 AUGUSTUS gene 3591 3908 0.61 - . g1";
scaffold10x_1 AUGUSTUS transcript 3591 3908 0.61 - . g1.t1";
scaffold10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS CDS 3591 3908 0.61 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS exon 3591 3908 . - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1 AUGUSTUS start_codon 3906 3908 . - 0 transcript_id "g1.t1"; gene_id "g1";
答案3
@Kay NewEdge 达拉莫拉
通过使用下面的 oneliner 我取得了结果
代码:
sed 's/[a-z][0-9]$/&";/g' example.txt |sed 's/[a-z][0-9].\{2\}/"&/g'
输出
scaffol"d10x_1 AUGUSTUS gene 3591 3908 0.61 - . "g1";
scaffol"d10x_1 AUGUSTUS transcript 3591 3908 0.61 - . "g1.t1";
scaffol"d10x_1 AUGUSTUS stop_codon 3591 3593 . - 0 transcript_id ""g1.t1"; gene_id ""g1";
scaffol"d10x_1 AUGUSTUS CDS 3591 3908 0.61 - 0 transcript_id ""g1.t1"; gene_id ""g1";
scaffol"d10x_1 AUGUSTUS exon 3591 3908 . - . transcript_id ""g1.t1"; gene_id ""g1";
scaffol"d10x_1 AUGUSTUS start_codon 3906 3908 . - 0 transcript_id ""g1.t1"; gene_id ""g1";