(已编辑)我有一个包含很多问题的文件。例如如下:
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; exon_number "1";
根据 ,您可以清楚地看到示例中有两个基因gene_name
,但软件以某种方式将这两个基因合并为一个基因 ( gene_id
)。我想通过使用gene_name
替换gene_id
for 行来纠正这些问题transcript_name
。另外,对于第三个基因(gene_id "XLOC_000004"
),我想将其名称替换为后面oId
不带部分的内容.t[0-9]
输出会是这样的
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; exon_number "1";
我猜逻辑是 grep the gene_name
by transcript_id
,然后根据数字替换gene_id
每一行中的the 。gene_name
transcript_id
到目前为止,我已经创建了一个列表transcript_id
和gene_name
TCONS_00000070 AT1G02100
TCONS_00000071 AT1G02110
TCONS_00000004 g15
然后我需要根据相关内容替换每gene_id
行中的。但我不知道该怎么做。我可以用吗?gene_name
transcript_id
sed
最后但并非最不重要的一点是,我应该提到我的真实文件很大,其中包含 35000 个不同的trnascript_id
.
提前非常感谢!
答案1
我无法判断这是否会涵盖所有边缘情况,但对于您的示例,它可以满足您的要求:
sed '/.*gene_name/{h;s///;s/;.*//;x;};G;s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file
它的作用是提取gene_name
如果存在的内容,将其存储在保留空间中,并将其用作所有后续gene_id
内容的替换,直到gene_name
出现新的内容。
更详细:
/.*gene_name/
是一个地址,所以里面的所有内容都{}
只会应用于具有该模式的线条- 在我们搞乱一切之前,我们将原始行存储到
h
旧空间 s///
删除之前的模式(直到 的所有内容gene_name
);s/;.*//
删除从分号开始的所有内容。所以剩下的就是空格和双引号中的字符串。x
交换两个空间,现在我们在保留空间中有替换,在模式空间中有原始行- 从现在开始的所有内容都应用于所有行:
G
将保留空间附加到每行,因此我们有了该行、换行符和替换字符 s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' is easier to write than to read:
[^;]matches everything between
基因 IDand the
;, thus the part to be replaced. The
(.)替换中的parts cover the text before and after the embedded newline, so we can refer to them as
\1 \2` 。and
gene_name
为了说明最后一步,请查看在使用with<CR>
作为嵌入换行符附加保留空间后缓冲区的外观:
Chr1_RagTag_p ………; gene_id "XLOC_000060"; exon_number "1";<CR> "AT1G02100"
\______v_____/\________v_______/ \____v____/
gene_id [^;]* \(.*\) \n \(.*\)
-E
也许使用扩展正则表达式(选项)更容易阅读:
sed -E '/.*gene_name/{h;s///;s/;.*//;x;};G;s/(gene_id)[^;]*(.*)\n(.*)/\1\3\2/' file
更新考虑到gene_name
更新问题中的无情况
oId
我只是简单地添加了与提取类似的提取gene_name
,但在它之前。所以如果后面有一个gene_name
,它就会覆盖oId
.这次是为了更好的可读性而分开的行:
sed '
/.*oId/{
h
s///
s/\..*/"/
x
}
/.*gene_name/{
h
s///
s/;.*//
x
}
G
s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file
答案2
在每个 Unix 机器上的任何 shell 中使用任何 awk:
$ awk '
match($0,/gene_name "[^"]+"/) {
name = substr($0,RSTART+10,RLENGTH-10)
}
match($0,/gene_id "[^"]+"/) {
$0 = substr($0,1,RSTART+7) name substr($0,RSTART+RLENGTH)
}
{ print }
' file
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
答案3
或者,再次使用gawk
(GNU Awk),更简单的处理方法是:
$ awk '/gene_name/ {gene_id=$14; $12=$14; print; next;} {$12=gene_id; print;}' file
或者,更简单,如 @EdMorton 所建议:
$ awk '/gene_name/ {gene_id=$14} {$12=gene_id; print}' file
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
注意:
我认为该解决方案不会变得更简单,尽管以目前的形式,它排除了任何当前字段包含一个或多个空格的可能性。为此,需要@EdMorton 的解决方案。
答案4
非常感谢大家为我编写脚本!昨晚我思考了一下,想到了下面的方法。
我们将包含上述信息的输入文件命名为 sample.gtf
我创建了一个包含transcript_id、gene_id 和gene_name 的文件:
$ grep "oId" sample.gtf | awk '{print $10,$12,$14}' | sed 's/\"//g' | sed 's/\;//g' | sed '
s/\.[^\.]*$/ /g' > gene_id_replace_list.txt
# here I use `oId` because I found that some of the lines has no `gene_name`. In a case of no `gene_name`, I would like to grab `oId` as the a `gene_name` for replacement of `gene_id`.
$ head gene_id_replace_list.txt
TCONS_00000070 XLOC_000060 AT1G02100
TCONS_00000071 XLOC_000060 AT1G02110
然后我逐行向 sed 编写了一个循环,该sed
命令或多或少类似于 Phillipos 上面发布的命令。
while IFS= read -r line; do
TransId=$(echo $line | awk '{print $1}')
GeneId=$(echo $line | awk '{print $2}')
TairId=$(echo $line | awk '{print $3}')
sed -i "/${TransId}/ s/${GeneId}/${TairId}/g" sample.gtf
done < gene_id_replace_list.txt
但是,我对这段代码不太满意,因为我的整个文件有 35000 个转录本,到目前为止,这个脚本每个转录本运行近 1 秒。那么有人有更好的建议吗?
我想调整上面 Philippos 的命令来应对没有gene_name的情况,并且还可以更轻松地打印出每个替换的transcript_id(我必须看看代码最后替换了什么,抱歉)。也许与 Philippos 提出的类似的一行sed
命令要快得多?