我有一个包含 7 列的文件,一个具有染色体区域的 GFF 文件。我想将 REGION ="exon" 的行折叠到文件中的一行。必须根据与每个区域重叠的区域折叠该行其他。
REGION START END SCORE STRAND FRAME ATTRIBUTE
exon 26453 26644 . + . Transcript "XM_092971"; Name "XM_092971"
exon 26842 27020 . + . Transcript "XM_092971"; Name "XM_092971"
exon 30355 30899 . - . Transcript "XM_104663"; Name "XM_104663"
GS_TRAN 30355 34083 . - . GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
snp 30847 30847 . + . SNP "rs2971719"; Name "rs2971719"
exon 31012 31409 . - . Transcript "XM_104663"; Name "XM_104663"
exon 34013 34083 . - . Transcript "XM_104663"; Name "XM_104663"
exon 40932 41071 . + . Transcript "XM_092971"; Name "XM_092971"
snp 44269 44269 . + . SNP "rs2873227"; Name "rs2873227"
snp 45723 45723 . + . SNP "rs2227095"; Name "rs2227095"
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134457 . - . Transcript "XM_086914"; Name "XM_086914"
查看上面的示例数据,只有最后两行可以合并为一行。因此,新行将成为。
exon 134031 134495 . - . Transcript "XM_086913"; Name "XM_086913"
如果另一行的末尾大于其前一行,那么在这种情况下,这将是 END 区域。基本上,如果有任何重叠,则取较早开始的区域和较晚结束的区域。
这样的实例可以有多行,这里只有最后 2 行。有一件事是 ATRRIBUTE 列肯定会显示这些行的不同 Transcript 名称,在其他情况下大多是相同的。
有关如何继续的任何建议。
更新的示例:如果最后两行是这样的
exon 134031 134457 . - . Transcript "XM_086913"; Name "XM_086913"
exon 134034 134495 . - . Transcript "XM_086914"; Name "XM_086914"
那么输出应该是:
exon 134031 134495 . - . Transcript "XM_086913"; Transcript "XM_086914"
基本上是从第一行开始,从第二行结束。因为我只想覆盖一行中的重叠,而不是 2 行或 3 行或更多行。这里重叠是在 2 行之间,但也可以在多于 2 行之间。
更新的示例(2014 年 3 月 24 日)
chr1 HAVANA stop_codon 1120520 1120522 . + 0 gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115077 1115233 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1115414 1115433 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA UTR 1120520 1121244 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1 HAVANA transcript 1115864 1119307 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1115864 1116240 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117121 1117195* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA *exon 1117150 1117826* . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA exon 1118256 1118427 . + . gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1 HAVANA transcript 1190648 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1209046 1209229 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1203113 1203372 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA CDS 1203241 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA start_codon 1203370 1203372 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA stop_codon 1203238 1203240 . - 0 gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1198726 1198766 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192588 1192690 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA exon 1192372 1192510 . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1191425 1191505* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1 HAVANA *exon 1190648 1191470* . - . gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
上半部分显示“+”链的重叠,下半部分显示“-”链的重叠。“-”链具有递减区域,因此重叠将如最后两行所示。两者都是不同的基因所以重叠必须是每个基因的,因为有时不同的基因也有重叠的外显子,但正如我在一些帖子中读到的那样,这种情况非常罕见。可以从最后一列中提取基因信息,表示为“gene_name”。
gene_name=TTLL10 中的两行具有重叠的外显子,因此它们将在最终输出中合并。
chr1 HAVANA *exon 1117121 1117195* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1117150 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
gene_name= UBE2J2 的两行具有重叠的外显子。
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
样本输出
其余行保持不变,并且上面的行针对每个基因进行合并。
chr1 HAVANA *exon 1117121 1117826* . + . transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
如果transcript_id不同,尽管gene_name将保持不变,但两个转录本id都会被打印。例如,如果对于基因,转录本id会有所不同,如下所示:
chr1 HAVANA *exon 1191425 1191505* . - . transcript_id "ENST00000473215.1"; gene_name "UBE2J2";
chr1 HAVANA *exon 1190648 1191470* . - . transcript_id "ENST00000473215.2"; gene_name "UBE2J2";
这将合并到上面,但应该有两个成绩单名称。因为,我相信它可能是需要的,并且稍后对于保留成绩单信息很重要。
chr1 HAVANA *exon 1190648 1191505* . - . transcript_id "ENST00000473215.1"; "ENST00000473215.2" gene_name "UBE2J2";
答案1
一种“awk”方法,
awk '
$1!="exon" { # If the first died is unequal to "exon"
if(previous)print previous # If there is a previous line then print it
print # Print the current line
previous=start=end=exon_seq="" # Set all variable to an empty string
next # Move on to the next line in the input file
}
{
if(exon_seq) { # if there is a sequence of lines with "exon in field 1
if(start<=$2 && end>=$3) # if the start value (field 2) of the previous line
# is less or equal to the current line and the end
# value of the previous line is greater than or
# equal to field 3 of the current line
next # then do nothing and read the next line
else # if there is no overlap,
print previous # then print the previous line
}
else { # if we are not already in the a sequence of
# "exon" lines, then this is the first one
exon_seq=1 # so exon_seq should become 1
}
previous=$0; start=$2; end=$3 # `start` become field2, `end` becomes field 3 and
# `previous` becomes the current record (line)
}
END{ # After all lines are processed
if(previous) print previous # If there still is a previous line, then print it
}
' file
答案2
我会使用 Perl 来解决如此复杂的任务。这是部分解决方案,您可能需要对其进行调整以更好地适合您:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ max };
sub output {
my $previous = shift;
print join ' ', 'exon',
@{$previous}{qw(start end score strand frame attribute)};
}
$\ = "\n";
my %previous;
while (<>) {
chomp;
my ($region, $start, $end, $score, $strand, $frame, $attribute)
= split ' ', $_, 7;
if ($. == 1) {
print;
} elsif ('exon' eq $region) {
if (keys %previous
and $start < $previous{end} # Overlap.
) {
if ($end > $previous{end}) { # Not contained.
$previous{attribute} =~ s/; Name .*//;
$previous{attribute} .= '; '
. ($attribute =~ /(Transcript ".*?")/)[0];
$previous{end} = $end;
}
} else {
if (keys %previous) {
output(\%previous);
}
%previous = ( start => $start,
end => $end,
score => $score,
strand => $strand,
frame => $frame,
attribute => $attribute,
);
}
} else {
output(\%previous) if keys %previous;
%previous = ();
}
}
output(\%previous) if keys %previous;