合并文件中的字段

合并文件中的字段

我有一个包含 7 列的文件,一个具有染色体区域的 GFF 文件。我想将 REGION ="exon" 的行折叠到文件中的一行。必须根据与每个区域重叠的区域折叠该行其他。

REGION  START   END  SCORE STRAND FRAME     ATTRIBUTE
 exon   26453   26644   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   26842   27020   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   30355   30899   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 GS_TRAN    30355   34083   .   -   .   GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
 snp    30847   30847   .   +   .   SNP "rs2971719"; Name "rs2971719"
 exon   31012   31409   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   34013   34083   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   40932   41071   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 snp    44269   44269   .   +   .   SNP "rs2873227"; Name "rs2873227"
 snp    45723   45723   .   +   .   SNP "rs2227095"; Name "rs2227095"
 exon   134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
 exon   134034  134457  .   -   .   Transcript "XM_086914"; Name "XM_086914"            

查看上面的示例数据,只有最后两行可以合并为一行。因此,新行将成为。

exon    134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            

如果另一行的末尾大于其前一行,那么在这种情况下,这将是 END 区域。基本上,如果有任何重叠,则取较早开始的区域和较晚结束的区域。

这样的实例可以有多行,这里只有最后 2 行。有一件事是 ATRRIBUTE 列肯定会显示这些行的不同 Transcript 名称,在其他情况下大多是相同的。

有关如何继续的任何建议。

更新的示例:如果最后两行是这样的

  exon  134031  134457  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
  exon  134034  134495  .   -   .   Transcript "XM_086914"; Name "XM_086914"

那么输出应该是:

exon    134031  134495  .   -   .   Transcript "XM_086913"; Transcript "XM_086914"

基本上是从第一行开始,从第二行结束。因为我只想覆盖一行中的重叠,而不是 2 行或 3 行或更多行。这里重叠是在 2 行之间,但也可以在多于 2 行之间。

更新的示例(2014 年 3 月 24 日)

chr1    HAVANA  stop_codon  1120520 1120522 .   +   0   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115077 1115233 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115414 1115433 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1120520 1121244 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  transcript  1115864 1119307 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1115864 1116240 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117121 1117195*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1118256 1118427 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";

chr1    HAVANA  transcript  1190648 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1209046 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1203113 1203372 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  CDS 1203241 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  start_codon 1203370 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  stop_codon  1203238 1203240 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1198726 1198766 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192588 1192690 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192372 1192510 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1191425 1191505*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1190648 1191470*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";

上半部分显示“+”链的重叠,下半部分显示“-”链的重叠。“-”链具有递减区域,因此重叠将如最后两行所示。两者都是不同的基因所以重叠必须是每个基因的,因为有时不同的基因也有重叠的外显子,但正如我在一些帖子中读到的那样,这种情况非常罕见。可以从最后一列中提取基因信息,表示为“gene_name”。

gene_name=TTLL10 中的两行具有重叠的外显子,因此它们将在最终输出中合并。

chr1    HAVANA  *exon   1117121 1117195*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 

gene_name= UBE2J2 的两行具有重叠的外显子。

 chr1   HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

样本输出

其余行保持不变,并且上面的行针对每个基因进行合并。

chr1    HAVANA  *exon   1117121 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1    HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

如果transcript_id不同,尽管gene_name将保持不变,但两个转录本id都会被打印。例如,如果对于基因,转录本id会有所不同,如下所示:

  chr1  HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.2";  gene_name "UBE2J2"; 

这将合并到上面,但应该有两个成绩单名称。因为,我相信它可能是需要的,并且稍后对于保留成绩单信息很重要。

  chr1  HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1"; "ENST00000473215.2"  gene_name "UBE2J2"; 

答案1

一种“awk”方法,

awk '
  $1!="exon" {                       # If the first died is unequal to "exon"
    if(previous)print previous       # If there is a previous line then print it
    print                            # Print the current line
    previous=start=end=exon_seq=""   # Set all variable to an empty string
    next                             # Move on to the next line in the input file
  }
  {
    if(exon_seq) {                   # if there is a sequence of lines with "exon in field 1
      if(start<=$2 && end>=$3)       # if the start value (field 2) of the previous line 
                                     # is less or equal to the current line and the end
                                     # value of the previous line is greater than or
                                     # equal to field 3 of the current line
        next                         # then do nothing and read the next line
      else                           # if there is no overlap,
        print previous               # then print the previous line
    }
    else {                           # if we are not already in the a sequence of 
                                     # "exon" lines, then this is the first one
      exon_seq=1                     # so exon_seq should become 1
    }
    previous=$0; start=$2; end=$3    # `start` become field2, `end` becomes field 3 and
                                     # `previous` becomes the current record (line)
  }
  END{                               # After all lines are processed
    if(previous) print previous      # If there still is a previous line, then print it
  }
' file

答案2

我会使用 Perl 来解决如此复杂的任务。这是部分解决方案,您可能需要对其进行调整以更好地适合您:

#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ max };

sub output {
    my $previous = shift;
    print join ' ', 'exon',
               @{$previous}{qw(start end score strand frame attribute)};
}

$\ = "\n";
my %previous;
while (<>) {
    chomp;
    my ($region, $start, $end, $score, $strand, $frame, $attribute)
        = split ' ', $_, 7;

    if ($. == 1) {
        print;

    } elsif ('exon' eq $region) {
        if (keys %previous
            and $start < $previous{end}                # Overlap.
           ) {

            if ($end > $previous{end}) {               # Not contained.
                $previous{attribute} =~ s/; Name .*//;
                $previous{attribute} .= '; ' 
                                     . ($attribute =~ /(Transcript ".*?")/)[0];
                $previous{end} = $end;
            }

        } else {
            if (keys %previous) {
                output(\%previous);
            }

            %previous = ( start     => $start,
                          end       => $end,
                          score     => $score,
                          strand    => $strand,
                          frame     => $frame,
                          attribute => $attribute,
                        );
        }

    } else {
        output(\%previous) if keys %previous;
        %previous = ();
    }
}

output(\%previous) if keys %previous;

相关内容