将“awk”嵌套在“while”循环中,逐行解析两个文件并比较列值

将“awk”嵌套在“while”循环中,逐行解析两个文件并比较列值

awk我需要一些&循环组合的帮助while。我有两个带有列的简单文件(正常的文件非常大),一个代表 ID=10 的简单间隔(编码区域(外显子),此处为 10 号染色体):

#exons.bed
10  60005   60100   
10  61007   61130   
10  61200   61300   
10  61500   61650   
10  61680   61850   

另一个表示顺序读取(=再次间隔但更小),并使用其他值作为最后一列,我稍后需要:

#reads.bed
10  60005   60010    34 
10  61010   61020    40
10  61030   61040    22
10  61065   61070    35 
10  61100   61105    41

因此,我想以一种快速有效的方式进行搜索,并找出哪些读取间隔(文件中哪一行)以及有多少个读取间隔落在一个编码区域中:

exon 1(first interval of table 1) contains reads of line 1,2,3, etc. 
of   reads.file(2nd table)

这样我就可以稍后获取每个外显子的这些行的第四列的值。

我已经编写了一段代码,可能需要对 while 循环进行一些更正,因为我无法让它为每个 awk 逐一解析读取行。这里是:

while read chr a b cov; do  #for the 4-column file

#if <a..b> interval of read falls inside exon interval:
awk '($2<=$a && $b <= $3) {print NR}' exons.bed >> out_lines.bed

done < reads.bed

目前,当我手动给出 a,b 时,我可以使 awk 行运行,但我想让它自动运行对于每对 a,b通过文件。

任何有关更改语法或更改方式的建议都将受到高度赞赏!

跟进

最后我用这段代码解决了这个问题:

    awk 'NR==FNR{
        a[NR]=$2; 
        b[NR]=$3;
        next; }
    {  #second file
    s[i]=0; m[i]=0;  k[i]=0;              # Add sum and mean calculation
    for (i in a){                                            
       if($2>=a[i] && $3<=b[i]){         # 2,3: cols of second file here
          k[i]+=1
          print k                      #Count nb of reads found in
          out[i]=out[i]" "FNR          # keep Nb of Line of read 
          rc[i]=rc[i]" "FNR"|"$4       #keep Line and cov value of $4th col
          s[i]= s[i]+$4                #sum over coverages for each exon
          m[i]= s[i]/k[i]             #Calculate mean (k will be the No or  
                                       #reads found on i-th exon)
     }}  
    }
    END{
       for (i in out){
          print "Exon", i,": Reads with their COV:",rc[i],\
          "Sum=",s[i],"Mean=",m[i] >> "MeanCalc.txt"

    }}' exons.bed  reads.bed

输出:

   Exon 2 : Reads with their COV:  2|40 3|22 4|35 5|41 Sum= 138  Mean= 34.5
   etc.

答案1

第一个问题是你不能awk像那样在内部使用 bash 变量。$aawk评估为场地 abuta是空的,因为它不是在 中定义的awk,而是在bash.解决这个问题的一种方法是使用awk's-v选项来定义变量

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

所以,你可以这样做:

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

不过你还有另一个错误。为了使读数落在外显子内,读数的起始位置必须大于外显子的起始位置,并且其结束位置小于外显子的结束位置。您正在使用$2<=a && b <= $3它将选择起始位置位于外显子边界之外的读取。你想要的是$2>=a && $3<=b

无论如何,在 bash 循环中运行此类操作的效率非常低,因为它需要为每对a和读取一次输入文件b。为什么不把全部事情都做进去呢awk

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

如果在示例文件上运行,上面的脚本将产生以下输出:

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

为了清楚起见,这是同样的事情,以不太浓缩的形式

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

答案2

我知道这不是相当你想要什么,但就我个人而言 - 我不合群awk,所以建议尝试一下 Perl。

像这样的东西:

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

鉴于您的样本数据给出:

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

您应该能够扩展它来轻松地进行一些更复杂的范围检查/验证!

相关内容