将“awk”嵌套在“while”循环中，逐行解析两个文件并比较列值

Question 1

第一个问题是你不能awk像那样在内部使用 bash 变量。$a内awk评估为场地 abuta是空的，因为它不是在中定义的awk，而是在bash.解决这个问题的一种方法是使用awk's-v选项来定义变量

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

所以，你可以这样做：

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

不过你还有另一个错误。为了使读数落在外显子内，读数的起始位置必须大于外显子的起始位置，并且其结束位置小于外显子的结束位置。您正在使用$2<=a && b <= $3它将选择起始位置位于外显子边界之外的读取。你想要的是$2>=a && $3<=b。

无论如何，在 bash 循环中运行此类操作的效率非常低，因为它需要为每对a和读取一次输入文件b。为什么不把全部事情都做进去呢awk？

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

如果在示例文件上运行，上面的脚本将产生以下输出：

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

为了清楚起见，这是同样的事情，以不太浓缩的形式

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Answer

第一个问题是你不能awk像那样在内部使用 bash 变量。$a内awk评估为场地 abuta是空的，因为它不是在中定义的awk，而是在bash.解决这个问题的一种方法是使用awk's-v选项来定义变量

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

所以，你可以这样做：

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

不过你还有另一个错误。为了使读数落在外显子内，读数的起始位置必须大于外显子的起始位置，并且其结束位置小于外显子的结束位置。您正在使用$2<=a && b <= $3它将选择起始位置位于外显子边界之外的读取。你想要的是$2>=a && $3<=b。

无论如何，在 bash 循环中运行此类操作的效率非常低，因为它需要为每对a和读取一次输入文件b。为什么不把全部事情都做进去呢awk？

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

如果在示例文件上运行，上面的脚本将产生以下输出：

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

为了清楚起见，这是同样的事情，以不太浓缩的形式

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Question 2

我知道这不是相当你想要什么，但就我个人而言 - 我不合群awk，所以建议尝试一下 Perl。

像这样的东西：

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

鉴于您的样本数据给出：

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

您应该能够扩展它来轻松地进行一些更复杂的范围检查/验证！

Answer

我知道这不是相当你想要什么，但就我个人而言 - 我不合群awk，所以建议尝试一下 Perl。

像这样的东西：

#!/usr/bin/perl

#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;

#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;

#record where our exons start and finish. 
my %start_of;
my %end_of;

#read line by line our exons file. 
#extract the 3 fields and save 'start' and 'end' in a hash table. 
while (<$exons>) {
    my ( $something, $start, $end ) = split;

    my $exon_id = $.;    #line number;
    $start_of{$exon_id} = $start;
    $end_of{$exon_id}   = $end;
}
close ( $exons );

my %exons;
#run through 'reads' line by line, extracting the files. 

open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
    my ( $thing, $read_start, $read_end, $value ) = split;

    #cycle through each exon. 
    foreach my $exon_id ( keys %start_of ) {

        #check if _this_ 'read' is within the start and end ranges. 
        if (    $read_start >= $start_of{$exon_id}
            and $read_end <= $end_of{$exon_id} )
        {
            #store the line number in our hash %exons. 
            push( @{ $exons{$exon_id} }, $. );
        }
    }
}
close ( $reads ); 

#cycle through %exons - in 'id' order. 
foreach my $exon_id ( sort keys %exons ) {
    #print any matches. 
    print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
        ") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}

鉴于您的样本数据给出：

exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5

您应该能够扩展它来轻松地进行一些更复杂的范围检查/验证！

将“awk”嵌套在“while”循环中，逐行解析两个文件并比较列值

答案1

答案2

相关内容