awk
我需要一些&循环组合的帮助while
。我有两个带有列的简单文件(正常的文件非常大),一个代表 ID=10 的简单间隔(编码区域(外显子),此处为 10 号染色体):
#exons.bed
10 60005 60100
10 61007 61130
10 61200 61300
10 61500 61650
10 61680 61850
另一个表示顺序读取(=再次间隔但更小),并使用其他值作为最后一列,我稍后需要:
#reads.bed
10 60005 60010 34
10 61010 61020 40
10 61030 61040 22
10 61065 61070 35
10 61100 61105 41
因此,我想以一种快速有效的方式进行搜索,并找出哪些读取间隔(文件中哪一行)以及有多少个读取间隔落在一个编码区域中:
exon 1(first interval of table 1) contains reads of line 1,2,3, etc.
of reads.file(2nd table)
这样我就可以稍后获取每个外显子的这些行的第四列的值。
我已经编写了一段代码,可能需要对 while 循环进行一些更正,因为我无法让它为每个 awk 逐一解析读取行。这里是:
while read chr a b cov; do #for the 4-column file
#if <a..b> interval of read falls inside exon interval:
awk '($2<=$a && $b <= $3) {print NR}' exons.bed >> out_lines.bed
done < reads.bed
目前,当我手动给出 a,b 时,我可以使 awk 行运行,但我想让它自动运行对于每对 a,b通过文件。
任何有关更改语法或更改方式的建议都将受到高度赞赏!
跟进
最后我用这段代码解决了这个问题:
awk 'NR==FNR{
a[NR]=$2;
b[NR]=$3;
next; }
{ #second file
s[i]=0; m[i]=0; k[i]=0; # Add sum and mean calculation
for (i in a){
if($2>=a[i] && $3<=b[i]){ # 2,3: cols of second file here
k[i]+=1
print k #Count nb of reads found in
out[i]=out[i]" "FNR # keep Nb of Line of read
rc[i]=rc[i]" "FNR"|"$4 #keep Line and cov value of $4th col
s[i]= s[i]+$4 #sum over coverages for each exon
m[i]= s[i]/k[i] #Calculate mean (k will be the No or
#reads found on i-th exon)
}}
}
END{
for (i in out){
print "Exon", i,": Reads with their COV:",rc[i],\
"Sum=",s[i],"Mean=",m[i] >> "MeanCalc.txt"
}}' exons.bed reads.bed
输出:
Exon 2 : Reads with their COV: 2|40 3|22 4|35 5|41 Sum= 138 Mean= 34.5
etc.
答案1
第一个问题是你不能awk
像那样在内部使用 bash 变量。$a
内awk
评估为场地 a
buta
是空的,因为它不是在 中定义的awk
,而是在bash
.解决这个问题的一种方法是使用awk
's-v
选项来定义变量
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of
the program begins. Such variable values are available to the
BEGIN rule of an AWK program.
所以,你可以这样做:
while read chr a b cov; do
awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b
done < reads.bed
不过你还有另一个错误。为了使读数落在外显子内,读数的起始位置必须大于外显子的起始位置,并且其结束位置小于外显子的结束位置。您正在使用$2<=a && b <= $3
它将选择起始位置位于外显子边界之外的读取。你想要的是$2>=a && $3<=b
。
无论如何,在 bash 循环中运行此类操作的效率非常低,因为它需要为每对a
和读取一次输入文件b
。为什么不把全部事情都做进去呢awk
?
awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
for (i in a){
if($2>=a[i] && $3<=b[i]){
out[i]=out[i]" "FNR
}}}
END{for (i in out){
print "Exon",i,"contains reads of line(s)"out[i],\
"of reads file"
}}' exons.bed reads.bed
如果在示例文件上运行,上面的脚本将产生以下输出:
Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file
为了清楚起见,这是同样的事情,以不太浓缩的形式
#!/usr/bin/awk -f
## While we're reading the 1st file, exons.bed
NR==FNR{
## Save the start position in array a and the end
## in array b. The keys of the arrays are the line numbers.
a[NR]=$2;
b[NR]=$3;
## Move to the next line, without continuing
## the script.
next;
}
## Once we move on to the 2nd file, reads.bed
{
## For each set of start and end positions
for (i in a){
## If the current line's 2nd field is greater than
## this start position and smaller than this end position,
## add this line number (FNR is the current file's line number)
## to the list of reads for the current value of i.
if($2>=a[i] && $3<=b[i]){
out[i]=out[i]" "FNR
}
}
}
## After both files have been processed
END{
## For each exon in the out array
for (i in out){
## Print the exon name and the redas it contains
print "Exon",i,"contains reads of line(s)"out[i],
"of reads file"
}
答案2
我知道这不是相当你想要什么,但就我个人而言 - 我不合群awk
,所以建议尝试一下 Perl。
像这样的东西:
#!/usr/bin/perl
#REALLY GOOD IDEA at the start of any perl code
use strict;
use warnings;
#open some files for input
open( my $exons, "<", 'exons.bed' ) or die $!;
#record where our exons start and finish.
my %start_of;
my %end_of;
#read line by line our exons file.
#extract the 3 fields and save 'start' and 'end' in a hash table.
while (<$exons>) {
my ( $something, $start, $end ) = split;
my $exon_id = $.; #line number;
$start_of{$exon_id} = $start;
$end_of{$exon_id} = $end;
}
close ( $exons );
my %exons;
#run through 'reads' line by line, extracting the files.
open( my $reads, "<", 'reads.bed' ) or die $!;
while (<$reads>) {
my ( $thing, $read_start, $read_end, $value ) = split;
#cycle through each exon.
foreach my $exon_id ( keys %start_of ) {
#check if _this_ 'read' is within the start and end ranges.
if ( $read_start >= $start_of{$exon_id}
and $read_end <= $end_of{$exon_id} )
{
#store the line number in our hash %exons.
push( @{ $exons{$exon_id} }, $. );
}
}
}
close ( $reads );
#cycle through %exons - in 'id' order.
foreach my $exon_id ( sort keys %exons ) {
#print any matches.
print "exon ",$exon_id, " (", $start_of{$exon_id}, " - ", $end_of{$exon_id},
") contains reads of line:", join( ",", @{ $exons{$exon_id} } ), "\n";
}
鉴于您的样本数据给出:
exon 1 (60005 - 60100) contains reads of line:1
exon 2 (61007 - 61130) contains reads of line:2,3,4,5
您应该能够扩展它来轻松地进行一些更复杂的范围检查/验证!