我有一些这样的文件:
- 文件_1
chrV 20924149 chrX 17718866 chrIV 17493793 chrII 15279345 chrI 15072423 chrIII 13783700 chrM 13794
- 文件_2
chrI 230218 chrII 813184 chrIII 316620 chrIV 1531933 chrIX 439888 chrM 85779 chrV 576874 chrVI 270161 chrVII 1090940 chrVIII 562643 chrX 745751 chrXI 666816 chrXII 1078177 chrXIII 924431 chrXIV 784333 chrXV 1091291 chrXVI 948066
我需要从第 2 列获取平均值和总值,以及从文件中获取最大值和最小值。我从 stackoverflow 中得到了一些想法,并制作了这个丑陋的 bash 脚本。
#!usr/bin/env bash
for VARIABLE in Data/*.sizes
do
echo $VARIABLE
echo 'Genome length:'
awk -F '\t' '{ sum += $2 } END { print sum }' $VARIABLE
echo 'Chr number:'
awk -F '\t' '{ NR $1 } END { print NR }' $VARIABLE
echo 'Chr mean length:'
awk -F '\t' '{ total += $2 } END { print total/NR }' $VARIABLE
echo 'Longest Chr:'
awk -v max=0 '{if($2>max){want=$1" "$2; max=$2}}END{print want}' $VARIABLE
echo 'Smallest Chr:'
awk 'NR == 1 || $2 < min {line = $1; min = $2}END{print line " " min}' $VARIABLE
echo " "
done
它有效,但如果您有任何更好的想法,也许有一种方法可以使其更通用,因为有时这是在一些类似的文件中完成的。
我将不胜感激任何意见,因为我通常不使用awk
and bash
。
我打印了这个:
Data/file_1
Genome length:
100286070
Chr number:
7
Chr mean length:
1.43266e+07
Longest Chr:
chrV 20924149
Smallest Chr:
chrM 13794
答案1
以下awk
脚本将完成该任务。awk
由于长度原因,我将其写为显式程序文件 - 这主要是由于打印分析结果的功能;实际的计算相当简短:
如果您有 GNUawk
块ENDFILE
:
程序文件(我们称之为analyze_genome_g.awk
):
#!/usr/bin/gawk -f
# Begin of file, characterized by FNR, the per-file line-counter, being 1.
# Initialize statistics: set sum, min, and max to first chromosome length
# and name of longest/shortest ('long'/'short') to first chromosome name.
FNR==1{s=min=max=$2; short=long=$1}
# All other lines: Update sum, min, and max lengths
FNR>1{s=s+$2;if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}}
# End-of-file (GNU awk feature!): Print statistics
ENDFILE{
printf("%s\n",FILENAME);
printf("- Genome length : %d\n",s);
printf("- Nr. of chromosomes : %d\n",FNR);
printf("- Mean chomosome length : %.1f\n",s/FNR);
printf("- Shortest chromosome : %s (length=%d)\n",short,min);
printf("- Longest chromosome : %s (length=%d)\n",long,max);
printf("\n");
}
你可以将其称为
gawk -f analyze_genome_g.awk file_1 file_2 ...
输出:
file_1
- Genome length : 100286070
- Nr. of chromosomes : 7
- Mean chomosome length : 14326581.4
- Shortest chromosome : chrM (length=13794)
- Longest chromosome : chrV (length=20924149)
file_2
- Genome length : 12157105
- Nr. of chromosomes : 17
- Mean chomosome length : 715123.8
- Shortest chromosome : chrM (length=85779)
- Longest chromosome : chrIV (length=1531933)
其他awk
变体:
如果您awk
不知道ENDFILE
情况,则需要一些解决方法 - 基本上将文件属性保存在临时变量中,并在新文件的开头(对于前一个文件)或在END
最后一个文件的块中打印统计信息已处理。
为了使这更方便,我们定义一个printstats()
执行输出的函数。
程序文件( analyze_genome.awk
):
#!/usr/bin/awk -f
function printstats()
{
printf("%s\n",last_fn);
printf("- Genome length : %d\n",s);
printf("- Nr. of chromosomes : %d\n",last_fnr);
printf("- Mean chomosome length : %.1f\n",s/last_fnr);
printf("- Shortest chromosome : %s (length=%d)\n",short,min);
printf("- Longest chromosome : %s (length=%d)\n",long,max);
printf("\n");
}
# Begin of file
# FNR==1 always works, but now we have to save file properties, too.
# If it is _not_ the first file (NR, the global line counter, is larger than
# FNR, the per-file line-counter), print statistics (of the previous file).
FNR==1{
if (NR>1) printstats();
s=min=max=$2; short=long=$1;
last_fn=FILENAME; last_fnr=1;
}
FNR>1{
s=s+$2; if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1};
last_fnr++;
}
END{printstats()}
您可以类似地调用它
awk -f analyze_genome.awk file_1 file_2 ...
作为一般说明,使用 shell 循环处理文本文件是不推荐,因为效率相当低;awk
等可以更快地执行几乎所有文本处理任务和许多统计计算。
答案2
您可以使用 GNU 版本的桌面计算器以 yaml 方式生成统计报告。下面是经过大量注释的 dc 代码版本。
#!/usr/bin/env bash
for VARIABLE in Data/*.sizes
do
printf '%s:\n' "$VARIABLE"
< "$VARIABLE" awk '{$1="["$1"]";sub(/^-/,"_",$2)}1' \
| dc -e "
[32adnn]si # two-spaces indent in reporting
[
lix[Genome length:] n32an lsp
lix[Chr number:] n32an lkp
lix[Chr mean length:] n32an /1.0*p
lix[Longest Chr:] n32an lM n32an lmp
lix[Smallest Chr:] n32an lN n32an lnp
q
]sR
[dsmrdsMr]s+
[dsnrdsNr]s-
[
?z0=R # report stats @ eof
lk1+sk # increment line kounter
dls+ss # update running sum
dlm<+ # update max
dln>- # update min
cz0=? # call myself recursively to read next line
]s?
[
? # read the first line
1skdss # initialize knt, sum
dsmrdsM # initialize max
sNsn # initialize min
cl?x # read next line
]sI
lIx # set the ball rolling, kinda like main()
"
结果:
Data/file_1.sizes:
Genome length: 100286070
Chr number: 7
Chr mean length: 14326581.0
Longest Chr: chrV 20924149
Smallest Chr: chrM 13794
Data/file_2.sizes:
Genome length: 12157105
Chr number: 17
Chr mean length: 715123.0
Longest Chr: chrIV 1531933
Smallest Chr: chrM 85779
答案3
awk 'BEGIN{sum=0}{sum=sum+$2}END{print sum/finalcountofline}' filename ====Mean
awk 'BEGIN{sum=0}($2 > sum){sum=$2}END{print sum}' filename ===Max
awk 'NR==1{sum=$2}($2 < sum){sum=$2}END{print sum}' filename ===min