改进 bash 脚本以对文件进行一些基本统计

改进 bash 脚本以对文件进行一些基本统计

我有一些这样的文件:

  • 文件_1
    chrV    20924149
    chrX    17718866
    chrIV   17493793
    chrII   15279345
    chrI    15072423
    chrIII  13783700
    chrM    13794
    
  • 文件_2
    chrI    230218
    chrII   813184
    chrIII  316620
    chrIV   1531933
    chrIX   439888
    chrM    85779
    chrV    576874
    chrVI   270161
    chrVII  1090940
    chrVIII 562643
    chrX    745751
    chrXI   666816
    chrXII  1078177
    chrXIII 924431
    chrXIV  784333
    chrXV   1091291
    chrXVI  948066
    

我需要从第 2 列获取平均值和总值,以及从文件中获取最大值和最小值。我从 stackoverflow 中得到了一些想法,并制作了这个丑陋的 bash 脚本。

#!usr/bin/env bash


for VARIABLE in Data/*.sizes
do
    echo $VARIABLE
    echo  'Genome length:'
    awk -F '\t' '{ sum += $2 } END { print sum }' $VARIABLE
    echo 'Chr number:'
    awk -F '\t' '{ NR $1 } END { print NR }' $VARIABLE
    echo 'Chr mean length:'
    awk -F '\t' '{ total += $2 } END { print total/NR }' $VARIABLE
    echo 'Longest Chr:'
    awk -v max=0 '{if($2>max){want=$1" "$2; max=$2}}END{print want}' $VARIABLE
    echo 'Smallest Chr:'
    awk 'NR == 1 || $2 < min {line = $1; min = $2}END{print line " " min}' $VARIABLE
    echo " "
done

它有效,但如果您有任何更好的想法,也许有一种方法可以使其更通用,因为有时这是在一些类似的文件中完成的。

我将不胜感激任何意见,因为我通常不使用awkand bash

我打印了这个:

Data/file_1
Genome length:
100286070
Chr number:
7
Chr mean length:
1.43266e+07
Longest Chr:
chrV 20924149
Smallest Chr:
chrM 13794

答案1

以下awk脚本将完成该任务。awk由于长度原因,我将其写为显式程序文件 - 这主要是由于打印分析结果的功能;实际的计算相当简短:

如果您有 GNUawkENDFILE

程序文件(我们称之为analyze_genome_g.awk):

#!/usr/bin/gawk -f

# Begin of file, characterized by FNR, the per-file line-counter, being 1.
# Initialize statistics: set sum, min, and max to first chromosome length
# and name of longest/shortest ('long'/'short') to first chromosome name.
FNR==1{s=min=max=$2; short=long=$1}

# All other lines: Update sum, min, and max lengths
FNR>1{s=s+$2;if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1}}

# End-of-file (GNU awk feature!): Print statistics
ENDFILE{
    printf("%s\n",FILENAME);
    printf("- Genome length         : %d\n",s);
    printf("- Nr. of chromosomes    : %d\n",FNR);
    printf("- Mean chomosome length : %.1f\n",s/FNR);
    printf("- Shortest chromosome   : %s (length=%d)\n",short,min);
    printf("- Longest chromosome    : %s (length=%d)\n",long,max);
    printf("\n");
}

你可以将其称为

gawk -f analyze_genome_g.awk file_1 file_2 ...

输出:

file_1
- Genome length         : 100286070
- Nr. of chromosomes    : 7
- Mean chomosome length : 14326581.4
- Shortest chromosome   : chrM (length=13794)
- Longest chromosome    : chrV (length=20924149)

file_2
- Genome length         : 12157105
- Nr. of chromosomes    : 17
- Mean chomosome length : 715123.8
- Shortest chromosome   : chrM (length=85779)
- Longest chromosome    : chrIV (length=1531933)

其他awk变体:

如果您awk不知道ENDFILE情况,则需要一些解决方法 - 基本上将文件属性保存在临时变量中,并在新文件的开头(对于前一个文件)或在END最后一个文件的块中打印统计信息已处理。

为了使这更方便,我们定义一个printstats()执行输出的函数。

程序文件( analyze_genome.awk):

#!/usr/bin/awk -f
function printstats()
{
    printf("%s\n",last_fn);
    printf("- Genome length         : %d\n",s);
    printf("- Nr. of chromosomes    : %d\n",last_fnr);
    printf("- Mean chomosome length : %.1f\n",s/last_fnr);
    printf("- Shortest chromosome   : %s (length=%d)\n",short,min);
    printf("- Longest chromosome    : %s (length=%d)\n",long,max);
    printf("\n");
}

# Begin of file
# FNR==1 always works, but now we have to save file properties, too.
# If it is _not_ the first file (NR, the global line counter, is larger than
# FNR, the per-file line-counter), print statistics (of the previous file).
FNR==1{
    if (NR>1) printstats();
    s=min=max=$2; short=long=$1;
    last_fn=FILENAME; last_fnr=1;
}


FNR>1{
    s=s+$2; if (min>$2) {min=$2; short=$1}; if (max<$2) {max=$2; long=$1};
    last_fnr++;
}

END{printstats()}

您可以类似地调用它

awk -f analyze_genome.awk file_1 file_2 ...

作为一般说明,使用 shell 循环处理文本文件是不推荐,因为效率相当低;awk等可以更快地执行几乎所有文本处理任务和许多统计计算。

答案2

您可以使用 GNU 版本的桌面计算器以 yaml 方式生成统计报告。下面是经过大量注释的 dc 代码版本。

#!/usr/bin/env bash
for VARIABLE in Data/*.sizes
do
    printf '%s:\n' "$VARIABLE" 
< "$VARIABLE" awk '{$1="["$1"]";sub(/^-/,"_",$2)}1' \
| dc -e "
[32adnn]si  # two-spaces indent in reporting
[
lix[Genome length:]   n32an lsp
lix[Chr number:]      n32an lkp
lix[Chr mean length:] n32an /1.0*p
lix[Longest Chr:]     n32an lM     n32an lmp
lix[Smallest Chr:]    n32an lN     n32an lnp
q
]sR
[dsmrdsMr]s+
[dsnrdsNr]s-
[
?z0=R  # report stats @ eof
lk1+sk # increment line kounter
dls+ss # update running sum
dlm<+  # update max
dln>-  # update min
cz0=?  # call myself recursively to read next line 
]s?
[
?       # read the first line
1skdss  # initialize knt, sum
dsmrdsM # initialize max
sNsn    # initialize min
cl?x    # read next line
]sI
lIx     # set the ball rolling, kinda like main() 
"

结果:

Data/file_1.sizes:
  Genome length: 100286070
  Chr number: 7
  Chr mean length: 14326581.0
  Longest Chr: chrV 20924149
  Smallest Chr: chrM 13794

Data/file_2.sizes:
  Genome length: 12157105
  Chr number: 17
  Chr mean length: 715123.0
  Longest Chr: chrIV 1531933
  Smallest Chr: chrM 85779

答案3

awk 'BEGIN{sum=0}{sum=sum+$2}END{print sum/finalcountofline}' filename ====Mean

awk 'BEGIN{sum=0}($2 > sum){sum=$2}END{print sum}' filename ===Max

awk 'NR==1{sum=$2}($2 < sum){sum=$2}END{print sum}' filename ===min

相关内容