提高使用 grep 进入 while 循环的 Bash 脚本的速度

Question 1

您已经达到了（温和地说）shell 中可以合理完成的操作的极限 — 您应该使用 AWK、Perl 或 Python 等语言重新编写脚本。使用此类更高级的语言将避免为所有文本处理运行多个进程；您将能够使用内置函数来完成此操作。

Answer

您已经达到了（温和地说）shell 中可以合理完成的操作的极限 — 您应该使用 AWK、Perl 或 Python 等语言重新编写脚本。使用此类更高级的语言将避免为所有文本处理运行多个进程；您将能够使用内置函数来完成此操作。

Question 2

百分比计算可以简化为像这样的单个操作

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub 替换一个模式并返回它所做的替换的计数。这样就可以用来快速计算百分比。

您还可以在 awk 中处理奇数行和偶数行。目前尚不清楚你在用奇数行做什么，但你的完整功能可以放在一个 awk 中 -

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    y=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
    }
 }' large_file

编辑：根据OP的要求更改了奇数行的if块。 gsub 将删除“cov”。从数字来看。将 shell 变量 $Y 传递给 awk 后，我们现在可以按照所需的格式进行分割和打印。

使用单个 awk 脚本而不是多个操作将显着加快操作速度。

Answer

百分比计算可以简化为像这样的单个操作

 echo "${even##}" | awk '{x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%\b", (y*100)/(x+y) }'

gsub 替换一个模式并返回它所做的替换的计数。这样就可以用来快速计算百分比。

您还可以在 awk 中处理奇数行和偶数行。目前尚不清楚你在用奇数行做什么，但你的完整功能可以放在一个 awk 中 -

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    y=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (y*100)/(x+y)
    }
 }' large_file

编辑：根据OP的要求更改了奇数行的if块。 gsub 将删除“cov”。从数字来看。将 shell 变量 $Y 传递给 awk 后，我们现在可以按照所需的格式进行分割和打印。

使用单个 awk 脚本而不是多个操作将显着加快操作速度。

Question 3

如果您的程序没有并行化（太多），那么内核的数量就无关紧要了。

您可以使用wcandtr而不是 sed 和 grep，这可能会加快速度：

ACOUNT=$(echo "${even##}" | tr -d [^A] | wc -m)

但实际上，我认为主要问题是 shell 虽然很容易为快速而肮脏的工作进行编程，但就原始处理能力而言，它并不是适合该工作的工具。我建议使用更复杂的编程语言，例如 Perl 或 Python，它们也具有线程功能（从而允许您使用所有核心）。

你可以在 perl 中这样做，有点像这样：

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...;                              # calculate your Y value here
while(my $odd = <ARGV>) {                 # Read a line from the file(s) passed
                                          # on the command line
    chomp $odd;                           # lose the newline
    my @split = split /_/, $odd;          # split the read line on a "_" boundary
                                          # into an array
    print join("_", @split[0..3]) . "\n"; # print the first four elements of the
                                          # array, separated by "_"
    print $split[$#split] / $y . "\n";    # Treat the final element of the
                                          # @split array as a number, divide it
                                          # by $y, and output the result
    my %charcount = (                     # Initialize a hash table
        A => 0,
        G => 0,
        C => 0,
        T => 0
    );
    my $even = <ARGV>;                    # read the even line
    chomp $even;
    foreach my $char(split //,$even) {    # split the string into separate
                                          # characters, and loop over them
        $charcount{$char}++;              # Count the correct character
    }
    my $total = $charcount{A} + $charcount{G} + $charcount{C} + $charcount{T};
    my $gc = $charcount{G} + $charcount{C};
    my $perc = $gc / $total;
    print "GC_CONT: $perc\n";             # Do our final calculations and
                                          # output the result
}

注意：未测试（除了“perl 是否接受此代码”之外）

如果您想了解有关 perl 的更多信息，请运行perldoc perlintro并开始 ;-)

Answer

如果您的程序没有并行化（太多），那么内核的数量就无关紧要了。

您可以使用wcandtr而不是 sed 和 grep，这可能会加快速度：

ACOUNT=$(echo "${even##}" | tr -d [^A] | wc -m)

但实际上，我认为主要问题是 shell 虽然很容易为快速而肮脏的工作进行编程，但就原始处理能力而言，它并不是适合该工作的工具。我建议使用更复杂的编程语言，例如 Perl 或 Python，它们也具有线程功能（从而允许您使用所有核心）。

你可以在 perl 中这样做，有点像这样：

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...;                              # calculate your Y value here
while(my $odd = <ARGV>) {                 # Read a line from the file(s) passed
                                          # on the command line
    chomp $odd;                           # lose the newline
    my @split = split /_/, $odd;          # split the read line on a "_" boundary
                                          # into an array
    print join("_", @split[0..3]) . "\n"; # print the first four elements of the
                                          # array, separated by "_"
    print $split[$#split] / $y . "\n";    # Treat the final element of the
                                          # @split array as a number, divide it
                                          # by $y, and output the result
    my %charcount = (                     # Initialize a hash table
        A => 0,
        G => 0,
        C => 0,
        T => 0
    );
    my $even = <ARGV>;                    # read the even line
    chomp $even;
    foreach my $char(split //,$even) {    # split the string into separate
                                          # characters, and loop over them
        $charcount{$char}++;              # Count the correct character
    }
    my $total = $charcount{A} + $charcount{G} + $charcount{C} + $charcount{T};
    my $gc = $charcount{G} + $charcount{C};
    my $perc = $gc / $total;
    print "GC_CONT: $perc\n";             # Do our final calculations and
                                          # output the result
}

注意：未测试（除了“perl 是否接受此代码”之外）

如果您想了解有关 perl 的更多信息，请运行perldoc perlintro并开始 ;-)

Question 4

您正在逐行读取一个长文件并在每次迭代中执行多个命令。您面临的主要问题是运行这些计算和一次读取非常小的文件块的延迟。

斯蒂芬·基特（Stephen Kitt）的答案很好，您想用更高级别的语言重写它，在其中您可以缓存文件内容并更有效地运行字符串操作。

如果您想排除存储和文件系统的性能，可以使用以下命令从 RAM 加载文件：

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

这应该会使该过程更快，因为 I/O 受到限制。但如果用 C、Ruby 或 Python 重写，它永远不会像它可能的那样好。

Answer

您正在逐行读取一个长文件并在每次迭代中执行多个命令。您面临的主要问题是运行这些计算和一次读取非常小的文件块的延迟。

斯蒂芬·基特（Stephen Kitt）的答案很好，您想用更高级别的语言重写它，在其中您可以缓存文件内容并更有效地运行字符串操作。

如果您想排除存储和文件系统的性能，可以使用以下命令从 RAM 加载文件：

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

这应该会使该过程更快，因为 I/O 受到限制。但如果用 C、Ruby 或 Python 重写，它永远不会像它可能的那样好。

提高使用 grep 进入 while 循环的 Bash 脚本的速度

答案1

答案2

答案3

答案4

相关内容