测量的一些时间是：

Question 1

我建议sed解决方案，但为了完整起见，

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

要在最后一行之后剪切：

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

速度测试（此处在 macOS 上，其他系统上为 YMMV）：

生成的 100,000,000 行文件seq 100000000 > test.in
阅读行数 50,000,000-50,000,010
测试无特定顺序
realbash的内置时间报告time

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

这些绝不是精确的基准，但差异足够明显且可重复*，可以很好地了解每个命令的相对速度。

*：除了前两者之间的sed -n p;q和之外head|tail，它们看起来本质上是相同的。

Answer

我建议sed解决方案，但为了完整起见，

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

要在最后一行之后剪切：

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

速度测试（此处在 macOS 上，其他系统上为 YMMV）：

生成的 100,000,000 行文件seq 100000000 > test.in
阅读行数 50,000,000-50,000,010
测试无特定顺序
realbash的内置时间报告time

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

这些绝不是精确的基准，但差异足够明显且可重复*，可以很好地了解每个命令的相对速度。

*：除了前两者之间的sed -n p;q和之外head|tail，它们看起来本质上是相同的。

Question 2

如果您希望包含 X 到 Y 行（从 1 开始编号），请使用

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail将读取并丢弃前 X-1 行（没有办法解决），然后读取并打印以下行。head将读取并打印请求的行数，然后退出。退出时head，tail收到一个信号管道信号并死亡，因此它不会从输入文件中读取超过缓冲区大小（通常是几千字节）的行。

或者，作为高尔基普尔建议，使用 sed：

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

不过，sed 解决方案速度明显慢（至少对于 GNU 实用程序和 Busybox 实用程序而言；如果您在管道速度慢而 sed 速度快的操作系统上提取大部分文件，sed 可能更具竞争力）。以下是 Linux 下的快速基准测试；数据是由生成的seq 100000000 >/tmp/a，环境是 Linux/amd64，/tmp是 tmpfs，并且机器处于空闲状态且不进行交换。

real  user  sys    command
 0.47  0.32  0.12  </tmp/a tail -n +50000001 | head -n 10 #GNU
 0.86  0.64  0.21  </tmp/a tail -n +50000001 | head -n 10 #BusyBox
 3.57  3.41  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
 1.04  0.60  0.46  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
 7.12  6.58  0.55  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
 9.95  9.54  0.28  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13  0.31  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

如果您知道要使用的字节范围，则可以通过直接跳到起始位置来更快地提取它。但对于行，您必须从头开始读取并计算换行数。要从 0 开始提取从 x（含）到 y（不包括）的块，块大小为 b：

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

Answer

如果您希望包含 X 到 Y 行（从 1 开始编号），请使用

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail将读取并丢弃前 X-1 行（没有办法解决），然后读取并打印以下行。head将读取并打印请求的行数，然后退出。退出时head，tail收到一个信号管道信号并死亡，因此它不会从输入文件中读取超过缓冲区大小（通常是几千字节）的行。

或者，作为高尔基普尔建议，使用 sed：

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

不过，sed 解决方案速度明显慢（至少对于 GNU 实用程序和 Busybox 实用程序而言；如果您在管道速度慢而 sed 速度快的操作系统上提取大部分文件，sed 可能更具竞争力）。以下是 Linux 下的快速基准测试；数据是由生成的seq 100000000 >/tmp/a，环境是 Linux/amd64，/tmp是 tmpfs，并且机器处于空闲状态且不进行交换。

real  user  sys    command
 0.47  0.32  0.12  </tmp/a tail -n +50000001 | head -n 10 #GNU
 0.86  0.64  0.21  </tmp/a tail -n +50000001 | head -n 10 #BusyBox
 3.57  3.41  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
 1.04  0.60  0.46  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
 7.12  6.58  0.55  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
 9.95  9.54  0.28  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13  0.31  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

如果您知道要使用的字节范围，则可以通过直接跳到起始位置来更快地提取它。但对于行，您必须从头开始读取并计算换行数。要从 0 开始提取从 x（含）到 y（不包括）的块，块大小为 b：

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

Question 3

该head | tail方法是实现此目的的最佳且最“惯用”的方法之一：

X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"

正如吉尔斯在评论中指出的，更快的方法是

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

速度更快的原因是第一个X-1与方法相比，线路不需要穿过管道head | tail。

您的问题措辞有点误导，可能解释了您对这种方法的一些毫无根据的疑虑。

你说你必须计算A, B, C，D但正如你所看到的，不需要文件的行数，最多需要 1 次计算，无论如何 shell 都可以为你做。
您担心管道会读取过多的行。事实上，这是不正确的：tail | head就文件 I/O 而言，它大约是您所能达到的最高效率。首先，考虑所需的最小工作量：找到X文件中的第 3 行，唯一的通用方法是读取每个字节并在计数时停止X换行符，因为无法预测文件的偏移量X'第行。一旦到达第 *X* 行，您必须读取所有行才能打印它们，停在是'th 行。因此，没有任何方法可以逃避阅读少于是线。现在，head -n $Y阅读不超过是行（四舍五入到最近的缓冲区单元，但如果正确使用缓冲区可以提高性能，因此无需担心开销）。此外，tail不会读取超过的内容head，因此我们已经表明head | tail读取尽可能少的行数（同样，加上一些我们忽略的可忽略不计的缓冲）。不使用管道的单一工具方法的唯一效率优势是更少的进程（因此开销也更少）。

Answer

该head | tail方法是实现此目的的最佳且最“惯用”的方法之一：

X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"

正如吉尔斯在评论中指出的，更快的方法是

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

速度更快的原因是第一个X-1与方法相比，线路不需要穿过管道head | tail。

您的问题措辞有点误导，可能解释了您对这种方法的一些毫无根据的疑虑。

你说你必须计算A, B, C，D但正如你所看到的，不需要文件的行数，最多需要 1 次计算，无论如何 shell 都可以为你做。
您担心管道会读取过多的行。事实上，这是不正确的：tail | head就文件 I/O 而言，它大约是您所能达到的最高效率。首先，考虑所需的最小工作量：找到X文件中的第 3 行，唯一的通用方法是读取每个字节并在计数时停止X换行符，因为无法预测文件的偏移量X'第行。一旦到达第 *X* 行，您必须读取所有行才能打印它们，停在是'th 行。因此，没有任何方法可以逃避阅读少于是线。现在，head -n $Y阅读不超过是行（四舍五入到最近的缓冲区单元，但如果正确使用缓冲区可以提高性能，因此无需担心开销）。此外，tail不会读取超过的内容head，因此我们已经表明head | tail读取尽可能少的行数（同样，加上一些我们忽略的可忽略不计的缓冲）。不使用管道的单一工具方法的唯一效率优势是更少的进程（因此开销也更少）。

Question 4

如果我们知道要选择的范围，从第一行：lStart到最后一行：lEnd我们可以计算：

lCount="$((lEnd-lStart+1))"

如果我们知道总行数：lAll我们还可以计算到文件末尾的距离：

toEnd="$((lAll-lStart+1))"

然后我们就会知道：

"how far from the start"            ($lStart) and
"how far from the end of the file"  ($toEnd).

选择其中最小的一个：tailnumber如下：

tailnumber="$toEnd"; (( toEnd > lStart )) && tailnumber="+$linestart"

允许我们使用始终最快的执行命令：

tail -n"${tailnumber}" ${thefile} | head -n${lCount}

$linestart请注意选择时的附加加号 (“+”) 。

唯一需要注意的是，我们需要总行数，这可能需要一些额外的时间才能找到。
与往常一样：

linesall="$(wc -l < "$thefile" )"

测量的一些时间是：

lStart |500| lEnd |500| lCount |11|
real   user   sys    frac
0.002  0.000  0.000  0.00  | command == tail -n"+500" test.in | head -n1
0.002  0.000  0.000  0.00  | command == tail -n+500 test.in | head -n1
3.230  2.520  0.700  99.68 | command == tail -n99999501 test.in | head -n1
0.001  0.000  0.000  0.00  | command == head -n500 test.in | tail -n1
0.001  0.000  0.000  0.00  | command == sed -n -e "500,500p;500q" test.in
0.002  0.000  0.000  0.00  | command == awk 'NR<'500'{next}1;NR=='500'{exit}' test.in


lStart |50000000| lEnd |50000010| lCount |11|
real   user   sys    frac
0.977  0.644  0.328  99.50 | command == tail -n"+50000000" test.in | head -n11
1.069  0.756  0.308  99.58 | command == tail -n+50000000 test.in | head -n11
1.823  1.512  0.308  99.85 | command == tail -n50000001 test.in | head -n11
1.950  2.396  1.284  188.77| command == head -n50000010 test.in | tail -n11
5.477  5.116  0.348  99.76 | command == sed -n -e "50000000,50000010p;50000010q" test.in
10.124  9.669  0.448  99.92| command == awk 'NR<'50000000'{next}1;NR=='50000010'{exit}' test.in


lStart |99999000| lEnd |99999010| lCount |11|
real   user   sys    frac
0.001  0.000  0.000  0.00  | command == tail -n"1001" test.in | head -n11
1.960  1.292  0.660  99.61 | command == tail -n+99999000 test.in | head -n11
0.001  0.000  0.000  0.00  | command == tail -n1001 test.in | head -n11
4.043  4.704  2.704  183.25| command == head -n99999010 test.in | tail -n11
10.346  9.641  0.692  99.88| command == sed -n -e "99999000,99999010p;99999010q" test.in
21.653  20.873  0.744  99.83 | command == awk 'NR<'99999000'{next}1;NR=='99999010'{exit}' test.in

请注意，如果所选线路靠近起点或靠近终点，时间会发生巨大变化。在文件的一侧看起来运行良好的命令在文件的另一侧可能会非常慢。

Answer

如果我们知道要选择的范围，从第一行：lStart到最后一行：lEnd我们可以计算：

lCount="$((lEnd-lStart+1))"

如果我们知道总行数：lAll我们还可以计算到文件末尾的距离：

toEnd="$((lAll-lStart+1))"

然后我们就会知道：

"how far from the start"            ($lStart) and
"how far from the end of the file"  ($toEnd).

选择其中最小的一个：tailnumber如下：

tailnumber="$toEnd"; (( toEnd > lStart )) && tailnumber="+$linestart"

允许我们使用始终最快的执行命令：

tail -n"${tailnumber}" ${thefile} | head -n${lCount}

$linestart请注意选择时的附加加号 (“+”) 。

唯一需要注意的是，我们需要总行数，这可能需要一些额外的时间才能找到。
与往常一样：

linesall="$(wc -l < "$thefile" )"

测量的一些时间是：

lStart |500| lEnd |500| lCount |11|
real   user   sys    frac
0.002  0.000  0.000  0.00  | command == tail -n"+500" test.in | head -n1
0.002  0.000  0.000  0.00  | command == tail -n+500 test.in | head -n1
3.230  2.520  0.700  99.68 | command == tail -n99999501 test.in | head -n1
0.001  0.000  0.000  0.00  | command == head -n500 test.in | tail -n1
0.001  0.000  0.000  0.00  | command == sed -n -e "500,500p;500q" test.in
0.002  0.000  0.000  0.00  | command == awk 'NR<'500'{next}1;NR=='500'{exit}' test.in


lStart |50000000| lEnd |50000010| lCount |11|
real   user   sys    frac
0.977  0.644  0.328  99.50 | command == tail -n"+50000000" test.in | head -n11
1.069  0.756  0.308  99.58 | command == tail -n+50000000 test.in | head -n11
1.823  1.512  0.308  99.85 | command == tail -n50000001 test.in | head -n11
1.950  2.396  1.284  188.77| command == head -n50000010 test.in | tail -n11
5.477  5.116  0.348  99.76 | command == sed -n -e "50000000,50000010p;50000010q" test.in
10.124  9.669  0.448  99.92| command == awk 'NR<'50000000'{next}1;NR=='50000010'{exit}' test.in


lStart |99999000| lEnd |99999010| lCount |11|
real   user   sys    frac
0.001  0.000  0.000  0.00  | command == tail -n"1001" test.in | head -n11
1.960  1.292  0.660  99.61 | command == tail -n+99999000 test.in | head -n11
0.001  0.000  0.000  0.00  | command == tail -n1001 test.in | head -n11
4.043  4.704  2.704  183.25| command == head -n99999010 test.in | tail -n11
10.346  9.641  0.692  99.88| command == sed -n -e "99999000,99999010p;99999010q" test.in
21.653  20.873  0.744  99.83 | command == awk 'NR<'99999000'{next}1;NR=='99999010'{exit}' test.in

请注意，如果所选线路靠近起点或靠近终点，时间会发生巨大变化。在文件的一侧看起来运行良好的命令在文件的另一侧可能会非常慢。

测量的一些时间是：

答案1

答案2

答案3

答案4

测量的一些时间是：

相关内容