如何分解文件（例如 split 到 stdout）以通过管道传输到命令？

Question 1

我认为最简单的方法是：

while IFS= read -r line; do
  { printf '%s\n' "$line"; head -n 99; } |
  other_commands
done <database_file

您需要read在每个部分的第一行使用，因为当到达文件末尾时似乎没有其他方法可以停止。欲了解更多信息，请参阅：

Answer

我认为最简单的方法是：

while IFS= read -r line; do
  { printf '%s\n' "$line"; head -n 99; } |
  other_commands
done <database_file

您需要read在每个部分的第一行使用，因为当到达文件末尾时似乎没有其他方法可以停止。欲了解更多信息，请参阅：

Question 2

基本上，我正在寻找split将输出到stdout，而不是文件。

如果您有权访问gnu split，该--filter选项正是这样做的：

‘--filter=command’

    With this option, rather than simply writing to each output file, write
    through a pipe to the specified shell command for each output file.

因此，就您而言，您可以将这些命令与一起使用--filter，例如

split -l 100 --filter='{ cat Header.sql; cat; } | sqlcmd; printf %s\\n DONE' infile

或者写一个脚本，例如myscript：

#!/bin/sh

{ cat Header.sql; cat; } | sqlcmd
printf %s\\n '--- PROCESSED ---'

然后简单地运行

split -l 100 --filter=./myscript infile

Answer

基本上，我正在寻找split将输出到stdout，而不是文件。

如果您有权访问gnu split，该--filter选项正是这样做的：

‘--filter=command’

    With this option, rather than simply writing to each output file, write
    through a pipe to the specified shell command for each output file.

因此，就您而言，您可以将这些命令与一起使用--filter，例如

split -l 100 --filter='{ cat Header.sql; cat; } | sqlcmd; printf %s\\n DONE' infile

或者写一个脚本，例如myscript：

#!/bin/sh

{ cat Header.sql; cat; } | sqlcmd
printf %s\\n '--- PROCESSED ---'

然后简单地运行

split -l 100 --filter=./myscript infile

Question 3

_linc() ( ${sh-da}sh ${dbg+-vx} 4<&0 <&3 ) 3<<-ARGS 3<<\CMD
        set -- $( [ $((i=${1%%*[!0-9]*}-1)) -gt 1 ] && {
                shift && echo "\${inc=$i}" ; }
        unset cmd ; [ $# -gt 0 ] || cmd='echo incr "#$((i=i+1))" ; cat'
        printf '%s ' 'me=$$ ;' \
        '_cmd() {' '${dbg+set -vx ;}' "$@" "$cmd" '
        }' )
        ARGS
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
CMD

上面的函数用于sed将其参数列表作为命令字符串应用于任意行增量。您在命令行上指定的命令将被输入到一个临时 shell 函数中，该函数将被输入到 stdin 上的此处文档中，其中包含每个增量的步骤行。

你像这样使用它：

time printf 'this is line #%d\n' `seq 1000` |
_linc 193 sed -e \$= -e r \- \| tail -n2
    #output
193
this is line #193
193
this is line #386
193
this is line #579
193
this is line #772
193
this is line #965
35
this is line #1000
printf 'this is line #%d\n' `seq 1000`  0.00s user 0.00s system 0% cpu 0.004 total

这里的机制非常简单：

i_cmd <<"${s:=${me}SPLIT${me}}"
${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
a$s

这就是sed剧本。基本上我们只是printf $increment * n;。因此，如果您将增量设置为 100，printf则会编写一个sed由 100 行组成的脚本，其中仅表示$!n，一行insert用于此处文档的顶端，另一行append用于底行 - 就是这样。其余的大部分只是处理选项。

ext命令n告诉sed打印当前行，删除它，然后拉入下一行。指定$!它应该只尝试除最后一行之外的任何行。

仅提供一个增量器，它将：

printf 'this is line #%d\n' `seq 10` |                                  ⏎
_linc 3
    #output
incr #1
this is line #1
this is line #2
this is line #3
incr #2
this is line #4
this is line #5
this is line #6
incr #3
this is line #7
this is line #8
this is line #9
incr #4
this is line #10

因此，在幕后发生的事情是将函数设置为echo计数器及其cat输入（如果未提供命令字符串）。如果您在命令行上看到它，它看起来像：

{ echo "incr #$((i=i+1))" ; cat ; } <<HEREDOC
this is line #7
this is line #8
this is line #9
HEREDOC

它为每个增量执行其中一个。看：

printf 'this is line #%d\n' `seq 10` |
dbg= _linc 3
    #output
set -- ${inc=2}
+ set -- 2
me=$$ ; _cmd() { ${dbg+set -vx ;} echo incr "#$((i=i+1))" ; cat
}
+ me=19396
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
+ s=
+ . /dev/stdin
+ seq 2
+ printf $!n\n%.0b 1 2
+ sed -f - /dev/fd/4
_cmd <<"19396SPLIT19396"
this is line #1
this is line #2
this is line #3
19396SPLIT19396
+ _cmd
+ set -vx ; echo incr #1
+ cat
this is line #1
this is line #2
this is line #3
_cmd <<"19396SPLIT19396"

非常快

time yes | sed = | sed -n 'p;n' |
_linc 4000 'printf "current line and char count\n"
    sed "1w /dev/fd/2" | wc -c
    [ $((i=i+1)) -ge 5000 ] && kill "$me" || echo "$i"'

    #OUTPUT

current line and char count
19992001
36000
4999
current line and char count
19996001
36000
current line and char count
[2]    17113 terminated  yes |
       17114 terminated  sed = |
       17115 terminated  sed -n 'p;n'
yes  0.86s user 0.06s system 5% cpu 16.994 total
sed =  9.06s user 0.30s system 55% cpu 16.993 total
sed -n 'p;n'  7.68s user 0.38s system 47% cpu 16.992 total

上面我告诉它每 4000 行递增一次。 17 秒后，我已经处理了 2000 万行。当然，那里的逻辑并不严格——我们只读取每一行两次并计算它们的所有字符，但可能性是相当开放的。此外，如果您仔细观察，您可能会发现似乎是提供输入的过滤器占用了大部分时间。

Answer

_linc() ( ${sh-da}sh ${dbg+-vx} 4<&0 <&3 ) 3<<-ARGS 3<<\CMD
        set -- $( [ $((i=${1%%*[!0-9]*}-1)) -gt 1 ] && {
                shift && echo "\${inc=$i}" ; }
        unset cmd ; [ $# -gt 0 ] || cmd='echo incr "#$((i=i+1))" ; cat'
        printf '%s ' 'me=$$ ;' \
        '_cmd() {' '${dbg+set -vx ;}' "$@" "$cmd" '
        }' )
        ARGS
        s= ; sed -f - <<-INC /dev/fd/4 | . /dev/stdin
                i_cmd <<"${s:=${me}SPLIT${me}}"
                ${inc:+$(printf '$!n\n%.0b' `seq $inc`)}
                a$s
        INC
CMD