在每个第 2000 个字符之前的最后一个空格处将大型纯文本文件拆分为较小的文件

Question 1

可能是这样的：

process() {
  do-what-you-have-to-do-with-the-chunk $1
}
chunk_size=2000
file=file1.txt

set -o extendedglob # needed for (#cmin,max), ## and (#b)

# contents of the file trimmed of leading and trailing whitespace
text=${${$(<$file)%%[[:space:]]##}##[[:space:]]##}

while (( $#text > chunk_size )); do
  if [[ $text = (#b)(?(#c0,$((chunk_size - 1)))[^[:space:]])[[:space:]]##(*) ]]; then
    process $match[1]
    text=$match[2]
  else
    print -ru2 Text cannot be split
    exit 1
  fi
done
if [[ -n $text ]]; then
  # last chunk
  process $text
done

请注意，长度是根据数量计算的特点，不是字节。您可以set +o multibyte以字节为单位进行计数，但这也意味着多字节间距字符将被忽略。在我的英国语言环境中，大多数空格字符都编码为多个字节，但它们并不是最常用的。他们是：

           09 U+0009 CHARACTER TABULATION
           0A U+000A LINE FEED
           0B U+000B LINE TABULATION
           0C U+000C FORM FEED
           0D U+000D CARRIAGE RETURN
           20 U+0020 SPACE
     E1 9A 80 U+1680 OGHAM SPACE MARK
     E2 80 80 U+2000 EN QUAD
     E2 80 81 U+2001 EM QUAD
     E2 80 82 U+2002 EN SPACE
     E2 80 83 U+2003 EM SPACE
     E2 80 84 U+2004 THREE-PER-EM SPACE
     E2 80 85 U+2005 FOUR-PER-EM SPACE
     E2 80 86 U+2006 SIX-PER-EM SPACE
     E2 80 88 U+2008 PUNCTUATION SPACE
     E2 80 89 U+2009 THIN SPACE
     E2 80 8A U+200A HAIR SPACE
     E2 80 A8 U+2028 LINE SEPARATOR
     E2 80 A9 U+2029 PARAGRAPH SEPARATOR
     E2 81 9F U+205F MEDIUM MATHEMATICAL SPACE
     E3 80 80 U+3000 IDEOGRAPHIC SPACE

Answer

可能是这样的：

process() {
  do-what-you-have-to-do-with-the-chunk $1
}
chunk_size=2000
file=file1.txt

set -o extendedglob # needed for (#cmin,max), ## and (#b)

# contents of the file trimmed of leading and trailing whitespace
text=${${$(<$file)%%[[:space:]]##}##[[:space:]]##}

while (( $#text > chunk_size )); do
  if [[ $text = (#b)(?(#c0,$((chunk_size - 1)))[^[:space:]])[[:space:]]##(*) ]]; then
    process $match[1]
    text=$match[2]
  else
    print -ru2 Text cannot be split
    exit 1
  fi
done
if [[ -n $text ]]; then
  # last chunk
  process $text
done

请注意，长度是根据数量计算的特点，不是字节。您可以set +o multibyte以字节为单位进行计数，但这也意味着多字节间距字符将被忽略。在我的英国语言环境中，大多数空格字符都编码为多个字节，但它们并不是最常用的。他们是：

           09 U+0009 CHARACTER TABULATION
           0A U+000A LINE FEED
           0B U+000B LINE TABULATION
           0C U+000C FORM FEED
           0D U+000D CARRIAGE RETURN
           20 U+0020 SPACE
     E1 9A 80 U+1680 OGHAM SPACE MARK
     E2 80 80 U+2000 EN QUAD
     E2 80 81 U+2001 EM QUAD
     E2 80 82 U+2002 EN SPACE
     E2 80 83 U+2003 EM SPACE
     E2 80 84 U+2004 THREE-PER-EM SPACE
     E2 80 85 U+2005 FOUR-PER-EM SPACE
     E2 80 86 U+2006 SIX-PER-EM SPACE
     E2 80 88 U+2008 PUNCTUATION SPACE
     E2 80 89 U+2009 THIN SPACE
     E2 80 8A U+200A HAIR SPACE
     E2 80 A8 U+2028 LINE SEPARATOR
     E2 80 A9 U+2029 PARAGRAPH SEPARATOR
     E2 81 9F U+205F MEDIUM MATHEMATICAL SPACE
     E3 80 80 U+3000 IDEOGRAPHIC SPACE

Question 2

Shell 语言非常擅长将事物分割成单词，只要您不需要保留输入中的精确空格（例如，连续的空格可以合并为单个空格）。

通过向前看，处理可以变得更容易一些 - 对于每个单词，如果它适合一个文件，则添加它。否则继续处理下一个文件。换行类似：

#!/usr/bin/env zsh

inFile=$1
fileAsWords=($(<$inFile))

outfileNum=0
outputText=
sep=
lineLen=0
eol=$'\n'
for word in $fileAsWords; do
    if (( ${#outputText} + ${#sep} + ${#word} + ${#eol} > 2000 )); then
        printf -v outfileName out-%04d.txt outfileNum++
        print -r -- $outputText > $outfileName
        outputText=
        sep=
        lineLen=0
    fi
    if (( lineLen > 0 && lineLen + ${#sep} + ${#word} > 80 )); then
        sep=${eol}
        lineLen=0
    fi
    outputText+=${sep}${word}
    (( lineLen += ${#sep} + ${#word} ))
    sep=' '
done
if (( ${#outputText} > 0 )); then
    printf -v outfileName out-%04d.txt outfileNum
    print -r -- $outputText > $outfileName
fi

如果这些项目可以包含嵌入的空格，这仍然可能会跨文件分割一些项目，例如 URL。用于分割的字符集可以通过IFS在创建单词数组之前设置（内部字段分隔符）来更改。

Answer

Shell 语言非常擅长将事物分割成单词，只要您不需要保留输入中的精确空格（例如，连续的空格可以合并为单个空格）。

通过向前看，处理可以变得更容易一些 - 对于每个单词，如果它适合一个文件，则添加它。否则继续处理下一个文件。换行类似：

#!/usr/bin/env zsh

inFile=$1
fileAsWords=($(<$inFile))

outfileNum=0
outputText=
sep=
lineLen=0
eol=$'\n'
for word in $fileAsWords; do
    if (( ${#outputText} + ${#sep} + ${#word} + ${#eol} > 2000 )); then
        printf -v outfileName out-%04d.txt outfileNum++
        print -r -- $outputText > $outfileName
        outputText=
        sep=
        lineLen=0
    fi
    if (( lineLen > 0 && lineLen + ${#sep} + ${#word} > 80 )); then
        sep=${eol}
        lineLen=0
    fi
    outputText+=${sep}${word}
    (( lineLen += ${#sep} + ${#word} ))
    sep=' '
done
if (( ${#outputText} > 0 )); then
    printf -v outfileName out-%04d.txt outfileNum
    print -r -- $outputText > $outfileName
fi

如果这些项目可以包含嵌入的空格，这仍然可能会跨文件分割一些项目，例如 URL。用于分割的字符集可以通过IFS在创建单词数组之前设置（内部字段分隔符）来更改。

在每个第 2000 个字符之前的最后一个空格处将大型纯文本文件拆分为较小的文件

答案1

答案2

相关内容