脚本

Question 1

我写了这个 bash 脚本来做到这一点。它基本上形成一个数组，其中包含要进入每个 tar 的文件名称，然后tar从所有这些都平行。这可能不是最有效的方法，但它会按照您的意愿完成工作。不过，我预计它会消耗大量内存。

您将需要调整脚本开头的选项。您可能还想更改cvjf最后一行中的 tar 选项（例如删除详细输出v以提高性能或将压缩更改j为z等）。

脚本

#!/bin/bash

# User configuratoin
#===================
files=(*.log)           # Set the file pattern to be used, e.g. (*.txt) or (*)
num_files_per_tar=5 # Number of files per tar
num_procs=4         # Number of tar processes to start
tar_file_dir='/tmp' # Tar files dir
tar_file_name_prefix='tar' # prefix for tar file names
tar_file_name="$tar_file_dir/$tar_file_name_prefix"

# Main algorithm
#===============
num_tars=$((${#files[@]}/num_files_per_tar))  # the number of tar files to create
tar_files=()  # will hold the names of files for each tar

tar_start=0 # gets update where each tar starts
# Loop over the files adding their names to be tared
for i in `seq 0 $((num_tars-1))`
do
  tar_files[$i]="$tar_file_name$i.tar.bz2 ${files[@]:tar_start:num_files_per_tar}"
  tar_start=$((tar_start+num_files_per_tar))
done

# Start tar in parallel for each of the strings we just constructed
printf '%s\n' "${tar_files[@]}" | xargs -n$((num_files_per_tar+1)) -P$num_procs tar cjvf

解释

首先，所有与所选模式匹配的文件名都存储在数组中files。接下来，for 循环对该数组进行切片并从切片形成字符串。切片的数量等于所需 tarball 的数量。结果字符串存储在数组中tar_files。 for 循环还将生成的 tarball 的名称添加到每个字符串的开头。的元素tar_files采用以下形式（假设 5 个文件/tarball）：

tar_files[0]="tar0.tar.bz2  file1 file2 file3 file4 file5"
tar_files[1]="tar1.tar.bz2  file6 file7 file8 file9 file10"
...

脚本的最后一行xargs用于启动多个tar进程（最多指定的最大数量），其中每个进程将并行tar_files处理数组的一个元素。

测试

文件列表：

$ls

a      c      e      g      i      k      m      n      p      r      t
b      d      f      h      j      l      o      q      s

生成的压缩包： $ls /tmp/tar* tar0.tar.bz2 tar1.tar.bz2 tar2.tar.bz2 tar3.tar.bz2

Answer

我写了这个 bash 脚本来做到这一点。它基本上形成一个数组，其中包含要进入每个 tar 的文件名称，然后tar从所有这些都平行。这可能不是最有效的方法，但它会按照您的意愿完成工作。不过，我预计它会消耗大量内存。

您将需要调整脚本开头的选项。您可能还想更改cvjf最后一行中的 tar 选项（例如删除详细输出v以提高性能或将压缩更改j为z等）。

脚本

#!/bin/bash

# User configuratoin
#===================
files=(*.log)           # Set the file pattern to be used, e.g. (*.txt) or (*)
num_files_per_tar=5 # Number of files per tar
num_procs=4         # Number of tar processes to start
tar_file_dir='/tmp' # Tar files dir
tar_file_name_prefix='tar' # prefix for tar file names
tar_file_name="$tar_file_dir/$tar_file_name_prefix"

# Main algorithm
#===============
num_tars=$((${#files[@]}/num_files_per_tar))  # the number of tar files to create
tar_files=()  # will hold the names of files for each tar

tar_start=0 # gets update where each tar starts
# Loop over the files adding their names to be tared
for i in `seq 0 $((num_tars-1))`
do
  tar_files[$i]="$tar_file_name$i.tar.bz2 ${files[@]:tar_start:num_files_per_tar}"
  tar_start=$((tar_start+num_files_per_tar))
done

# Start tar in parallel for each of the strings we just constructed
printf '%s\n' "${tar_files[@]}" | xargs -n$((num_files_per_tar+1)) -P$num_procs tar cjvf

解释

首先，所有与所选模式匹配的文件名都存储在数组中files。接下来，for 循环对该数组进行切片并从切片形成字符串。切片的数量等于所需 tarball 的数量。结果字符串存储在数组中tar_files。 for 循环还将生成的 tarball 的名称添加到每个字符串的开头。的元素tar_files采用以下形式（假设 5 个文件/tarball）：

tar_files[0]="tar0.tar.bz2  file1 file2 file3 file4 file5"
tar_files[1]="tar1.tar.bz2  file6 file7 file8 file9 file10"
...

脚本的最后一行xargs用于启动多个tar进程（最多指定的最大数量），其中每个进程将并行tar_files处理数组的一个元素。

测试

文件列表：

$ls

a      c      e      g      i      k      m      n      p      r      t
b      d      f      h      j      l      o      q      s

生成的压缩包： $ls /tmp/tar* tar0.tar.bz2 tar1.tar.bz2 tar2.tar.bz2 tar3.tar.bz2

Question 2

这是另一个脚本。您可以选择是否需要每个段正好 100 万个文件，或者正好 30 个段。我在此脚本中选择了前者，但split关键字允许任一选择。

#!/bin/bash
#
DIR="$1"        # The source of the millions of files
TARDEST="$2"    # Where the tarballs should be placed

# Create the million-file segments
rm -f /tmp/chunk.*
find "$DIR" -type f | split -l 1000000 - /tmp/chunk.

# Create corresponding tarballs
for CHUNK in $(cd /tmp && echo chunk.*)
do
    test -f "$CHUNK" || continue

    echo "Creating tarball for chunk '$CHUNK'" >&2
    tar cTf "/tmp/$CHUNK" "$TARDEST/$CHUNK.tar"
    rm -f "/tmp/$CHUNK"
done

有许多细节可以应用于此脚本。作为文件列表前缀的使用/tmp/chunk.可能应该被推入常量声明中，并且代码不应该真正假设它可以删除任何匹配的内容/tmp/chunk.*，但我将这种方式保留为概念证明而不是完善的实用程序。如果我使用它，我会mktemp创建一个临时目录来保存文件列表。

Answer

这是另一个脚本。您可以选择是否需要每个段正好 100 万个文件，或者正好 30 个段。我在此脚本中选择了前者，但split关键字允许任一选择。

#!/bin/bash
#
DIR="$1"        # The source of the millions of files
TARDEST="$2"    # Where the tarballs should be placed

# Create the million-file segments
rm -f /tmp/chunk.*
find "$DIR" -type f | split -l 1000000 - /tmp/chunk.

# Create corresponding tarballs
for CHUNK in $(cd /tmp && echo chunk.*)
do
    test -f "$CHUNK" || continue

    echo "Creating tarball for chunk '$CHUNK'" >&2
    tar cTf "/tmp/$CHUNK" "$TARDEST/$CHUNK.tar"
    rm -f "/tmp/$CHUNK"
done

有许多细节可以应用于此脚本。作为文件列表前缀的使用/tmp/chunk.可能应该被推入常量声明中，并且代码不应该真正假设它可以删除任何匹配的内容/tmp/chunk.*，但我将这种方式保留为概念证明而不是完善的实用程序。如果我使用它，我会mktemp创建一个临时目录来保存文件列表。

Question 3

这正是所要求的：

#!/bin/bash
ctr=0;
# Read 1M lines, strip newline chars, put the results into an array named "asdf"
while readarray -n 1000000 -t asdf; do
  ctr=$((${ctr}+1));
# "${asdf[@]}" expands each entry in the array such that any special characters in
# the filename won't cause problems
  tar czf /destination/path/asdf.${ctr}.tgz "${asdf[@]}";
# If you don't want compression, use this instead:
  #tar cf /destination/path/asdf.${ctr}.tar "${asdf[@]}";
# this is the canonical way to generate output
# for consumption by read/readarray in bash
done <(find /source/path -not -type d);

readarray（在 bash 中）也可用于执行回调函数，因此可能会被重写为类似于：

function something() {...}
find /source/path -not -type d \
  | readarray -n 1000000 -t -C something asdf

GNUparallel可以用来做类似的事情（未经测试；我没有parallel安装我所在的位置，所以我只是即兴发挥）：

find /source/path -not -type d -print0 \
  | parallel -j4 -d '\0' -N1000000 tar czf '/destination/path/thing_backup.{#}.tgz'

由于未经测试，您可以添加--dry-runarg 来查看它实际上会做什么。我最喜欢这个，但并不是每个人都parallel安装了。 -j4使其一次使用 4 个作业，与'-d '\0'结合使其忽略文件名中的特殊字符（空格等）。其余的应该是不言自明的。find-print0

可以做类似的事情parallel，但我不喜欢它，因为它会生成随机文件名：

find /source/path -not -type d -print0 \
  | parallel -j4 -d '\0' -N1000000 --tmpdir /destination/path --files tar cz

我还不知道如何让它生成连续的文件名。

xargs也可以使用，但与parallel没有直接的方法来生成输出文件名不同，所以你最终会做一些愚蠢/黑客的事情，如下所示：

find /source/path -not -type d -print0 \
  | xargs -P 4 -0 -L 1000000 bash -euc 'tar czf $(mktemp --suffix=".tgz" /destination/path/backup_XXX) "$@"'

OP 说他们不想使用 split ...我认为这看起来很奇怪，因为cat重新加入他们就好了；这会生成一个 tar 并将其分割成 3GB 的块：

tar c /source/path | split -b $((3*1024*1024*1024)) - /destination/path/thing.tar.

...这会将它们解压缩到当前目录中：

cat $(\ls -1 /destination/path/thing.tar.* | sort) | tar x

Answer