合并大量文件

Question 1

如果您拥有该计算机的 root 权限，您可以暂时增加“打开文件描述符的最大数量”限制：

ulimit -Hn 10240 # The hard limit
ulimit -Sn 10240 # The soft limit

进而

paste res.* >final.res

之后您可以将其设置回原始值。

A第二种解决方案，如果您无法更改限制：

for f in res.*; do cat final.res | paste - $f >temp; cp temp final.res; done; rm temp

它对paste每个文件调用一次，最后有一个包含所有列的巨大文件（需要一分钟）。

编辑:猫的无用用途...不是！

正如评论中提到的，cat这里 ( cat final.res | paste - $f >temp) 的用法并非毫无用处。第一次运行循环时，该文件final.res尚不存在。paste然后就会失败，并且文件永远不会被填充，也不会创建。我的解决方案仅cat在第一次失败，No such file or directory并paste从标准输入中读取一个空文件，但它会继续。该错误可以忽略。

Answer

如果您拥有该计算机的 root 权限，您可以暂时增加“打开文件描述符的最大数量”限制：

ulimit -Hn 10240 # The hard limit
ulimit -Sn 10240 # The soft limit

进而

paste res.* >final.res

之后您可以将其设置回原始值。

A第二种解决方案，如果您无法更改限制：

for f in res.*; do cat final.res | paste - $f >temp; cp temp final.res; done; rm temp

它对paste每个文件调用一次，最后有一个包含所有列的巨大文件（需要一分钟）。

编辑:猫的无用用途...不是！

正如评论中提到的，cat这里 ( cat final.res | paste - $f >temp) 的用法并非毫无用处。第一次运行循环时，该文件final.res尚不存在。paste然后就会失败，并且文件永远不会被填充，也不会创建。我的解决方案仅cat在第一次失败，No such file or directory并paste从标准输入中读取一个空文件，但它会继续。该错误可以忽略。

Question 2

如果混乱' 答案不适用（因为您没有所需的权限），您可以paste按如下方式批量调用：

ls -1 res.* | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge* > final.res

这一次列出了名为等的文件中的 1000 个文件lists00，lists01然后将相应的res.文件粘贴到名为等的文件中merge00，merge01最后合并所有生成的部分合并的文件。

正如所提到的混乱您可以增加一次使用的文件数量；限制是给定的值ulimit -n减去您已经打开的文件数量，所以您会说

ls -1 res.* | split -l $(($(ulimit -n)-10)) -d - lists

使用限制减十。

如果您的版本split不支持-d，您可以将其删除：它所做的只是告诉split您使用数字后缀。默认情况下，后缀将为aa, abetc. 而不是01, 02etc。

如果有太多文件ls -1 res.*失败（“参数列表太长”），您可以将其替换为find以避免该错误：

find . -maxdepth 1 -type f -name res.\* | split -l 1000 -d - lists

（正如指出的唐克里斯斯蒂,当管道输出-1时不需要；ls但我将其保留以处理ls别名为的情况-C。）

Answer

如果混乱' 答案不适用（因为您没有所需的权限），您可以paste按如下方式批量调用：

ls -1 res.* | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge* > final.res

这一次列出了名为等的文件中的 1000 个文件lists00，lists01然后将相应的res.文件粘贴到名为等的文件中merge00，merge01最后合并所有生成的部分合并的文件。

正如所提到的混乱您可以增加一次使用的文件数量；限制是给定的值ulimit -n减去您已经打开的文件数量，所以您会说

ls -1 res.* | split -l $(($(ulimit -n)-10)) -d - lists

使用限制减十。

如果您的版本split不支持-d，您可以将其删除：它所做的只是告诉split您使用数字后缀。默认情况下，后缀将为aa, abetc. 而不是01, 02etc。

如果有太多文件ls -1 res.*失败（“参数列表太长”），您可以将其替换为find以避免该错误：

find . -maxdepth 1 -type f -name res.\* | split -l 1000 -d - lists

（正如指出的唐克里斯斯蒂,当管道输出-1时不需要；ls但我将其保留以处理ls别名为的情况-C。）

Question 3

尝试以这种方式执行它：

ls res.*|xargs paste >final.res

您还可以将批次分成几部分，然后尝试以下操作：

paste `echo res.{1..100}` >final.100
paste `echo res.{101..200}` >final.200
...

最后合并最终文件

paste final.* >final.res

Answer

尝试以这种方式执行它：

ls res.*|xargs paste >final.res

您还可以将批次分成几部分，然后尝试以下操作：

paste `echo res.{1..100}` >final.100
paste `echo res.{101..200}` >final.200
...

最后合并最终文件

paste final.* >final.res

Question 4

考虑到涉及的文件数量、行大小等，我认为它将超过工具的默认大小（awk、sed、paste、*等）

我会为此创建一个小程序，它既不会打开 10,000 个文件，也不会打开数十万行的长度（10,000 个文件，每行 10 个（示例中行的最大大小））。它只需要一个约 10,000 个整数数组，来存储从每个文件读取的字节数。缺点是它只有一个文件描述符，它被每个文件、每一行重复使用，这可能会很慢。

FILES和的定义ROWS应更改为实际的精确值。输出被发送到标准输出。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define FILES 10000 /* number of files */
#define ROWS 500    /* number of rows  */

int main() {
   int positions[FILES + 1];
   FILE *file;
   int r, f;
   char filename[100];
   size_t linesize = 100;
   char *line = (char *) malloc(linesize * sizeof(char));

   for (f = 1; f <= FILES; positions[f++] = 0); /* sets the initial positions to zero */

   for (r = 1; r <= ROWS; ++r) {
      for (f = 1; f <= FILES; ++f) {
         sprintf(filename, "res.%d", f);                  /* creates the name of the current file */
         file = fopen(filename, "r");                     /* opens the current file */
         fseek(file, positions[f], SEEK_SET);             /* set position from the saved one */
         positions[f] += getline(&line, &linesize, file); /* reads line and saves the new position */
         line[strlen(line) - 1] = 0;                      /* removes the newline */
         printf("%s ", line);                             /* prints in the standard ouput, and a single space */
         fclose(file);                                    /* closes the current file */
      }
      printf("\n");  /* after getting the line from each file, prints a new line to standard output */
   }
}

Answer

考虑到涉及的文件数量、行大小等，我认为它将超过工具的默认大小（awk、sed、paste、*等）

我会为此创建一个小程序，它既不会打开 10,000 个文件，也不会打开数十万行的长度（10,000 个文件，每行 10 个（示例中行的最大大小））。它只需要一个约 10,000 个整数数组，来存储从每个文件读取的字节数。缺点是它只有一个文件描述符，它被每个文件、每一行重复使用，这可能会很慢。

FILES和的定义ROWS应更改为实际的精确值。输出被发送到标准输出。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define FILES 10000 /* number of files */
#define ROWS 500    /* number of rows  */

int main() {
   int positions[FILES + 1];
   FILE *file;
   int r, f;
   char filename[100];
   size_t linesize = 100;
   char *line = (char *) malloc(linesize * sizeof(char));

   for (f = 1; f <= FILES; positions[f++] = 0); /* sets the initial positions to zero */

   for (r = 1; r <= ROWS; ++r) {
      for (f = 1; f <= FILES; ++f) {
         sprintf(filename, "res.%d", f);                  /* creates the name of the current file */
         file = fopen(filename, "r");                     /* opens the current file */
         fseek(file, positions[f], SEEK_SET);             /* set position from the saved one */
         positions[f] += getline(&line, &linesize, file); /* reads line and saves the new position */
         line[strlen(line) - 1] = 0;                      /* removes the newline */
         printf("%s ", line);                             /* prints in the standard ouput, and a single space */
         fclose(file);                                    /* closes the current file */
      }
      printf("\n");  /* after getting the line from each file, prints a new line to standard output */
   }
}

合并大量文件

答案1

答案2

答案3

答案4

相关内容