将大文件分割成块而不分割条目

Question 1

使用以下建议csplit：

根据行号分割

$ csplit file.txt <num lines> "{repetitions}"

例子

假设我有一个包含 1000 行的文件。

$ seq 1000 > file.txt

$ csplit file.txt 100 "{8}"
288
400
400
400
400
400
400
400
400
405

结果像这样的文件：

$ wc -l xx*
  99 xx00
 100 xx01
 100 xx02
 100 xx03
 100 xx04
 100 xx05
 100 xx06
 100 xx07
 100 xx08
 101 xx09
   1 xx10
1001 total

您可以通过提前根据特定文件中的行数预先计算数字来绕过必须指定重复次数的静态限制。

$ lines=100
$ echo $lines 
100

$ rep=$(( ($(wc -l file.txt | cut -d" " -f1) / $lines) -2 ))
$ echo $rep
8

$ csplit file.txt 100 "{$rep}"
288
400
400
400
400
400
400
400
400
405

根据空行分割

另一方面，如果您想简单地在文件中包含的空白行上拆分文件，您可以使用以下版本split：

$ csplit file2.txt '/^$/' "{*}"

例子

假设我在上面添加了 4 个空行file.txt，并将文件创建为file2.txt.您可以看到它们已被手动添加，如下所示：

$ grep -A1 -B1 "^$" file2.txt
20

21
--
72

73
--
112

113
--
178

179

上面显示我已将它们添加到示例文件中的相应数字之间。现在当我运行csplit命令时：

$ csplit file2.txt '/^$/' "{*}"
51
157
134
265
3290

你可以看到我现在有 4 个文件，它们已经根据空行分割开了：

$ grep -A1 -B1 '^$' xx0*
xx01:
xx01-21
--
xx02:
xx02-73
--
xx03:
xx03-113
--
xx04:
xx04-179

参考

Answer

使用以下建议csplit：

根据行号分割

$ csplit file.txt <num lines> "{repetitions}"

例子

假设我有一个包含 1000 行的文件。

$ seq 1000 > file.txt

$ csplit file.txt 100 "{8}"
288
400
400
400
400
400
400
400
400
405

结果像这样的文件：

$ wc -l xx*
  99 xx00
 100 xx01
 100 xx02
 100 xx03
 100 xx04
 100 xx05
 100 xx06
 100 xx07
 100 xx08
 101 xx09
   1 xx10
1001 total

您可以通过提前根据特定文件中的行数预先计算数字来绕过必须指定重复次数的静态限制。

$ lines=100
$ echo $lines 
100

$ rep=$(( ($(wc -l file.txt | cut -d" " -f1) / $lines) -2 ))
$ echo $rep
8

$ csplit file.txt 100 "{$rep}"
288
400
400
400
400
400
400
400
400
405

根据空行分割

另一方面，如果您想简单地在文件中包含的空白行上拆分文件，您可以使用以下版本split：

$ csplit file2.txt '/^$/' "{*}"

例子

假设我在上面添加了 4 个空行file.txt，并将文件创建为file2.txt.您可以看到它们已被手动添加，如下所示：

$ grep -A1 -B1 "^$" file2.txt
20

21
--
72

73
--
112

113
--
178

179

上面显示我已将它们添加到示例文件中的相应数字之间。现在当我运行csplit命令时：

$ csplit file2.txt '/^$/' "{*}"
51
157
134
265
3290

你可以看到我现在有 4 个文件，它们已经根据空行分割开了：

$ grep -A1 -B1 '^$' xx0*
xx01:
xx01-21
--
xx02:
xx02-73
--
xx03:
xx03-113
--
xx04:
xx04-179

参考

Question 2

如果您不关心记录的顺序，您可以这样做：

gawk -vRS= '{printf "%s", $0 RT > "file.out." (NR-1)%15}' file.in

否则，您需要首先获取记录数，以了解每个输出文件中应放入多少条记录：

gawk -vRS= -v "n=$(gawk -vRS= 'END {print NR}' file.in)" '
  {printf "%s", $0 RT > "file.out." int((NR-1)*15/n)}' file.in

Answer

如果您不关心记录的顺序，您可以这样做：

gawk -vRS= '{printf "%s", $0 RT > "file.out." (NR-1)%15}' file.in

否则，您需要首先获取记录数，以了解每个输出文件中应放入多少条记录：

gawk -vRS= -v "n=$(gawk -vRS= 'END {print NR}' file.in)" '
  {printf "%s", $0 RT > "file.out." int((NR-1)*15/n)}' file.in

Question 3

这是一个可行的解决方案：

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

它的工作原理是允许第一个sed编写第二个sed的脚本。第二个sed首先收集所有输入行，直到遇到空行。然后它将所有输出行写入文件。第一个sed为第二个写出一个脚本，指示它在哪里写入输出。在我的测试用例中，脚本如下所示：

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

我是这样测试的：

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

这为我提供了一个 6000 行的文件，如下所示：

<iteration#>
and
more
lines
here
#blank

...重复1000次。

运行上面的脚本后：

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done

输出

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here

Answer

这是一个可行的解决方案：

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

它的工作原理是允许第一个sed编写第二个sed的脚本。第二个sed首先收集所有输入行，直到遇到空行。然后它将所有输出行写入文件。第一个sed为第二个写出一个脚本，指示它在哪里写入输出。在我的测试用例中，脚本如下所示：

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

我是这样测试的：

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

这为我提供了一个 6000 行的文件，如下所示：

<iteration#>
and
more
lines
here
#blank

...重复1000次。

运行上面的脚本后：

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done

输出

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here

Question 4

尝试awk

awk 'BEGIN{RS="\n\n"}{print $0 > FILENAME"."FNR}' big_db.msg

Answer

尝试awk

awk 'BEGIN{RS="\n\n"}{print $0 > FILENAME"."FNR}' big_db.msg

将大文件分割成块而不分割条目

答案1

根据行号分割

例子

根据空行分割

例子

参考

答案2

答案3

输出

答案4

相关内容