拆分 10GB 文本文件 1）输出文件最小大小为 40MB 2）按照特定字符串 ()

Question

下面的脚本将一个（大）文件切成片。我没有使用该split命令，因为文件的内容必须按记录“四舍五入”。您可以在脚本的头部部分设置切片的大小。

步骤

困难
因为脚本应该能够处理大型文件，所以无法使用 Python 或 Python 的大型文件read()；readlines()脚本会尝试将整个文件一次性加载到内存中，这肯定会阻塞您的系统。同时，必须进行划分，用整个记录“舍入”部分。因此，脚本应该能够以某种方式识别或“读取”文件的内容。

似乎唯一的选择是使用：

with open(file) as src:
    for line in src:

逐行读取文件。

方法
在脚本中我选择了两步方法：

分析文件（大小、切片数、行数、记录数、每节的记录数），然后创建节或“标记”的列表（按行索引）。
再次读取文件，但现在将行分配到单独的文件中。

将这些行逐一附加到单独的切片（文件）的过程似乎效率不高，但从我尝试的所有方法来看，它是最有效、最快和最省电的选项。

我如何测试
我创建了一个xml略大于 10GB 的文件，里面充满了像您的示例一样的记录。我将切片大小设置为45mb。在我不太新的系统（奔腾双核 CPU E6700 @ 3.20GHz × 2）上，脚本的分析产生了以下内容：

analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665

然后它开始创建 45 mb 的切片，每个切片大约需要 25-27 秒。

creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5

等等...

处理过程中处理器占用了 45-50%，占用了约 850-880mb 的内存（共 4GB）。处理过程中计算机的使用情况还算可以。

整个过程耗时一个半小时。在较新的系统上，耗时应该会少得多。

剧本

#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml" 
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...\n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
    for line in src:
        line_number = line_number+1
        if identifying_string in line:
            records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
    with open(outfile, "a") as out:
        out.write(line)

with open(file) as src:
    print("creating slice", sl)
    for line in src:
        if line_number <= curr_marker:
            writeline(outfile, line)
        else:
            sl = sl+1
            curr_marker = line_markers[sl]
            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
            print("creating slice", sl)
            writeline(outfile, line)       
        line_number = line_number+1

如何使用

将脚本复制到一个空文件中，设置“大文件”的路径、保存切片的目录路径以及切片的大小。另存为slice.py并通过以下命令运行它：

/path/to/slice.py

笔记

大文件的大小应至少超过切片大小的几倍。差异越大，（输出）切片的大小就越可靠。
假设记录的平均大小（从大图上看）大致相同。从这里巨大的数据量来看，人们会认为这是一个可以接受的假设，但你必须检查一下（通过查看切片大小是否存在很大差异）。

Answer 1

下面的脚本将一个（大）文件切成片。我没有使用该split命令，因为文件的内容必须按记录“四舍五入”。您可以在脚本的头部部分设置切片的大小。

步骤

困难
因为脚本应该能够处理大型文件，所以无法使用 Python 或 Python 的大型文件read()；readlines()脚本会尝试将整个文件一次性加载到内存中，这肯定会阻塞您的系统。同时，必须进行划分，用整个记录“舍入”部分。因此，脚本应该能够以某种方式识别或“读取”文件的内容。

似乎唯一的选择是使用：

with open(file) as src:
    for line in src:

逐行读取文件。

方法
在脚本中我选择了两步方法：

分析文件（大小、切片数、行数、记录数、每节的记录数），然后创建节或“标记”的列表（按行索引）。
再次读取文件，但现在将行分配到单独的文件中。

将这些行逐一附加到单独的切片（文件）的过程似乎效率不高，但从我尝试的所有方法来看，它是最有效、最快和最省电的选项。

我如何测试
我创建了一个xml略大于 10GB 的文件，里面充满了像您的示例一样的记录。我将切片大小设置为45mb。在我不太新的系统（奔腾双核 CPU E6700 @ 3.20GHz × 2）上，脚本的分析产生了以下内容：

analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665

然后它开始创建 45 mb 的切片，每个切片大约需要 25-27 秒。

creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5

等等...

处理过程中处理器占用了 45-50%，占用了约 850-880mb 的内存（共 4GB）。处理过程中计算机的使用情况还算可以。

整个过程耗时一个半小时。在较新的系统上，耗时应该会少得多。

剧本

#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml" 
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...\n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
    for line in src:
        line_number = line_number+1
        if identifying_string in line:
            records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
    with open(outfile, "a") as out:
        out.write(line)

with open(file) as src:
    print("creating slice", sl)
    for line in src:
        if line_number <= curr_marker:
            writeline(outfile, line)
        else:
            sl = sl+1
            curr_marker = line_markers[sl]
            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
            print("creating slice", sl)
            writeline(outfile, line)       
        line_number = line_number+1

如何使用

将脚本复制到一个空文件中，设置“大文件”的路径、保存切片的目录路径以及切片的大小。另存为slice.py并通过以下命令运行它：

/path/to/slice.py

笔记

大文件的大小应至少超过切片大小的几倍。差异越大，（输出）切片的大小就越可靠。
假设记录的平均大小（从大图上看）大致相同。从这里巨大的数据量来看，人们会认为这是一个可以接受的假设，但你必须检查一下（通过查看切片大小是否存在很大差异）。

拆分 10GB 文本文件 1）输出文件最小大小为 40MB 2）按照特定字符串 ()

答案1

步骤

剧本

如何使用

笔记

相关内容