快速连接大量小型 PDF

Question 1

如果您想使用 Python，之前的帖子中讨论了几个 Python 脚本：https://stackoverflow.com/questions/3444645/merge-pdf-files

由于 Python PDF 库的工作方式，所有文件都会首先打开，并且只有在写入输出文件时才会读取内容。因此，您应该会遇到内存消耗过高的情况。解决方法是将文件拆分成几个文件夹。

您可以轻松扩展此脚本，例如，合并子树中的所有 PDF 及其所有子文件夹。

此程序支持可选标志，用于详细输出以及跳过每个输入文件的最后一页。输入文件模式允许使用通配符。

from argparse import ArgumentParser
from glob import glob
from PyPDF2 import PdfFileReader, PdfFileWriter



def PDF_cat(files, output_filename, skiplastpage, verbose):
    # First open all the files, then produce the output file, and
    # finally close the input files. This is necessary because
    # the data isn't read from the input files until the write
    # operation. Thanks to
    # https://stackoverflow.com/questions/6773631/problem-with-closing-_
    #    python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733

    writer = PdfFileWriter()
    skip = 1 if skiplastpage else 0

    # collect and open input files
    inp = [open(f,'rb') for f in glob(files) if f != output_filename]
    n = len(inp)
    print 'merging %d files' % n
    for i, fh in enumerate(inp, 1):
        reader = PdfFileReader(fh)
        for pg in range(reader.getNumPages() - skip):
            writer.addPage(reader.getPage(pg))
        if verbose: print '%d/%d %s' % (i, n, fh.name)

    print('writing output file...')
    with open(output_filename, 'wb') as fout:
        writer.write(fout)
    # finallly...
    for fh in inp:
        fh.close()

if __name__ == '__main__':
    parser = ArgumentParser()

    # add more options if you like
    parser.add_argument('-o', '--output',
                        dest='output_filename',
                        help='write merged PDF files to FILE',
                        metavar='FILE')
    parser.add_argument(dest='files',
                        help='PDF files to merge')
    parser.add_argument('-s', '--skiplastpage',
                        dest='skiplastpage',
                        action='store_true',
                        help='skip last page of each merged PDF')
    parser.add_argument('-v', '--verbose',
                        dest='verbose',
                        action='store_true',
                        help='show progress')
    parser.set_defaults(output_filename='mergedPDFs.pdf', files='.\*.pdf',
                        skiplastpage=False, verbose=False)

    args = parser.parse_args()
    PDF_cat(args.files, args.output_filename, args.skiplastpage, args.verbose)

快速测试：在我的笔记本上合并 501 个相同的 PDF（每个 91 KB）需要 61 秒，使用 PDFtk.exe 则需要 83 秒。输出文件大小不一，但显示相同。

Answer

如果您想使用 Python，之前的帖子中讨论了几个 Python 脚本：https://stackoverflow.com/questions/3444645/merge-pdf-files

由于 Python PDF 库的工作方式，所有文件都会首先打开，并且只有在写入输出文件时才会读取内容。因此，您应该会遇到内存消耗过高的情况。解决方法是将文件拆分成几个文件夹。

您可以轻松扩展此脚本，例如，合并子树中的所有 PDF 及其所有子文件夹。

此程序支持可选标志，用于详细输出以及跳过每个输入文件的最后一页。输入文件模式允许使用通配符。

from argparse import ArgumentParser
from glob import glob
from PyPDF2 import PdfFileReader, PdfFileWriter



def PDF_cat(files, output_filename, skiplastpage, verbose):
    # First open all the files, then produce the output file, and
    # finally close the input files. This is necessary because
    # the data isn't read from the input files until the write
    # operation. Thanks to
    # https://stackoverflow.com/questions/6773631/problem-with-closing-_
    #    python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733

    writer = PdfFileWriter()
    skip = 1 if skiplastpage else 0

    # collect and open input files
    inp = [open(f,'rb') for f in glob(files) if f != output_filename]
    n = len(inp)
    print 'merging %d files' % n
    for i, fh in enumerate(inp, 1):
        reader = PdfFileReader(fh)
        for pg in range(reader.getNumPages() - skip):
            writer.addPage(reader.getPage(pg))
        if verbose: print '%d/%d %s' % (i, n, fh.name)

    print('writing output file...')
    with open(output_filename, 'wb') as fout:
        writer.write(fout)
    # finallly...
    for fh in inp:
        fh.close()

if __name__ == '__main__':
    parser = ArgumentParser()

    # add more options if you like
    parser.add_argument('-o', '--output',
                        dest='output_filename',
                        help='write merged PDF files to FILE',
                        metavar='FILE')
    parser.add_argument(dest='files',
                        help='PDF files to merge')
    parser.add_argument('-s', '--skiplastpage',
                        dest='skiplastpage',
                        action='store_true',
                        help='skip last page of each merged PDF')
    parser.add_argument('-v', '--verbose',
                        dest='verbose',
                        action='store_true',
                        help='show progress')
    parser.set_defaults(output_filename='mergedPDFs.pdf', files='.\*.pdf',
                        skiplastpage=False, verbose=False)

    args = parser.parse_args()
    PDF_cat(args.files, args.output_filename, args.skiplastpage, args.verbose)

快速测试：在我的笔记本上合并 501 个相同的 PDF（每个 91 KB）需要 61 秒，使用 PDFtk.exe 则需要 83 秒。输出文件大小不一，但显示相同。

Question 2

您还可以尝试其他 Acrobat 替代品。这些工具可能会对您有所帮助。

1.PDF萨姆

按给定的页码、给定的书签级别或给定大小的文件合并和拆分 PDF 文件
从 PDF 中提取页面
旋转 PDF 文件的每一页或仅选定的页面
将 PDF 文件合并在一起，交替从一个文件和另一个文件中获取页面。

2.PDF合并

安全文件合并和处理
提供网上平台用于合并 PDF
还有桌面版可用

3.PDFtk

简单但功能强大的工具包
带有命令行工具，可以轻松地在命令行上与多个 pdf 进行交互。

目前，我建议您使用 pdftk，因为它的命令行工具非常强大，可以节省大量的时间和精力。

可以使用任何其他工具随意编辑该列表。

Answer

您还可以尝试其他 Acrobat 替代品。这些工具可能会对您有所帮助。

1.PDF萨姆

按给定的页码、给定的书签级别或给定大小的文件合并和拆分 PDF 文件
从 PDF 中提取页面
旋转 PDF 文件的每一页或仅选定的页面
将 PDF 文件合并在一起，交替从一个文件和另一个文件中获取页面。

2.PDF合并

安全文件合并和处理
提供网上平台用于合并 PDF
还有桌面版可用

3.PDFtk

简单但功能强大的工具包
带有命令行工具，可以轻松地在命令行上与多个 pdf 进行交互。

目前，我建议您使用 pdftk，因为它的命令行工具非常强大，可以节省大量的时间和精力。

可以使用任何其他工具随意编辑该列表。

Question 3

我使用 Ghostscript 来做这件事。我用它合并了 4000 个 pdf 文件。我发现它比 PyPDF 更不容易破坏 PDF 内容。合并 4000 个 pdf 只需几分钟。

Answer

我使用 Ghostscript 来做这件事。我用它合并了 4000 个 pdf 文件。我发现它比 PyPDF 更不容易破坏 PDF 内容。合并 4000 个 pdf 只需几分钟。

快速连接大量小型 PDF

答案1

答案2

1.PDF萨姆

2.PDF合并

3.PDFtk

答案3

相关内容