批量对多个 PDF 进行 OCR

Question 1

我也曾寻找过一种自动批量 OCR 多个 PDF 的方法，但运气不佳。最后，我想出了一个类似于您的可行解决方案，使用 Acrobat 和以下脚本：

将所有相关的 PDF 复制到特定目录。
删除已经包含文本的 PDF（假设它们已经是 OCRd 或已经是文本 - 我知道这不是理想的，但现在足够好了）。
使用自动热键自动运行 Acrobat，选择特定目录并对所有文档进行 OCR，并在其文件名后附加“-ocr”。
将 OCRd PDF 移回其原始位置，使用“-ocr.pdf”文件的存在来确定是否成功。

有点希思罗宾逊，但实际上效果很好。

Answer

我也曾寻找过一种自动批量 OCR 多个 PDF 的方法，但运气不佳。最后，我想出了一个类似于您的可行解决方案，使用 Acrobat 和以下脚本：

将所有相关的 PDF 复制到特定目录。
删除已经包含文本的 PDF（假设它们已经是 OCRd 或已经是文本 - 我知道这不是理想的，但现在足够好了）。
使用自动热键自动运行 Acrobat，选择特定目录并对所有文档进行 OCR，并在其文件名后附加“-ocr”。
将 OCRd PDF 移回其原始位置，使用“-ocr.pdf”文件的存在来确定是否成功。

有点希思罗宾逊，但实际上效果很好。

Question 2

我相信您需要意识到 ABBYY FineReader 是一种旨在提供快速、准确的开箱即用 OCR 的最终用户解决方案。

根据我的经验，OCR 项目每次都有明显不同的细节，并且无法为每种独特情况创建开箱即用的解决方案。但我可以向您推荐可以为您完成工作的更专业的工具：

看一下ABBYY 识别服务器，这是一款专业的OCR自动化产品。
说到 Linux，看看http://ocr4linux.com，它是一个可能也适合您的命令行实用程序。
对于更复杂的任务，ABBYY 拥有非常灵活的 SDK，例如ABBYY FineReader 引擎（内部托管）或ABBYY Cloud OCR SDK（基于 Microsoft Azure 云），让您按照自己想要的方式设计 OCR 处理。

我是上述云服务前端开发团队的一员，如有必要，可以提供更多信息。

考虑到在 PDF 中查找文本层，我无法给出任何建议，因为这项任务与我的专长 OCR 有点不同，所以我认为您使用外部脚本的方法非常合理。也许您会发现这个讨论很有帮助：http://forum.ocrsdk.com/questions/108/check-if-pdf-is-scanned-image-or-contains-text

Answer

我相信您需要意识到 ABBYY FineReader 是一种旨在提供快速、准确的开箱即用 OCR 的最终用户解决方案。

根据我的经验，OCR 项目每次都有明显不同的细节，并且无法为每种独特情况创建开箱即用的解决方案。但我可以向您推荐可以为您完成工作的更专业的工具：

看一下ABBYY 识别服务器，这是一款专业的OCR自动化产品。
说到 Linux，看看http://ocr4linux.com，它是一个可能也适合您的命令行实用程序。
对于更复杂的任务，ABBYY 拥有非常灵活的 SDK，例如ABBYY FineReader 引擎（内部托管）或ABBYY Cloud OCR SDK（基于 Microsoft Azure 云），让您按照自己想要的方式设计 OCR 处理。

我是上述云服务前端开发团队的一员，如有必要，可以提供更多信息。

考虑到在 PDF 中查找文本层，我无法给出任何建议，因为这项任务与我的专长 OCR 有点不同，所以我认为您使用外部脚本的方法非常合理。也许您会发现这个讨论很有帮助：http://forum.ocrsdk.com/questions/108/check-if-pdf-is-scanned-image-or-contains-text

Question 3

2015 年初，我在 Windows 上使用 Nuance OmniPage Ultimate 成功实现了完全自动化的批量 OCR。不是免费的，标价 500 美元。使用随附的批处理程序“DocuDirect”。它有一个选项“在没有任何提示的情况下运行作业”，这似乎是对您原始问题的直接回答。

我使用 DocuDirect 为每个输入图像（即不可搜索）PDF 文件输出一个可搜索的 PDF 文件；可以指示它在输出文件夹中复制输入目录树以及原始输入文件名（几乎 - 见下文）。还使用多个核心。准确度是我评估的软件包中最好的。跳过受密码保护的文档（不停止作业，不显示对话框）。

注意事项 1：几乎原始文件名 - 后缀“.PDF”变为“.pdf”（即从大写变为小写），因为嘿，在 Windows 上它们都是一样的。（呃。）

注意事项 2：没有日志文件，因此诊断哪些文件在识别过程中失败（它们肯定会失败）的责任又回到了您身上。 DocuDirect 会很乐意产生乱码输出，例如整个页面都丢失了。我使用 PyPDF2 模块编写了一个 Python 脚本来实现粗略的验证：测试输出页数是否与输入页数匹配。见下文。

注意事项 3：模糊、不清晰的输入图像文件会导致 OmniPage 永远挂起，不使用任何 CPU；它永远无法恢复。这确实破坏了批处理，我没有找到任何解决方法。我也向 Nuance 报告了这个问题，但没有任何进展。

@Joe 说得对，该软件编程和文档编写得不好。我注意到核OmniPage 拥有令人惊奇的字符识别魔法技术，但是其外壳（GUI 和批处理）就足以让你抓狂。

我赞同@Joe 和@Kiwi 的建议，使用脚本筛选出文件，以便仅向 OCR 包中呈现未受保护的图像文档。

我与 Nuance 的唯一关系就是作为一名不太满意的客户 - 我有一批未解决的支持单可以证明这一点 :)

@Joe：回答晚了，但可能仍然有意义。@SuperUser 社区：我希望您觉得这是主题相关的。

** 更新 ** 后续套件是 Nuance PowerPDF Advanced，标价仅为 150 美元。我使用它甚至取得了更好的效果，它同样准确，但更加稳定。

以下是 OCR 前/后树验证的 python 脚本。

'''
Script to validate OCR outputs against inputs.
Both input and output are PDF documents in a directory tree.
For each input document, checks for the corresponding output
document and its page count.

Requires PyPDF2 from https://pypi.python.org/pypi/PyPDF2
'''

from __future__ import print_function
from PyPDF2 import PdfFileReader
import getopt
import os
import stat
import sys

def get_pdf_page_count(filename):
    '''
    Gets number of pages in the named PDF file.
    Fails on an encrypted or invalid file, returns None.
    '''
    with open(filename, "rb") as pdf_file:
        page_count = None
        err = None
        try:
            # slurp the file
            pdf_obj = PdfFileReader(pdf_file)
            # extract properties
            page_count = pdf_obj.getNumPages()
            err = ""
        except Exception:
            # Invalid PDF.
            # Limit exception so we don't catch KeyboardInterrupt etc.
            err = str(sys.exc_info())
            # This should be rare
            print("Warning: failed on file %s: %s" % (filename, err), file=sys.stderr)
            return None

    return page_count

def validate_pdf_pair(verbose, img_file, txt_file):
    '''
    Checks for existence and size of target PDF file;
    number of pages should match source PDF file.
    Returns True on match, else False.
    '''
    #if verbose: 
    #    print("Image PDF is %s" % img_file)
    #    print("Text PDF is %s" % txt_file)

    # Get source and target page counts
    img_pages = get_pdf_page_count(img_file)
    txt_pages = get_pdf_page_count(txt_file)
    if img_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % img_file, file=sys.stderr)
        return None
    if txt_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % txt_file, file=sys.stderr)
        return None

    retval = True
    if img_pages != txt_pages:
        retval = False
        print("Mismatch page count: %d in source %s, %d in target %s" % (img_pages, img_file, txt_pages, txt_file), file=sys.stderr)

    return retval


def validate_ocr_output(verbose, process_count, total_count, img_dir, txt_dir):
    '''
    Walks a tree of files to compare against output tree, calling self recursively.
    Returns a tuple with PDF file counts (matched, non-matched).
    '''
    # Iterate over the this directory
    match = 0
    nonmatch = 0
    for dirent in os.listdir(img_dir):
        src_path = os.path.join(img_dir, dirent)
        tgt_path = os.path.join(txt_dir, dirent)
        if os.path.isdir(src_path):
            if verbose: print("Found source dir %s" % src_path)
            # check target
            if os.path.isdir(tgt_path):
                # Ok to process
                (sub_match, sub_nonmatch) = validate_ocr_output(verbose, process_count + match + nonmatch, total_count, 
                                         src_path, tgt_path)
                match += sub_match
                nonmatch += sub_nonmatch
            else:
                # Target is missing!?
                print("Fatal: target dir not found: %s" % tgt_path, file=sys.stderr)

        elif os.path.isfile(src_path):
            # it's a plain file
            if src_path.lower().endswith(".pdf"):
                # check target
                # HACK: OmniPage changes upper-case PDF suffix to pdf;
                # of course not visible in Windohs with the case-insensitive 
                # file system, but it's a problem on linux.
                if not os.path.isfile(tgt_path):
                    # Flip lower to upper and VV
                    if tgt_path.endswith(".PDF"):
                        # use a slice
                        tgt_path = tgt_path[:-4] + ".pdf"
                    elif tgt_path.endswith(".pdf"):
                        tgt_path = tgt_path[:-4] + ".PDF"
                # hopefully it will be found now!
                if os.path.isfile(tgt_path):
                    # Ok to process
                    sub_match = validate_pdf_pair(verbose, src_path, tgt_path)
                    if sub_match:
                        match += 1
                    else:
                        nonmatch += 1
                    if verbose: print("File %d vs %d matches: %s" % (process_count + match + nonmatch, total_count, sub_match))

                else:
                    # Target is missing!?
                    print("Fatal: target file not found: %s" % tgt_path, file=sys.stderr)
                    nonmatch += 1

        else:
            # This should never happen
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return (match, nonmatch)

def count_pdfs_listdir(verbose, src_dir):
    '''
    Counts PDF files in a tree using os.listdir, os.stat and recursion.
    Not nearly as elegant as os.walk, but hopefully very fast on
    large trees; I don't need the whole list in memory.
    '''
    count = 0
    for dirent in os.listdir(src_dir):
        src_path = os.path.join(src_dir, dirent)
        # stat the entry just once
        mode = os.stat(src_path)[stat.ST_MODE]
        if stat.S_ISDIR(mode):
            # It's a directory, recurse into it
            count += count_pdfs_listdir(verbose, src_path)
        elif stat.S_ISREG(mode):
            # It's a file, count it
            if src_path.lower().endswith('.pdf'):
                count += 1
        else:
            # Unknown entry, print an error
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return count

def main(args):
    '''
    Parses command-line arguments and processes the named dirs.
    '''
    try:
        opts, args = getopt.getopt(args, "vi:o:")
    except getopt.GetoptError:
        usage()
    # default values
    verbose = False
    in_dir = None
    out_dir = None
    for opt, optarg in opts:
        if opt in ("-i"):
            in_dir = optarg
        elif opt in ("-o"):
            out_dir = optarg
        elif opt in ("-v"):
            verbose = True
        else:
            usage()
    # validate args
    if in_dir is None or out_dir is None: usage()
    if not os.path.isdir(in_dir):
        print("Not found or not a directory: %s" % input, file=sys.stderr)
        usage()
    if not os.path.isdir(out_dir):
        print("Not found or not a directory: %s" % out_dir, file=sys.stderr)
        usage()
    if verbose: 
        print("Validating input %s -> output %s" % (in_dir, out_dir))
    # get to work
    print("Counting files in %s" % in_dir)
    count = count_pdfs_listdir(verbose, in_dir)
    print("PDF input file count is %d" % count)
    (match,nomatch) = validate_ocr_output(verbose=verbose, process_count=0, total_count=count, img_dir=in_dir, txt_dir=out_dir) 
    print("Results are: %d matches, %d mismatches" % (match, nomatch))

def usage():
    print('Usage: validate_ocr_output.py [options] -i input-dir -o output-dir')
    print('    Compares pre-OCR and post-OCR directory trees')
    print('    Options: -v = be verbose')
    sys.exit()

# Pass all params after program name to our main
if __name__ == "__main__":
    main(sys.argv[1:])

Answer

2015 年初，我在 Windows 上使用 Nuance OmniPage Ultimate 成功实现了完全自动化的批量 OCR。不是免费的，标价 500 美元。使用随附的批处理程序“DocuDirect”。它有一个选项“在没有任何提示的情况下运行作业”，这似乎是对您原始问题的直接回答。

我使用 DocuDirect 为每个输入图像（即不可搜索）PDF 文件输出一个可搜索的 PDF 文件；可以指示它在输出文件夹中复制输入目录树以及原始输入文件名（几乎 - 见下文）。还使用多个核心。准确度是我评估的软件包中最好的。跳过受密码保护的文档（不停止作业，不显示对话框）。

注意事项 1：几乎原始文件名 - 后缀“.PDF”变为“.pdf”（即从大写变为小写），因为嘿，在 Windows 上它们都是一样的。（呃。）

注意事项 2：没有日志文件，因此诊断哪些文件在识别过程中失败（它们肯定会失败）的责任又回到了您身上。 DocuDirect 会很乐意产生乱码输出，例如整个页面都丢失了。我使用 PyPDF2 模块编写了一个 Python 脚本来实现粗略的验证：测试输出页数是否与输入页数匹配。见下文。

注意事项 3：模糊、不清晰的输入图像文件会导致 OmniPage 永远挂起，不使用任何 CPU；它永远无法恢复。这确实破坏了批处理，我没有找到任何解决方法。我也向 Nuance 报告了这个问题，但没有任何进展。

@Joe 说得对，该软件编程和文档编写得不好。我注意到核OmniPage 拥有令人惊奇的字符识别魔法技术，但是其外壳（GUI 和批处理）就足以让你抓狂。

我赞同@Joe 和@Kiwi 的建议，使用脚本筛选出文件，以便仅向 OCR 包中呈现未受保护的图像文档。

我与 Nuance 的唯一关系就是作为一名不太满意的客户 - 我有一批未解决的支持单可以证明这一点 :)

@Joe：回答晚了，但可能仍然有意义。@SuperUser 社区：我希望您觉得这是主题相关的。

** 更新 ** 后续套件是 Nuance PowerPDF Advanced，标价仅为 150 美元。我使用它甚至取得了更好的效果，它同样准确，但更加稳定。

以下是 OCR 前/后树验证的 python 脚本。

'''
Script to validate OCR outputs against inputs.
Both input and output are PDF documents in a directory tree.
For each input document, checks for the corresponding output
document and its page count.

Requires PyPDF2 from https://pypi.python.org/pypi/PyPDF2
'''

from __future__ import print_function
from PyPDF2 import PdfFileReader
import getopt
import os
import stat
import sys

def get_pdf_page_count(filename):
    '''
    Gets number of pages in the named PDF file.
    Fails on an encrypted or invalid file, returns None.
    '''
    with open(filename, "rb") as pdf_file:
        page_count = None
        err = None
        try:
            # slurp the file
            pdf_obj = PdfFileReader(pdf_file)
            # extract properties
            page_count = pdf_obj.getNumPages()
            err = ""
        except Exception:
            # Invalid PDF.
            # Limit exception so we don't catch KeyboardInterrupt etc.
            err = str(sys.exc_info())
            # This should be rare
            print("Warning: failed on file %s: %s" % (filename, err), file=sys.stderr)
            return None

    return page_count

def validate_pdf_pair(verbose, img_file, txt_file):
    '''
    Checks for existence and size of target PDF file;
    number of pages should match source PDF file.
    Returns True on match, else False.
    '''
    #if verbose: 
    #    print("Image PDF is %s" % img_file)
    #    print("Text PDF is %s" % txt_file)

    # Get source and target page counts
    img_pages = get_pdf_page_count(img_file)
    txt_pages = get_pdf_page_count(txt_file)
    if img_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % img_file, file=sys.stderr)
        return None
    if txt_pages is None:
        # Bogus PDF, skip.
        print("Warning: failed to get page count for %s" % txt_file, file=sys.stderr)
        return None

    retval = True
    if img_pages != txt_pages:
        retval = False
        print("Mismatch page count: %d in source %s, %d in target %s" % (img_pages, img_file, txt_pages, txt_file), file=sys.stderr)

    return retval


def validate_ocr_output(verbose, process_count, total_count, img_dir, txt_dir):
    '''
    Walks a tree of files to compare against output tree, calling self recursively.
    Returns a tuple with PDF file counts (matched, non-matched).
    '''
    # Iterate over the this directory
    match = 0
    nonmatch = 0
    for dirent in os.listdir(img_dir):
        src_path = os.path.join(img_dir, dirent)
        tgt_path = os.path.join(txt_dir, dirent)
        if os.path.isdir(src_path):
            if verbose: print("Found source dir %s" % src_path)
            # check target
            if os.path.isdir(tgt_path):
                # Ok to process
                (sub_match, sub_nonmatch) = validate_ocr_output(verbose, process_count + match + nonmatch, total_count, 
                                         src_path, tgt_path)
                match += sub_match
                nonmatch += sub_nonmatch
            else:
                # Target is missing!?
                print("Fatal: target dir not found: %s" % tgt_path, file=sys.stderr)

        elif os.path.isfile(src_path):
            # it's a plain file
            if src_path.lower().endswith(".pdf"):
                # check target
                # HACK: OmniPage changes upper-case PDF suffix to pdf;
                # of course not visible in Windohs with the case-insensitive 
                # file system, but it's a problem on linux.
                if not os.path.isfile(tgt_path):
                    # Flip lower to upper and VV
                    if tgt_path.endswith(".PDF"):
                        # use a slice
                        tgt_path = tgt_path[:-4] + ".pdf"
                    elif tgt_path.endswith(".pdf"):
                        tgt_path = tgt_path[:-4] + ".PDF"
                # hopefully it will be found now!
                if os.path.isfile(tgt_path):
                    # Ok to process
                    sub_match = validate_pdf_pair(verbose, src_path, tgt_path)
                    if sub_match:
                        match += 1
                    else:
                        nonmatch += 1
                    if verbose: print("File %d vs %d matches: %s" % (process_count + match + nonmatch, total_count, sub_match))

                else:
                    # Target is missing!?
                    print("Fatal: target file not found: %s" % tgt_path, file=sys.stderr)
                    nonmatch += 1

        else:
            # This should never happen
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return (match, nonmatch)

def count_pdfs_listdir(verbose, src_dir):
    '''
    Counts PDF files in a tree using os.listdir, os.stat and recursion.
    Not nearly as elegant as os.walk, but hopefully very fast on
    large trees; I don't need the whole list in memory.
    '''
    count = 0
    for dirent in os.listdir(src_dir):
        src_path = os.path.join(src_dir, dirent)
        # stat the entry just once
        mode = os.stat(src_path)[stat.ST_MODE]
        if stat.S_ISDIR(mode):
            # It's a directory, recurse into it
            count += count_pdfs_listdir(verbose, src_path)
        elif stat.S_ISREG(mode):
            # It's a file, count it
            if src_path.lower().endswith('.pdf'):
                count += 1
        else:
            # Unknown entry, print an error
            print("Warning: not a directory nor file: %s" % src_path, file=sys.stderr)
    return count

def main(args):
    '''
    Parses command-line arguments and processes the named dirs.
    '''
    try:
        opts, args = getopt.getopt(args, "vi:o:")
    except getopt.GetoptError:
        usage()
    # default values
    verbose = False
    in_dir = None
    out_dir = None
    for opt, optarg in opts:
        if opt in ("-i"):
            in_dir = optarg
        elif opt in ("-o"):
            out_dir = optarg
        elif opt in ("-v"):
            verbose = True
        else:
            usage()
    # validate args
    if in_dir is None or out_dir is None: usage()
    if not os.path.isdir(in_dir):
        print("Not found or not a directory: %s" % input, file=sys.stderr)
        usage()
    if not os.path.isdir(out_dir):
        print("Not found or not a directory: %s" % out_dir, file=sys.stderr)
        usage()
    if verbose: 
        print("Validating input %s -> output %s" % (in_dir, out_dir))
    # get to work
    print("Counting files in %s" % in_dir)
    count = count_pdfs_listdir(verbose, in_dir)
    print("PDF input file count is %d" % count)
    (match,nomatch) = validate_ocr_output(verbose=verbose, process_count=0, total_count=count, img_dir=in_dir, txt_dir=out_dir) 
    print("Results are: %d matches, %d mismatches" % (match, nomatch))

def usage():
    print('Usage: validate_ocr_output.py [options] -i input-dir -o output-dir')
    print('    Compares pre-OCR and post-OCR directory trees')
    print('    Options: -v = be verbose')
    sys.exit()

# Pass all params after program name to our main
if __name__ == "__main__":
    main(sys.argv[1:])

Question 4

开启Mac或Linux：

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

从这里。

Answer

开启Mac或Linux：

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

从这里。

批量对多个 PDF 进行 OCR

批量 OCR PDF

视窗

Linux

在线的

云（更新于 2023 年——难以置信人们仍在关注这个）

识别非 OCR PDF

当前的“解决方案”

答案1

答案2

答案3

答案4

相关内容