将 pdf 文本层复制到另一个 pdf

Question 1

这是一个在命令行上执行此操作的简单 shell 脚本：

另存为~/pdf-merge-text.sh（和chmod +x它）：

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

现在只需调用它：

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

这个想法是带状图像来自 OCR 的 PDF，然后通过上面答案中的技术。

Answer

这是一个在命令行上执行此操作的简单 shell 脚本：

另存为~/pdf-merge-text.sh（和chmod +x它）：

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

现在只需调用它：

~/pdf-merge-text.sh txt.pdf img.pdf out.pdf

这个想法是带状图像来自 OCR 的 PDF，然后通过上面答案中的技术。

Question 2

这个答案pdftotext -bboxstackoverflow 上有一个解决方案。你可以使用Python 包从 pdf-2 中提取带有坐标的文本PDF矿工，然后使用 Python 包将隐藏的文本写入新的 PDF报告实验室，然后使用以下方法将此隐藏文本 PDF 与 pdf-1 合并：PDFtk（网页上有一个适用于 Windows 的 GUI；Unix 的命令行现在称为 PDFtk Server。）

或者，您可以尝试使用 PDFtk 直接合并 pdf-1 和 pdf-2。运行pdftk pdf-2 multistamp pdf-1 output out.pdf。这会将 pdf-1 的每一页放在 pdf-2 的相应页面前面，因此您只能看到来自 pdf-1 的图像（假设它们是扫描件，并且没有透明背景），但会包含来自 pdf-2 的隐藏文本。缺点是它可能非常大，因为它将包含每页图像的两个副本。我已经验证了这有效，输出 pdf 的大小是输入大小的总和。

Answer

这个答案pdftotext -bboxstackoverflow 上有一个解决方案。你可以使用Python 包从 pdf-2 中提取带有坐标的文本PDF矿工，然后使用 Python 包将隐藏的文本写入新的 PDF报告实验室，然后使用以下方法将此隐藏文本 PDF 与 pdf-1 合并：PDFtk（网页上有一个适用于 Windows 的 GUI；Unix 的命令行现在称为 PDFtk Server。）

或者，您可以尝试使用 PDFtk 直接合并 pdf-1 和 pdf-2。运行pdftk pdf-2 multistamp pdf-1 output out.pdf。这会将 pdf-1 的每一页放在 pdf-2 的相应页面前面，因此您只能看到来自 pdf-1 的图像（假设它们是扫描件，并且没有透明背景），但会包含来自 pdf-2 的隐藏文本。缺点是它可能非常大，因为它将包含每页图像的两个副本。我已经验证了这有效，输出 pdf 的大小是输入大小的总和。

Question 3

根据剧本这个答案，你可以从输入_ocr.pdf使用 ghostscript 的文件：

gs -o “input_ocr_textonly.pdf” -sDEVICE=pdfwrite -dFILTERIMAGE “input_ocr.pdf”

并将其与输入图像.pdf使用 pdftk 的文件：

pdftk“input_ocr_textonly.pdf”multistamp“input_image.pdf”输出“output.pdf”

或者，使用編輯：

qpdf --empty --pages“input_image.pdf”-- --underlay“input_ocr_textonly.pdf”--“output.pdf”

Answer

根据剧本这个答案，你可以从输入_ocr.pdf使用 ghostscript 的文件：

gs -o “input_ocr_textonly.pdf” -sDEVICE=pdfwrite -dFILTERIMAGE “input_ocr.pdf”

并将其与输入图像.pdf使用 pdftk 的文件：

pdftk“input_ocr_textonly.pdf”multistamp“input_image.pdf”输出“output.pdf”

或者，使用編輯：

qpdf --empty --pages“input_image.pdf”-- --underlay“input_ocr_textonly.pdf”--“output.pdf”

Question 4

如果您必须这样做，LibreOffice + GIMP 应该可以完成这项工作。首先，使用 LibreOffice Draw 提取高质量扫描。然后使用 GIMP 编辑它们以删除扫描的文本。最后，将图像添加到较低层的 OCRed 文件中。

但如果您将其作为某些例行工作的一部分来执行，那么您的工作流程可能存在问题。

Answer

如果您必须这样做，LibreOffice + GIMP 应该可以完成这项工作。首先，使用 LibreOffice Draw 提取高质量扫描。然后使用 GIMP 编辑它们以删除扫描的文本。最后，将图像添加到较低层的 OCRed 文件中。

但如果您将其作为某些例行工作的一部分来执行，那么您的工作流程可能存在问题。

将 pdf 文本层复制到另一个 pdf

答案1

这是一个在命令行上执行此操作的简单 shell 脚本：

答案2

答案3

答案4

相关内容