我有 Microsoft Word 文档。有人打印了该文档,签名(是的。用笔。很怀旧。)然后扫描了它。显然,这个过程将文档变成了图像,使得搜索或从中复制粘贴文本变得很困难。
我尝试了 OCR 工具,运行该工具后,文档看起来与扫描件完全相同,我可以搜索和复制粘贴文本。但是检查 OCR 错误很麻烦,我甚至不知道如何纠正我发现的任何错误。而且这似乎完全没有必要,因为我仍然有原始的 Word 文档。
我怎样才能嵌入或以其他方式合并扫描的文档和原始 Word 文档,以便您可以看到扫描内容,但文本选择和搜索行为却与原始 Word 文档一样?
最好采用基于开源软件(pdftk、qpdf 等)且可在 Linux 上离线运行的解决方案。
答案1
pdftk
您可以使用和命令来实现所需的结果multistamp
。
首先将 M$-Word 文档导出为 PDF 文件,document.pdf
将签名文件导出为document_signed.pdf
。然后按如下方式合并两个文档:
pdftk document.pdf multistamp document_signed.pdf output document_signed_searchable.pdf
这将创建一个document_signed_searchable.pdf
具有您想要的功能的。
以下是手册中的相关摘录:
background <background PDF filename | - | PROMPT>
Applies a PDF watermark to the background of a single input PDF. Pass the background PDF's filename after background like so:
pdftk in.pdf background back.pdf output out.pdf
Pdftk uses only the first page from the background PDF and applies it to every page of the input PDF. This page is scaled and rotated as needed to fit the input page. You can use - to pass a background PDF into pdftk via stdin.
If the input PDF does not have a transparent background (such as a PDF created from page scans) then the resulting background won't be visible -- use the stamp operation instead.
multibackground <background PDF filename | - | PROMPT>
Same as the background operation, but applies each page of the background PDF to the corresponding page of the input PDF. If the input PDF has more pages than the stamp PDF, then the final stamp page is repeated across these remaining pages in the input PDF.
stamp <stamp PDF filename | - | PROMPT>
This behaves just like the background operation except it overlays the stamp PDF page on top of the input PDF document's pages. This works best if the stamp PDF page has a transparent background.
multistamp <stamp PDF filename | - | PROMPT>
Same as the stamp operation, but applies each page of the background PDF to the corresponding page of the input PDF. If the input PDF has more pages than the stamp PDF, then the final stamp page is repeated across these remaining pages in the input PDF.