在 Linux 上 - 如何从文本中提取文本,.pdf
其中文本确实是文本,而不是扫描图像?我想要一些可以在命令行/脚本中使用的东西,而不是交互方式。(我不想转换为.tif
OCR 并使用 - 文件中已经有文本.pdf
,那么为什么要引入不完善的 OCR 带来的不准确性?)
答案1
pdftotext
poppler 附带的工具将尝试提取 PDF 中找到的任何文本。
答案2
Ignacio 的回答很好。事实上,这将是我列表中的第一件事。好吧,也许建议使用pdftohtml
poppler 附带的工具,结合pdfreflow如果您想尝试将文本重新组合成段落等。(当然,这会给您 HTML 输出,但将 HTML 转换为纯文本可以通过多种方式完成。)
这里还有一些其他选择。
ebook-convert
来自的命令行工具口径,它可以将 .PDF 转换为纯文本(或 RTF 或多种电子书格式,如 ePub 等)
pdftxtextract
从波多福
艾比沃德可以从命令行调用来转换它可以输入/导出的任何格式,并使用适当的导入插件,这包括 PDF:
abiword --to=txt file.pdf
(公平地说,我认为 AbiWord 和 calibre 都使用了 poppler 库,但我并不确定。)
答案3
除此之外pdftotext
,pdftohtml
您还可以将其与 pandoc 结合使用,将 pdf 转换为其他格式,例如markdown
mdpdf () {
pdftohtml -s -stdout "${@}" |
pandoc -f html -t markdown -o "${1%.pdf}".md ;
}
答案4
处理段落的方法比较
22.12.0 版本pdftotext
中有一个问题poppler-utils
伊格纳西奥 (Ignacio) 提到当段落长度超过 PDF 页面宽度时,它会在段落内添加换行符,例如:
1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.
这些额外的换行符使得 txt 文件在 Kindle 等设备上阅读起来非常糟糕。
然而我发现ebook-convert
由 frabjous 提及很好地克服了这个问题,并产生了类似这样的结果:
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light from the darkness.
无论段落有多长,它都会将段落保持在单行中,并在段落之间添加双换行符,并且在 Kindle 上的表现更好。
我将使用其他答案中提到的方法进行测试此测试 PDF生成自这个 Libreoffice .odt 文件:
pdftotext 输出:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very
very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
ebook-convert
输出:
Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1
H1 1
H2 1 1
H2 1 2
First very important paragraph.
And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.
Reference to H1 1 on page: 1
https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg
H1 2
H2 2 1
H2 2 2
Document Outline
H1 1 H2 1 1
H2 1 2
H1 2 H2 2 1
H2 2 2
关于换行方面的问题也被更具体地问到:https://unix.stackexchange.com/questions/691579/how-to-convert-pdf-file-to-text-without-breaking-lines
在 Ubuntu 23.04、poppler-utils 22.12.0、calibre 6.11.0 上测试。