列出 PDF 中的指定目的地

Question 1

波普勒的 pdf信息命令行实用程序将为您提供 PDF 中所有指定目标的页码、位置和名称。您至少需要 0.58 版本的 Poppler。

$ pdfinfo -dests input.pdf
Page  Destination                 Name
   1 [ XYZ null null null      ] "F1"
   1 [ XYZ  122  458 null      ] "G1.1500945"
   1 [ XYZ   79  107 null      ] "G1.1500953"
   1 [ XYZ   79   81 null      ] "G1.1500954"
   1 [ XYZ null null null      ] "P.1"
   2 [ XYZ null null null      ] "L1"
   2 [ XYZ null null null      ] "P.2"
(...)

Answer

波普勒的 pdf信息命令行实用程序将为您提供 PDF 中所有指定目标的页码、位置和名称。您至少需要 0.58 版本的 Poppler。

$ pdfinfo -dests input.pdf
Page  Destination                 Name
   1 [ XYZ null null null      ] "F1"
   1 [ XYZ  122  458 null      ] "G1.1500945"
   1 [ XYZ   79  107 null      ] "G1.1500953"
   1 [ XYZ   79   81 null      ] "G1.1500954"
   1 [ XYZ null null null      ] "P.1"
   2 [ XYZ null null null      ] "L1"
   2 [ XYZ null null null      ] "P.2"
(...)

Question 2

这pyPDF库可以列出锚点：

#!/usr/bin/env python
import sys
from pyPdf import PdfFileReader
def pdf_list_anchors(fh):
    reader = PdfFileReader(fh)
    destinations = reader.getNamedDestinations()
    for name in destinations:
        print name
pdf_list_anchors(open(sys.argv[1]))

对于完成用例来说这已经足够好了，但是锚点是以随机顺序列出的。只有 pyPdf 1.13 的稳定接口，我找不到按顺序列出锚点的方法。我还没有尝试过 pyPdf2 。

Answer

这pyPDF库可以列出锚点：

#!/usr/bin/env python
import sys
from pyPdf import PdfFileReader
def pdf_list_anchors(fh):
    reader = PdfFileReader(fh)
    destinations = reader.getNamedDestinations()
    for name in destinations:
        print name
pdf_list_anchors(open(sys.argv[1]))

对于完成用例来说这已经足够好了，但是锚点是以随机顺序列出的。只有 pyPdf 1.13 的稳定接口，我找不到按顺序列出锚点的方法。我还没有尝试过 pyPdf2 。

Question 3

（也在这里回答：查看 PDF 文档中的锚点）

我有同样的问题，最终通过以下方式找到了一个很好的答案如何直观地检查 PDF 的结构以对其进行逆向工程？

答案是使用Python包pdfminer.six。是均匀的文档中的示例之一！将此代码剪切并粘贴到终端中：

pip install pdfminer.six
cat >extract.py <<EOF
import sys
import pdfminer.pdfparser, pdfminer.pdfdocument
with open(sys.argv[1], "rb") as f:
  parser = pdfminer.pdfparser.PDFParser(f)
  document = pdfminer.pdfdocument.PDFDocument(parser)
  for (level, title, dest, a, se) in document.get_outlines():
    print('  ' * level, title, dest or a, se)
EOF
python extract.py myInputFile.pdf

在我的特定 PDF 上，输出如下所示：

$ python extract.py ~/Desktop/p2786r3.pdf | head
   Abstract {'S': /'GoTo', 'D': b'section.1'} None
   Revision History {'S': /'GoTo', 'D': b'section.2'} None
     R3: October 2023 (midterm mailing)r3-october-2023-midterm-mailing {'S': /'GoTo', 'D': b'section*.2'} None
     R2: June 2023 (Varna meeting)r2-june-2023-varna-meeting {'S': /'GoTo', 'D': b'section*.3'} None
     R1: May 2023 (pre-Varna mailing)r1-may-2023-pre-varna-mailing {'S': /'GoTo', 'D': b'section*.4'} None
     R0: Issaquah 2023r0-issaquah-2023 {'S': /'GoTo', 'D': b'section*.5'} None
   Introduction {'S': /'GoTo', 'D': b'section.3'} None
   Motivating Use Cases {'S': /'GoTo', 'D': b'section.4'} None
     Efficient vector growth {'S': /'GoTo', 'D': b'subsection.4.1'} None
     Moving types without empty states {'S': /'GoTo', 'D': b'subsection.4.2'} None

事实上，p2786r3.pdf#subsection.4.2在我的浏览器中导航到该特定部分会打开 PDF。

Answer