如何检测 PDF 文件中实际上是两页的页面？

Question

在玩了几个实用程序之后

poppler 实用程序

包，我最终得出了一个可以接受的但不是最佳的解决方案。

事实证明，检测 PDF 文件中的双页是一件相当棘手的事情。我找不到任何可以轻松做到这一点的库。所以最后，我决定使用

pdf转html

，这是来自

poppler 实用程序

包，将每个页面转换为 html，然后使用正则表达式提取非双页的页面。有趣的是，我仅使用 html 文件中的一两行就能正确获取大多数情况。它并不适用于所有情况，因为有标记为单页的双页，但似乎没有标记为双页的单页，因此不存在损坏原始文件的风险。

这是我所做的：我主要依靠检测标题编号，在大多数情况下，标题编号是 html 文件的第一行（当然，在所有页面上的几行都是相同的）

我利用了文件介绍中，标题编号采用罗马数字这一事实，因此使用了相应的正则表达式：

if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

我注意到的另一件事是，如果该行（实际上是第 31 行，因为前 30 行在所有页面上都是相同的）包含图像链接，那么可能不需要将其切成两半（有些情况是左页是空白的，而右页包含图像，但这种情况很少见，所以我只需要遍历结果中的每一页并删除那些双页的页面）。我只需搜索字符串“img”。

我还发现双页在开头就包含页眉编号，因此我简单地使用：

if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>V. &#160;I. &#160;L ª - n i n &#[0-9]*;<br/>', line):

（最后一行是由于有些特殊页面需要特殊处理）

最后，它并没有检测出所有的单页，但好处是它不会错误地将任何单页视为双页，所以假设结果是 [1, 5, 100]，那么我可以简单地遍历列表并目视检查每种情况。虽然这仍然不是完全自动化的，但这比检查每个单页要好得多。

对于那些感兴趣的人，这里是我的代码（在 Python 2.7 中）：

# -*- coding: utf-8 -*-
#!/usr/bin/python
#

import re
import pdb
import os
import errno
import subprocess

# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml


def silentremove(filename):
    try:
        os.remove(filename)
    except OSError as e:  # this would be "except OSError, e:" before Python 2.6
        if e.errno != errno.ENOENT:  # errno.ENOENT = no such file or directory
            raise  # re-raise exception if a different error occurred


num_of_pages = 395
input = "Lenin06.pdf"
excps = []
i = 1
with open(input, 'rt') as fid:
    while 1:
        if i > num_of_pages:
            break
        if (i == 1) or (i == 2):
            excps.append(str(i))
            i += 1
            continue
        if (i == 3) or (i == 4):
            i += 1
            continue
        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
        os.system(cmd)
        html_file = input[:-4] + "s.html"
        with open(html_file, 'rt') as html_fid:
            for j in range(30):
                line = html_fid.readline()
            line = html_fid.readline()
            line = line.strip()

            if re.findall("img", line):
                excps.append(str(i))
            else:
                if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
                    excps.append(str(i))
                else:
                    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
                        # Loi tua (Introduction)
                        silentremove(input[:-4] + ".html")
                        silentremove(input[:-4] + "_ind.html")
                        silentremove(input[:-4] + "s.html")
                        i += 1
                        continue
                    else:
                        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>V. &#160;I. &#160;L ª - n i n &#[0-9]*;<br/>', line):
                            # print "haha"
                            # Trang doi (Double page)
                            silentremove(input[:-4] + ".html")
                            silentremove(input[:-4] + "_ind.html")
                            silentremove(input[:-4] + "s.html")
                            i += 1
                            continue
                        else:
                            if re.findall('<a name=[0-9]*></a>[^0-9&#;]*&#160;<br/>', line) and \
                                    re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
                                # 1 so truong hop trang trai trong, trang phai co chu
                                # (Some cases where the left page is blank while the right page contains
                                # text)
                                silentremove(input[:-4] + ".html")
                                silentremove(input[:-4] + "_ind.html")
                                silentremove(input[:-4] + "s.html")
                                i += 1
                                continue
                            else:
                                excps.append(str(i))
                        pass
                    pass
                pass
            silentremove(input[:-4] + ".html")
            silentremove(input[:-4] + "_ind.html")
            silentremove(input[:-4] + "s.html")
            i += 1
        pass
for file in os.listdir("./"):
    if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
        silentremove(file)
    pass
pdb.set_trace()

这是文件：https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE（这只是一个例子，还有很多需要转换为单页的）

Answer 1

在玩了几个实用程序之后

poppler 实用程序

包，我最终得出了一个可以接受的但不是最佳的解决方案。

事实证明，检测 PDF 文件中的双页是一件相当棘手的事情。我找不到任何可以轻松做到这一点的库。所以最后，我决定使用

pdf转html

，这是来自

poppler 实用程序

包，将每个页面转换为 html，然后使用正则表达式提取非双页的页面。有趣的是，我仅使用 html 文件中的一两行就能正确获取大多数情况。它并不适用于所有情况，因为有标记为单页的双页，但似乎没有标记为双页的单页，因此不存在损坏原始文件的风险。

这是我所做的：我主要依靠检测标题编号，在大多数情况下，标题编号是 html 文件的第一行（当然，在所有页面上的几行都是相同的）

我利用了文件介绍中，标题编号采用罗马数字这一事实，因此使用了相应的正则表达式：

if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):

我注意到的另一件事是，如果该行（实际上是第 31 行，因为前 30 行在所有页面上都是相同的）包含图像链接，那么可能不需要将其切成两半（有些情况是左页是空白的，而右页包含图像，但这种情况很少见，所以我只需要遍历结果中的每一页并删除那些双页的页面）。我只需搜索字符串“img”。

我还发现双页在开头就包含页眉编号，因此我简单地使用：

if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>V. &#160;I. &#160;L ª - n i n &#[0-9]*;<br/>', line):

（最后一行是由于有些特殊页面需要特殊处理）

最后，它并没有检测出所有的单页，但好处是它不会错误地将任何单页视为双页，所以假设结果是 [1, 5, 100]，那么我可以简单地遍历列表并目视检查每种情况。虽然这仍然不是完全自动化的，但这比检查每个单页要好得多。

对于那些感兴趣的人，这里是我的代码（在 Python 2.7 中）：

# -*- coding: utf-8 -*-
#!/usr/bin/python
#

import re
import pdb
import os
import errno
import subprocess

# Find pages that are not double page
# OS: Ubuntu
# Requirements: Python 2.7, pdftohtml


def silentremove(filename):
    try:
        os.remove(filename)
    except OSError as e:  # this would be "except OSError, e:" before Python 2.6
        if e.errno != errno.ENOENT:  # errno.ENOENT = no such file or directory
            raise  # re-raise exception if a different error occurred


num_of_pages = 395
input = "Lenin06.pdf"
excps = []
i = 1
with open(input, 'rt') as fid:
    while 1:
        if i > num_of_pages:
            break
        if (i == 1) or (i == 2):
            excps.append(str(i))
            i += 1
            continue
        if (i == 3) or (i == 4):
            i += 1
            continue
        cmd = "pdftohtml -i %s -f %d -l %d" % (input, i, i)
        os.system(cmd)
        html_file = input[:-4] + "s.html"
        with open(html_file, 'rt') as html_fid:
            for j in range(30):
                line = html_fid.readline()
            line = html_fid.readline()
            line = line.strip()

            if re.findall("img", line):
                excps.append(str(i))
            else:
                if re.findall('<a name=[0-9]*></a>&#[0-9]*;<br/>', line):
                    excps.append(str(i))
                else:
                    if re.findall('<a name=[0-9]*></a>[XIVLCDM]*<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*&#[0-9]*;<br/>', line) or \
                            re.findall('<a name=[0-9]*></a>&#[0-9]*;[XIVLCDM]*<br/>', line):
                        # Loi tua (Introduction)
                        silentremove(input[:-4] + ".html")
                        silentremove(input[:-4] + "_ind.html")
                        silentremove(input[:-4] + "s.html")
                        i += 1
                        continue
                    else:
                        if re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>&#[0-9]*;[0-9]*&#[0-9]*;<br/>', line) or \
                                re.findall('<a name=[0-9]*></a>V. &#160;I. &#160;L ª - n i n &#[0-9]*;<br/>', line):
                            # print "haha"
                            # Trang doi (Double page)
                            silentremove(input[:-4] + ".html")
                            silentremove(input[:-4] + "_ind.html")
                            silentremove(input[:-4] + "s.html")
                            i += 1
                            continue
                        else:
                            if re.findall('<a name=[0-9]*></a>[^0-9&#;]*&#160;<br/>', line) and \
                                    re.findall('^[0-9]*&#[0-9]*;<br/>$', html_fid.readline().strip()):
                                # 1 so truong hop trang trai trong, trang phai co chu
                                # (Some cases where the left page is blank while the right page contains
                                # text)
                                silentremove(input[:-4] + ".html")
                                silentremove(input[:-4] + "_ind.html")
                                silentremove(input[:-4] + "s.html")
                                i += 1
                                continue
                            else:
                                excps.append(str(i))
                        pass
                    pass
                pass
            silentremove(input[:-4] + ".html")
            silentremove(input[:-4] + "_ind.html")
            silentremove(input[:-4] + "s.html")
            i += 1
        pass
for file in os.listdir("./"):
    if file.endswith(".png") or file.endswith(".jpg") or file.endswith(".jpeg"):
        silentremove(file)
    pass
pdb.set_trace()

这是文件：https://drive.google.com/open?id=1vjnebt3xEuY8odhZHPwL8pf26l8ySdnE（这只是一个例子，还有很多需要转换为单页的）

如何检测 PDF 文件中实际上是两页的页面？

答案1

相关内容