pdfpages 包含破坏 PDF/A 兼容性

pdfpages 包含破坏 PDF/A 兼容性

我遇到了一个问题,包含普通 pdf 会导致我的文档不符合 pdf/a 标准,因为包含字体。我的模板使用 pdfx 来实现 pdf/a 兼容性,我使用 pdfpages 来包含 pdf,并使用 Lualatex 进行构建。

\documentclass[a4paper]{report}
\usepackage{pdfpages}
\begin{document}
\tableofcontents
\includepdf{test.pdf}
\end{document}

不幸的是,我无法提供 test.pdf 本身,因为它包含一些机密信息。PDF 嵌入了所有字体,但本身不符合 pdf/a 标准。当我使用 verapdf 验证 pdf(包括 test.pdf 的整个文档)时,我得到:

<?xml version="1.0" encoding="utf-8"?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.18.11" buildDate="2021-04-19T10:21:00+02:00"></releaseDetails>
    <releaseDetails id="validation-model" version="1.18.8" buildDate="2021-04-19T10:35:00+02:00"></releaseDetails>
    <releaseDetails id="gui" version="1.18.6" buildDate="2021-04-27T08:53:00+02:00"></releaseDetails>
  </buildInformation>
  <jobs>
    <job>
      <item size="718642">
        <name>/path/to/file.pdf</name>
      </item>
      <validationReport profileName="PDF/A-2B validation profile" statement="PDF file is not compliant with Validation Profile requirements." isCompliant="false">
        <details passedRules="121" failedRules="1" passedChecks="58863" failedChecks="2">
          <rule specification="ISO 19005-2:2011" clause="6.2.11.4" testNumber="4" status="failed" passedChecks="0" failedChecks="2">
            <description>If the FontDescriptor dictionary of an embedded CID font contains a CIDSet stream, then it shall identify all CIDs which are present in the font program,
            regardless of whether a CID in the font is referenced or used by the PDF or not.</description>
            <object>PDCIDFont</object>
            <test>fontFile_size == 0 || fontName.search(/[A-Z]{6}\+/) != 0 || CIDSet_size == 0 || cidSetListsAllGlyphs == true</test>
            <check status="failed">
              <context>root/document[0]/pages[1](26 0 obj PDPage)/contentStream[0](27 0 obj PDContentStream)/operators[7]/xObject[0]/contentStream[0](24 0 obj PDContentStream)/operators[72]/font[0](KYIDZT+SymbolMT)/DescendantFonts[0](KYIDZT+SymbolMT)</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[1](26 0 obj PDPage)/contentStream[0](27 0 obj PDContentStream)/operators[7]/xObject[0]/contentStream[0](24 0 obj PDContentStream)/operators[234]/font[0](MAYIVC+FrontPagePro-Medium)/DescendantFonts[0](MAYIVC+FrontPagePro-Medium)</context>
            </check>
          </rule>
        </details>
      </validationReport>
      <duration start="1675365559348" finish="1675365560315">00:00:00.967</duration>
    </job>
  </jobs>
  <batchSummary totalJobs="1" failedToParse="0" encrypted="0">
    <validationReports compliant="0" nonCompliant="1" failedJobs="0">1</validationReports>
    <featureReports failedJobs="0">0</featureReports>
    <repairReports failedJobs="0">0</repairReports>
    <duration start="1675365559279" finish="1675365560330">00:00:01.051</duration>
  </batchSummary>
</report>

使用 pdffonts,我可以看到所有字体都嵌入在 pdf 中:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TTJGTY+Roboto-Regular                CID Type 0C       Identity-H       yes yes yes     13  0
QMJPLU+Roboto-Bold                   CID Type 0C       Identity-H       yes yes yes     15  0
MAYIVC+FrontPagePro-Medium           CID TrueType      Identity-H       yes yes yes     35  0
KYIDZT+SymbolMT                      CID TrueType      Identity-H       yes yes yes     36  0
PGCUGW+Charter-Bold                  TrueType          WinAnsi          yes yes no      37  0
DAESFN+FrontPageMedium               TrueType          WinAnsi          yes yes no      38  0
KUYJMB+FrontPagePro-Medium           TrueType          WinAnsi          yes yes no      39  0
IJGRYF+Charter                       TrueType          WinAnsi          yes yes no      40  0
BHNTXR+Stafford                      TrueType          WinAnsi          yes yes no      41  0
GUTOKG+ArialMT                       Type 1C           WinAnsi          yes yes no      42  0
KKKIUW+XCharter-Roman                CID Type 0C       Identity-H       yes yes yes     72  0
AIWZML+XCharter-Bold                 CID Type 0C       Identity-H       yes yes yes    117  0
FMADMD+XCharter-Italic               CID Type 0C       Identity-H       yes yes yes    158  0
KEZTQT+RobotoMono-Regular            CID Type 0C       Identity-H       yes yes yes    204  0

为了进一步分解,我仅在所包含的文档上运行了 pdffonts:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
MAYIVC+FrontPagePro-Medium           CID TrueType      Identity-H       yes yes yes     25  0
KYIDZT+SymbolMT                      CID TrueType      Identity-H       yes yes yes     15  0
PGCUGW+Charter-Bold                  TrueType          WinAnsi          yes yes no      20  0
DAESFN+FrontPageMedium               TrueType          WinAnsi          yes yes no       8  0
KUYJMB+FrontPagePro-Medium           TrueType          WinAnsi          yes yes no      10  0
IJGRYF+Charter                       TrueType          WinAnsi          yes yes no      12  0
BHNTXR+Stafford                      TrueType          WinAnsi          yes yes no      22  0
GUTOKG+ArialMT                       Type 1C           WinAnsi          yes yes no      18  0

因此,我可以看到所有字体都已嵌入,但它们的编码或打包方式使我的文档不再符合 PDF/A 标准。为了验证,我关闭了文档包含,并检查了没有包含 PDF 的文档;它通过了 PDF/A 合规性检查。

如果我在 test.pdf(包含的 pdf)上运行 verapub,我会得到:

<?xml version="1.0" encoding="utf-8"?>
<report>
  <buildInformation>
    <releaseDetails id="core" version="1.18.11" buildDate="2021-04-19T10:21:00+02:00"></releaseDetails>
    <releaseDetails id="validation-model" version="1.18.8" buildDate="2021-04-19T10:35:00+02:00"></releaseDetails>
    <releaseDetails id="gui" version="1.18.6" buildDate="2021-04-27T08:53:00+02:00"></releaseDetails>
  </buildInformation>
  <jobs>
    <job>
      <item size="103546">
        <name>/path/to/test.pdf</name>
      </item>
      <validationReport profileName="PDF/A-1B validation profile" statement="PDF file is not compliant with Validation Profile requirements." isCompliant="false">
        <details passedRules="94" failedRules="7" passedChecks="9353" failedChecks="100">
          <rule specification="ISO 19005-1:2005" clause="6.3.7" testNumber="3" status="failed" passedChecks="0" failedChecks="5">
            <description>Font programs' "cmap" tables for all symbolic TrueType fonts shall contain exactly one encoding</description>
            <object>TrueTypeFontProgram</object>
            <test>isSymbolic == false || nrCmaps == 1</test>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[8]/font[0](DAESFN+FrontPageMedium)/fontFile[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[12]/font[0](KUYJMB+FrontPagePro-Medium)/fontFile[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[23]/font[0](IJGRYF+Charter)/fontFile[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[161]/font[0](PGCUGW+Charter-Bold)/fontFile[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[188]/font[0](BHNTXR+Stafford)/fontFile[0]</context>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.2.3" testNumber="2" status="failed" passedChecks="0" failedChecks="88">
            <description>DeviceRGB may be used only if the file has a PDF/A-1 OutputIntent that uses an RGB colour space</description>
            <object>PDDeviceRGB</object>
            <test>gOutputCS != null &amp;&amp; gOutputCS == "RGB "</test>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[2]/colorSpace[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[8]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[12]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[14]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[18]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[20]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[23]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[25]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[27]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[29]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[31]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[33]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[35]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[37]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[39]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[41]/fillCS[0]</context>
            </check>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[44]/fillCS[0]</context>
            </check>

          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.1.7" testNumber="2" status="failed" passedChecks="0" failedChecks="1">
            <description>The stream keyword shall be followed either by a CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) character sequence
            or by a single LINE FEED character. The endstream keyword shall be preceded by an EOL marker</description>
            <object>CosStream</object>
            <test>streamKeywordCRLFCompliant == true &amp;&amp; endstreamKeywordEOLCompliant == true</test>
            <check status="failed">
              <context>root/indirectObjects[40](6 0)/directObject[0]</context>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.1.8" testNumber="1" status="failed" passedChecks="0" failedChecks="3">
            <description>The object number and generation number shall be separated by a single white-space character. The generation number and obj keyword 
    shall be separated by a single white-space character. The object number and endobj keyword shall each be preceded by an EOL marker. The obj and endobj
    keywords shall each be followed by an EOL marker.</description>
            <object>CosIndirect</object>
            <test>spacingCompliesPDFA</test>
            <check status="failed">
              <context>root/indirectObjects[1](29 0)</context>
            </check>
            <check status="failed">
              <context>root/indirectObjects[18](5 0)</context>
            </check>
            <check status="failed">
              <context>root/indirectObjects[44](2 0)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.7.11" testNumber="1" status="failed" passedChecks="0" failedChecks="1">
            <description>The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema.</description>
            <object>MainXMPPackage</object>
            <test>Identification_size == 1</test>
            <check status="failed">
              <context>root/document[0]/metadata[0](46 0 obj PDMetadata)/XMPPackage[0]</context>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.4" testNumber="2" status="failed" passedChecks="0" failedChecks="1">
            <description>An XObject dictionary shall not contain the SMask key</description>
            <object>PDXObject</object>
            <test>containsSMask == false</test>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/contentStream[0](6 0 obj PDContentStream)/operators[291]/xObject[0](31 0 obj PDXImage)</context>
            </check>
          </rule>
          <rule specification="ISO 19005-1:2005" clause="6.4" testNumber="3" status="failed" passedChecks="0" failedChecks="1">
            <description>A Group object with an S key with a value of Transparency shall not be included in a form XObject. 
            A Group object with an S key with a value of Transparency shall not be included in a page dictionary</description>
            <object>PDGroup</object>
            <test>S != "Transparency"</test>
            <check status="failed">
              <context>root/document[0]/pages[0](4 0 obj PDPage)/Group[0](5 0 obj PDGroup)</context>
            </check>
          </rule>
        </details>
      </validationReport>
      <duration start="1675365944670" finish="1675365945226">00:00:00.556</duration>
    </job>
  </jobs>
  <batchSummary totalJobs="1" failedToParse="0" encrypted="0">
    <validationReports compliant="0" nonCompliant="1" failedJobs="0">1</validationReports>
    <featureReports failedJobs="0">0</featureReports>
    <repairReports failedJobs="0">0</repairReports>
    <duration start="1675365944608" finish="1675365945246">00:00:00.638</duration>
  </batchSummary>
</report>

我正在努力弄清楚 verapdf 在 test.pdf 上给出的 cmap 错误是否与 verapdf 在主文档上突出显示的 CID 问题有关。基本上,我的问题是:我需要对这个 pdf 做什么才能让 pdfx 能够成功使整个文档符合 PDFA 标准?

答案1

这不是一个新问题。我不知道这个建议是否有效,但是这是 PDF 文件的链接(日期为 2022 年),其中包含有关您的问题的信息. 请参阅第 4 节。

尽管字体都是嵌入的,但就 PDF/A 而言,它们并不是“以相同的方式”嵌入的。我不认为对pdfx软件包代码的任何修改都可以解决这个问题。因此,诀窍是拆解原始 PDF,然后使用 PDF 软件(而不是 LaTeX)重新构建它。然后,第二次构建会“以相同的方式”嵌入所有字体。这就是链接文件所说的。

编辑:之前在此处发布的 ghostscript 代码存在问题。进一步的研究表明,使用 ghostscript 可能会解决一个问题,但会导致其他问题(例如擦除元数据或更改字体名称)。正如 OP 发现的那样,最好的方法是使用外部软件。

相关内容