验证/确认 PDF 文件的完整性

Question 1

使用 PDFtk 检查 PDF 文件是否有效非常容易。PDFtk 的免费 GUI可从PDF 实验室。当您运行此工具时，您可以从多个目录加载任意数量的 PDF（通过使用添加文件按钮），然后它将非常快速地开始访问这些 PDF 文件中的页面。

如果所选 PDF 中的任何文件不是有效 PDF，此实用程序将显示有关错误的消息，并自动将其从选择窗口中删除。

因此，使用 PDFtk 的此过程可以节省很多时间。此外，如果您有多核 CPU，您可以运行此实用程序的多个实例，并在每个实例中放入数百个 PDF。

我从去年开始使用这个软件，它是我用过的最方便的 PDF 工具。

Answer

使用 PDFtk 检查 PDF 文件是否有效非常容易。PDFtk 的免费 GUI可从PDF 实验室。当您运行此工具时，您可以从多个目录加载任意数量的 PDF（通过使用添加文件按钮），然后它将非常快速地开始访问这些 PDF 文件中的页面。

如果所选 PDF 中的任何文件不是有效 PDF，此实用程序将显示有关错误的消息，并自动将其从选择窗口中删除。

因此，使用 PDFtk 的此过程可以节省很多时间。此外，如果您有多核 CPU，您可以运行此实用程序的多个实例，并在每个实例中放入数百个 PDF。

我从去年开始使用这个软件，它是我用过的最方便的 PDF 工具。

Question 2

我使用 xpdfbin-win 包中的“pdfinfo.exe”和 cpdf.exe 来检查 PDF 文件是否损坏，但如果没有必要，我不想涉及二进制文件。

我读到较新的 PDF 格式在末尾有一个可读的 xml 数据目录，因此我使用常规 Windows NOTEPAD.exe 打开 PDF，向下滚动到不可读的数据，直到末尾，看到几个可读的键。我只需要一个键，但选择同时使用 CreationDate 和 ModDate。

以下 Powershell (PS) 脚本将检查当前目录中的所有 PDF 文件，并将每个文件的状态输出到文本文件 (!RESULTS.log) 中。对 35,000 个 PDF 文件运行此脚本大约需要 2 分钟。我尝试为 PS 新手添加注释。希望这能节省一些时间。可能有更好的方法可以做到这一点，但这对我来说非常完美，并且可以静默处理错误。您可能需要在开始时定义以下内容：如果您在屏幕上看到错误，则 $ErrorActionPreference =“SilentlyContinue”。

将以下内容复制到文本文件中并适当命名（例如：CheckPDF.ps1）或打开 PS 并浏览到包含要检查的 PDF 文件的目录并将其粘贴到控制台中。

#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}

$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
    #
    # Suppress error messages
    #
    trap { Write-Output "Error trapped"; continue; }

    #
    # Read raw PDF data
    #
    $pdfText = Get-Content $item -raw

    #
    # Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
    #
    $ptr1 = $pdfText.IndexOf("CreationDate")
    $ptr2 = $pdfText.IndexOf("ModDate")

    #
    # Grab raw dates from file - will ERR if ptr is 0
    #
    try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }

    #
    # Append filename and bad status to logfile and increment a counter
    # catch block is also where you would rename, move, or delete bad files.
    #
    catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }

    #
    # Append filename and good status to logfile
    #
    Write-Output "$item - OK" -EA "Stop" >> $logFile

    #
    # Increment a counter
    #
    $goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter

#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }

#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"

Answer

我使用 xpdfbin-win 包中的“pdfinfo.exe”和 cpdf.exe 来检查 PDF 文件是否损坏，但如果没有必要，我不想涉及二进制文件。

我读到较新的 PDF 格式在末尾有一个可读的 xml 数据目录，因此我使用常规 Windows NOTEPAD.exe 打开 PDF，向下滚动到不可读的数据，直到末尾，看到几个可读的键。我只需要一个键，但选择同时使用 CreationDate 和 ModDate。

以下 Powershell (PS) 脚本将检查当前目录中的所有 PDF 文件，并将每个文件的状态输出到文本文件 (!RESULTS.log) 中。对 35,000 个 PDF 文件运行此脚本大约需要 2 分钟。我尝试为 PS 新手添加注释。希望这能节省一些时间。可能有更好的方法可以做到这一点，但这对我来说非常完美，并且可以静默处理错误。您可能需要在开始时定义以下内容：如果您在屏幕上看到错误，则 $ErrorActionPreference =“SilentlyContinue”。

将以下内容复制到文本文件中并适当命名（例如：CheckPDF.ps1）或打开 PS 并浏览到包含要检查的 PDF 文件的目录并将其粘贴到控制台中。

#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}

$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
    #
    # Suppress error messages
    #
    trap { Write-Output "Error trapped"; continue; }

    #
    # Read raw PDF data
    #
    $pdfText = Get-Content $item -raw

    #
    # Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
    #
    $ptr1 = $pdfText.IndexOf("CreationDate")
    $ptr2 = $pdfText.IndexOf("ModDate")

    #
    # Grab raw dates from file - will ERR if ptr is 0
    #
    try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }

    #
    # Append filename and bad status to logfile and increment a counter
    # catch block is also where you would rename, move, or delete bad files.
    #
    catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }

    #
    # Append filename and good status to logfile
    #
    Write-Output "$item - OK" -EA "Stop" >> $logFile

    #
    # Increment a counter
    #
    $goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter

#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }

#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"

Question 3

按照@n0nuf 的脚步，我编写了一个批处理脚本，使用 pdfinfo 检查特定文件夹中的所有 PDF，如果发现损坏则通过 cpdf 推送，以尝试修复它们：

@ECHO OFF
FOR %%f in (*.PDF) DO (
    echo %%f
    pdfinfo "%%f" 2>&1 | findstr /I "error"  >nul 2>&1
    if not errorlevel 1 (
        echo "bad -> try to fix"
        @cpdf -i %%f -o %%f_.pdf 2>NUL
        mv %%f .\\bak\\%%f
    ) else (
       REM echo good        
    )
)
@ECHO ON

或者与 bash 脚本相同：

for file in $(find . -iname "*.pdf")
do
    echo "$file"
    pdfinfo "$file" 2>&1 | grep -i 'error' &> /dev/null
    if [ $? == 0 ]; then
       echo "broken -> try to fix"
       cpdf -i "$file" -o "$file"_.pdf
    fi
done

损坏的 PDF 将被移动到子文件夹 \bak，重新创建的 PDF 将获得后缀 _.pdf（虽然不完美，但对我来说已经足够好了）。注意：重新创建的 PDF 包含较少的错误，应该可以使用常规 PDF 查看器查看。但这并不意味着您可以恢复所有内容。无法恢复的内容会导致空白页。

我也尝试使用 JHOVE（开源文件格式识别、验证和特性工具）进行相同的操作，正如@kraftydevil 在此处建议的那样：使用 Linux 上的命令行检查 PDF 文件是否损坏现在可以确认这也是一种有效的方法。（一开始我不太成功。但后来我注意到我没有正确处理 JHOVE 的输出。）

为了测试这两种方法，我用文本编辑器从 PDF 中删除并更改了随机部分（删除了流，因此页面无法在我的 PDF 查看器中呈现，更改了 PDF 标签，并移动了一些位）。结果是：两者pdfinfo 和 JHOVE 能够正确识别损坏的文件（在某些情况下，JHOVE 甚至更加敏感）。

以下是 JHOVE 的等效脚本：

@ECHO OFF
FOR %%f in (*.PDF) DO (
    echo %%f
    "C:\Program Files (x86)\JHOVE\jhove.bat" -m pdf-hul %%f | findstr /C:"Well-Formed and valid" >nul 2>&1
    if not errorlevel 1 (
        echo good
    ) else (
        echo "bad -> try to fix"
        @cpdf -i %%f -o %%f_.pdf 2>NUL
        REM mv %%f .\\bak\\%%f
    )
)
@ECHO ON

Answer