如何在 PDF 文件中搜索所有红色文本？

Question 1

答案是使用 MSWord 来查找源文件中红色文本的位置，因为该 PDF 文件中唯一的红色文本是A carrier is requi红色的所有其他文本都是 2 种轮廓黑色真类型（推断为黑色外部描边和黑色内部填充），然后可以在渲染过程中更改为其他颜色。

因此，您想要的是文件中针对以下对象（可能包括黑色文本）描边或填充具有高红色含量的颜色，即，您可能正在寻找 1 0 0 或类似的颜色，具体取决于来源。

一种简单的方法是将 PDF 导出为 HTML，这样通过搜索单词可以更容易地检测到红色呈现的黑色文本。

颜色：#ff0000">百慕大航空 (0710/2030)

对于该特定的 PDF，以下是 PDF 解码的第一个示例

/Span <</MCID 6/Lang(en-US)>> BDC
.00000912 0 612 792 re
W*
n
1 0 0 rg
1 0 0 RG
BT
/F1 12 Tf
1 0 0 1 54.025 643.95 Tm
[(Be)-5.0000007(rmu)-6.0000007(d)-6.0000007(Ai)-6.0000007(r )7.0000007( )7.0000007(\(0)-6.0000007(7)14.000001(1)-6.0000007(0)-6.0000007(/)7.0000007(2)-6.0000007(0)-6.0000007(3)14.000001(0)-6.0000007(\))] TJ
ET

我们清楚地看到红色

1 0 0 rg
1 0 0 RG

紧接着从开始文本 (BT) 到结束文本 (ET) 的字母，这些字母包含足够的字母来解码并呈现为BermudAir

然而，第 6 页的那部分内容经过了大量编码，需要解码才能在记事本中读取。

使用 PDFtk 或 qpdf 或许多其他 PDF 工具，只需一行即可轻松解压缩 PDF（跨平台），在 Linux 上很容易找到从1 0 0 rg或1 0 0 RG到的一段文本ET，但在 Windows 上却不那么简单。

回答

更简单的是，Windows 的 PDF2HTML 方法 findstr#ff0000可以全部放在一行上。我们可以在批处理文件中删除核心内容之前的部分（包括核心内容）color:#ff0000">，也可以删除</span></p>核心内容之后的部分

mutool convert -F html -o text.htm "Signatory VWP Carriers 2023-08.pdf"

C:\mupdf\1.20.0>findstr /i "color:#ff0000" text.htm

<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">BermudAir  (0710/2030) </span></p>
<p style="top:497.3pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Challenge Air Cargo, Ltd  (07/10/2030) </span></p>
<p style="top:469.8pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">ES Windows Aviation  (07/10/2030) </span></p>
<p style="top:276.5pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Kelowna Flightcraft Air Charter dba KF Aeroflyer  (07/10/2030) </span></p>
<p style="top:387.0pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Normandie Administradora de Bens E Partipacoes LTDA  (07/05/2030)  </span></p>
<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Royal Company  (07/10/2030)  </span></p>
<p style="top:359.3pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">SLVR Air LLC  (07/07/2030) </span></p>
<p style="top:179.9pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Stebbins Aviation Inc.  (07/11/2030) </span></p>
<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Volair Lineas Aereas del Caribe SA LLC  (07/11/2030)  </span></p>

@echo off & Title "Searching for red text" & if not [%~x1] == [.pdf] echo I need pdf to re[a]d & pause & exit /b
set "mutool=C:\Users\lez\Downloads\Apps\PDF\mupdf\1.20.0\mutool.exe"

"%mutool%" convert -F html -o"%tmp%\_txt.htm" "%~1%"
findstr /i "color:#ff0000" "%tmp%\_txt.htm">"%tmp%\_txt.txt"
for /f "usebackq tokens=4* delims=#<>" %%a in ("%tmp%\_txt.txt") do echo %%a

REM DECIDE WHAT YOU WANT WITH THE ABOVE OUTPUT
pause>nul

Answer

答案是使用 MSWord 来查找源文件中红色文本的位置，因为该 PDF 文件中唯一的红色文本是A carrier is requi红色的所有其他文本都是 2 种轮廓黑色真类型（推断为黑色外部描边和黑色内部填充），然后可以在渲染过程中更改为其他颜色。

因此，您想要的是文件中针对以下对象（可能包括黑色文本）描边或填充具有高红色含量的颜色，即，您可能正在寻找 1 0 0 或类似的颜色，具体取决于来源。

一种简单的方法是将 PDF 导出为 HTML，这样通过搜索单词可以更容易地检测到红色呈现的黑色文本。

颜色：#ff0000">百慕大航空 (0710/2030)

对于该特定的 PDF，以下是 PDF 解码的第一个示例

/Span <</MCID 6/Lang(en-US)>> BDC
.00000912 0 612 792 re
W*
n
1 0 0 rg
1 0 0 RG
BT
/F1 12 Tf
1 0 0 1 54.025 643.95 Tm
[(Be)-5.0000007(rmu)-6.0000007(d)-6.0000007(Ai)-6.0000007(r )7.0000007( )7.0000007(\(0)-6.0000007(7)14.000001(1)-6.0000007(0)-6.0000007(/)7.0000007(2)-6.0000007(0)-6.0000007(3)14.000001(0)-6.0000007(\))] TJ
ET

我们清楚地看到红色

1 0 0 rg
1 0 0 RG

紧接着从开始文本 (BT) 到结束文本 (ET) 的字母，这些字母包含足够的字母来解码并呈现为BermudAir

然而，第 6 页的那部分内容经过了大量编码，需要解码才能在记事本中读取。

使用 PDFtk 或 qpdf 或许多其他 PDF 工具，只需一行即可轻松解压缩 PDF（跨平台），在 Linux 上很容易找到从1 0 0 rg或1 0 0 RG到的一段文本ET，但在 Windows 上却不那么简单。

回答

更简单的是，Windows 的 PDF2HTML 方法 findstr#ff0000可以全部放在一行上。我们可以在批处理文件中删除核心内容之前的部分（包括核心内容）color:#ff0000">，也可以删除</span></p>核心内容之后的部分

mutool convert -F html -o text.htm "Signatory VWP Carriers 2023-08.pdf"

C:\mupdf\1.20.0>findstr /i "color:#ff0000" text.htm

<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">BermudAir  (0710/2030) </span></p>
<p style="top:497.3pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Challenge Air Cargo, Ltd  (07/10/2030) </span></p>
<p style="top:469.8pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">ES Windows Aviation  (07/10/2030) </span></p>
<p style="top:276.5pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Kelowna Flightcraft Air Charter dba KF Aeroflyer  (07/10/2030) </span></p>
<p style="top:387.0pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Normandie Administradora de Bens E Partipacoes LTDA  (07/05/2030)  </span></p>
<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Royal Company  (07/10/2030)  </span></p>
<p style="top:359.3pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">SLVR Air LLC  (07/07/2030) </span></p>
<p style="top:179.9pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Stebbins Aviation Inc.  (07/11/2030) </span></p>
<p style="top:138.4pt;left:54.0pt;line-height:12.0pt"><span style="font-family:Arial,sans-serif;font-size:12.0pt;color:#ff0000">Volair Lineas Aereas del Caribe SA LLC  (07/11/2030)  </span></p>

@echo off & Title "Searching for red text" & if not [%~x1] == [.pdf] echo I need pdf to re[a]d & pause & exit /b
set "mutool=C:\Users\lez\Downloads\Apps\PDF\mupdf\1.20.0\mutool.exe"

"%mutool%" convert -F html -o"%tmp%\_txt.htm" "%~1%"
findstr /i "color:#ff0000" "%tmp%\_txt.htm">"%tmp%\_txt.txt"
for /f "usebackq tokens=4* delims=#<>" %%a in ("%tmp%\_txt.txt") do echo %%a

REM DECIDE WHAT YOU WANT WITH THE ABOVE OUTPUT
pause>nul

Question 2

在您问题的示例的第 22 页上，红色行在转换为 html 时，会生成以下行中的字符串：

<div class="... baseline;color:rgba(255,0,0,1);">Royal Company (07/10/2030)</span>...;

考虑到行中存在的字符串..rbga(255,0,0,1...与要提取的文本相对应，只需将您的PDF to HTML并使用双For /F循环列出“分页”并按出现的顺序（从第一行到最后一行）获取包含字符串的结果文件中的行红色：

1.使用PDFtoHTML.exe转换Input.pdf到\HTML_DIR\OutPut

2>nul "C:\Full\Path\To\bin\pdftohtml.exe" "C:\Full\Path\To\File\Signatory VWP Carriers 2023-08.pdf" "%temp%\Signatory VWP Carriers 2023-08"

观察：这2>nul只是隐藏了消息的输出未找到字体（如果系统上没有安装源），这显然不会干扰获取感兴趣的线。

Config Error: No display font for 'Symbol'
Config Error: No display font for 'ZapfDingbats'

2..html在循环中（逐页）获取文件的完整路径，并在第二个循环中提取红色的行（已经使用以下命令过滤了感兴趣的行find）：

for /f ^usebackq^delims^= %i in (`dir /b /o:d /a:-d "%temp%\Signatory VWP Carriers 2023-08\Page*.html"`)do @for /f ^eol^=^)^usebackq^tokens^=^3^delims^=^<^> %G in (`@find /i "rgba(255" ^< "%temp%\Signatory VWP Carriers 2023-08\%~nxi"`)do @echo/%~G

3.删除 pdf 到 html 转换创建的文件夹：

2>nul RMDir "%temp%\Signatory VWP Carriers 2023-08" /s /q

4.使用以下命令删除每行/输出末尾的多余空格：eol

for /f "eol=) usebackq ...

在 bat 文件中它将是：

@echo off 

set "_pdftohtml=C:\Full\Path\To\bin\pdftohtml.exe"
2>nul RMDir "%temp%\Signatory VWP Carriers 2023-08" /s /q

2>nul "%_pdftohtml%" "C:\Full\Path\To\Signatory VWP Carriers 2023-08.pdf" "%temp%\Signatory VWP Carriers 2023-08"

for /f ^usebackq^delims^= %%i in (`dir /b /o:d /a:-d "%temp%\Signatory VWP Carriers 2023-08\Page*.html"
   `)do for /f ^eol^=^)^usebackq^tokens^=^3^delims^=^<^> %%G in (`find /i "rgba(255" ^< "%temp%\Signatory VWP Carriers 2023-08\%%~nxi"
        `)do echo/%%~G

2>nul RMDir "%temp%\Signatory VWP Carriers 2023-08" /s /q

相同的 bat，但通过参数或“拖放”处理 pdf 文件

@echo off

if not "%~x1" == ".pdf" goto :eOf
set "_pdftohtml=C:\Full\Path\To\bin\pdftohtml.exe"

echo/Stay there!.. I'm working on the file "%~f1"
2>nul =;( RMDir "%temp%\%~n1" /s /q & "%_pdftohtml%" "%~f1" "%temp%\%~n1" );=

for /f ^usebackq^delims^= %%i in =;(`dir /b /o:d /a:-d "%temp%\%~n1"`
    );= do for /f ^eol^=^)^usebackq^ ^tokens^=^3^ ^delims^=^<^> %%G in =;(`
         find /i "rgba(255" ^< "%temp%\%~n1\%%~nxi"`)do echo/%%~G

2>nul RMDir "%temp%\%~n1" /s /q /s /q

与上面相同，但它不打印任何内容，这些行被添加到你的剪贴板，在需要的地方粘贴：

@echo off

if not "%~x1" == ".pdf" goto :eOf
set "_pdftohtml=C:\Full\Path\To\bin\pdftohtml.exe"

echo/Stay there!.. I'm working on the file "%~f1"
2>nul =;( RMDir "%temp%\%~n1" /s /q & "%_pdftohtml%" "%~f1" "%temp%\%~n1" );=

=;(@for /f "usebackq delims=" %%i in =;(`
     @dir /b /o:d /a:-d "%temp%\%~n1"
        `);= do @for /f "eol=) usebackq tokens=3 delims=<>" %%G in =;(`
             @find /i "rgba(255" ^^^< "%temp%\%~n1\%%~nxi"
               `);= do @echo/%%~G);=|clip & 2>nul RMDir "%temp%\%~n1" /s /q /s /q

获取文本文件中的行OutPutFile.txt：

@echo off

if not "%~x1" == ".pdf" goto :eOf
set "_pdftohtml=C:\Full\Path\To\bin\pdftohtml.exe"

echo/Stay there!.. I'm working on the file "%~f1"
2>nul =;( RMDir "%temp%\%~n1" /s /q & "%_pdftohtml%" "%~f1" "%temp%\%~n1" );=

=;(@for /f "usebackq delims=" %%i in =;(`
     @dir /b /o:d /a:-d "%temp%\%~n1"
        `);= do @for /f "eol=) usebackq tokens=3 delims=<>" %%G in =;(`
             @find /i "rgba(255" ^< "%temp%\%~n1\%%~nxi"
               `);= do @echo/%%~G);= > "C:\Full\Path\To\OutPutFile.txt"

2>nul RMDir "%temp%\%~n1" /s /q /s /q

其他资源：

cd /d
if /?
For /?
For /F /?
条件执行
- ||和&&
命令重定向
- |，<，>，2>， ETC。

pdftohtml版本 4.04

Windows 命令解释器如何cmd.exe解析脚本

Answer