使用 powershell 搜索 pdf 内容并输出文件列表

Question 1

Powershell 使用文本编辑器。下面评估每个 pdf 的每一页上的文本以查找关键字，然后将任何匹配项导出到 csv。如果找到匹配项，您可以使用它来重命名文件，将它们移动到分类文件夹等。

编辑：itextsharp 的 Github 页面表明它已停止使用，并链接到Itext7 https://github.com/itext/itext7-dotnet（作为 AGPL/商业软件双重许可，似乎可以免费用于非商业用途。）

Add-Type -Path "C:\path_to_dll\itextsharp.dll"
$pdfs = gci "C:\path_to_pdfs" *.pdf
$export = "C:\path_to_export\export.csv"
$results = @()
$keywords = @('Keyword1','Keyword2','Keyword3')

foreach($pdf in $pdfs) {

    Write-Host "processing -" $pdf.FullName

    # prepare the pdf
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName

    # for each page
    for($page = 1; $page -le $reader.NumberOfPages; $page++) {
    
        # set the page text
        $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)

        # if the page text contains any of the keywords we're evaluating
        foreach($keyword in $keywords) {
            if($pageText -match $keyword) {
                $response = @{
                    keyword = $keyword
                    file = $pdf.FullName
                    page = $page
                }
                $results += New-Object PSObject -Property $response
            }
        }
    }
    $reader.Close()
}

Write-Host ""
Write-Host "done"

$results | epcsv $export -NoTypeInformation

控制台输出：

processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf

done
PS C:\>

csv 输出：

keyword    page    file
Keyword2   14      C:\path_to_pdfs\3.pdf
Keyword3   22      C:\path_to_pdfs\3.pdf
Keyword1   6       C:\path_to_pdfs\5.pdf

Answer

Powershell 使用文本编辑器。下面评估每个 pdf 的每一页上的文本以查找关键字，然后将任何匹配项导出到 csv。如果找到匹配项，您可以使用它来重命名文件，将它们移动到分类文件夹等。

编辑：itextsharp 的 Github 页面表明它已停止使用，并链接到Itext7 https://github.com/itext/itext7-dotnet（作为 AGPL/商业软件双重许可，似乎可以免费用于非商业用途。）

Add-Type -Path "C:\path_to_dll\itextsharp.dll"
$pdfs = gci "C:\path_to_pdfs" *.pdf
$export = "C:\path_to_export\export.csv"
$results = @()
$keywords = @('Keyword1','Keyword2','Keyword3')

foreach($pdf in $pdfs) {

    Write-Host "processing -" $pdf.FullName

    # prepare the pdf
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName

    # for each page
    for($page = 1; $page -le $reader.NumberOfPages; $page++) {
    
        # set the page text
        $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)

        # if the page text contains any of the keywords we're evaluating
        foreach($keyword in $keywords) {
            if($pageText -match $keyword) {
                $response = @{
                    keyword = $keyword
                    file = $pdf.FullName
                    page = $page
                }
                $results += New-Object PSObject -Property $response
            }
        }
    }
    $reader.Close()
}

Write-Host ""
Write-Host "done"

$results | epcsv $export -NoTypeInformation

控制台输出：

processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf

done
PS C:\>

csv 输出：

keyword    page    file
Keyword2   14      C:\path_to_pdfs\3.pdf
Keyword3   22      C:\path_to_pdfs\3.pdf
Keyword1   6       C:\path_to_pdfs\5.pdf

Question 2

如果 PDF 的文件内容已在 Windows Search 中编入索引，则可以查询系统文件系统索引。您可能需要安装 iFilter以确保 Windows 将索引 PDF。但此方法将适用于 pdf、文本文件、xlsx 文件等。

$searchString = "foo"
$searchPath = "C:\Users\Uzer\Searchfolder"
$sql = "SELECT System.ItemPathDisplay, System.DateModified, " +
       "System.Size, System.FileExtension FROM SYSTEMINDEX " +
       "WHERE SCOPE = '$searchPath' AND FREETEXT('$searchstring')"
$provider = "provider=search.collatordso;extended properties=’application=windows’;" 
$connector = new-object system.data.oledb.oledbdataadapter -argument $sql, $provider 
$dataset = new-object system.data.dataset 
if ($connector.fill($dataset)) { $dataset.tables[0] }

Answer

如果 PDF 的文件内容已在 Windows Search 中编入索引，则可以查询系统文件系统索引。您可能需要安装 iFilter以确保 Windows 将索引 PDF。但此方法将适用于 pdf、文本文件、xlsx 文件等。

$searchString = "foo"
$searchPath = "C:\Users\Uzer\Searchfolder"
$sql = "SELECT System.ItemPathDisplay, System.DateModified, " +
       "System.Size, System.FileExtension FROM SYSTEMINDEX " +
       "WHERE SCOPE = '$searchPath' AND FREETEXT('$searchstring')"
$provider = "provider=search.collatordso;extended properties=’application=windows’;" 
$connector = new-object system.data.oledb.oledbdataadapter -argument $sql, $provider 
$dataset = new-object system.data.dataset 
if ($connector.fill($dataset)) { $dataset.tables[0] }

Question 3

您可以使用它Get-Content在文件中查找特定内容。

例子：

$searchstring = "foo"
$directory = Get-ChildItem -Path C:\temp\ -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)} | # do something...}

使用$searchstring变量提供在文件中搜索的单词。$directory变量是包含将使用搜索字符串进行搜索的文件的目录。

Get-Content可以找到有关 cmdlet 的更多信息这里。

Answer

您可以使用它Get-Content在文件中查找特定内容。

例子：

$searchstring = "foo"
$directory = Get-ChildItem -Path C:\temp\ -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)} | # do something...}

使用$searchstring变量提供在文件中搜索的单词。$directory变量是包含将使用搜索字符串进行搜索的文件的目录。

Get-Content可以找到有关 cmdlet 的更多信息这里。

使用 powershell 搜索 pdf 内容并输出文件列表

答案1

答案2

答案3

相关内容