我们可以在 pdf 文件中搜索包含多个无特定顺序单词的页面吗?

我们可以在 pdf 文件中搜索包含多个无特定顺序单词的页面吗?

我想在 pdf 文件中搜索所有页面,每个页面包含几个给定的单词,没有特定的顺序。例如,我想查找同时包含“hello”和“world”的所有页面(不按特定顺序排列)。

我不确定是否pdfgrep 可以做到。

我正在尝试做一些类似于我们如何在 Google 图书中显示的书中搜索多个单词的操作。

谢谢。

答案1

-P是的,如果您使用该选项(让它使用PCRE引擎和类似 perl 的正则表达式),您可以使用零宽度先行断言来完成此操作。

$ pdfgrep -Pn '(?=.*process)(?=.*preparation)' ~/Str-Cmp.pdf
8:•     If a preparation process is used, the method used shall be declared.
10:Standard, preparation may be an important part of the ordering process. See Annex C for some examples of
38:padding. The preparation processing could move the original numerals (in order of occurrence) to the very

仅当两个单词位于同一行时,上述方法才有效;如果这些单词可以出现在同一页的不同行上,则执行以下操作:

$ pdfgrep -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
9:                                                                                                  ISO/IEC 14651:2007(E)
10:ISO/IEC 14651:2007(E)
12:ISO/IEC 14651:2007(E)
...

s中的标志意味着(?s:.将匹配换行符。请注意,这只会打印页面的第一行;您可以使用以下-A选项进行调整:

$ pdfgrep -A4 -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
8-•     Any specific internal format for intermediate keys used when comparing, nor for the table used. The use of
8-      numeric keys is not mandated either.
8-•     A context-dependent ordering.
8-•     Any particular preparation of character strings prior to comparison.
--
9:                                                                                                  ISO/IEC 14651:2007(E)
...

一个粗略的包装脚本,它将打印匹配页面中任何模式的行全部任意顺序的模式:

usage: pdfgrepa [options] files ... -- patterns ...

#! /bin/sh
r1= r2=
for a; do
        if [ "$r2" ]; then
                r1="$r1(?=.*$a)"; r2="$r2|$a"
        else
                case $a in
                --)     r2='(?=^--$)';;
                *)      set -- "$@" "$a";;
                esac
        fi
        shift
done
pdfgrep -A10000 -Pn "(?s:$r1)" "$@" | grep -P --color "$r2"

$ pdfgrepa ~/Str-Cmp.pdf -i -- obtains process preparation 37- the strings after preparation are identical, and the end result (as the user would normally see it) could be 37- collation process applying the same rules. This kind of indeterminacy is undesirable. 37-one obtains after this preparation the following strings:

答案2

pdfgrep -nP 'hello.{1,99}world|world.{1,99}hello' a.pdf

https://pdfgrep.org/doc.html

相关内容