我想找到字符串
Time series prediction with ensemble models
在使用 shell 脚本的 pdf 文件中。我正在使用pdftotext "$file" - | grep "$string"
。其中$file
,是 pdf 文件名,$string
是上述字符串。如果整个字符串包含在一行中,它可以找出该行。但它找不到如下行:
Time series prediction with
ensemble models
我该如何解决呢?我是 Linux 新手。因此,如果能详细解释的话我将不胜感激。提前致谢。
答案1
一种可能的方法可能是用(可从“universe”存储库获得)替换grep
,pcregrep
它支持多行匹配,然后不再搜索文字字符串
Time series prediction with ensemble models
Time\s+series\s+prediction\s+with\s+ensemble\s+models
其中\s+
代表一个或多个空格字符(包括换行符)。使用 bash shell 的内置字符串替换功能执行后面的步骤
pdftotext "$file" - | pcregrep -M "${string// /\\s+}"
如果您不能使用,pcregrep
那么您可能能够使用带grep
开关的plain 来获得所需的输出-z
:这告诉grep
您将输入“行”视为由NUL
字符而不是换行符分隔 - 在这种情况下,有效地使它将整个输入视为一行。例如,如果您只想打印匹配项(没有上下文)
pdftotext "$file" - | grep -zPo "${string// /\\s+}"
答案2
使用 Python,很多可以做到...
如果我稍后再查看它,我可能会做一些优化,但在我的测试中,下面的脚本可以完成这项工作。
在文件上测试:
Monkey eats banana since he ran out of peanuts
Monkey
eats banana since he ran
out of peanuts
really, Monkey eats banana since
he ran out of peanuts
A lot of useless text here…
Have to add some lines for the sake of the test.
Monkey eats banana since he ran out of peanuts
寻找字符串“Monkey eats banana since he ran out ofpeanuts”,它输出:
Found matches
--------------------
[line 1]
Monkey eats banana since he ran out of peanuts
[line 2]
Monkey
eats banana since he ran
out of peanuts
[line 5]
Monkey eats banana since
he ran out of peanuts
[line 9]
Monkey eats banana since he ran out of peanuts
剧本
#!/usr/bin/env python3
import subprocess
import sys
f = sys.argv[1]; string = sys.argv[2]
# convert to .txt with your suggestion
subprocess.call(["pdftotext", f])
# read the converted file
text = open(f.replace(".pdf", ".txt")).read()
# editing the file a bit for searching options / define th length of the searched string
subtext = text.replace("\n", " "); size = len(string)
# in a while loop, find the matching string and set the last found index as a start for the next match
matches = []; start = 0
while True:
match = subtext.find(string, start)
if match == -1:
break
else:
matches.append(match)
start = match+1
print("Found matches\n"+20*"-")
for m in matches:
# print the found matches, replacing the edited- in spaces by (possibly) original \n
print("[line "+str(text[:m].count("\n")+1)+"]\n"+text[m:m+size].strip())
使用方法:
- 将脚本复制到一个空文件中,并将其另存为
search_pdf.py
通过命令运行:
python3 /path/to/search_pdf.py /path/to/file.pdf string_to_look_for
无需提及,如果路径或搜索的字符串包含空格,则需要使用引号:
python3 '/path to/search_pdf.py' '/path to/file.pdf' 'string to look for'
答案3
steeldriver 在评论中建议的另一种方法是用空格替换所有换行符,将输出转换pdftotext
为一行长文本并搜索:
string="Time series prediction with ensemble models"
pdftotext "$file" - | tr '\n' ' ' | grep -o "$string"
我添加了-o
以便grep
只打印行的匹配部分。如果没有它,您将打印文件的全部内容。
另一种方法是使用grep
的-z
开关,告诉它使用\0
而不是\n
来定义行。这意味着整个输入将被视为单个“行”,您可以使用 Perl 兼容或扩展的正则表达式来匹配它:
$ printf 'foo\nbar\nbaz\n' | grep -oPz 'foo\nbar'
foo
bar
但是,除非您事先知道字符串是如何被分成多行的,否则这无济于事。