如何在 shell 脚本中查找多行字符串？

Question 1

一种可能的方法可能是用（可从“universe”存储库获得）替换grep，pcregrep它支持多行匹配，然后不再搜索文字字符串

Time series prediction with ensemble models

而是搜索perl 兼容正则表达式 (PCRE)

Time\s+series\s+prediction\s+with\s+ensemble\s+models

其中\s+代表一个或多个空格字符（包括换行符）。使用 bash shell 的内置字符串替换功能执行后面的步骤

pdftotext "$file" - | pcregrep -M "${string// /\\s+}"

如果您不能使用，pcregrep那么您可能能够使用带grep开关的plain 来获得所需的输出-z：这告诉grep您将输入“行”视为由NUL字符而不是换行符分隔 - 在这种情况下，有效地使它将整个输入视为一行。例如，如果您只想打印匹配项（没有上下文）

pdftotext "$file" - | grep -zPo "${string// /\\s+}"

Answer

一种可能的方法可能是用（可从“universe”存储库获得）替换grep，pcregrep它支持多行匹配，然后不再搜索文字字符串

Time series prediction with ensemble models

而是搜索perl 兼容正则表达式 (PCRE)

Time\s+series\s+prediction\s+with\s+ensemble\s+models

其中\s+代表一个或多个空格字符（包括换行符）。使用 bash shell 的内置字符串替换功能执行后面的步骤

pdftotext "$file" - | pcregrep -M "${string// /\\s+}"

如果您不能使用，pcregrep那么您可能能够使用带grep开关的plain 来获得所需的输出-z：这告诉grep您将输入“行”视为由NUL字符而不是换行符分隔 - 在这种情况下，有效地使它将整个输入视为一行。例如，如果您只想打印匹配项（没有上下文）

pdftotext "$file" - | grep -zPo "${string// /\\s+}"

Question 2

使用 Python，很多可以做到...

如果我稍后再查看它，我可能会做一些优化，但在我的测试中，下面的脚本可以完成这项工作。

在文件上测试：

Monkey eats banana since he ran out of peanuts 
Monkey
eats banana since he ran 
out of peanuts 
really, Monkey eats banana since 
he ran out of peanuts 
A lot of useless text here…
Have to add some lines for the sake of the test.
Monkey eats banana since he ran out of peanuts

寻找字符串“Monkey eats banana since he ran out ofpeanuts”，它输出：

Found matches
--------------------
[line 1]
Monkey eats banana since he ran out of peanuts
[line 2]
Monkey
eats banana since he ran
out of peanuts
[line 5]
Monkey eats banana since
he ran out of peanuts
[line 9]
Monkey eats banana since he ran out of peanuts

剧本

#!/usr/bin/env python3
import subprocess
import sys

f = sys.argv[1]; string = sys.argv[2]

# convert to .txt with your suggestion
subprocess.call(["pdftotext", f])
# read the converted file
text = open(f.replace(".pdf", ".txt")).read()
# editing the file a bit for searching options / define th length of the searched string
subtext = text.replace("\n", " "); size = len(string)
# in a while loop, find the matching string and set the last found index as a start for the next match
matches = []; start = 0
while True:
    match = subtext.find(string, start)
    if match == -1:
        break
    else:
        matches.append(match)
    start = match+1

print("Found matches\n"+20*"-")
for m in matches:
    # print the found matches, replacing the edited- in spaces by (possibly) original \n
    print("[line "+str(text[:m].count("\n")+1)+"]\n"+text[m:m+size].strip())

使用方法：

将脚本复制到一个空文件中，并将其另存为search_pdf.py

通过命令运行：

python3 /path/to/search_pdf.py /path/to/file.pdf string_to_look_for

无需提及，如果路径或搜索的字符串包含空格，则需要使用引号：

python3 '/path to/search_pdf.py' '/path to/file.pdf' 'string to look for'

Answer

使用 Python，很多可以做到...

如果我稍后再查看它，我可能会做一些优化，但在我的测试中，下面的脚本可以完成这项工作。

在文件上测试：

Monkey eats banana since he ran out of peanuts 
Monkey
eats banana since he ran 
out of peanuts 
really, Monkey eats banana since 
he ran out of peanuts 
A lot of useless text here…
Have to add some lines for the sake of the test.
Monkey eats banana since he ran out of peanuts

寻找字符串“Monkey eats banana since he ran out ofpeanuts”，它输出：

Found matches
--------------------
[line 1]
Monkey eats banana since he ran out of peanuts
[line 2]
Monkey
eats banana since he ran
out of peanuts
[line 5]
Monkey eats banana since
he ran out of peanuts
[line 9]
Monkey eats banana since he ran out of peanuts

剧本

#!/usr/bin/env python3
import subprocess
import sys

f = sys.argv[1]; string = sys.argv[2]

# convert to .txt with your suggestion
subprocess.call(["pdftotext", f])
# read the converted file
text = open(f.replace(".pdf", ".txt")).read()
# editing the file a bit for searching options / define th length of the searched string
subtext = text.replace("\n", " "); size = len(string)
# in a while loop, find the matching string and set the last found index as a start for the next match
matches = []; start = 0
while True:
    match = subtext.find(string, start)
    if match == -1:
        break
    else:
        matches.append(match)
    start = match+1

print("Found matches\n"+20*"-")
for m in matches:
    # print the found matches, replacing the edited- in spaces by (possibly) original \n
    print("[line "+str(text[:m].count("\n")+1)+"]\n"+text[m:m+size].strip())

使用方法：

将脚本复制到一个空文件中，并将其另存为search_pdf.py

通过命令运行：

python3 /path/to/search_pdf.py /path/to/file.pdf string_to_look_for

无需提及，如果路径或搜索的字符串包含空格，则需要使用引号：

python3 '/path to/search_pdf.py' '/path to/file.pdf' 'string to look for'

Question 3

steeldriver 在评论中建议的另一种方法是用空格替换所有换行符，将输出转换pdftotext为一行长文本并搜索：

string="Time series prediction with ensemble models"
pdftotext "$file" - | tr '\n' ' ' | grep -o "$string"

我添加了-o以便grep只打印行的匹配部分。如果没有它，您将打印文件的全部内容。

另一种方法是使用grep的-z开关，告诉它使用\0而不是\n来定义行。这意味着整个输入将被视为单个“行”，您可以使用 Perl 兼容或扩展的正则表达式来匹配它：

$ printf 'foo\nbar\nbaz\n' | grep -oPz 'foo\nbar'
foo
bar

但是，除非您事先知道字符串是如何被分成多行的，否则这无济于事。

Answer

steeldriver 在评论中建议的另一种方法是用空格替换所有换行符，将输出转换pdftotext为一行长文本并搜索：

string="Time series prediction with ensemble models"
pdftotext "$file" - | tr '\n' ' ' | grep -o "$string"

我添加了-o以便grep只打印行的匹配部分。如果没有它，您将打印文件的全部内容。

另一种方法是使用grep的-z开关，告诉它使用\0而不是\n来定义行。这意味着整个输入将被视为单个“行”，您可以使用 Perl 兼容或扩展的正则表达式来匹配它：

$ printf 'foo\nbar\nbaz\n' | grep -oPz 'foo\nbar'
foo
bar

但是，除非您事先知道字符串是如何被分成多行的，否则这无济于事。

如何在 shell 脚本中查找多行字符串？

答案1

答案2

剧本

使用方法：

答案3

相关内容