SED--复杂文本删除和模式匹配

SED--复杂文本删除和模式匹配

我在 SED 上搜索了很多。我是新手。设法创建了一个命令,用于删除大型(250mb+)文本文件中 PATTERN-1 和 PATTERN-2(包括模式)之间的文本块。

现在我有一个更复杂的任务。我需要在文本文件中找到一个模式,并删除从模式之前的一行到与另一个模式匹配的另一行的所有文本。我举个例子:

PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'

我需要搜索 PATTERN-2。例如:state="wreck" 当我找到 PATTERN-2 时,我需要找到上一个 PATTERN-1。然后我需要删除 PATTERN-1 和 PATTERN-3 之间的所有文本(这将包括删除 PATTERN-2)。

所以如果我的文字是:

<connection ...
... state="wreck" ...
</connection>

我会找到任何 state="wreck" 的实例 - 然后删除 <connection和之间的所有内容</connection>(包括文本<connection</connection>)。

谢谢。希望这个问题问得清楚。

答案1

如果你可以使用 perl,这里有一种方法可以删除所有<connection...</connection>包含以下内容的块state="wreck"

cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah

perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah

blah blah

blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah

blah blah

解释:

-0      # slurp mode, read the file as it has only 1 line
-pe     # print current line, execute the following instructions

正则表达式:

s#                      : substitute, regex delimiter
<connection             : literally
(?:                     : start non capture group
    (?!</connection>)   : negative lookahead, make sure we don't find </connection>
    .                   : any character, including newline because of the s flag
)*                      : group may appear 0 or more times
state="wreck"           : literally
(?:                     : start non capture group
    (?!</connection>)   : negative lookahead, make sure we don't find </connection>
    .                   : any character, including newline because of the s flag
)*                      : group may appear 0 or more times
</connection>           : literally
##gs                    : replace with empty string, global, dot match newline

相关内容