我在 SED 上搜索了很多。我是新手。设法创建了一个命令,用于删除大型(250mb+)文本文件中 PATTERN-1 和 PATTERN-2(包括模式)之间的文本块。
现在我有一个更复杂的任务。我需要在文本文件中找到一个模式,并删除从模式之前的一行到与另一个模式匹配的另一行的所有文本。我举个例子:
PATTERN-1 = '<connection'
PATTERN-2 = state="wreck"
PATTERN-3 = '</connection>'
我需要搜索 PATTERN-2。例如:state="wreck" 当我找到 PATTERN-2 时,我需要找到上一个 PATTERN-1。然后我需要删除 PATTERN-1 和 PATTERN-3 之间的所有文本(这将包括删除 PATTERN-2)。
所以如果我的文字是:
<connection ...
... state="wreck" ...
</connection>
我会找到任何 state="wreck" 的实例 - 然后删除
<connection
和之间的所有内容</connection>
(包括文本<connection
和</connection>
)。
谢谢。希望这个问题问得清楚。
答案1
如果你可以使用 perl,这里有一种方法可以删除所有<connection...</connection>
包含以下内容的块state="wreck"
cat file.txt
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
<connection ...
... state="wreck" ...
</connection>
blah blah
perl -0 -pe 's#<connection(?:(?!</connection>).)*state="wreck"(?:(?!</connection>).)*</connection>##gs' file.txt
blah blah
blah blah
blah blah
blah blah
<connection ...
... state="another" ...
</connection>
blah blah
blah blah
解释:
-0 # slurp mode, read the file as it has only 1 line
-pe # print current line, execute the following instructions
正则表达式:
s# : substitute, regex delimiter
<connection : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
state="wreck" : literally
(?: : start non capture group
(?!</connection>) : negative lookahead, make sure we don't find </connection>
. : any character, including newline because of the s flag
)* : group may appear 0 or more times
</connection> : literally
##gs : replace with empty string, global, dot match newline