正则表达式:从 html 标签中删除所有 html 标签,除了一些其他 html 标签

正则表达式:从 html 标签中删除所有 html 标签,除了一些其他 html 标签

我需要删除所有 html 标签,例如<p style="text-align: center;">,但html 标签中的 </em>和除外</em><p class="glovo"></p>

例子:

<p class="glovo">In these <p style="text-align: center;"> situations we may be forgetting to really <em>bend</em> at our practice and <em>sweat</em> at it.</p>

必须成为:

<p class="glovo">In these situations we may be forgetting to really <em>bend</em> at our practice and <em>sweat</em> at it.</p>

我使用这个通用公式:

REGION-START(?=(?:(?!REGION-FINAL).)*?FIND REGEX)(?=(?:(?!REGION-FINAL).)).+?REGION-FINAL\R?

REGION-START = <p class="glovo">
REGION-FINAL = </p>
FIND REGEX = <(?!/)[^>]*[^/]>(?!<em>|</em>)

因此,我的最终正则表达式变成:

FIND:

<p class="glovo">(?=(?:(?!</p>).)*?<(?!/)[^>]*[^/]>(?!<em>|</em>))(?=(?:(?!</p>).)).+?</p>\R?

REPLACE BY: (LEAVE EMPTY)

问题是我的正则表达式选择了整个 html 标记,而不仅仅是其中的标记。有人能帮助我吗?

答案1

  • Ctrl+H
  • 找什么:(?:<p class="glovo">|\G).*?\K<(?!/?em>).*?>(?=.*</p>)
  • 用。。。来代替:LEAVE EMPTY
  • 打钩 环绕
  • 选择 正则表达式
  • Replace all

解释:

(?:                     # non capture group
    <p class="glovo">       # literally
  |                       # OR
    \G                      # restart from last match position
)                       # end group
.*?                     # 0 or more any character, not greedy
\K                      # forget all we have seen until this position
<                       # literally <
    (?!/?em>)               # not followed by em or /em
    .*?                     # 0 or more any character, not greedy
    >
(?=.*</p>)              # positive lookahead, make sure we have </p> somewhere after

截图(之前):

在此处输入图片描述

截图(之后):

在此处输入图片描述

相关内容