正则表达式:查找包含这些词(字符串)的特定 HTML 标签

正则表达式:查找包含这些词(字符串)的特定 HTML 标签

我有这个标签:

<span class="text_obisnuit2">* Not&#259;:</span>John Wells - <em>My Dreams</em>, Albatros Books, 1986.</p>

和这个:

<span class="text_obisnuit1">* Not&#259;:</span>Mariah Carey - <em>Lovers on the road</em>, BackStreet Books, 1965.</p>

所以,我想找到那些特定的 html 标签<span class="text_obisnuit2">包含以下单词(字符串): Albatros<em></em> (第一行)

答案1

这是一个很简单的方法,但它需要在标签后面加上“Albatros” <em>(演示):

(<span class="text_obisnuit2">).*<em>.*<\/em>.*Albatros.*

接下来的那个不关心它们的顺序(演示):

(<span class="text_obisnuit2">).*(<em>.*<\/em>.*Albatros.*|Albatros.*<em>.*<\/em>.*)

这是另一种变体,其中后面的数字无关紧要text_obisnuit,整个span标签被捕获为第一组(演示):

(<span class="text_obisnuit\d+">.*<\/span>).*(<em>.*<\/em>.*Albatros.*|Albatros.*<em>.*<\/em>.*)

所有正则表达式都假设条目在文件中各自占一行。也许检测<p></p>作为边界更有意义,但为此我们需要从输入文件中摘录更大的内容。

答案2

  • Ctrl+F
  • 找什么:<span class="text_obisnuit2">(?=.*?<em>.*?</em>)(?=.*?\bAlbatros\b).*$
  • 查看 环绕
  • 查看 正则表达式
  • 取消选中 . matches newline
  • Find All in Current Document

解释:

<span class="text_obisnuit2">   # literally
(?=                             # positive lookahead, make sure we have after:
.*?                           # 0 or more any character but newline, not greedy
<em>                          # literally open em tag
.*?                           # 0 or more any character but newline, not greedy
</em>                         # literally close em tag
)                               # end lookahead
(?=                             # positive lookahead, make sure we have after:
.*?                           # 0 or more any character but newline, not greedy
\bAlbatros\b                  # Albatros with word boundaries
)                               # end lookahead
.*                              # 0 or more any character but newline
$                               # end of line

截屏:

在此处输入图片描述

相关内容