正则表达式:查找包含至少 3 个关键字的 HTML 标签

正则表达式:查找包含至少 3 个关键字的 HTML 标签

我有这些单词,其中至少三个可以在英语的任何一个句子中出现。

was, where, were, some, then, than, that, can, by, the, and, with, over, there, is, as, also, through, from, while, just, like, for, such, if, else, still, again, want, will, wish, make, made, well, have, had, has, it, be, do, say, others, go, know, see, think, look, give, use, find, tell, ask, work, seem, feel, try, leave, call, get, take, too, in, addition, to, could, who, he, she, because, of, your, yours, their, doesn't, are, an, these, this, those, but, at, whom, or, out, how, when, between, his, her, they, them, my, without, maybe, even, show, can't, must, couldn't, now, i'm, many, come, own, self, seen, it’s, we, any, other, coming, so, found, more, much, all, very, same, did, which, does, on

另外,我有这两个html标签,但只有第一个的内容是英文的:

<meta name="description" content="Simply Red are a British soul and pop band which formed in Manchester in 1985. The lead vocalist of the band is singer and songwriter Mick Hucknall by">

以及一个俄语标签:

<meta name="description" content="Simply Red - британская соул- и поп-группа, образованная в Манчестере в 1985 году. Ведущим вокалистом группы является певец и автор песен Мик Хакнелл.">

所以,我想检查所有包含用英文书写的标签的 html 文件。为此,我必须找到那些从开头就包含至少 3 个该关键字的 html 标签。

我的正则表达式只有几个词(简短版本),如下所示:

搜索:(?-s)<meta name="description".+?(?:(was|is|as|on|and|in)).+>

更大版本将是:

(?-s)<meta name="description".*?(was|where|were|some|then|than|that|can|by|the|and|with|over|there|is|as|also|through|from|while|just|like|for|such|if|else|still|again|want|will|wish|make|made|well|have|had|has|it|be|do|say|others|go|know|see|think|look|give|use|find|tell|ask|work|seem|feel|try|leave|call|get|take|too|in|addition|to|could|who|he|she|because|of|your|yours|their|doesn't|are|an|these|this|those|but|at|whom|or|out|how|when|between|his|her|they|them|my|without|maybe|even|show|can't|must|couldn't|now|i'm|many|come|own|self|seen|it’s|we|any|other|coming|so|found|more|much|all|very|same|did|which|does|on).+>

好的,问题是我的正则表达式也找到了第二个标签,其内容是用俄语写的。我必须只找到第一个(英文)

答案1

您的列表太大,因此为了演示该技巧,这里有一个包含四个单词的小列表的示例 one two three four

在此处输入图片描述

以下是对搜索字符串的解释:(one|two|three|four).*(?-1).*(?-1)

  • (one|two|three|four):捕获组中的其中一个单词
  • .*:查找任意数量的字符
  • (?-1):查找该组后面的另一个匹配项(递归子模式)

相关内容