正则表达式：选择并删除评论 HTML 标签中的所有内容

Question 1

使用 Notepad++：

Ctrl+H
找什么：(?:\h*\R|\G)\K(?:(<p class=.*?</p>\R?)|(?:(?!<p class=.*?</p>)[\s\S])+)(?=[\s\S]+)
用。。。来代替：$1
查看环绕
查看 正则表达式
取消选中 . matches newline
Replace all

演示与说明

截图（之前）：

截图（之后）：

Answer

使用 Notepad++：

Ctrl+H
找什么：(?:\h*\R|\G)\K(?:(<p class=.*?</p>\R?)|(?:(?!<p class=.*?</p>)[\s\S])+)(?=[\s\S]+)
用。。。来代替：$1
查看环绕
查看 正则表达式
取消选中 . matches newline
Replace all

演示与说明

截图（之前）：

截图（之后）：

Question 2

在 Powershell 中工作：

$sourcedir = "C:\Folder1\"
 $resultsdir = "C:\Folder2\"
 Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object{
     $output=@()
     $content = Get-Content -Path $_.FullName
     $start = $content | Where-Object {$_ -match '<!-- ARTICOL START -->'} 
     $final = $content | Where-Object {$_ -match '<!-- ARTICOL FINAL -->'} 
     for($i=0;$i -lt $content.Count;$i++){
         if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){
             if($content[$i] -notmatch '<p class='){
                 continue
             }
         }
         $output += $content[$i]
     }
     $output | Out-File -FilePath $resultsdir\$($_.name)
 }

谢谢你，薛建军-MSFT这对我的回答有帮助这里

Answer

在 Powershell 中工作：

$sourcedir = "C:\Folder1\"
 $resultsdir = "C:\Folder2\"
 Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object{
     $output=@()
     $content = Get-Content -Path $_.FullName
     $start = $content | Where-Object {$_ -match '<!-- ARTICOL START -->'} 
     $final = $content | Where-Object {$_ -match '<!-- ARTICOL FINAL -->'} 
     for($i=0;$i -lt $content.Count;$i++){
         if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){
             if($content[$i] -notmatch '<p class='){
                 continue
             }
         }
         $output += $content[$i]
     }
     $output | Out-File -FilePath $resultsdir\$($_.name)
 }

谢谢你，薛建军-MSFT这对我的回答有帮助这里

Question 3

好的，这是解析 html 文件的正则表达式方法，事实上这是一个非常糟糕的想法，我现在可以使用（更）复杂的 PowerShell 和 Python 3 代码来更正确地执行此操作，但您要求使用正则表达式方法，所以我只会给出您所要求的，因为您的 html 代码并不是那么复杂。

所以我将您的代码复制粘贴到 Notepad++ 中并将其保存为扩展名为 .html 的文本文件，我将其保存在 D:\test.html

  <!-- ARTICOL START -->

<div align="justify">
        <table width="682" border="0">
          <tr>
            <td><h1 class="den_articol" itemprop="sfe">My text here</h1></td>
          </tr>
          <tr>
            <td class="text_dreapta">On Ianuarie 14, 2014, in <a href="https://neculaifantanaru.com/en/qualities-of-a-leader.html" title="See al articles from  Qualities of a leader" class="external" rel="category tag">Qualities of a leader</a>, by Author</td>
          </tr>
        </table>
      <h2 class="text_obisnuit2"><img src="index_files/sfa.jpg" width="718" height="605" id="sfs" usemap="#m_dgrnt" alt="hip" /><map name="tfAbonament" id="m_34">
<area shape="rect" coords="259,545,457,582" href="#plata" alt="" />
</map></h2>
        <p class="den_articol">Why this text text?</p>
<p class="text_obisnuit">test text text</p>
        <p class="text_obisnuit">test text text</p>
  <p class="text_obisnuit2">test text text</p>
    </div>
    <p align="justify" class="text_obisnuit style3">&nbsp;</p>
   
       <!-- ARTICOL FINAL -->

下一步应该是从文件中获取内容，在 PowerShell 中这通常使用来完成Get-Content，然后只需将第一个 cmdlet 的结果通过管道传输到Where-Objectcmdlet 以使用正则表达式匹配过滤结果，如果语句为真则包含该行，否则不包含，这就是您过滤结果的方式，并且where是的别名where-object。

get-content D:\test.html | where {$_ -match "ARTICOL|<P class=(.*)</p>"}

输出结果如下：

  <!-- ARTICOL START -->
            <td><h1 class="den_articol" itemprop="sfe">My text here</h1></td>
        <p class="den_articol">Why this text text?</p>
<p class="text_obisnuit">test text text</p>
        <p class="text_obisnuit">test text text</p>
  <p class="text_obisnuit2">test text text</p>
       <!-- ARTICOL FINAL -->

我承认这并不完全是你想要的，但已经很接近了。

现在可以使用 for 循环和 if 语句的组合来实现相同的结果：

$html = get-content D:\test.html
for ($i = 0; $i -lt $html.count; $i++) {
    if ($html[$i] -match "ARTICOL|<P class=(.*)</p>") { $html[$i] }
}

第一行获取文件的内容，默认情况下 get-content 逐行获取内容，因此结果是一个数组，我们将它保存在一个变量中，然后使用索引循环遍历数组，在 PowerShell 中，数组中第一个元素的索引为 0，因此最后一个元素的索引为数组中元素的数量减 1，我们逐个元素循环遍历数组并检查它是否与正则表达式匹配，如果匹配，则将其打印到屏幕上。

更新：要对一堆文件执行此操作，只需使用此代码（您必须替换占位符才能使用该代码）：

$files = (Get-ChildItem -Path "path\to\folder" -Force -Recurse -filter *.html).FullName
foreach ($file in $files) {
    $content = Get-Content -Path $file
    $content = $content | where {$_ -match "ARTICOL|(<P class=(.*)</p>)"}
    Set-Content -Path $file -Value $content
}

Answer