消除块匹配一个巨大的 html 文件中的字符串

消除块匹配一个巨大的 html 文件中的字符串

我使用的是 Mac,我想<div>从 html 文件中删除与某个字符串匹配的多个块。我尝试按以下方式使用 sed ,但失败了:

  1. STRING首先,我转义了 my 中具有特殊正则表达式含义的所有符号并生成了ESCAPEDSTRING

  2. 但现在我正在努力寻找一个可以在多行上工作并使用正则表达式删除相应行的工具。我想是sed行不通的

在下面的示例中,我想删除<div>包含 string 的任何块GET /thestring//index.php,而其他所有内容(即包含 的倒数第二个块GET /thisisatotallydifferentstring)仍然是 html 文件的一部分。示例 foo.html 如下所示:

<div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:18 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=3214235353.32&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:28 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=3214123353.99 HTTP/1.1" 200 278 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:18 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523141718.15&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:19 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523141718.15&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:55:29 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523141729.64 HTTP/1.1" 200 278 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142027.44&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:28 +0200] "GET /thestring//index.php?fnc=OSCConfirmCatalog&amp;session_id=1523142027.44&amp;date= HTTP/1.1" 200 249 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:00:38 +0200] "GET /thestring//index.php?fnc=OSCExportOrder&amp;session_id=1523142038.38 HTTP/1.1" 200 278 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects setter usage and property overloading</span><br>
 <span class="line"><b>Log line: </b>222.333.444.555 - - [03/Jan/2013:01:03:42 +0200] "GET /thisisatotallydifferentstring.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:01:05:27 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=1523142327.08&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
 </div>

我想删除每个<div></div>包含“thestring”的块。

我的正则表达式如下所示:

\<div class\="block highlight"\>\n  Reason\: \<span class\="reason"\>Detects JavaScript location/document property access and window access obfuscation\</span\>\<br\>\n \<span class\="line"\>\<b\>Log line\: .* \- \- \[08/Apr/2018\:.*\] "GET /pixi//index\.php.* HTTP/1\.1" 200 .* "\-" "\-"\n\</span\>\<br\>\n \</div\>\n

有什么建议么?

答案1

xsltproc使用OSX 上的程序,man xsltproc

例如:

$ cat remdivs.xslt 
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">

<xsl:output method="html" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*" />
<xsl:preserve-space elements="html body div" />

<xsl:template match="@* | node()">
    <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
 </xsl:template>

 <xsl:template match="div[@class='block highlight']"/>

 </xsl:stylesheet>



$ cat input.xml 
<html>
<div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2000976405029%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e21%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e000035010005%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 339 "-" "-"
</span><br>
 </div>
 <div class="no highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
 </div>
 <div class="block highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:50:17 +0200] "GET /thestring//index.php?fnc=OSCExportCatalog&amp;session_id=3214235353.32&amp;onlynew=y HTTP/1.1" 200 676 "-" "-"
</span><br>
 </div>
</html>

$ xsltproc --html remdivs.xslt input.xml
<html>
<body>
 <div class="no highlight">
  Reason: <span class="reason">Detects JavaScript location/document property access and window access obfuscation</span><br>
 <span class="line"><b>Log line: </b>111.222.333.444 - - [03/Jan/2013:00:45:40 +0200] "GET /thestring//index.php?fnc=OSCImportStock&amp;session_id=1523141136.42&amp;data=%3cARTICLE_ITEM%3e%3cARTICLE_ITEM_ID%3e2001021500003%3c%2fARTICLE_ITEM_ID%3e%3cQUANTITY%3e1%3c%2fQUANTITY%3e%3cDELIVERY_DATE%2f%3e%3cMIN_STOCK_QTY%3e0%3c%2fMIN_STOCK_QTY%3e%3cACTIVE%3eTrue%3c%2fACTIVE%3e%3cEAN%3e501302462%3c%2fEAN%3e%3cOPENSUPPLORDERS%3e0%3c%2fOPENSUPPLORDERS%3e%3c%2fARTICLE_ITEM%3e HTTP/1.1" 200 349 "-" "-"
</span><br>
 </div>

</body>
</html>

进一步问题解释后进行编辑。

进一步解释。

xsltproc 基于模板 (remdivs.xslt) 执行输入文档的转换,我使用 --html 选项来放宽严格的 xml 验证,因为您的输入文档包含<br>空元素(与 相反<br/>)。

处理器首先获取输入文档并在内存中构建文档模型,然后应用它在文档中找到的模板遍历模型中的元素。.xslt

查看.xslt,它包含前导码声明,然后是一些有助于定义所需输出类型的通用处理规则。

只有2个模板,第一个

<xsl:template match="@* | node()">
    <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
 </xsl:template>

该模板有一个match属性,因此它仅应用于输入文档中与匹配表达式匹配的那些元素,在本例中表达式是"@* | node()",它将匹配文档中的任何属性或任何节点,整个事情!它对这些元素所做的操作已在内部列出,它选择性地复制应用每个模板的输出,但选择标准模板将是每个属性和元素的名称。结果是,如果仅存在此模板,则将是原始输入文档的副本,并应用输出处理规则。

第二个模板进行拒绝。

 <xsl:template match="div[@class='block highlight']"/>

在这里,它专门匹配<div>具有名为class、值为 的属性的元素'block highlight'。因此,<div>...</div>与此匹配的那些块将被替换为此模板生成的输出,正弦此块为空(具有终止符/),则不会生成任何输出。

另一方面,这

 <xsl:template match="div[@class='block highlight']">
    suppressed div output<br>
 </xsl:template>

将输出一些文本来代替抑制的 div 块。

这是基于您修改后的问题的不同抑制模板。

<xsl:template match="div">
   <xsl:choose>
       <xsl:when test="not(contains(span[@class='line'],'GET /thestring'))">
           <xsl:copy>
               <xsl:apply-templates select="@* | node()"/>
           </xsl:copy>
       </xsl:when>
       <xsl:otherwise><!-- Just do nothing to supress output -->
       </xsl:otherwise>
   </xsl:choose>
</xsl:template>

该模板应用于所有 div 元素,它测试其任何也包含class属性值的子 span 元素的文本内容是否line不包含字符串 'GET /thestring'。

当它不包含字符串时,我们会执行与第一个模板相同的复制,否则我们不会执行任何操作来抑制该 div 块的输出。

进一步阅读 XPath,它定义了如何寻址文档的元素和属性,以及 XSLT 来编写处理模板,这些示例应该有助于初学者更清楚地理解。

相关内容