sed html 解析

2024-5-24 • tag-icon

我必须解析 HTML 以将两个括号之间的文本（不是 HTML 代码）更改回<or >。

这是我必须替换的 HTML 代码示例：

<content:encoded><![CDATA[<div class="pre_headline">some text</div> <p>…. More text . </p><p></p><h2> More text </h2><p> More text < text between angle brackets > … more text
… </content:encoded>

期望的输出：

<content:encoded><![CDATA[<div class="pre_headline">some text</div> <p>…. More text . </p><p></p><h2> More text </h2><p> More text &lt; text between angle brackets &gt; … more text
… </content:encoded>

所有文本都在一行中。我现在所做的所有替换都是使用 sed 或 awk。但我无法找到一种方法来替换文本中的括号而不更改所有 html 标签。

我想定义所有 html 括号后面都不跟空格。内联文本括号后面通常跟一个空格。这可能是选择我必须更换哪些支架的一种方法。也许有更好的规则，因为此方法不会替换括号中没有空格的文本:(

以下 sed 命令将替换所有括号。

sed -e 's/>/\&gt;/g' | 
sed -e 's/</\&lt;/g' |

答案1

这是可能的sed但比任何 XML 解析器都困难。

sed '
:2
#puts open and close tag in one pattern
/\s*<\([^>]*>\).*<\/\1\s*$/!{
    N
    b2
}
#mark pairable tags by `#` symbol
:1
s/\(.*<\)\(\([^#> ]*\).*<\)\/\3/\1#\2#\/\3/
#other variant
#s/\(.*<\)\(\([^><]*\)[^>]*>.*<\/\3\)>/\1#\2#>/
t1
#change non-marked text
s/<\([^#]*\)>/\&lt;\1\&gt;/g
#remove marks
s/#//g
' file.html

答案1

相关内容