使用 sed/grep/awk 删除 html 标签

使用 sed/grep/awk 删除 html 标签

如果我们有以下内容,如何删除所有标签呢?

Study eases concerns about taking antidepressants during pregnancy and autism risk <a href="https://t.co/Cs0mdeYEBo" rel="nofollow noopener" dir="ltr" data-expanded-url="http://cbsn.ws/2oTosqU" class="twitter-timeline-link" target="_blank" title="http://cbsn.ws/2oTosqU" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">cbsn.ws/2oTosqU</span><span class="invisible"></span><span class="tco-ellipsis"><span class="invisible">&nbsp;</span></span></a><a href="https://t.co/rs5813GdLG" class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" >pic.twitter.com/rs5813GdLG</a>

使用该命令后的结果应该如下所示:

Study eases concerns about taking antidepressants during pregnancy and autism risk

使用以下内容后:

sed -e 's/<[^>]*>//g'

或者

sed 's/<[^>]\+>//g'

我得到:

Study eases concerns about taking antidepressants during pregnancy and autism risk http://cbsn.ws/2oTosqU&nbsp;pic.twitter.com/rs5813GdLG

这不正是我想要的。我只需要使用 sed、awk、grep 来完成此操作。

答案1

该命令运行正常,您的文件格式错误。您可以使用grep --color=yes <[^>]*>' fileor 来查看这一点,方法是在每个之后添加换行符>

$ sed -e 's/>/>\n/g' file 
Study eases concerns about taking antidepressants during pregnancy and autism risk <a href="https://t.co/Cs0mdeYEBo" rel="nofollow noopener" dir="ltr" data-expanded-url="http://cbsn.ws/2oTosqU" class="twitter-timeline-link" target="_blank" title="http://cbsn.ws/2oTosqU" >
<span class="tco-ellipsis">
</span>
<span class="invisible">
http://</span>
<span class="js-display-url">
cbsn.ws/2oTosqU</span>
<span class="invisible">
</span>
<span class="tco-ellipsis">
<span class="invisible">
&nbsp;</span>
</span>
</a>
<a href="https://t.co/rs5813GdLG" class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" >
pic.twitter.com/rs5813GdLG</a>

请注意,http://</span>cbsn.ws/2oTosqU&nbsp;pic.twitter.com/rs5813GdLG不在 html 标签内,因此它们保持原样,非常正确。

所以,你想要的不是删除 html 标签,而是删除 html 标签和其他一些东西但我不知道如何知道你想要什么和不想要什么。

相关内容