html 中标签之间的搜索模式

html 中标签之间的搜索模式

我需要从具有特定标题的标签中获取价值。

我有这个命令。

sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html

这是index.html的一部分,我需要“生活中的一切都是运气”

<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
<a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="b-qt qt_106578 oncl_q" title="view quote">Everything in life is luck.</a>
<a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="bq-aut qa_106578 oncl_a" title="view author">Donald Trump</a>
</div>
<div class="qbn-box">
<div class="sh-cont">
<a href="https://www.brainyquote.com/share/fb/106578" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/106578?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/106578?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a>
</div>
</div>
<div class="qll-dsk-kw-box">
<div class="kw-box">
<a href="https://www.brainyquote.com/topics/life-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="0">Life</a>
<a href="https://www.brainyquote.com/topics/luck-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="1">Luck</a>
<a href="https://www.brainyquote.com/topics/everything-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="2">Everything</a>
</div>
</div>
</div>
<div id="qpos_1_2" class="m-brick grid-item boxy bqQt r-width" style="position: absolute; left: 623px; top: 2px;">
<div class="clearfix">
<div class="qti-listm">
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_119339" class="oncl_q">
<img id="qimage_119339" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1(1).jpg" class="bqphtgrid" alt="The first thing the secretary types is the boss. - Donald Trump">
</a>
</div>
<a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="b-qt qt_119339 oncl_q" title="view quote">The first thing the secretary types is the boss.</a>
<a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="bq-aut qa_119339 oncl_a" title="view author">Donald Trump</a>
</div>
<div class="qbn-box">
<div class="sh-cont">
<a href="https://www.brainyquote.com/share/fb/119339" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/119339?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/119339?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a>
</div>
</div>

我需要所有这些值来填充 bash 中的数组。这里的预期输出是['生活中的一切都是运气','秘书输入的第一件事是老板。']。但我需要index.html 中的所有引号,因此我需要选择器将所有引号获取到数组。

答案1

尽管它是 HTML 而不是正确的 XML,但您实际上可以使用xmlstarlet.

让我们调用您的文件index.html。命令调用:

xmlstarlet fo -H index.html 2>/dev/null |
    xmlstarlet sel -t -v '//a[@title="view quote" and string-length(text()) > 1]' -n 2>/dev/null

输出:

Everything in life is luck.
The first thing the secretary types is the boss.

您以前可能没有遇到过xmlstarlet。它是一个令人惊奇的工具,可以让您格式化、编辑和解析 XML。今天我发现它还可以重新格式化格式不良的 HTML。如果没有,请安装它。 (如果您无权安装它,请询问。)它以一种无法开始处理的方式理解sedXML awk。重新格式化 XML?sedawk可能会破裂,但xmlstarlet看不出有什么显着差异。

相关内容