鉴于这种:
<p>Currencies fluctuate every day. The rate shown is effective for transactions submitted to Visa on <strong>February 5, 2017</strong>, with a bank foreign transaction fee of <st <span><strong>1</strong> Euro = <strong>1.079992</strong> United States Dolla <p>The 'currency calculator' below gives you an indication of the cost of purchas <p>February 5, 2017</p><div class="clear-both"></div> <!-- removed clearboth- <p><strong>1 EUR = 1.079992 USD</strong></p> <div class="clear-both"></di <table width="290" border="0" cellspacing="0" cellpadding="3"> <a href="/content/VISA/US/en_us/home/support/consumer/travel-support/exchange e-calculator.html"> <button class="btn btn-default btn-xs"><span class="retur <p><p>This converter uses a single rate per day with respect to any two currencies. Rates displayed may not precisely reflect actual rate applied to transaction amount due to rounding differences, Rates apply to the date the transaction was processed by Visa; this may differ from the actual date of the transaction. Banks may or may not assess foreign transaction fees on cross-border transactions. Fees are applied at banks’ discretion. Please contact your bank for more information.</p>
我需要提取1.079992
我在用着:
sed -E 's:.*(1\.[0-9\.]+).*:\1:g
...这有效...但是有更优雅的方法吗?
或者,有没有办法直接从中获取该值curl
?
(我的完整命令是curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017' | grep '<p><strong>1' | sed -E 's:.*(1\.[0-9\\.]+).*:\1:g'
:)
答案1
用于curl
获取、lynx
解析和awk
提取
请不要用sed
,grep
等解析 XML/HTML。HTML 是上下文无关的,但sed
和朋友只是常规的。1
url='https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017'
user_agent= 'Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'
curl -sA "${user_agent}" "${url}" \
| lynx -stdin -dump \
| awk '/1 EUR/{ print $4 }'
您需要某种 HTML 解析器来可靠地提取内容。在这里,我使用lynx
(基于文本的网络浏览器),但也存在更轻的替代方案。
在这里,curl
检索页面,然后lynx
解析它并转储文本表示。搜索字符串的/1 EUR/
原因,只找到行:awk
1 EUR
1 EUR = 1.079992 USD
然后{ print $4 }
让它打印第四列,1.079992
。
替代解决方案不带curl
由于我选择的 HTML 解析器是lynx
,curl
所以没有必要:
url='https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017'
user_agent= 'Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'
lynx -useragent="${user_agent}" -dump "${url}" \
| awk '/1 EUR/{ print $4 }'
1 A pcre
(grep -P
在某些实现中)可以描述一些上下文无关甚至上下文相关的字符串集,但不是全部。
编辑于2017-12-23添加用户代理字符串(假装是 Firefox),因为该网站当前阻止curl
和lynx
。
答案2
另一个解决方案:html2text
curl -s 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=2/12/2017' \
| html2text \
| grep '1 Euro' \
| awk '{ print $4 }'
答案3
建议:使用 xml/html 感知工具:
xmllint
curl "$url" | xmllint -html -xpath '//span/strong[2]/text()' -
希德尔
curl "$url" | xidel -s -e "//span/strong[2]" -
甚至
xidel -e "/span/strong[2]" $url
答案4
我会使用pandoc
转换为json
,然后python
提取数据。它将比grep
.
像这样,它通过 stdin 获取输入:
pandoc -f html -t json | python3 -c '
import json
import sys
output=[]
data = json.load(sys.stdin)
for i in data[1][0]["c"]:
if i["t"]=="Strong":
output.append((i["c"]))
print(output[2][0]["c"])
'