用于curl获取、lynx解析和awk提取

用于curl获取、lynx解析和awk提取

鉴于这种:

<p>Currencies fluctuate every day. The rate shown is effective for transactions submitted to Visa on <strong>February 5, 2017</strong>, with a bank foreign transaction fee of <st <span><strong>1</strong> Euro = <strong>1.079992</strong> United States Dolla <p>The 'currency calculator' below gives you an indication of the cost of purchas <p>February 5, 2017</p><div class="clear-both"></div> <!-- removed clearboth- <p><strong>1 EUR = 1.079992 USD</strong></p> <div class="clear-both"></di <table width="290" border="0" cellspacing="0" cellpadding="3"> <a href="/content/VISA/US/en_us/home/support/consumer/travel-support/exchange e-calculator.html"> <button class="btn btn-default btn-xs"><span class="retur <p><p>This converter uses a single rate per day with respect to any two currencies. Rates displayed may not precisely reflect actual rate applied to transaction amount due to rounding differences, Rates apply to the date the transaction was processed by Visa; this may differ from the actual date of the transaction. Banks may or may not assess foreign transaction fees on cross-border transactions. Fees are applied at banks’ discretion. Please contact your bank for more information.</p>

我需要提取1.079992

我在用着:

sed -E 's:.*(1\.[0-9\.]+).*:\1:g

...这有效...但是有更优雅的方法吗?

或者,有没有办法直接从中获取该值curl

(我的完整命令是curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017' | grep '<p><strong>1' | sed -E 's:.*(1\.[0-9\\.]+).*:\1:g' :)

答案1

用于curl获取、lynx解析和awk提取

请不要用sed,grep等解析 XML/HTML。HTML 是上下文无关的,但sed和朋友只是常规的。1

url='https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017'
user_agent= 'Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'

curl -sA "${user_agent}" "${url}"  \
| lynx -stdin -dump                \
| awk '/1 EUR/{ print $4 }'

您需要某种 HTML 解析器来可靠地提取内容。在这里,我使用lynx(基于文本的网络浏览器),但也存在更轻的替代方案。

在这里,curl检索页面,然后lynx解析它并转储文本表示。搜索字符串的/1 EUR/原因,只找到行:awk1 EUR

   1 EUR = 1.079992 USD

然后{ print $4 }让它打印第四列,1.079992

替代解决方案不带curl

由于我选择的 HTML 解析器是lynxcurl所以没有必要:

url='https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=02/05/2017'
user_agent= 'Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'

lynx -useragent="${user_agent}" -dump "${url}"  \
| awk '/1 EUR/{ print $4 }'

1 A pcregrep -P在某些实现中)可以描述一些上下文无关甚至上下文相关的字符串集,但不是全部。


编辑于2017-12-23添加用户代理字符串(假装是 Firefox),因为该网站当前阻止curllynx

答案2

另一个解决方案:html2text

curl -s 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html/?fromCurr=USD&toCurr=EUR&fee=0&exchangedate=2/12/2017' \
| html2text \
| grep '1 Euro' \
| awk '{ print $4 }'

答案3

建议:使用 xml/html 感知工具:

xmllint

curl "$url" | xmllint -html -xpath '//span/strong[2]/text()' - 

希德尔

curl "$url" | xidel -s -e "//span/strong[2]" -

甚至

xidel -e "/span/strong[2]" $url

答案4

我会使用pandoc转换为json,然后python提取数据。它将比grep.

像这样,它通过 stdin 获取输入:

pandoc  -f html -t json | python3 -c '
import json
import sys

output=[]
data = json.load(sys.stdin)

for i in data[1][0]["c"]:
    if i["t"]=="Strong":
        output.append((i["c"]))

print(output[2][0]["c"])
'

相关内容