从 CLI 进行基本网络抓取

Question 1

和

curl "http://clojurescript.net/" | scrape -be '//body/script' | xml2json | jq '.html.body.script[].src

你有

"http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-web.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-main.js"

这些工具是：

伟大的jqhttps://stedolan.github.io/jq/;
刮https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape;
xml2jsonhttps://github.com/Inist-CNRS/node-xml2json-command。

或者与：

curl "http://clojurescript.net/" | hxnormalize -x | hxselect -i 'body > script' |  grep -oP '(http:.*?)(")' | sed 's/"//g'

你有：

http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

Answer

和

curl "http://clojurescript.net/" | scrape -be '//body/script' | xml2json | jq '.html.body.script[].src

你有

"http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-web.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-main.js"

这些工具是：

伟大的jqhttps://stedolan.github.io/jq/;
刮https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape;
xml2jsonhttps://github.com/Inist-CNRS/node-xml2json-command。

或者与：

curl "http://clojurescript.net/" | hxnormalize -x | hxselect -i 'body > script' |  grep -oP '(http:.*?)(")' | sed 's/"//g'

你有：

http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

Question 2

我不知道有什么独立的实用程序可以解析 HTML。有一些用于 XML 的实用程序，但我认为它们都不易于使用。

许多编程语言都有一个解析 HTML 的库。大多数 Unix 系统都有 Perl 或 Python。我推荐使用Python美丽汤或者 Perl 的HTML::树构建器。如果您愿意，当然可以使用另一种语言（诺科吉里在 Ruby 等中）

这是一个将下载和解析结合在一起的 Python 单行代码：

python2 -c 'import codecs, sys, urllib, BeautifulSoup; html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1])); sys.stdout.writelines([e["src"] + "\n" for e in html.findAll("script")])' http://clojurescript.net/

或者作为更具可读性的几行代码：

python2 -c '
import codecs, sys, urllib, BeautifulSoup;
html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1]));
scripts = html.findAll("script");
for e in scripts: print(e["src"])
' http://clojurescript.net/

Answer