我想要wget
做的是递归地抓取某个目录下的整个站点并下载所有文件,比如 png 文件。
我将使用 Wikipedia 作为示例。这是命令:
wget -r -p -e robots=off -H -D en.wikipedia.org --no-parent -A png http://en.wikipedia.org/wiki/Main_Page
这是我得到的:
URL transformed to HTTPS due to an HSTS policy
--2016-07-20 11:02:51-- https://en.wikipedia.org/wiki/Main_Page
Resolving en.wikipedia.org (en.wikipedia.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to en.wikipedia.org (en.wikipedia.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘en.wikipedia.org/wiki/Main_Page’
en.wikipedia.org/wi [ <=> ] 64.72K 298KB/s in 0.2s
2016-07-20 11:02:51 (298 KB/s) - ‘en.wikipedia.org/wiki/Main_Page’ saved [66278]
Removing en.wikipedia.org/wiki/Main_Page since it should be rejected.
URL transformed to HTTPS due to an HSTS policy
--2016-07-20 11:02:51-- https://en.wikipedia.org/static/images/wikimedia-button.png
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 2426 (2.4K) [image/png]
Saving to: ‘en.wikipedia.org/static/images/wikimedia-button.png’
en.wikipedia.org/st 100%[===================>] 2.37K --.-KB/s in 0s
2016-07-20 11:02:51 (147 MB/s) - ‘en.wikipedia.org/static/images/wikimedia-button.png’ saved [2426/2426]
URL transformed to HTTPS due to an HSTS policy
--2016-07-20 11:02:51-- https://en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 1585 (1.5K) [image/png]
Saving to: ‘en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png’
en.wikipedia.org/st 100%[===================>] 1.55K --.-KB/s in 0s
2016-07-20 11:02:51 (102 MB/s) - ‘en.wikipedia.org/static/images/poweredby_mediawiki_88x31.png’ saved [1585/1585]
FINISHED --2016-07-20 11:02:51--
Total wall clock time: 1.0s
Downloaded: 3 files, 69K in 0.2s (316 KB/s)
即使我添加,也会发生同样的情况-l inf
。
当我运行相同的代码但删除时-A png
,wget
它继续下载,并且看不到尽头,就像它应该做的那样。
那么,问题是什么?如何让它抓取整个网站,但只下载某些文件类型?
答案1
o/p,stanny 获得的结果令人惊讶但却是事实。
我得到了相同的结果,但我也从普通的维基百科页面获得了成功的结果,使用以下命令: -
wget --no-check-certificate --span-hosts -e robots=off -p -A png https://en.wikipedia.org/wiki/Antimatter
我正在使用 wget 1.16,在使用 Windows 7 64 位的 Windows PC 上运行。