为什么wget不愿意递归下载？

Question 1

我对此进行了测试，发现了问题：

wget 尊重 robots.txt，除非明确指示不要这样做。

wget -r http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
--2015-12-31 12:29:52--  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html
Resolving www.comp.brad.ac.uk (www.comp.brad.ac.uk)... 143.53.133.30
Connecting to www.comp.brad.ac.uk (www.comp.brad.ac.uk)|143.53.133.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 878 [text/html]
Saving to: ‘www.comp.brad.ac.uk/research/GIP/tutorials/index.html’

www.comp.brad.ac.uk/research/GI 100%[======================================================>]     878  --.-KB/s   in 0s     

2015-12-31 12:29:53 (31.9 MB/s) - ‘www.comp.brad.ac.uk/research/GIP/tutorials/index.html’ saved [878/878]

Loading robots.txt; please ignore errors.
--2015-12-31 12:29:53--  http://www.comp.brad.ac.uk/robots.txt
Reusing existing connection to www.comp.brad.ac.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: ‘www.comp.brad.ac.uk/robots.txt’

www.comp.brad.ac.uk/robots.txt  100%[======================================================>]      26  --.-KB/s   in 0s     

2015-12-31 12:29:53 (1.02 MB/s) - ‘www.comp.brad.ac.uk/robots.txt’ saved [26/26]

FINISHED --2015-12-31 12:29:53--

正如您所看到的，wget 完美地完成了您的要求。

在这种情况下，robots.txt 说什么？

cat robots.txt
User-agent: *
Disallow: /

所以这个网站不希望机器人下载东西，至少不希望那些正在阅读和关注robots.txt的机器人，通常这意味着他们不希望在搜索引擎中被索引。

wget -r -erobots=off  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

现在，如果 wget 太强大而您无法学习，那也没关系，但不要错误地认为缺陷在于 wget。

然而，对网站进行递归下载存在风险，因此有时最好使用限制来避免抓取整个网站：

wget -r -erobots=off -l2 -np  http://www.comp.brad.ac.uk/research/GIP/tutorials/index.html

-l2表示最多 2 级。-l意思是：水平。
-np意味着不要在树中向上移动，而是从起始页进入。-np意思是：没有父母。

这仅取决于目标页面，有时您想准确指定要获取和不获取的内容，例如，在这种情况下，您仅获得默认的 .html/.htm 扩展名，而不是图形、pdf、音乐/视频扩展。该-A选项允许您添加要抓取的扩展类型。

顺便说一下，我查了一下，我的 wget 版本是 1.17，是 2015 年的。不确定你用的是什么版本。顺便说一句，我认为Python也是在90年代创建的，所以按照你的推理，Python也是90年代的垃圾。

我承认它的wget --help内容非常丰富并且功能丰富，就像 wget 手册页一样，所以可以理解为什么有人不想阅读它，但是有大量的在线教程告诉您如何执行最常见的 wget 操作。

Answer