我无法在镜像博客时让 wget 成功递归

Question

运行 with-d显示发生了什么：

Location: http://blogs.gamefilia.com/lord-areg [following]
    ....
Deciding whether to enqueue "http://blogs.gamefilia.com/lord-areg".
Going to "" would escape "lord-areg" with no_parent on.
Decided NOT to load it.
Redirection "http://blogs.gamefilia.com/lord-areg" failed the test.

重定向的页面位于指定区域之外，因此尽管已检索该页面，但在递归时不会遵循其内容。

删除最后一个/意味着没有重定向，但正如您所发现的，也意味着 wget 不将其lord-areg视为目录，而是使用前一个/，因此整个站点匹配：

请注意，对于 HTTP（和 HTTPS），尾部斜杠对于“--no-parent”非常重要。 HTTP 没有“目录”的概念 — Wget 依靠您来指示什么是目录、什么不是。在 'http://foo/bar/', Wget 会认为 'bar' 是一个目录，而 in 'http://foo/酒吧'（没有尾部斜杠），'bar' 将被视为文件名（因此 '--no-parent' 将毫无意义，因为它的父级是 '/'）。

（4.3 基于目录的限制）

所以你需要以其他方式限制结果。 -I lord-areg几乎可以工作，但会跳过表单的页面/lord-areg?page=1。为了匹配这些，请更详细地描述所需的 URL：

--accept-regex '^http:\/\/blogs\.gamefilia\.com\/lord-areg[?/]'

Answer 1