wget 不转换链接

wget 不转换链接

我正在尝试在进行大修之前镜像一个相当大的网站(超过 20,000 个页面)。基本上,在切换到新网站之前,我需要备份,以防我们忘记了我们需要的东西(我们在发布时将有大约 1,000 个页面)。该网站在 CMS 上运行,我无法轻松从中提取可用数据,因此我尝试使用 wget 进行复制。

我的问题是,尽管命令中有 --convert-links 或 -k,但 wget 似乎并没有真正转换链接。我尝试了几种不同的标志组合,但还是无法获得所需的输出。最近一次失败的尝试是:

nohup wget --mirror -k -l10 -PafscSnapshot --html-extension -R *calendar* -o wget.log http://www.example.org &

我还添加了 --backup-converted 和 --convert-links 来代替 -k(这并不重要)。我使用和不使用 -P 和 -l 都做过,同样,它们并不重要。

导致文件仍然有如下链接:

http://www.example.org/ht/d/sp/i/17770

答案1

这是一个旧帖子,但我将答案放在这里,以方便以后搜索的人。

--convert-links功能仅发生网站下载已完成。我猜,对于如此大的网站,您在下载完几个页面后就试图停止下载过程,因此下载过程尚未开始。

也可以看看https://stackoverflow.com/questions/6348289/download-a-working-local-copy-of-a-webpage

来自 wget 文档

‘-k’
‘--convert-links’
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-html content, etc.

Each link will be changed in one of the two ways:

    The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories.
    The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif. 

Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by ‘-k’ will be performed at the end of all the downloads. 

答案2

我尝试备份一个 6Gig 网站时也遇到了同样的问题。几天后,wget 完成,没有错误消息,退出状态为 0,但没有转换链接。使用相同选项执行较小的检索可以正常工作。就好像在 wget 结束之前,已下载内容的内部表被清除或损坏了。

我将尝试使用 -nc 重新获取该网站(它不应该重新获取任何内容,因为它已经被下载了,并完成链接转换 - 请参阅如果未指定 -k,则让 wget 在下载后将 HTML 链接转换为相对链接

答案3

也许你遇到过wget -k 在 Windows 和 Linux 上以不同的方式转换文件由于操作系统文件名限制?

答案4

如果按照手册还有 -o,则 -k 会被忽略:

请注意,仅在下载单个文档时才允许使用“-k”,因为在这种情况下它只会将所有相对 URI 转换为外部 URI;当所有 URI 都被下载到单个文件时,“-k”对于多个 URI 毫无意义;仅当输出是常规文件时才可使用“-k”。

相关内容