使用 wget 镜像网站但仅匹配 url 模式

Question 1

是的，解决方案是-I：

  -I list
   --include-directories=list
       Specify a comma-separated list of directories you wish to follow
       when downloading.  Elements of list may contain wildcards.

例如，

wget http://abc.com/A/G/4/ --no-parent -I /A/G/4

Answer

是的，解决方案是-I：

  -I list
   --include-directories=list
       Specify a comma-separated list of directories you wish to follow
       when downloading.  Elements of list may contain wildcards.

例如，

wget http://abc.com/A/G/4/ --no-parent -I /A/G/4

Question 2

有几个相关的标志：

-A acclist --accept acclist

（文件名采用逗号分隔的 glob 样式模式）

-I list
--include-directories=list

（目录的逗号分隔的 glob 样式模式）

--accept-regex urlregex

（使用正则表达式获取完整 URL）

通常，您还会传递-r给递归，-l inf否则最大递归深度为 5。如果您希望能够开始和停止下载，-nc“no clobber”可避免重新下载现有文件。为此，-E (--adjust-extension)也很有用，它会将.html扩展名添加到缺少它的 HTML 页面；当扩展名存在并-nc指定时，wget仍将从文件的磁盘副本中读取 URL。

以下是下载《古兰经》逐字翻译的示例：

wget -E -nc -l inf -nd -r --no-parent 'http://corpus.quran.com/wordbyword.jsp?chapter=1&verse=1' -A '*wordbyword*'

它从第一节开始，由于每个页面都链接到下一节，因此它最终会下载所有页面。该-A选项将我们限制在我们感兴趣的页面上。

我认为需要更多示例，因此请随时提出建议，我会尝试更新它。

Answer