告诉 wget 不要抓取匹配模式的 URL？

Question 1

--reject-regex经过一番尝试和错误后，我意识到解决方案就是像这样使用：

wget -r --reject-regex page --spider --no-check-certificate -w 1 http://mysite.com/

urlregex 不得包含通配符，因此*page*无效，但page确实如此。

Answer

--reject-regex经过一番尝试和错误后，我意识到解决方案就是像这样使用：

wget -r --reject-regex page --spider --no-check-certificate -w 1 http://mysite.com/

urlregex 不得包含通配符，因此*page*无效，但page确实如此。

Question 2

从man wget：

-R rejlist --reject rejlist
           Specify comma-separated lists of file name suffixes or patterns to
           accept or reject.

该选项只会拒绝文件与模式匹配的。

严格来说，你的URL中page是一个请求参数，而不是路径的最后部分（例如文件名）。

您可能想要转储 wget 找到的所有 URL（例如 grep 所有下载的 URL 的日志），删除那些不满足您要求的 URL（例如使用 grep -v），最后让 wget 检索剩下的 URL。例如：

# dump the whole website
wget ... -P dump -o wget.log  ...

# extract URLs from the log file
cat wget.log | grep http | tr -s " " "\012" | grep http >urls

# excludes URLs with the word page anywhere in it
cat urls | grep -v page >urls 

# delete previous dump, since it probably contains unwanted files
rm -rf dump

# Fetch URLs
cat urls | xargs wget -x

您可能需要根据需要添加其他 wget 选项（例如 --no-check-certificate）。

Answer