wget - 如何根据 URL 模式下载网页？

2024-5-28 • tag-icon

考虑一个具有以下目录结构的网站 www.music.com：

/piano
   /covers
     /Chopin 
        apple.html
        bannan.js
        balloon.html
        index.html
     /Franz Liszt
        index.html
        roses.js
        Love Dream.html

     /Frodo
        index.html
        linkenso.html

/violin
   /covers
      /David
         Viva.html
      /Ross
         index.html

我只想从 music.com/piano/covers 中的嵌套目录获取 index.html 文件，其中子目录的名称以“Fr”开头。在上面的示例中，我只想下载 2 个文件：

www.music.com/piano/covers/Franz Listz/index.html
www.music.com/piano/covers/Frodo/index.html

使用 wget，我想到使用以下内容：

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.music.com/piano/covers 
   --accept-regex=/piano/covers/Fr.*/index.html  
        http://www.music.com

我在自己的网站上执行了相同的操作，但只得到一个不正确的文件：

www.music.com/index.html

为什么我使用上面的选项？

使用--recursive或-r不是问题，因为错误仍然存在。此外，该选项--page-requisites是更好的选择，因为我不需要服务器提供每条信息。
--domains：确保不下载指定 URL 之外的任何内容。应该是这种情况，因为我不需要piano/covers文件夹之外的任何资源
--header: 我想覆盖 Accept : * / * 以防止我的请求要求一切
--html-extensions: 只下载html文件

似乎--accept-regex由于某种原因该部分甚至没有被考虑。但建议使用-A该选项，因为我需要的文件分布在不同的目录中。有什么想法如何获取参数指定的两个文件--accept-regex吗？

编辑1：

访问上例中使用的 URL 会出现 404 错误。因此，我将为我的网站提供上下文，我实际上正在尝试执行此操作

来自 www.ajayhalthor.com，目录结构：

/piano
   /nightwish-sahara
   /nightwish-amaranth
   /skillet-hero
   /skillet-the-last-night
   /breaking-benjamin-diary-of-jane
   /skillet-comatose
   /one-republic-counting-stars
   /skillet-falling-inside-the-black
   /63/index.html
   /a/few/more/links/index.html

/about
   /other/links/index.html
/Home
   /main/links/index.html

从这个结构中，我想检索 www.ajayhalthor.com/piano 中以“Sk”开头的文件。我希望检索以下文件：

www.ajayhalthor.com/piano/skillet-hero
www.ajayhalthor.com/piano/skillet-the-last-night
www.ajayhalthor.com/piano/skillet-comatose
www.ajayhalthor.com/piano/skillet-falling-inside-the-black

运行以下命令：

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.ajayhalthor.com/piano
   --accept-regex="piano/sk.*"
        http://www.ajayhalthor.com

我得到以下输出：

Resolving www.ajayhalthor.com (www.ajayhalthor.com)... 23.229.213.7
Connecting to www.ajayhalthor.com (www.ajayhalthor.com)|23.229.213.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.ajayhalthor.com/index.html’

www.ajayhalthor.com/index.html              [ <=>                                                                          ]  24.28K  --.-KB/s    in 0.1s    

Last-modified header missing -- time-stamps turned off.
2017-01-19 01:56:11 (245 KB/s) - ‘www.ajayhalthor.com/index.html’ saved [24862]

FINISHED --2017-01-19 01:56:11--
Total wall clock time: 1.4s
Downloaded: 1 files, 24K in 0.1s (245 KB/s)
Converting links in www.ajayhalthor.com/index.html... 11-1
Converted links in 1 files in 0.003 seconds.

仅下载了 1 个文件 www.ajayhalthor.com/index.html。我使用--accept-regex正确吗？

相关内容