当我通过 Firefox 保存网页时,我得到如下目录结构:
.
├── Some Page/
└── Some Page.html
所以我们有一个 .html 文件和一个包含图像、javascript、css 等的文件夹。
wget
我可以使用(或任何其他命令行工具)获得相同的结果(html +单个大文件夹)吗?
编辑:我需要这个,因为我下载多个网页,有时检查每个页面的下载位置很混乱。
答案1
我可能不完全理解这个问题,但一个简单的解决方法是使用该-r
标志。所以:
wget -r www.site.com
将递归地抓取深度达 5 个级别的项目(您也可以更改它,其中-l N
N 是最大深度)并将它们存储在 ./www.site.com/ 中,基本上重新创建您抓取的 URL 的文件夹结构在该文件夹内。所以你最终会得到:
.
├── www.site.com /
└─────── pics
| └─── image1.jpg
| └─── image2.jpg
└─────── index.html
└─────── links.html
但是,这不会将 index.html 文件保留在当前文件夹中,而是将其放入站点的文件夹中。
如果您想尝试一下目录结构,请参阅手册页中有关如何减少路径的一些信息:
Directory Options
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames
will get extensions .n).
-x
--force-directories
The opposite of -nd---create a hierarchy of directories, even if one would not have been created otherwise. E.g. wget -x http://fly.srk.fer.hr/robots.txt will save the downloaded file to fly.srk.fer.hr/robots.txt.
-nH
--no-host-directories
Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
--cut-dirs=number
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.
Take, for example, the directory at ftp://ftp.xemacs.org/pub/xemacs/. If you retrieve it with -r, it will be saved locally under ftp.xemacs.org/pub/xemacs/. While the -nH option can remove the ftp.xemacs.org/ part,
you are still stuck with pub/xemacs. This is where --cut-dirs comes in handy; it makes Wget not "see" number remote directory components. Here are several examples of how --cut-dirs option works.
No options -> ftp.xemacs.org/pub/xemacs/
-nH -> pub/xemacs/
-nH --cut-dirs=1 -> xemacs/
-nH --cut-dirs=2 -> .
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
...
If you just want to get rid of the directory structure, this option is similar to a combination of -nd and -P. However, unlike -nd, --cut-dirs does not lose with subdirectories---for instance, with -nH --cut-dirs=1, a
beta/ subdirectory will be placed to xemacs/beta, as one would expect.