wget - 如何递归下载并且仅下载特定的 mime 类型/扩展名（即仅文本）

Question 1

您可以指定允许的列表。不允许的文件名模式：

允许：

-A LIST
--accept LIST

不允许：

-R LIST
--reject LIST

LIST是以逗号分隔的文件名模式/扩展名列表。

您可以使用以下保留字符来指定模式：

*
?
[
]

例子：

只下载 PNG 文件：-A png
不要下载 CSS 文件：-R css
不要下载以“avatar”开头的 PNG 文件：-R avatar*.png

如果文件没有扩展名。文件名没有您可以使用的模式，我猜您需要 MIME 类型解析（请参阅拉尔斯·科特霍夫的回答）。

Answer

您可以指定允许的列表。不允许的文件名模式：

允许：

-A LIST
--accept LIST

不允许：

-R LIST
--reject LIST

LIST是以逗号分隔的文件名模式/扩展名列表。

您可以使用以下保留字符来指定模式：

*
?
[
]

例子：

只下载 PNG 文件：-A png
不要下载 CSS 文件：-R css
不要下载以“avatar”开头的 PNG 文件：-R avatar*.png

如果文件没有扩展名。文件名没有您可以使用的模式，我猜您需要 MIME 类型解析（请参阅拉尔斯·科特霍夫的回答）。

Question 2

你可以尝试用以下命令修补 wget这（还这里) 按 MIME 类型过滤。不过这个补丁现在已经很旧了，所以它可能不再起作用了。

Answer

你可以尝试用以下命令修补 wget这（还这里) 按 MIME 类型过滤。不过这个补丁现在已经很旧了，所以它可能不再起作用了。

Question 3

新的 Wget (Wget2) 已经具有以下功能：

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

截至今天，Wget2 尚未发布，但很快就会发布。 Debian不稳定已经发布了alpha版本。

看着https://gitlab.com/gnuwget/wget2了解更多信息。您可以直接发布问题/意见[电子邮件受保护]。

Answer

新的 Wget (Wget2) 已经具有以下功能：

--filter-mime-type    Specify a list of mime types to be saved or ignored`

### `--filter-mime-type=list`

Specify a comma-separated list of MIME types that will be downloaded.  Elements of list may contain wildcards.
If a MIME type starts with the character '!' it won't be downloaded, this is useful when trying to download
something with exceptions. For example, download everything except images:

  wget2 -r https://<site>/<document> --filter-mime-type=*,\!image/*

It is also useful to download files that are compatible with an application of your system. For instance,
download every file that is compatible with LibreOffice Writer from a website using the recursive mode:

  wget2 -r https://<site>/<document> --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

截至今天，Wget2 尚未发布，但很快就会发布。 Debian不稳定已经发布了alpha版本。

看着https://gitlab.com/gnuwget/wget2了解更多信息。您可以直接发布问题/意见[电子邮件受保护]。

Question 4

我尝试过一种完全不同的方法是使用Scrapy，但它有同样的问题！我是这样解决的：所以：Python Scrapy - 基于 mimetype 的过滤器以避免非文本文件下载？

解决方案是设置一个代理并配置Scrapy通过环境变量Node.js使用它。http_proxy

什么是代理人应该做的是：

从 Scrapy 获取 HTTP 请求并将其发送到正在爬取的服务器。然后它返回 Scrapy 的响应，即拦截所有 HTTP 流量。

对于二进制文件（基于您实施的启发式），它会403 Forbidden向 Scrapy 发送错误并立即关闭请求/响应。这有助于节省时间、流量，并且 Scrapy 不会崩溃。

实际有效的示例代理代码！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

Answer

我尝试过一种完全不同的方法是使用Scrapy，但它有同样的问题！我是这样解决的：所以：Python Scrapy - 基于 mimetype 的过滤器以避免非文本文件下载？

解决方案是设置一个代理并配置Scrapy通过环境变量Node.js使用它。http_proxy

什么是代理人应该做的是：

从 Scrapy 获取 HTTP 请求并将其发送到正在爬取的服务器。然后它返回 Scrapy 的响应，即拦截所有 HTTP 流量。

对于二进制文件（基于您实施的启发式），它会403 Forbidden向 Scrapy 发送错误并立即关闭请求/响应。这有助于节省时间、流量，并且 Scrapy 不会崩溃。

实际有效的示例代理代码！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

wget - 如何递归下载并且仅下载特定的 mime 类型/扩展名（即仅文本）

答案1

答案2

答案3

答案4

实际有效的示例代理代码！

相关内容