使用 wget，获取 gzipped 版本而不是实际 HTML 的正确命令是什么

Question 1

如果您请求经过 gzip 处理的内容（使用 Accept-encoding: gzip 标头，这是正确的），那么我的理解是 wget 无法读取该内容。因此，您最终会在磁盘上得到一个单一的 gzip 压缩文件，用于您点击的第一页，但没有其他内容。

即，您不能使用 wget 请求 gzip 压缩内容并同时递归整个站点。

我认为有一个补丁可以让 wget 支持此功能，但它不在默认发行版中。

如果包含 -S 标志，您可以判断 Web 服务器是否以正确类型的内容进行响应。例如，

wget -S --header="accept-encoding: gzip" wordpress.com
--2011-06-17 16:06:46--  http://wordpress.com/
Resolving wordpress.com (wordpress.com)... 72.233.104.124, 74.200.247.60, 76.74.254.126
Connecting to wordpress.com (wordpress.com)|72.233.104.124|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx
  Date: Fri, 17 Jun 2011 15:06:47 GMT
  Content-Type: text/html; charset=UTF-8
  Connection: close
  Vary: Accept-Encoding
  Last-Modified: Fri, 17 Jun 2011 15:04:57 +0000
  Cache-Control: max-age=190, must-revalidate
  Vary: Cookie
  X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
  X-Pingback: http://wordpress.com/xmlrpc.php
  Link: <http://wp.me/1>; rel=shortlink
  X-nananana: Batcache
  Content-Encoding: gzip
Length: unspecified [text/html]

内容编码明确指出 gzip，但是对于 linux.about.com（当前），

wget -S --header="accept-encoding: gzip" linux.about.com
--2011-06-17 16:12:55--  http://linux.about.com/
Resolving linux.about.com (linux.about.com)... 207.241.148.80
Connecting to linux.about.com (linux.about.com)|207.241.148.80|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 17 Jun 2011 15:12:56 GMT
  Server: Apache
  Set-Cookie: TMog=B6HFCs2H20kA1I4N; domain=.about.com; path=/; expires=Sat, 22-Sep-12 14:19:35 GMT
  Set-Cookie: Mint=B6HFCs2H20kA1I4N; domain=.about.com; path=/
  Set-Cookie: zBT=1; domain=.about.com; path=/
  Vary: *
  PRAGMA: no-cache
  P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI"
  Cache-Control: max-age=-3600
  Expires: Fri, 17 Jun 2011 14:12:56 GMT
  Connection: close
  Content-Type: text/html
Length: unspecified [text/html]

它返回 text/html。

由于一些较旧的浏览器仍然存在 gzip 编码内容的问题，因此许多网站仅根据浏览器识别来启用它。他们经常默认将其关闭，只有当他们知道浏览器可以支持它时才将其打开 - 并且他们通常不将 wget 包含在该列表中。这意味着您可能会发现 wget 永远不会返回 gzip 内容，即使该网站似乎对您的浏览器这样做。

Answer

如果您请求经过 gzip 处理的内容（使用 Accept-encoding: gzip 标头，这是正确的），那么我的理解是 wget 无法读取该内容。因此，您最终会在磁盘上得到一个单一的 gzip 压缩文件，用于您点击的第一页，但没有其他内容。

即，您不能使用 wget 请求 gzip 压缩内容并同时递归整个站点。

我认为有一个补丁可以让 wget 支持此功能，但它不在默认发行版中。

如果包含 -S 标志，您可以判断 Web 服务器是否以正确类型的内容进行响应。例如，

wget -S --header="accept-encoding: gzip" wordpress.com
--2011-06-17 16:06:46--  http://wordpress.com/
Resolving wordpress.com (wordpress.com)... 72.233.104.124, 74.200.247.60, 76.74.254.126
Connecting to wordpress.com (wordpress.com)|72.233.104.124|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx
  Date: Fri, 17 Jun 2011 15:06:47 GMT
  Content-Type: text/html; charset=UTF-8
  Connection: close
  Vary: Accept-Encoding
  Last-Modified: Fri, 17 Jun 2011 15:04:57 +0000
  Cache-Control: max-age=190, must-revalidate
  Vary: Cookie
  X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
  X-Pingback: http://wordpress.com/xmlrpc.php
  Link: <http://wp.me/1>; rel=shortlink
  X-nananana: Batcache
  Content-Encoding: gzip
Length: unspecified [text/html]

内容编码明确指出 gzip，但是对于 linux.about.com（当前），

wget -S --header="accept-encoding: gzip" linux.about.com
--2011-06-17 16:12:55--  http://linux.about.com/
Resolving linux.about.com (linux.about.com)... 207.241.148.80
Connecting to linux.about.com (linux.about.com)|207.241.148.80|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 17 Jun 2011 15:12:56 GMT
  Server: Apache
  Set-Cookie: TMog=B6HFCs2H20kA1I4N; domain=.about.com; path=/; expires=Sat, 22-Sep-12 14:19:35 GMT
  Set-Cookie: Mint=B6HFCs2H20kA1I4N; domain=.about.com; path=/
  Set-Cookie: zBT=1; domain=.about.com; path=/
  Vary: *
  PRAGMA: no-cache
  P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI"
  Cache-Control: max-age=-3600
  Expires: Fri, 17 Jun 2011 14:12:56 GMT
  Connection: close
  Content-Type: text/html
Length: unspecified [text/html]

它返回 text/html。

由于一些较旧的浏览器仍然存在 gzip 编码内容的问题，因此许多网站仅根据浏览器识别来启用它。他们经常默认将其关闭，只有当他们知道浏览器可以支持它时才将其打开 - 并且他们通常不将 wget 包含在该列表中。这意味着您可能会发现 wget 永远不会返回 gzip 内容，即使该网站似乎对您的浏览器这样做。

Question 2

获取 html 页面并压缩它或获取任何文件并压缩的简单命令。

$ wget -qO - <url> | gzip -c > file_name.gz

有关该选项的更多信息。使用 man 命令。

Answer

获取 html 页面并压缩它或获取任何文件并压缩的简单命令。

$ wget -qO - <url> | gzip -c > file_name.gz

有关该选项的更多信息。使用 man 命令。

Question 3

根据 mikeserv 等人的说法。在上面对 bash（大约版本 4.3）的响应中，其开发人员曾经采用了关于如何维护 LINENO 的 IEEE 规范，以便在评估 EXIT 信号的参数时该值始终设置为 1。（事实上，它是当前行，也是这些执行上下文中的第一行。）

已经列出了多种解决方法。因为我觉得它与其他任何东西相比都非常简单 - 让我们在这里提供我对这个问题的概念证明：

#!/bin/bash
trap 'catch EXIT $? $debug_line_old' EXIT
trap 'debug_line_old=$debug_line;debug_line=$LINENO' DEBUG # note: debug is invoked before line gets executed!
catch() {
  echo "event=$1, rc=$2, line=$3, file=$0"
}
exit 1

当运行结果时你应该看到这个：

event=EXIT, rc=1, line=7, file=./trap_exit_get_lineno.bash

顺便说一句 - 当不将任何类型的陷阱事件破坏成一行时，它是有益的 - 为单独的信号保留单独的行对编码有很大帮助。其次，DEBUG 陷阱在调用其他任何内容方面似乎受到很大限制。

Answer

根据 mikeserv 等人的说法。在上面对 bash（大约版本 4.3）的响应中，其开发人员曾经采用了关于如何维护 LINENO 的 IEEE 规范，以便在评估 EXIT 信号的参数时该值始终设置为 1。（事实上，它是当前行，也是这些执行上下文中的第一行。）

已经列出了多种解决方法。因为我觉得它与其他任何东西相比都非常简单 - 让我们在这里提供我对这个问题的概念证明：

#!/bin/bash
trap 'catch EXIT $? $debug_line_old' EXIT
trap 'debug_line_old=$debug_line;debug_line=$LINENO' DEBUG # note: debug is invoked before line gets executed!
catch() {
  echo "event=$1, rc=$2, line=$3, file=$0"
}
exit 1

当运行结果时你应该看到这个：

event=EXIT, rc=1, line=7, file=./trap_exit_get_lineno.bash

顺便说一句 - 当不将任何类型的陷阱事件破坏成一行时，它是有益的 - 为单独的信号保留单独的行对编码有很大帮助。其次，DEBUG 陷阱在调用其他任何内容方面似乎受到很大限制。

使用 wget，获取 gzipped 版本而不是实际 HTML 的正确命令是什么

答案1

答案2

答案3

相关内容