我正在寻找一种方法来显示重定向链中的所有 URL,最好是从 shell 中。我找到了一种几乎可以用curl 做到这一点的方法,但它只显示第一个和最后一个URL。我想看看他们所有人。
一定有一种方法可以简单地做到这一点,但我一生都找不到它是什么。
编辑:自从提交此内容后,我已经找到了如何使用 Chrome 执行此操作(CTRL+SHIFT+I->网络选项卡)。但是,我仍然想知道如何从 Linux 命令行完成它。
答案1
简单地使用怎么样wget
?
$ wget http://picasaweb.google.com 2>&1 | grep Location:
Location: /home [following]
Location: https://www.google.com/accounts/ServiceLogin?hl=en_US&continue=https%3A%2F%2Fpicasaweb.google.com%2Flh%2Flogin%3Fcontinue%3Dhttps%253A%252F%252Fpicasaweb.google.com%252Fhome&service=lh2<mpl=gp&passive=true [following]
Location: https://accounts.google.com/ServiceLogin?hl=en_US&continue=https%3A%2F%2Fpicasaweb.google.com%2Flh%2Flogin%3Fcontinue%3Dhttps%3A%2F%2Fpicasaweb.google.com%2Fhome&service=lh2<mpl=gp&passive=true [following]
curl -v
还显示一些信息,但看起来不如wget
.
$ curl -v -L http://picasaweb.google.com 2>&1 | egrep "^> (Host:|GET)"
> GET / HTTP/1.1
> Host: picasaweb.google.com
> GET /home HTTP/1.1
> Host: picasaweb.google.com
> GET /accounts/ServiceLogin?hl=en_US&continue=https%3A%2F%2Fpicasaweb.google.com%2Flh%2Flogin%3Fcontinue%3Dhttps%253A%252F%252Fpicasaweb.google.com%252Fhome&service=lh2<mpl=gp&passive=true HTTP/1.1
> Host: www.google.com
> GET /ServiceLogin?hl=en_US&continue=https%3A%2F%2Fpicasaweb.google.com%2Flh%2Flogin%3Fcontinue%3Dhttps%253A%252F%252Fpicasaweb.google.com%252Fhome&service=lh2<mpl=gp&passive=true HTTP/1.1
> Host: accounts.google.com
答案2
正确的基于卷曲的解决方案
url=https://rb.gy/x7cg8r
while redirect_url=$(
curl -I -s -S -f -w "%{redirect_url}\n" -o /dev/null "$url"
); do
echo "$url"
url=$redirect_url
[[ -z "$url" ]] && break
done
结果:
https://rb.gy/x7cg8r
https://t.co/BAvVoPyqNr
https://unix.stackexchange.com/
这比我的快 12%基于wget的解决方案。
基准详情
cd "$(mktemp -d)"
cat <<'EOF' >curl-based-solution
#!/bin/bash
url=https://rb.gy/x7cg8r
while redirect_url=$(
curl -I -s -S -f -w "%{redirect_url}\n" -o /dev/null "$url"
); do
echo "$url"
url=$redirect_url
[[ -z "$url" ]] && break
done
EOF
chmod +x curl-based-solution
cat <<'EOF' >wget-based-solution
#!/bin/bash
url=https://rb.gy/x7cg8r
wget -S --spider "$url" 2>&1 \
| grep -oP '^--[[:digit:]: -]{19}-- \K.*'
EOF
chmod +x wget-based-solution
hyperfine --warmup 5 ./wget-based-solution ./curl-based-solution
$ hyperfine --warmup 5 ./wget-based-solution ./curl-based-solution
Benchmark #1: ./wget-based-solution
Time (mean ± σ): 1.397 s ± 0.025 s [User: 90.3 ms, System: 19.7 ms]
Range (min … max): 1.365 s … 1.456 s 10 runs
Benchmark #2: ./curl-based-solution
Time (mean ± σ): 1.250 s ± 0.015 s [User: 72.4 ms, System: 23.4 ms]
Range (min … max): 1.229 s … 1.277 s 10 runs
Summary
'./curl-based-solution' ran
1.12 ± 0.02 times faster than './wget-based-solution'
答案3
以显示全部重定向链中的 URL 的数量,包括第一个:
wget -S --spider https://rb.gy/x7cg8r 2>&1 \
| grep -oP '^--[[:digit:]: -]{19}-- \K.*'
结果(在 Fedora Linux 上测试):
https://rb.gy/x7cg8r
https://t.co/BAvVoPyqNr
https://unix.stackexchange.com/
使用的 wget 选项:
-S
--server-response
Print the headers sent by HTTP servers and responses sent by FTP servers.
--spider
When invoked with this option, Wget will behave as a Web spider, which
means that it will not download the pages, just check that they are there
...
来源:https://www.mankier.com/1/wget
这组合of-S
并--spider
导致wget
发出HEAD
请求而不是GET
请求。
使用的 GNU grep 选项:
-o
--only-matching
Print only the matched (non-empty) parts of a matching line, with each such
part on a separate output line.
-P
--perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
来源:https://www.mankier.com/1/grep
我们感兴趣的行如下所示:
--2021-12-07 12:29:25-- https://rb.gy/x7cg8r
您会看到时间戳由 19 个字符组成,其中包括数字、连字符、冒号和空格。因此它与 相匹配[[:digit:]-: ]{19}
,其中我们使用了固定量词19.
这\K
重置匹配部分的开头。
将 grep 替换为 sed
如果您愿意,管道grep
阶段可以替换为:sed
wget -S --spider https://rb.gy/x7cg8r 2>&1 \
| sed -En 's/^--[[:digit:]: -]{19}-- (.*)/\1/p'
与基于 - 的解决方案相比curl
...
基于curl的解决方案省略了重定向链中的第一个url:
$ curl -v -L https://rb.gy/x7cg8r 2>&1 | grep -i "^< location:"
< Location: https://t.co/BAvVoPyqNr
< location: https://unix.stackexchange.com/
此外,发送到第二个管道阶段的字节数增加了 4354.99%:
$ wget -S --spider https://rb.gy/x7cg8r 2>&1 | wc -c
2728
$ curl -v -L https://rb.gy/x7cg8r 2>&1 | wc -c
121532
$ awk 'BEGIN {printf "%.2f\n", (121532-2728)/2728*100}'
4354.99
在我的基准测试中,wget 解决方案比基于curl 的解决方案稍快(4%)。
更新:看我基于卷曲的答案以获得最快的解决方案。
答案4
curl -v
可以显示 HTTP 重定向链中的所有 URL:
$ curl -v -L https://go.usa.gov/W3H 2>&1 | grep -i "^< location:"
< location: http://hurricanes.gov/nhc_storms.shtml
< Location: https://www.hurricanes.gov/nhc_storms.shtml
< location: https://www.nhc.noaa.gov:443/nhc_storms.shtml
< location: http://www.nhc.noaa.gov/cyclones
< Location: https://www.nhc.noaa.gov/cyclones
< location: http://www.nhc.noaa.gov/cyclones/
< Location: https://www.nhc.noaa.gov/cyclones/