如何使用 cURL 获取网站的所有路径

Question 1

我使用了这个方法并且它以某种方式起作用

$ wget --spider --recursive https://www.inlanefreight.com

这将显示-

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

-在底部。现在，假设 23 个下载和 10 个损坏的链接加起来就是我得到 33 的唯一路径，这是正确的答案。

Answer

我使用了这个方法并且它以某种方式起作用

$ wget --spider --recursive https://www.inlanefreight.com

这将显示-

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

-在底部。现在，假设 23 个下载和 10 个损坏的链接加起来就是我得到 33 的唯一路径，这是正确的答案。

Question 2

这就是我想出的：

 curl https://www.inlanefreight.com/ | grep -Po 'https://www.inlanefreight.com/\K[^"\x27]+' | sort -u  | wc -l

我不知道它是否打算使用正则表达式来解决。

Answer

这就是我想出的：

 curl https://www.inlanefreight.com/ | grep -Po 'https://www.inlanefreight.com/\K[^"\x27]+' | sort -u  | wc -l

我不知道它是否打算使用正则表达式来解决。

Question 3

仅使用 cURL 和这些过滤工具：grep、tr、sort、cut 和 wc 以及一个附加工具 uniq。 我的结果不正确（34），33 是正确的。仍然不确定哪条路径是重复的。:(

curl https://www.inlanefreight.com --insecure > ilf

cat ilf | grep "https://www.inlanefreight.com" > ilf.1

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | uniq -c > ilf.2

cat ilf.2 | wc -l

$> 34

我怀疑这是重复的来源（这些行的 cat ilf.2 ）

<snip>
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&#038;format=xml
<snip>

修复“？”上的这个问题

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | cut -d"?" -f1 | uniq -c | wc -l
$> 33

正确答案是33。

Answer

仅使用 cURL 和这些过滤工具：grep、tr、sort、cut 和 wc 以及一个附加工具 uniq。 我的结果不正确（34），33 是正确的。仍然不确定哪条路径是重复的。:(

curl https://www.inlanefreight.com --insecure > ilf

cat ilf | grep "https://www.inlanefreight.com" > ilf.1

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | uniq -c > ilf.2

cat ilf.2 | wc -l

$> 34

我怀疑这是重复的来源（这些行的 cat ilf.2 ）

<snip>
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&#038;format=xml
<snip>

修复“？”上的这个问题

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | cut -d"?" -f1 | uniq -c | wc -l
$> 33

正确答案是33。

Question 4

TL;DR;：你不能。

来自 wget 联机帮助页：

“-p 此选项使 Wget 下载正确显示给定 HTML 页面所需的所有文件。这包括内联图像、声音和引用的样式表等内容。”

这是的一个特点wget。curl是一个执行单个http命令（简化）的软件/库。wget有一些功能，例如下载整个网站和需要的东西解释的内容。虽然这在 Web 1.0 时代有效，但此功能不再很有用，因为网站通过 javascript 加载其他文件，而这些文件甚至不会被wget.网站为https://www.inlanefreight.com是一个 WordPress 网站，主题来自https://themeansar.com/所以你可以从那里购买它，解释它，写一个脚本，并希望你做得正确。

但是来吧，https://www.inlanefreight.com有 6 页和一个 pdf 文件；你可以通过点击来计算它，这比我需要找出它是 wordpress 的速度要快。

Answer