如何使用 cURL 获取网站的所有路径

如何使用 cURL 获取网站的所有路径

curl //website// 会给我源代码,但是从那里我将如何过滤我们的每个唯一路径并获取它们的数量?

问题:

在您的计算机上使用 cURL 获取“https://www.inlanefreight.com”网站的源代码并过滤该域的所有唯一路径。提交这些路径的数量作为答案。

从问题中,我不知道“UNIQUE PATHS”的含义,但我认为它的含义类似于您从执行中得到的内容 $wget -p


我使用了这个方法并且它以某种方式起作用

wget --spider --recursive https://www.inlanefreight.com

这将显示

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

在底部。假设 23 个下载和 10 个损坏的链接加起来就是我得到的唯一路径 33,这是正确的答案。

答案1

我使用了这个方法并且它以某种方式起作用

$ wget --spider --recursive https://www.inlanefreight.com

这将显示-

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

-在底部。现在,假设 23 个下载和 10 个损坏的链接加起来就是我得到 33 的唯一路径,这是正确的答案。

答案2

这就是我想出的:

 curl https://www.inlanefreight.com/ | grep -Po 'https://www.inlanefreight.com/\K[^"\x27]+' | sort -u  | wc -l

我不知道它是否打算使用正则表达式来解决。

答案3

仅使用 cURL 和这些过滤工具:grep、tr、sort、cut 和 wc 以及一个附加工具 uniq。 我的结果不正确(34),33 是正确的。仍然不确定哪条路径是重复的。:(

curl https://www.inlanefreight.com --insecure > ilf

cat ilf | grep "https://www.inlanefreight.com" > ilf.1

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | uniq -c > ilf.2

cat ilf.2 | wc -l

$> 34

我怀疑这是重复的来源(这些行的 cat ilf.2 )

<snip>
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&#038;format=xml
<snip>

修复“?”上的这个问题

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | cut -d"?" -f1 | uniq -c | wc -l
$> 33

正确答案是33。

答案4

TL;DR;:你不能。

来自 wget 联机帮助页:

“-p 此选项使 Wget 下载正确显示给定 HTML 页面所需的所有文件。这包括内联图像、声音和引用的样式表等内容。”

这是 的一个特点wgetcurl是一个执行单个http命令(简化)的软件/库。wget有一些功能,例如下载整个网站和需要的东西解释的内容。虽然这在 Web 1.0 时代有效,但此功能不再很有用,因为网站通过 javascript 加载其他文件,而这些文件甚至不会被wget.网站为https://www.inlanefreight.com是一个 WordPress 网站,主题来自https://themeansar.com/所以你可以从那里购买它,解释它,写一个脚本,并希望你做得正确。

但是来吧,https://www.inlanefreight.com有 6 页和一个 pdf 文件;你可以通过点击来计算它,这比我需要找出它是 wordpress 的速度要快。

相关内容