curl //website//
会给我源代码,但是从那里我将如何过滤我们的每个唯一路径并获取它们的数量?
问题:
在您的计算机上使用 cURL 获取“https://www.inlanefreight.com”网站的源代码并过滤该域的所有唯一路径。提交这些路径的数量作为答案。
从问题中,我不知道“UNIQUE PATHS”的含义,但我认为它的含义类似于您从执行中得到的内容
$wget -p
我使用了这个方法并且它以某种方式起作用
wget --spider --recursive https://www.inlanefreight.com
这将显示
Found 10 broken links.
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf
FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)
在底部。假设 23 个下载和 10 个损坏的链接加起来就是我得到的唯一路径 33,这是正确的答案。
答案1
我使用了这个方法并且它以某种方式起作用
$ wget --spider --recursive https://www.inlanefreight.com
这将显示-
Found 10 broken links.
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf
FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)
-在底部。现在,假设 23 个下载和 10 个损坏的链接加起来就是我得到 33 的唯一路径,这是正确的答案。
答案2
这就是我想出的:
curl https://www.inlanefreight.com/ | grep -Po 'https://www.inlanefreight.com/\K[^"\x27]+' | sort -u | wc -l
我不知道它是否打算使用正则表达式来解决。
答案3
仅使用 cURL 和这些过滤工具:grep、tr、sort、cut 和 wc 以及一个附加工具 uniq。 我的结果不正确(34),33 是正确的。仍然不确定哪条路径是重复的。:(
curl https://www.inlanefreight.com --insecure > ilf
cat ilf | grep "https://www.inlanefreight.com" > ilf.1
cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | uniq -c > ilf.2
cat ilf.2 | wc -l
$> 34
我怀疑这是重复的来源(这些行的 cat ilf.2 )
<snip>
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&format=xml
<snip>
修复“?”上的这个问题
cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | cut -d"?" -f1 | uniq -c | wc -l
$> 33
正确答案是33。
答案4
TL;DR;:你不能。
来自 wget 联机帮助页:
“-p 此选项使 Wget 下载正确显示给定 HTML 页面所需的所有文件。这包括内联图像、声音和引用的样式表等内容。”
这是 的一个特点wget
。curl
是一个执行单个http命令(简化)的软件/库。wget
有一些功能,例如下载整个网站和需要的东西解释的内容。虽然这在 Web 1.0 时代有效,但此功能不再很有用,因为网站通过 javascript 加载其他文件,而这些文件甚至不会被wget
.网站为https://www.inlanefreight.com是一个 WordPress 网站,主题来自https://themeansar.com/所以你可以从那里购买它,解释它,写一个脚本,并希望你做得正确。
但是来吧,https://www.inlanefreight.com有 6 页和一个 pdf 文件;你可以通过点击来计算它,这比我需要找出它是 wordpress 的速度要快。