有没有办法在不下载文件的情况下在域上 Grep 多个 HTML 页面？

Question

请注意，即使 curl 会下载页面，它也不会将其写入文件，而是写入 stdout。

方法 1

curl支持按顺序获取 URL：

curl 'https://exampleblog.com/posts/[1-50]' | grep <searchterm>

你可以做一个 for 循环：

for i in {1..50}
do
    curl https://exampleblog.com/posts/"$i" | grep <searchterm>
done

如果 URL 中没有序列号，wget则可以使用递归。它将解析下载的页面中的 URL 并跟踪找到的链接。该--no-parent选项可确保它仅下载同一子目录中层次结构更深的页面，在本例中为questions。

请注意，如果下载的页面中没有符合条件的链接，wget则不会加载它，即使该网站上的其他页面可能引用它。

wget --recursive --no-parent https://superuser.com/questions/1750443 -O ./test.out
grep <searchterm> test.out
rm test.out

Answer 1

请注意，即使 curl 会下载页面，它也不会将其写入文件，而是写入 stdout。

curl支持按顺序获取 URL：

curl 'https://exampleblog.com/posts/[1-50]' | grep <searchterm>

你可以做一个 for 循环：

for i in {1..50}
do
    curl https://exampleblog.com/posts/"$i" | grep <searchterm>
done

如果 URL 中没有序列号，wget则可以使用递归。它将解析下载的页面中的 URL 并跟踪找到的链接。该--no-parent选项可确保它仅下载同一子目录中层次结构更深的页面，在本例中为questions。

请注意，如果下载的页面中没有符合条件的链接，wget则不会加载它，即使该网站上的其他页面可能引用它。

wget --recursive --no-parent https://superuser.com/questions/1750443 -O ./test.out
grep <searchterm> test.out
rm test.out