从 html 页面获取链接

Question 1

安装lynx，然后：

lynx -listonly -nonumbers -dump input.html > links.txt

确保您的输入文件具有.html扩展名。

例如：

$ cat test.html
<a href="http://superuser.com">test</a>
http://google.com
$ lynx -listonly -nonumbers -dump test.html
http://superuser.com/

如果您有一个指向 HTML 文件的文本文件，您需要从中获取链接，您可以对其进行迭代：

while read -r file; do
  lynx -listonly -nonumbers -dump "$file" > "${file%.*}.txt
done < input.txt

这将读取文本文件中的每一行，使用 lynx 提取链接，并将它们写入与它们指向的 HTML 文件具有相同基本名称的 .txt 文件中。

Answer

安装lynx，然后：

lynx -listonly -nonumbers -dump input.html > links.txt

确保您的输入文件具有.html扩展名。

例如：

$ cat test.html
<a href="http://superuser.com">test</a>
http://google.com
$ lynx -listonly -nonumbers -dump test.html
http://superuser.com/

如果您有一个指向 HTML 文件的文本文件，您需要从中获取链接，您可以对其进行迭代：

while read -r file; do
  lynx -listonly -nonumbers -dump "$file" > "${file%.*}.txt
done < input.txt

这将读取文本文件中的每一行，使用 lynx 提取链接，并将它们写入与它们指向的 HTML 文件具有相同基本名称的 .txt 文件中。

Question 2

将问题分为两部分。

假设目标页面不需要登录或凭证。

在 Linux 或 Unix 机器上运行，或者赛格威在 Windows 上，在终端会话中

wget -i your.txt

然后对于每个下载的文件，运行

cat FILE | \
sed 's/href=/\nhref=/g' | \
grep href=\" | \
sed 's/.*href="//g;s/".*//g' >> out.txt

如果缺少某些内容，请运行

sudo apt-get install coreutils wget grep sed

在 Debian Linux 上，尽管大多数系统默认附带它们。

如果您选择在 Cygwin 会话中执行此操作，请记住在安装时选择Core Utilities、Wget和。grepsed

Answer