如何提取一个网页的所有外部链接并保存到文件中？

Question 1

你需要2个工具，山猫和awk，尝试这个：

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' > links.txt

如果需要编号行，请使用命令荷兰，尝试这个：

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' | nl > links.txt

Answer

你需要2个工具，山猫和awk，尝试这个：

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' > links.txt

如果需要编号行，请使用命令荷兰，尝试这个：

$ lynx -dump http://www.google.com.br | awk '/http/{print $2}' | nl > links.txt

Question 2

这是对 lelton 的回答的改进：你根本不需要 awk，因为 lynx 有一些有用的选项。

lynx -listonly -nonumbers -dump http://www.google.com.br

如果你想要数字

lynx -listonly -dump http://www.google.com.br

Answer

这是对 lelton 的回答的改进：你根本不需要 awk，因为 lynx 有一些有用的选项。

lynx -listonly -nonumbers -dump http://www.google.com.br

如果你想要数字

lynx -listonly -dump http://www.google.com.br

Question 3

正如其他答案所讨论的那样，山猫是一个很好的选择，但是几乎在每种编程语言和环境中都有许多其他选择。

另一个选择是xmllint. 使用示例：

$ curl -sS "https://superuser.com" \
| xmllint --html --xpath '//a[starts-with(@href, "http")]/@href' 2>/dev/null - \
| sed 's/^ href="\|"$//g' \
| tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing

此外，Perl 还提供HTML::Parser：

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Parser;
use LWP::Simple;

sub start {
    my $href = shift->{href};
    print "$href\n" if $href && $href =~ /^https?:\/\//;
}

my $url = shift @ARGV or die "No argument URL provided";
my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start, "attr"]);
$parser->report_tags(["a"]);
$parser->parse(get($url) or die "Failed to GET $url");

示例用法（包括根据 OP 请求写入文件；对于此处任何带有 shebang 的脚本，用法都相同）：

$ ./scrape_links https://superuser.com > links.txt \
&& cat links.txt | tail -3
https://linkedin.com/company/stack-overflow
https://www.instagram.com/thestackoverflow
https://stackoverflow.com/help/licensing

Ruby 具有锯木宝石：

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open('https://superuser.com'))

doc.xpath('//a[starts-with(@href, "http")]/@href').each do |link|
  puts link.content
end

NodeJS 有再见：

const axios = require("axios");
const cheerio = require("cheerio");

(async () => {
  const $ = cheerio.load((await axios.get("https://superuser.com")).data);
  $("a").each((i, e) => console.log($(e).attr("href")));
})();

Python 的美丽的汤尚未在此主题中显示：

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://superuser.com").text, "lxml")

for x in soup.find_all("a", href=True):
    if x["href"].startswith("http"):
        print(x["href"])

Answer