从 html 文件中获取单引号内的文本

Question 1

sed等实用程序awk不适用于解析 html 等结构化数据。因此，更可行的解决方案是使用 python 来执行相同操作。

首先，确保美丽的汤安装者：

sudo apt-get install python3 python3-bs4

现在创建一个新文件（例如test.py）并粘贴我为此目的编写的简短脚本：

#!/usr/bin/env python3
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print("Argument not provided or not .html file", file=sys.stderr)
    exit()

with open(sys.argv[1], 'r', encoding='latin-1') as f:
    webpage = f.read()

soup = BeautifulSoup(webpage, "lxml")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

根据要求提供 Python 2 版本：

#!/usr/bin/env python2
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print >> sys.stderr, "Argument not provided or not .html file"
    exit()

with open(sys.argv[1], 'rb') as f:
    webpage = f.read().decode("latin-1")

soup = BeautifulSoup(webpage, "html.parser")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

修改DOMAIN变量以匹配您的实际域，将此脚本保存在当前目录中并运行如下：

./test.py yourfile.html > outputfile

作为参考，这是使用问题中提供的示例运行脚本时产生的输出：

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc

Answer

sed等实用程序awk不适用于解析 html 等结构化数据。因此，更可行的解决方案是使用 python 来执行相同操作。

首先，确保美丽的汤安装者：

sudo apt-get install python3 python3-bs4

现在创建一个新文件（例如test.py）并粘贴我为此目的编写的简短脚本：

#!/usr/bin/env python3
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print("Argument not provided or not .html file", file=sys.stderr)
    exit()

with open(sys.argv[1], 'r', encoding='latin-1') as f:
    webpage = f.read()

soup = BeautifulSoup(webpage, "lxml")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

根据要求提供 Python 2 版本：

#!/usr/bin/env python2
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print >> sys.stderr, "Argument not provided or not .html file"
    exit()

with open(sys.argv[1], 'rb') as f:
    webpage = f.read().decode("latin-1")

soup = BeautifulSoup(webpage, "html.parser")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

修改DOMAIN变量以匹配您的实际域，将此脚本保存在当前目录中并运行如下：

./test.py yourfile.html > outputfile

作为参考，这是使用问题中提供的示例运行脚本时产生的输出：

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc

Question 2

另一个使用适当 HTML 解析器的 Perl 解决方案如下（例如get-links.pl）：

#!/usr/bin/env perl

use strict;
use warnings;
use File::Spec;
use WWW::Mechanize;

my $filename = shift or die "Must supply a *.html file\n";
my $absolute_filename = File::Spec->rel2abs($filename);

my $mech = WWW::Mechanize->new();
$mech->get( "file://$absolute_filename" );
my @links = $mech->links();
foreach my $link ( @links ) {
    my $new_link = $link->url;

    if ( $new_link =~ s(^\.\./)(http://z.z.com/) ) {
        print "$new_link\n";
    }
}

您可能需要安装WWW::Mechanize 模块首先，因为它不是一个核心模块（这意味着它默认不与 Perl 一起安装）。为此，请运行

sudo apt install libwww-mechanize-perl

该脚本读取给定的文件，将文件名转换为绝对路径（因为我们想要构建一个像这样的正确 URI file:///path/to/source.html）。

提取链接（my @links = $mech->links();）后，它会检查每个链接的 URL，如果以开头，../则将该部分替换为http://z.z.com/并打印。

用法：

./get-links.pl source.html

输出：

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc

作为@Amith KK已经说过他的回答：解析 HTML（或 XML）最好使用适当的解析器，因为sed当源中存在其他元素时，诸如及其同类的工具可能会失败看喜欢一个链接但实际上不是。

Answer

另一个使用适当 HTML 解析器的 Perl 解决方案如下（例如get-links.pl）：

#!/usr/bin/env perl

use strict;
use warnings;
use File::Spec;
use WWW::Mechanize;

my $filename = shift or die "Must supply a *.html file\n";
my $absolute_filename = File::Spec->rel2abs($filename);

my $mech = WWW::Mechanize->new();
$mech->get( "file://$absolute_filename" );
my @links = $mech->links();
foreach my $link ( @links ) {
    my $new_link = $link->url;

    if ( $new_link =~ s(^\.\./)(http://z.z.com/) ) {
        print "$new_link\n";
    }
}

您可能需要安装WWW::Mechanize 模块首先，因为它不是一个核心模块（这意味着它默认不与 Perl 一起安装）。为此，请运行

sudo apt install libwww-mechanize-perl

该脚本读取给定的文件，将文件名转换为绝对路径（因为我们想要构建一个像这样的正确 URI file:///path/to/source.html）。

提取链接（my @links = $mech->links();）后，它会检查每个链接的 URL，如果以开头，../则将该部分替换为http://z.z.com/并打印。

用法：

./get-links.pl source.html

输出：

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc

作为@Amith KK已经说过他的回答：解析 HTML（或 XML）最好使用适当的解析器，因为sed当源中存在其他元素时，诸如及其同类的工具可能会失败看喜欢一个链接但实际上不是。

Question 3

要从文件中提取单引号之间的数据并用URL 中的test.html两个点替换，并将提取的数据保存到文件中，请执行以下操作：..http://newfile.txt

cat test.html | sed -ne 's/^.*'\''\([^'\'']*\)'\''.*$/\1/p' | sed -e 's/\.\./http:\//g' > newfile.txt

或者尝试不使用 sed：

cat test.html | grep -Eo "'[^'() ]+'" | tr -d \'\" | perl -pe 's/../http:\//' > newfile.txt

这适用于作者添加到问题中的文件示例：

cat test.html | grep -Eo "'[^|'() ]+'" | grep -wE "('..)" | tr -d \'\" | perl -pe 's/../http:\/\/mysite.mydomain.com/' > newfile.txt

Answer

要从文件中提取单引号之间的数据并用URL 中的test.html两个点替换，并将提取的数据保存到文件中，请执行以下操作：..http://newfile.txt

cat test.html | sed -ne 's/^.*'\''\([^'\'']*\)'\''.*$/\1/p' | sed -e 's/\.\./http:\//g' > newfile.txt

或者尝试不使用 sed：

cat test.html | grep -Eo "'[^'() ]+'" | tr -d \'\" | perl -pe 's/../http:\//' > newfile.txt

这适用于作者添加到问题中的文件示例：

cat test.html | grep -Eo "'[^|'() ]+'" | grep -wE "('..)" | tr -d \'\" | perl -pe 's/../http:\/\/mysite.mydomain.com/' > newfile.txt

Question 4

将 HTML 转换为文本

正如评论中提到的，您需要转换html为文本格式。为此，有一个单行应该涵盖所有的基础：

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

如果你要转换数十万行，bash 内置命令的速度会快很多倍：

#-------------------------------------------------------------------------------
LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

检查文件是否存在

要测试文件是否存在，请使用此功能：

function validate_url(){
  if [[ `wget -S --spider $1  2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
}

综合起来

仍然需要根据来自具有有效文件名的有效网页的示例数据来编写最终脚本。

Answer

将 HTML 转换为文本

正如评论中提到的，您需要转换html为文本格式。为此，有一个单行应该涵盖所有的基础：

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

如果你要转换数十万行，bash 内置命令的速度会快很多倍：

#-------------------------------------------------------------------------------
LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

检查文件是否存在

要测试文件是否存在，请使用此功能：

function validate_url(){
  if [[ `wget -S --spider $1  2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
}

综合起来

仍然需要根据来自具有有效文件名的有效网页的示例数据来编写最终脚本。

从 html 文件中获取单引号内的文本

答案1

答案2

答案3

答案4

将 HTML 转换为文本

检查文件是否存在

综合起来

相关内容