使用perl脚本提取html数据

使用perl脚本提取html数据

这是我的代码,用于提取标题下的某些数据Item Drop%。我想提取90.5%该标题下的内容。但我只能提取整个列,而不仅仅是该值。任何想法 ?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'data.html';
unless ( -e $file ) {
    my $rc = getstore(
        'proj/Desktop/folder1/data.html',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}



my $te = HTML::TableExtract->new( headers => qw(Item Drop%)]);

$te->parse_file($file);

my ($table) = $te->tables;

foreach my $ts (ts->tables) {
    print "Table (", join(',', $ts->coords), ");\n";
    foreach my $row ($ts->rows) {
        print join(',', @$row), "\n";
    }
}

我的data.html是:


 ..
 ..
 ..
<table align = "center" class="" style= .......>
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 90.5%</td>
</tr>
..
..
..
..
<tr>
<th rowspan="2">EM</th>
<th colspan="2"><a href= "proj/Desktop/folder1/data.html" class = ..../th>
<td> 40%</td>
</tr>

</table>

答案1

我的想法是大多数情况下,这是任何语言中 HTML 抓取的更好方法,并且不限于表格。珀尔的HTML::TreeBuilder::XPath是必须具备的,并且可以轻松获取您的价值,请检查:

#!/usr/bin/env perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("./data.html");
print [$tree->findvalues('//table//td[contains(text(), "%")')]->[0];

输出

90.5%

相关内容