需要帮助从多个 html 文件中提取信息

Question 1

如果您的内容将包含，<div>那么您的脚本/代码必须足够智能，以识别匹配的结束符</div>。
我找到了一个基于 PHP 的解决方案可以做到这一点。在这里找到它：PHP HTML DOM Parser(http://simplehtmldom.sourceforge.net/manual.htm）

您可以将其与和结合使用directoryiterator，file_put_contents以循环方式写入包含内容的文件。
如果您想将其插入 HTML 模板，您可以保存 HTML 模板，并用一些可辨别的文本代替您想要放置的实际内容，如下所示：

<div class="new_data">
replace_me_discernible_text_not_appearing_anywhere_else_in_file
</div>

然后，您可以用自己的内容替换此文本。以下是执行此操作的完整脚本（剧本部分功劳归于原作者。我粘贴在这里以供将来参考)：

<?php
include('simple_html_dom.php');

$destdir = "extracted_html";
$oldMessage = "replace_me_discernible_text_not_appearing_anywhere_else_in_file";
$dir = new DirectoryIterator("content_html");
foreach ($dir as $fileinfo)
        {
    if (!$fileinfo->isDot())
                {
                $file_name = basename($fileinfo);
                $html = file_get_html("content_html/$file_name");
                foreach($html->find('div.heading') as $e)
                        {

                        $str=file_get_contents('template.html');
                        $str=str_replace($oldMessage, $e,$str);
                        file_put_contents("$destdir/$file_name", $str);
                        echo $file_name . " <b>Done!</b> </br>";
                        }
                }
        }
?>

希望这能有效。

Answer

如果您的内容将包含，<div>那么您的脚本/代码必须足够智能，以识别匹配的结束符</div>。
我找到了一个基于 PHP 的解决方案可以做到这一点。在这里找到它：PHP HTML DOM Parser(http://simplehtmldom.sourceforge.net/manual.htm）

您可以将其与和结合使用directoryiterator，file_put_contents以循环方式写入包含内容的文件。
如果您想将其插入 HTML 模板，您可以保存 HTML 模板，并用一些可辨别的文本代替您想要放置的实际内容，如下所示：

<div class="new_data">
replace_me_discernible_text_not_appearing_anywhere_else_in_file
</div>

然后，您可以用自己的内容替换此文本。以下是执行此操作的完整脚本（剧本部分功劳归于原作者。我粘贴在这里以供将来参考)：

<?php
include('simple_html_dom.php');

$destdir = "extracted_html";
$oldMessage = "replace_me_discernible_text_not_appearing_anywhere_else_in_file";
$dir = new DirectoryIterator("content_html");
foreach ($dir as $fileinfo)
        {
    if (!$fileinfo->isDot())
                {
                $file_name = basename($fileinfo);
                $html = file_get_html("content_html/$file_name");
                foreach($html->find('div.heading') as $e)
                        {

                        $str=file_get_contents('template.html');
                        $str=str_replace($oldMessage, $e,$str);
                        file_put_contents("$destdir/$file_name", $str);
                        echo $file_name . " <b>Done!</b> </br>";
                        }
                }
        }
?>

希望这能有效。

Question 2

这很容易做到PCREGREP 的 Windows 端口以及以下命令：

for %%i in (*.html) do (
  pcregrep -N CRLF -M -o "<div class="""heading_container""">(.+?)</div>" "%%i" ^
  > "%%~ni.cpp"
)

div如果中间有多余的s ，那么您可以在for循环中使用此行来提取直到清除div：

  …
  pcregrep -N CRLF -M -o "<div class="""heading_container""">(.+?)<div class="""clear""">" "%%i" ^
  …

图1：检测结果

Answer

这很容易做到PCREGREP 的 Windows 端口以及以下命令：

for %%i in (*.html) do (
  pcregrep -N CRLF -M -o "<div class="""heading_container""">(.+?)</div>" "%%i" ^
  > "%%~ni.cpp"
)

div如果中间有多余的s ，那么您可以在for循环中使用此行来提取直到清除div：

  …
  pcregrep -N CRLF -M -o "<div class="""heading_container""">(.+?)<div class="""clear""">" "%%i" ^
  …

图1：检测结果

Question 3

好的，这是您要做的一个简单的版本

#!/bin/sh 
for X in $(find ./ -name "*.html")
    do
    FN=$(echo $X | cut -d '/' -f 3)
    cat $X | awk '/^< div class=\"heading_container\" >/,/< div class=\"clear\"><\/div >/  { print }' > ./new/$FN   
    done

如果您的所有文件都在名为 files 的目录下的名为 old 的子目录中。从 files 目录运行此命令，它将删除您想要的信息并将其转储到目录 ./files/new 中的相同文件名中。

这是相当不安全的，如果旧的目录中有子目录它将不起作用。

我可能会考虑增强这一点，如果可以做得更好，我会更新。

更新

尽管我被告知这个目标是 Windows，但这里有一个更完整的 bash 脚本，可能会在将来对某些人有所帮助。

#!/bin/sh

cd old

# Create the directory structure in the 'new' directory 

for Z in $(find ./ -type d)
        do
        Z=$(echo $Z | tr -d '.')
        mkdir ../new/$Z
        done
cd ..
# Find all relevent files snip the interesting bit and copy to the same file in ../new 

for X in $(find ./ -name "*.html")
        do
        FN=$(echo $X | cut -d '/' -f 3-100)
        cat $X | awk '/^< div class=\"heading_container\" >/,/< div class=\"clear\"><\/div >/  { print }' > ./new/$FN
        done

主要注意事项是在运行之前删除（备份并删除）“新”目录。

Answer