将 HTML 转换为文本格式

将 HTML 转换为文本格式

我想要从这个 HTML 文件中提取页面内容:

<BR />
<TABLE style=border-color:#32506d border=1 cellspacing=0>
    <caption class=header style=background-color:#32506d><b>Additional M2Ms & Standalone DataMasking List for 09 10 2020
            PST</b></caption>
    <tr style=background-color:#32506d class=header>
        <td class=CR>Start Time</td>
        <td class=CR>FM CR</td>
        <td class=CR>CR Type</td>
        <td class=CR>Customer Name</td>
        <td class=CR>Source Pod</td>
        <td class=CR>Target Pod</td>
        <td class=CR>DM Flag</td>
        <td class=CR>Release</td>
        <td class=CR>Data Center</td>
        <td class=CR>CDB Sync</td>
        <td class=CR>FreeSpace Check</td>
        <td class=CR>TDE/DV Check</td>
        <td class=CR>M2M Optin</td>
        <td class=CR>M2M Type</td>
        <td class=CR colspan=2>Database Reorg Details</td>
        <td>Operations Team</td>
    </tr>
    <tr>
        <td>09/10/2020-19:00</td>
        <td class=CR><a href=http://fleetmanager.oraclecloud.com/change/faces/registerChangeRequest?CRID=11124482
                target=_blank>11124482</td>
        <td>M2M</td>
        <td>TCS</td>
        <td>KCLB-CDB</td>
        <td>EGLG-TEST</td>
        <td class=CR>N</td>
        <td>Revision 13.20.07</td>
        <td>ks8-US-OCC</td>
        <td class=CR>
            <font color=#34A853>Yes</font>
        </td>
        <td class=CR>
            <font color=#34A853>Passed</font>
        </td>
        <td class=CR>
            <font color=#34A853>Passed</font>
        </td>
        <td class=CR>Y</td>
        <td class=CR>
            <font color=#34A853>sDC</font>
        </td>
        <td>
            <font color=#db3236>Reclaimable Space: 3532 GB</font>
        </td>
        <td>
            <font color=#db3236>Reorg Required</font>
        </td>
        <td>
            <center>
                <font color=#0000FF>RAMU</font>
            </center>
        </td>
    </tr>
    <tr>
        <td>09/10/2020-19:00</td>
        <td class=CR><a href=http://fleetmanager.oraclecloud.com/change/faces/registerChangeRequest?CRID=11170981
                target=_blank>11170981</td>
        <td>
            <font color=green>Standalone Data Masking</font>
        </td>
        <td>Wipro, Inc.</td>
        <td></td>
        <td>LMNO-TEST</td>
        <td class=CR></td>
        <td>Revision 13.20.07</td>
        <td>ns2-US</td>
        <td class=CR>NA</td>
        <td class=CR>NA</td>
        <td class=CR>NA</td>
        <td class=CR>NA</td>
        <td class=CR>NA</td>
        <td>
            <center>NA</center>
        </td>
        <td>
            <center>NA</center>
        </td>
        <td>DataMasking</td>
    </tr>
</TABLE><br /><span>Thanks,</span><br /><span>M2M Ops</span><br /><br /><span>Note: This is a system generated email,
    still you can reply with queries/suggestions.</span>

</HTML>

到目前为止,我已经尝试使用以下方法sed

sed -n '/^$/!{s/<[^>]*>//g;p;}' file.html

我得到以下输出:

Start TimeFM CRCR TypeCustomer NameSource PodTarget PodDM FlagReleaseData CenterCDB SyncFreeSpace CheckTDE/DV CheckM2M OptinM2M TypeDatabase Reorg DetailsOperations Team
09/10/2020-19:0011124482
M2MTCSKCLB-CDBEGLG-TESTNRevision 13.20.07ks8-US-OCCYes
PassedPassedYsDCReclaimable Space: 3532 GBReorg RequiredRAMU
09/10/2020-19:0011170981
Standalone Data MaskingWipro Inc.LMNO-TESTRevision 13.20.07ns2-USNA
NANANANANANADataMasking
Thanks,M2M OpsNote: This is a system generated email, still you can reply with queries/suggestions.

但它与期望的输出不同:

StartTime           FMCR      CRType                   CustomerName         SourcePod  TargetPod DMFlag Release               DataCenter       CDBSync  FreeSpaceCheck TDE/DVCheck M2MOptin  M2MType DatabaseReorgDetails                             OperationsTeam
09/10/2020-19:00    11124482    M2M                        TCS               KCLB-CDB  KCLB-TEST  N     Revision 13.20.07      ks8-US-OCC       YES     Passed          Passed      Y         sDC     Reclaimable Space: 3532 GB   Reorg Required     RAMU
09/10/2020-19:00    11170981 Standalone Data Masking     Wipro, Inc              LMNO-TEST              Revision 13.20.07      ns2-US           NA      NA               NA          NA         NA      NA                           NA              DataMasking

答案1

输出sed完全awk取决于 HTML 文件的格式。例如,修订 #3会产生不同的结果修订 #4

或者,您可以使用特定工具,例如html2text.html2text会将生成的 HTML 页面格式化为纯文本字符。当然,您可以使用其他命令行工具(例如sed和)进一步处理输出awk

要安装html2text,只需运行:

sudo apt install html2text

要开始,只需运行:

html2text file.html

默认情况下,html2text将 HTML 文档格式化为屏幕宽度为 79 个字符。因此,结果将如下所示:


 ___________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST____________________________________
|Start|FM CR   |CR Type   |Customer|Source|Target|DM  |Release |Data  |CDB |FreeSpace|TDE/DV|M2M  |M2M |Database Reorg      |Operations |
|Time_|________|__________|Name____|Pod___|Pod___|Flag|________|Center|Sync|Check____|Check_|Optin|Type|Details_____________|Team_______|
|09/  |        |          |        |      |      |    |        |      |    |         |      |     |    |Reclaimable|        |           |
|10/  |11124482|M2M       |TCS     |KCLB- |EGLG- |N   |Revision|ks8-  |Yes |Passed   |Passed|Y    |sDC |Space: 3532|Reorg   |   RAMU    |
|2020-|        |          |        |CDB   |TEST  |    |13.20.07|US-OCC|    |         |      |     |    |GB         |Required|           |
|19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________|
|09/  |        |Standalone|        |      |      |    |        |      |    |         |      |     |    |           |        |           |
|10/  |11170981|Data      |Wipro,  |      |LMNO- |    |Revision|ns2-US|NA  |NA       |NA    |NA   |NA  |    NA     |   NA   |DataMasking|
|2020-|        |Masking   |Inc.    |      |TEST  |    |13.20.07|      |    |         |      |     |    |           |        |           |
|19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________|

Thanks,
M2M Ops

Note: This is a system generated email, still you can reply with queries/
suggestions.

但是,您可以将宽度更改为所需的字符数。例如,在您的问题中,宽度为 261 个字符。因此,您也可以使用

html2text -width 261 file.html

其结果为:


 _________________________________________________________________________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST_________________________________________________________________________________________________
|Start_Time______|FM_CR___|CR_Type________________|Customer_Name|Source_Pod|Target_Pod|DM_Flag|Release__________|Data_Center|CDB_Sync|FreeSpace_Check|TDE/DV_Check|M2M_Optin|M2M_Type|Database_Reorg_Details____________________________________|Operations_Team___|
|09/10/2020-19:00|11124482|M2M____________________|TCS__________|KCLB-CDB__|EGLG-TEST_|N______|Revision_13.20.07|ks8-US-OCC_|Yes_____|Passed_________|Passed______|Y________|sDC_____|Reclaimable_Space:_3532_GB|Reorg_Required_________________|_______RAMU_______|
|09/10/2020-19:00|11170981|Standalone_Data_Masking|Wipro,_Inc.__|__________|LMNO-TEST_|_______|Revision_13.20.07|ns2-US_____|NA______|NA_____________|NA__________|NA_______|NA______|____________NA____________|______________NA_______________|DataMasking_______|

Thanks,
M2M Ops

Note: This is a system generated email, still you can reply with queries/suggestions.

现在,要进行操作,例如删除字形 ( |)、下划线 ( _)、空行、第一行和最后 3 行,您可以根据需要使用任何命令行工具。一种丑陋的方法看起来像

html2text -width 200 file.html | sed 's/|/\ /g;s/\_/\ /g;/^$/d'| head -n -3 | tail -n +2

这将产生

 Start Time     FM CR    CR Type      Customer Name Source Pod Target Pod DM Flag Release  Data Center CDB Sync FreeSpace TDE/DV Check M2M Optin M2M Type Database Reorg Details        Operations Team 
                                                                                                                Check                                                                                   
 09/10/2020-19: 11124482 M2M          TCS           KCLB-CDB   EGLG-TEST  N       Revision ks8-US-OCC  Yes      Passed    Passed       Y         sDC      Reclaimable    Reorg Required      RAMU       
 00                                                                               13.20.07                                                                Space: 3532 GB                                
 09/10/2020-19: 11170981 Standalone   Wipro, Inc.              LMNO-TEST          Revision ns2-US      NA       NA        NA           NA        NA             NA             NA       DataMasking     
 00                      Data Masking                                             13.20.07                                                                                                              

答案2

我更喜欢在带有指令的文件命令中使用带有“sed句子”的“vim”,然后使用“vim -s 文件命令”执行。

$ cat 文件命令
:%加入
:%s//&\r/gi
:%s//\t/gi
:%s/]*>//g
:w %.txt
:Q!


$ vim -s 文件命令示例.html

$ cat 示例.html.txt
 2020 年 9 月 10 日(太平洋标准时间)的其他 M2M 和独立 DataMasking 列表
 开始时间 FM CR CR 类型 客户名称 源 Pod 目标 Pod DM 标志 发布数据中心 CDB 同步 可用空间检查 TDE/DV 检查 M2M 选择 M2M 类型 数据库重组详情 运营团队      
 2020/09/10-19:00 11124482 M2M TCS KCLB-CDB EGLG-TEST N 修订版 13.20.07 ks8-US-OCC 是 已通过 已通过 Y sDC 可回收空间:3532 GB 重组所需 RAMU       
 2020 年 9 月 10 日-19:00 11170981 独立数据屏蔽 Wipro, Inc. LMNO-TEST 修订版 13.20.07 ns2-US NA NA NA NA NA NA NA 数据屏蔽      


相关内容