我想要从这个 HTML 文件中提取页面内容:
<BR />
<TABLE style=border-color:#32506d border=1 cellspacing=0>
<caption class=header style=background-color:#32506d><b>Additional M2Ms & Standalone DataMasking List for 09 10 2020
PST</b></caption>
<tr style=background-color:#32506d class=header>
<td class=CR>Start Time</td>
<td class=CR>FM CR</td>
<td class=CR>CR Type</td>
<td class=CR>Customer Name</td>
<td class=CR>Source Pod</td>
<td class=CR>Target Pod</td>
<td class=CR>DM Flag</td>
<td class=CR>Release</td>
<td class=CR>Data Center</td>
<td class=CR>CDB Sync</td>
<td class=CR>FreeSpace Check</td>
<td class=CR>TDE/DV Check</td>
<td class=CR>M2M Optin</td>
<td class=CR>M2M Type</td>
<td class=CR colspan=2>Database Reorg Details</td>
<td>Operations Team</td>
</tr>
<tr>
<td>09/10/2020-19:00</td>
<td class=CR><a href=http://fleetmanager.oraclecloud.com/change/faces/registerChangeRequest?CRID=11124482
target=_blank>11124482</td>
<td>M2M</td>
<td>TCS</td>
<td>KCLB-CDB</td>
<td>EGLG-TEST</td>
<td class=CR>N</td>
<td>Revision 13.20.07</td>
<td>ks8-US-OCC</td>
<td class=CR>
<font color=#34A853>Yes</font>
</td>
<td class=CR>
<font color=#34A853>Passed</font>
</td>
<td class=CR>
<font color=#34A853>Passed</font>
</td>
<td class=CR>Y</td>
<td class=CR>
<font color=#34A853>sDC</font>
</td>
<td>
<font color=#db3236>Reclaimable Space: 3532 GB</font>
</td>
<td>
<font color=#db3236>Reorg Required</font>
</td>
<td>
<center>
<font color=#0000FF>RAMU</font>
</center>
</td>
</tr>
<tr>
<td>09/10/2020-19:00</td>
<td class=CR><a href=http://fleetmanager.oraclecloud.com/change/faces/registerChangeRequest?CRID=11170981
target=_blank>11170981</td>
<td>
<font color=green>Standalone Data Masking</font>
</td>
<td>Wipro, Inc.</td>
<td></td>
<td>LMNO-TEST</td>
<td class=CR></td>
<td>Revision 13.20.07</td>
<td>ns2-US</td>
<td class=CR>NA</td>
<td class=CR>NA</td>
<td class=CR>NA</td>
<td class=CR>NA</td>
<td class=CR>NA</td>
<td>
<center>NA</center>
</td>
<td>
<center>NA</center>
</td>
<td>DataMasking</td>
</tr>
</TABLE><br /><span>Thanks,</span><br /><span>M2M Ops</span><br /><br /><span>Note: This is a system generated email,
still you can reply with queries/suggestions.</span>
</HTML>
到目前为止,我已经尝试使用以下方法sed
:
sed -n '/^$/!{s/<[^>]*>//g;p;}' file.html
我得到以下输出:
Start TimeFM CRCR TypeCustomer NameSource PodTarget PodDM FlagReleaseData CenterCDB SyncFreeSpace CheckTDE/DV CheckM2M OptinM2M TypeDatabase Reorg DetailsOperations Team
09/10/2020-19:0011124482
M2MTCSKCLB-CDBEGLG-TESTNRevision 13.20.07ks8-US-OCCYes
PassedPassedYsDCReclaimable Space: 3532 GBReorg RequiredRAMU
09/10/2020-19:0011170981
Standalone Data MaskingWipro Inc.LMNO-TESTRevision 13.20.07ns2-USNA
NANANANANANADataMasking
Thanks,M2M OpsNote: This is a system generated email, still you can reply with queries/suggestions.
但它与期望的输出不同:
StartTime FMCR CRType CustomerName SourcePod TargetPod DMFlag Release DataCenter CDBSync FreeSpaceCheck TDE/DVCheck M2MOptin M2MType DatabaseReorgDetails OperationsTeam
09/10/2020-19:00 11124482 M2M TCS KCLB-CDB KCLB-TEST N Revision 13.20.07 ks8-US-OCC YES Passed Passed Y sDC Reclaimable Space: 3532 GB Reorg Required RAMU
09/10/2020-19:00 11170981 Standalone Data Masking Wipro, Inc LMNO-TEST Revision 13.20.07 ns2-US NA NA NA NA NA NA NA DataMasking
答案1
输出sed
完全awk
取决于 HTML 文件的格式。例如,修订 #3会产生不同的结果修订 #4。
或者,您可以使用特定工具,例如html2text
.html2text
会将生成的 HTML 页面格式化为纯文本字符。当然,您可以使用其他命令行工具(例如sed
和)进一步处理输出awk
。
要安装html2text
,只需运行:
sudo apt install html2text
要开始,只需运行:
html2text file.html
默认情况下,html2text
将 HTML 文档格式化为屏幕宽度为 79 个字符。因此,结果将如下所示:
___________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST____________________________________
|Start|FM CR |CR Type |Customer|Source|Target|DM |Release |Data |CDB |FreeSpace|TDE/DV|M2M |M2M |Database Reorg |Operations |
|Time_|________|__________|Name____|Pod___|Pod___|Flag|________|Center|Sync|Check____|Check_|Optin|Type|Details_____________|Team_______|
|09/ | | | | | | | | | | | | | |Reclaimable| | |
|10/ |11124482|M2M |TCS |KCLB- |EGLG- |N |Revision|ks8- |Yes |Passed |Passed|Y |sDC |Space: 3532|Reorg | RAMU |
|2020-| | | |CDB |TEST | |13.20.07|US-OCC| | | | | |GB |Required| |
|19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________|
|09/ | |Standalone| | | | | | | | | | | | | | |
|10/ |11170981|Data |Wipro, | |LMNO- | |Revision|ns2-US|NA |NA |NA |NA |NA | NA | NA |DataMasking|
|2020-| |Masking |Inc. | |TEST | |13.20.07| | | | | | | | | |
|19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________|
Thanks,
M2M Ops
Note: This is a system generated email, still you can reply with queries/
suggestions.
但是,您可以将宽度更改为所需的字符数。例如,在您的问题中,宽度为 261 个字符。因此,您也可以使用
html2text -width 261 file.html
其结果为:
_________________________________________________________________________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST_________________________________________________________________________________________________
|Start_Time______|FM_CR___|CR_Type________________|Customer_Name|Source_Pod|Target_Pod|DM_Flag|Release__________|Data_Center|CDB_Sync|FreeSpace_Check|TDE/DV_Check|M2M_Optin|M2M_Type|Database_Reorg_Details____________________________________|Operations_Team___|
|09/10/2020-19:00|11124482|M2M____________________|TCS__________|KCLB-CDB__|EGLG-TEST_|N______|Revision_13.20.07|ks8-US-OCC_|Yes_____|Passed_________|Passed______|Y________|sDC_____|Reclaimable_Space:_3532_GB|Reorg_Required_________________|_______RAMU_______|
|09/10/2020-19:00|11170981|Standalone_Data_Masking|Wipro,_Inc.__|__________|LMNO-TEST_|_______|Revision_13.20.07|ns2-US_____|NA______|NA_____________|NA__________|NA_______|NA______|____________NA____________|______________NA_______________|DataMasking_______|
Thanks,
M2M Ops
Note: This is a system generated email, still you can reply with queries/suggestions.
现在,要进行操作,例如删除字形 ( |
)、下划线 ( _
)、空行、第一行和最后 3 行,您可以根据需要使用任何命令行工具。一种丑陋的方法看起来像
html2text -width 200 file.html | sed 's/|/\ /g;s/\_/\ /g;/^$/d'| head -n -3 | tail -n +2
这将产生
Start Time FM CR CR Type Customer Name Source Pod Target Pod DM Flag Release Data Center CDB Sync FreeSpace TDE/DV Check M2M Optin M2M Type Database Reorg Details Operations Team
Check
09/10/2020-19: 11124482 M2M TCS KCLB-CDB EGLG-TEST N Revision ks8-US-OCC Yes Passed Passed Y sDC Reclaimable Reorg Required RAMU
00 13.20.07 Space: 3532 GB
09/10/2020-19: 11170981 Standalone Wipro, Inc. LMNO-TEST Revision ns2-US NA NA NA NA NA NA NA DataMasking
00 Data Masking 13.20.07
答案2
我更喜欢在带有指令的文件命令中使用带有“sed句子”的“vim”,然后使用“vim -s 文件命令”执行。
$ cat 文件命令 :%加入 :%s//&\r/gi :%s//\t/gi :%s/]*>//g :w %.txt :Q! $ vim -s 文件命令示例.html $ cat 示例.html.txt 2020 年 9 月 10 日(太平洋标准时间)的其他 M2M 和独立 DataMasking 列表 开始时间 FM CR CR 类型 客户名称 源 Pod 目标 Pod DM 标志 发布数据中心 CDB 同步 可用空间检查 TDE/DV 检查 M2M 选择 M2M 类型 数据库重组详情 运营团队 2020/09/10-19:00 11124482 M2M TCS KCLB-CDB EGLG-TEST N 修订版 13.20.07 ks8-US-OCC 是 已通过 已通过 Y sDC 可回收空间:3532 GB 重组所需 RAMU 2020 年 9 月 10 日-19:00 11170981 独立数据屏蔽 Wipro, Inc. LMNO-TEST 修订版 13.20.07 ns2-US NA NA NA NA NA NA NA 数据屏蔽