将多行列表转换为 TSV

将多行列表转换为 TSV

我有一个项目列表,其中每个项目都有多行。 分隔项目的标记是唯一的(每个项目, HTML <li>),并且我只看到包含在单个标记化段落( HTML )中的文本实例<p>。 我希望由此生成一个 TSV,其中项目按以下顺序排列:

  1. 日期
  2. 姓名
  3. 网址
  4. 概括

从我所看到的所有项目来看,URL 和名称都有重复(在每个项目中),因此我选择了第一个 URL 和第二个名称,因为这对我来说似乎最容易。摘要可能包含视觉辅助标签(即<strong>),因此我使用负前瞻来执行此操作,而日期不应该有内部标签,因此我改用否定字符类。

前两项是

    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">On
    Contact: Race and America's long war </a>
    </p>
    <p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    America's inner and outer wars and its nexus with capitalism and
    empire with Professor of Social and Cultural Analysis and History at
    New York University Nikhil Pal Singh. The internal violence in the
    United... 
    </p>
    <p style="margin-bottom: 0in">Feb 27, 2022 10:36</p>
    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">On
    Contact: George Washington and the legacy of white supremacy </a></strong>
    </p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    George Washington, the fallible human being and one of the principal
    architects of the United States, with author Nathaniel Philbrick. As
    America fractures into ideologically hostile camps, it colors how
    we... 
    </p>
    <p style="margin-bottom: 0in">Feb 25, 2022 09:09 
    </p>
    <li>[...]

我尝试的正则表达式是<li>.*<a href="([^"]+)".*alt="On Contact: ([^"]+)".*<p[^>]*>((?:.(?!<\/p>))+)<\/p><p[^>]*>([^<]+)<,如果它有效,它将被替换为$4\t$2\t$1\t$3。我希望正则表达式可以在 Notepad++ 中工作。

非常感谢你的帮助

更新 1

我后来使用的测试字符串添加了列表项,在摘要中添加了显示标签(即<strong>),尽管它与标题不一致,但我必须删除制表符,因为它们会干扰 TSV 的创建,并且我想我也可以在此过程中删除换行符(删除[\t\r\n]),结果是:

<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">OnContact: Race and America's long war </a></p><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in">On the show, Chris Hedges discussesAmerica's inner and outer wars and its nexus with capitalism and <strong>empire</strong> with Professor of Social and Cultural Analysis and History atNew York University Nikhil Pal Singh. The internal violence in theUnited... </p><p style="margin-bottom: 0in">Feb 27, 2022 10:36</p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">OnContact: George Washington and the legacy of white supremacy </a></strong></p><p style="margin-bottom: 0in">On the show, <span class="host">Chris Hedges</span> discusses George Washington, the fallible human being and one of the principalarchitects of the United States, with author Nathaniel Philbrick. AsAmerica fractures into ideologically hostile camps, it colors howwe... </p><p style="margin-bottom: 0in">Feb 25, 2022 09:09 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_e46c470920b1171d.jpg" name="Image4" alt="On Contact: Oppenheimer & the bomb culture" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">OnContact: Oppenheimer &amp; the bomb culture </a></strong></p><p style="margin-bottom: 0in">On the show, Chris Hedges discusses J.Robert Oppenheimer and the making of the bomb with author <span class="author">Kai Bird.J. Robert Oppenheimer</span>, &ldquo;the father of the atomic bomb,&rdquo;was by the end of World War II one of the most celebrated men inAmerica.... </p><p style="margin-bottom: 0in">Feb 20, 2022 06:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_15449064d00f77f3.jpg" name="Image149" alt="On Contact – War with Iran? Stephen Kinzer" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">OnContact &ndash; War with Iran? Stephen Kinzer </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to journalistand author, Stephen Kinzer, on efforts by Saudi Arabia and Washington to cripple Iran&rsquo;s economy, inevitably putting Saudi Arabia, its Gulf allies and Washington on a collision course with the <em>Islamic</em>... </p><p style="margin-bottom: 0in">Sep 29, 2019 07:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_b82502a96022a758.png" name="Image150" alt="The future of the Amazon rain forest – Sonia Bone Guajajara" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">Thefuture of the Amazon rain forest &ndash; Sonia Bone Guajajara </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to Sonia BoneGuajajara, leader of 300 indigenous ethnic groups in Brazil, aboutthe future of the Amazon rain forest, its people, climate change,and the competing goals of agrobusiness, multinational corporations,and the... </p><p style="margin-bottom: 0in">Sep 22, 2019 07:15 </p></ul>

答案1

我喜欢将问题分解,并尝试优化我发现的任何.*问题.*?。请注意,如果 HTML 结构发生变化,则发生问题的可能性会大大增加。

我也是支持该标志的正则表达式的粉丝,/x这样我就可以添加空格和注释来帮助一切适应我的大脑。

这是我的想法,并附上了一些注释以帮助理解每个部分的作用:

<li>
(?>[<](?!a\b)[^<>]*[>]|[^<>]+)*
<a\shref="(?<url>[^"]+)"[^>]*>

# Match until we reach '<img'
(?>[<](?!img\b)[^<>]*[>]|[^<>]+)*
<img

# Match until we reach 'alt=' within '<img...>'
(?>[^<>=]*+(?<!alt)=|"[^<>"=]*"\s)*
alt="(?:On\sContact[\s–:\-–]*)?(?<on_contact>[^"]+)"[^<>]*>

# Match until it reaches a '<p...>' that does not contain some other opening '<' tag element.
(?>[<](?!p\b)[^<>]*[>]|[^<>]+|<p[^>]*>\s*<(?!\/?p\b)[^<>]*>)*
<p[^>]*>

# Match 'stuff stuff ... stuff stuff' without including trailing whitespace.
(?<desc>[^<>\s]+(?>\s+[^<>\s]+)*
  # Handle <strong>...</strong> nested tags
  (?>\s*[<](?!\/p)[^<>]*[>]|\s*[^<>\s]+(?>\s+[^<>\s]+)*)*
)

\s*<\/p>

# Match until we reach another '<p...>'
(?>[<](?!p\b)[^<>]*[>]|[^<>]+)*
<p[^>]*>

# Capture the date
(?<date>[^<]+)

# Match until we reach a '<li>' (or end of string)
(?>[<](?!li\b)[^<>]*[>]|[^<>]+)*

您可以看到此操作作用于您的原始文本这里

可以找到相同的正则表达式,但删除了注释行和空格这里同样,它应该能够直接放入 Notepad++ 或任何您拥有的兼容 PCRE2 的工具中。

答案2

您的正则表达式包含一些错误,导致它与文本不匹配。

  • 删除无用的(在 Notepad++ 中)斜线字符转义\/==>/
  • .*用非贪婪的替换所有.*?
  • 你的暴躁贪婪令牌顺序错误,(?:.(?!</p>))+应该(?:(?!</p>).)+

此外,<li>示例文本中的 2 的结构并不相同:

  • 前者在第二段中有<p>图片
  • 后者在第一段中有<p>图片

那么捕获组就不会捕获相同的数据。


您可以查看正则表达式这里


我稍微改变了你的正则表达式,假设想要的段落不包含任何标签,它适用于你的例子:

<li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>

演示与说明


在 Notepad++ 中运行

  • Ctrl+H
  • 找什么:<li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>
  • 用。。。来代替:$4\n$2\n$1\n$3\n\n
  • 查看 环绕
  • 查看 正则表达式
  • 查看 . matches newline
  • Replace all

截图(之前):

在此处输入图片描述

截图(之后):

在此处输入图片描述

相关内容