使用 etree，如何获取 HTML 文件中某个元素及其子元素及其标签的所有文本？

2024-6-10 • tag-icon

使用 etree，如何获取 HTML 文件中某个元素及其子元素及其标签的所有文本？

我正在尝试从 HTML 文件中获取文本数据并根据其标签重新格式化它们。HTML 文件如下所示：

<!doctype html>

(some HTML code)

<section id="descrption">

    descrption_text_0

    <span class="ac"><span class="ac">1a.</span></span>

    descrption_text_1

    <i>italic_0</i>

    <span class="ac"><span class="ac">1b.</span></span>

    descrption_text_2

    <i>italic_1</i>

</section>

(some other HTML code)

</html>

我的目标是生成一个包含“section”标签内的所有文本及其标签的数据列表，并使它们的顺序与原始 HTML 代码中的顺序保持一致，如下所示：

[('section', 'descrption_text_0'),
 ('ac', '1a.'),
 ('section', 'descrption_text_1'),
 ('i', 'italic_0'),
 ('ac', '1b.'),
 ('section', 'descrption_text_2'),
 ('i', 'italic_1')]

以下是我迄今为止尝试过的：

html = etree.HTML(my_html_file)

# this returns a list of all text without their tags
html.xpath('//section[@id="description"]//text()')

# this only gets texts inside the children of <section>, without "descrption_text_0", "descrption_text_0", etc
for el in html.iter():
    if el.tag == 'section' and el.attrib['id'] == 'description':
    print(el.tag)
    print(el.text)
    for sub_el in el.iter():
        print(sub_el.tag)
        print(sub_el.text)

我能想到的最后一种方法是使用 etree.trostring() 将其转换回原始 HTML 代码，然后编写另一个函数来处理它...但是有没有办法使用 etree 的内置函数来实现我的目标？或者还有其他模块可以做到这一点？提前致谢。

相关内容