反转 HTML 文件中数千个元素的排序顺序的正确工具

反转 HTML 文件中数千个元素的排序顺序的正确工具

我有一个 HTML 文件,其中包含数千个<div class='date'></div><ul>...</ul>代码块,如下所示:

<!DOCTYPE html>
<html>

    <head>
    </head>

    <body>

        <div class="date">Wed May 23 2018</div>
        <ul>
            <li>
                Do laundry
                <ul>
                    <li>
                        Get coins
                    </li>
                </ul>
            </li>
            <li>
                Wash the dishes
            </li>
        </ul>

        <div class='date'>Thu May 24 2018</div>
        <ul>
            <li>
                Solve the world's hunger problem
                <ul>
                    <li>
                        Don't tell anyone
                    </li>
                </ul>
            </li>
            <li>
                Get something to wear
            </li>
        </ul>

        <div class='date'>Fri May 25 2018</div>
        <ul>
            <li>
                Modify the website according to GDPR
            </li>
            <li>
                Watch YouTube
            </li>
        </ul>

    </body>

</html>

每个元素<div>和相应的<ul>元素都是针对特定日期的。的块<div class='date'></div><ul>...</ul>按升序排序,即较新的日期位于底部文件的。我打算按降序排列它们,以便较新的日期位于顶部文件的内容,如下所示:

<!DOCTYPE html>
<html>

    <head>
    </head>

    <body>

        <div class='date'>Fri May 25 2018</div>
        <ul>
            <li>
                Modify the website according to GDPR
            </li>
            <li>
                Watch YouTube
            </li>
        </ul>

        <div class='date'>Thu May 24 2018</div>
        <ul>
            <li>
                Solve the world's hunger problem
                <ul>
                    <li>
                        Don't tell anyone
                    </li>
                </ul>
            </li>
            <li>
                Get something to wear
            </li>
        </ul>

        <div class="date">Wed May 23 2018</div>
        <ul>
            <li>
                Do laundry
                <ul>
                    <li>
                        Get coins
                    </li>
                </ul>
            </li>
            <li>
                Wash the dishes
            </li>
        </ul>

    </body>

</html> 

我不确定什么是正确的工具,是 shell 脚本吗?是吗awk?是Python吗?还有什么可能更快更方便的吗?

答案1

扩展Python解决方案:

sort_html_by_date.py脚本:

from bs4 import BeautifulSoup
from datetime import datetime

with open('input.html') as html_doc:    # replace with your actual html file name
    soup = BeautifulSoup(html_doc, 'lxml')
    divs = {}
    for div in soup.find_all('div', 'date'):
        divs[datetime.strptime(div.string, '%a %B %d %Y')] = \
            str(div) + '\n' + div.find_next_sibling('ul').prettify()

    soup.body.clear()
    for el in sorted(divs, reverse=True):
        soup.body.append(divs[el])

    print(soup.prettify(formatter=None))

用法:

python sort_html_by_date.py

输出:

 <!DOCTYPE html>
<html>
 <head>
 </head>
 <body>
  <div class="date">Fri May 25 2018</div>
<ul>
 <li>
  Modify the website according to GDPR
 </li>
 <li>
  Watch YouTube
 </li>
</ul>
  <div class="date">Thu May 24 2018</div>
<ul>
 <li>
  Solve the world's hunger problem
  <ul>
   <li>
    Don't tell anyone
   </li>
  </ul>
 </li>
 <li>
  Get something to wear
 </li>
</ul>
  <div class="date">Wed May 23 2018</div>
<ul>
 <li>
  Do laundry
  <ul>
   <li>
    Get coins
   </li>
  </ul>
 </li>
 <li>
  Wash the dishes
 </li>
</ul>
 </body>
</html>

使用的模块:

美丽汤-https://www.crummy.com/software/BeautifulSoup/bs4/doc/
约会时间 -https://docs.python.org/3.3/library/datetime.html#module-datetime

相关内容