类似浏览器的阅读器模式,仅输出文本

类似浏览器的阅读器模式,仅输出文本

背景:阅读器模式,如 Safari 和其他浏览器中所示,提取主要内容基于文章网页使用复杂的启发式方法,并以非常易读的字体显示。

所有导航、页眉、页脚和其他内容都被删除。该模式仅适用于“文章”,即。有“主要内容”的页面,如新闻文章、科学论文等。

问题:有没有一个开源终端的实现(即纯文本)?或者,有另一种方法可以完成同样的事情吗?

示例:《纽约时报》的这篇文章应该输出如下:

$ utility --reader-mode https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html

SEND US YOUR IDEAS FOR WHAT TO DO DURING THE POLAR VORTEX. WE
WANT TO HEAR FROM YOU.

It’s so cold in much of the Midwest today that you could get
frostbite within five minutes once you step outside. If you’re
living through it indoors, give us your tips.

A commuter during an extremely light morning rush hour in Chicago
on Wednesday. Businesses and schools have closed as the city
copes with record low temperatures.

Across the Midwest, where wind chills were minus 51 in
Minneapolis and minus 45 in Chicago, the risks of going outside
on Wednesday were dire. So, many people simply didn’t bother,
while others took a chance to briefly experience the coldest
weather in a generation.

Whether you’re an adventurer or a hibernator, tell us your
recommendations for staying warm and busy. What are you cooking
or binge-watching? What board games are you playing? If you’re
venturing outside, what are you doing to stay safe? (Experts warn
that even a short time in the extreme cold can be very
dangerous.) How many layers of clothing are you wearing, and
which special hats and gloves are necessary? Send us your photos
and your stories.

答案1

我一直在尝试 readability-cli (https://gitlab.com/gardenappl/readability-cli) 与 pandoc (https://pandoc.org/)。例如

% readable https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html | pandoc -f html - -t plain
Send Us Your Ideas for What to Do During the Polar Vortex. We Want to Hear From You.

It’s so cold in much of the Midwest today that you could get frostbite
within five minutes once you step outside. If you’re living through it
indoors, give us your tips.

[Credit...Scott Olson/Getty Images]

Across the Midwest, where wind chills were minus 51 in Minneapolis and
minus 45 in Chicago, the risks of going outside on Wednesday were dire.
So, many people simply didn’t bother, while others took a chance to

等等。这是一个 Node 项目,因此人们想知道依赖项中的漏洞,因此请自行判断。 (讽刺的是,它对于像本页这样的 stackexchange.com 链接效果不佳:-)

答案2

评论关于“导航内容”是由-nolist选项,例如

lynx -nolist -dump www.google.com > file.txt

没有显示任何链接等:

$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt 

   Search Images Maps Play YouTube News Gmail Drive More »
   Web History | Settings | Sign in

   Google

     _______________________________________________________
     Google Search  I'm Feeling Lucky                          Advanced search
                                                               Language tools

   Advertising Programs       Business  Solutions       +Google     About
   Google

                         © 2019 - Privacy - Terms

w3m给出类似的东西,但没有选项:

$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in

                                    Google

           [                                                         ] Advanced
                                                                       searchLanguage
                       [Google Search][I'm Feeling Lucky]              tools

           Advertising ProgramsBusiness Solutions+GoogleAbout Google

                          (C) 2019 - Privacy - Terms

links2输出看起来很像w3m的(注意前面缺少的空格关于):

$ links2 -dump www.google.com                                          
   Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
   Web History | Settings | Sign in                                             
                                     Google

    __________________________________________________________    Advanced       
              [ Google Search ] [ I'm Feeling Lucky ]             searchLanguage 
                                                                  tools          

           Advertising ProgramsBusiness Solutions+GoogleAbout Google

                           (c) 2019 - Privacy - Terms

$ links2 -dump www.google.com >file.txt 
$ cat file.txt 
   Search Images Maps Play YouTube News Gmail Drive More >>
   Web History | Settings | Sign in
                                     Google

    __________________________________________________________    Advanced       
              [ Google Search ] [ I'm Feeling Lucky ]             searchLanguage 
                                                                  tools          

           Advertising ProgramsBusiness Solutions+GoogleAbout Google

                           (c) 2019 - Privacy - Terms

(奇怪的是,如果转储直接进入终端,它还会打印进度——这不是一个好功能)elinks显然只转储带有“导航内容”的格式(ymmv)。

从进一步的评论来看,事实证明 OP 对可以呈现给定内容的东西感兴趣分配在页面上。比较尺寸来源倾倒该页面提供了一些线索:

      大小 缓冲区名称 内容
      ------- -------------------- ----------------------- -------------------------------------------------- ----------------
   0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
   1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html

显示转储大约是源大小的 2%。页面的大部分内容都是非信息性的,文本浏览器显示信息。但是分配请求的是一个两行块,看起来像这样(只有开始:第一行实际上有62265人物):

<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>

第一行包含文章文本(加上大量标记),随意查看第二行,这可能是 GUI 浏览器检测到的用于显示文章的脚本。上述文本浏览器都不具有仅显示给定的<div>...</div>或以这种方式解释脚本的功能。这些文章提到了几种 GUI 浏览器中缺少阅读器模式的标准 URI:

答案3

这满足您的要求吗? (从https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file

lynx --dump www.google.com > file.txt

相关内容