我如何从任何搜索引擎中提取句子?

我如何从任何搜索引擎中提取句子?

对于正在学习外语的人来说,了解句子中特定单词的用法非常有帮助。例如,您想查看动词的不同词形变化。我过去常常在 imdb > 引语搜索部分查看单词的用法,它几乎存储了所有好莱坞电影的剧本。我想制作一个命令行工具来搜索搜索引擎中的任何单词并以有序的句子列表显示结果。我找到了一些 perl 脚本来将文本解析为句子。我如何从任何搜索引擎中提取句子并将它们列在句子中,就像 jukuu.com 双语句子搜索器一样?

答案1

例如INDB,囚犯

命令

/tmp$ wget http://www.imdb.com/title/tt1392214/?ref_=hm_cht_t1

这将会显示一些事情...

--14:17:11-- http://www.imdb.com/title/tt1392214/?ref_=hm_cht_t1
           => `index.html?ref_=hm_cht_t1'
正在解析 www.imdb.com...72.21.215.52
连接到 www.imdb.com|72.21.215.52|:80... 已连接。
HTTP 请求已发送,正在等待响应...200 OK
长度:未指定 [text/html]

    [ ] 186,103 389.18K/秒

14:17:12(388.45 KB/s)-`index.html?ref_=hm_cht_t1'已保存 [186103]

结果:

~/tmp$ ls
index.html?ref_=hm_cht_t1

现在您可以扫描文件...

grep Directed\ by index.html\?ref_\=hm_cht_t1
<meta name="description" content="Directed by Denis Villeneuve.  With Hugh Jackman, Jake Gyllenhaal, Viola Davis, Melissa Leo. When Keller Dover's daughter and her friend go missing, he takes matters into his own hands as the police pursue multiple leads and the pressure mounts. But just how far will this desperate father go to protect his family?" />
<meta property="og:description" content="Directed by Denis Villeneuve.  With Hugh Jackman, Jake Gyllenhaal, Viola Davis, Melissa Leo. When Keller Dover's daughter and her friend go missing, he takes matters into his own hands as the police pursue multiple leads and the pressure mounts. But just how far will this desperate father go to protect his family?" />

上面的例子是您可以更详细地实现这一目标的核心:让用户输入他/她想要搜索的内容,然后使用 wget google 搜索该词。扫描这些结果中的 URL,然后 wget 这些 URL,从这些结果中提取内容,并将它们呈现给该用户。

相关内容