下载网页内容

下载网页内容

我想编写一个python程序来下载一个网页的内容,然后下载第一个页面链接到的网页的内容。

例如,这是主网页http://www.adobe.com/support/security/,以及我想要下载的页面:http://www.adobe.com/support/security/bulletins/apsb13-23.htmlhttp://www.adobe.com/support/security/bulletins/apsb13-22.html

我想满足一个特定的条件:它应该只下载公告下的网页,而不是咨询下的网页(http://www.adobe.com/support/security/advisories/apsa13-02.html

 #!/usr/bin/env python
 import urllib
 import re
 import sys
 page = urllib.urlopen("http://www.adobe.com/support/security/")
 page = page.read()
 fileHandle = open('content', 'w')
 links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
 for link in links:
 sys.stdout = fileHandle
 print ('%s' % (link[0]))
 sys.stdout = sys.__stdout__
 fileHandle.close() 
 os.system("grep -i '\/support\/security\/bulletins\/' content >> content1") 

我已经将公告链接提取到content1中,但不知道如何通过提供content1作为输入来下载这些网页的内容。

content1 文件如下所示:- /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-07.html

答案1

如果我理解了你的问题,那么以下脚本应该是你想要的:

#!/usr/bin/env python

import urllib
import re
import sys
import os
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    sys.stdout = fileHandle
    print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close() 
os.system("grep -i '\/support\/security\/bulletins\/' content 2>/dev/null | head -n 3 | uniq | sed -e 's/^/http:\/\/www.adobe.com/g' > content1")
os.system("wget -i content1")

答案2

这个问题可能与 stackoverflow 有关!

但无论如何你可以看看HT轨道为此,它执行类似的操作,而且它是开源的

相关内容