我想编写一个python程序来下载一个网页的内容,然后下载第一个页面链接到的网页的内容。
例如,这是主网页http://www.adobe.com/support/security/,以及我想要下载的页面:http://www.adobe.com/support/security/bulletins/apsb13-23.html和http://www.adobe.com/support/security/bulletins/apsb13-22.html
我想满足一个特定的条件:它应该只下载公告下的网页,而不是咨询下的网页(http://www.adobe.com/support/security/advisories/apsa13-02.html)
#!/usr/bin/env python
import urllib
import re
import sys
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
sys.stdout = fileHandle
print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close()
os.system("grep -i '\/support\/security\/bulletins\/' content >> content1")
我已经将公告链接提取到content1中,但不知道如何通过提供content1作为输入来下载这些网页的内容。
content1 文件如下所示:- /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-07.html
答案1
如果我理解了你的问题,那么以下脚本应该是你想要的:
#!/usr/bin/env python
import urllib
import re
import sys
import os
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
sys.stdout = fileHandle
print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close()
os.system("grep -i '\/support\/security\/bulletins\/' content 2>/dev/null | head -n 3 | uniq | sed -e 's/^/http:\/\/www.adobe.com/g' > content1")
os.system("wget -i content1")
答案2
这个问题可能与 stackoverflow 有关!
但无论如何你可以看看HT轨道为此,它执行类似的操作,而且它是开源的