我有时会发现网站以 JavaScript 链接的形式发布内容(文件)。如果链接以传统<a href="...">
结构发布,则可以轻松解析 HTML、找到链接并下载内容。甚至像 Acrobat 这样的应用程序也能够处理这种情况并生成网站相关区域的 PDF。
但 JavaScript 链接则不然。
这是一个具有内容(公共访问,不需要登录或密码)但使用 javascript 链接的网站示例。
如何以编程方式下载此处的 PDF 文件?
http://www.oml.ago.state.ma.us/
每年都有标签,以 2013 年的标签为例。
http://www.oml.ago.state.ma.us/Default.aspx?sectionYear=1&year=2013
这里有几百个链接,但除了点击每一个链接外,我无法找到任何方法来找到目标并下载它们。
答案1
我想到两个选项(都不属于 Java):
编写一个 JavaScript 书签,你可以在浏览器中点击它并抓取 DOM 元素后您要抓取的页面已加载,JS 已执行。此方法可行,但无法扩展到大量页面。
使用无头浏览器,例如http://casperjs.org/,http://phantomjs.org/或者http://slimerjs.org/
答案2
您可以通过开发人员控制台查看网络来找到它。
该 URL 为http://www.oml.ago.state.ma.us/default.aspx
,带有一些帖子参数:
Host: www.oml.ago.state.ma.us
User-Agent: [...]
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
DNT: 1
Referer: http://www.oml.ago.state.ma.us/
Cookie: [...]
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 5713
__EVENTTARGET=ctl00%24ContentPlaceHolder1%24grdOML%24ctl02%24lnkOpenFile&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTI3MjY2NDEzNg9kFgJmD2QWAgIDD2QWAgIBD2QWCAIBD2QWAmYPFgIeBFRleHQFgQM8dGFibGUgd2lkdGg9JzcwJScgY2VsbHBhZGRpbmc9JzInIGNlbGxzcGFjaW5nPScyJyBib3JkZXI9JzAnPjx0cj48dGQgYmdjb2xvcj0nI2RjZGNkMCdhbGlnbj0nbGVmdCcgdmFsaWduPSdtaWRkbGUnY2xhc3M9J25hdmlnYXRpb25UZXh0J3dpZHRoPSc0MCUnPjxiPjxhIGhyZWY9J0RlZmF1bHQuYXNweD9zZWN0aW9uPTAnPkJyb3dzZSBPTUwgRGV0ZXJtaW5hdGlvbnM8L2E%2BPC9iPjwvdGQ%2BPHRkIGJnY29sb3I9JyNmMGYwZTgnYWxpZ249J2xlZnQnIHZhbGlnbj0nbWlkZGxlJ2NsYXNzPSduYXZpZ2F0aW9uVGV4dCd3aWR0aD0nMzUlJz48YSBocmVmPSdTZWFyY2guYXNweD9zZWN0aW9uPTEnPlNlYXJjaCBPTUwgRGV0ZXJtaW5hdGlvbnM8L2E%2BPC90ZD48L3RyPjwvdGFibGU%2BZAIDD2QWAmYPFgIfAAWbBjx0YWJsZSB3aWR0aD0nMTAwJScgY2VsbHBhZGRpbmc9JzInIGNlbGxzcGFjaW5nPScyJyBib3JkZXI9JzAnPjx0cj48dGQgYmdjb2xvcj0nI2RjZGNkMCdhbGlnbj0nbGVmdCcgdmFsaWduPSd0b3AnY2xhc3M9J25hdmlnYXRpb25UZXh0J3dpZHRoPScyMiUnPjxiPjxhIGhyZWY9J0RlZmF1bHQuYXNweD9zZWN0aW9uWWVhcj0wJnllYXI9MjAxNCc%2BMjAxNDwvYT48L2I%2BPC90ZD48dGQgYmdjb2xvcj0nI2YwZjBlOCdhbGlnbj0nbGVmdCcgdmFsaWduPSd0b3AnY2xhc3M9J25hdmlnYXRpb25UZXh0J3dpZHRoPScxOS41JSc%2BPGEgaHJlZj0nRGVmYXVsdC5hc3B4P3NlY3Rpb25ZZWFyPTEmeWVhcj0yMDEzJz4yMDEzPC9hPjwvdGQ%2BPHRkIGJnY29sb3I9JyNmMGYwZTgnYWxpZ249J2xlZnQnIHZhbGlnbj0ndG9wJ2NsYXNzPSduYXZpZ2F0aW9uVGV4dCd3aWR0aD0nMTkuNSUnPjxhIGhyZWY9J0RlZmF1bHQuYXNweD9zZWN0aW9uWWVhcj0yJnllYXI9MjAxMic%2BMjAxMjwvYT48L3RkPjx0ZCBiZ2NvbG9yPScjZjBmMGU4J2FsaWduPSdsZWZ0JyB2YWxpZ249J3RvcCdjbGFzcz0nbmF2aWdhdGlvblRleHQnd2lkdGg9JzE5LjUlJz48YSBocmVmPSdEZWZhdWx0LmFzcHg%2Fc2VjdGlvblllYXI9MyZ5ZWFyPTIwMTEnPjIwMTE8L2E%2BPC90ZD48dGQgYmdjb2xvcj0nI2YwZjBlOCdhbGlnbj0nbGVmdCcgdmFsaWduPSd0b3AnY2xhc3M9J25hdmlnYXRpb25UZXh0J3dpZHRoPScxOS41JSc%2BPGEgaHJlZj0nRGVmYXVsdC5hc3B4P3NlY3Rpb25ZZWFyPTQmeWVhcj0yMDEwJz4yMDEwPC9hPjwvdGQ%2BPC90cj48L3RhYmxlPmQCBQ8QZA8WAWYWARAFDy0tUHJpb3IgWWVhcnMtLQUPLS1QcmlvciBZZWFycy0tZxYBZmQCBw88KwANAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnQCDWQWAmYPZBYcAgEPZBYKZg9kFgICAQ8PFgQfAAUKMDEvMzEvMjAxNB4PQ29tbWFuZEFyZ3VtZW50BV5PTUwtMjAxNC03LVNlZWtvbmstQW5pbWFsLVNoZWx0ZXItQnVpbGRpbmctQ29tbWl0dGVlLWFuZC1TZWVrb25rLUJvYXJkLW9mLVNlbGVjdG1lbi5wZGY7Mzg0NzM3ZGQCAQ8PFgIfAAUKT01MIDIwMTQtN2RkAgIPDxYCHwAFKVNlZWtvbmsgQW5pbWFsIFNoZWx0ZXIgQnVpbGRpbmcgQ29tbWl0dGVlZGQCAw8PFgIfAAUFTG9jYWxkZAIEDw8WAh8ABQYmbmJzcDtkZAICD2QWCmYPZBYCAgEPDxYEHwAFCjAxLzI3LzIwMTQfAwUwT01MLTIwMTQtNi1Ib2xsYW5kLUJvYXJkLW9mLVNlbGVjdG1lbi5wZGY7Mzg2Njg2ZGQCAQ8PFgIfAAUKT01MIDIwMTQtNmRkAgIPDxYCHwAFGkhvbGxhbmQgQm9hcmQgb2YgU2VsZWN0bWVuZGQCAw8PFgIfAAUFTG9jYWxkZAIEDw8WAh8ABQdIb2xsYW5kZGQCAw9kFgpmD2QWAgIBDw8WBB8ABQowMS8yNy8yMDE0HwMFLU9NTC0yMDE0LTUtTG9uZ21lYWRvdy1TZWxlY3QtQm9hcmQucGRmOzM4MDc4OGRkAgEPDxYCHwAFCk9NTCAyMDE0LTVkZAICDw8WAh8ABRdMb25nbWVhZG93IFNlbGVjdCBCb2FyZGRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUKTG9uZ21lYWRvd2RkAgQPZBYKZg9kFgICAQ8PFgQfAAUKMDEvMjcvMjAxNB8DBTQxLTI3LTE0LUVzc2V4LUJvYXJkLW9mLVNlbGVjdG1lbl9SZWRhY3RlZC5wZGY7MzkxMTg4ZGQCAQ8PFgIfAAUHMS0yNy0xNGRkAgIPDxYCHwAFGEVzc2V4IEJvYXJkIG9mIFNlbGVjdG1lbmRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUFRXNzZXhkZAIFD2QWCmYPZBYCAgEPDxYEHwAFCjAxLzI3LzIwMTQfAwU%2BMS0yNy0xNC1TdHVyYnJpZGdlLUNvbnNlcnZhdGlvbi1Db21taXNzaW9uX1JlZGFjdGVkLnBkZjszODk1MzdkZAIBDw8WAh8ABQcxLTI3LTE0ZGQCAg8PFgIfAAUiU3R1cmJyaWRnZSBDb25zZXJ2YXRpb24gQ29tbWlzc2lvbmRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUKU3R1cmJyaWRnZWRkAgYPZBYKZg9kFgICAQ8PFgQfAAUKMDEvMjEvMjAxNB8DBTlPTUwtMjAxNC00LU1hc3NhY2h1c2V0dHMtQm9hcmQtb2YtQm9pbGVyLVJ1bGVzLnBkZjszODA4MTVkZAIBDw8WAh8ABQpPTUwgMjAxNC00ZGQCAg8PFgIfAAUVQm9hcmQgb2YgQm9pbGVyIFJ1bGVzZGQCAw8PFgIfAAUFU3RhdGVkZAIEDw8WAh8ABQZCb3N0b25kZAIHD2QWCmYPZBYCAgEPDxYEHwAFCjAxLzIxLzIwMTQfAwUyMS0yMS0xNC1DYW1icmlkZ2UtQ2l0eS1Db3VuY2lsX1JlZGFjdGVkLnBkZjszODY4MjhkZAIBDw8WAh8ABQcxLTIxLTE0ZGQCAg8PFgIfAAUWQ2FtYnJpZGdlIENpdHkgQ291bmNpbGRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUJQ2FtYnJpZGdlZGQCCA9kFgpmD2QWAgIBDw8WBB8ABQowMS8yMS8yMDE0HwMFOTEtMjEtMTQtU3R1cmJyaWRnZS1Cb2FyZC1vZi1TZWxlY3RtZW5fUmVkYWN0ZWQucGRmOzM5NDIzOWRkAgEPDxYCHwAFBzEtMjEtMTRkZAICDw8WAh8ABR1TdHVyYnJpZGdlIEJvYXJkIG9mIFNlbGVjdG1lbmRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUKU3R1cmJyaWRnZWRkAgkPZBYKZg9kFgICAQ8PFgQfAAUKMDEvMjEvMjAxNB8DBUcxLTIxLTE0LVByb3ZpbmNldG93bi1IaXN0b3JpY2FsLURpc3RyaWN0LUNvbW1pc3Npb25fUmVkYWN0ZWQucGRmOzM3NTgxNGRkAgEPDxYCHwAFBzEtMjEtMTRkZAICDw8WAh8ABSlQcm92aW5jZXRvd24gSGlzdG9yaWMgRGlzdHJpY3QgQ29tbWlzc2lvbmRkAgMPDxYCHwAFBUxvY2FsZGQCBA8PFgIfAAUMUHJvdmluY2V0b3duZGQCCg9kFgpmD2QWAgIBDw8WBB8ABQowMS8xMy8yMDE0HwMFMU9NTC0yMDE0LTMtRWdyZW1vbnQtQm9hcmQtb2YtU2VsZWN0bWVuLnBkZjszNzgyMTdkZAIBDw8WAh8ABQpPTUwgMjAxNC0zZGQCAg8PFgIfAAUbRWdyZW1vbnQgQm9hcmQgb2YgU2VsZWN0bWVuZGQCAw8PFgIfAAUFTG9jYWxkZAIEDw8WAh8ABQhFZ3JlbW9udGRkAgsPZBYKZg9kFgICAQ8PFgQfAAUKMDEvMTMvMjAxNB8DBUxPTUwtMjAxNC0yLU1pbnV0ZW1hbi1SZWdpb25hbC1UZWNobmljYWwtU2Nob29sLURpc3RyaWN0LUNvbW1pdHRlZS5wZGY7MzcwMzcxZGQCAQ8PFgIfAAUKT01MIDIwMTQtMmRkAgIPDxYCHwAFOE1pbnV0ZW1hbiBSZWdpb25hbCBWb2NhdGlvbmFsIFRlY2huaWNhbCBTY2hvb2wgQ29tbWl0dGVlZGQCAw8PFgIfAAURUmVnaW9uYWwvRGlzdHJpY3RkZAIEDw8WAh8ABQYmbmJzcDtkZAIMD2QWCmYPZBYCAgEPDxYEHwAFCjAxLzEzLzIwMTQfAwU3MS0xMy0xNC1Bc2hmaWVsZC1Cb2FyZC1vZi1TZWxlY3RtZW5fUmVkYWN0ZWQucGRmOzM3MDI2NmRkAgEPDxYCHwAFBzEtMTMtMTRkZAICDw8WAh8ABRVBc2hmaWVsZCBTZWxlY3QgQm9hcmRkZAIDDw8WAh8ABQVMb2NhbGRkAgQPDxYCHwAFCEFzaGZpZWxkZGQCDQ9kFgpmD2QWAgIBDw8WBB8ABQowMS8wMi8yMDE0HwMFNU9NTC0yMDE0LTEtQm94Zm9yZC1ab25pbmctQm9hcmQtb2YtQXBwZWFscy5wZGY7MzY1NTAzZGQCAQ8PFgIfAAUKT01MIDIwMTQtMWRkAgIPDxYCHwAFH0JveGZvcmQgWm9uaW5nIEJvYXJkIG9mIEFwcGVhbHNkZAIDDw8WAh8ABQVMb2NhbGRkAgQPDxYCHwAFB0JveGZvcmRkZAIODw8WAh4HVmlzaWJsZWhkZBgBBSBjdGwwMCRDb250ZW50UGxhY2VIb2xkZXIxJGdyZE9NTA88KwAKAQgCAWQqRlzk94heDgb756WGG3iXbo2UvA%3D%3D&__EVENTVALIDATION=%2FwEWFAKH5NrcBAKbtOHzBQKO7pzyCQKY2J3zAwLlxIrvAwK9oZrLDQKN6YqwCgLFgqFvAsSCpY4JAq2B6YkBAsSC7agLAsuCkcwDAsqClesJArOBmZ0MAq6BncQIAuXatqUOAoDbuswKAoDbnm8C59qijgkCxNrmiQFU8mZCmbVka60Kj%2BqgzpL%2Fbfuz8A%3D%3D
试图隐藏公共文档的 URL 总是愚蠢且无用的。它还会破坏导航(例如,您不能在新选项卡中打开它...)。
答案3
谢谢@sebcap26为我指明了正确的方向。
我猜解决办法是:
wget http://www.oml.ago.state.ma.us/default.aspx --post-data="parameters"