如何正确下载该网页？

Question

wget由于 JavaScript 处理 URL 的方式，单独使用是行不通的。您必须使用解析页面xmllint，然后将 URL 处理成wget可以处理的格式。

首先提取并处理由 JavaScript 处理的 URL，并将其输出到urls.txt：

wget -O - 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

现在下载打开每个 URL 找到的 PDF 文件urls.txt：

wget -O - -i urls.txt | grep -o 'https.*pdf' | wget -i -

curl选择：

curl 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

curl -s $(cat urls.txt) | grep -o 'https.*pdf' | xargs -l curl -O

Answer 1

wget由于 JavaScript 处理 URL 的方式，单独使用是行不通的。您必须使用解析页面xmllint，然后将 URL 处理成wget可以处理的格式。

首先提取并处理由 JavaScript 处理的 URL，并将其输出到urls.txt：

wget -O - 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

现在下载打开每个 URL 找到的 PDF 文件urls.txt：

wget -O - -i urls.txt | grep -o 'https.*pdf' | wget -i -

curl选择：

curl 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2>/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' > urls.txt

curl -s $(cat urls.txt) | grep -o 'https.*pdf' | xargs -l curl -O

如何正确下载该网页？

答案1

相关内容