答案1
某些网站/资产或其部分不允许/阻止自动化,并且对此无能为力。
顺便说一句,您不需要浏览器来下载网站数据,这当然被称为网络抓取,这是使用 PowerShell 网络 cmdlet 完成的,具体来说......
# Get specifics for a module, cmdlet, or function
(Get-Command -Name Invoke-WebRequest).Parameters
(Get-Command -Name Invoke-WebRequest).Parameters.Keys
<#
# Results
UseBasicParsing
Uri
WebSession
SessionVariable
Credential
UseDefaultCredentials
CertificateThumbprint
Certificate
UserAgent
DisableKeepAlive
TimeoutSec
Headers
MaximumRedirection
Method
Proxy
ProxyCredential
ProxyUseDefaultCredentials
Body
ContentType
TransferEncoding
InFile
OutFile
PassThru
Verbose
Debug
ErrorAction
WarningAction
InformationAction
ErrorVariable
WarningVariable
InformationVariable
OutVariable
OutBuffer
PipelineVariable
#>
Get-help -Name Invoke-WebRequest -Examples
<#
# Results
$R = Invoke-WebRequest -URI
$R.AllElements | where {$_.innerhtml -like "*=*"} | Sort {
values. Sorting by the shortest HTML value often helps you find the
$R=Invoke-WebRequest http://www.facebook.com/login.php
$FB
$Form = $R.Forms[0]
$Form | Format-List
$Form.fields
$Form.Fields["email"]="[email protected]"
$R=Invoke-WebRequest -Uri ("https://www.facebook.com" +
# Sends a sign-in request by running the Invoke-WebRequest
$R.StatusDescription
(Invoke-WebRequest -Uri "http://msdn.microsoft.com/en-us/library
#>
Get-help -Name Invoke-WebRequest -Full
Get-help -Name Invoke-WebRequest -Online
因此,对于您说要访问的 URL,请注意您会得到以下结果...
# Download website main page
($InstacartHomeData = Invoke-WebRequest -Uri 'https://www.instantcart.com')
<#
# Results
StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE html><html lang="en" class="no-js"><head><link rel="alternate"
href="https://www.instantcart.com/" hreflang="en-gb" /><link rel="alternate"
href="https://www.instantcart.com/" hreflang="en" ...
RawContent : HTTP/1.1 200 OK
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
Content-Type: text/...
Forms : {}
Headers : {[Pragma, no-cache], [Vary, Accept-Encoding], [Connection, close], [Transfer-Encoding, chunked],
[Cache-Control, private, no-cache, no-store, proxy-revalidate, no-transform], [Content-Type,
text/html], [Date, Thu, 28 May 2020 04:23:28 GMT], [Expires, Thu, 19 Nov 1981 08:52:00 GMT],
[Set-Cookie, sid=b806f71e100b9f2d4d1037561b53ff65; path=/; domain=www.instantcart.com], [Server,
Apache], [X-Powered-By, PHP/5.5.38]}
Images : {@{innerHTML=; innerText=; outerHTML=<img width="160" class="img-responsive"
...
#>
# Get only images data
$InstacartHomeData.Images | Select-Object alt, src
<#
# Results
alt src
--- ---
/pics/logo.png
Abode Home Products /images/home/clients/abode-home-products.png
Avanta UK /images/home/clients/avanta-uk.png
Q-Park /images/home/clients/qpark.png
...
#>
现在,对您的目标页面进行同样的尝试。
# Download website specific main page
($InstacartProductPageData = Invoke-WebRequest -Uri 'https://www.instacart.com/products/98954-poland-spring-natural-spring-water-2-5-gal')
<#
# Results
# Cookie are used to get this
StatusCode : 200
StatusDescription : OK
Content : <!DOCTYPE html>
<html lang='en'>
<head>
<title>
Poland Spring Natural Spring Water (2.5 gal) - Instacart
</title>
<meta content='Buy Poland Spring Natural Spring Water (2.5 gal) online and have it de...
RawContent : HTTP/1.1 200 OK
Transfer-Encoding: chunked
Connection: keep-alive
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Permit...
Forms : {}
Headers : {[Transfer-Encoding, chunked], [Connection, keep-alive], [X-Frame-Options, SAMEORIGIN],
[X-XSS-Protection, 1; mode=block]...}
Images : {@{innerHTML=; innerText=; outerHTML=<img class="rmq-569a8dd6" style="background: rgb(255, 255,
...
Poland Spring 100% Natural Spring Water
2.5 gal; outerHTML=<a style="text-decoration: none;"
href="/products/16965376-poland-spring-100-natural-spring-water-2-5-gal" data-radium="true"><div
class="rmq-cd8b1370 rmq-5e34cd3" style="padding: 0px 16px; width: 208px; height: 100%; text-align:
left; line-height: 1.29; font-size: 14px; display: flex; position: relative; opacity: 1;
flex-direction: column;" data-radium="true"><div class="rmq-24058c4e" style="width: 176px; height:
176px;" data-radium="true"><img style="width: 100%; display: block;" alt="" src="https://d2d8wwwkmh
fcva.cloudfront.net/352x/d1s8987jlndkbs.cloudfront.net/assets/missing-item-4bbe82b8555e4d1c12626fd4
82cb2409713e8e30835645ff3650ef66a725d03c.png" data-radium="true"></div><div style="padding-bottom:
8px; margin-top: auto;" data-radium="true"><div class="rmq-50e196af" style="color: rgb(66, 66,
66); overflow: hidden; margin-top: 20px; -ms-text-overflow: ellipsis; max-height: 55px;"
data-radium="true">Poland Spring 100% Natural Spring Water</div><div style="color: rgb(117, 117,
117);" data-radium="true"><span>2.5 gal</span></div></div></div></a>; outerText=
...
#>
# Get only images data
$InstacartProductPageData.Images | Select-Object alt, src
<#
# Results
alt src
--- ---
Instacart logo https://d2guulkeunn7d8.cloudfront.net/assets/beetstrap/brand/carrotlogo-p...
Poland Spring Natural Spring Water https://d2lnr5mha7bycj.cloudfront.net/product-image/file/large_f44f2f09-b...
Gala Fresh logo https://d2lnr5mha7bycj.cloudfront.net/warehouse/logo/162/0f5c96be-4126-45...
...
#>
答案2
请参阅下面使用 Internet Explorer 呈现页面的内容,然后将图像位置存储在文档属性中。
根据需要调整输出目录和网站。
我还没有测试过这个结果是否与 Firefox 列出的结果相同,但很可能会产生相同的结果。
$OutputDirectory = "c:\test\images.txt" # change this to the output directory and txt file name, ensure it ends with .txt
$Weppage = "https://www.somewebsite.com" # change this to the webpage you want
$ieObject = New-Object -ComObject 'InternetExplorer.Application'
$ieObject.Visible = $false
$ieObject.Navigate($Weppage)
while($ieObject.ReadyState -ne 4) {start-sleep -m 100}
$images = $ieObject.Document.images | % {$_.src}
$images | Out-file $OutputDirectory
$ieObject.quit()