我收到了谷歌报告的生成 404 错误的 URL 列表。
我可以使用 curl (从命令行)测试一个 url,如下所示:
curl -k --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)" https://MYURLHERE
效果和我预期的完全一样。我想把它放在一个脚本中,这样我就可以浏览一下它们的列表,这就是我所拥有的。
#!/usr/bin/bash
url=$1
curlcmd="curl -k --user-agent \"Googlebot/2.1 (+http://www.google.com/bot.html)\""
$curlcmd $url
但它不起作用。我不断
curl: (1) Protocol "(+http" not supported or disabled in libcurl
我不知道如何摆脱这个问题并使其发挥作用。有什么建议吗?
答案1
用引号将变量 $1 括起来,或者可以使用如下方法:
$ touch $$
$ echo 'http://www.google.com' >> $$
$ echo 'http://www.yahoo.com' >> $$
$ for url in $(cat $$); do curl -I $url ; done
HTTP/1.1 200 OK
Date: Wed, 22 Nov 2017 15:57:19 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2017-11-22-15; expires=Fri, 22-Dec-2017 15:57:19 GMT; path=/; domain=.google.com
Set-Cookie: NID=117=CaOUCOyr9TPjs64tqyz1MuqHsASzL_3eO5n-NE4ubqAikITGbs7QY0aegNByOWX1Vaf9SsUVQDJ1wdaIOZwXoiqfVZ9ISLtta7tvcDH6LFM52OGFKRH4J5Clde2EX8oG; expires=Thu, 24-May-2018 15:57:19 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Age: 0
Transfer-Encoding: chunked
Via: 1.1 localhost.localdomain
HTTP/1.1 200 OK
Date: Wed, 22 Nov 2017 15:57:19 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2017-11-22-15; expires=Fri, 22-Dec-2017 15:57:19 GMT; path=/; domain=.google.com
Set-Cookie: NID=117=VRrA0-bCESlSCoerEK0n1hxXfldwpQI4cisiKrEgnKVph9HkfQJu-tbur3ZBiLh3-RFKZ0kbWUWsBwJKzsi_aPUuJzztM1rCuDfljZLxqjaHanZxiCx7qch4P2WCoDDC; expires=Thu, 24-May-2018 15:57:19 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding
Age: 0
Transfer-Encoding: chunked
Via: 1.1 localhost.localdomain
HTTP/1.1 200 OK
Date: Wed, 22 Nov 2017 15:57:19 GMT
Via: http/1.1 media-router-fp56.prod.media.ne1.yahoo.com (ApacheTrafficServer [c s f ]), 1.1 localhost.localdomain
Server: ATS
Cache-Control: no-store, no-cache, max-age=0, private
Content-Type: text/html
Content-Language: en
Expires: -1
X-Frame-Options: SAMEORIGIN
Content-Length: 12
Age: 0
$
答案2
您可以像这样修改它:
#!/usr/bin/bash
url="$1"
curlcmd='curl -k --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)"'
$curlcmd "$url"
您收到的消息表明不支持 http(默认)。请改用 https:
./test.sh https://www.somepage.com
答案3
对于查询URL 列表作为命令行参数给出:
#!/bin/sh
USER_AGENT="Googlebot/2.1 (+http://www.google.com/bot.html)"
curl_with_ua(){
curl -k --user-agent "$USER_AGENT" "$1"
}
for url in "$@"; do
curl_with_ua "$url"
done
答案4
我无法重现您的问题...我只是收到这个听起来毫无意义的警告,否则它可以起作用:
curl: (3) URL rejected: Port number was not a decimal number between 0 and 65535
虽然这个错误很难理解,但实际上这是错误的做事方式。在没有引号的变量中执行操作几乎总是一个坏主意。而简单地添加引号也会失败:
url=https://startpage.com
curlcmd="curl -k --user-agent \"Googlebot/2.1 (+http://www.google.com/bot.html)\""
"$curlcmd" "$url"
由于上面错误地使用了引号,错误是这样的,因为它使整个 cmd (包括所有空格)变成一个无参数的命令:
bash: curl -k --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)": No such file or directory
而导致问题的原因可能是“2.1”和“(”之间的空格。如果使用引号,所有空格都会被合并成一个毫无意义的大参数。如果没有引号,它们是分开的,但“2.1”后面的空格也会将其拆分成另一个参数。
另外,您可以使用eval
它来使该空间的某些转义真正起作用实际上合乎逻辑......但我不推荐使用 eval。
我喜欢使用函数来做这种事。(这可能在 sh 中也有效)
#!/usr/bin/bash
url="$1"
mycurl() {
curl -k --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)" "$@"
}
mycurl "$url"
或者数组也可以工作(但不是在 sh 中...不要被系统所sh
欺骗bash --posix
)
#!/usr/bin/bash
url="$1"
curlcmd=(curl -k --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)")
# an example of the main reason I generally use an array... you can modify the command dynamically
if [ "$DEBUG" = 1 ]; then
curlcmd+=(-v)
fi
# and this runs it... fully protected variables in quotes, but also easy, unlike all the other escaping methods
"${curlcmd[@]}" "$url"