阻止 yandex.ru 机器人

Question 1

不要相信论坛上关于此事的报道！请相信服务器日志。如果 Yandex 遵守 robots.txt，您会在日志中看到证据。我亲眼看到，Yandex 机器人甚至不会读取 robots.txt 文件！

不要再浪费时间在长长的 IP 列表上，这只会大大降低您的网站速度。

在 .htaccess 中（在每个网站的根文件夹中）输入以下行：

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
Order Deny,Allow
Deny from env=bad_bot

我已经这样做了，但 Yandex 现在收到的都是 403 访问被拒绝错误。

再见，Yandex！

Answer

不要相信论坛上关于此事的报道！请相信服务器日志。如果 Yandex 遵守 robots.txt，您会在日志中看到证据。我亲眼看到，Yandex 机器人甚至不会读取 robots.txt 文件！

不要再浪费时间在长长的 IP 列表上，这只会大大降低您的网站速度。

在 .htaccess 中（在每个网站的根文件夹中）输入以下行：

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
Order Deny,Allow
Deny from env=bad_bot

我已经这样做了，但 Yandex 现在收到的都是 403 访问被拒绝错误。

再见，Yandex！

Question 2

我太年轻（名誉），无法将所有需要的 URL 都发布为超链接，所以请原谅我带括号的 URL。

这Dan Andreatta 的论坛链接，和另一个，有你需要的部分但不是全部。你需要使用他们的方法查找 IP 号码，并编写一些脚本来保持列表的更新。然后你想要像这样的东西，向你展示一些已知值，包括他们一直在使用的子域名命名方案。用 crontab 监视他们的 IP 范围，也许可以自动估计合理的跨域路由（我没有找到任何关于其实际分配的提及；可能只是谷歌失败了）。

尽可能准确地找到他们的 IP 范围，这样您就不必在用户等待时浪费时间进行反向 DNS 查找（http://你的域名/notpornipromise），而你只是在进行比较匹配之类的。谷歌只是向我展示了grepcidr，看起来非常相关。从链接页面：“grepcidr 可用于根据一个或多个无类域间路由 (CIDR) 规范或地址范围指定的任意网络过滤 IP 地址列表。”我猜想它是一种具有已知 I/O 的专用代码，这很好，但您知道您可以以十亿种不同的方式重现该功能。

我能想到的、也真正想分享的“通用解决方案”（将想法付诸实践等）是，您开始在您的位置编写此类罪犯的数据库，并花一些业余时间思考和研究防御和反击这种行为的方法。这会让您更深入地了解入侵检测、模式分析和蜜网，而这个特定问题的范围却远远不够。然而，在这项研究的范围内，您提出的这个问题有无数的答案。

我发现这是因为 Yandex 在我自己的一个网站上表现出了有趣的行为。我不会将自己在日志中看到的行为称为滥用，但 spider50.yandex.ru 占用了我 2% 的访问量和 1% 的带宽……我可以看到机器人会真正滥用大型文件和论坛等，而这些都不是今天我正在查看的服务器上可供滥用的。值得调查的是，机器人查看 /robots.txt，然后等待 4 到 9 个小时并请求其中没有的 /directory/，然后等待 4 到 9 个小时，请求 /another_directory/，然后可能再等几个小时，再次请求 /robots.txt，重复进行。就频率而言，我认为它们表现得足够好，spider50.yandex.ru 机器似乎尊重 /robots.txt。

我今天不打算阻止他们访问该服务器，但如果我分享了罗斯的经历，我就会这么做。

作为我们今天在我的服务器案例中处理的微小数字的参考：

Top 10 of 1315 Total Sites By KBytes
 # Hits  Files  KBytes   Visits  Hostname
 1 247 1.20% 247 1.26% 1990 1.64% 4 0.19% ip98-169-142-12.dc.dc.cox.net
 2 141 0.69% 140 0.72% 1873 1.54% 1 0.05% 178.160.129.173
 3 142 0.69% 140 0.72% 1352 1.11% 1 0.05% 162.136.192.1
 4 85 0.41% 59 0.30% 1145 0.94% 46 2.19% spider50.yandex.ru
 5 231 1.12% 192 0.98% 1105 0.91% 4 0.19% cpe-69-135-214-191.woh.res.rr.com
 6 16 0.08% 16 0.08% 1066 0.88% 11 0.52% rate-limited-proxy-72-14-199-198.google.com
 7 63 0.31% 50 0.26% 1017 0.84% 25 1.19% b3090791.crawl.yahoo.net
 8 144 0.70% 143 0.73% 941  0.77% 1 0.05% user10.hcc-care.com
 9 70 0.34% 70 0.36% 938  0.77% 1 0.05% cpe-075-177-135-148.nc.res.rr.com
10 205 1.00% 203 1.04% 920  0.76% 3 0.14% 92.red-83-54-7.dynamicip.rima-tde.net

这是在共享主机上，它甚至不再费心限制带宽，如果爬取采取某种类似 DDoS 的形式，他们可能会比我先注意到并阻止它。所以，我对此并不生气。事实上，我更喜欢让他们将数据写入我的日志中以供使用。

罗斯，如果你真的对每天损失 2GB 流量给 Yandex 感到愤怒，那么你也许会垃圾邮件他们。这就是它存在的原因！通过 HTTP 301 直接将他们从您不希望他们下载的内容重新路由到垃圾邮件子域，或者自己动手，这样您就可以控制逻辑并获得更多乐趣。这种解决方案为您提供了以后在更需要时可以重复使用的工具。

然后开始深入研究你的日志，寻找像这样的有趣的东西：

217.41.13.233 - - [31/Mar/2010:23:33:52 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:54 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:33:58 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:00 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:01 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:03 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:04 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:05 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:06 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:09 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:14 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:16 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:17 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:18 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:21 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:23 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:24 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:26 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:27 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"
217.41.13.233 - - [31/Mar/2010:23:34:28 -0500] "GET /user/ HTTP/1.1" 404 15088 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)"

提示：服务器上不存在 /user/ 目录，也没有指向该目录的超链接。

Answer