答案1
付费网站如何将其网页放入 Google 中?
首先,googlebot 会索引整个网络。他们希望索引所有网站,包括付费网站。我这个完全不重要的个人网站一直被 google 索引。
Google 只能索引网站允许他们查看的内容,他们不会试图绕过安全性或访问未自愿提供的文件。
如果网站向 Google 提供付费墙,它会对其进行索引,然后就此停止,因为这就是所有可用的内容。有不同的 HTML 标签可以表明某些内容是否应该缓存。Google 可能会尊重这些标签。
https://stackoverflow.com/questions/1341089/using-meta-tags-to-turn-off-caching-in-all-browsers
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
每个机器人(包括谷歌)都会从每个网站下载 robots.txt 以获取有关如何操作的进一步说明。
让我们看看 nwtimes: robots.txt
User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /reuters/
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
现在让我们看看 tnooz:robots.txt
User-agent: msnbot
User-agent: AhrefsBot
User-agent: bingbot
User-agent: YandexBot
Crawl-delay: 10
在他们的文件中没有发现任何限制。
qz.com 只有几个限制:
# If you are regularly crawling WordPress.com sites, please use our firehose to receive real-time push updates instead.
# Please see https://developer.wordpress.com/docs/firehose/ for more details.
Sitemap: https://qz.com/news-sitemap.xml
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
# Sitemap archive
Sitemap: https://qz.com/sitemap.xml
Disallow: /wp-login.php
Disallow: /activate/ # har har
Disallow: /cgi-bin/ # MT refugees
Disallow: /mshots/v1/
Disallow: /next/
Disallow: /public.api/
User-agent: IRLbot
Crawl-delay: 3600
一些网站向 googlebots 提供示例/部分文章,而 google 会缓存向他们提供的部分。
来源(下)https://yoast.com/ultimate-guide-robots-txt/
If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means the search engine has to be able to index that page and find the noindex tag, so the page should not be blocked by robots.txt.