AI crawlers now represent a meaningful share of bot traffic on the open web. If you manage a content-focused site, you face a practical question: how do you control which AI systems can access your content for training or retrieval, without accidentally blocking the search engine crawlers that make your content discoverable?
The answer requires understanding how different crawlers identify themselves, what robots.txt directives actually control, and where enforcement needs to go beyond polite signalling.
The crawler landscape in 2026
AI-related crawlers fall into three groups that require different handling:
Search engine crawlers (preserve access)
These crawlers index your content for search results:
- Googlebot — Google Search
- Bingbot — Microsoft Bing
- Yandexbot — Yandex Search
- Applebot — Apple Search / Siri
Blocking these directly reduces your search visibility. Keep them allowed.
AI training crawlers (block selectively)
These crawlers fetch content for language model training:
- GPTBot — OpenAI (separate from ChatGPT browsing)
- Google-Extended — Google AI training (separate from Googlebot)
- ClaudeBot — Anthropic training
- Bytespider — ByteDance / TikTok
- CCBot — Common Crawl (used by many AI labs)
Blocking these does not affect your search rankings. They are distinct user agents from the corresponding search crawlers.
AI-powered search and retrieval
These crawlers fetch content to generate AI search results or answer queries:
- ChatGPT-User — ChatGPT's browsing feature
- PerplexityBot — Perplexity AI search
- Amazonbot — Amazon Alexa
Blocking these is a business decision: they may send referral traffic, but they also use your content to generate answers that may reduce direct visits.
robots.txt configuration
Block AI training, keep search
# Standard search engines — allowed
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Applebot
Allow: /
# AI training crawlers — blocked
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# Default — allow
User-agent: *
Allow: /
Key points
Google-Extendedcontrols Google's AI training crawler separately fromGooglebot. BlockingGoogle-Extendeddoes not affect Google Search indexing.GPTBotis separate fromChatGPT-User. You can block training while allowing ChatGPT browsing, or block both.robots.txtis advisory — well-behaved crawlers respect it, but it provides no enforcement mechanism.
The X-Robots-Tag alternative
For more granular control, use HTTP headers:
X-Robots-Tag: googlebot: index, follow
X-Robots-Tag: gptbot: noindex, nofollow
This works per-page or per-content-type and can be set in CDN headers, server configuration, or _headers files.
Identification and verification
Verifying crawler identity
Any bot can claim to be Googlebot in its user-agent string. Verify legitimate crawlers:
Google: Reverse DNS lookup — legitimate Googlebot resolves to *.googlebot.com or *.google.com
host 66.249.66.1
# Should return: 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com
Bing: Resolves to *.search.msn.com
OpenAI/GPTBot: Published IP ranges in OpenAI's documentation
Anthropic/ClaudeBot: Published CIDR ranges
Requests that claim a known user agent but don't originate from the published IP ranges should be treated as suspicious.
Detecting undeclared AI crawlers
Some AI-related crawlers use generic user agents or browser-like strings. Indicators:
- Systematic crawl patterns (following sitemap URLs in order)
- No JavaScript execution
- Consistent request intervals without human-like variation
- Missing browser fingerprint signals (WebGL, canvas, font enumeration)
- High request volume from cloud provider IP ranges (AWS, GCP, Azure)
Enforcement beyond robots.txt
Since robots.txt is advisory, actual enforcement requires additional layers:
Rate limiting by user agent
At the CDN or WAF level, apply strict rate limits to identified AI crawler user agents:
# Cloudflare WAF rule example
If User-Agent contains "GPTBot" → Rate limit to 2 req/min
If User-Agent contains "ClaudeBot" → Rate limit to 2 req/min
IP-based blocking
For crawlers that ignore robots.txt and don't identify themselves:
- Identify suspicious IPs from access logs (high volume, no JS, systematic patterns)
- Check IP ownership (WHOIS, ASN lookup)
- Block or challenge the IP range at the firewall/WAF level
Challenge pages
Serve a JavaScript challenge (e.g., Cloudflare's managed challenge) to suspected bots. Legitimate browsers pass automatically; simple HTTP fetchers fail and receive a block.
This is effective but adds latency for the first request from new visitors. Use it selectively for paths that are heavily crawled.
Common mistakes
Blocking Googlebot when you mean to block Google-Extended. These are different user agents. Blocking Googlebot removes your site from Google Search.
Assuming robots.txt is enough. It works for well-behaved crawlers but provides zero enforcement against aggressive or undeclared bots.
Blocking all bots aggressively. Some AI-powered search tools (Perplexity, ChatGPT browsing) send traffic to your site. Blocking them means users asking AI about your topics will get answers sourced from your competitors instead.
Not monitoring bot traffic. If you don't know what's crawling your site, you can't make informed decisions about what to block. Review your access logs regularly.
Forgetting about RSS/Atom feeds. AI systems may consume your RSS feed rather than crawling HTML pages. If you want to block AI training access, consider your feed strategy too.
Verification
- Validate
robots.txtsyntax at Google's robots.txt tester - Check that
Googlebotcan still access your pages: use Google Search Console's URL Inspection tool - Verify AI crawler blocking:
curl -A "GPTBot/1.0" -s -o /dev/null -w "%{http_code}" https://yourdomain.com/— should see rate-limited or blocked response - Monitor search indexing in Google Search Console for any unexpected drops
- Review access logs for new AI crawler user agents monthly
Related reading on wplus.net
- AI & Bot Traffic Hardening for 2026 — rate limiting and cache-key separation for bot traffic
- Security hub — headers, TLS, and hardening
- Hosting hub — hosting architecture and CDN configuration