AI crawlers now represent a meaningful share of bot traffic on the open web. If you manage a content-focused site, you face a practical question: how do you control which AI systems can access your content for training or retrieval, without accidentally blocking the search engine crawlers that make your content discoverable?

The answer requires understanding how different crawlers identify themselves, what robots.txt directives actually control, and where enforcement needs to go beyond polite signalling.

The crawler landscape in 2026

AI-related crawlers fall into three groups that require different handling:

Search engine crawlers (preserve access)

These crawlers index your content for search results:

  • Googlebot — Google Search
  • Bingbot — Microsoft Bing
  • Yandexbot — Yandex Search
  • Applebot — Apple Search / Siri

Blocking these directly reduces your search visibility. Keep them allowed.

AI training crawlers (block selectively)

These crawlers fetch content for language model training:

  • GPTBot — OpenAI (separate from ChatGPT browsing)
  • Google-Extended — Google AI training (separate from Googlebot)
  • ClaudeBot — Anthropic training
  • Bytespider — ByteDance / TikTok
  • CCBot — Common Crawl (used by many AI labs)

Blocking these does not affect your search rankings. They are distinct user agents from the corresponding search crawlers.

AI-powered search and retrieval

These crawlers fetch content to generate AI search results or answer queries:

  • ChatGPT-User — ChatGPT's browsing feature
  • PerplexityBot — Perplexity AI search
  • Amazonbot — Amazon Alexa

Blocking these is a business decision: they may send referral traffic, but they also use your content to generate answers that may reduce direct visits.

robots.txt configuration

Block AI training, keep search

# Standard search engines — allowed
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Applebot
Allow: /

# AI training crawlers — blocked
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# Default — allow
User-agent: *
Allow: /

Key points

  • Google-Extended controls Google's AI training crawler separately from Googlebot. Blocking Google-Extended does not affect Google Search indexing.
  • GPTBot is separate from ChatGPT-User. You can block training while allowing ChatGPT browsing, or block both.
  • robots.txt is advisory — well-behaved crawlers respect it, but it provides no enforcement mechanism.

The X-Robots-Tag alternative

For more granular control, use HTTP headers:

X-Robots-Tag: googlebot: index, follow
X-Robots-Tag: gptbot: noindex, nofollow

This works per-page or per-content-type and can be set in CDN headers, server configuration, or _headers files.

Identification and verification

Verifying crawler identity

Any bot can claim to be Googlebot in its user-agent string. Verify legitimate crawlers:

Google: Reverse DNS lookup — legitimate Googlebot resolves to *.googlebot.com or *.google.com

host 66.249.66.1
# Should return: 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com

Bing: Resolves to *.search.msn.com

OpenAI/GPTBot: Published IP ranges in OpenAI's documentation

Anthropic/ClaudeBot: Published CIDR ranges

Requests that claim a known user agent but don't originate from the published IP ranges should be treated as suspicious.

Detecting undeclared AI crawlers

Some AI-related crawlers use generic user agents or browser-like strings. Indicators:

  • Systematic crawl patterns (following sitemap URLs in order)
  • No JavaScript execution
  • Consistent request intervals without human-like variation
  • Missing browser fingerprint signals (WebGL, canvas, font enumeration)
  • High request volume from cloud provider IP ranges (AWS, GCP, Azure)

Enforcement beyond robots.txt

Since robots.txt is advisory, actual enforcement requires additional layers:

Rate limiting by user agent

At the CDN or WAF level, apply strict rate limits to identified AI crawler user agents:

# Cloudflare WAF rule example
If User-Agent contains "GPTBot" → Rate limit to 2 req/min
If User-Agent contains "ClaudeBot" → Rate limit to 2 req/min

IP-based blocking

For crawlers that ignore robots.txt and don't identify themselves:

  1. Identify suspicious IPs from access logs (high volume, no JS, systematic patterns)
  2. Check IP ownership (WHOIS, ASN lookup)
  3. Block or challenge the IP range at the firewall/WAF level

Challenge pages

Serve a JavaScript challenge (e.g., Cloudflare's managed challenge) to suspected bots. Legitimate browsers pass automatically; simple HTTP fetchers fail and receive a block.

This is effective but adds latency for the first request from new visitors. Use it selectively for paths that are heavily crawled.

Common mistakes

Blocking Googlebot when you mean to block Google-Extended. These are different user agents. Blocking Googlebot removes your site from Google Search.

Assuming robots.txt is enough. It works for well-behaved crawlers but provides zero enforcement against aggressive or undeclared bots.

Blocking all bots aggressively. Some AI-powered search tools (Perplexity, ChatGPT browsing) send traffic to your site. Blocking them means users asking AI about your topics will get answers sourced from your competitors instead.

Not monitoring bot traffic. If you don't know what's crawling your site, you can't make informed decisions about what to block. Review your access logs regularly.

Forgetting about RSS/Atom feeds. AI systems may consume your RSS feed rather than crawling HTML pages. If you want to block AI training access, consider your feed strategy too.

Verification

  1. Validate robots.txt syntax at Google's robots.txt tester
  2. Check that Googlebot can still access your pages: use Google Search Console's URL Inspection tool
  3. Verify AI crawler blocking: curl -A "GPTBot/1.0" -s -o /dev/null -w "%{http_code}" https://yourdomain.com/ — should see rate-limited or blocked response
  4. Monitor search indexing in Google Search Console for any unexpected drops
  5. Review access logs for new AI crawler user agents monthly

Related reading on wplus.net