AI crawler traffic has become a first-order concern for hosting operators. Through 2025 and into 2026, Cloudflare Radar data shows AI-associated bot traffic climbing steadily as a proportion of total requests, with some sites reporting 30–60% of bandwidth consumed by large-language-model training crawlers, retrieval-augmented-generation fetchers, and AI-powered search previews.

For small and mid-size hosting operations, this traffic creates real problems: inflated bandwidth bills, CDN cache pollution, rate-limit exhaustion, and degraded response times for human visitors. This guide covers practical approaches to identifying, managing, and rate-limiting AI bot traffic without breaking legitimate access or search engine visibility.

The traffic landscape in 2026

AI bot traffic falls into several categories, each with different behavioural patterns:

Training crawlers

Large-scale crawlers that fetch content for model training datasets. Characteristics:

  • High request volume, often from a small number of IP ranges
  • Tend to follow sitemaps and crawl exhaustively
  • Many now identify themselves via User-Agent (GPTBot, ClaudeBot, Google-Extended, etc.)
  • Some do not respect robots.txt or use misleading user agents

RAG fetchers

Retrieval-augmented generation systems that fetch pages on demand in response to user queries. Characteristics:

  • Lower per-session volume but spread across many concurrent sessions
  • Request patterns resemble search-engine fetches but may not cache results
  • Timing is bursty — sudden spikes when a topic trends
  • Often use generic HTTP client user agents

AI search previews

Search engines with AI-generated summaries that pre-fetch and process pages. Characteristics:

  • Similar to traditional search crawlers but with higher processing overhead per page
  • Generally well-identified via user agent
  • Respect robots.txt more reliably than independent crawlers

Undeclared bots

The hardest category: bots that use browser-like user agents, rotate IPs, and attempt to appear as human traffic. Distinguishing these from real users requires behavioural analysis rather than user-agent matching.

Identification strategies

User-agent filtering

The first line of defence. Known AI crawlers and their user-agent strings as of early 2026:

Bot User-Agent contains Purpose
GPTBot GPTBot OpenAI training/browsing
ChatGPT-User ChatGPT-User ChatGPT browsing feature
ClaudeBot ClaudeBot Anthropic training
Google-Extended Google-Extended Google AI training
Bytespider Bytespider ByteDance training
CCBot CCBot Common Crawl
PerplexityBot PerplexityBot Perplexity AI search
Amazonbot Amazonbot Amazon Alexa AI

This list will evolve. Maintain it as a living configuration rather than a static hardcoded list.

IP range verification

Legitimate major crawlers publish their IP ranges:

  • Google: dig TXT _netblocks.google.com
  • OpenAI: Published in their documentation
  • Anthropic: Published CIDR ranges

Verify that requests claiming to be from a known bot actually originate from that bot's published IP ranges. Spoofed user agents from random IPs are common.

Behavioural signals

For undeclared bots, look for:

  • Request patterns that systematically spider sitemap URLs in order
  • No JavaScript execution (check via challenge pages or analytics)
  • Consistent sub-second request intervals
  • No cookie/session state across requests
  • Missing or unusual Accept-Language, Accept-Encoding headers

Rate-limiting approaches

Per-bot rate limits

Apply specific rate limits to identified bot user agents:

# Example: Cloudflare rate-limiting rule
# If User-Agent contains "GPTBot" → limit to 10 requests/minute
# If User-Agent contains "ClaudeBot" → limit to 10 requests/minute

This is the simplest approach and handles well-behaved bots that identify themselves.

Tiered rate limiting

Implement multiple tiers:

  1. Verified crawlers (IP + user-agent match): 30 req/min
  2. Declared but unverified bots (user-agent match, unknown IP): 5 req/min
  3. Suspected bots (behavioural signals): 2 req/min
  4. Human visitors: no artificial rate limit (rely on standard DDoS protection)

Cache-key separation

Bot traffic can pollute your CDN cache in unexpected ways:

  • Bots requesting with different Accept-Encoding values create separate cache entries
  • Bots with unusual query parameters can fragment the cache
  • High-volume bot requests can evict human-visitor cache entries on capacity-limited caches

Consider separating cache keys for bot traffic:

  • Route identified bot traffic to a separate cache zone or origin
  • Use a Vary header strategy that doesn't fragment the human-visitor cache
  • Deprioritise cache warming from bot requests

"Human path" optimisation

The core idea: optimise your infrastructure for human visitors first, and handle bot traffic as a secondary concern.

Separate serving paths

If bot traffic is significant (>20% of requests):

  1. Identify bot requests at the edge (CDN worker, WAF rule)
  2. Route them to a bot-specific backend or cache tier
  3. Serve human visitors from the primary, warm cache
  4. Apply stricter rate limits and lower priority to the bot path

Progressive rendering for humans

Since bots typically don't execute JavaScript:

  • Serve critical content in the initial HTML response
  • Use progressive enhancement for interactive features
  • Bots get the content they need; humans get the full experience
  • This is not "cloaking" — the base content is identical

Connection priority

At the server level, if you control the origin:

  • Prioritise connections from known human-traffic IP ranges
  • Apply TCP/QUIC congestion control fairly but with human-traffic priority
  • Use HTTP/3 prioritisation hints for human-initiated requests

Common mistakes

Blocking all bots aggressively. Some AI crawlers also power search features that send you traffic. Blocking everything may reduce your visibility in AI-powered search results without actually reducing your costs meaningfully.

Relying solely on user-agent strings. Undeclared bots and spoofed user agents bypass UA-only filtering trivially. Layer behavioural analysis and IP verification on top.

Ignoring robots.txt for AI control. Many legitimate AI crawlers do respect robots.txt. Use it as a first signal:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

But do not rely on it as enforcement — treat it as a signal, with rate limiting as the actual control.

Not monitoring bot traffic separately. If you don't split bot vs. human metrics, you can't tell whether your site is slow for humans or just overloaded by bots. Separate your analytics by traffic type.

Over-engineering the solution. For a small static site behind a CDN, simple rate-limiting rules and robots.txt may be all you need. The complexity of behavioural analysis is justified only when bot traffic is actually causing problems.

Verification

  1. Check your CDN analytics for user-agent distribution — identify what percentage of traffic is from known bots
  2. Verify robots.txt is served correctly and includes AI crawler directives
  3. Test rate-limiting rules against a known bot user agent: curl -A "GPTBot/1.0" -s -o /dev/null -w "%{http_code}" https://yourdomain.com/
  4. Monitor 429 (Too Many Requests) response rates to ensure rate limits are triggering appropriately
  5. Compare page-load metrics for human traffic before and after implementing bot controls

Related reading on wplus.net