AI crawler traffic has become a first-order concern for hosting operators. Through 2025 and into 2026, Cloudflare Radar data shows AI-associated bot traffic climbing steadily as a proportion of total requests, with some sites reporting 30–60% of bandwidth consumed by large-language-model training crawlers, retrieval-augmented-generation fetchers, and AI-powered search previews.
For small and mid-size hosting operations, this traffic creates real problems: inflated bandwidth bills, CDN cache pollution, rate-limit exhaustion, and degraded response times for human visitors. This guide covers practical approaches to identifying, managing, and rate-limiting AI bot traffic without breaking legitimate access or search engine visibility.
The traffic landscape in 2026
AI bot traffic falls into several categories, each with different behavioural patterns:
Training crawlers
Large-scale crawlers that fetch content for model training datasets. Characteristics:
- High request volume, often from a small number of IP ranges
- Tend to follow sitemaps and crawl exhaustively
- Many now identify themselves via
User-Agent(GPTBot, ClaudeBot, Google-Extended, etc.) - Some do not respect
robots.txtor use misleading user agents
RAG fetchers
Retrieval-augmented generation systems that fetch pages on demand in response to user queries. Characteristics:
- Lower per-session volume but spread across many concurrent sessions
- Request patterns resemble search-engine fetches but may not cache results
- Timing is bursty — sudden spikes when a topic trends
- Often use generic HTTP client user agents
AI search previews
Search engines with AI-generated summaries that pre-fetch and process pages. Characteristics:
- Similar to traditional search crawlers but with higher processing overhead per page
- Generally well-identified via user agent
- Respect robots.txt more reliably than independent crawlers
Undeclared bots
The hardest category: bots that use browser-like user agents, rotate IPs, and attempt to appear as human traffic. Distinguishing these from real users requires behavioural analysis rather than user-agent matching.
Identification strategies
User-agent filtering
The first line of defence. Known AI crawlers and their user-agent strings as of early 2026:
| Bot | User-Agent contains | Purpose |
|---|---|---|
| GPTBot | GPTBot |
OpenAI training/browsing |
| ChatGPT-User | ChatGPT-User |
ChatGPT browsing feature |
| ClaudeBot | ClaudeBot |
Anthropic training |
| Google-Extended | Google-Extended |
Google AI training |
| Bytespider | Bytespider |
ByteDance training |
| CCBot | CCBot |
Common Crawl |
| PerplexityBot | PerplexityBot |
Perplexity AI search |
| Amazonbot | Amazonbot |
Amazon Alexa AI |
This list will evolve. Maintain it as a living configuration rather than a static hardcoded list.
IP range verification
Legitimate major crawlers publish their IP ranges:
- Google:
dig TXT _netblocks.google.com - OpenAI: Published in their documentation
- Anthropic: Published CIDR ranges
Verify that requests claiming to be from a known bot actually originate from that bot's published IP ranges. Spoofed user agents from random IPs are common.
Behavioural signals
For undeclared bots, look for:
- Request patterns that systematically spider sitemap URLs in order
- No JavaScript execution (check via challenge pages or analytics)
- Consistent sub-second request intervals
- No cookie/session state across requests
- Missing or unusual
Accept-Language,Accept-Encodingheaders
Rate-limiting approaches
Per-bot rate limits
Apply specific rate limits to identified bot user agents:
# Example: Cloudflare rate-limiting rule
# If User-Agent contains "GPTBot" → limit to 10 requests/minute
# If User-Agent contains "ClaudeBot" → limit to 10 requests/minute
This is the simplest approach and handles well-behaved bots that identify themselves.
Tiered rate limiting
Implement multiple tiers:
- Verified crawlers (IP + user-agent match): 30 req/min
- Declared but unverified bots (user-agent match, unknown IP): 5 req/min
- Suspected bots (behavioural signals): 2 req/min
- Human visitors: no artificial rate limit (rely on standard DDoS protection)
Cache-key separation
Bot traffic can pollute your CDN cache in unexpected ways:
- Bots requesting with different
Accept-Encodingvalues create separate cache entries - Bots with unusual query parameters can fragment the cache
- High-volume bot requests can evict human-visitor cache entries on capacity-limited caches
Consider separating cache keys for bot traffic:
- Route identified bot traffic to a separate cache zone or origin
- Use a
Varyheader strategy that doesn't fragment the human-visitor cache - Deprioritise cache warming from bot requests
"Human path" optimisation
The core idea: optimise your infrastructure for human visitors first, and handle bot traffic as a secondary concern.
Separate serving paths
If bot traffic is significant (>20% of requests):
- Identify bot requests at the edge (CDN worker, WAF rule)
- Route them to a bot-specific backend or cache tier
- Serve human visitors from the primary, warm cache
- Apply stricter rate limits and lower priority to the bot path
Progressive rendering for humans
Since bots typically don't execute JavaScript:
- Serve critical content in the initial HTML response
- Use progressive enhancement for interactive features
- Bots get the content they need; humans get the full experience
- This is not "cloaking" — the base content is identical
Connection priority
At the server level, if you control the origin:
- Prioritise connections from known human-traffic IP ranges
- Apply TCP/QUIC congestion control fairly but with human-traffic priority
- Use HTTP/3 prioritisation hints for human-initiated requests
Common mistakes
Blocking all bots aggressively. Some AI crawlers also power search features that send you traffic. Blocking everything may reduce your visibility in AI-powered search results without actually reducing your costs meaningfully.
Relying solely on user-agent strings. Undeclared bots and spoofed user agents bypass UA-only filtering trivially. Layer behavioural analysis and IP verification on top.
Ignoring robots.txt for AI control. Many legitimate AI crawlers do respect robots.txt. Use it as a first signal:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
But do not rely on it as enforcement — treat it as a signal, with rate limiting as the actual control.
Not monitoring bot traffic separately. If you don't split bot vs. human metrics, you can't tell whether your site is slow for humans or just overloaded by bots. Separate your analytics by traffic type.
Over-engineering the solution. For a small static site behind a CDN, simple rate-limiting rules and robots.txt may be all you need. The complexity of behavioural analysis is justified only when bot traffic is actually causing problems.
Verification
- Check your CDN analytics for user-agent distribution — identify what percentage of traffic is from known bots
- Verify
robots.txtis served correctly and includes AI crawler directives - Test rate-limiting rules against a known bot user agent:
curl -A "GPTBot/1.0" -s -o /dev/null -w "%{http_code}" https://yourdomain.com/ - Monitor 429 (Too Many Requests) response rates to ensure rate limits are triggering appropriately
- Compare page-load metrics for human traffic before and after implementing bot controls
Related reading on wplus.net
- Immutable Caching for Download Archives — caching strategies that reduce origin load
- HTTP/3 Hosting Checklist for 2026 — protocol-level optimisation
- Hosting hub — hosting architecture overview
- Security hub — headers, TLS, and configuration hardening