AI & Bot Traffic Hardening for 2026: Rate Limits, Cache Keys, and "Human Path" Optimisation

AI crawler traffic has become a first-order concern for hosting operators. Through 2025 and into 2026, Cloudflare Radar data shows AI-associated bot traffic climbing steadily as a proportion of total requests, with some sites reporting 30–60% of bandwidth consumed by large-language-model training crawlers, retrieval-augmented-generation fetchers, and AI-powered search previews.

For small and mid-size hosting operations, this traffic creates real problems: inflated bandwidth bills, CDN cache pollution, rate-limit exhaustion, and degraded response times for human visitors. This guide covers practical approaches to identifying, managing, and rate-limiting AI bot traffic without breaking legitimate access or search engine visibility.

The traffic landscape in 2026

AI bot traffic falls into several categories, each with different behavioural patterns:

Training crawlers

Large-scale crawlers that fetch content for model training datasets. Characteristics:

High request volume, often from a small number of IP ranges
Tend to follow sitemaps and crawl exhaustively
Many now identify themselves via User-Agent (GPTBot, ClaudeBot, Google-Extended, etc.)
Some do not respect robots.txt or use misleading user agents

RAG fetchers

Retrieval-augmented generation systems that fetch pages on demand in response to user queries. Characteristics:

Lower per-session volume but spread across many concurrent sessions
Request patterns resemble search-engine fetches but may not cache results
Timing is bursty — sudden spikes when a topic trends
Often use generic HTTP client user agents

AI search previews

Search engines with AI-generated summaries that pre-fetch and process pages. Characteristics:

Similar to traditional search crawlers but with higher processing overhead per page
Generally well-identified via user agent
Respect robots.txt more reliably than independent crawlers

Undeclared bots

The hardest category: bots that use browser-like user agents, rotate IPs, and attempt to appear as human traffic. Distinguishing these from real users requires behavioural analysis rather than user-agent matching.

Identification strategies

User-agent filtering

The first line of defence. Known AI crawlers and their user-agent strings as of early 2026:

Bot	User-Agent contains	Purpose
GPTBot	`GPTBot`	OpenAI training/browsing
ChatGPT-User	`ChatGPT-User`	ChatGPT browsing feature
ClaudeBot	`ClaudeBot`	Anthropic training
Google-Extended	`Google-Extended`	Google AI training
Bytespider	`Bytespider`	ByteDance training
CCBot	`CCBot`	Common Crawl
PerplexityBot	`PerplexityBot`	Perplexity AI search
Amazonbot	`Amazonbot`	Amazon Alexa AI

This list will evolve. Maintain it as a living configuration rather than a static hardcoded list.

IP range verification

Legitimate major crawlers publish their IP ranges:

Google: dig TXT _netblocks.google.com
OpenAI: Published in their documentation
Anthropic: Published CIDR ranges

Verify that requests claiming to be from a known bot actually originate from that bot's published IP ranges. Spoofed user agents from random IPs are common.

Behavioural signals

For undeclared bots, look for:

Request patterns that systematically spider sitemap URLs in order
No JavaScript execution (check via challenge pages or analytics)
Consistent sub-second request intervals
No cookie/session state across requests
Missing or unusual Accept-Language, Accept-Encoding headers

Rate-limiting approaches

Per-bot rate limits

Apply specific rate limits to identified bot user agents:

# Example: Cloudflare rate-limiting rule
# If User-Agent contains "GPTBot" → limit to 10 requests/minute
# If User-Agent contains "ClaudeBot" → limit to 10 requests/minute

This is the simplest approach and handles well-behaved bots that identify themselves.

Tiered rate limiting

Implement multiple tiers:

Verified crawlers (IP + user-agent match): 30 req/min
Declared but unverified bots (user-agent match, unknown IP): 5 req/min
Suspected bots (behavioural signals): 2 req/min
Human visitors: no artificial rate limit (rely on standard DDoS protection)

Cache-key separation

Bot traffic can pollute your CDN cache in unexpected ways:

Bots requesting with different Accept-Encoding values create separate cache entries
Bots with unusual query parameters can fragment the cache
High-volume bot requests can evict human-visitor cache entries on capacity-limited caches

Consider separating cache keys for bot traffic:

Route identified bot traffic to a separate cache zone or origin
Use a Vary header strategy that doesn't fragment the human-visitor cache
Deprioritise cache warming from bot requests

"Human path" optimisation

The core idea: optimise your infrastructure for human visitors first, and handle bot traffic as a secondary concern.

Separate serving paths

If bot traffic is significant (>20% of requests):

Identify bot requests at the edge (CDN worker, WAF rule)
Route them to a bot-specific backend or cache tier
Serve human visitors from the primary, warm cache
Apply stricter rate limits and lower priority to the bot path

Progressive rendering for humans

Since bots typically don't execute JavaScript:

Serve critical content in the initial HTML response
Use progressive enhancement for interactive features
Bots get the content they need; humans get the full experience
This is not "cloaking" — the base content is identical

Connection priority

At the server level, if you control the origin:

Prioritise connections from known human-traffic IP ranges
Apply TCP/QUIC congestion control fairly but with human-traffic priority
Use HTTP/3 prioritisation hints for human-initiated requests

Common mistakes

Blocking all bots aggressively. Some AI crawlers also power search features that send you traffic. Blocking everything may reduce your visibility in AI-powered search results without actually reducing your costs meaningfully.

Relying solely on user-agent strings. Undeclared bots and spoofed user agents bypass UA-only filtering trivially. Layer behavioural analysis and IP verification on top.

Ignoring robots.txt for AI control. Many legitimate AI crawlers do respect robots.txt. Use it as a first signal:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

But do not rely on it as enforcement — treat it as a signal, with rate limiting as the actual control.

Not monitoring bot traffic separately. If you don't split bot vs. human metrics, you can't tell whether your site is slow for humans or just overloaded by bots. Separate your analytics by traffic type.

Over-engineering the solution. For a small static site behind a CDN, simple rate-limiting rules and robots.txt may be all you need. The complexity of behavioural analysis is justified only when bot traffic is actually causing problems.

Verification

Check your CDN analytics for user-agent distribution — identify what percentage of traffic is from known bots
Verify robots.txt is served correctly and includes AI crawler directives
Test rate-limiting rules against a known bot user agent: curl -A "GPTBot/1.0" -s -o /dev/null -w "%{http_code}" https://yourdomain.com/
Monitor 429 (Too Many Requests) response rates to ensure rate limits are triggering appropriately
Compare page-load metrics for human traffic before and after implementing bot controls