The relationship between content publishers and AI companies has shifted from implicit to explicit. Through 2024 and 2025, lawsuits, licensing deals, and regulatory action established that content used for AI training is a rights question, not a free-access assumption. By 2026, publishers of all sizes need a clear position on AI training access to their content — and the mechanisms to express and enforce that position.
This page covers the current state of machine-readable licensing signals, the emerging pay-per-crawl models, and practical policy templates you can adapt for your own site.
The current landscape
What changed
- Copyright lawsuits (NYT v. OpenAI, Getty v. Stability AI, and others) established that large-scale content ingestion for AI training is not clearly protected by fair use / fair dealing
- Licensing deals between publishers and AI companies (AP, Reddit, Axel Springer, etc.) created a market signal that content for training has economic value
- EU AI Act and proposed regulations require transparency about training data sources
- robots.txt AI directives became the de facto first signal for crawler control, with most major AI companies now recognising specific user-agent tokens
What hasn't changed
- There is no universally adopted standard for machine-readable AI licensing
robots.txtremains advisory with no legal enforcement mechanism in most jurisdictions- Small publishers have limited leverage in direct licensing negotiations
- The legal landscape varies significantly by jurisdiction
Machine-readable licensing signals
robots.txt (current standard)
The most widely supported signal. See the AI crawler control guide for detailed configuration. Key directives:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Limitation: robots.txt expresses "don't crawl" but not "you may crawl under these terms." It's binary, not conditional.
TDM (Text and Data Mining) reservation
The EU DSM Directive (Article 4) allows rights holders to reserve text-and-data-mining rights via machine-readable means. This is legally binding in EU member states.
Express TDM reservation in HTTP headers:
X-Robots-Tag: noai
X-Robots-Tag: noimageai
Or in HTML meta tags:
<meta name="robots" content="noai, noimageai">
These signals are not yet universally respected by all crawlers, but they establish your legal position under EU law.
W3C proposal: TDM Reservation Protocol
The W3C has been developing a more formal protocol for expressing TDM rights. The draft includes:
- A
/.well-known/tdmrep.jsonfile expressing site-wide TDM policy - Per-page HTTP headers or meta tags
- Machine-readable license conditions
As of early 2026, this is still in draft form, but adding a tdmrep.json file is a low-cost signal that aligns with the likely standard.
RSL (Robots Source Licensing) signals
Several industry proposals extend robots.txt with licensing metadata. The general pattern:
# robots.txt with licensing extension
User-agent: GPTBot
Disallow: /
License: https://example.com/ai-terms
User-agent: *
Allow: /
This approach is not standardised yet, but the concept — linking robots.txt directives to license terms — is gaining traction.
Pay-per-crawl models
Several companies now offer intermediary services for content licensing:
How pay-per-crawl works
- Publisher registers content with a licensing intermediary
- Intermediary provides a token or API endpoint for authorised access
- AI companies negotiate bulk or per-query access through the intermediary
- Publisher receives payment based on crawl volume or content usage
Current market dynamics
- Rates vary enormously: from fractions of a cent per page for bulk training data to significant per-article fees for premium content
- Most deals are bilateral between large publishers and AI companies
- Intermediary platforms are emerging but none has dominant market share yet
- Small publishers typically lack the scale for direct deals
Practical approach for small publishers
For sites with limited negotiating power:
- Express your position clearly:
robots.txt+ TDM reservation + terms-of-service page - Monitor usage: track AI crawler access in your logs
- Document infringement: if AI crawlers ignore your stated policy, document the access
- Join collective licensing efforts: industry groups are forming to negotiate on behalf of smaller publishers
- Consider selective access: allow some AI systems that provide attribution/traffic in exchange for access
Policy templates
Terms of use — AI training restriction
Add this clause to your terms of use page:
Automated content collection for AI training. Content on this site may not be used for training, fine-tuning, or evaluating machine learning models, large language models, or artificial intelligence systems without prior written permission. Automated crawling or scraping for these purposes is prohibited regardless of the method used to access the content. This restriction applies to all content including text, images, code, and data published on this domain and its subdomains.
robots.txt — comprehensive AI blocking
# AI training crawlers — access prohibited
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Omgilibot
Disallow: /
# Search engines — access permitted
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Applebot
Allow: /
User-agent: *
Allow: /
HTTP headers — TDM reservation
In your _headers file or server configuration:
/*
X-Robots-Tag: noai, noimageai
Common mistakes
Having no policy at all. Silence may be interpreted as implicit permission in some jurisdictions. Express your position explicitly.
Blocking all automated access. This breaks search engines, monitoring tools, accessibility checkers, and other legitimate bots. Be specific about what you're blocking and why.
Relying solely on legal terms without technical controls. Terms of use are enforceable after the fact, but they don't prevent crawling. Combine legal terms with technical controls (robots.txt + rate limiting).
Ignoring the economic trade-off. Some AI systems drive traffic to your site. Blocking everything may reduce your visibility without generating licensing revenue. Evaluate each AI system's value to your audience.
Verification
- Test
robots.txtwith a validator to ensure syntax is correct - Verify AI crawler user agents are correctly blocked: check access logs for GPTBot, ClaudeBot, etc.
- Confirm search engine access is preserved: verify Googlebot can access your content via Search Console
- Check
X-Robots-Tagheaders are returned correctly:curl -I https://yourdomain.com/ | grep -i x-robots - Review and update your AI crawler list quarterly — new crawlers appear regularly
Related reading on wplus.net
- AI Crawler Control Without Losing SEO — technical implementation of crawler controls
- AI & Bot Traffic Hardening for 2026 — rate limiting and traffic management
- Legal hub — privacy, terms, and policies