The relationship between content publishers and AI companies has shifted from implicit to explicit. Through 2024 and 2025, lawsuits, licensing deals, and regulatory action established that content used for AI training is a rights question, not a free-access assumption. By 2026, publishers of all sizes need a clear position on AI training access to their content — and the mechanisms to express and enforce that position.

This page covers the current state of machine-readable licensing signals, the emerging pay-per-crawl models, and practical policy templates you can adapt for your own site.

The current landscape

What changed

  • Copyright lawsuits (NYT v. OpenAI, Getty v. Stability AI, and others) established that large-scale content ingestion for AI training is not clearly protected by fair use / fair dealing
  • Licensing deals between publishers and AI companies (AP, Reddit, Axel Springer, etc.) created a market signal that content for training has economic value
  • EU AI Act and proposed regulations require transparency about training data sources
  • robots.txt AI directives became the de facto first signal for crawler control, with most major AI companies now recognising specific user-agent tokens

What hasn't changed

  • There is no universally adopted standard for machine-readable AI licensing
  • robots.txt remains advisory with no legal enforcement mechanism in most jurisdictions
  • Small publishers have limited leverage in direct licensing negotiations
  • The legal landscape varies significantly by jurisdiction

Machine-readable licensing signals

robots.txt (current standard)

The most widely supported signal. See the AI crawler control guide for detailed configuration. Key directives:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Limitation: robots.txt expresses "don't crawl" but not "you may crawl under these terms." It's binary, not conditional.

TDM (Text and Data Mining) reservation

The EU DSM Directive (Article 4) allows rights holders to reserve text-and-data-mining rights via machine-readable means. This is legally binding in EU member states.

Express TDM reservation in HTTP headers:

X-Robots-Tag: noai
X-Robots-Tag: noimageai

Or in HTML meta tags:

<meta name="robots" content="noai, noimageai">

These signals are not yet universally respected by all crawlers, but they establish your legal position under EU law.

W3C proposal: TDM Reservation Protocol

The W3C has been developing a more formal protocol for expressing TDM rights. The draft includes:

  • A /.well-known/tdmrep.json file expressing site-wide TDM policy
  • Per-page HTTP headers or meta tags
  • Machine-readable license conditions

As of early 2026, this is still in draft form, but adding a tdmrep.json file is a low-cost signal that aligns with the likely standard.

RSL (Robots Source Licensing) signals

Several industry proposals extend robots.txt with licensing metadata. The general pattern:

# robots.txt with licensing extension
User-agent: GPTBot
Disallow: /
License: https://example.com/ai-terms

User-agent: *
Allow: /

This approach is not standardised yet, but the concept — linking robots.txt directives to license terms — is gaining traction.

Pay-per-crawl models

Several companies now offer intermediary services for content licensing:

How pay-per-crawl works

  1. Publisher registers content with a licensing intermediary
  2. Intermediary provides a token or API endpoint for authorised access
  3. AI companies negotiate bulk or per-query access through the intermediary
  4. Publisher receives payment based on crawl volume or content usage

Current market dynamics

  • Rates vary enormously: from fractions of a cent per page for bulk training data to significant per-article fees for premium content
  • Most deals are bilateral between large publishers and AI companies
  • Intermediary platforms are emerging but none has dominant market share yet
  • Small publishers typically lack the scale for direct deals

Practical approach for small publishers

For sites with limited negotiating power:

  1. Express your position clearly: robots.txt + TDM reservation + terms-of-service page
  2. Monitor usage: track AI crawler access in your logs
  3. Document infringement: if AI crawlers ignore your stated policy, document the access
  4. Join collective licensing efforts: industry groups are forming to negotiate on behalf of smaller publishers
  5. Consider selective access: allow some AI systems that provide attribution/traffic in exchange for access

Policy templates

Terms of use — AI training restriction

Add this clause to your terms of use page:

Automated content collection for AI training. Content on this site may not be used for training, fine-tuning, or evaluating machine learning models, large language models, or artificial intelligence systems without prior written permission. Automated crawling or scraping for these purposes is prohibited regardless of the method used to access the content. This restriction applies to all content including text, images, code, and data published on this domain and its subdomains.

robots.txt — comprehensive AI blocking

# AI training crawlers — access prohibited
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Omgilibot
Disallow: /

# Search engines — access permitted
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Applebot
Allow: /

User-agent: *
Allow: /

HTTP headers — TDM reservation

In your _headers file or server configuration:

/*
  X-Robots-Tag: noai, noimageai

Common mistakes

Having no policy at all. Silence may be interpreted as implicit permission in some jurisdictions. Express your position explicitly.

Blocking all automated access. This breaks search engines, monitoring tools, accessibility checkers, and other legitimate bots. Be specific about what you're blocking and why.

Relying solely on legal terms without technical controls. Terms of use are enforceable after the fact, but they don't prevent crawling. Combine legal terms with technical controls (robots.txt + rate limiting).

Ignoring the economic trade-off. Some AI systems drive traffic to your site. Blocking everything may reduce your visibility without generating licensing revenue. Evaluate each AI system's value to your audience.

Verification

  1. Test robots.txt with a validator to ensure syntax is correct
  2. Verify AI crawler user agents are correctly blocked: check access logs for GPTBot, ClaudeBot, etc.
  3. Confirm search engine access is preserved: verify Googlebot can access your content via Search Console
  4. Check X-Robots-Tag headers are returned correctly: curl -I https://yourdomain.com/ | grep -i x-robots
  5. Review and update your AI crawler list quarterly — new crawlers appear regularly

Related reading on wplus.net