Ask "should I let AI crawlers read my site?" and you'll get confident answers in both directions. They're all incomplete, because the honest answer is that every option costs you something. There is no setting that gives you visibility, control, revenue and clean analytics at once. Governing crawler access is the art of choosing which thing to give up — and re-choosing as the math changes.

Four choices, four renunciations

Allow everything — you renounce control and resources

The open posture maximizes your eligibility to be retrieved and cited. It also means your content feeds model training, gets cloned by scrapers, and is hammered by hundreds of bots — inflating bandwidth and poisoning your analytics. You trade control and resources for reach.

Block AI crawlers — you renounce reach

Blocking protects your content from training and theft. But you cannot be the answer in an engine you forbade from reading you. As AI search absorbs an estimated 15–20% of informational queries, the cost of invisibility there rises every quarter. You trade reach for protection.

Rate-limit — you renounce precision

Throttling feels like the reasonable middle. But a rate limit is a blunt instrument: it can't easily tell a user's agent completing a purchase from a scraper cloning your catalog. Set it tight and you block real demand; set it loose and you barely slow the abuse. You trade precision for simplicity.

Charge — you renounce certainty of being chosen

In 2025 Cloudflare added a third option to the binary: Pay Per Crawl, a marketplace where publishers charge AI companies for each crawl, setting their own rates while the AI company decides whether to pay. It's compelling — but only large publishers have the leverage to make a price stick, and any price is a filter: name a number an AI company won't pay, and you've blocked yourself from that engine's answers. You trade guaranteed inclusion for the chance at compensation.

There is no "correct" crawler policy. There is only the trade you're willing to make this quarter — and the discipline to revisit it next quarter.

The balance keeps moving

Worse, the cost-benefit isn't fixed. The same policy can be right today and wrong in six months because the variables underneath it move:

  • As AI search grows, the cost of blocking rises — you forfeit more visibility each quarter.
  • As scraping and bot volume grow (bots are now ~53% of web traffic), the cost of allowing rises — more resource drain, more analytics noise.
  • As pay-per-crawl matures and licensing deals proliferate (see the web economy in the age of agents), a fourth equilibrium keeps shifting the others.

A crawler policy is not a checkbox you tick once. It's a position you hold and re-evaluate, like an allocation.

The deeper problem: do you even hold the dial?

All of the above assumes you can act on your decision. Often you can't — and three gates decide whether you really control your own access policy.

1. Your stack

Can you actually edit robots.txt, set response headers, and write server- or edge-level rules? On a flexible stack (a self-hosted site, WordPress with the right access) you can. On a locked website builder or rigid SaaS, the controls may simply not be exposed — your policy is whatever the vendor shipped.

2. Your awareness

Most site owners don't know this is a decision at all. And a decision you don't know you're making is made for you — by the default. The single biggest determinant of a site's crawler policy in 2026 is whether anyone on the team realized it mattered.

3. Your intermediaries

This is the shift that changes everything. On July 1, 2025, Cloudflare became the first major infrastructure provider to block AI crawlers by default for new domains, flipping the web from opt-out to opt-in. Your CDN, host, or WAF increasingly decides crawler policy upstream — sometimes before you've configured anything. The upside: providers now let crawlers declare their purpose (train, inference, or index for search), enabling far more granular control than a lone robots.txt ever could. The catch: the dial may be on someone else's dashboard.

How to decide deliberately

  1. Find your dial first. Before tuning anything, learn who currently controls access — your code, your CDN, your host — and what their default already does. You may discover the decision was made for you.
  2. Reject all-or-nothing. Use granular signals. A Content-Signal line in robots.txt separates search, ai-input (retrieval) and ai-train (training) — letting you be found and cited without feeding training.
  3. Set a default posture, then justify exceptions. For most sites that want visibility: allow search and retrieval, restrict training, and rate-limit or block only identified abusers. Distinguish crawler purpose wherever your tooling exposes it.
  4. Protect agents, throttle scrapers. A user's agent completing a task is a customer; a catalog-cloning scraper is not. Target the second with auth and limits without blocking the first.
  5. Revisit on a cadence. Put a quarterly review on the calendar. The right trade in Q1 is a mistake by Q3 because the underlying costs moved.

The uncomfortable truth is that "set it and forget it" is itself a choice — usually the choice to let your CDN's default and your own inattention decide how the AI era sees you. The teams that win don't find a perfect policy. They stay awake to a moving trade-off, and they keep their hand on the dial.

Frequently asked questions

Should I block AI crawlers from my website?

It depends on your goal, and it's a trade-off. Blocking protects content from training and scraping but forfeits AI visibility — you can't be cited in an answer you blocked the engine from reading. The common middle path is to allow search and retrieval while restricting training, using granular signals rather than an all-or-nothing block.

What is Cloudflare Pay Per Crawl?

A Cloudflare marketplace, launched in 2025, that adds a third option beyond allow and block: charging AI companies each time they crawl your pages. Publishers set rates; AI companies choose whether to accept. It arrived with Cloudflare's July 2025 move to block AI crawlers by default for new domains — opt-out to opt-in.

Do I even control which crawlers can access my site?

Not always. Real control depends on whether your stack lets you edit robots.txt, headers and server rules; whether you're aware the decision exists; and what your infrastructure provider does by default. With Cloudflare blocking AI crawlers by default for new domains, the policy may be set upstream — so first find out who holds your dial.