We almost let our own CDN block the robots we invited
We wrote JSON-LD, an llms.txt and citations for one reason: so ChatGPT, Claude and
Perplexity would read this site and quote it. Then we found the setting that could quietly
undo all of it — at our own front door.
The thing nobody warns you about
Your robots.txt can roll out the red carpet for every AI crawler, and it changes nothing if
the CDN decides otherwise. Since mid-2025, Cloudflare blocks AI crawlers by default on new
zones, and a "Block AI bots" toggle enforces it at the wire — a 403 no matter what your file
says.
robots.txt is a request; the edge is the gate
This is the distinction that took us a while to internalise.
What robots.txt does
- A polite, voluntary request
- A well-behaved crawler reads it and complies
- Edit it freely — it signals intent
What the edge does
- Enforces allow or block at the wire
- Returns 403 regardless of robots.txt
- This is what actually decides reachability
The flow a crawler actually hits
What we'd tell ourselves
Do
- Choose "allow AI crawlers" during Cloudflare onboarding
- Set managed robots.txt to "disable" so your own file is served
- Verify with
curl -A GPTBot -Iand expect a 200
Don't
- Assume a friendly robots.txt is enough on its own
- Enable "Block AI bots" or the managed "don't scrape" robots.txt
- Trust the user-agent alone — even Perplexity got de-listed for stealth crawling
We almost shipped a site that argued for being cited, then refused the visitors who'd do the citing. The fix was a single toggle and a curl command — finding it was the whole lesson.
Sources