Journey
MilestoneGEOinfracrawlers

We almost let our own CDN block the robots we invited

We wrote JSON-LD, an llms.txt and citations for one reason: so ChatGPT, Claude and Perplexity would read this site and quote it. Then we found the setting that could quietly undo all of it — at our own front door.

The thing nobody warns you about

Your robots.txt can roll out the red carpet for every AI crawler, and it changes nothing if the CDN decides otherwise. Since mid-2025, Cloudflare blocks AI crawlers by default on new zones, and a "Block AI bots" toggle enforces it at the wire — a 403 no matter what your file says.

robots.txt is a request; the edge is the gate

This is the distinction that took us a while to internalise.

What robots.txt does

  • A polite, voluntary request
  • A well-behaved crawler reads it and complies
  • Edit it freely — it signals intent

What the edge does

  • Enforces allow or block at the wire
  • Returns 403 regardless of robots.txt
  • This is what actually decides reachability

The flow a crawler actually hits

GPTBot requestCloudflare edgeBot rules + managed robots.txt200 or 403
The repo robots.txt only matters if the edge lets the request through first.

What we'd tell ourselves

Do

  • Choose "allow AI crawlers" during Cloudflare onboarding
  • Set managed robots.txt to "disable" so your own file is served
  • Verify with curl -A GPTBot -I and expect a 200

Don't

  • Assume a friendly robots.txt is enough on its own
  • Enable "Block AI bots" or the managed "don't scrape" robots.txt
  • Trust the user-agent alone — even Perplexity got de-listed for stealth crawling

We almost shipped a site that argued for being cited, then refused the visitors who'd do the citing. The fix was a single toggle and a curl command — finding it was the whole lesson.

Sources

  1. pangaea.id — the repository