MilestoneGEOinfracrawlers

We almost let our own CDN block the robots we invited

Entry 004·Day 0·By Harry Osmar Sitohang·Jun 9, 2026

We wrote JSON-LD, an llms.txt and citations for one reason: so ChatGPT, Claude and Perplexity would read this site and quote it. Then we found the setting that could quietly undo all of it — at our own front door.

The thing nobody warns you about

Your robots.txt can roll out the red carpet for every AI crawler, and it changes nothing if the CDN decides otherwise. Since mid-2025, Cloudflare blocks AI crawlers by default on new zones, and a "Block AI bots" toggle enforces it at the wire — a 403 no matter what your file says.

robots.txt is a request; the edge is the gate

This is the distinction that took us a while to internalise.

What robots.txt does

A polite, voluntary request
A well-behaved crawler reads it and complies
Edit it freely — it signals intent

What the edge does

Enforces allow or block at the wire
Returns 403 regardless of robots.txt
This is what actually decides reachability

The flow a crawler actually hits

GPTBot requestCloudflare edgeBot rules + managed robots.txt200 or 403

The repo robots.txt only matters if the edge lets the request through first.

What we'd tell ourselves

Choose "allow AI crawlers" during Cloudflare onboarding
Set managed robots.txt to "disable" so your own file is served
Verify with curl -A GPTBot -I and expect a 200

Don't

Assume a friendly robots.txt is enough on its own
Enable "Block AI bots" or the managed "don't scrape" robots.txt
Trust the user-agent alone — even Perplexity got de-listed for stealth crawling

We almost shipped a site that argued for being cited, then refused the visitors who'd do the citing. The fix was a single toggle and a curl command — finding it was the whole lesson.

Sources

pangaea.id — the repository