A practical guide to managing AI crawlers on your site: when to block, when to allow, robots.txt patterns, CDN bot rules, and how to measure the impact.
Before you paste anything into robots.txt, decide what you're trying to achieve. There are really only four positions a site can take on AI crawlers:
Most teams default to position 1 without thinking about it. The interesting question is whether your position should be position 2, 3, or 4 — and you can't answer that without data on what AI crawlers are actually doing on your site.
For each AI bot, ask three questions:
Two answers of "yes" to questions 1 or 2 generally means allow. Two answers of "yes" to question 3 means block. Mixed answers mean monitor for 30-60 days first.
robots.txt is the gentleman's agreement of the web. Well-behaved bots honor it; bad actors ignore it. All the major AI companies (OpenAI, Anthropic, Google, Meta, Apple) honor robots.txt for their named bots — they have legal teams who care.
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: claude-web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: GrokBot
Disallow: /
User-agent: cohere-ai
Disallow: /
This blocks the major LLM crawlers but allows traditional search bots (Googlebot, Bingbot) — you don't lose SEO. Note: Google-Extended is specifically Google's AI opt-out token; blocking it removes you from Gemini training and AI Overviews without removing you from Google Search.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Allow live-fetch agents — these fire when a user asks the AI about your page
# User-agent: ChatGPT-User (not listed = allowed)
# User-agent: Perplexity-User (not listed = allowed)
# User-agent: claude-web (not listed = allowed)
This is a defensible middle ground: your content isn't used to train new models, but a user asking ChatGPT "what does example.com say about X?" still gets a fresh fetch of your page.
User-agent: GPTBot
Disallow: /pricing
Disallow: /customers
Disallow: /case-studies
User-agent: ClaudeBot
Disallow: /pricing
Disallow: /customers
Disallow: /case-studies
Useful when you want AI assistants to recommend your product (so allow the homepage, features, docs) but you don't want them quoting your exact pricing or customer logos out of context. AI assistants regularly misquote prices because they trained on outdated cache; blocking /pricing from the training crawlers forces the model to either skip pricing or fetch it live.
robots.txt is a request. CDN bot rules are enforcement. If you have Cloudflare, Fastly, or Vercel in front of your site, you can return 403/429 to specific bot fingerprints and they have no choice.
Cloudflare's Bot Management tier lets you write rules in the Web Application Firewall. A typical block looks like:
(cf.client.bot) and (http.user_agent contains "GPTBot")
Set the action to Block (or Challenge if you want to be less aggressive). Cloudflare also ships a free "AI Scrapers and Crawlers" managed rule you can toggle in one click, which covers most of the bots in this list.
Cloudflare's recently-shipped Content Signals mechanism is a more nuanced version of this — you declare whether each path may be used for training, search, or inference, and crawlers self-comply. Worth enabling alongside hard blocks.
// middleware.ts
export function middleware(req: Request) {
const ua = req.headers.get('user-agent') || '';
const aiBots = /GPTBot|ClaudeBot|PerplexityBot|Bytespider|CCBot/i;
if (aiBots.test(ua)) {
return new Response('Forbidden', { status: 403 });
}
}
map $http_user_agent $is_ai_bot {
default 0;
~*GPTBot 1;
~*ClaudeBot 1;
~*PerplexityBot 1;
~*Bytespider 1;
~*CCBot 1;
}
server {
if ($is_ai_bot) {
return 403;
}
}
An increasingly popular third path: don't block AI bots, but serve them clean markdown instead of your full HTML. The benefits:
Two ways to do this:
/llms.txt, /llms-full.txt, and /md/<path> markdown files at predictable URLs. AI bots that know the convention fetch them directly. This is what Crawlytics generates for you.Accept: text/markdown, return markdown instead of HTML for the canonical URL. This is what Cloudflare's Markdown for Agents ships.Both approaches work; the first reaches more bots (most clients don't send Accept: text/markdown yet), the second is lower-latency. Full comparison here.
Three things to check:
Run this from a test environment:
curl -A "GPTBot" https://yoursite.com/
If your block rule fires you should see 403. If you see 200, your robots.txt is being honored but your CDN isn't enforcing — fine if that's intentional, a problem if you meant to hard-block.
Grep your server logs for the User-Agents you allowed:
grep -iE 'PerplexityBot|ChatGPT-User|claude-web' /var/log/nginx/access.log | tail -50
If the count is climbing over time, your allow list is working as intended. If it dropped to zero after a config change, you accidentally blocked something.
This is the bottom-line question. Blocking and allowing only matter if they translate to (or away from) human visits. Two ways to measure:
chat.openai.com, perplexity.ai, claude.ai. You'll miss most in-app browser sessions (they strip Referer) but the desktop traffic shows up.chatgpt as the source even when Referer is stripped.They exist. Scrapers masquerading as legitimate AI bots, abandoned crawlers running on autopilot, and a handful of named bots from less-reputable companies. For these:
If you want a "good enough" starting position without overthinking it:
/llms.txt to make the allowed bots' job easier and get better citationsThis is more work than "block everything" but it's also the only way to make a decision that aligns with your actual business outcomes instead of a vibes-based reaction.
Add the following to your robots.txt: User-agent: GPTBot then Disallow: /. Repeat for ChatGPT-User and OAI-SearchBot if you want to block live-fetch and search-index bots too. For hard enforcement (not just polite request), add a CDN bot rule in Cloudflare, Fastly, or Vercel that returns 403 to that User-Agent.
List each major bot explicitly in robots.txt: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, claude-web, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Bytespider, CCBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended, GrokBot, cohere-ai. Note Google-Extended is Google's AI opt-out token, blocking it removes you from Gemini training and AI Overviews without affecting your Google Search ranking.
Allow them if you want to be cited by AI search and AI assistants because blocking the training crawler means your content is absent from the model's knowledge. Block them if your content is paywalled or proprietary. A common middle ground: block pure training crawlers (CCBot, Bytespider, Applebot-Extended, Google-Extended), allow live-fetch agents (ChatGPT-User, Perplexity-User, claude-web) so user-initiated questions about your site still get fresh content.
No. Google-Extended is a separate token Google introduced specifically as an AI opt-out signal. Blocking Google-Extended in robots.txt removes you from Gemini training and Google AI Overviews, but Googlebot and Googlebot-News still crawl normally and your traditional Google Search ranking is unaffected.
The major ones do. OpenAI, Anthropic, Google, Meta, Apple, and Perplexity all honor robots.txt for their named bots because they have legal teams that care. A handful of scrapers and less reputable bots ignore robots.txt entirely, for those you need CDN bot rules, rate limiting, or honeypot pages that ban any IP that fetches them.
This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt