How to Manage AI Crawlers (Allow, Block, Monitor) — 2026 Guide

Summary

A practical guide to managing AI crawlers on your site: when to block, when to allow, robots.txt patterns, CDN bot rules, and how to measure the impact.

Contents

Key facts


Start with the framework, not the config

Before you paste anything into robots.txt, decide what you're trying to achieve. There are really only four positions a site can take on AI crawlers:

  1. Allow everything. You want to be cited by every AI assistant. Default for most marketing sites, SaaS, content sites, ecommerce.
  2. Allow but track. You allow AI traffic but want to know who's reading what so you can optimize. Most sites belong here once they get curious.
  3. Allow some, block others. Allow the ones that drive measurable referral traffic (Perplexity, ChatGPT search), block the ones that just train models without sending visitors (CCBot, anthropic-ai). Selective.
  4. Block everything. You're behind a paywall, your content is proprietary, or you're philosophically opposed to AI training on your work. Rare in commercial contexts; common for publishers fighting copyright issues.

Most teams default to position 1 without thinking about it. The interesting question is whether your position should be position 2, 3, or 4 — and you can't answer that without data on what AI crawlers are actually doing on your site.

The "should I allow this bot?" decision

For each AI bot, ask three questions:

  1. Does it drive referral traffic? Perplexity and ChatGPT search produce real human visits to cited sites. Pure training crawlers (CCBot, Bytespider for ByteDance's internal use, Applebot-Extended) don't drive direct traffic — they feed a model whose output may or may not include your site.
  2. Does it serve your customers? If your audience uses Claude or Gemini, having those models trained on your content means your customers get accurate answers about your product. Blocking means accuracy drops.
  3. Is it scraping you in a way you'd consider harmful? Some publishers care about copyright; some don't. Some care about competitive intelligence (e.g., pricing pages being scraped by competitors masquerading as AI bots); some don't.

Two answers of "yes" to questions 1 or 2 generally means allow. Two answers of "yes" to question 3 means block. Mixed answers mean monitor for 30-60 days first.

robots.txt — the polite signal

robots.txt is the gentleman's agreement of the web. Well-behaved bots honor it; bad actors ignore it. All the major AI companies (OpenAI, Anthropic, Google, Meta, Apple) honor robots.txt for their named bots — they have legal teams who care.

Block all AI bots

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: claude-web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: GrokBot
Disallow: /

User-agent: cohere-ai
Disallow: /

This blocks the major LLM crawlers but allows traditional search bots (Googlebot, Bingbot) — you don't lose SEO. Note: Google-Extended is specifically Google's AI opt-out token; blocking it removes you from Gemini training and AI Overviews without removing you from Google Search.

Block training but allow live-fetch (let users get fresh answers about your site)

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow live-fetch agents — these fire when a user asks the AI about your page
# User-agent: ChatGPT-User       (not listed = allowed)
# User-agent: Perplexity-User    (not listed = allowed)
# User-agent: claude-web         (not listed = allowed)

This is a defensible middle ground: your content isn't used to train new models, but a user asking ChatGPT "what does example.com say about X?" still gets a fresh fetch of your page.

Block specific paths only

User-agent: GPTBot
Disallow: /pricing
Disallow: /customers
Disallow: /case-studies

User-agent: ClaudeBot
Disallow: /pricing
Disallow: /customers
Disallow: /case-studies

Useful when you want AI assistants to recommend your product (so allow the homepage, features, docs) but you don't want them quoting your exact pricing or customer logos out of context. AI assistants regularly misquote prices because they trained on outdated cache; blocking /pricing from the training crawlers forces the model to either skip pricing or fetch it live.

CDN bot rules — the enforced signal

robots.txt is a request. CDN bot rules are enforcement. If you have Cloudflare, Fastly, or Vercel in front of your site, you can return 403/429 to specific bot fingerprints and they have no choice.

Cloudflare

Cloudflare's Bot Management tier lets you write rules in the Web Application Firewall. A typical block looks like:

(cf.client.bot) and (http.user_agent contains "GPTBot")

Set the action to Block (or Challenge if you want to be less aggressive). Cloudflare also ships a free "AI Scrapers and Crawlers" managed rule you can toggle in one click, which covers most of the bots in this list.

Cloudflare's recently-shipped Content Signals mechanism is a more nuanced version of this — you declare whether each path may be used for training, search, or inference, and crawlers self-comply. Worth enabling alongside hard blocks.

Vercel Bot Management (Edge Network)

// middleware.ts
export function middleware(req: Request) {
  const ua = req.headers.get('user-agent') || '';
  const aiBots = /GPTBot|ClaudeBot|PerplexityBot|Bytespider|CCBot/i;
  if (aiBots.test(ua)) {
    return new Response('Forbidden', { status: 403 });
  }
}

nginx

map $http_user_agent $is_ai_bot {
    default 0;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*PerplexityBot 1;
    ~*Bytespider 1;
    ~*CCBot 1;
}

server {
    if ($is_ai_bot) {
        return 403;
    }
}

Allow + serve markdown (the "agent-friendly" approach)

An increasingly popular third path: don't block AI bots, but serve them clean markdown instead of your full HTML. The benefits:

Two ways to do this:

  1. Stable URLs: publish /llms.txt, /llms-full.txt, and /md/<path> markdown files at predictable URLs. AI bots that know the convention fetch them directly. This is what Crawlytics generates for you.
  2. Content negotiation: when an AI bot sends Accept: text/markdown, return markdown instead of HTML for the canonical URL. This is what Cloudflare's Markdown for Agents ships.

Both approaches work; the first reaches more bots (most clients don't send Accept: text/markdown yet), the second is lower-latency. Full comparison here.

Measuring whether your config is working

Three things to check:

1. Are blocked bots actually blocked?

Run this from a test environment:

curl -A "GPTBot" https://yoursite.com/

If your block rule fires you should see 403. If you see 200, your robots.txt is being honored but your CDN isn't enforcing — fine if that's intentional, a problem if you meant to hard-block.

2. Are allowed bots still visiting?

Grep your server logs for the User-Agents you allowed:

grep -iE 'PerplexityBot|ChatGPT-User|claude-web' /var/log/nginx/access.log | tail -50

If the count is climbing over time, your allow list is working as intended. If it dropped to zero after a config change, you accidentally blocked something.

3. Are you actually getting referral traffic from AI assistants?

This is the bottom-line question. Blocking and allowing only matter if they translate to (or away from) human visits. Two ways to measure:

What about bots that ignore robots.txt?

They exist. Scrapers masquerading as legitimate AI bots, abandoned crawlers running on autopilot, and a handful of named bots from less-reputable companies. For these:

A reasonable default for most sites

If you want a "good enough" starting position without overthinking it:

  1. Allow all AI bots in robots.txt (don't block anything for the first 30 days)
  2. Install bot tracking — Crawlytics or grep your own logs
  3. Observe for 30 days: which bots are visiting, how much volume, which pages they prefer, whether the cited-by-AI traffic appears in your analytics
  4. Make blocking decisions based on the data — block the bots that consume bandwidth without sending visits, allow the ones that drive measurable referrals
  5. Generate /llms.txt to make the allowed bots' job easier and get better citations

This is more work than "block everything" but it's also the only way to make a decision that aligns with your actual business outcomes instead of a vibes-based reaction.

Related

Frequently Asked Questions

How do I block GPTBot from crawling my website?

Add the following to your robots.txt: User-agent: GPTBot then Disallow: /. Repeat for ChatGPT-User and OAI-SearchBot if you want to block live-fetch and search-index bots too. For hard enforcement (not just polite request), add a CDN bot rule in Cloudflare, Fastly, or Vercel that returns 403 to that User-Agent.

How do I block all AI crawlers at once?

List each major bot explicitly in robots.txt: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, claude-web, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Bytespider, CCBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended, GrokBot, cohere-ai. Note Google-Extended is Google's AI opt-out token, blocking it removes you from Gemini training and AI Overviews without affecting your Google Search ranking.

Should I block AI crawlers or allow them?

Allow them if you want to be cited by AI search and AI assistants because blocking the training crawler means your content is absent from the model's knowledge. Block them if your content is paywalled or proprietary. A common middle ground: block pure training crawlers (CCBot, Bytespider, Applebot-Extended, Google-Extended), allow live-fetch agents (ChatGPT-User, Perplexity-User, claude-web) so user-initiated questions about your site still get fresh content.

Does Google-Extended affect my Google Search ranking?

No. Google-Extended is a separate token Google introduced specifically as an AI opt-out signal. Blocking Google-Extended in robots.txt removes you from Gemini training and Google AI Overviews, but Googlebot and Googlebot-News still crawl normally and your traditional Google Search ranking is unaffected.

Do AI crawlers honor robots.txt?

The major ones do. OpenAI, Anthropic, Google, Meta, Apple, and Perplexity all honor robots.txt for their named bots because they have legal teams that care. A handful of scrapers and less reputable bots ignore robots.txt entirely, for those you need CDN bot rules, rate limiting, or honeypot pages that ban any IP that fetches them.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap