Blocking GPTBot opts you out of training — but not ChatGPT-User, the real-time fetch bot. The 2026 framework for which AI bots to allow, block, or rate-limit based on what you sell.
GPTBot via robots.About 40% of the sites I audit have a robots.txt block on GPTBot and have no idea what it actually does. They added it in 2024 because a Slack thread said to. Eighteen months later they're wondering why ChatGPT never cites them, and the answer is sometimes — but not always — sitting in that one line.
The reason this is confusing is that "block GPTBot" sounds like one decision and is actually two, made through the same syntax, with very different consequences. Get the distinction right and you can opt out of model training while staying fully visible in live AI answers. Get it wrong and you're either leaking IP to a free training corpus or invisible to the next generation of search. This is the 2026 framework for figuring out which side of that line you're on.
OpenAI runs at least three crawlers, and they do different things:
| User-Agent | What it does | If you block it |
|---|---|---|
| GPTBot | Crawls pages for inclusion in future training datasets. | Your content will not be used to train the next ChatGPT model. Has zero effect on current ChatGPT answers. |
| ChatGPT-User | Fetches a URL in real time when a ChatGPT user asks a question that needs your page. | You disappear from live ChatGPT citations. The model has to guess about you from memory. |
| OAI-SearchBot | Builds the in-product web index ChatGPT Search ranks from. | You're not in the ChatGPT Search index. Live answers that pull from the index skip you. |
Anthropic, Google, and Perplexity follow the same split — a training crawler, a live-fetch crawler, often an index crawler. The naming convention is reliable enough to grep for: bot-suffixed names are training, user-suffixed names are live.
This is the entire game. If you only want to opt out of training, block GPTBot and leave ChatGPT-User and OAI-SearchBot alone. If you want to disappear from ChatGPT entirely, block all three. Most people who think they want the second thing actually want the first.
Blocking GPTBot via robots.txt does exactly one thing: it tells OpenAI's training crawler not to fetch your URLs. OpenAI honors it — they've been clean about that since launch. The Common Crawl corpus they sometimes also train on is a separate question (more below).
What it does not do:
The training-vs-live split also matters for one practical reason: live fetches are the ones that drive traffic. When ChatGPT-User pulls your page in response to a user prompt, the citation includes your URL. Users click. You get the visit (currently miscredited as "(direct)" in Google Analytics — see the attribution fix). Block ChatGPT-User and you turn off that referral stream.
There are really only four sensible robots.txt stances on AI crawlers. Pick one based on what your business sells.
You want the broadest possible AI visibility. You let training crawlers index your content for future models, and you let live-fetch bots cite you today. This is the right default for SaaS marketing sites, B2B documentation, ecommerce, local services, and anything where being found is the business model.
Robots.txt: do nothing. The absence of a Disallow is an Allow.
You don't want your content training future models — maybe it's proprietary methodology, maybe you're philosophically opposed, maybe you want to negotiate a licensing deal later — but you do want to be cited in live AI answers today. This is the underused middle option.
This is the right stance for original journalism that still wants the referral traffic, original research firms, agencies that publish methodology pages, and anyone whose content is the product but whose distribution is the marketing channel.
You don't want AI assistants reading your content at all. Paid research, paywalled journalism, members-only content, legal/medical sites with liability concerns, and original IP that you actively license. You're trading off discoverability for control.
Realistic warning: blocking everything means you'll keep losing share-of-voice to competitors who allowed everything. For most categories that's a losing trade past 12-18 months. Consider Stance 2 instead.
You're fine with the bots, but the volume is costing you money or degrading performance. High-bandwidth media sites (video, large image libraries, complex SPAs that render expensively) hit this first. The answer is usually not a hard block — it's a rate limit on the training crawlers (which fetch broadly and repeatedly) while leaving live-fetch bots untouched (they fetch one URL per user prompt, low volume).
This is also the right stance for any site where your bot-to-human ratio has crossed about 1:5. The citation tracking guide walks through how to measure that.
Quick reference. Find your row and the answer falls out:
| Site type | Recommended stance | Why |
|---|---|---|
| SaaS marketing + docs | Allow All | Docs in training data = better LLM understanding of your product. Citations drive demos. |
| B2B services / agency | Allow All (or Allow Live Only) | Allow Live Only if your methodology is the differentiator and you don't want it baked into models. |
| Ecommerce | Allow Live + consider blocking training | Product catalogs in training data = competitors' AI-generated descriptions look like yours. Block training, keep live so agents can buy. |
| Original journalism | Allow Live, Block Training (or Block All if paywalled) | You want the referral traffic from citations but you're getting trained on without compensation. |
| Paid research / paywalled | Block All | Your content is the product. Citation = leak. |
| Local service business | Allow All | You want every "best plumber in Dallas" prompt to surface you. Training also helps the model recommend you. |
| High-bandwidth media (video/images) | Rate-Limit training, Allow Live | Bandwidth cost. Live fetches are sparse; training fetches are dense. |
| Proprietary methodology / consulting | Allow Live, Block Training | You want to be cited. You don't want models to "absorb" your method into the base layer. |
Save these and adapt the comments. Order matters — put more specific rules first.
# Default: no AI-specific rules needed. All bots welcome.
User-agent: *
Allow: /
# Block AI training crawlers, allow live-fetch and search index bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Live-fetch and search bots remain allowed (no Disallow for them)
# ChatGPT-User, Claude-User, Perplexity-User, OAI-SearchBot, Claude-SearchBot
# Block all known AI crawlers — training, live, and index
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
Robots.txt does not have a portable rate-limit directive (Crawl-delay is honored inconsistently). The real answer is at the server or CDN layer: nginx limit_req_zone, Cloudflare WAF rate-limiting rule keyed on User-Agent, or Vercel middleware. Set a budget per bot per hour. Common starting point: 1 request per second for training crawlers, no limit for the User-suffixed variants.
Google-Extended is Google's separate User-Agent for Gemini training. They split it from Googlebot in 2023 specifically so sites could opt out of AI training without losing Search rankings. Disallowing Google-Extended does not affect your appearance in Google Search results.
For almost every site, the right move on Google-Extended matches what you did on GPTBot. If you're Stance 2 (Allow Live, Block Training), block Google-Extended too — Gemini's live-fetch behavior runs under different UA strings. If you're Stance 1 (Allow All), leave it open. The asymmetry would be: block GPTBot but allow Google-Extended because you think Gemini is the future. That's a real bet some sites are making, just be deliberate about it.
You can flip your robots.txt any time. Two lag effects to know:
From blocked to allowed: the next time the training crawler revisits robots.txt (usually within 24 hours), it'll start fetching your content. Inclusion in the next trained model depends on whatever the training cycle is — for GPT-5-class models that's roughly a quarterly window. For live-fetch bots like ChatGPT-User, the change is effectively immediate; the next user prompt that needs your URL will fetch it cleanly.
From allowed to blocked: training crawlers stop fetching within a day or two. But content already in a trained model stays there — there's no retroactive removal. So if a model trained on your pre-block content, that knowledge sticks until the model is retrained from scratch. Plan for a 3-9 month tail.
Blog post — Why publishers are switching to default-deny — Reuters and People Inc. flipped robots.txt from blocklist to allowlist. The trend, the data, and what smaller sites should copy.
Resource — Complete list of AI crawler bots — Every AI crawler UA, what it does, and the robots.txt directive for each.
Resource — Manage AI crawlers — the how-to — Once you have a stance, this guide ships the robots.txt, CDN, and dashboard config to enforce it.
Blog post — How to track AI citations — Detect which AI bots are hitting your site and whether citations are happening in answers.
Blog post — ChatGPT Agent Can't Reach Your Website? 6 Blockers — The inverse problem: if you blocked too much by accident, here is how to confirm it and let the right agents back in.
Feature — Crawlytics bot tracking dashboard — Per-bot, per-page fetch counts so you can make allow/block calls with data, not guesswork.
Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →
No. GPTBot is OpenAI's training crawler. It has no relationship with Google, Bing, or any traditional search engine. Blocking it affects what shows up in ChatGPT (and only the training-derived parts, not live citations). Google Search ranking is untouched.
Mostly no. ChatGPT cites live URLs via ChatGPT-User, which is a different bot. To stop ChatGPT citations entirely you need to block ChatGPT-User and OAI-SearchBot as well. Many sites think they blocked ChatGPT and only blocked the training half.
If you're under 10k monthly pageviews and not publishing original journalism or proprietary research, the answer is almost certainly no. Allow All. The visibility upside outweighs the (very theoretical) training downside for sites your size. Revisit the decision when you cross 100k or when your content becomes the moat.
Cloudflare launched a "Pay-Per-Crawl" beta in 2024 that lets sites set a per-fetch price for specific AI bots and have Cloudflare collect on their behalf. Adoption is light but growing. For most sites it's not yet worth the operational overhead; for large publishers it's a real revenue line worth piloting. The publishers seeing the best economics are the ones with high-quality news, financial data, or technical documentation — content that's expensive to recreate.
Common Crawl runs the CCBot User-Agent and publishes a public web archive that downstream AI labs train on (OpenAI included, historically). If you're going Stance 3 (Block All), add CCBot to the list. If you're going Stance 2 (Allow Live, Block Training), it's worth adding — Common Crawl is the back-door training pipeline that lets a model learn your content even after you've blocked GPTBot directly. The honest caveat: blocking CCBot also opts you out of academic research datasets and a lot of legitimate non-training uses, so weigh it.
This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt