ChatGPT Agent Blocked From Your Site? 6 Causes and Fixes

Summary

A ChatGPT agent blocked from your website is usually accidental. The 6 common causes, from Cloudflare bot rules to JS-only rendering, and how to confirm and fix each.

Contents

Key facts


You asked ChatGPT about your own site and it could not see it. Or a customer told you the agent skipped right past your product. Before you start tearing apart configs, know that this is almost always accidental. Some default protection, a forgotten line in robots.txt, or a rendering choice made years ago is quietly turning the agent away. This post is for the site owner being blocked by their own stack, not the owner who chose to block. If you actually want to keep certain bots out, that is a different decision, and here is the decision guide for it.

First, confirm it's actually blocked (the 30-second test)

Do not fix anything until you have proof. "ChatGPT didn't mention my site" is not proof of a block. It might mean your content was reachable but not relevant, or that the model never tried to fetch you at all. Get a real signal first.

The fastest signal lives in your server logs. Grep for the AI user agents and look at the status codes they got back:

grep -iE 'ChatGPT-User|OAI-SearchBot|GPTBot|PerplexityBot|ClaudeBot' access.log | tail -50

If those requests return 200, the agent reached you. If you see 403 (forbidden), 429 (rate limited), or 503 with a challenge body, something in your stack is turning them away. A wall of 403s under OAI-SearchBot is your answer.

No log access, or you want to reproduce it live? Fetch your own URL while pretending to be ChatGPT's fetcher:

curl -A "Mozilla/5.0 (compatible; ChatGPT-User/1.0; +https://openai.com/bot)" -i https://yoursite.com/your-page

Read the response. A 200 with your actual page HTML means you are reachable. A 403, a 429, or a 200 whose body is a Cloudflare "checking your browser" interstitial or a CAPTCHA page means an agent gets that same wall instead of your content. That interstitial is the tell: the request technically succeeded, but the payload is a challenge, not your page.

One caveat worth stating plainly. Spoofing a user agent only tests user-agent-based rules. If a service verifies the request's source IP against published bot IP ranges, your curl from a random IP will not perfectly mimic the real bot. It is still the right first test, because the most common accidental blocks key off the user agent string or off "is this a browser," and curl catches both.

Blocker 1 — Cloudflare Bot Fight Mode and managed bot rules

Cloudflare's bot-fighting features are the single most common accidental cause, because they can be on without you ever choosing them. Bot Fight Mode (and Super Bot Fight Mode on paid plans) is designed to challenge or block automated traffic. AI crawlers are automated traffic. The feature does not always know the difference between a scraper you hate and the search fetcher you want.

If your curl test returned a "Just a moment..." interstitial or a JavaScript challenge, this is your prime suspect. Check it in the Cloudflare dashboard under Security, then Bots. Super Bot Fight Mode has a specific toggle for "AI bots" / "AI scrapers" that, when set to block, takes out crawlers using a managed signature list. Setting that to "Allow" (or carving out the specific verified bots) is usually the whole fix.

Be precise about what these tools claim. Cloudflare's bot features can challenge or block AI traffic when configured to, and recent defaults have leaned toward blocking AI scrapers on new zones, but the exact behavior depends on your plan and settings. Do not assume. Check the live toggle, then re-run the curl test to confirm you now get HTML.

Blocker 2 — robots.txt rules you forgot about

Open https://yoursite.com/robots.txt and read every line, because a single broad rule can wipe out a whole class of agents. The two patterns that bite people:

A blanket disallow that you, a plugin, or a staging config left behind:

User-agent: *
Disallow: /

That tells every well-behaved crawler to stay out of everything. Or a targeted block someone added during an "AI scraping" panic:

User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

Here is where the GPTBot confusion costs sites real visibility. Blocking GPTBot opts you out of OpenAI's training crawl, which many sites want. But people copy a "block OpenAI" snippet that also disallows OAI-SearchBot and ChatGPT-User, the agents that fetch your page when a user asks ChatGPT about you in real time. You meant to skip training. You accidentally went dark in ChatGPT search. Keep those rules separate and intentional.

Compliant agents read robots.txt and honor it, so a fix here is just an edit and a re-fetch. Remove or scope the offending lines, confirm the file is served fresh (caching plugins love to serve a stale robots.txt), and you are done.

Blocker 3 — WAF and rate-limit rules

Your WAF and rate limiter can block agents that robots.txt would happily allow, and they do it without leaving a polite note. Two failure modes show up most:

Aggressive rate limits. An agent fetching a few pages in quick succession to answer one question can trip a limiter tuned for human pacing, and it gets a 429. Real-time fetchers like ChatGPT-User usually pull a small number of pages, so a sane per-IP allowance rarely hurts them. A limiter set very tight, or one that counts all bot-shaped traffic against one bucket, will. Look for 429s under AI user agents in your logs, then loosen the rule or exempt verified bots rather than guessing at a magic number.

WAF rules that block "non-browser" traffic. Some managed rule sets challenge or drop requests that lack browser-like headers or carry a known bot user agent. That is great against junk traffic and terrible when it catches the fetcher you want citing you. If your curl test returned a 403 with a WAF block page (not a Cloudflare JS challenge, an actual "request blocked" response), inspect your WAF event log, find the rule ID that fired, and add an exception for the verified AI user agents you want to allow.

Blocker 4 — JavaScript-only content

This one is sneaky because nothing is "blocking" anything. The request returns 200, the agent is technically welcome, and it still sees almost nothing. If your page ships an empty <div id="root"></div> and paints all the real content with client-side JavaScript, a crawler-style agent that does not execute JS gets the empty shell. From the outside it looks like a block. It is really a rendering gap.

Test it the way the agent experiences it. Re-run the curl from the first section and actually read the body. If your headline, prices, and copy are missing from the raw HTML and you only see a script bundle, that is your problem. Some agents drive a full browser and will render JS; the live fetch and crawler-style agents frequently do not, and you cannot control which one arrives.

The fix is to put the content that matters into the server-rendered HTML: server-side rendering, static generation, or at minimum prerendering for bot user agents. Get your product name, price, and core copy into the initial response. An agent will not run your app to find your price. It reads what you hand it.

Blockers 5 and 6 — CDN geo rules and login/paywall gates

Two more accidental blocks worth a quick check.

Blocker 5, CDN geo and country rules. If you block or challenge traffic from certain countries, or allowlist only a handful, you may be blocking the regions agent infrastructure fetches from. The compute behind a live fetch does not necessarily sit in your customer's country. A geo rule written for fraud prevention can silently exclude the data-center ranges agents originate from. Check your CDN's country/ASN rules and confirm you are not challenging the networks these fetchers use.

Blocker 6, login and paywall gates. If the content lives behind authentication, a soft paywall, or an interstitial "accept cookies to continue" wall, the agent hits the gate, not the goods. This is sometimes intentional, and that is fine. Just be honest that gated content is not agent-readable content. If you want an agent to summarize or cite a page, the substance has to be reachable without a login. A common middle ground is exposing a meaningful preview or a clean summary in the server HTML while keeping the full asset gated.

Allowing agents without opening the floodgates

The goal is not "let everything in." It is to let the agents that drive visibility reach you while still keeping abusive scrapers out. Three moves do most of the work.

Allowlist the specific user agents you want. For live ChatGPT visibility that means OAI-SearchBot and ChatGPT-User; add PerplexityBot and ClaudeBot for those assistants. Allowing these is a deliberate, narrow decision, not a blanket "allow all bots." You can permit the search and fetch agents while still declining the training crawler if that is your preference.

Keep rate limits sane rather than absent. Real-time fetchers pull few pages, so a reasonable per-IP allowance and a verified-bot exemption let them through without inviting a scraper to hammer you. You do not have to choose between blocked and defenseless.

Keep blocking the bots you actually want gone. Allowlisting the good agents does not mean disarming. Abusive scrapers and unknown bot traffic can still be challenged or dropped. If you want to go further and flip to an allowlist-first posture, that is a defensible strategy some publishers are adopting; read the default-deny breakdown before you do. And for copy-paste robots.txt, Cloudflare, Vercel, and nginx rules, you can manage AI crawlers with real configs.

Whatever you change, finish the way you started. Re-run the curl test, check your logs for a clean run of 200s under the AI user agents, and confirm the body is your real page. A fix you did not verify is a guess. The whole point of the 30-second test is that it works in both directions: it tells you when you are blocked, and it tells you when you are finally not.

Related

Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →

Frequently Asked Questions

How do I test if ChatGPT can see my website?

Fetch your own URL with a ChatGPT-style user agent and read the response. Run curl -A "Mozilla/5.0 (compatible; ChatGPT-User/1.0; +https://openai.com/bot)" -i https://yoursite.com/page and check two things: the status code and the body. A 200 with your real HTML means you are reachable. A 403, a 429, or a 200 whose body is a "checking your browser" challenge or CAPTCHA means an agent gets that wall instead of your content. For an even better signal, grep your server logs for AI user agents and look at the status codes they actually received. The free Crawlytics Agent-Ready Grader runs this fetch-as-an-agent test for you in about a minute.

Does Cloudflare block ChatGPT by default?

Cloudflare can challenge or block AI crawlers depending on your plan and settings, and recent defaults on new zones have leaned toward blocking AI scrapers, so it is a common accidental cause, but it is not a universal absolute. Bot Fight Mode and Super Bot Fight Mode are designed to challenge automated traffic, and AI agents are automated traffic. Check the live setting in your dashboard under Security, then Bots, rather than assuming. Look for the AI bots / AI scrapers toggle and set it to allow, or carve out the verified bots you want. Then re-run the curl test to confirm you now get real HTML instead of an interstitial.

Which user agents do I need to allow for ChatGPT?

Allow OAI-SearchBot and ChatGPT-User for live ChatGPT visibility. Those are the agents that fetch your page when a user asks ChatGPT about you in real time, so they are the ones that determine whether you show up. GPTBot is a separate thing: it is OpenAI's training crawler, and blocking it does not affect live search visibility. If you also want Perplexity and Claude to reach you, add PerplexityBot and ClaudeBot. The common mistake is copying a "block OpenAI" snippet that disallows the search and fetch agents alongside the training crawler.

Will allowing AI agents increase my server costs?

Usually only marginally, because live fetch agents pull a small number of pages per task. A real-time fetcher like ChatGPT-User typically requests the handful of pages it needs to answer one question, not your whole site. Training crawlers like GPTBot can be more aggressive over time, which is one reason some sites allow the search agents but decline the training crawler. If you are worried, keep a reasonable rate limit in place with a verified-bot exemption rather than blocking outright. That caps worst-case load without cutting off the agents that drive visibility.

Can I allow ChatGPT but block AI training crawlers?

Yes, and many sites do exactly this. Allow OAI-SearchBot and ChatGPT-User so the live search and fetch agents can reach you, while disallowing GPTBot in robots.txt to opt out of training. They are distinct user agents with distinct jobs, so you can treat them differently. This is the configuration for owners who want to appear in ChatGPT answers without contributing their content to model training. Just keep the rules explicit and separate so a future copy-paste edit does not accidentally collapse the two.

Content not visible to non-JS crawlers

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap