Complete List of AI Crawler Bots: User-Agents + robots.txt (2026)
Summary
Every major AI crawler: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, and 20 more. User-Agent patterns, purpose, robots.txt directives.
Contents
Key facts
- AI crawlers fall into three categories:
- Depends on your business.
- We've intentionally left out:
- If you have raw access logs (nginx, Apache, Vercel, Cloudflare), this command will surface AI bot traffic for the bots in the list above:
- New AI bots show up roughly monthly.
If you've been getting unexpected traffic from User-Agents you don't recognize, this is probably the table you wanted. We maintain this list by watching real ingest traffic across the Crawlytics customer base — when a new AI crawler shows up in the wild, we add the signature here and in the production classifier.
25 bots across 19 companies as of June 2026.
Why these bots exist
AI crawlers fall into three categories:
- Training crawlers — fetch your content to use in model training. These visit periodically (weekly to monthly), don't fire JavaScript, and won't show up in your front-end analytics. Examples: GPTBot, ClaudeBot, Bytespider, Applebot-Extended.
- Live-fetch agents — fire when a user asks the AI a question that requires fetching a specific URL right now. Lower volume but real-time. Examples: ChatGPT-User, Perplexity-User, claude-web.
- Search-index crawlers — feed AI search products (SearchGPT, You.com, DuckAssist, Kagi). Behave more like traditional search crawlers — frequent, broad, indexed for retrieval. Examples: OAI-SearchBot, PerplexityBot, YouBot.
Most production AI assistants use multiple bots from this list in combination — training plus live-fetch plus index. Blocking one but not the others usually doesn't get you the result you want.
Should you block AI crawlers?
Depends on your business. The short version:
- If you want to be cited by ChatGPT, Claude, Perplexity, and AI search: allow them. Blocking the training crawler means your content is missing from the model's knowledge; blocking the live-fetch agent means user-initiated queries about your page can't pull fresh content.
- If your content is paywalled or proprietary and being scraped without compensation: block them. Use robots.txt for compliant bots and CDN bot rules (Cloudflare, Fastly, Vercel Bot Manager) for the rest.
- If you're not sure: install Crawlytics first to see what they're actually doing on your site. Then decide based on data instead of vibes.
For a deeper walkthrough of allow/block strategy, see how to manage AI crawlers.
The full bot table
OpenAI
| Bot name |
Purpose |
robots.txt |
| GPTBotofficial docs |
Training crawl |
User-agent: GPTBot Disallow: / |
| ChatGPT-Userofficial docs |
Live user-initiated fetch |
User-agent: ChatGPT-User Disallow: / |
| OAI-SearchBotofficial docs |
SearchGPT index |
User-agent: OAI-SearchBot Disallow: / |
Anthropic
| Bot name |
Purpose |
robots.txt |
| ClaudeBotofficial docs |
Training crawl |
User-agent: ClaudeBot Disallow: / |
| claude-webofficial docs |
Live user-initiated fetch |
User-agent: claude-web Disallow: / |
| anthropic-aiofficial docs |
Legacy / general |
User-agent: anthropic-ai Disallow: / |
Perplexity
| Bot name |
Purpose |
robots.txt |
| PerplexityBotofficial docs |
Index for Perplexity answers |
User-agent: PerplexityBot Disallow: / |
| Perplexity-Userofficial docs |
Live user-initiated fetch |
User-agent: Perplexity-User Disallow: / |
Google
| Bot name |
Purpose |
robots.txt |
| Google-Extendedofficial docs |
Gemini training + AI Overviews opt-out signal |
User-agent: Google-Extended Disallow: / |
ByteDance
| Bot name |
Purpose |
robots.txt |
| Bytespider |
Training crawl for Doubao + TikTok AI |
User-agent: Bytespider Disallow: / |
Common Crawl
| Bot name |
Purpose |
robots.txt |
| CCBotofficial docs |
Open crawl corpus used by many AI labs |
User-agent: CCBot Disallow: / |
Meta
| Bot name |
Purpose |
robots.txt |
| Meta-ExternalAgent |
Meta AI / Llama training |
User-agent: Meta-ExternalAgent Disallow: / |
| FacebookBotofficial docs |
Public sharing previews — overlaps AI use |
User-agent: FacebookBot Disallow: / |
Amazon
| Bot name |
Purpose |
robots.txt |
| Amazonbotofficial docs |
Alexa training + Amazon AI |
User-agent: Amazonbot Disallow: / |
Apple
| Bot name |
Purpose |
robots.txt |
| Applebot-Extendedofficial docs |
Apple Intelligence training |
User-agent: Applebot-Extended Disallow: / |
Microsoft
| Bot name |
Purpose |
robots.txt |
| CopilotBot |
Microsoft 365 Copilot crawl |
User-agent: CopilotBot Disallow: / |
xAI
| Bot name |
Purpose |
robots.txt |
| GrokBot |
Grok training |
User-agent: GrokBot Disallow: / |
Mistral
| Bot name |
Purpose |
robots.txt |
| MistralAI-User |
Le Chat live fetch |
User-agent: MistralAI-User Disallow: / |
Cohere
| Bot name |
Purpose |
robots.txt |
| cohere-ai |
Cohere training |
User-agent: cohere-ai Disallow: / |
You.com
| Bot name |
Purpose |
robots.txt |
| YouBot |
You.com search + AI |
User-agent: YouBot Disallow: / |
Phind
| Bot name |
Purpose |
robots.txt |
| PhindBot |
Phind developer search |
User-agent: PhindBot Disallow: / |
DuckDuckGo
| Bot name |
Purpose |
robots.txt |
| DuckAssistBot |
DuckAssist (AI answers) |
User-agent: DuckAssistBot Disallow: / |
Kagi
| Bot name |
Purpose |
robots.txt |
| KagiBot |
Kagi search + AI features |
User-agent: KagiBot Disallow: / |
Diffbot
| Bot name |
Purpose |
robots.txt |
| Diffbotofficial docs |
Knowledge graph extraction |
User-agent: Diffbot Disallow: / |
AI2
| Bot name |
Purpose |
robots.txt |
| ai2bot |
Allen Institute for AI research |
User-agent: ai2bot Disallow: / |
What's not in this list
We've intentionally left out:
- Googlebot, Bingbot, traditional search crawlers. They predate the AI category and are well-documented elsewhere. Blocking them is almost always a bad idea regardless of your AI stance.
- Generic scrapers with no clear AI affiliation (e.g., random Python
requests User-Agents). We classify those as "unknown" traffic, not AI.
- Image-only crawlers (ImageSift, etc.) unless they participate in AI training, which most don't currently.
- RSS/feed readers and uptime monitors that some sites mistake for AI traffic.
Detecting these bots in your own logs
If you have raw access logs (nginx, Apache, Vercel, Cloudflare), this command will surface AI bot traffic for the bots in the list above:
grep -iE 'GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|claude-web|anthropic-ai|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|CCBot|Meta-ExternalAgent|FacebookBot|Amazonbot|Applebot-Extended|CopilotBot|GrokBot|MistralAI-User|cohere-ai|YouBot|PhindBot|DuckAssistBot|KagiBot|Diffbot|ai2bot' /var/log/nginx/access.log | wc -l
That gives you a count. Drop the | wc -l for the full list of requests. For an actual dashboard with per-bot per-page breakdowns and historical trends, install Crawlytics — it does this in real time across 19 providers.
This list will get out of date
New AI bots show up roughly monthly. We update this page on a similar cadence — the "Last updated" date at the top is the source of truth. If you're looking at this 6+ months past that date, expect there to be additions we haven't shipped yet.
If you spot an AI crawler in your logs that's not on this list, email us — we add new bot patterns within a few days of seeing them in the wild.
Related
Frequently Asked Questions
What is GPTBot?
GPTBot is OpenAI's training crawler. It visits public websites a few times per week to collect content for training future versions of ChatGPT. It does not execute JavaScript, does not show up in Google Analytics, and respects robots.txt. To block it, add User-agent: GPTBot then Disallow: / to your robots.txt.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is OpenAI's training crawler that runs on a schedule. ChatGPT-User is the live-fetch agent that fires only when a real user asks ChatGPT to read a specific page right now. OAI-SearchBot is a third bot, OpenAI's SearchGPT index crawler. Each can be allowed or blocked independently in robots.txt.
How do I see which AI bots are crawling my site?
Three options: (1) grep your raw server access logs for known User-Agent patterns (GPTBot, ClaudeBot, PerplexityBot, Bytespider, CCBot, etc.); (2) check your CDN dashboard if you use Cloudflare or Fastly; (3) install a dedicated tracker like Crawlytics, which classifies 25+ AI crawlers in real time and shows per-page per-bot crawl frequency.
Should I block AI bots from crawling my site?
Depends on your goal. Block them if your content is paywalled, proprietary, or being scraped without compensation. Allow them if you want to be cited by ChatGPT, Claude, Perplexity, and AI search results, because blocking the training crawler means your content is absent from the model's knowledge. A common middle ground: block pure training crawlers like CCBot and Bytespider, allow live-fetch agents like ChatGPT-User and Perplexity-User.
How often do AI crawlers visit a website?
Varies widely. Training crawlers like GPTBot and ClaudeBot typically hit a site a few times per week per page. Live-fetch agents like ChatGPT-User and Perplexity-User only fire when a real user asks a question that requires reading that specific URL. High-traffic pages or pages with frequent updates get crawled more often.
Cite this page
Related on this site
This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt
Site index for AI agents: llms.txt · sitemap