Complete List of AI Crawler Bots: User-Agents + robots.txt (2026)

Summary

Every major AI crawler: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, and 20 more. User-Agent patterns, purpose, robots.txt directives.

Contents

Key facts


If you've been getting unexpected traffic from User-Agents you don't recognize, this is probably the table you wanted. We maintain this list by watching real ingest traffic across the Crawlytics customer base — when a new AI crawler shows up in the wild, we add the signature here and in the production classifier.

25 bots across 19 companies as of June 2026.

Why these bots exist

AI crawlers fall into three categories:

  1. Training crawlers — fetch your content to use in model training. These visit periodically (weekly to monthly), don't fire JavaScript, and won't show up in your front-end analytics. Examples: GPTBot, ClaudeBot, Bytespider, Applebot-Extended.
  2. Live-fetch agents — fire when a user asks the AI a question that requires fetching a specific URL right now. Lower volume but real-time. Examples: ChatGPT-User, Perplexity-User, claude-web.
  3. Search-index crawlers — feed AI search products (SearchGPT, You.com, DuckAssist, Kagi). Behave more like traditional search crawlers — frequent, broad, indexed for retrieval. Examples: OAI-SearchBot, PerplexityBot, YouBot.

Most production AI assistants use multiple bots from this list in combination — training plus live-fetch plus index. Blocking one but not the others usually doesn't get you the result you want.

Should you block AI crawlers?

Depends on your business. The short version:

For a deeper walkthrough of allow/block strategy, see how to manage AI crawlers.

The full bot table

OpenAI

Bot name Purpose robots.txt
GPTBotofficial docs Training crawl User-agent: GPTBot Disallow: /
ChatGPT-Userofficial docs Live user-initiated fetch User-agent: ChatGPT-User Disallow: /
OAI-SearchBotofficial docs SearchGPT index User-agent: OAI-SearchBot Disallow: /

Anthropic

Bot name Purpose robots.txt
ClaudeBotofficial docs Training crawl User-agent: ClaudeBot Disallow: /
claude-webofficial docs Live user-initiated fetch User-agent: claude-web Disallow: /
anthropic-aiofficial docs Legacy / general User-agent: anthropic-ai Disallow: /

Perplexity

Bot name Purpose robots.txt
PerplexityBotofficial docs Index for Perplexity answers User-agent: PerplexityBot Disallow: /
Perplexity-Userofficial docs Live user-initiated fetch User-agent: Perplexity-User Disallow: /

Google

Bot name Purpose robots.txt
Google-Extendedofficial docs Gemini training + AI Overviews opt-out signal User-agent: Google-Extended Disallow: /

ByteDance

Bot name Purpose robots.txt
Bytespider Training crawl for Doubao + TikTok AI User-agent: Bytespider Disallow: /

Common Crawl

Bot name Purpose robots.txt
CCBotofficial docs Open crawl corpus used by many AI labs User-agent: CCBot Disallow: /

Meta

Bot name Purpose robots.txt
Meta-ExternalAgent Meta AI / Llama training User-agent: Meta-ExternalAgent Disallow: /
FacebookBotofficial docs Public sharing previews — overlaps AI use User-agent: FacebookBot Disallow: /

Amazon

Bot name Purpose robots.txt
Amazonbotofficial docs Alexa training + Amazon AI User-agent: Amazonbot Disallow: /

Apple

Bot name Purpose robots.txt
Applebot-Extendedofficial docs Apple Intelligence training User-agent: Applebot-Extended Disallow: /

Microsoft

Bot name Purpose robots.txt
CopilotBot Microsoft 365 Copilot crawl User-agent: CopilotBot Disallow: /

xAI

Bot name Purpose robots.txt
GrokBot Grok training User-agent: GrokBot Disallow: /

Mistral

Bot name Purpose robots.txt
MistralAI-User Le Chat live fetch User-agent: MistralAI-User Disallow: /

Cohere

Bot name Purpose robots.txt
cohere-ai Cohere training User-agent: cohere-ai Disallow: /

You.com

Bot name Purpose robots.txt
YouBot You.com search + AI User-agent: YouBot Disallow: /

Phind

Bot name Purpose robots.txt
PhindBot Phind developer search User-agent: PhindBot Disallow: /

DuckDuckGo

Bot name Purpose robots.txt
DuckAssistBot DuckAssist (AI answers) User-agent: DuckAssistBot Disallow: /

Kagi

Bot name Purpose robots.txt
KagiBot Kagi search + AI features User-agent: KagiBot Disallow: /

Diffbot

Bot name Purpose robots.txt
Diffbotofficial docs Knowledge graph extraction User-agent: Diffbot Disallow: /

AI2

Bot name Purpose robots.txt
ai2bot Allen Institute for AI research User-agent: ai2bot Disallow: /

What's not in this list

We've intentionally left out:

Detecting these bots in your own logs

If you have raw access logs (nginx, Apache, Vercel, Cloudflare), this command will surface AI bot traffic for the bots in the list above:

grep -iE 'GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|claude-web|anthropic-ai|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|CCBot|Meta-ExternalAgent|FacebookBot|Amazonbot|Applebot-Extended|CopilotBot|GrokBot|MistralAI-User|cohere-ai|YouBot|PhindBot|DuckAssistBot|KagiBot|Diffbot|ai2bot' /var/log/nginx/access.log | wc -l

That gives you a count. Drop the | wc -l for the full list of requests. For an actual dashboard with per-bot per-page breakdowns and historical trends, install Crawlytics — it does this in real time across 19 providers.

This list will get out of date

New AI bots show up roughly monthly. We update this page on a similar cadence — the "Last updated" date at the top is the source of truth. If you're looking at this 6+ months past that date, expect there to be additions we haven't shipped yet.

If you spot an AI crawler in your logs that's not on this list, email us — we add new bot patterns within a few days of seeing them in the wild.

Related

Frequently Asked Questions

What is GPTBot?

GPTBot is OpenAI's training crawler. It visits public websites a few times per week to collect content for training future versions of ChatGPT. It does not execute JavaScript, does not show up in Google Analytics, and respects robots.txt. To block it, add User-agent: GPTBot then Disallow: / to your robots.txt.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler that runs on a schedule. ChatGPT-User is the live-fetch agent that fires only when a real user asks ChatGPT to read a specific page right now. OAI-SearchBot is a third bot, OpenAI's SearchGPT index crawler. Each can be allowed or blocked independently in robots.txt.

How do I see which AI bots are crawling my site?

Three options: (1) grep your raw server access logs for known User-Agent patterns (GPTBot, ClaudeBot, PerplexityBot, Bytespider, CCBot, etc.); (2) check your CDN dashboard if you use Cloudflare or Fastly; (3) install a dedicated tracker like Crawlytics, which classifies 25+ AI crawlers in real time and shows per-page per-bot crawl frequency.

Should I block AI bots from crawling my site?

Depends on your goal. Block them if your content is paywalled, proprietary, or being scraped without compensation. Allow them if you want to be cited by ChatGPT, Claude, Perplexity, and AI search results, because blocking the training crawler means your content is absent from the model's knowledge. A common middle ground: block pure training crawlers like CCBot and Bytespider, allow live-fetch agents like ChatGPT-User and Perplexity-User.

How often do AI crawlers visit a website?

Varies widely. Training crawlers like GPTBot and ClaudeBot typically hit a site a few times per week per page. Live-fetch agents like ChatGPT-User and Perplexity-User only fire when a real user asks a question that requires reading that specific URL. High-traffic pages or pages with frequent updates get crawled more often.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap