Complete List of AI Crawler Bots: User-Agents + robots.txt (2026)

Summary

Every major AI crawler: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, and 20 more. User-Agent patterns, purpose, robots.txt directives.

Why these bots exist
Should you block AI crawlers?
The full bot table
- OpenAI
- Anthropic
- Perplexity
- Google
- ByteDance
- Common Crawl
- Meta
- Amazon
- Apple
- Microsoft
- xAI
- Mistral
- Cohere
- You.com
- Phind
- DuckDuckGo
- Kagi
- Diffbot
- AI2
What's not in this list
Detecting these bots in your own logs
This list will get out of date
Related

Key facts

AI crawlers fall into three categories:
Depends on your business.
We've intentionally left out:
If you have raw access logs (nginx, Apache, Vercel, Cloudflare), this command will surface AI bot traffic for the bots in the list above:
New AI bots show up roughly monthly.

If you've been getting unexpected traffic from User-Agents you don't recognize, this is probably the table you wanted. We maintain this list by watching real ingest traffic across the Crawlytics customer base — when a new AI crawler shows up in the wild, we add the signature here and in the production classifier.

25 bots across 19 companies as of June 2026.

Why these bots exist

AI crawlers fall into three categories:

Training crawlers — fetch your content to use in model training. These visit periodically (weekly to monthly), don't fire JavaScript, and won't show up in your front-end analytics. Examples: GPTBot, ClaudeBot, Bytespider, Applebot-Extended.
Live-fetch agents — fire when a user asks the AI a question that requires fetching a specific URL right now. Lower volume but real-time. Examples: ChatGPT-User, Perplexity-User, claude-web.
Search-index crawlers — feed AI search products (SearchGPT, You.com, DuckAssist, Kagi). Behave more like traditional search crawlers — frequent, broad, indexed for retrieval. Examples: OAI-SearchBot, PerplexityBot, YouBot.

Most production AI assistants use multiple bots from this list in combination — training plus live-fetch plus index. Blocking one but not the others usually doesn't get you the result you want.

Should you block AI crawlers?

Depends on your business. The short version:

If you want to be cited by ChatGPT, Claude, Perplexity, and AI search: allow them. Blocking the training crawler means your content is missing from the model's knowledge; blocking the live-fetch agent means user-initiated queries about your page can't pull fresh content.
If your content is paywalled or proprietary and being scraped without compensation: block them. Use robots.txt for compliant bots and CDN bot rules (Cloudflare, Fastly, Vercel Bot Manager) for the rest.
If you're not sure: install Crawlytics first to see what they're actually doing on your site. Then decide based on data instead of vibes.

For a deeper walkthrough of allow/block strategy, see how to manage AI crawlers.

The full bot table

OpenAI

Bot name	Purpose	robots.txt
GPTBotofficial docs	Training crawl	User-agent: GPTBot Disallow: /
ChatGPT-Userofficial docs	Live user-initiated fetch	User-agent: ChatGPT-User Disallow: /
OAI-SearchBotofficial docs	SearchGPT index	User-agent: OAI-SearchBot Disallow: /

Anthropic

Bot name	Purpose	robots.txt
ClaudeBotofficial docs	Training crawl	User-agent: ClaudeBot Disallow: /
claude-webofficial docs	Live user-initiated fetch	User-agent: claude-web Disallow: /
anthropic-aiofficial docs	Legacy / general	User-agent: anthropic-ai Disallow: /

Perplexity

Bot name	Purpose	robots.txt
PerplexityBotofficial docs	Index for Perplexity answers	User-agent: PerplexityBot Disallow: /
Perplexity-Userofficial docs	Live user-initiated fetch	User-agent: Perplexity-User Disallow: /

Google

Bot name	Purpose	robots.txt
Google-Extendedofficial docs	Gemini training + AI Overviews opt-out signal	User-agent: Google-Extended Disallow: /

ByteDance

Bot name	Purpose	robots.txt
Bytespider	Training crawl for Doubao + TikTok AI	User-agent: Bytespider Disallow: /

Common Crawl

Bot name	Purpose	robots.txt
CCBotofficial docs	Open crawl corpus used by many AI labs	User-agent: CCBot Disallow: /

Bot name	Purpose	robots.txt
Meta-ExternalAgent	Meta AI / Llama training	User-agent: Meta-ExternalAgent Disallow: /
FacebookBotofficial docs	Public sharing previews — overlaps AI use	User-agent: FacebookBot Disallow: /

Amazon

Bot name	Purpose	robots.txt
Amazonbotofficial docs	Alexa training + Amazon AI	User-agent: Amazonbot Disallow: /

Apple

Bot name	Purpose	robots.txt
Applebot-Extendedofficial docs	Apple Intelligence training	User-agent: Applebot-Extended Disallow: /

Microsoft

Bot name	Purpose	robots.txt
CopilotBot	Microsoft 365 Copilot crawl	User-agent: CopilotBot Disallow: /

xAI

Bot name	Purpose	robots.txt
GrokBot	Grok training	User-agent: GrokBot Disallow: /

Mistral

Bot name	Purpose	robots.txt
MistralAI-User	Le Chat live fetch	User-agent: MistralAI-User Disallow: /

Cohere

Bot name	Purpose	robots.txt
cohere-ai	Cohere training	User-agent: cohere-ai Disallow: /

You.com

Bot name	Purpose	robots.txt
YouBot	You.com search + AI	User-agent: YouBot Disallow: /

Phind

Bot name	Purpose	robots.txt
PhindBot	Phind developer search	User-agent: PhindBot Disallow: /

DuckDuckGo

Bot name	Purpose	robots.txt
DuckAssistBot	DuckAssist (AI answers)	User-agent: DuckAssistBot Disallow: /

Kagi

Bot name	Purpose	robots.txt
KagiBot	Kagi search + AI features	User-agent: KagiBot Disallow: /

Diffbot

Bot name	Purpose	robots.txt
Diffbotofficial docs	Knowledge graph extraction	User-agent: Diffbot Disallow: /

AI2

Bot name	Purpose	robots.txt
ai2bot	Allen Institute for AI research	User-agent: ai2bot Disallow: /

What's not in this list

We've intentionally left out:

Googlebot, Bingbot, traditional search crawlers. They predate the AI category and are well-documented elsewhere. Blocking them is almost always a bad idea regardless of your AI stance.
Generic scrapers with no clear AI affiliation (e.g., random Python requests User-Agents). We classify those as "unknown" traffic, not AI.
Image-only crawlers (ImageSift, etc.) unless they participate in AI training, which most don't currently.
RSS/feed readers and uptime monitors that some sites mistake for AI traffic.

Detecting these bots in your own logs

If you have raw access logs (nginx, Apache, Vercel, Cloudflare), this command will surface AI bot traffic for the bots in the list above:

grep -iE 'GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|claude-web|anthropic-ai|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|CCBot|Meta-ExternalAgent|FacebookBot|Amazonbot|Applebot-Extended|CopilotBot|GrokBot|MistralAI-User|cohere-ai|YouBot|PhindBot|DuckAssistBot|KagiBot|Diffbot|ai2bot' /var/log/nginx/access.log | wc -l

That gives you a count. Drop the | wc -l for the full list of requests. For an actual dashboard with per-bot per-page breakdowns and historical trends, install Crawlytics — it does this in real time across 19 providers.

This list will get out of date

New AI bots show up roughly monthly. We update this page on a similar cadence — the "Last updated" date at the top is the source of truth. If you're looking at this 6+ months past that date, expect there to be additions we haven't shipped yet.

If you spot an AI crawler in your logs that's not on this list, email us — we add new bot patterns within a few days of seeing them in the wild.

Frequently Asked Questions

What is GPTBot?

GPTBot is OpenAI's training crawler. It visits public websites a few times per week to collect content for training future versions of ChatGPT. It does not execute JavaScript, does not show up in Google Analytics, and respects robots.txt. To block it, add User-agent: GPTBot then Disallow: / to your robots.txt.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler that runs on a schedule. ChatGPT-User is the live-fetch agent that fires only when a real user asks ChatGPT to read a specific page right now. OAI-SearchBot is a third bot, OpenAI's SearchGPT index crawler. Each can be allowed or blocked independently in robots.txt.

How do I see which AI bots are crawling my site?

Three options: (1) grep your raw server access logs for known User-Agent patterns (GPTBot, ClaudeBot, PerplexityBot, Bytespider, CCBot, etc.); (2) check your CDN dashboard if you use Cloudflare or Fastly; (3) install a dedicated tracker like Crawlytics, which classifies 25+ AI crawlers in real time and shows per-page per-bot crawl frequency.

Should I block AI bots from crawling my site?

Depends on your goal. Block them if your content is paywalled, proprietary, or being scraped without compensation. Allow them if you want to be cited by ChatGPT, Claude, Perplexity, and AI search results, because blocking the training crawler means your content is absent from the model's knowledge. A common middle ground: block pure training crawlers like CCBot and Bytespider, allow live-fetch agents like ChatGPT-User and Perplexity-User.

How often do AI crawlers visit a website?

Varies widely. Training crawlers like GPTBot and ClaudeBot typically hit a site a few times per week per page. Live-fetch agents like ChatGPT-User and Perplexity-User only fire when a real user asks a question that requires reading that specific URL. High-traffic pages or pages with frequent updates get crawled more often.

Cite this page

Title: Complete List of AI Crawler Bots: User-Agents + robots.txt (2026)
Author: Crawlytics
Publisher: Crawlytics
Published: 2026-06-03
Updated: 2026-06-03
URL: https://crawlytics.app/resources/ai-bots-list?utm_source=claude&utm_medium=ai_referral&utm_campaign=crawlytics

Related on this site

This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap