How to Track Which AI Bots Crawl Your Site (2026)

Summary

How to track AI bots crawling your site: capture requests from nginx, Vercel, Cloudflare or WordPress logs, identify crawlers by user-agent, tier by intent.

Contents

Key facts


AI bots are already crawling your site, and the record of it is sitting in a file you can read right now. Every time GPTBot, ClaudeBot, or PerplexityBot fetches a page, your server writes a log line with the bot's user-agent in it. Tracking AI bots is not a matter of installing a sensor or waiting for data to accumulate. The data exists. The work is reading it, knowing which names to look for, and sorting the noise from the signal.

This is the step-by-step how-to. If you want the comparison of which products do this for you, I wrote a separate roundup of the best AI bot tracking tools — this post is the process behind those tools, so you understand exactly what they are doing and can do it yourself if you prefer. Four steps: capture the requests, identify the bots, tier them by intent, then track continuously.

Where AI bots show up: logs vs edge

AI bots appear in exactly two places, and which one you read depends on your stack.

Server access logs. If a request reaches your origin server, it gets logged. That is your nginx access.log, Apache's access_log, or your application framework's request log. Each line records the timestamp, the path requested, the status code, and the user-agent. This is the most complete view, because it captures everything your server actually served, and it is first-party data you already own.

Edge request logs. If you run a CDN or a serverless platform, a lot of traffic is answered at the edge before it ever touches your origin. A bot fetching a cached page might never appear in your origin log at all. Cloudflare, Vercel, Fastly, and similar platforms keep their own request logs, and those are where you catch edge-served bot traffic. On a heavily cached site, the edge log is the more honest count.

The practical rule: if your site is a plain server with no CDN, read the access log. If you sit behind Cloudflare or deploy on Vercel, read the edge log, or read both and reconcile. Missing the edge layer is the most common reason a site under-counts its AI-bot traffic — the requests are real, they just never reached the box you were watching.

Step 1: Capture the requests (per stack)

Capturing means getting access to the log stream where bot requests land. Here is where that stream lives on the four stacks I get asked about most.

nginx (or Apache) on your own server. The log is already being written. By default nginx writes to /var/log/nginx/access.log and Apache to /var/log/apache2/access.log or /var/log/httpd/access_log. Confirm the log format includes the user-agent — the standard combined format does. You can read it live, rotate it with logrotate, or ship it to a tool. Nothing to install to start capturing; the file is there.

Vercel / Next.js. Vercel does not give you a raw access log by default, so you capture bot requests one of two ways. Read the user-agent in middleware (middleware.ts) from request.headers.get('user-agent') and log or forward it, which lets you see every request including edge-served ones. Or configure a Vercel log drain to pipe request logs to a destination you control. The middleware path is the one that catches bots even when a page is statically served.

Cloudflare. Bot requests answered at Cloudflare's edge live in Cloudflare's logs, not your origin. On paid plans, Logpush exports request logs (including the user-agent field) to storage or a log service. Cloudflare also surfaces verified-bot and AI-crawler data in its own analytics. If you are Cloudflare-only, that edge log is your source of truth, because a cached hit may never reach your server.

WordPress. WordPress runs on top of Apache or nginx, so the simplest capture is the same server access log described above — every request to wp-content and your posts is recorded there. If you do not have shell access to the log, a bot-logging plugin records AI-crawler hits inside the WordPress dashboard. Either way the source is the web server underneath WordPress, not WordPress itself.

Across all four, you are not generating new data. You are getting a handle on the request stream that already contains the bots.

Step 2: Identify the bots by user-agent

You identify an AI bot by matching its user-agent string against known names. The user-agent is a text field the bot sends with every request, and the major AI operators publish theirs. Here is a single log line from nginx, with the user-agent at the end:

52.x.x.x - - [19/Jun/2026:08:14:02 +0000] "GET /pricing HTTP/1.1" 200 8421
  "-" "Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)"

That GPTBot/1.2 token is the whole tell. To pull every AI-bot hit out of an nginx log in one pass, match the names you care about:

grep -E 'GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|PerplexityBot|CCBot|Google-Extended' \
  /var/log/nginx/access.log

The names worth knowing, and who is behind each:

For the full, current list of AI-bot user-agents and the operators behind them, keep our AI bots list open while you read your logs — it is the lookup table for this step. One caveat: user-agents can be spoofed, so for a request that matters you can confirm authenticity against each operator's published IP ranges or reverse-DNS. The user-agent is where you start, not always where you stop.

Step 3: Tier them by intent

A raw count of "AI bot hits" is almost useless, because the bots want different things. Sorting them into three intent tiers is what turns the log into a decision. The same names from Step 2 fall into these buckets:

Tier 1 — training / data collection. GPTBot, CCBot, and Google-Extended are pulling your content to train or feed models. These hits do not bring a visitor today. Whether you allow them is a strategic and licensing question, not a traffic question. This is the tier most people mean when they ask about "blocking AI."

Tier 2 — search / index. OAI-SearchBot and PerplexityBot crawl to build the indexes their assistants cite from. Letting these in is how you become eligible to be surfaced in AI search answers. Blocking them is closer to blocking Googlebot than to blocking a scraper.

Tier 3 — live-user fetches. ChatGPT-User is the standout: it fires because a real person asked ChatGPT something and the assistant is fetching your page to answer them. These are the highest-intent bot hits you get, because there is a human attached. A spike in live-user fetches on a product page is a buying signal, not background noise.

Why the tiering matters: a thousand Tier 1 training hits and ten Tier 3 live-user fetches are wildly different events, and a tool that reports "1,010 AI bot hits" flattens them into a number that tells you nothing. When you separate them, the questions get sharp. Are live-user fetches landing on pages that render properly for a bot? Is one training crawler hammering you far past what its value justifies? You cannot ask either question from an undifferentiated total.

Step 4: Track continuously — DIY vs automated

A one-time grep tells you what happened in one log file. To see trends — which bots are increasing, which pages they favor, when a new crawler shows up — you need continuous tracking. Two ways to get there.

The DIY log pipeline. You can build this. Ship your access and edge logs into a store, run a scheduled job that matches user-agents against your bot list, classify each hit by intent tier, and chart the result. It is real work, and it never stops being work: the bot list changes as operators launch and rename crawlers, spoofing means you eventually want IP verification, and reconciling edge logs against origin logs is fiddly. If you have a log stack and an engineer who enjoys this, it is a legitimate path and you own every byte.

An automated tool. The alternative is to point a purpose-built tool at the same logs and let it handle the parsing, the signature list, the tiering, and the dashboard. This is what Crawlytics does for bot tracking: it classifies traffic against 25+ bot signatures spanning 19 providers, matches each hit with sub-millisecond regex classification, sorts everything into the three intent tiers automatically, and works on any stack rather than locking you to one CDN. The signature list is maintained for you, so a new crawler is recognized without you editing a regex. It is the same data the DIY pipeline produces, without you maintaining the pipeline.

The honest trade-off: DIY costs engineering time and ongoing maintenance but gives you total control; an automated tool costs a subscription but gives you the result today and keeps the bot list current. Most teams I talk to start with a grep to confirm the bots are there, then move to a tool the moment they want to watch trends instead of taking snapshots. If you are weighing specific products for that step, the bot-tracking tools roundup compares them, and the Crawlytics vs Google Analytics piece explains why GA4 cannot do this job (it relies on JavaScript that most bots never run).

What to do once you can see them

Seeing the bots is the setup. The payoff is acting on each tier, and the decisions follow from the intent split in Step 3.

Decide which crawlers to allow. Once you can see who is pulling what, the block-or-allow question stops being abstract. Our guide to the GPTBot block decision walks the actual trade-offs (training opt-out versus search visibility), and the manage AI crawlers resource has the robots.txt patterns to implement whatever you choose. Tracking first, then deciding — blocking a bot you have not measured is guessing.

Weigh the cost of heavy crawlers. Some crawlers pull a large volume and return nothing of value. Reading AI bot traffic cost reframes the log from "interesting" to "is this crawler earning its keep" using relative cost-versus-value, so a parasite crawler that hammers expensive pages stands out from a live-user fetch that might convert. (Relative tiers, not dollar figures — anyone quoting you an exact per-bot dollar cost is making it up.)

Make your pages readable to the bots that matter. If live-user fetches are landing on pages whose content renders client-side, the bot may get an empty shell. The fix is serving bots clean, readable content. That is the next layer of being agent-ready, and it is where bot tracking stops being a report and starts being a lever you can pull.

Related

Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →

Frequently Asked Questions

How do I see which AI bots crawl my site?

Read your server access logs or your CDN request logs and match the user-agent field against known AI-bot names like GPTBot, ChatGPT-User, ClaudeBot, and PerplexityBot. A one-line grep over your nginx or Apache log shows them immediately. To watch trends over time instead of a single snapshot, point a bot-tracking tool at the same log stream so the classification and tiering happen automatically.

How do I identify AI crawlers in my server logs?

Match the user-agent string. AI crawlers announce themselves with names such as GPTBot (OpenAI training), ChatGPT-User (a live ChatGPT fetch), ClaudeBot (Anthropic), PerplexityBot, OAI-SearchBot, CCBot (Common Crawl), and Google-Extended. Filter your access log for those substrings. The reverse-DNS or published IP ranges can confirm a bot is genuine, since user-agents can be spoofed, but the user-agent is where you start.

Can I track AI bots without code?

Yes. Put a tool in front of your logs or your CDN and let it do the parsing. Crawlytics ingests your traffic, matches it against 25+ bot signatures, tiers each hit by intent, and shows the result in a dashboard, so you never touch a regex. The free agent-readiness grader at /agent-ready is a code-free way to check, in about a minute, whether AI bots can even reach your pages in the first place.

How do I track AI bots on Vercel or WordPress?

On Vercel, read the bot in middleware or an edge function from the request user-agent header, or check Vercel log drains. On WordPress, either parse the server access log directly or use a plugin that logs AI-bot hits, since WordPress sits behind the same Apache or nginx that records every request. The bot names are identical across stacks — only the place you read them from changes.

What do I do once I can see the AI bots?

Decide per intent tier. Live-user fetches (ChatGPT-User) are the highest-value visitors, so make sure those pages render cleanly for bots. Training crawlers (GPTBot, CCBot) are a policy choice you control in robots.txt. Watch heavy crawlers that pull a lot and convert nothing. Reading our guides on AI-bot traffic cost and the GPTBot block decision turns raw log data into those calls.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap