Server logs show which AI bots fetched your pages; prompt-testing shows which answers cite you. Practical playbook with bot UAs, grep commands, and benchmarks.
The most common question I get from marketing teams in 2026 is some version of: "Is ChatGPT citing my site?" The answer is usually disappointing — not because the data isn't there, but because most teams are looking in the wrong place. Google Analytics will not tell you. Your CDN dashboard probably won't either. The data exists in your raw server logs and in the AI tools themselves, but you have to know what to look for.
This is the practical playbook. Four detection steps, the User-Agent strings to grep for, a way to prompt-test your own brand, and the benchmarks that tell you whether what you're seeing is good, average, or a warning sign.
First, the uncomfortable part. When someone discovers you through ChatGPT and visits your site later, the visit almost never carries an AI fingerprint. It lands in GA as (direct) / (none), or as a branded Google search three days later when the buyer types your name to find pricing. The discovery happened inside an AI answer; the analytics record says otherwise.
Zero-click answers are worse. A buyer asks Perplexity to compare five tools in your category, reads the synthesis, shortlists you, and never clicks anything. You influenced a deal and generated zero rows in any analytics table. Sales teams call this the dark funnel, and AI assistants are pumping more of the buying journey into it every quarter.
So when teams try to track AI search visibility through referral reports and conclude nothing is happening, they're usually wrong. The influence is there. It's wearing a disguise — direct visits, branded searches, "a colleague mentioned you" — and the job is to find signals that don't depend on a referrer header.
Casey Nifong made this case well in a June 2026 Search Engine Land piece on tracking AI search visibility when attribution falls short. Her argument: no single metric explains AI-driven influence, so you triangulate across four signals. Assisted conversions, branded search growth, direct traffic trends, and brand visibility inside the AI systems themselves. All four are sound, and the first three are exactly where AI's invisible influence leaks back into measurable data. (The fourth is usually measured by prompt sampling, which has real limits — more on that in our piece on why AI share of voice is a made-up number.)
What the article doesn't cover is the one dataset that records AI activity directly instead of by proxy: your server logs. Every ChatGPT-User or Perplexity-User hit is a timestamped, page-level record of an assistant pulling your content for a live answer. The four proxy signals tell you something is happening. The logs tell you which assistant, which page, and when.
That pairing is the practical move. Treat AI bot crawl spikes as your leading indicator and branded search lift as the lagging one. If Claude-User fetches of your comparison page triple in March and branded search impressions climb in April, you've connected crawl data to business impact without a referrer header in sight. Watch the two lines together for a quarter and the lag between them becomes your attribution model.
Before any tooling, separate the two questions, because they have different answers and different fixes.
Question 1: Is AI fetching my pages? This is a server-side question. AI assistants have crawlers that visit your URLs, parse the content, and return it (or a summary) to whoever asked. Your access logs show every fetch. If you don't see fetches from named AI bots, the agent doesn't have your content — full stop.
Question 2: Is AI citing me in answers? This is a client-side question. Even if AI is fetching you, the model may or may not surface your URL when answering a user's question. Citation is a separate event from retrieval. You measure it by asking the AI a question and seeing if you show up.
The two failure modes are different. If you're not being fetched, the fix is technical — robots.txt, llms.txt, agent-accessibility. If you're being fetched but not cited, the fix is editorial — your content isn't answering the question well enough, or competitors are answering it more clearly.
Pull the last 30 days of access logs. The format varies (Apache, nginx, Cloudflare, Vercel, Netlify) but every one of them records the User-Agent. Here's the bot taxonomy you should be grepping for:
| UA pattern | Who | What it means |
|---|---|---|
| GPTBot | OpenAI | Training crawler. Fetches pages to potentially include in model training. Not a real-time answer signal. |
| ChatGPT-User | OpenAI | Live fetch. Triggered when a ChatGPT user asks a question and the model decides to browse your URL. |
| OAI-SearchBot | OpenAI | ChatGPT Search index crawler. Real-time-ish — populates the in-product web index. |
| ClaudeBot | Anthropic | Training crawler. |
| Claude-User | Anthropic | Live fetch. Claude is browsing your URL on behalf of a user prompt. |
| Claude-SearchBot | Anthropic | Claude Search index crawler. |
| PerplexityBot | Perplexity | Index crawler. |
| Perplexity-User | Perplexity | Live fetch on behalf of a user. |
| Google-Extended | Gemini training crawler. Separate UA from Googlebot so you can opt out of AI training without losing Search. | |
| Bytespider | ByteDance | Doubao / Chinese-market crawlers. Often confused for malicious traffic. |
| Amazonbot | Amazon | Alexa+ / Rufus crawler. |
A nginx one-liner to count the last 30 days of AI-bot hits by UA:
grep -E 'GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|Amazonbot' /var/log/nginx/access.log* \
| awk -F'"' '{print $6}' \
| sed -E 's/.*(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|Amazonbot).*/\1/' \
| sort | uniq -c | sort -rn
What you want to see: a healthy mix, with the User-suffixed bots (ChatGPT-User, Claude-User, Perplexity-User) showing up at all. Those are the ones tied to real user prompts in real time. If you only see training crawlers (GPTBot, ClaudeBot) but never the User variants, you're indexed but not being browsed.
Counts by bot are useful, but the more actionable view is bot-by-page. Which of your pages are AI assistants actually fetching? Sort the request paths by bot fetch count and look at the top 20. Common patterns:
If the top pages don't match the pages you want surfaced in AI answers, that's a content gap, not a tracking gap. Write the page that the agent is asking for.
Logs tell you what's being fetched. They don't tell you whether you're being cited. For that, you have to be the user.
Open ChatGPT, Claude, and Perplexity (signed-out incognito sessions for each — your account history biases results). Run a battery of prompts a buyer in your category would actually type. Record which sources are cited in the response, and where you appear.
A working test set has three tiers:
llms.txt, robots.txt blocking AI bots, or fresh-domain trust issues.Score each prompt: cited (your URL appears in the source list), mentioned (your brand name appears in the answer but no URL), or absent. Track this monthly. Three months of data shows whether your AI visibility is trending up or down.
You can do this by hand for 20-30 prompts in an afternoon. Past that, automate it — there are tools (Profound, Otterly, AI Brand Rank) that run scheduled prompts against each model and chart your appearance over time.
Once you have the data, the next question is whether what you're seeing is good. Here are the benchmarks I see across Crawlytics customers in mid-2026, split by site size:
| Site size (monthly human pageviews) | Healthy AI-bot fetches/month | Healthy bot-to-human ratio |
|---|---|---|
| Under 10k | 200 - 1,000 | 1:30 to 1:50 |
| 10k - 100k | 2,000 - 15,000 | 1:25 to 1:50 |
| 100k - 1M | 15,000 - 200,000 | 1:10 to 1:30 |
| 1M+ | 200,000+ | 1:5 to 1:20 |
Two failure modes to watch for:
llms.txt exists, check whether your sitemap.xml is up to date and discoverable.Bots fetching your page isn't the end of the funnel. Some percentage of the users who saw your citation in ChatGPT will click through and visit your site. That visit is where revenue happens — and it's where most analytics stacks go blind.
The reason: ChatGPT, Claude, and Perplexity all open citations in an in-app browser that strips the Referer header. Google Analytics sees the visit as (direct) / (none) and you have no idea it came from AI. We covered the full mechanics and the fix in our piece on why ChatGPT traffic shows as direct in Google Analytics — the short version is you have to inject UTM parameters into the URLs that AI assistants fetch, before they fetch them.
Detection and attribution work together. Detection tells you whether you're showing up. Attribution tells you whether the visits convert. Without both, you're flying half-blind.
Three real-shape benchmarks from Crawlytics customer cohorts (anonymized):
None of these are "industry averages" — your mileage will vary. They're shapes to compare against. If yours is dramatically lower at a given site size, dig in.
The whole playbook above works without any paid tool for a site under ~10k monthly pageviews. Past that scale, three things break:
At that point you want a dashboard that does both halves — per-bot fetch counts AND scheduled prompt-tests with citation tracking — and ties them to the same per-page rollup. That's what Crawlytics does. The Visibility tier ($29.99/mo) covers fetch detection plus llms.txt generation; citation tracking is on the roadmap for the next tier.
If you're not at that scale yet, the grep-and-prompt-test loop is more than enough. Run it once a month. Track the trend. The day you can't keep up by hand is the day to graduate to a tool.
Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →
This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt