AI Share of Voice Is a Made-Up Number — Measure This Instead

Summary

AI share of voice scores divide citations by a denominator nobody can audit. The fix: server-side bot-log data. Three AI visibility metrics you can actually defend to leadership.

Contents

Key facts


AI share of voice is a percentage built on a denominator nobody can audit. A vendor runs a few hundred prompts of its own choosing against ChatGPT, Perplexity, and Gemini, counts how often you appear, and hands you a score. The score looks like the share-of-voice metrics search marketers have trusted for twenty years. It is not the same kind of number, and treating it like one is how teams end up reporting noise to their board.

The number your vendor gave you

Here is how this usually starts. Leadership reads that AI assistants are eating search referrals and asks the obvious question: how visible are we in AI search? Someone signs up for a visibility tool, and within a week there is a number. "We have 14% AI share of voice in our category, up 3 points this quarter." It goes in the deck. Everyone relaxes.

Then the next model version ships and the number drops 6 points. Nobody changed the website. Nobody can explain the drop, because nobody can see inside the number. The prompts behind it are the vendor's, the weighting is the vendor's, and the universe being sampled belongs to no one at all.

If you have lived some version of this, the discomfort you felt is correct. The metric has a structural problem, not a tuning problem.

The infinite-denominator problem

Dan Taylor, head of technical SEO at SALT.agency, laid the problem out cleanly in a Search Engine Land piece on AI share of voice, and his framing is worth quoting because it names what most marketers only half-sense.

Classic share of voice worked because it had a fixed, transparent denominator: a keyword list. If a tool said you held 22% visibility, you could open the list, check the rankings, and audit every input behind the percentage. The math was boring, which is exactly what you want from a metric.

AI platforms break that model. As Taylor puts it, the universe of possible user prompts is effectively infinite. People do not query an LLM with "best crm small business." They write three paragraphs about their team size, their budget, their Salesforce trauma, and the integration they cannot live without. No two of those prompts are alike, and the response to each one is generated fresh.

So vendors do the only thing they can: pick a small, static subset of prompts and sample it. The sampling itself is reasonable. Presenting the result as your share of an open market is not, because a few hundred curated prompts standing in for an infinite space is a contrived environment dressed up as a census. Change the sample and the score changes. You never chose the sample, and in most tools you cannot fully inspect it.

The black-box problem: scores that break on model updates

The denominator problem would be survivable if the scores were at least stable. They are not, and the late-2025 model cycle proved it. Taylor points to what happened when OpenAI rolled out its next-generation ChatGPT model: the platform-wide volume of outbound citations and source links dropped, and visibility scores fell across entire categories at once. Brands that changed nothing watched their numbers slide because the model became stingier with links.

That is the audit failure in miniature. When a score moves, you need to answer one question: did we change, or did the instrument change? With prompt-sampled share of voice you usually cannot tell. The model updated, the vendor's prompt set rotated, the sampling temperature varied, or your content actually got worse. All four produce the same downward line on the chart, and the chart will not tell you which.

A metric you cannot decompose is a metric you cannot defend. The first skeptical question from a CFO ("why did this drop?") has no checkable answer, and metrics that survive in board decks are the ones with checkable answers.

To be fair to the tools: the better ones, Profound among them, are upfront that they measure a curated prompt set, and at enterprise scale a consistently-run sample has real directional value. The dishonesty is rarely the vendor's. It creeps in at the slide where a sampled estimate becomes "our share of the AI market," stated as fact.

The honest signal: what AI crawlers actually did

Taylor's proposed fix is to replace share of voice with sharper prompt-side questions: share of mentions, share of recommendations, share of narrative. Those are better questions, and worth asking. But they still live inside the sampled-prompt world, so they inherit the same denominator. I want to push one step further, to the only dataset in this whole category that is not synthetic.

Your server logs.

Every time GPTBot, ClaudeBot, PerplexityBot, or OAI-SearchBot fetches one of your pages, your server records it: which bot, which URL, what time, what response code. Nothing is sampled and nothing is simulated. The denominator is your own page inventory, a finite list you control completely. That single property fixes both problems at once. The infinite prompt universe is gone because you are no longer measuring prompts, and the black box is gone because every row in the dataset can be pulled up and inspected.

Crawl data answers a different question than share of voice, and it is worth being precise about that. It does not tell you what fraction of AI answers mention you. It tells you whether the models' retrieval infrastructure is actively reading you: which pages, which bots, how often, and trending which way. That is an upstream signal. An assistant cannot cite from a live fetch of a page its crawler never visits, so coverage is the precondition for everything the prompt-side tools are trying to estimate. Upstream and auditable beats downstream and synthetic when the audience is a skeptical executive.

Three metrics you can audit

Here is the reporting framework I would put in front of leadership instead. Three numbers, each with a fixed denominator and a paper trail.

1. AI bot crawl coverage

Of your highest-value pages (pricing, product, top category and conversion pages), what percentage was fetched by at least one major AI crawler in the last 30 days? Start with a defined list of 20 to 50 URLs so the denominator is explicit, then break the result out per bot: GPTBot and OAI-SearchBot for the OpenAI ecosystem, ClaudeBot for Anthropic, PerplexityBot for Perplexity. "GPTBot covered 38 of our 50 money pages last month, up from 29" is a sentence every word of which can be verified against a log line. No share-of-voice score can make that claim.

2. Crawl velocity after content changes

When you publish or substantially update a page, how quickly do AI crawlers re-fetch it, and does fetch frequency rise afterward? This is the closest log-side analog to rank tracking. If you rewrote your flagship comparison page on the 3rd and ClaudeBot re-fetched it on the 5th and then weekly after months of silence, your content investment registered with the systems that feed AI answers. If nothing re-crawls for six weeks, you have learned something a prompt sample would never show you: the models are answering questions about you from stale data.

3. llms.txt discoverability of your highest-value pages

Two checks. First, does your llms.txt exist and list the pages you actually want AI systems to read, rather than whatever your sitemap happened to contain? Second, do the logs show bots fetching the file and then reaching the pages it points to? An llms.txt that lists your top 20 pages while crawlers only ever touch your blog archive is a routing problem you can see and fix. The standard analytics stack misses all of this, because GA only fires on JavaScript-executing humans; bot fetches never appear there.

Report all three monthly. Coverage tells you breadth, velocity tells you responsiveness, discoverability tells you whether your routing layer works. Together they form a visibility report where every figure survives the "show me" question. If you also want the downstream confirmation that fetches turn into citations, citation tracking layers on cleanly. And if you want a free baseline before building any of this, the Agent-Ready Grader scores your llms.txt, robots, and agent-readiness in about a minute.

Related

Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →

Frequently Asked Questions

Is prompt sampling completely useless?

No. A consistently-run prompt set has real directional value, especially for qualitative questions log data cannot touch: how models describe your brand, which competitors get co-mentioned, whether you appear in recommendation-style answers. Enterprise tools that run hundreds of curated prompts daily produce a usable trend line as long as the methodology stays constant. The failure mode is treating the output as a market-share fact rather than a sampled estimate from an unauditable sample. Use prompt data as a supplement and a source of language insights. Do not make it the headline number you stake credibility on, because you cannot explain its movements.

How is crawl data a proxy for AI visibility?

Crawl data measures the precondition for visibility rather than visibility itself. AI assistants that browse or retrieve at answer time can only cite from pages their crawlers fetch, so coverage and crawl frequency tell you whether you are inside the retrieval loop at all. A page no AI bot has touched in 90 days is being answered about from training-era memory, if at all. The honest caveat: a fetch does not guarantee a citation, and models also carry knowledge from training data your logs never see. That is why crawl metrics pair well with periodic citation spot-checks. Logs tell you whether you are being read; spot-checks confirm whether being read turns into being cited.

Can I defend log-based metrics to leadership?

Yes, and more easily than any prompt-sampled score, because every number decomposes into verifiable records. If an executive asks why coverage dropped from 76% to 60%, you can name the eight pages that lost crawls and investigate each one. Frame it in three sentences: prompt-based share of voice samples an infinite space with a denominator we cannot inspect, and it swings on every model update. Crawl coverage measures a fixed list of our own pages against logged requests from real AI systems. One of these survives an audit. Finance teams in particular take to this quickly, since it mirrors how they already think about measurement: observed transactions over modeled estimates.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap