A step-by-step AI search visibility audit: check robots.txt, server HTML, AI bot access, schema, and attribution by hand, then score all five in 30 seconds.
Quick answer
To audit your AI search visibility, check five dimensions in order. (1) Discoverability — open /robots.txt, /sitemap.xml, and /llms.txt in a browser and confirm they exist and load. (2) Content accessibility — use View Source (not Inspect) and search for a sentence from your page; if it is missing, your content renders only in JavaScript and most AI fetchers cannot see it. (3) Bot access control — read your robots.txt and note which AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot) are allowed or blocked. (4) Protocol and attribution signals — check for JSON-LD schema and a llms.txt file. (5) Attribution — confirm you can tell when AI assistants send you traffic. You can do all of this by hand in 15 minutes, or run the free Agent-Ready grader at /agent-ready for a ~30-second scan that scores the same five categories and sorts the fixes by severity.
Most "is my site AI-ready" advice skips the part where you actually look. So here is the look: a five-dimension AI search visibility audit you can run by hand with a browser and three URLs, plus the exact thing to check inside each one. I will walk through each dimension, show you the DIY method, then point you at the fast way at the end.
The reason this matters more than a generic SEO checklist: AI assistants fetch your pages differently than Googlebot does, they obey different rules in your robots.txt, and they fail in ways that are invisible from a normal browser. A page can rank fine in Google and be completely unreadable to ChatGPT. The audit below is built to catch exactly those gaps.
An AI search visibility audit checks five dimensions, in roughly the order an AI crawler experiences them. These are the same five categories the free Crawlytics grader scores, and they map to the lifecycle of a bot visiting your site: can it find you, can it read you, are you letting it in, do you speak its protocol, and can you measure the result.
robots.txt, your XML sitemap, and your llms.txt — the files a well-behaved bot checks first to learn what exists and where.llms.txt standard that tells agents what your site is and where the important content lives?Score each one and you have a complete picture. Below, each dimension gets a hands-on method. None of it requires a tool to start.
Start by confirming that the three files AI crawlers look for first actually exist and load. Open each of these in a browser, replacing the domain with yours:
https://yourdomain.com/robots.txthttps://yourdomain.com/sitemap.xmlhttps://yourdomain.com/llms.txtrobots.txt should load as plain text and contain a line pointing to your sitemap (Sitemap: https://yourdomain.com/sitemap.xml). If it 404s, bots fall back to crawling links and you lose control over what they fetch. If it exists but blocks everything with Disallow: / under a wildcard User-agent: *, you have a much bigger problem — covered in Step 3.
sitemap.xml should load as XML and list your real, canonical URLs. Spot-check that the URLs are live (not redirects or 404s) and that your important pages are present. A sitemap that lists 12 URLs when you have 400 pages is a discoverability gap; a sitemap full of dead links wastes crawl budget and trains bots to ignore it. If your sitemap is an index file pointing to sub-sitemaps, follow one or two down and confirm they resolve.
llms.txt is the newest of the three and the one most sites are missing entirely. It is a Markdown file at the root of your domain that gives AI agents a curated map of your site: what it is, and links to your most important pages in a clean, token-efficient form. If it 404s, you score zero on this part, and it is one of the easiest high-impact gaps to close because almost no one has it yet. We have a full guide to what llms.txt is and how to write one, and a deeper reference in the llms.txt resource hub.
The discoverability check takes about three minutes and tells you whether AI crawlers can even begin. If all three files exist and resolve cleanly, you pass this dimension. If any are missing, note it and move on — you now have your first fix.
The most important and most commonly failed check in the entire audit is whether your content exists in the server HTML. Here is why it matters: when ChatGPT, Perplexity, or Claude fetch a page, they request the raw HTML the way curl would. Most AI fetchers do not run JavaScript. If your page is a single-page app or relies on client-side rendering to paint the text, the bot receives a near-empty shell, and your carefully written content is invisible to it.
To check this by hand, do not use "Inspect" — Inspect shows you the live DOM after JavaScript has run, which hides the problem. Instead:
Ctrl/Cmd + U). This shows the raw HTML the server sent, before any JavaScript executes.Ctrl/Cmd + F to search that source for a distinctive sentence from your page body.If you find the sentence, your content is server-rendered and AI fetchers can read it. If the source is mostly <script> tags and your text is nowhere to be found, your content is client-rendered and you have a serious accessibility gap. For a harder confirmation, run curl -s https://yourdomain.com/your-page | grep "your sentence" from a terminal — if grep returns nothing, neither does the bot.
The second half of this dimension is whether a clean, machine-readable version of each page is available. Real article pages are wrapped in navigation, cookie banners, ads, and sidebar clutter that dilute the signal an AI extracts. Serving bots a stripped Markdown or clean-HTML version of the page raises the odds your actual content gets parsed and cited. This is exactly what the Crawlytics serving layer does — it returns AI-optimized HTML and per-page Markdown to AI fetchers while leaving the human-facing site untouched. The audit step is simpler: just confirm your real content is reachable without JavaScript first. For the citation-side of this, see how to get cited by ChatGPT.
Go back to your robots.txt and read it carefully, because this is where good intentions go wrong. The question to answer: which AI crawlers can reach your site, and is that what you actually want?
The major AI user-agents to look for, by name, are:
Read each User-agent block and its Disallow rules. A common, costly mistake is copying a robots.txt that blocks GPTBot and friends "to protect content," and thereby blocking the exact crawlers that feed AI citations you want. The opposite mistake is allowing everyone and never checking who actually shows up. There is a real strategic decision here between training crawlers, which use your content to train models with no direct traffic back, and search or user crawlers, which fetch your page to answer a live question and can cite you. Our decision guide on blocking GPTBot walks through that trade-off, and the AI bots reference list catalogs every major crawler with its purpose and user-agent string so you can read your robots.txt against the real names.
The honest limit of the DIY method: robots.txt tells you what you declared, not who actually crawled you. To know which bots really hit your server and how often, you need server-log analysis — which is the gap between auditing your config and measuring your reality. More on that under the fast way below.
Protocol Discovery asks whether you publish the machine-readable signals AI systems look for. Two are easy to audit by hand today; a third is emerging.
Structured data (JSON-LD schema). View your page source and search for application/ld+json. You are looking for schema that names your entity, marks up articles, products, or FAQs, and gives AI systems unambiguous facts to extract instead of inferring them from prose. The fastest external check is Google's Rich Results Test, which parses and validates the JSON-LD. Missing or broken schema is a soft failure — your content can still be read — but present, valid schema raises the odds of accurate citation. See schema markup for AI search for which types matter most.
llms.txt protocol file. This overlaps with Step 1 but counts here too: publishing a valid llms.txt is itself a protocol signal that you are speaking the agentic web's emerging language. It is the clearest single signal that a site has thought about AI agents at all.
WebMCP (emerging, not yet a hand-check for most sites). WebMCP is a draft standard that lets in-browser AI agents call structured actions on your site — think "let an agent add to cart and check out" rather than just read your page. Be clear-eyed about its status: it is a draft spec, no browser ships it enabled by default as of June 2026 (Chromium offers it behind a flag and origin trial; WebKit/Safari opposes it), and the real set of agents that can invoke it today is Perplexity's Comet browser plus some extensions and custom agents. WebMCP is a forward-looking line item on an audit, not a pass/fail you will hit on most sites yet. If you sell online and want to be ready for agent-driven transactions, it is worth understanding — see WebMCP explained — but do not let it distract from getting Steps 1 through 3 right first.
Attribution. The final dimension: when an AI assistant cites you and someone clicks through, can you tell? Open Google Analytics and look at your "(direct) / (none)" traffic. A chunk of it is almost certainly AI referrals — assistants and in-app browsers frequently strip the referrer, so a real click from a ChatGPT citation lands as "direct." Tagging the links AI fetchers see with UTM parameters can recover some of this signal, though it is best-effort: some assistants strip query parameters too, so treat recovered AI attribution as a strong directional signal rather than exact truth. The point of the audit step is simply to confirm whether you have any way to separate AI-driven visits from genuinely direct ones today.
Once you understand the five dimensions, you do not need to re-run the manual checks every time. The free Agent-Ready grader does the whole audit as a real live scan in about 30 seconds. It fetches your robots.txt, llms.txt, llms-full.txt, sitemap.xml, and homepage in parallel, runs roughly 25 checks across the exact five categories above, and returns:
The public result is ungated: you get the score, the five category breakdowns, and the top findings without signing up. There is also a per-domain URL at /agent-ready/<your-domain> and an embeddable badge if you want to show the score on your own site.
How to read your score. Treat the composite as triage, not a grade. Below 60 almost always means a structural problem — blocked AI crawlers or content that only renders in JavaScript — and those are the things to fix first because they make every other effort moot. The 60s and 70s usually mean fixable gaps like a missing llms.txt or thin schema. 80 and above is genuinely strong. Read the category scores before the overall number: a single zero in Protocol Discovery (no llms.txt) tanks the average but is a fast fix, whereas a low Content Accessibility score signals a rendering change that takes real engineering.
The grader audits your configuration from the outside. The thing it cannot do from a single external scan is tell you which AI bots are actually crawling you and how often — that requires reading your server logs over time, which is the paid Crawlytics product. If you want the tool-by-tool view of what's out there for this job, including the honest trade-offs, see our roundup of the best agent-readiness tools. (Disclosure: Crawlytics publishes this blog; the grader is our free tool. It is a configuration scan, not a log analyzer, and it does not currently email a report or show your live bot traffic — for that you would move to the paid product or read your logs directly.)
Fix in severity order, and the order is almost always the same. First, anything that blocks access: an AI crawler disallowed in robots.txt, or content that only exists after JavaScript runs. These are binary — if a bot can't reach or read the page, nothing downstream matters. Second, discoverability: a missing or broken sitemap, and a missing llms.txt. Third, protocol signals: add or repair JSON-LD schema. Last, attribution: set up UTM tagging so you can measure AI referrals going forward. Attribution matters, but it measures results — it does not create them, so it comes after the things that make you visible.
Re-audit after any deploy that changes rendering, touches robots.txt, or migrates your CMS or framework. Those three changes are exactly the ones that silently break AI accessibility — a framework upgrade that flips a page from server-rendered to client-rendered can make your content vanish from AI fetchers overnight while looking identical in a browser. Beyond that, a quarterly audit is reasonable for a stable site, with an eye on the shifting crawler landscape since new AI bots launch and user-agent strings change. For the bigger-picture strategy around all of this, the AI search optimization hub ties the audit into an ongoing program.
One related signal worth knowing: if your site is on Cloudflare, there is now a separate Cloudflare agent-readiness score in the mix too — useful context for how the infrastructure layer is starting to grade sites on the same question.
Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →
Check five things by hand. First, confirm AI crawlers can discover your site: open yourdomain.com/robots.txt, yourdomain.com/sitemap.xml, and yourdomain.com/llms.txt in a browser. Second, confirm your content is in the raw server HTML by using View Source (not Inspect) and searching for a sentence from your page. Third, check which AI bots your robots.txt allows or blocks. Fourth, look for schema markup and a llms.txt protocol file. Fifth, confirm you can attribute AI referral traffic. Or run a free scan at crawlytics.app/agent-ready that checks the same five dimensions automatically.
Yes. Crawlytics runs a free Agent-Ready grader at crawlytics.app/agent-ready that performs a real live scan of your site in about 30 seconds. It fetches your robots.txt, llms.txt, sitemap.xml, and homepage, then scores five categories — Discoverability, Content Accessibility, Bot Access Control, Protocol Discovery, and Attribution — and returns an overall 0-100 score with the top findings sorted by severity. No signup is required to see the score and category breakdown.
On a 0-100 composite scale, 80 and above is genuinely strong, the 60s and 70s mean you have fixable gaps, and below 60 usually signals a structural problem like blocked AI crawlers or content that only renders in JavaScript. Treat the composite as a triage tool: a single low category (for example Protocol Discovery because you have no llms.txt) drags the average down but is a quick fix, whereas a low Content Accessibility score often means a rendering change.
An SEO audit optimizes for Google ranking: keywords, backlinks, Core Web Vitals, title tags, and indexability for Googlebot. An AI visibility audit optimizes for being read and cited by AI assistants like ChatGPT, Perplexity, and Claude. The dimensions differ — AI fetchers care whether they are allowed in your robots.txt, whether your content exists in server HTML without JavaScript, whether you publish a llms.txt file, and whether you serve clean machine-readable content. There is overlap (crawlable, fast, well-structured pages help both) but the failure modes are different.
Re-audit after any deploy that changes how pages render, touches robots.txt, or migrates your CMS or framework, because those are exactly the changes that silently break AI accessibility. Beyond that, a quarterly check is reasonable for a stable site. The AI crawler landscape also shifts — new bots launch and user-agent strings change — so a periodic look at which crawlers are hitting your server is worth it even when your own site has not changed.
This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt