llms.txt Generator + Per-Page Markdown for AI Bots

Contents

Key facts


all features

► Serve · llms.txt generator

Generate llms.txt and per-page markdown from your sitemap.

Crawlytics fetches every URL in your sitemap, extracts the main content as clean markdown, scores and categorizes the pages, and exposes llms.txt, llms-full.txt, and per-page .md at stable URLs. Built for AI bots that don't run JavaScript.

What gets served

Three URLs per site, ready to hand to an AI assistant.

Curated index

/api/sites//llms.txt

Top-scored pages grouped by category. One-line descriptions, links to the live HTML.

Full bundle

/api/sites//llms-full.txt

Top 50 pages' full markdown concatenated into a single document an AI can ingest in one fetch.

Per-page markdown

/api/sites//md/

Every crawled page addressable as clean markdown. Drop /api/sites/3/md/pricing into ChatGPT, get the page back as markdown.

Crawl status

auth required

/api/sites//crawl

Discovered / processed / failed counts, latest job state, top-scored pages. Auth required.

The pipeline

Six steps from sitemap URL to public output.

The Coverage Gap cross-check

Because Crawlytics generates your llms.txt and reads your bot logs from the same database, it can answer a question a standalone generator can't: which pages does the file declare that no AI bot has actually fetched? That list is the llms.txt Coverage Gap on your dashboard — declared but undiscovered.

A generator that only writes the file never finds out whether bots used it. The closed loop is the point.

// the serve layer

Bots don't want your JavaScript. Serve them clean markdown built from your sitemap, automatically.

Related features

Frequently Asked Questions

What is llms.txt?

llms.txt is an open standard (llmstxt.org) — a plain-text file at /llms.txt that tells AI systems what a website contains. The format is: an H1 with your site name, a one-line quote summary, then sections of grouped links with descriptions. AI bots fetch this instead of trying to parse JavaScript-heavy HTML. llms-full.txt is the companion file that concatenates the full markdown of your top-scored pages into one fetchable document.

How do I add llms.txt to my website?

Three options: (1) hand-write it — fine for small sites, becomes a maintenance burden past ~20 pages; (2) generate it once and host the static file — drifts the moment you add a page; (3) auto-generate it from your sitemap so it stays current automatically. Crawlytics does option 3 — paste a snippet, point it at your sitemap, and we serve /llms.txt, /llms-full.txt, and per-page /md/ URLs that update on a daily re-crawl.

What is the difference between llms.txt and llms-full.txt?

llms.txt is the curated index — short. It lists your top pages grouped by category (Product, Pricing, Blog, etc.) with one-line descriptions. An AI bot fetches it to understand your site map. llms-full.txt is the full bundle — the actual markdown of your top-scored pages concatenated into a single document an AI can ingest in one fetch.

Does Google penalize llms.txt?

No. llms.txt is a static text file at /llms.txt that Google's crawlers ignore entirely (Googlebot reads HTML, not llms.txt). It doesn't affect SEO rankings either positively or negatively in Google. Its purpose is to help AI bots (GPTBot, ClaudeBot, PerplexityBot) understand and cite your content — which is increasingly relevant for AI search traffic.

How does scoring work?

Each page gets a 0–50 score from six signals: sitemap priority, URL depth (shallower wins), category (homepage / about / pricing / product / tools / docs / blog), word count, recency (from sitemap lastmod), and whether it has a meta description. The top-scored pages flow to the top of llms.txt and into llms-full.txt.

How many pages can it crawl?

Currently 100 pages per crawl invocation (bounded by serverless request budget). Larger sites are handled by re-clicking Start crawl — the worker skips pages already in the database and processes the next 100. A 480-page site finishes in 5 clicks. Auto-continue is on the roadmap.

How does it handle JavaScript-rendered content?

Crawlytics fetches the server-rendered HTML response. For SPAs that require JS to render, ensure server-side rendering is enabled or use a prerender service in front of your origin.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap