Automatic Content Generation

Coming soon

From HTML to AI-ready markdown in minutes

Crawlytics will crawl your website via sitemap.xml, convert every page to clean markdown, and generate llms.txt index files following the open standard. Smart content scoring will surface your most important pages first — even if you have thousands. This feature is on the roadmap and not yet shipped; everything below describes the planned behavior.

Sitemap-based crawling

Reads your sitemap.xml (including sitemap indexes) to discover all pages automatically.

Smart content scoring

Pages scored by URL depth, sitemap priority, internal link count, word count, recency, and content type.

Intelligent categorization

Auto-classifies pages as core, product, docs, blog, legal, or archive for organized llms.txt output.

Incremental updates

Only re-processes pages with changed lastmod dates. A 5,000 page site with 10 changes updates in seconds.

Configurable content extraction

Strip navbars, footers, sidebars, ads, and other boilerplate. Target specific content selectors.

Three output files

llms.txt (curated index), llms-full.txt (top content combined), and individual .md files for every page.

How content scoring works

Each page receives a score (0-50+) based on seven signals: sitemap priority tag (0-10 pts), URL depth (0-8 pts, shallower = more important), content type (0-10 pts, core pages score highest), internal link count (0-10 pts), word count (penalizes thin content), recency (0-3 pts), and whether the page has a meta description (0-2 pts). Pages are sorted by score, and the top content flows into llms.txt and llms-full.txt.

Handling large sites

For sites with thousands of pages, Crawlytics dynamically adjusts what goes into llms.txt. A 50-page site shows everything. A 2,000-page site shows about 100 curated links organized by category. A 10,000-page site shows only the top 70 — but all pages still get individual .md files that AI can access via the links in the index.

Clean markdown conversion

HTML is processed through a pipeline: unwanted elements are stripped (nav, header, footer, scripts, ads), the main content area is identified, and the result is converted to clean markdown with proper headings, links (made absolute), images with alt text, and code blocks. The output is optimized for LLM consumption — minimal noise, maximum signal.

FAQ

Frequently asked questions

llms.txt is an open standard for providing website content in a format optimized for Large Language Models. Similar to how robots.txt tells crawlers what to access, llms.txt tells AI what your site contains and provides it in clean, structured markdown.

Crawlytics fetches the server-rendered HTML response. For sites that require JavaScript to render content (SPAs), you may need to ensure server-side rendering is enabled, or use a prerendering service.

Yes. You can configure which post types to include, exclude specific URLs, customize the content selector for extraction, and add extra CSS selectors to strip. The scoring algorithm handles prioritization automatically, but you control the inputs.

Want early access?

This feature isn't built yet. Sign up for the dashboard now and you'll be first to try it when it ships.

Notify me when it ships