Automatic Content Generation
Coming soonFrom HTML to AI-ready markdown in minutes
Crawlytics will crawl your website via sitemap.xml, convert every page to clean markdown, and generate llms.txt index files following the open standard. Smart content scoring will surface your most important pages first — even if you have thousands. This feature is on the roadmap and not yet shipped; everything below describes the planned behavior.
Sitemap-based crawling
Reads your sitemap.xml (including sitemap indexes) to discover all pages automatically.
Smart content scoring
Pages scored by URL depth, sitemap priority, internal link count, word count, recency, and content type.
Intelligent categorization
Auto-classifies pages as core, product, docs, blog, legal, or archive for organized llms.txt output.
Incremental updates
Only re-processes pages with changed lastmod dates. A 5,000 page site with 10 changes updates in seconds.
Configurable content extraction
Strip navbars, footers, sidebars, ads, and other boilerplate. Target specific content selectors.
Three output files
llms.txt (curated index), llms-full.txt (top content combined), and individual .md files for every page.
How content scoring works
Each page receives a score (0-50+) based on seven signals: sitemap priority tag (0-10 pts), URL depth (0-8 pts, shallower = more important), content type (0-10 pts, core pages score highest), internal link count (0-10 pts), word count (penalizes thin content), recency (0-3 pts), and whether the page has a meta description (0-2 pts). Pages are sorted by score, and the top content flows into llms.txt and llms-full.txt.
Handling large sites
For sites with thousands of pages, Crawlytics dynamically adjusts what goes into llms.txt. A 50-page site shows everything. A 2,000-page site shows about 100 curated links organized by category. A 10,000-page site shows only the top 70 — but all pages still get individual .md files that AI can access via the links in the index.
Clean markdown conversion
HTML is processed through a pipeline: unwanted elements are stripped (nav, header, footer, scripts, ads), the main content area is identified, and the result is converted to clean markdown with proper headings, links (made absolute), images with alt text, and code blocks. The output is optimized for LLM consumption — minimal noise, maximum signal.
FAQ
Frequently asked questions
llms.txt is an open standard for providing website content in a format optimized for Large Language Models. Similar to how robots.txt tells crawlers what to access, llms.txt tells AI what your site contains and provides it in clean, structured markdown.
Crawlytics fetches the server-rendered HTML response. For sites that require JavaScript to render content (SPAs), you may need to ensure server-side rendering is enabled, or use a prerendering service.
Yes. You can configure which post types to include, exclude specific URLs, customize the content selector for extraction, and add extra CSS selectors to strip. The scoring algorithm handles prioritization automatically, but you control the inputs.
Want early access?
This feature isn't built yet. Sign up for the dashboard now and you'll be first to try it when it ships.
Notify me when it ships