AI Bot Traffic Cost: Which Crawlers Are Worth It?

Summary

AI bot traffic cost is a real line item: crawlers hammer carts, checkout, and search pages that never cache. Which crawlers to allow, rate-limit, or monetize.

Contents

Key facts


The first time most site owners notice AI bots, it is not in their analytics. It is in a hosting bill that jumped 30% with no change in human traffic, or a database that started timing out at 3am on a page nobody promotes. You go digging in the logs and find a crawler you have never heard of fetching your faceted search results a few thousand times an hour.

For two years the conversation about AI crawlers was about copyright and scraping. That framing is now out of date. The bots that hurt are not stealing your best article. They are running up compute on the parts of your site that cost the most to serve and return the least, and the fix is not a robots.txt block. It is deciding, per bot, who has earned access to what.

AI bot traffic stopped being free

Two things changed at once. The volume went vertical, and the mix got worse. Reporting that Search Engine Journal rounds up puts AI crawler traffic up sharply year over year, with roughly 1 in 31 site visits now coming from a bot rather than a person. On a site doing a million monthly pageviews, that is tens of thousands of fetches you are paying to serve to software.

The mix is the part people miss. Not all AI bots are the same animal. A useful way to split them:

Bot type What it wants What you get back
Training crawlers (GPTBot, ClaudeBot, CCBot, Bytespider) Pages to fold into the next model's training set. Nothing direct. Maybe a sliver of influence on a future model's general knowledge.
Search/index crawlers (OAI-SearchBot, PerplexityBot, GoogleOther) To index you for an AI answer engine. Citations in AI search results, which drive referral clicks.
Live-user fetchers (ChatGPT-User, Perplexity-User) Your page, right now, because a human just asked about it. The highest-intent traffic there is. A real person is on the other end.

By most counts the large majority of all this crawling, on the order of 80%, is the first row: training. That is the category with no return path. When a training crawler fetches your checkout page forty times, you have paid for forty expensive database hits to help train a model that will not send you a single visitor.

The banality of bots

The damage usually is not malice. It is sloppiness at scale. A lot of today's crawlers were assembled quickly, sometimes vibe-coded overnight against a half-read spec, and they behave like it. They ignore Crawl-delay. They skim or skip robots.txt. And they have no loop detection, which is where the real cost lives.

Picture a crawler that finds your product listing at /shop?category=shoes. A well-built one fetches it once. A careless one discovers that ?category=shoes&sort=price, &page=2, &color=red, and every combination of those produces a slightly different page, and it walks the entire combinatorial tree. There are thousands of those URLs. The bot does not know it is going in circles, so it keeps fetching variants for days. Each one is a fresh, uncacheable render.

I have seen a single unknown crawler generate more dynamic-page load in a week than a site's actual customers did in a month. It bought nothing, cited nothing, and signed up for nothing. It just looped.

Why your cart and search pages are the expensive part

Here is the distinction that makes AI bot traffic cost a real budget item rather than a rounding error: not all pages cost the same to serve.

A blog post or a marketing page is usually static or cached. The first visitor pays the rendering cost, a CDN stores the result, and the next ten thousand visitors (human or bot) get a cheap copy from the edge. A bot hammering those pages is mostly noise.

Dynamic endpoints are the opposite. They are built to never cache, because their whole job is to be personal or live:

This is why two sites with identical bot counts can have wildly different bot costs. The one whose crawlers stay on static content barely notices. The ecommerce site whose crawlers wandered into faceted search is paying for thousands of database queries a day that no customer ever asked for. When you decide what to protect first, protect the endpoints that do work on every hit, not the ones the CDN already handles for free.

The visibility trap

The obvious response is to block all of it. Add the bots to robots.txt, turn on a firewall rule, move on. That reflex is what gets sites in trouble, because the bot categories are tangled together and the cheap fix has an expensive side effect.

Block GPTBot and you opt out of training, which is fine. But sites routinely overshoot and also block ChatGPT-User and OAI-SearchBot, the bots that actually fetch you when a person asks ChatGPT a question and when the answer engine builds its index. Now you have saved a little compute and made yourself invisible in AI search at the same time. For a business that depends on being found, that trade is backwards. I wrote a full framework for that specific decision in the GPTBot allow/block guide, and the short version is: the bot you most want to keep and the one you most want to drop often share a vendor and look almost identical in a log.

So the goal is not less bot traffic. It is the right bot traffic. You want the search and live-user fetchers in, you want the training crawlers gated or rate-limited on your expensive pages, and you cannot tell which is which without identifying them first.

Tiered access: which bots earn which parts of your site

Replace the on/off switch with a tier list. Most sites land on three tiers, ordered by what the bot gives back:

  1. Search and index crawlers — allow, everywhere. These are the bots that turn into citations and referral clicks. Treat them like Googlebot. The compute they use is a marketing cost with a return.
  2. Known AI agents with business value — allow, and consider charging. An agent that completes a purchase or a booking through your site is a customer, not a cost. If a specific AI agent is making thousands of calls to your product catalog, that is a signal to give it a real, metered way in rather than absorbing the load for free.
  3. Unknown scrapers and training crawlers on high-cost endpoints — rate-limit first. This is where the savings are. A training bot has no reason to be inside your checkout flow or looping your search. Throttling it there costs you nothing in visibility, because training crawlers do not produce citations.

Two levers enforce a tier list, and they are not equally strong. Be honest about which is which.

llms.txt is the polite signal. It is a file that tells AI systems what your site is, which content matters, and increasingly which paths they should leave alone. Well-behaved bots read it. It is advisory, though, the same way robots.txt is. A scraper that already ignores robots.txt will ignore llms.txt too, so treat it as a way to guide good actors, not a wall against bad ones.

WebMCP is the enforceable lever. Instead of letting an agent brute-force your dynamic pages, you expose a small set of real, callable tools (search the catalog, check availability, place an order) behind a snippet you control. Now a high-value agent gets a clean, cheap, structured path in, and you can require a key, rate-limit it, or charge for it. The agent stops scraping your expensive HTML because you gave it something better, and you stop paying for the brute-force version of a request you were happy to serve anyway.

How to decide with data, not vibes

None of this works if you cannot see which bot is doing what. Raw analytics will not tell you. Google Analytics filters most bots out by design, so the traffic that is costing you money is the traffic your dashboard pretends does not exist. You end up with polluted human numbers and zero visibility into the bots themselves.

This is the gap Crawlytics is built to close, and it is the reason we just shipped a view called Bot ROI. The workflow is three steps:

The Bot ROI view crosses two numbers most tools keep separate: what a bot costs you to serve (its share of hits on your expensive endpoints) and what it returns (the conversions it drives). That cross is the whole decision in one screen. A training crawler at 80% high-cost hits and zero conversions is an easy throttle. A search crawler with the same load is one you leave alone, because it pays you back in citations.

Where to start this week

You do not need to solve this all at once. Start by looking, then act on the clearest case.

First, get a bot-filtered view of your traffic so you know the real split between humans and crawlers, and which crawlers they are. Second, find the single worst offender: the bot with the most hits on your dynamic, non-cacheable pages and nothing to show for it. That one bot is usually a meaningful chunk of the whole problem. Rate-limit it on those paths and watch your database load, not your visibility, drop. Third, for the bots that do earn their keep, make their lives easy with a clean llms.txt and, if any of them are transacting, a real tool to call.

The era of treating every bot the same is over. Block everything and you go dark in AI search. Allow everything and you fund other companies' models with your hosting budget. The sites that come out ahead are the ones that stop guessing and start measuring which crawlers are worth it, one endpoint at a time.

Related

Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →

Frequently Asked Questions

How much does AI bot traffic actually cost?

It depends almost entirely on which pages the bots hit. Crawlers that stay on static, cached content cost close to nothing, because a CDN serves them a cheap copy. Crawlers that hit dynamic endpoints (cart, checkout, internal search, filtered URLs) cost real money, because each fetch triggers a database query and a fresh render that cannot be cached. The same number of bot visits can be a rounding error on one site and a 30% hosting increase on an ecommerce site whose crawlers wandered into faceted search.

Should I just block all AI bots to save money?

No. Blocking everything saves a small amount of compute and costs you AI search visibility, because some of those bots are the ones that cite you in ChatGPT, Perplexity, and Google's AI answers. The better approach is tiered: allow search and live-user crawlers, rate-limit unknown training crawlers on your expensive pages, and monetize high-value agents. Block only the bots that produce pure cost with no return.

Which AI bots are the most expensive to allow?

The training and data crawlers that ignore crawl limits and loop on URL variants, such as poorly behaved scrapers hitting ?page= and ?filter= combinations across your search and listing pages. They generate the most dynamic-page load and return nothing, because training crawlers do not drive citations or sales. Identify them by name and by the share of their hits that land on non-cacheable endpoints.

Does llms.txt stop bots from crawling expensive pages?

Only the well-behaved ones. llms.txt is an advisory signal, like robots.txt, that tells AI systems what to crawl and what to skip. Cooperative bots respect it; scrapers that already ignore robots.txt will ignore llms.txt too. Use it to guide good actors toward your best content and away from costly paths, but pair it with rate limiting or a WebMCP gate when you need to actually enforce a limit.

How do I tell which AI bots are driving value versus just costing me?

Cross two numbers: each bot's share of hits on your high-cost endpoints, and the conversions or citations it drives. A bot with high cost and zero return is a throttle candidate; a bot with the same load that produces citations or sales earns its access. Standard analytics will not show this because they filter bots out, so you need a tool that tracks bot identity and behavior per page. Crawlytics built its Bot ROI view specifically to put that cost-versus-value comparison in one place.

Content not visible to non-JS crawlers

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap