Default-Deny AI Crawlers: Why Reuters and Publishers Are Switching

Summary

Reuters, Time, and People Inc. are switching robots.txt from a blocklist to an allowlist. What default-deny means, why blocklists are failing, and what to do if you're not a publisher.

Contents

Key facts


At a late-May IAB Tech Lab event, Lindsay Van Kirk, SVP of Innovation at People Inc., gave a number that reframes the entire AI-crawler debate. When her team switched from blocking bots by name to allowing only a short approved list, the count of blocked user agents went from about 2,100 to more than 30,000. Nothing about the open web changed that week. The 28,000-bot gap had been crawling People's titles all along, and the old blocklist simply never knew their names.

That gap is why Reuters, Time, People Inc., and a lengthening list of publishers are rewriting the most boring file on their servers. They're moving robots.txt from "block what you recognize" to "allow only what you approve." If you manage a site and you've been maintaining a list of bad bots to block, this is the shift you need to understand, because the list approach you're using is the one these publishers just abandoned.

What "default-deny" actually means

A blocklist robots.txt names the crawlers you want to keep out and lets everything else through. An allowlist robots.txt inverts that: it names the handful of crawlers you permit and refuses everyone else by default. Default-deny is the security term for that posture. You don't enumerate threats, you enumerate the exceptions and treat the rest of the world as untrusted until proven otherwise.

For two decades, the blocklist model worked because the bots that mattered were a known, slow-moving set: Googlebot, Bingbot, a few SEO crawlers. You could name the bad actors because there weren't many. The AI boom broke that assumption. There are now dozens of training crawlers, live-fetch agents, and search indexers across providers, and a new one can show up under a brand-new user-agent string any week. A blocklist is only as good as your last update, and nobody updates robots.txt weekly.

Reuters' live robots.txt is the cleanest example of the inverted model. It explicitly allows crawlers from Amazon, Google, Bing/Microsoft, Yahoo, and OpenAI, then disallows other bots across most of the site. Five names in, everyone else out. That file doesn't need to know that 30,000 other agents exist. It refuses them by structure, not by enumeration.

The number that should scare you: 2,100 to 30,000+

People Inc.'s jump from roughly 2,100 to over 30,000 blocked agents isn't a story about new bots appearing overnight. It's a story about how much the blocklist was missing the whole time. The company didn't suddenly attract 28,000 new crawlers. Those crawlers were already fetching People.com, Travel + Leisure, Food & Wine, and the rest of the portfolio. Switching to an allowlist just made the invisible visible.

This is the part that should land for any site owner. Your blocklist isn't a measure of crawler traffic. It's a measure of the crawlers you happened to hear about. The ones you don't name aren't absent, they're unmeasured, and unmeasured bot traffic is exactly the kind that scrapes content, drives up bandwidth bills, and gives nothing back. The People Inc. number is what the gap between the agents you block and the agents that actually visit looks like at scale.

Why robots.txt was never built for this

Robots.txt was published as a convention in 1994. It was designed for a cooperative web where a handful of search engines wanted to be polite about which directories they indexed. It has no authentication, no enforcement, and no way to verify that the bot reading it is the bot it claims to be. Compliance is entirely voluntary.

That voluntary model is now the core problem. A Tollbit report found that 30% of total AI bot scrapes didn't comply with the explicit permissions in robots.txt. Nearly a third of AI crawler activity simply ignores the file. Some of that is bad actors spoofing user agents; some is crawlers that read robots.txt and fetch anyway. Either way, a robots.txt rule is a request, not a wall, and a meaningful share of AI traffic treats it as optional.

Publishers know this, which is why robots.txt is becoming the policy layer rather than the enforcement layer. The enforcement happens at the CDN or WAF, where you can actually drop a request. The allowlist in robots.txt states the intent clearly enough that downstream tools, licensing negotiations, and legal positions have a documented baseline. The industry is also organizing around it: the publisher-backed SPUR Coalition grew to 36 organizations after adding 30 members in May, aiming to set shared standards for how content gets licensed and used. Regulators are moving too. A new UK conduct requirement forces Google to let sites opt out of AI search features, a sign that opt-out is becoming a right rather than a favor.

Reuters' "fair value exchange" test

Allowlisting raises an obvious question: how do you decide who gets in? Reuters built an explicit rubric. Josh London, head of Reuters Professional, told Digiday that a bot earns access only if it offers a "fair value exchange" across four dimensions:

Run the major crawlers through that filter and Reuters' five-name allowlist makes sense. Google and Bing send search traffic and underpin discovery. Amazon and Yahoo fit existing commercial relationships. OpenAI has been signing licensing deals with publishers, which buys it a seat. A training crawler that pays nothing, sends nothing, and respects nothing fails all four tests, so it doesn't make the list. The framework turns an emotional "block the AI" reaction into a business decision you can defend line by line.

What smaller sites can't copy from Reuters

Here's the catch nobody at the IAB event had to say out loud. Reuters can demand a fair value exchange because Reuters has bargaining power. Its archive is worth licensing, so AI companies negotiate. When you run a 40-page SaaS site, a regional services business, or a personal blog, no one is lining up to license your content, and a hard default-deny can quietly cost you the visibility you actually want.

The asymmetry is real. Anthropic's crawler documentation now warns publishers about the visibility trade-off of blocking its search bot: refuse the crawler that feeds AI answers and you opt out of being cited in those answers. For a publisher with a paywall and a licensing team, that trade can be worth it. For a business whose growth depends on being found, blocking the bots that surface you in ChatGPT or Claude is a way to make yourself invisible to the fastest-growing discovery channel on the web. Copying Reuters' robots.txt without Reuters' business model can backfire.

The distinction that matters is the one between training crawlers and live-fetch or search crawlers. The training kind takes your content to improve a model and usually sends nothing back. The live-fetch and search kind pulls your page in response to a real user question and cites you, which sends traffic. A smart allowlist isn't "block AI." It's "permit what sends readers, scrutinize what only takes." We walk through that split crawler by crawler in the GPTBot decision guide, and the AI bots list maps every major user agent to what it actually does.

A default-deny playbook for sites without a licensing team

You don't need Reuters' clout to borrow Reuters' discipline. The order of operations is what matters, and most sites get it backwards by editing robots.txt first and measuring never. Flip that.

Measure before you block. Pull your server or CDN logs and find out which bots actually hit your site, how often, and which pages they hammer. The People Inc. lesson is that the bots you don't track are the ones costing you the most. You can't make a value-exchange call on a crawler you didn't know was there.

Sort by what each bot gives back. Group the crawlers you find into three buckets: send-me-traffic (search and live-fetch bots like Googlebot, Bingbot, ChatGPT-User, OAI-SearchBot, Claude-User), take-only (training crawlers and scrapers that never refer a visitor), and unknown. Allow the first bucket without hesitation. Scrutinize the second. Investigate the third before it grows.

Start with a soft allowlist, not a hard one. You don't have to go full default-deny on day one. Begin by allowing your known-good search and AI-answer bots explicitly, then disallow the specific take-only crawlers you've identified. That captures most of the upside with far less risk of accidentally blocking a bot that was sending you readers. The manage AI crawlers guide has the ready-to-paste robots.txt, Cloudflare, and nginx configs for each posture.

Enforce where it counts. Remember the Tollbit 30%. Robots.txt states intent, but the bots that ignore it only stop at the CDN or WAF. If a specific scraper is costing you real bandwidth and ignoring the file, rate-limit or block it at Cloudflare or nginx by user agent, where the request can actually be dropped.

Re-measure on a schedule. New crawlers launch constantly. The whole reason blocklists fail is that they go stale, and an allowlist goes stale the same way if you never check what's hitting the gate. A monthly look at your bot traffic is enough to catch a new entrant before it becomes a 28,000-agent surprise.

The bottom line

Default-deny is the right instinct because the math of the open web changed. When new bots outpace any blocklist and a third of them ignore the rules anyway, "allow only what you approve" is the only posture that scales. The publishers flipping their robots.txt aren't being paranoid, they're being realistic about a file that was never designed for this.

The honest caveat for everyone who isn't Reuters: an allowlist is a tool, not a reflex. Block the wrong bots and you lock yourself out of AI search at the exact moment it's becoming how people find things. Start with measurement, allow the crawlers that bring readers, and refuse the ones that only take. That's the version of default-deny that works whether you have a licensing team or just a robots.txt file and a bandwidth bill.

Related

Written by Crawlytics Team. Crawlytics tracks AI bots, generates llms.txt, and powers WebMCP commerce, all from one snippet on any stack. See how it works →

Frequently Asked Questions

What is the difference between an allowlist and a blocklist for AI crawlers?

A blocklist names the specific bots you want to keep out and allows everyone else by default. An allowlist (the default-deny model) names the few bots you permit and refuses everyone else by default. The practical difference is coverage: a blocklist only stops crawlers you've heard of, while an allowlist stops every bot you haven't explicitly approved. People Inc. found that switching from a blocklist to an allowlist raised its blocked-agent count from about 2,100 to more than 30,000, because the blocklist had been missing tens of thousands of crawlers it never knew to name.

Does a default-deny robots.txt actually stop AI bots?

Not on its own. Robots.txt is a voluntary convention with no enforcement, and a Tollbit report found that about 30% of AI bot scrapes ignore the permissions in the file entirely. A default-deny robots.txt clearly states your intent and gives well-behaved crawlers a rule to follow, but the bots that ignore it only stop at the CDN or WAF layer, where you can rate-limit or hard-block by user agent. Treat robots.txt as the policy layer and your CDN as the enforcement layer.

Should a small website use a default-deny robots.txt?

Usually not as an aggressive first step. Smaller sites rarely have the licensing clout that makes a hard allowlist pay off, and blocking the wrong bots can remove you from AI answers that drive discovery. A better approach is a soft allowlist: explicitly allow the search and live-fetch bots that send you traffic, then disallow the specific take-only crawlers you've identified in your logs. Measure your real bot traffic first, then tighten from there.

Which AI crawlers should I allow if I switch to an allowlist?

Allow the bots that send readers back to your site. That generally means search and live-fetch crawlers like Googlebot, Bingbot, ChatGPT-User, OAI-SearchBot, and Claude-User, which fetch a page in response to a real user query and cite you. Reuters allowlists Amazon, Google, Bing/Microsoft, Yahoo, and OpenAI based on a "fair value exchange" test of licensing, referral traffic, uptime, and monetization. Scrutinize training-only crawlers that take content without sending visitors, and investigate any user agent you don't recognize before allowing it.

Will blocking AI crawlers hurt my search visibility?

It can, depending on which crawlers you block. Blocking a training-only bot like GPTBot has no effect on traditional search rankings. But refusing live-fetch and AI-search crawlers removes you from the answers those assistants generate. Anthropic's own documentation now warns publishers about the visibility cost of blocking its search bot. If being found is part of your business model, allow the crawlers that cite you in AI answers and reserve blocking for the ones that only take content without referring traffic.

Cite this page

Related on this site


This page is part of Crawlytics.app. View all pages: llms.txt · llms-full.txt

Site index for AI agents: llms.txt · sitemap