Skip to content
← Back to blog
Tutorial

robots.txt for AI bots: block the trainers, allow the citers

The robots.txt for AI bots setup that blocks training crawlers like GPTBot and ClaudeBot while allowing the search crawlers that get you cited in AI answers.

By Mitrasish, Co-founderJun 30, 202612 min read
robots.txt for AI bots: block the trainers, allow the citers

There are two kinds of AI crawler, and the difference decides whether you disappear from AI answers or just opt out of model training. Training crawlers scrape your pages to build the next model. Search crawlers index your pages so an assistant can cite you in an answer right now. Block the first set, allow the second. Most "block AI" guides treat every bot as one enemy, ship a blanket disallow, and quietly delist the site from ChatGPT, Claude, and Perplexity results in the same move.

This is the robots.txt for AI bots setup that keeps those two goals separate: the 2026 user-agent list, the exact file to commit to your repo, and why your CDN may be overriding the file before a crawler ever reads it. robots.txt is access control, not a content map. If you also want to hand models a curated reading list, that is a different file, and our llms.txt guide covers it. The point of all of this is to stay citable, which is the whole game behind answer engine optimization.

Training crawlers vs search crawlers: the split that decides if you get cited

Block the bots that scrape for model training. Allow the bots that index for citations. That one sentence is the entire strategy, and getting it wrong is how careful people accidentally remove themselves from AI search.

Both kinds of crawler come from the same vendors, hit the same URLs, and identify with similar-looking user-agent strings. But they do opposite jobs. A training crawler feeds a future model and gives you nothing back today. A search crawler is the reason an assistant can find, read, and quote your page when someone asks a question. If your robots.txt cannot tell them apart, neither can you.

Why blocking GPTBot does not remove you from ChatGPT

GPTBot is the training crawler. ChatGPT's citations come from a different system. OpenAI documents several crawlers, and the three that matter for a content publisher are GPTBot, which crawls content that may be used to train its models; OAI-SearchBot, which surfaces sites in ChatGPT's search results; and ChatGPT-User, which fetches a page only when a person asks ChatGPT to go look at it. Each is controllable on its own in robots.txt.

So blocking GPTBot does exactly one thing: it signals that your content should not train future models. It does not touch your eligibility to be cited in ChatGPT, because that runs through OAI-SearchBot. The mistake is assuming "block ChatGPT" means one rule. Block GPTBot and you have opted out of training. Block OAI-SearchBot and you have opted out of being the answer.

The trap: ClaudeBot trains, Claude-SearchBot cites

Anthropic splits its crawlers the same way, and the naming is the trap. Anthropic runs three agents: ClaudeBot collects web content that may contribute to training, Claude-User fetches pages when a Claude user asks a question, and Claude-SearchBot navigates the web to improve search result quality. They are independent user agents, each controllable in robots.txt.

Here is the part even careful guides get backwards. ClaudeBot is the training crawler, so allowing it opts you into training, not into Claude's answers. The agent that earns you citations is Claude-SearchBot. Anthropic warns directly that blocking Claude-SearchBot "prevents our system from indexing your content for search optimization, which may reduce your site's visibility and accuracy in user search results," as Search Engine Journal documented. If your instinct is "block ClaudeBot, that is the Anthropic one," you have blocked training (fine) and done nothing for citations, and if you go further and block Claude-SearchBot too, you have switched off the part you wanted.

The 2026 user-agent table: what to block, what to allow

Sort every AI crawler into three buckets: training (block to opt out), search (allow to stay citable), and user-triggered (leave alone). Here is the current list across the three vendors that matter, plus Common Crawl.

CrawlerOperatorWhat it doesYour move
GPTBotOpenAITrains modelsBlock to opt out of training
ClaudeBotAnthropicTrains modelsBlock to opt out of training
CCBotCommon CrawlBuilds an archive used to train most LLMsBlock to opt out of training
Google-ExtendedGoogleTrains GeminiBlock to opt out (no Search impact)
OAI-SearchBotOpenAIChatGPT search citationsAllow to stay citable
Claude-SearchBotAnthropicClaude search indexingAllow to stay citable
PerplexityBotPerplexityPerplexity search citationsAllow to stay citable
ChatGPT-UserOpenAIFetches a page a user asked forLeave alone
Claude-UserAnthropicFetches a page a user asked forLeave alone
Perplexity-UserPerplexityFetches a page a user asked forLeave alone

A note on Perplexity, because it is the cleanest of the bunch. Perplexity documents PerplexityBot as the agent that surfaces and links sites in its search results, says plainly it is "not used to crawl content for AI foundation models," and recommends you allow it. Perplexity-User is the user-initiated fetch and generally ignores robots.txt because a person triggered it. CCBot, by contrast, belongs in the block column: Common Crawl's freely released archive has been used to train most major language models, so disallowing CCBot (it identifies as CCBot/2.0 and honors robots.txt) is part of opting out of the upstream training pipeline.

Delete the dead strings: Claude-Web and anthropic-ai

If your robots.txt blocks Claude-Web or anthropic-ai, it is blocking nothing. Both are deprecated. Anthropic's current training crawler is ClaudeBot, and the older tokens no longer match an active agent, as the Search Engine Journal breakdown notes. A file that still disallows only the old strings reads like an opt-out and behaves like an opt-in. Replace them with ClaudeBot. This is the most common stale rule in the wild, copied from 2023-era templates that never got updated.

Google-Extended and AI Overviews: what you can and cannot control

Google-Extended controls one thing: whether your content trains Gemini. Google's crawler docs state that Google-Extended "does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search." So block it to stay out of Gemini training, and your organic rankings are untouched.

What you cannot do is robots.txt your way out of AI Overviews. They ride Googlebot. Google says AI is "built into Search and integral to how Search functions," with no separate crawler and no dedicated opt-out token. The only levers are the standard Search controls (nosnippet, data-nosnippet, max-snippet, noindex), and blocking Googlebot to escape Overviews would also delete you from regular search results. We walk through the trade in how to show up in Google AI Overviews. For everything else, allowing the search crawlers is upstream of getting cited at all, which is what ranking in ChatGPT is actually about.

User-triggered fetchers: leave them alone

ChatGPT-User, Claude-User, and Perplexity-User fetch a page because a human asked the assistant to look at it. Blocking them backfires: you are refusing a visit your own potential reader requested. And robots.txt often does not apply to them anyway. OpenAI notes that for ChatGPT-User, robots.txt may not apply because the action is user-triggered rather than automated crawling, and Perplexity says the same about Perplexity-User. There is no upside to disallowing them and a real downside, so leave them out of your block list.

The copy-paste robots.txt for a repo blog

Here is the file. Disallow the training set, leave the search set allowed, keep the classic engines, and commit it to your repo as public/robots.txt so it ships in a pull request and diffs over time.

code
# robots.txt - committed at public/robots.txt
# Goal: opt out of AI training, stay eligible for AI search citations.

# --- Block: training crawlers ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# --- Allow: AI search / citation crawlers ---
User-agent: OAI-SearchBot
Disallow:

User-agent: Claude-SearchBot
Disallow:

User-agent: PerplexityBot
Disallow:

# --- Keep classic search ---
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

# --- Everyone else (allow all) ---
User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

Two things about the syntax. An empty Disallow: means "nothing is disallowed," which is how you allow a crawler; it reads more honestly than a redundant Allow: /. And a crawler obeys the most specific user-agent group that matches it, so GPTBot follows its own Disallow: / and ignores the User-agent: * block entirely. That is exactly what you want: the named search bots stay allowed even though the wildcard exists. Swap in your real sitemap URL and you are done.

Committing it to the repo is the point, not an afterthought. A robots.txt edited live in a hosting dashboard has no history, no reviewer, and no diff when it silently changes. In version control it goes through the same pull request as everything else, which is the whole argument for keeping a Git-based blog: the file that controls who can read your content should be as reviewable as the content itself.

Commit it, then check your CDN is not overriding it

A committed robots.txt is necessary and not sufficient, because your CDN can serve a different file than the one in your repo. This is the silent failure mode, and Cloudflare is where most teams hit it.

On July 1, 2025, Cloudflare made AI crawler blocking the default for every new domain, describing itself as the first internet infrastructure provider to block AI crawlers by default and eliminating the need for site owners to manually opt out. If your domain was onboarded after that date, the search crawlers you carefully allowed in your repo file may be denied at the edge because Cloudflare's default block applies before your rules are even read. To check or adjust the setting, go to your Cloudflare dashboard, navigate to Security > Bots, and look for AI Crawl Control, where you can configure per-crawler allow or deny rules independently.

It gets subtler. Cloudflare's managed robots.txt does not overwrite your file: it "will prepend our managed robots.txt before your existing robots.txt, combining both into a single response." So your committed file is still there, with Cloudflare's AI-blocking directives stacked in front of it. And because robots.txt is only a preference, real enforcement happens lower down: AI Crawl Control blocks at the network layer regardless of what the file says, and the Pay Per Crawl feature can gate crawlers behind an HTTP 402 Payment Required response at the edge, independent of robots.txt. The result is a robots.txt that says "allow OAI-SearchBot" while the edge quietly returns a block. That is your AI citations leaking with no error in your logs.

How to verify the live file matches the repo

Trust the deployed file, not the repo copy. Curl the live URL and read what actually ships:

bash
curl -sS https://yourdomain.com/robots.txt

Confirm the search bots are not getting a 403 or a 402 at the edge by requesting a page as each one and checking the status code:

bash
curl -sI -A "OAI-SearchBot" https://yourdomain.com/ | head -n 1
curl -sI -A "Claude-SearchBot" https://yourdomain.com/ | head -n 1
curl -sI -A "PerplexityBot" https://yourdomain.com/ | head -n 1

A 200 is what you want. A 403 or 402 means something at the edge is overriding your intent. For verifying that real crawler traffic is genuine rather than spoofed, every vendor publishes IP range files in their bot documentation, so you can validate hits against the published ranges and reverse DNS. Allow about 24 hours for robots.txt changes to take effect, which is the window OpenAI gives for its systems to pick up an update. Once the bots can reach you, the next question is whether the citations actually show up, and most of that traffic hides in Direct, which is the subject of AI citation tracking.

robots.txt is a preference, not a wall

robots.txt expresses what you want; it does not enforce it. Compliance is voluntary, and not every crawler complies. Cloudflare documented in August 2025 that Perplexity used undeclared stealth crawlers, a generic Chrome user agent on rotating IPs and ASNs outside its published ranges, generating 3 to 6 million requests a day to reach content on domains that explicitly disallowed it. In Cloudflare's same test, ChatGPT-User fetched the robots file and stopped when it was disallowed.

That sounds like an argument against bothering. It is not, because of which bots you are allowing. The search crawlers in your allow list, OAI-SearchBot and Claude-SearchBot among them, are the well-behaved ones that read robots.txt and honor it. Your allow-list does its job for the outcome you actually want: staying citable. If you also want to hard-stop the crawlers that ignore the file, that is a network-layer job (a WAF rule or a CDN block), not a robots.txt line. Keep the two tools in their lanes.

Where this sits next to llms.txt and your AEO stack

robots.txt is the access-control layer; it decides who may read you. llms.txt is the answer-feed layer; it hands models a curated map of your best pages. Different files, different intent, and they do not substitute for each other, which is why the llms.txt guide treats them as separate steps. robots.txt without llms.txt still lets the search crawlers in. llms.txt without a correct robots.txt is a reading list for crawlers you accidentally blocked.

The throughline is that all of it should be version-controlled and reviewed, not edited live in a dashboard where a stray toggle costs you citations with no diff to catch it. A robots.txt in your repo ships in a pull request, gets a second set of eyes, and leaves a history when it changes. That is the same discipline behind running your whole blog as code: the writing, the facts, the links, and the crawler rules all reviewed before they go live. It is the gap we built Lyra to close, by writing posts that AI answer engines cite and opening each as a PR you merge, so request early access if your blog lives in a repo and you want it run this way.

Allowing the right crawlers is upstream of every AI citation, and Lyra writes the posts those crawlers cite, then ships each one as a pull request you review and merge.

Talk to the founder → · Join the waitlist

Step by step

The short version

  1. 01

    Split the crawlers into training vs search

    Sort every AI user agent into two buckets: training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) and search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot). The split decides whether you opt out of training or out of citations.

  2. 02

    Write the file: disallow training, allow search

    Disallow the training set, leave the search set un-disallowed (or explicitly allowed), and keep Googlebot and Bingbot. Leave the user-triggered fetchers alone.

  3. 03

    Commit it to the repo as public/robots.txt

    Put the file in version control so it ships in a reviewable pull request and diffs over time, instead of being edited live in a dashboard nobody audits.

  4. 04

    Verify the live file and check your CDN

    Curl the deployed /robots.txt, confirm the search bots are not getting a 403 or 402 at the edge, and allow about 24 hours for changes to take effect.

FAQ

Frequently asked

Does blocking GPTBot remove my site from ChatGPT?+

No. GPTBot is OpenAI's training crawler, so blocking it only signals that your content should not be used to train models. ChatGPT's search citations come from a separate crawler, OAI-SearchBot, which is not used for training. If you want to stay citable in ChatGPT, block GPTBot but allow OAI-SearchBot.

What is the difference between ClaudeBot and Claude-SearchBot?+

ClaudeBot collects web content that may train Anthropic's models. Claude-SearchBot indexes pages to improve Claude's search answers. They are separate user agents you control independently in robots.txt. Allowing ClaudeBot opts you into training, not citations, so the agent you actually want to allow for AI answers is Claude-SearchBot.

Should I block AI crawlers in robots.txt?+

Block the training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) if you want to opt out of model training, and allow the search crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) so you stay eligible to be cited in AI answers. A blanket block of every AI user agent quietly deletes you from AI search results.

Do Claude-Web and anthropic-ai still work in robots.txt?+

No. Both are deprecated. Anthropic's current training crawler is ClaudeBot, so a robots.txt that disallows only Claude-Web and anthropic-ai is blocking nothing. Update those rules to ClaudeBot or you are training-opted-in while believing you opted out.

Why is my robots.txt not being respected?+

Two reasons. First, robots.txt is voluntary, so a crawler can ignore it. Second, your CDN may be overriding it: since July 2025, Cloudflare blocks AI crawlers by default for every new domain, so your committed allow rules may be moot unless you have explicitly disabled that default in the Cloudflare dashboard under Security > Bots > AI Crawl Control, where you can set per-crawler rules. Cloudflare's managed robots.txt also prepends its own rules ahead of yours, and AI Crawl Control can enforce blocks at the network layer regardless of what your file says. Curl the live file and check the edge before trusting the repo copy.

Built by the tool you're reading about

This post is the kind of thing Lyra ships on her own.

Lyra finds the topics worth ranking for, writes them in your repo's voice, fact-checks every claim, and opens a pull request scored and ready to merge. You review and hit merge. Want to see what she'd write for you? Tell us about your blog and the founder will walk through it with you.

robots.txt for AI BotsBlock AI CrawlersAI Training CrawlersAllow AI Search CrawlersAI Search CrawlerPerplexity SEO