Should you block GPTBot, ClaudeBot, and PerplexityBot? A 2026 decision matrix

9 min read · updated 2026-07-05

The short answer

Allow the answer-time AI fetchers that cite you and send referral traffic — ChatGPT-User, OAI-SearchBot, PerplexityBot, and Claude-Web — and decide the training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) based on whether you want your content used to train models. Blocking answer-time bots costs you citations; blocking training bots only opts you out of training.

Six months ago this was a hot take. Today it's table stakes: every site owner has to decide which AI crawlers to allow and which to block. The hard part is that not all AI bots do the same thing — and the right answer is almost never "block them all" or "allow them all."

The split that matters

AI crawlers fall into two buckets. Training crawlers harvest pages once and use them to train future models. You get nothing back. Answer-time fetchers pull a URL when a user explicitly asks about it in a chat — and they (usually) cite you, sending real referral traffic.

The 2026 default

Block training crawlers, allow answer-time fetchers. That keeps your work out of corpora you weren't paid for, while preserving the citation traffic AI assistants now send.

OpenAI: GPTBot vs ChatGPT-User vs OAI-SearchBot

GPTBot — training crawler. Block to opt out of future GPT training.
ChatGPT-User — fetches your page when a ChatGPT user asks about it. Allow if you want attribution traffic.
OAI-SearchBot — indexes pages for ChatGPT Search (the search-product feature). Allow if you want to appear in those results.

Anthropic: ClaudeBot vs Claude-Web vs anthropic-ai

ClaudeBot — Anthropic's primary crawler. Used for training plus answer-time fetches.
Claude-Web — newer on-demand fetcher. Allow to keep citation traffic from claude.ai.
anthropic-ai — legacy Anthropic UA. Block this too if you're opting out of training.

Google: Googlebot vs Google-Extended vs GoogleOther

Googlebot — feeds Google Search. Almost never block unless you're trying to deindex.
Google-Extended — feeds Gemini training AND AI Overviews. Block this if you don't want to appear in AI Overviews (and you accept that your content won't train Gemini).
GoogleOther — experimental / internal research crawlers, separate from Search and Google-Extended.

The Google-Extended trap

Many sites accidentally blocked Googlebot when they meant to block Google-Extended. Read your robots.txt line carefully — a typo here can deindex you from Search entirely.

Perplexity: PerplexityBot

PerplexityBot is one user-agent that does both training and answer-time fetching. Perplexity sends meaningful referral traffic with citations — almost always worth allowing. If you must block training, the Perplexity team recommends contacting them directly rather than blocking the UA, since blocking removes you from answers too.

Bytespider, CCBot, Diffbot

Bytespider — ByteDance / TikTok. Historically aggressive on crawl volume. Most sites block.
CCBot — Common Crawl. The dataset feeds nearly every open-source LLM (Llama, Mistral, etc). Blocking is the cleanest way to opt out of open-source training corpora.
Diffbot — research and knowledge-graph data broker. Block if you don't want your content sold downstream.

Apple: Applebot vs Applebot-Extended

Applebot — Spotlight, Siri search, Safari suggestions. Allow.
Applebot-Extended — Apple Intelligence training. Block if you're opting out of training.

A starter robots.txt snippet for 2026

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow answer-time fetchers (citation traffic)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: DuckAssistBot
Allow: /

# Everything else
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Audit before you ship

Even hand-written robots.txt files have typos and conflicting rules. Run your domain through our free AI Crawler Access Checker before you push — we'll show you exactly which AI crawlers will be allowed and which blocked, by user-agent.

What this won't do

Robots.txt is a polite request. Major commercial AI crawlers publicly honor it. Anonymous scrapers, residential-proxy bots, and sketchy data brokers ignore it. If you have content you genuinely need to keep out of any AI corpus, robots.txt is the floor — not the ceiling. Authentication walls, rate limiting, and legal action are the rest of the stack.

Run your domain through the free AI Crawler Access Checker. We'll show you which AI crawlers your robots.txt currently allows or blocks, and flag any inconsistencies.

Check my AI crawler access