IndexerNow

Should you block GPTBot, ClaudeBot, and PerplexityBot? A 2026 decision matrix

9 min read · updated 2026-05-18

Six months ago this was a hot take. Today it's table stakes: every site owner has to decide which AI crawlers to allow and which to block. The hard part is that not all AI bots do the same thing — and the right answer is almost never "block them all" or "allow them all."

The split that matters

AI crawlers fall into two buckets. Training crawlers harvest pages once and use them to train future models. You get nothing back. Answer-time fetchers pull a URL when a user explicitly asks about it in a chat — and they (usually) cite you, sending real referral traffic.

The 2026 default

Block training crawlers, allow answer-time fetchers. That keeps your work out of corpora you weren't paid for, while preserving the citation traffic AI assistants now send.

OpenAI: GPTBot vs ChatGPT-User vs OAI-SearchBot

  • GPTBot — training crawler. Block to opt out of future GPT training.
  • ChatGPT-User — fetches your page when a ChatGPT user asks about it. Allow if you want attribution traffic.
  • OAI-SearchBot — indexes pages for ChatGPT Search (the search-product feature). Allow if you want to appear in those results.

Anthropic: ClaudeBot vs Claude-Web vs anthropic-ai

  • ClaudeBot — Anthropic's primary crawler. Used for training plus answer-time fetches.
  • Claude-Web — newer on-demand fetcher. Allow to keep citation traffic from claude.ai.
  • anthropic-ai — legacy Anthropic UA. Block this too if you're opting out of training.

Google: Googlebot vs Google-Extended vs GoogleOther

  • Googlebot — feeds Google Search. Almost never block unless you're trying to deindex.
  • Google-Extended — feeds Gemini training AND AI Overviews. Block this if you don't want to appear in AI Overviews (and you accept that your content won't train Gemini).
  • GoogleOther — experimental / internal research crawlers, separate from Search and Google-Extended.
The Google-Extended trap

Many sites accidentally blocked Googlebot when they meant to block Google-Extended. Read your robots.txt line carefully — a typo here can deindex you from Search entirely.

Perplexity: PerplexityBot

PerplexityBot is one user-agent that does both training and answer-time fetching. Perplexity sends meaningful referral traffic with citations — almost always worth allowing. If you must block training, the Perplexity team recommends contacting them directly rather than blocking the UA, since blocking removes you from answers too.

Bytespider, CCBot, Diffbot

  • Bytespider — ByteDance / TikTok. Historically aggressive on crawl volume. Most sites block.
  • CCBot — Common Crawl. The dataset feeds nearly every open-source LLM (Llama, Mistral, etc). Blocking is the cleanest way to opt out of open-source training corpora.
  • Diffbot — research and knowledge-graph data broker. Block if you don't want your content sold downstream.

Apple: Applebot vs Applebot-Extended

  • Applebot — Spotlight, Siri search, Safari suggestions. Allow.
  • Applebot-Extended — Apple Intelligence training. Block if you're opting out of training.

A starter robots.txt snippet for 2026

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow answer-time fetchers (citation traffic)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: DuckAssistBot
Allow: /

# Everything else
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml
Audit before you ship

Even hand-written robots.txt files have typos and conflicting rules. Run your robots.txt through our free AI bot auditor before you push — we'll show you exactly which AI crawlers will be allowed and which blocked, by user-agent.

What this won't do

Robots.txt is a polite request. Major commercial AI crawlers publicly honor it. Anonymous scrapers, residential-proxy bots, and sketchy data brokers ignore it. If you have content you genuinely need to keep out of any AI corpus, robots.txt is the floor — not the ceiling. Authentication walls, rate limiting, and legal action are the rest of the stack.

Run your domain through the free AI bot auditor. We'll show you which AI crawlers your robots.txt currently allows or blocks, and flag any inconsistencies.

Audit my AI bot policy