citare
GEO spoke — crawler access

AI bot crawlers — the complete reference

The single most common cause of low AI search visibility has a one-line fix: Google-Extended: Disallow: / inherited from a starter template. Crawler access caps the ceiling of every downstream optimization. Here's what each AI crawler does, the copy-pasteable robots.txt, how to verify it works, and the gotchas that produce zero traffic despite a clean allow list.

Updated May 2026

TL;DR

  • 1.AI crawlers split on two axes: training-data (GPTBot, ClaudeBot, Google-Extended — 6-12 month horizon) vs live-search (Bingbot, PerplexityBot, Googlebot — days-to-weeks horizon). Allow both classes.
  • 2.Six crawlers are mandatory for full coverage: Googlebot, Google-Extended, Bingbot, GPTBot, ClaudeBot, PerplexityBot. Eight more are recommended at zero cost (OAI-SearchBot, Anthropic-AI, Applebot-Extended, CCBot, Cohere-AI, Mistral, Bytespider, Diffbot).
  • 3.PerplexityBot is structurally unique — it's BOTH training-input AND live-search-input. Highest-impact single crawler to allow for B2B brands.
  • 4.robots.txt is a polite request. CDN/WAF rules enforce. Cloudflare Bot Fight Mode is the #1 silent-killer of AI crawler traffic — check yours.

Why crawler access is the prerequisite for everything

AI search visibility breaks down into three layers — indexed, cited, routable. Crawler access is the input to layer one. If GPTBot hasn't crawled your site, your content cannot enter ChatGPT's training corpus. If PerplexityBot is blocked, Perplexity cannot cite you. If Google-Extended is disallowed, Google AI Overview drops you from its citation candidate set.

Crawler access caps the ceiling of every downstream optimization. You can have the best structured data in your category, the freshest content, and the strongest entity graph — and still be invisible on AI search if your robots.txt blocks the crawlers that feed those platforms.

Fix this first. Then build upward.

The two-axis taxonomy

Every AI crawler classifies along two axes: who runs it (operator) and what it's for (purpose). The purpose axis matters most for optimization strategy.

Training-data crawlers

Gather text for future model training. Effect on visibility shows up at the next training cycle (6-12 month typical lag).

Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot

Live-search crawlers

Build search indexes that AI products query in real time. Effect on visibility shows up within days to weeks of crawl.

Examples: Bingbot, PerplexityBot, Googlebot, OAI-SearchBot

Some crawlers serve both purposes. The most notable: PerplexityBot, which builds Perplexity's live search index AND informs future model training. A single crawl event has both immediate and long-term effects.

The crawlers, what each one powers

Googlebot

Google · Live search

Mandatory
Powers
Google organic search · AI Overview · Gemini (indirect)
Refresh cadence
Hours to days

Google's primary search crawler. Powers organic rank. Indirectly powers AIO and Gemini sourcing because both products read from Google's main index.

Google-Extended

Google · AI surface eligibility

Mandatory
Powers
AI Overview citation eligibility · Gemini training data
Refresh cadence
Days to weeks

Google's AI-specific user agent. Distinct from Googlebot — blocking it preserves organic Google rank while disqualifying you from AIO citation entirely. The single most common AIO failure cause in audits.

Bingbot

Microsoft · Live search

Mandatory
Powers
Bing search · ChatGPT web search · Microsoft Copilot
Refresh cadence
Days to weeks

Microsoft's search crawler. Critical for ChatGPT visibility — ChatGPT web search grounds against Bing's index, not Google's. A brand well-indexed on Google but thin on Bing is invisible to ChatGPT regardless of Google rank.

GPTBot

OpenAI · Training (long-term)

Recommended
Powers
Future ChatGPT trained-knowledge
Refresh cadence
Periodic

OpenAI's training crawler. Affects ChatGPT's knowledge of your brand at the next training cycle (6-12 month lag). Does NOT power ChatGPT web search — that's Bingbot. Most common misconception in this space.

OAI-SearchBot

OpenAI · Live search (rolling out)

Recommended (forward-compat)
Powers
OpenAI's own emerging live-search index
Refresh cadence
Periodic

OpenAI's newer live-search crawler. Suggests OpenAI is building toward decoupling ChatGPT web search from Bing. Appearing in logs at increasing rates; allow now for forward-compatibility at zero cost.

ClaudeBot

Anthropic · Training (long-term)

Recommended
Powers
Future Claude trained-knowledge
Refresh cadence
Periodic

Anthropic's training crawler — the equivalent of GPTBot for Claude. Claude's user base is meaningful in technical, AI-adjacent, and analytical audiences.

Anthropic-AI / Claude-User / Claude-SearchBot

Anthropic · Product features + agentic + live search

Recommended
Powers
Claude web fetches · Claude agentic browsing · Claude live search
Refresh cadence
On-demand + periodic

Anthropic ships multiple bots for different functions. Anthropic-AI handles user-initiated URL fetches inside Claude; Claude-User is agentic browsing; Claude-SearchBot is the emerging live-search index. For full Claude visibility, allow all three plus ClaudeBot.

PerplexityBot

Perplexity · Live search + training (both)

High (B2B)
Powers
Perplexity index (powers live answers AND informs model training)
Refresh cadence
Weekly to monthly

Structurally different — BOTH training-input and live-search-input. A single crawl event has immediate AND long-term visibility effects. The most strategically important AI crawler for B2B brands targeting technical/professional audiences.

Applebot-Extended

Apple · Training

Recommended
Powers
Apple Intelligence · Siri · Spotlight
Refresh cadence
Periodic

Apple's AI training crawler, mirroring Google's Googlebot/Google-Extended split. Growing share especially on iOS as Apple Intelligence rolls out.

CCBot

Common Crawl · Open-data training

Recommended
Powers
Many downstream AI training pipelines
Refresh cadence
Periodic

Common Crawl's bot. Its open dataset is used by many AI training pipelines downstream. Allowing CCBot expands your reach into AI products built on Common Crawl data — broader effect than any single vendor.

Bytespider

ByteDance · Live search + training

If China audience
Powers
Doubao + other ByteDance AI products
Refresh cadence
Periodic

ByteDance's crawler. Critical for brands with China audience reach; can be skipped otherwise.

Cohere-AI

Cohere · Training

If enterprise audience
Powers
Cohere Command + Embed models · enterprise AI deployments
Refresh cadence
Periodic

Cohere's training crawler. Smaller consumer user base but disproportionate weight in enterprise AI deployments — Cohere models are embedded in many enterprise RAG pipelines, customer support copilots, and B2B AI products. Allow if you sell to enterprise.

Mistral

Mistral AI · Training

If European audience
Powers
Mistral Large + Le Chat · open-weight model ecosystem
Refresh cadence
Periodic

Mistral's training crawler. Smaller global market share but growing share in European enterprise AI and any product built on Mistral's open-weight models. Allow if you have meaningful European business or sell into the open-weight ecosystem.

Diffbot

Diffbot · Structured-data extraction + training

Optional
Powers
Diffbot Knowledge Graph · downstream AI training datasets
Refresh cadence
Periodic

Used by some AI training datasets and structured-data extraction services. Selective allow depending on your stance on third-party knowledge-graph aggregation. Most brands include it as default-permissive.

Copy-pasteable robots.txt allow list

Drop this at site root. Replace the sitemap URL with yours. Named- bot allow rules at the top; catch-all default at the bottom.

# ── AI search crawlers ─────────────────────────────────────────────
# Google
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

# Microsoft (powers ChatGPT web search)
User-agent: Bingbot
Allow: /

# OpenAI (ChatGPT training + emerging live search)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic (Claude training + product features + live search)
User-agent: ClaudeBot
Allow: /

User-agent: Anthropic-AI
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Perplexity (live search + training, dual-purpose)
User-agent: PerplexityBot
Allow: /

# Apple (Apple Intelligence)
User-agent: Applebot-Extended
Allow: /

# Common Crawl (used by many downstream training pipelines)
User-agent: CCBot
Allow: /

# Cohere (enterprise AI deployments)
User-agent: Cohere-AI
Allow: /

# Mistral (European enterprise + open-weight ecosystem)
User-agent: Mistral
Allow: /

# Diffbot (structured-data extraction + knowledge graph)
User-agent: Diffbot
Allow: /

# ── Default policy for everything else ─────────────────────────────
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /_next/
Disallow: /private/

# ── Sitemap ────────────────────────────────────────────────────────
Sitemap: https://yoursite.com/sitemap.xml

Don't Disallow: / in the catch-all unless you mean it.

That single line blocks every unnamed bot — including future AI crawlers that haven't shipped yet. Use the named- allow + permissive-default pattern instead.

How to verify crawlers are actually hitting your site

A robots.txt allow rule does not guarantee a crawler will visit. You verify by checking server logs.

nginx / Apache access logs

# Linux — last 7 days of AI bot hits (adjust path to your access log)
zcat -f /var/log/nginx/access.log* | \
  grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Bingbot|Googlebot|OAI-SearchBot|Cohere-AI|Mistral" | \
  awk '{print $1, $4, $7, $9}' | \
  sort | uniq -c | sort -rn | head -50

# macOS — same query, gzcat instead of zcat
gzcat -f /var/log/nginx/access.log* | \
  grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Bingbot|Googlebot|OAI-SearchBot|Cohere-AI|Mistral" | \
  awk '{print $1, $4, $7, $9}' | \
  sort | uniq -c | sort -rn | head -50

# Per-bot summary count (Linux — swap zcat→gzcat for macOS)
for bot in GPTBot ClaudeBot PerplexityBot Google-Extended Bingbot Googlebot OAI-SearchBot Cohere-AI Mistral; do
  count=$(zcat -f /var/log/nginx/access.log* | grep -c "$bot")
  echo "$bot: $count"
done

If you see zero hits for an allowed bot in 7 days, something between robots.txt and your origin is dropping requests — usually a CDN or WAF rule. Check next.

Or just paste your URL into our free tool

citare.ai/tools/ai-robots-checker fetches your robots.txt and tells you which AI crawlers can access you. Zero signup. Faster than grepping logs.

Beyond robots.txt — CDN and firewall layers

robots.txt is a polite request. CDN-level rules are enforcement.

Cloudflare

Bot Fight Mode blocks legitimate AI crawlers with CAPTCHAs they cannot solve. The #1 silent killer in our audits. Switch it off OR add explicit allow rules for AI crawler user agents + IP ranges under Security → Bots.

Vercel

Default firewall doesn't block AI crawlers. If you've added custom WAF rules with broad bot-blocking, audit for AI crawler exclusions.

AWS WAF / Cloudfront

AWS managed rule groups (AWSManagedRulesBotControlRuleSet) can block AI crawlers by default depending on configuration. Audit WAF rules for blanket bot blocks.

Imperva / Akamai / others

Enterprise WAF/CDN products often have aggressive bot defaults. Audit each one for explicit AI crawler whitelisting.

Six gotchas that produce zero crawler traffic despite a clean allow list

1

Inherited Disallow: / from a starter template

Some Next.js, Astro, and WordPress templates ship with overly restrictive defaults intended for staging environments that nobody removed at production launch. View https://yoursite.com/robots.txt directly in a browser — first rule wins.

2

Conflicting rules — User-agent: * vs named-bot

A User-agent: * block with Disallow: / will conflict with named-bot Allows for some crawlers. Standard is most-specific-rule-wins, but implementations vary. Put named-bot allows at the TOP, catch-all at the BOTTOM, and never Disallow: / in the catch-all unless you mean it.

3

Wildcard quirks

Disallow: /*? looks like it blocks query strings but behaves differently across crawlers. Disallow: /*.pdf$ works in some, not others. Avoid wildcards unless necessary; prefer explicit prefix paths.

4

Trailing slash inconsistencies

Disallow: /admin and Disallow: /admin/ mean different things to some crawlers. If you intend to block a directory, use the trailing slash explicitly.

5

Cached robots.txt

Google caches robots.txt for up to 24 hours; others vary. Expect 24-72 hour delay before crawlers respect rule changes. Patience required.

6

CMS plugins that regenerate robots.txt

Some plugins regenerate robots.txt on save and silently revert manual changes. Audit your robots.txt monthly to catch drift.

Frequently asked questions

What's the single most important AI crawler to allow?

Google-Extended. It controls AI Overview citation eligibility separately from organic Google rank. Many older robots.txt files inherit Google-Extended: Disallow: / from starter templates and accidentally disqualify the entire site from AIO citation. One-line fix; 4-8 weeks to effect.

Is GPTBot the same as Bingbot?

No. GPTBot is OpenAI's training crawler — affects ChatGPT's future trained-knowledge of your brand (6-12 month horizon). Bingbot is Microsoft's search crawler that powers Bing AND ChatGPT's live web search. ChatGPT web-search visibility depends on Bingbot + Bing index health, not GPTBot. Allow both for full ChatGPT coverage.

If I block GPTBot, am I still visible on ChatGPT?

Partially. Blocking GPTBot prevents OpenAI from training on your content for future ChatGPT models, so your trained-knowledge presence stops growing. But ChatGPT's web search uses Bing, which is a separate crawler — if Bingbot can index you and your Bing coverage is healthy, ChatGPT can still cite you via web search. The two horizons are decoupled.

Why is PerplexityBot strategically different from GPTBot or ClaudeBot?

PerplexityBot is BOTH the training-input AND the live-search-input for Perplexity. GPTBot is training-only; Bingbot is live-search-only. A single PerplexityBot crawl has immediate AND long-term effects on Perplexity visibility. Within weeks of allowing PerplexityBot and publishing source-quality content, brands typically see Perplexity citation surface rate improve.

How do I verify crawlers are actually hitting my site?

Check server logs. Grep nginx or Apache access logs for the bot user-agent strings (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, Googlebot). Cloudflare and Vercel dashboards expose the same data. If you see zero hits for an allowed bot in 7 days, something between robots.txt and your origin is dropping the requests — usually a CDN or WAF rule.

Does Cloudflare's Bot Fight Mode block AI crawlers?

Yes. Bot Fight Mode challenges most non-Googlebot bots with CAPTCHAs they cannot solve, including legitimate AI crawlers like GPTBot and PerplexityBot. They appear as zero traffic in your origin logs despite robots.txt allowing them. Switch off Bot Fight Mode OR add explicit allow rules for AI crawler user agents and IP ranges in Cloudflare's Bot dashboard.

Should I allow ALL AI crawlers, or be selective?

For most brands, allow the primary set (Googlebot, Google-Extended, Bingbot, GPTBot, OAI-SearchBot, ClaudeBot, Anthropic-AI, PerplexityBot). Cost is essentially zero; reach is broad. Be selective only if you have specific concerns: copyrighted content you don't want trained on (block training crawlers only), or bandwidth costs from aggressive bots (rate-limit via Crawl-delay, don't block).

How long after updating robots.txt do crawlers respect the new rules?

24-72 hours. Google caches robots.txt for up to 24 hours; other crawlers vary. After that, expect 4-8 weeks for AI surface rate changes to register, gated by each platform's next index refresh cycle.

Check your crawler access in 30 seconds

Paste your URL into our free AI robots.txt checker. We fetch your robots.txt and tell you which AI crawlers can access you, which are blocked, and what to fix.

Related