AI bot crawlers — the complete reference
The single most common cause of low AI search visibility has a one-line fix: Google-Extended: Disallow: / inherited from a starter template. Crawler access caps the ceiling of every downstream optimization. Here's what each AI crawler does, the copy-pasteable robots.txt, how to verify it works, and the gotchas that produce zero traffic despite a clean allow list.
Updated May 2026
TL;DR
- 1.AI crawlers split on two axes: training-data (GPTBot, ClaudeBot, Google-Extended — 6-12 month horizon) vs live-search (Bingbot, PerplexityBot, Googlebot — days-to-weeks horizon). Allow both classes.
- 2.Six crawlers are mandatory for full coverage: Googlebot, Google-Extended, Bingbot, GPTBot, ClaudeBot, PerplexityBot. Eight more are recommended at zero cost (OAI-SearchBot, Anthropic-AI, Applebot-Extended, CCBot, Cohere-AI, Mistral, Bytespider, Diffbot).
- 3.PerplexityBot is structurally unique — it's BOTH training-input AND live-search-input. Highest-impact single crawler to allow for B2B brands.
- 4.robots.txt is a polite request. CDN/WAF rules enforce. Cloudflare Bot Fight Mode is the #1 silent-killer of AI crawler traffic — check yours.
Why crawler access is the prerequisite for everything
AI search visibility breaks down into three layers — indexed, cited, routable. Crawler access is the input to layer one. If GPTBot hasn't crawled your site, your content cannot enter ChatGPT's training corpus. If PerplexityBot is blocked, Perplexity cannot cite you. If Google-Extended is disallowed, Google AI Overview drops you from its citation candidate set.
Crawler access caps the ceiling of every downstream optimization. You can have the best structured data in your category, the freshest content, and the strongest entity graph — and still be invisible on AI search if your robots.txt blocks the crawlers that feed those platforms.
Fix this first. Then build upward.
The two-axis taxonomy
Every AI crawler classifies along two axes: who runs it (operator) and what it's for (purpose). The purpose axis matters most for optimization strategy.
Some crawlers serve both purposes. The most notable: PerplexityBot, which builds Perplexity's live search index AND informs future model training. A single crawl event has both immediate and long-term effects.
The crawlers, what each one powers
Copy-pasteable robots.txt allow list
Drop this at site root. Replace the sitemap URL with yours. Named- bot allow rules at the top; catch-all default at the bottom.
# ── AI search crawlers ─────────────────────────────────────────────
# Google
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Allow: /
# Microsoft (powers ChatGPT web search)
User-agent: Bingbot
Allow: /
# OpenAI (ChatGPT training + emerging live search)
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic (Claude training + product features + live search)
User-agent: ClaudeBot
Allow: /
User-agent: Anthropic-AI
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
# Perplexity (live search + training, dual-purpose)
User-agent: PerplexityBot
Allow: /
# Apple (Apple Intelligence)
User-agent: Applebot-Extended
Allow: /
# Common Crawl (used by many downstream training pipelines)
User-agent: CCBot
Allow: /
# Cohere (enterprise AI deployments)
User-agent: Cohere-AI
Allow: /
# Mistral (European enterprise + open-weight ecosystem)
User-agent: Mistral
Allow: /
# Diffbot (structured-data extraction + knowledge graph)
User-agent: Diffbot
Allow: /
# ── Default policy for everything else ─────────────────────────────
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /_next/
Disallow: /private/
# ── Sitemap ────────────────────────────────────────────────────────
Sitemap: https://yoursite.com/sitemap.xml
Don't Disallow: / in the catch-all unless you mean it.
That single line blocks every unnamed bot — including future AI crawlers that haven't shipped yet. Use the named- allow + permissive-default pattern instead.
How to verify crawlers are actually hitting your site
A robots.txt allow rule does not guarantee a crawler will visit. You verify by checking server logs.
nginx / Apache access logs
# Linux — last 7 days of AI bot hits (adjust path to your access log)
zcat -f /var/log/nginx/access.log* | \
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Bingbot|Googlebot|OAI-SearchBot|Cohere-AI|Mistral" | \
awk '{print $1, $4, $7, $9}' | \
sort | uniq -c | sort -rn | head -50
# macOS — same query, gzcat instead of zcat
gzcat -f /var/log/nginx/access.log* | \
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Bingbot|Googlebot|OAI-SearchBot|Cohere-AI|Mistral" | \
awk '{print $1, $4, $7, $9}' | \
sort | uniq -c | sort -rn | head -50
# Per-bot summary count (Linux — swap zcat→gzcat for macOS)
for bot in GPTBot ClaudeBot PerplexityBot Google-Extended Bingbot Googlebot OAI-SearchBot Cohere-AI Mistral; do
count=$(zcat -f /var/log/nginx/access.log* | grep -c "$bot")
echo "$bot: $count"
done
If you see zero hits for an allowed bot in 7 days, something between robots.txt and your origin is dropping requests — usually a CDN or WAF rule. Check next.
Or just paste your URL into our free tool
citare.ai/tools/ai-robots-checker fetches your robots.txt and tells you which AI crawlers can access you. Zero signup. Faster than grepping logs.
Beyond robots.txt — CDN and firewall layers
robots.txt is a polite request. CDN-level rules are enforcement.
Six gotchas that produce zero crawler traffic despite a clean allow list
Frequently asked questions
What's the single most important AI crawler to allow?
Google-Extended. It controls AI Overview citation eligibility separately from organic Google rank. Many older robots.txt files inherit Google-Extended: Disallow: / from starter templates and accidentally disqualify the entire site from AIO citation. One-line fix; 4-8 weeks to effect.
Is GPTBot the same as Bingbot?
No. GPTBot is OpenAI's training crawler — affects ChatGPT's future trained-knowledge of your brand (6-12 month horizon). Bingbot is Microsoft's search crawler that powers Bing AND ChatGPT's live web search. ChatGPT web-search visibility depends on Bingbot + Bing index health, not GPTBot. Allow both for full ChatGPT coverage.
If I block GPTBot, am I still visible on ChatGPT?
Partially. Blocking GPTBot prevents OpenAI from training on your content for future ChatGPT models, so your trained-knowledge presence stops growing. But ChatGPT's web search uses Bing, which is a separate crawler — if Bingbot can index you and your Bing coverage is healthy, ChatGPT can still cite you via web search. The two horizons are decoupled.
Why is PerplexityBot strategically different from GPTBot or ClaudeBot?
PerplexityBot is BOTH the training-input AND the live-search-input for Perplexity. GPTBot is training-only; Bingbot is live-search-only. A single PerplexityBot crawl has immediate AND long-term effects on Perplexity visibility. Within weeks of allowing PerplexityBot and publishing source-quality content, brands typically see Perplexity citation surface rate improve.
How do I verify crawlers are actually hitting my site?
Check server logs. Grep nginx or Apache access logs for the bot user-agent strings (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, Googlebot). Cloudflare and Vercel dashboards expose the same data. If you see zero hits for an allowed bot in 7 days, something between robots.txt and your origin is dropping the requests — usually a CDN or WAF rule.
Does Cloudflare's Bot Fight Mode block AI crawlers?
Yes. Bot Fight Mode challenges most non-Googlebot bots with CAPTCHAs they cannot solve, including legitimate AI crawlers like GPTBot and PerplexityBot. They appear as zero traffic in your origin logs despite robots.txt allowing them. Switch off Bot Fight Mode OR add explicit allow rules for AI crawler user agents and IP ranges in Cloudflare's Bot dashboard.
Should I allow ALL AI crawlers, or be selective?
For most brands, allow the primary set (Googlebot, Google-Extended, Bingbot, GPTBot, OAI-SearchBot, ClaudeBot, Anthropic-AI, PerplexityBot). Cost is essentially zero; reach is broad. Be selective only if you have specific concerns: copyrighted content you don't want trained on (block training crawlers only), or bandwidth costs from aggressive bots (rate-limit via Crawl-delay, don't block).
How long after updating robots.txt do crawlers respect the new rules?
24-72 hours. Google caches robots.txt for up to 24 hours; other crawlers vary. After that, expect 4-8 weeks for AI surface rate changes to register, gated by each platform's next index refresh cycle.
Check your crawler access in 30 seconds
Paste your URL into our free AI robots.txt checker. We fetch your robots.txt and tell you which AI crawlers can access you, which are blocked, and what to fix.
Related
GEO — the complete 2026 guide
The pillar page
Rank in Google AI Overview
Action #1 of the AIO 9-action list = allow Google-Extended
Structured data for AI
The next layer after crawler access
The four-index reality
Why each platform needs its own crawler allow
AI robots.txt checker
Free tool — paste a URL, see what's allowed
AI bot UA detector
Free tool — identify AI bots from your server logs