Every major AI company now operates web crawlers that visit your site, read your content, and use it to train models or generate real-time answers. GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot (Perplexity AI) are the three most significant, but the list is growing. Your robots.txt file is the primary mechanism for controlling what these crawlers can access, and the decision about whether to block or allow them has real commercial consequences.

This guide covers the major AI crawlers, what they do, and how to make an informed decision for your UK business.

The Major AI Crawlers You Need to Know

Each AI crawler serves a different purpose, and understanding those differences matters for your robots.txt decisions.

GPTBot is OpenAI’s web crawler. It has two primary functions: gathering training data for future model versions and retrieving content in real time for ChatGPT’s browse mode. The user agent string is GPTBot. Blocking GPTBot prevents your content from being used in OpenAI’s training data and reduces your visibility in ChatGPT’s real-time browsing responses. OpenAI also operates ChatGPT-User, a separate crawler specifically for real-time browsing during ChatGPT conversations.

ClaudeBot is Anthropic’s web crawler, used primarily for training data collection. The user agent string is ClaudeBot (previously anthropic-ai). Anthropic also operates Claude-Web for real-time web access during conversations. Blocking ClaudeBot reduces your presence in Claude’s training data and potentially its future citation behaviour.

PerplexityBot is Perplexity AI’s crawler, used for real-time retrieval. When users ask Perplexity questions, PerplexityBot fetches and reads web pages to generate cited answers. The user agent is PerplexityBot. Blocking this crawler directly prevents your content from appearing in Perplexity’s answers.

Google-Extended is Google’s crawler for AI training data, separate from Googlebot. Blocking Google-Extended prevents your content from being used to train Google’s Gemini models and potentially appearing in AI Overviews. Critically, blocking Google-Extended does not affect your regular Google search rankings, as Googlebot is a separate crawler.

Bytespider is ByteDance’s crawler, used for training data for their AI models. CCBot is the Common Crawl bot, whose open dataset is used by many AI companies for model training. Amazonbot is used by Amazon for Alexa and other AI services.

The Case for Allowing AI Crawlers

For most UK businesses focused on growth and visibility, allowing AI crawlers is the commercially sensible default.

AI platforms are becoming primary discovery channels. ChatGPT, Claude, and Perplexity collectively handle hundreds of millions of queries monthly. Blocking their crawlers means your brand cannot appear in these answers. For B2B businesses especially, this is increasingly where buyers begin their research.

Citation drives qualified traffic. When an AI platform cites your content, it typically includes a link that drives referral traffic. Early data suggests that AI-referred traffic often has higher engagement metrics than traditional organic search traffic, because users have already received context about your relevance before clicking through.

Training data inclusion builds long-term authority. When your content is included in AI model training data, your brand and expertise become part of the model’s foundational knowledge. This creates a durable citation advantage that persists across countless future conversations.

Blocking is a declining leverage strategy. Some publishers initially blocked AI crawlers as a negotiating tactic around licensing and compensation. For most businesses that are not media publishers, this leverage does not exist, and the cost of invisibility far outweighs any theoretical benefit.

The Case for Blocking (or Selectively Restricting)

There are legitimate reasons to restrict AI crawler access, though they apply to fewer businesses than many assume.

Proprietary content protection. If your website contains genuinely proprietary research, methodologies, or data that you sell as a product, you may want to prevent AI models from absorbing and redistributing this content. A consultancy whose primary product is a proprietary framework, for example, has a legitimate interest in controlling how that framework is distributed.

Competitive intelligence concerns. In some sectors, detailed content about your processes, pricing structures, or strategic approach could be used by competitors querying AI models. This is a real but often overstated concern.

Data licensing and compensation. Major publishers like The New York Times and The Guardian have taken positions on AI training data compensation. If your business model depends on content licensing revenue, blocking crawlers while negotiating terms may make sense.

Regulatory caution in sensitive sectors. Some regulated UK businesses may prefer to restrict AI access to certain content while they assess the regulatory implications. An FCA-regulated firm might allow general marketing content to be crawled while restricting detailed product information until they are confident in how AI models represent it.

Practical robots.txt Configuration

Here is how to implement your chosen approach in robots.txt.

Allow all AI crawlers (recommended for most businesses):

# AI Crawlers - Allowed
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Selective access (allow browsing, restrict training):

# Allow real-time browsing (citations)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training data collection
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Block all AI crawlers (not recommended for most businesses):

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Partial access (allow blog, restrict product pages):

User-agent: GPTBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/

User-agent: ClaudeBot
Allow: /blog/
Disallow: /products/
Disallow: /pricing/

Monitoring AI Crawler Activity

Once you have made your robots.txt decision, you should actively monitor what AI crawlers are doing on your site.

Check your server access logs. Look for requests from GPTBot, ClaudeBot, PerplexityBot, and other AI user agents. This tells you which crawlers are visiting, how frequently, and which pages they access most.

Use Bing Webmaster Tools and Google Search Console. Both platforms provide crawl data that can help you understand how your content is being accessed. Google Search Console specifically shows Google-Extended crawl activity.

Monitor your site’s performance. AI crawlers can be aggressive. If you notice increased server load or bandwidth usage correlated with AI crawler activity, you may need to implement crawl rate limits rather than outright blocks.

Review your analytics for AI referral traffic. Track referral traffic from chat.openai.com, claude.ai, perplexity.ai, and other AI platforms. This gives you direct visibility into which AI platforms are driving traffic and helps you assess the commercial value of allowing crawler access.

The robots.txt decisions you make today directly shape your AI visibility for the months and years ahead. For most UK businesses, the right answer is to allow access and actively optimise for citation, rather than closing the door on what is rapidly becoming the next major discovery channel.


Unsure how AI crawlers interact with your site today? Request your free AI Visibility Audit and we will review your robots.txt configuration, crawler activity, and citation performance across every major AI platform.