ServicesAI Audit
← Back to Blog

Why Is Your robots.txt Blocking AI Crawlers Without You Realising?

robots.txtAI crawlersAI visibilityLLM SEOGPTBotChatGPTPerplexityGeminitechnical SEOAI search
Close-up of lettered dice spelling 'WHY' on a neutral background, ideal for concepts of inquiry or curiosity.

The silent visibility killer most site owners never check

Most people set up their robots.txt file years ago, or let a plugin handle it, and never look at it again. That was fine when robots.txt was mainly about telling Google which pages to ignore. But the AI search era has introduced a whole new set of crawlers, and a lot of sites are accidentally blocking every single one of them.

ChatGPT uses GPTBot. Perplexity uses PerplexityBot. Google's AI Overviews still lean on Googlebot, but Gemini's broader training pipeline uses Google-Extended. Anthropic's Claude uses ClaudeBot. These are distinct user agents, and unless your robots.txt explicitly permits them, a single wildcard disallow rule can lock them all out without you ever knowing.

The problem is quiet. There's no error in Google Search Console. No alert fires. You just quietly stop being cited in AI answers, and you never connect the dots.

How the wildcard rule accidentally blocks everything

The most common culprit is a rule that looks completely harmless:

User-agent: *
Disallow: /

This tells every crawler that isn't explicitly listed elsewhere in the file to go away entirely. It's often added during a site migration to prevent staging environments from being indexed, or by a developer who wanted to block a specific scraper. Then it gets left in, forgotten, while the site goes live.

Even a less aggressive version can cause damage:

User-agent: *
Disallow: /blog/
Disallow: /products/

If GPTBot and PerplexityBot aren't listed as separate user agents with their own rules, this wildcard applies to them too. Your blog posts and product pages, which are exactly the content AI engines want to cite, are now invisible to those crawlers.

WordPress security plugins are another common source of accidental blocks. Wordfence, iThemes Security, and similar tools sometimes add aggressive robots.txt rules to reduce bot traffic and server load. The intention is sensible. The side effect is that legitimate AI crawlers get caught in the same net as the bad actors.

Which AI crawlers you need to think about

Here's a quick reference of the main AI crawler user agents you need to be aware of:

  • GPTBot - OpenAI's crawler for training ChatGPT and powering ChatGPT Search. User agent string: GPTBot
  • ChatGPT-User - Used when ChatGPT browses the web in real time during a conversation. Separate from GPTBot.
  • PerplexityBot - Perplexity's indexing crawler. User agent string: PerplexityBot
  • Google-Extended - Google's opt-out token for Gemini training data and Vertex AI. If you block this, you may affect how well Gemini understands your content.
  • ClaudeBot - Anthropic's crawler. User agent string: ClaudeBot
  • Bytespider - ByteDance's crawler, used for various AI and recommendation systems.
  • cohere-ai - Cohere's training crawler.

None of these will be listed in your robots.txt by default unless you or your developer has added them. Which means the wildcard rule governs them all.

For a deeper look at how to identify which of these are actually visiting your site, see our post on how to check which AI crawlers are visiting your site.

Three ways to audit your robots.txt right now

1. Read it directly

Go to yourdomain.com/robots.txt in your browser. Every site has one, or should. Look for any User-agent: * blocks and read every Disallow line carefully. If you see a blanket Disallow: / with a wildcard, that's a red flag. If key directories like /blog/ or /products/ are disallowed under the wildcard, AI crawlers are blocked from exactly the pages you want them reading.

2. Use Google Search Console's robots.txt tester

Google Search Console has a built-in robots.txt tester under the legacy tools. You can paste in a specific URL and test whether a given user agent can access it. It doesn't natively test GPTBot or PerplexityBot, but you can manually enter those user agent strings to simulate the check. It's not perfect, but it gives you a quick sanity check.

3. Use a dedicated robots.txt checker

Tools like Ahrefs, Screaming Frog, and Zenstark's robots.txt analyser let you test specific user agents against specific URLs. Run GPTBot and PerplexityBot against your most important pages. If they come back blocked, you have your answer.

What a well-structured robots.txt looks like for AI visibility

The goal is to be explicit rather than relying on wildcard catch-alls. Here's a sensible structure:

User-agent: Googlebot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Disallow: /wp-admin/
Disallow: /checkout/
Disallow: /account/

By listing the major crawlers explicitly before the wildcard rule, you give them clear permission regardless of what the wildcard says. The wildcard then only catches genuinely unlisted bots, and you can restrict the paths that genuinely should stay private, like admin areas and checkout flows.

This approach also makes your intentions transparent. If an AI company reviews your robots.txt to understand your crawl preferences, the explicit allows signal that you want to be part of AI search, not opting out of it.

The GPTBot situation deserves special attention

OpenAI published guidance on GPTBot and made it easy for site owners to allow or block it. They also distinguished between GPTBot (training data) and ChatGPT-User (live browsing). These are different user agents and they behave differently.

If you block GPTBot but allow ChatGPT-User, ChatGPT can still browse your pages in real time during a conversation, but your content won't feed into OpenAI's training data or improve how ChatGPT understands your subject area over time. For most businesses, allowing both makes sense. The training data point matters because it affects how confidently ChatGPT talks about your brand, products, and expertise without needing to browse your site at all.

Our post on what GPTBot is and how to let it crawl your site covers this in more detail if you want the full picture.

Common scenarios where blocking happens accidentally

Plugin-generated robots.txt files

Yoast SEO, Rank Math, and All in One SEO all manage robots.txt through the WordPress dashboard. If a previous developer or site owner edited the virtual robots.txt in Yoast and added disallow rules, those will still be live. The file itself lives in WordPress's virtual file system and can be edited under Yoast's settings. Most site owners have no idea this is separate from a physical robots.txt file on the server.

Staging environments copied to production

Staging sites almost always have a Disallow: / rule to prevent indexing. When a developer deploys to production by copying the staging environment, that robots.txt comes along for the ride. It happens more often than developers would like to admit.

Cloudflare or CDN-level rules

Some CDN configurations inject or override robots.txt headers. If you're using Cloudflare with certain firewall rules, you might be blocking AI crawler IP ranges without realising it. This is less common but worth investigating if your robots.txt looks clean but you're still not seeing AI crawler traffic in your server logs.

Security plugins with bot blocking

As mentioned above, security plugins that aggressively block "bad bots" sometimes classify AI crawlers as threats due to their high crawl rates. Check your plugin's whitelist settings and make sure GPTBot, PerplexityBot, and ClaudeBot are excluded from any blanket bot blocks.

Fixing the problem and checking your work

Once you've updated your robots.txt, don't just assume it's working. Here's a simple verification checklist:

  1. Visit your live robots.txt and read through it again to confirm the changes saved correctly.
  2. Use Google Search Console to fetch the robots.txt and confirm it matches what you expect.
  3. Check your server access logs or a tool like Cloudflare Analytics to see if GPTBot and PerplexityBot start appearing within a few weeks.
  4. If you use a CDN or WAF, verify that none of the AI crawler IP ranges are being blocked at the network level.
  5. Submit your sitemap to any tools that support it. Perplexity, for example, accepts direct sitemap submissions through its publisher portal.

Getting your robots.txt right is just one layer of AI visibility. The next step is making sure your content is structured in a way that AI engines can actually understand and cite. That's where schema markup comes in. If you want to understand where your site currently stands, the free AI visibility audit at FlinnSchema is a good place to start.

And if you're thinking about what else contributes to how AI engines read and reference your site, it's worth knowing that robots.txt is closely related to other machine-readable files. The emerging llms.txt standard is one example - a file specifically designed to help large language models understand the structure of your site. It's different from robots.txt but serves a complementary purpose.

Frequently Asked Questions

Does blocking GPTBot affect my Google rankings?

No. GPTBot is OpenAI's crawler and has no connection to Google's ranking systems. Blocking GPTBot will not affect your position in Google Search results. It will, however, affect whether your content feeds into ChatGPT's training data and how well ChatGPT understands your site when it browses in real time. Google's own AI tools use Googlebot and Google-Extended, which are separate user agents.

How do I edit robots.txt in WordPress?

If you're using Yoast SEO, go to Yoast's settings, then navigate to "Tools" and then "File Editor." You'll find a virtual robots.txt editor there. In Rank Math, it's under General Settings, then Edit robots.txt. If a physical robots.txt file exists in your root directory, it will take precedence over Yoast's virtual one, so check both. For WordPress users without a plugin managing the file, you can edit it directly via FTP or your hosting file manager.

Will AI crawlers respect my robots.txt rules?

The major AI companies, OpenAI, Anthropic, Google, and Perplexity, have publicly committed to respecting robots.txt directives. OpenAI was among the first to publish this commitment alongside its GPTBot documentation. Smaller or less reputable AI scrapers may not honour the file, but the ones that matter most for AI search visibility do. That said, robots.txt is not enforced technically. It's a convention, not a security mechanism.

Should I block AI crawlers if I'm worried about my content being used for training?

That's a legitimate business decision and the choice is yours. Some publishers, particularly in media and publishing, have chosen to block GPTBot and Google-Extended to avoid their content being used in training data without compensation. The trade-off is reduced visibility in AI-generated answers. If your goal is to be discovered and cited by AI search engines like Perplexity and ChatGPT Search, blocking their crawlers works directly against that. Most e-commerce and service businesses benefit more from allowing access than restricting it. For a more detailed view of the strategic options, see how FlinnSchema approaches AI visibility for its clients.

Want to check your AI visibility?

Run a free audit on your website and see how visible you are to ChatGPT, Perplexity, and other AI search engines.

Run Free Audit
Why Is Your robots.txt Blocking AI Crawlers Without You Realising?