ServicesAI Audit
← Back to Blog

Should I Block AI Crawlers or Let Them Index My Site?

AI crawlersAI visibilityrobots.txtGPTBotClaudeBotLLM SEOschema markupAI search

Blocking AI crawlers feels like a reasonable instinct. Your content took time to produce, large language models are scraping it without paying you a penny, and you have no guarantee it will benefit your business in return. It is completely understandable to want to put a wall up.

But before you add a few lines to your robots.txt and call it done, it is worth thinking carefully about what you are actually giving up, and whether the trade-off makes sense for your specific situation. Because for most e-commerce brands and service businesses, blocking AI crawlers is quietly costing them visibility they do not even know they are losing.

What AI crawlers actually do with your content

There are two distinct ways AI systems interact with your site, and it is important not to conflate them.

The first is training data collection. This is what sparked the original backlash. Companies like OpenAI, Anthropic, and Google crawl the web to gather text used to train their models. Once a model is trained, that data is baked in. Blocking crawlers at this stage does not remove content already absorbed into a model, but it does stop future training runs from using new pages you publish.

The second is real-time retrieval. This is where things get commercially interesting. When someone asks ChatGPT a question with browsing enabled, or uses Perplexity to research a product, those tools are actively fetching and citing web pages right now, in real time. If you block the crawlers responsible for this, your site simply does not appear in those answers. You are invisible at the exact moment a potential customer is making a decision.

These two functions are often handled by different bots with different user agents. GPTBot is primarily associated with training, while ChatGPT-User is used for real-time browsing. Blocking them both with a blanket rule is where businesses tend to make a costly mistake.

The case for blocking: when it genuinely makes sense

There are real scenarios where restricting AI crawlers is the right call. It is not a universally bad idea, it depends on your content model.

You monetise content directly

If your business is built on subscription content, paywalled articles, or proprietary research, then allowing AI crawlers to index and summarise that content essentially lets users get your value for free via an AI answer. Publishers, data providers, and niche membership sites have a legitimate reason to block training crawlers specifically.

You have sensitive or legally complex content

Healthcare providers, legal firms, and financial services businesses sometimes have regulatory reasons to restrict how their content is distributed and interpreted. If your content could cause harm when taken out of context by an AI, or if your compliance team has flagged it, blocking makes sense.

You are a very small site with thin content

If you only have a handful of pages and they are not particularly well-structured, being cited by an AI tool could actually work against you. AI systems citing thin or inaccurate content can reinforce the wrong impression of your brand. In this case, fixing the content first is the smarter move before you invite AI attention.

The case for letting AI crawlers in: what you stand to gain

For the vast majority of e-commerce brands, service providers, and content-led businesses, blocking AI crawlers is leaving real opportunity on the table.

AI search is already driving purchase decisions

People are using ChatGPT, Perplexity, and Gemini to research products, compare services, and find local businesses. These are not passive information seekers. They are buyers. When someone asks Perplexity "what is the best protein powder for endurance athletes" or "which Shopify agency specialises in DTC brands", the sites that get cited are the ones that get the traffic and the enquiry.

If your site is blocked, you are not in that conversation at all.

Citations compound over time

AI systems build up a picture of which sources are trustworthy and authoritative. Every time your content is cited, your brand becomes more likely to be cited again. It is not entirely unlike domain authority in traditional SEO, but faster and more directly tied to how well your content is structured and how clearly it answers specific questions.

Blocking means you never start building that citation history.

Structured data works in your favour

If you have invested in schema markup, JSON-LD, and proper content structure, AI crawlers can read and use that information far more accurately than a human scanning a page. Your product details, reviews, FAQs, and business information become machine-readable assets. Blocking crawlers means that investment goes unrewarded by the fastest-growing discovery channel right now.

At FlinnSchema, we see this regularly. Clients who open up to AI crawlers after implementing proper structured data start appearing in AI-generated answers within weeks, not months. The structured signal is that clear.

How to make a granular decision rather than a blanket one

The good news is that you do not have to choose between fully open and fully closed. You can be surgical about it.

Separate training bots from retrieval bots

Your robots.txt file lets you target specific user agents. Here is a practical approach for most businesses:

  • Allow ChatGPT-User (real-time browsing by ChatGPT users)
  • Allow PerplexityBot (Perplexity's live retrieval crawler)
  • Allow GoogleOther (used by Google for Gemini and AI Overviews)
  • Consider blocking GPTBot if you are specifically concerned about training data
  • Consider blocking CCBot (Common Crawl, heavily used for training datasets)

This way you retain the commercial benefit of being discoverable in live AI search results, while limiting your contribution to model training if that is a concern.

Use page-level decisions, not site-wide ones

You do not have to apply the same rules to your entire site. Block AI crawlers from your members area, your checkout flow, or your internal documentation. Open them up to your blog, your product pages, your FAQs, and your service pages. These are the pages where being cited by an AI assistant has direct commercial value.

Check your current status

A lot of businesses have AI crawlers blocked and do not even know it. Some WordPress plugins and Shopify apps add blanket bot restrictions as a "security" measure. It is worth auditing your robots.txt right now to see what you are currently allowing and blocking. You might find you have been invisible to AI search engines for months without realising it.

If you want a proper assessment of how visible your site currently is to AI search tools, our free AI visibility audit will show you exactly where you stand.

The robots.txt is not the only lever

Even if you leave all AI crawlers open, there are other ways to influence how your content is used.

The noai and noimageai meta tags are emerging standards that some AI companies are starting to respect, though adoption is not yet universal. Adding these to specific pages signals that you do not want that content used for training, without necessarily blocking retrieval access.

Structured data also plays a role here. AI systems are more likely to cite content they can clearly parse and attribute. Implementing author schema and organisation markup does not just help with attribution. It builds a verifiable identity that AI tools treat as more credible. Unstructured content from an anonymous-looking page is far less likely to be cited, regardless of whether the crawler can access it.

It is also worth understanding how AI crawlers actually discover your site in the first place. The discovery process is different from traditional search engine crawling, and knowing how it works helps you structure your site to be found more effectively.

The honest bottom line

If you are running a content business built around paywalled information, blocking training crawlers is reasonable. If you are a typical e-commerce brand, a service business, or a content-led site trying to grow, blocking AI crawlers is almost certainly hurting you more than it is protecting you.

The question is not really "should I block them or not". The better question is "which crawlers should I allow, on which pages, and have I structured my content so that when they do visit, they actually understand what I do and who I serve?"

That second question is where the real work is. Getting your robots.txt right takes about ten minutes. Getting your structured data, content architecture, and entity signals right is an ongoing effort that compounds in your favour the longer you do it.

If you are not sure where to start, understanding how Perplexity decides which sources to cite is a good place to get your bearings. The principles apply broadly across AI search tools and will help you prioritise what to fix first.

And if you want someone to take a proper look at your site's AI visibility from the ground up, book a free audit and we will tell you exactly what is standing between you and being cited by AI search engines.

Frequently Asked Questions

Will blocking AI crawlers protect my content from being used in AI training?

Partially. Blocking crawlers like GPTBot and CCBot can prevent your new content from being included in future training runs. However, if your content was already publicly available before you added the block, it may already have been collected. Blocking also does nothing about content already baked into existing models. It is a forward-looking measure, not a retroactive one.

If I block AI crawlers, will my Google rankings be affected?

Traditional Google search rankings should not be directly affected by blocking AI-specific crawlers like GPTBot or PerplexityBot, as these are separate from Googlebot. However, blocking GoogleOther (used by Google for AI Overviews and Gemini features) could limit your appearance in Google's own AI-powered results, which are increasingly prominent in search pages. Be specific about which bots you block.

How do I know if AI crawlers are currently blocked on my site?

Check your robots.txt file directly by visiting yourdomain.com/robots.txt. Look for any Disallow: / rules under user agents like GPTBot, PerplexityBot, ChatGPT-User, or a catch-all User-agent: * that might be blocking everything. Some security plugins and CDN configurations also block bots at a server level, so check those settings too.

Does letting AI crawlers in mean I will definitely appear in AI search answers?

Not automatically. Allowing access is the first step, but AI systems also assess content quality, structure, authority, and relevance before deciding what to cite. You need well-structured pages, clear entity signals, schema markup, and genuinely useful content. Access is the minimum requirement. Everything else determines whether you get cited or ignored.

Want to check your AI visibility?

Run a free audit on your website and see how visible you are to ChatGPT, Perplexity, and other AI search engines.

Run Free Audit