If you cannot measure something, you cannot improve it. That applies to AI visibility just like any other marketing channel, except AI visibility is uniquely hard to measure because there is no single ranking system, no central tracker, and no public dashboard for who gets cited by ChatGPT or Perplexity. Every business that takes AI visibility seriously runs into the same question: how do you actually measure progress?
This post walks through the answer in detail. It covers how we measure AI visibility at FlinnSchema, how other tools and academic researchers approach the same problem, and what you should actually track if you want a credible read on where your business stands. If you are completely new to the topic, our explainer on what AI visibility is and why it matters sets the scene before you dig into the measurement side here.
Why AI Visibility Is Hard to Measure
Traditional SEO has decades of measurement infrastructure. You can plug your domain into any of a dozen rank tracking tools and see exactly where you sit for specific keywords on Google's results page. The data is granular, reproducible, and well-understood. AI visibility has none of that yet, and several structural reasons make it harder.
First, there is no single source of truth. ChatGPT, Perplexity, Gemini, and Grok all generate their own answers from their own retrieval pipelines. A business cited by ChatGPT might be invisible to Grok, and vice versa. So measuring "AI visibility" requires measuring across multiple engines, which multiplies the data collection effort.
Second, AI answers vary by query. The same business might be cited for "best recruitment agency in Kent" but ignored for "tech recruitment specialists UK". To get a credible score, you need to test multiple prompt variations rather than relying on a single query.
Third, AI responses are non-deterministic. The same prompt sent to ChatGPT twice can produce different answers, especially when the model uses web search and ingests slightly different sources each time. Single tests are unreliable. Statistical sampling across multiple runs is the only way to get a stable measurement.
Fourth, the structural factors that drive AI visibility (schema markup, crawler access, trust signals) are technical inputs rather than direct outputs. Measuring them tells you whether your site is ready to be cited, but not whether it actually is being cited. You need both kinds of measurement to have a complete picture.
Structural Measurement vs Behavioural Measurement
The clean way to think about AI visibility measurement is to separate it into two categories that complement each other.
Structural measurement evaluates whether your site is technically optimised for AI engines. It is deterministic, reproducible, and fast. You can run it on any URL and get a score immediately. Examples include: does your site have complete Organisation schema, does your robots.txt allow GPTBot and ClaudeBot, do you have a valid llms.txt file, do you have FAQPage schema on key pages, and so on. Structural measurement tells you whether your foundations are in place.
Behavioural measurement evaluates whether AI engines actually cite your business when prompted. It is non-deterministic, requires multiple runs across multiple engines, and is more expensive to gather but more directly tied to commercial outcomes. Examples include: send 20 prompts about your industry to ChatGPT and count how many mention your business, repeat across Perplexity, Gemini, and Grok, then compute a citation rate.
Each kind of measurement has tradeoffs. Structural metrics are cheap, fast, and reproducible, but they only tell you whether the conditions for citation exist. Behavioural metrics are expensive, slow, and noisy, but they tell you what AI engines actually do with the conditions you have set up. The two together form a complete picture.
How We Measure It at FlinnSchema
Our methodology combines both kinds of measurement into a single score capped at 90 percent. The structural side is a 26-factor weighted scoring system, and the behavioural side is automated testing across four major AI engines. Both update independently and combine into a per-domain dashboard.
For the structural score, we evaluate 26 factors grouped by impact tier. The high-impact factors include Schema Markup (2.2x weight), E-E-A-T Signals (2.1x weight), Schema Completeness (2.0x weight), Schema Types diversity (1.9x weight), LLM Readability (1.8x weight), Reviews and Trust (1.7x weight), AI Crawler Access (1.6x weight), FAQ Schema (1.6x weight), Conversational Content (1.5x weight), and the llms.txt file (1.5x weight). Standard-impact factors include Reddit Presence, Content Freshness, Social Profiles, Content Depth, Robots.txt, and Sitemap.xml. Lower-impact factors cover Internal Links, Image SEO, Semantic HTML, Meta Tags, Open Graph, Heading Structure, Page Performance, HTML Quality, HTTPS, and Mobile Viewport.
Each factor is checked against your live site. The score is the weighted percentage of factors passed, capped at 90 percent because no site is ever truly perfect and we want the scoring to reflect that honestly. The full breakdown of how the score is calculated is in what the AI visibility score actually means, and for a deeper look at what each factor actually checks, see inside the AI visibility audit.
How We Measure the Behavioural Side
The behavioural side is where most AI visibility tools either skip or shortcut. We do it as follows.
For each tracked domain, we generate a set of 20 prompts using a combination of AI-generated suggestions (based on the business context, products or services, industry, and location) and template-based fallbacks. The prompts cover five categories: brand-direct queries, service or product queries, problem-solving queries, local queries (where applicable), and comparison queries. The goal is to test whether the business shows up for the actual kinds of questions real customers ask.
We then send those 20 prompts to four engines: ChatGPT (OpenAI's Responses API with the web_search_preview tool), Perplexity (Sonar with month-level search recency), Gemini (gemini-2.5-flash with google_search grounding), and Grok (grok-3 with search_mode on). All four are queried with live web search enabled, which is critical because API-only queries skip the real retrieval pipeline that real users experience. This is one of the things we cover in detail on our methodology page.
Each response is then classified by Claude Haiku as a genuine mention, a confused mention (the business mentioned but described incorrectly), an echoed mention (the business mentioned only because we put its name in the prompt), or not found. We use the classifier rather than simple string matching because string matching produces too many false positives. A client whose business name is partly generic (think "Highland Adventures" for a Scottish hiking guide) would otherwise appear in every answer about Scottish hiking, even when the AI did not actually mean their business.
The classified results compute a citation rate per engine and an overall behavioural score that we surface alongside the structural score. Over time, repeated runs (we test daily for premium clients) build a time series so you can see whether interventions on the structural side actually translate into citation gains.
Case Study: What Real Measurement Looks Like in Practice
One of our long-running clients is a recruitment agency in Kent. When we first measured them, their structural score was 18 out of 100 and their behavioural score showed 0 out of 40 citations across the four engines (20 prompts in two rounds per engine). After eight weeks of structured implementation work covering schema, crawler access, llms.txt, content restructuring, and review-signal consolidation, their structural score had risen to 62 and the behavioural side showed 23 out of 40 citations across the engines.
The interesting part of the measurement was that the structural and behavioural scores moved at different paces. The structural score jumped quickly within the first two weeks of implementation as schema and crawler access changes took effect. The behavioural score lagged by three to four weeks because AI engines needed time to re-crawl, re-index, and adjust their retrieval patterns. Without measuring both, we would not have been able to tell whether the work was paying off in real time. Measuring only one side would have been misleading either way. You can see more before-and-after measurement breakdowns on the FlinnSchema results page.
How Other Tools and Researchers Measure It
FlinnSchema's approach is one of several emerging methodologies. The field is young and the consensus is still forming. Here is how some of the other approaches work, with the tradeoffs of each.
Princeton's GEO benchmark is the academic foundation for this whole field. The 2023 paper titled "GEO: Generative Engine Optimization" by Aggarwal, Murahari, Rajpurohit, Kalyan, Narasimhan, and Deshpande introduced systematic testing methodology for how content changes affect AI citation rates. They tested specific content interventions (adding citations, adding statistics, adding quotes, adjusting fluency) against a benchmark of generative engines, and measured the resulting change in visibility metrics. The full paper is available at arxiv.org/abs/2311.09735 for the technical specifics. It is the closest thing to an academic baseline for this kind of measurement and worth reading in full if you want to understand the underlying research.
HubSpot's free AI Search Grader takes a similar behavioural approach but at a single-snapshot scale. You enter your domain and HubSpot tests a small set of prompts against ChatGPT and Perplexity, reporting whether your brand is mentioned. It is free, fast, and useful as a sanity check, but the prompt set is small and there is no structural side, so it is closer to a one-off diagnostic than an ongoing measurement programme.
Profound focuses specifically on tracking branded mentions across AI engines over time, with dashboards aimed at marketing teams. The methodology centres on prompt-based testing similar to ours, with less emphasis on the structural foundations that determine whether your site can be cited in the first place. It is well-suited for established brands wanting share-of-voice tracking but less actionable for businesses trying to identify what to fix.
Ahrefs Brand Radar tracks brand mentions in Google's AI Overviews, which is a hybrid of traditional SEO and AI visibility. Because AI Overviews appear within Google search results, the measurement uses Ahrefs' existing crawl infrastructure but extracts AI-generated content for analysis. It is a useful complement for businesses already using Ahrefs but it covers only one channel rather than the full AI engine landscape.
SemRush and other established SEO tools have added AI tracking features that monitor mentions in AI Overviews and provide some prompt-testing capability. The depth varies and most are still treating AI visibility as a feature alongside traditional SEO rather than a distinct measurement discipline.
Gartner's forecasts are not a measurement tool but a framing for why this matters. Their 2024 prediction that traditional search volume would drop 25 percent by 2026 has been one of the more widely cited industry signals for the shift toward AI-driven discovery. The forecast itself is not measurement, but it explains why credible measurement infrastructure is becoming a priority.
Schema.org and Google Search Central documentation are not measurement tools either, but they are the canonical references for the structural inputs that any credible measurement system has to evaluate. The Schema.org specification defines what valid schema looks like, and any measurement that purports to evaluate structured data quality should be grounded in those definitions.
The Honest Tradeoffs in Each Approach
No single methodology is complete. Each has its strengths and weaknesses.
Pure behavioural measurement (citation tracking only) tells you what AI engines do but not why. If your citation rate is low, you have no clear path to improving it because the measurement does not surface the underlying factors that need to change. You see the symptom without seeing the cause.
Pure structural measurement (factor scoring only) tells you whether your foundations are right but not whether the foundations are translating into real citations. You can have a perfect structural score and still be invisible if the citation behaviour of the engines does not follow the patterns your structural model assumes. You see the cause without confirming the effect.
Single-engine measurement (only ChatGPT, only Perplexity) gives you a partial view. The engines weight signals differently, and a citation strategy that works for Perplexity might leave you invisible to Grok or vice versa. The engines also evolve at different paces, so a single-engine snapshot can mislead you within months as the engine in question changes its retrieval logic.
Snapshot measurement (one-time tests) misses the non-determinism of AI engines. The same prompt produces different responses across runs, so a single test is unreliable. Time-series testing across multiple runs is the only way to detect real changes in behaviour versus random variation.
The combination of structural and behavioural across multiple engines and over time is what we have settled on at FlinnSchema. It is more expensive than any single approach, but it produces the only measurement we trust to actually reflect commercial reality. For a deeper look at the difference between AI visibility and traditional SEO measurement, see how AI visibility is different from SEO.
What You Should Actually Track
If you are starting from zero, the practical answer is to track these in order of importance.
Your AI visibility score as a single composite number is the lead indicator. It rolls structural and behavioural into one read of where you stand and which direction you are moving. This is the number our free audit returns in about 60 seconds, and it is the simplest thing to monitor over time.
Schema completeness as a percentage of the recommended schemas for your business type. This is the most actionable structural metric because every missing schema is a specific implementation task. Our piece on what types of schema your business needs covers the decision tree.
AI crawler access status across the major bot user agents. This is binary (allowed or blocked) but high-impact because a blocked crawler is a closed channel regardless of how good the rest of your work is.
Citation rate per engine if you run premium-tier testing. Track separately for ChatGPT, Perplexity, Gemini, and Grok because the engines move independently. A citation rate that goes from 0 out of 20 to 10 out of 20 on ChatGPT but stays at 0 out of 20 on Grok tells you something specific about where to invest next.
Time-series trend rather than absolute snapshots. The direction matters more than the level when you are starting out. A score of 35 trending upward at 3 points a month is healthier than a score of 60 sitting flat for six months. Stable upward trends compound; flat scores tend to drift down as competitors invest.
How to Get a Baseline Today
The fastest practical step is to run our free 26-factor audit. It returns a composite AI visibility score in about 60 seconds with no credit card and no sales call required. The result includes a per-factor breakdown so you know exactly which structural elements are missing and which are already in place. From the score, our roadmap on how to increase your AI visibility score covers the priority order for the highest-impact fixes.
If you want the behavioural side as well, that is what our Premium plan covers: daily LLM testing across all four major AI engines with verified classifications, time-series tracking, and a prioritised roadmap that updates as your structural work changes the citation behaviour. For prospects who want to see the full picture before committing, book a free 15-minute walkthrough and we will run a live audit on your domain and walk through what we are measuring and why.
For context on why measurement matters in the first place, our pieces on whether customers actually use ChatGPT to find businesses and how to get cited by ChatGPT cover the commercial side. Measurement is the link between the technical work you do and the commercial outcomes you care about. Without a credible read on where you stand, you are flying blind on a channel that increasingly drives real revenue.
