Most GEO Tools Are Just Dashboards — The Real Problem in AI Search Optimisation
The GEO/AIO/AEO tool market has grown quickly, and the dashboards are genuinely useful. Share of AI voice charts, prompt tracking grids, citation frequency scores, competitor benchmarking views – the category has produced a coherent visual language for AI visibility in a short time. But underneath the polished interfaces lies a fundamental problem: most tools are telling you what happened, not why – and they are often measuring the wrong thing.
Understanding why requires some honesty about the constraints these tools operate under. LLMs are black boxes: you cannot directly inspect their weights, their internal associations, or the reasoning behind any individual output. All you can observe is the output itself. This is not a failure of ambition on the part of GEO platforms – it reflects a genuine technical constraint. But it means there is an important distinction between what these tools can reliably measure and what businesses actually need to know.
The Rank Tracker Fallacy
The dominant model in current GEO tooling is essentially a rank tracker for AI. You enter a set of prompts – 'best CRM for startups,' 'top project management tools,' 'accounting software for freelancers' – and the tool runs those prompts through AI systems and reports whether your brand appeared.
Some tools go further, assigning a 'position' within the AI response (first brand mentioned, second, third). This looks familiar and actionable. It mirrors the keyword ranking reports that SEO teams have used for decades. But there is a critical problem: AI search does not have rankings in the way that term implies.
Research by Rand Fishkin and SparkToro found that asking an AI for a product recommendation 100 times produces 100 different answers. The brand composition, order, and framing all vary. A single 'position' measurement captures one draw from a probability distribution – and presents it as a fact.
The real unit of measure is the probability of inclusion – the percentage of times a brand appears across hundreds of iterations. Most current tools ignore this variability, providing brands with a skewed and unstable view of their visibility.
What AI Search Actually Produces: Distributions, Not Ranks
This distinction – between a single ranked output and a probability distribution over possible outputs – is not academic. It has direct implications for how brands should measure and optimise their AI visibility.
The correct question is not 'what position are we in?' but 'what is our probability of inclusion for this prompt across this population of users?' That probability varies by model, by user context, by the recency of training data, and by dozens of other factors. Measuring it accurately requires running hundreds of iterations of a prompt, varying context, and analysing the distribution of outcomes – not checking a single result.
Very few current GEO tools do this. Those that do often present it in ways that obscure the underlying variability, smoothing it into a single score that feels more like a traditional ranking than the probabilistic signal it actually is.
The Causality Gap
The deeper problem is causality. Even if a tool accurately measures your probability of appearing in an AI answer, it cannot tell you why that probability is what it is – or what you would need to change to improve it.
Consider what actually drives AI brand recommendations. It is a complex interplay of signals, none of which current tools can directly observe:
- Training data prevalence – how often your brand appeared in the model's training corpus. Training data is not disclosed; models are retrained periodically with no public changelog.
- Entity association strength – how consistently high-authority sources link your brand to specific use cases. Observable at the surface (citations) but not at the weight level inside the model.
- Web search/retrieval relevance – how well your content is retrieved when relevant queries are processed. Retrieval indices and scoring are proprietary to each platform.
- Co-mention patterns – which other brands and concepts appear alongside yours in authoritative sources. Requires corpus-level analysis, not output monitoring.
- Corroboration thresholds – whether enough independent sources agree on the same claims about your brand. Requires cross-source semantic analysis beyond what output dashboards provide.
Current GEO tools surface the outputs of this process (your citations, your share of voice) but cannot decompose the inputs. They can tell you that your visibility dropped; they cannot tell you whether it is because a key authoritative source stopped mentioning you, because a competitor achieved a corroboration threshold in a new category, or because a model update shifted the weighting of training-data signals.
Without causal understanding, optimisation becomes expensive guesswork. Teams invest in content that may not address the actual gap. They build citations in sources that carry little weight with the specific models they care about. They optimise for prompts that are not representative of real user behaviour.
The Simulated Prompt Problem
There is a further, underappreciated issue with the dominant GEO methodology: the prompts being tested are not the prompts real users are typing.
Most GEO tools operate on a library of curated, 'clean' prompts – short, intent-clear queries run in fresh, context-free sessions. This is useful for establishing a baseline, but it misses the majority of how AI discovery actually works. Real users arrive at AI recommendations through multi-turn conversations, with context from previous queries and implicit signals about their industry, role, and situation baked into the conversation history.
Testing a brand's visibility with a set of generic prompts in incognito sessions is like testing a recommendation engine by asking it to recommend a film with no viewing history – technically valid, but not representative of how the system performs for actual users.
What the Next Generation Needs to Do
The honest assessment is that first-generation GEO tools have established something valuable: AI visibility is measurable, and it correlates with business outcomes. But the category needs to evolve from monitoring to modelling – from reporting what happened to explaining why, and predicting what will happen for different types of users across different contexts.
That evolution requires a different conceptual foundation: one that treats AI recommendations as the output of a complex, persona-sensitive system, and that models the signals driving inclusion or exclusion at the level of specific user segments. This is precisely the missing layer that persona intelligence addresses – moving beyond dashboards to model the causal relationship between brand signals and AI recommendation probability.
Written by
ZIO Team
Research Team
The ZIO research and product team, dedicated to advancing persona intelligence.