AI Product Quality Isn’t a Metric,It’s a System
By Ayşe Naz Erkan | Published June 2025

How can AI products deliver on their promise? The answer lies not in the tech, but in how we measure their behavior. We look to timeless systems-thinking principles for a better approach.
“All the uncertainties we have raised must confront and correct each other, there must be dialogue.” — Edgar Morin
In our previous post, we unpacked why AI product development — probabilistic, context-bound, and ever-evolving—demands a different playbook. In this article, we explore why AI quality has been elusive for developers and how we can think differently about it. Rather than just asking what to measure, we ask how measurement itself must evolve to reflect the complexity, subjectivity, and dynamic nature of real-world AI behavior.
Product development has long been guided by a simple principle: you can’t improve what you can’t measure. While that principle remains true for AI, our methods of measurement are obsolete. We are still trying to measure a technology that is fluid and dynamic with the tools of a static world—treating metrics as isolated dashboards rather than what they need to be: adaptive levers to steer a living system.
The measurement gap becomes obvious the moment we work with large language models. LLMs moved us away from deterministic, procedural systems and toward probabilistic, interpretive ones. But the real challenge isn’t just the nature of the LLM itself. In our view, an AI product is a system that connects users, experts, models, and infrastructure. And AI product quality now depends on the whole system.

The Domain Expert as the Architect of Meaning
To make quality a first-class design principle, the domain expert’s role transforms from a downstream inspector of outputs into the upstream architect of the system’s values. Reverse engineering their intuitive judgment from outputs is limited and fragile; explicit expert principles must be the blueprint. By moving experts into the role of co-designer1, we empower them to define the core principles of “good” from the outset — their expertise becomes the objective function around which the entire system is built.
Why Numbers No longer Tell the Whole Story
AI product teams are building systems that generate human language, yet still rely on quantitative tools to measure them. After all, numbers can track accuracy, but rarely capture context, tone, or trust. Well-chosen proxies2, like sentiment scores or user-reported trust, help, but only when paired with qualitative review loops. This reflects a deeper, philosophical tension: true quality lies in the qualitative, and building for it demands a new culture of measurement. As each function owns a different slice of numbers, no one sees the full picture. This creates an organizational void, leaving teams asking a critical question: who, exactly, owns AI product quality?

The Ownership Vacuum
When product executives are unsure who is responsible for AI quality among product managers, designers, engineers, or AI research teams, it's not a failure. Instead, it signals that the old, siloed roles no longer match today’s cross-functional reality. This vacuum exists in many AI startups and teams we talk to. The issue is that a systemic property can't be assigned to a single function or role. Resolving it requires creating a system where ownership is shared and feedback loops are transparently understood.
Metrics as a Guidance System, Not Just a Ruler
A systems theory lens provides the solution, by allowing us to stop looking at isolated metrics and instead see the entire network of incentives we are creating. AI product metrics are the signals that guide the behavior of that system. By carefully designing these incentives, we can create a system that naturally evolves toward strategic goals.

The Product is the Connector
Within this systems-centric framework, the AI product links the key participants:
- the expert, who provides the objective,
- the user, who provides the real-world feedback,
- and the product design and development teams, who build and maintain the structure that enables this information flow.
The product is the conduit through which these interactions occur, where quality isn’t the output of any single component, but an emergent property of the overall system. That’s why AI quality remains elusive: it can’t be captured by measuring individual parts in isolation. A good AI product isn’t just accurate — it’s coherent.Coherence3 , the true quality benchmark, is the state where all the parts of a system fit to form a united whole. It lives in the dialogue between user needs, domain expert intent, and system behavior.
We’ll explore the practical frameworks for building such a system, with concrete use cases, in our upcoming posts.
Coherence is achieved when shared ownership across an AI organization creates a loop of shared meaning-making. This connects the domain expert’s mental map to the real-world needs of the user, replacing the old world of linear, disconnected tools. Instead of relying on observability dashboards that aggregate mechanistic LLM judge evaluations (like asking, “Is this AI model answer complete?”), we move towards a holistic and adaptive process. It is how we move from simply building AI to truly guiding it with intent.
Hamel Husain notes in his blog “A Field Guide to Rapidly Improving AI Products,” successful teams “flip this model by giving domain experts tools to write and iterate on prompts directly,” Hamel Husain, March 24, 2025. ↩︎
Thomas & Uminsky argue metrics are often a proxy and that “we often can’t measure what matters most.” Reliance on Metrics Is a Fundamental Challenge for AI, arXiv 2002.08512 (2020). ↩︎
Edgar Morin, "Coherence and Epistemological Opening", in On Complexity, Hampton Press, 2008. ↩︎