Lumiflow AI Blog

AI Product Quality Isn’t a Metric,It’s a System

How can AI products deliver on their promise? The answer lies not in the tech, but in how we measure their behavior. We look to timeless systems-thinking principles for a better approach.


“All the uncertainties we have raised must confront and correct each other, there must be dialogue.” — Edgar Morin


In our previous post, we unpacked why AI product development — probabilistic, context-bound, and ever-evolving—demands a different playbook. In this article, we explore why AI quality has been elusive for developers and how we can think differently about it. Rather than just asking what to measure, we ask how measurement itself must evolve to reflect the complexity, subjectivity, and dynamic nature of real-world AI behavior.

Product development has long been guided by a simple principle: you can’t improve what you can’t measure. While that principle remains true for AI, our methods of measurement are obsolete. We are still trying to measure a technology that is fluid and dynamic with the tools of a static world—treating metrics as isolated dashboards rather than what they need to be: adaptive levers to steer a living system.

The measurement gap becomes obvious the moment we work with large language models. LLMs moved us away from deterministic, procedural systems and toward probabilistic, interpretive ones. But the real challenge isn’t just the nature of the LLM itself. In our view, an AI product is a system that connects users, experts, models, and infrastructure. And AI product quality now depends on the whole system.

The Domain Expert as the Architect of Meaning

To make quality a first-class design principle, the domain expert’s role transforms from a downstream inspector of outputs into the upstream architect of the system’s values. Reverse engineering their intuitive judgment from outputs is limited and fragile; explicit expert principles must be the blueprint. By moving experts into the role of co-designer1, we empower them to define the core principles of “good” from the outset — their expertise becomes the objective function around which the entire system is built.

Why Numbers No longer Tell the Whole Story

AI product teams are building systems that generate human language, yet still rely on quantitative tools to measure them. After all, numbers can track accuracy, but rarely capture context, tone, or trust. Well-chosen proxies2, like sentiment scores or user-reported trust, help, but only when paired with qualitative review loops. This reflects a deeper, philosophical tension: true quality lies in the qualitative, and building for it demands a new culture of measurement. As each function owns a different slice of numbers, no one sees the full picture. This creates an organizational void, leaving teams asking a critical question: who, exactly, owns AI product quality?

The Ownership Vacuum

When product executives are unsure who is responsible for AI quality among product managers, designers, engineers, or AI research teams, it's not a failure. Instead, it signals that the old, siloed roles no longer match today’s cross-functional reality. This vacuum exists in many AI startups and teams we talk to. The issue is that a systemic property can't be assigned to a single function or role. Resolving it requires creating a system where ownership is shared and feedback loops are transparently understood.

Metrics as a Guidance System, Not Just a Ruler

A systems theory lens provides the solution, by allowing us to stop looking at isolated metrics and instead see the entire network of incentives we are creating. AI product metrics are the signals that guide the behavior of that system. By carefully designing these incentives, we can create a system that naturally evolves toward strategic goals.

The Product is the Connector

Within this systems-centric framework, the AI product links the key participants:

The product is the conduit through which these interactions occur, where quality isn’t the output of any single component, but an emergent property of the overall system. That’s why AI quality remains elusive: it can’t be captured by measuring individual parts in isolation. A good AI product isn’t just accurate — it’s coherent.Coherence3 , the true quality benchmark, is the state where all the parts of a system fit to form a united whole. It lives in the dialogue between user needs, domain expert intent, and system behavior.

We’ll explore the practical frameworks for building such a system, with concrete use cases, in our upcoming posts.

Coherence is achieved when shared ownership across an AI organization creates a loop of shared meaning-making. This connects the domain expert’s mental map to the real-world needs of the user, replacing the old world of linear, disconnected tools. Instead of relying on observability dashboards that aggregate mechanistic LLM judge evaluations (like asking, “Is this AI model answer complete?”), we move towards a holistic and adaptive process. It is how we move from simply building AI to truly guiding it with intent.


  1. Hamel Husain notes in his blog “A Field Guide to Rapidly Improving AI Products,” successful teams “flip this model by giving domain experts tools to write and iterate on prompts directly,” Hamel Husain, March 24, 2025. ↩︎

  2. Thomas & Uminsky argue metrics are often a proxy and that “we often can’t measure what matters most.” Reliance on Metrics Is a Fundamental Challenge for AI, arXiv 2002.08512 (2020). ↩︎

  3. Edgar Morin, "Coherence and Epistemological Opening", in On Complexity, Hampton Press, 2008. ↩︎

Decoding AI Product Development:Why Traditional Playbooks Fall Short

Building AI products for specialized tasks like medical reports, legal contracts, or accounting workflows differs fundamentally from general-purpose LLM interfaces. It requires defining precise behavior, handling edge cases, and measuring quality in domain-specific terms, anchored to a product goal.

Over the past three months, we spoke with more than 50 AI startups, from early-stage builders to scaled teams shipping products with traction. We observed recurring pain points that helped us decode the core challenges of AI product development and solidify our vision. The industry needs tools that go beyond mechanistic oversight toward holistic semantic instrumentation that aligns AI behavior with product intent.

AI product development teams across industries are realizing that once large language models (LLMs) are embedded into real-world, high-stakes products, things break.

The question isn’t whether the model performs well—it’s whether the AI behaves as intended in the specific context of the product. This gap breeds confusion, cycles of rework, and mounting frustration for developers. The consequence? Engineering time gets burned increasing product risk and cost. One product lead at a growth stage startup shared: “We lack tools to verify if the AI is working. We just eyeball it.” Another engineering lead described their AI quality process as “extremely manual” and “very anecdotal”.

AI Benchmarks ≠ Product Quality

As Ben Recht recently wrote, most model evaluation benchmarks serve marketing rather than offer practical guidance. They might tell you which LLM scores best on a leaderboard, but they say nothing about how a model performs in your product, primarily because benchmarks lack product context. As one founder put it from firsthand experience: “These [model] benchmarks look impressive, but I still have to read every AI answer to see if it‘s usable for our customers.”

When models are studied in isolation, outside the context of their domain-specific applications, this real-world complexity is missed. Unlike research benchmarks, product environments are dynamic, messy, and context-sensitive. One healthcare team said: “We can’t evaluate AI using academic tests. We need to know if it captures the patient’s pain location and severity, because if it misses that, we’re at risk.” A legal AI company echoed the challenge: “Academic evals don’t measure whether the AI includes the citations our clients expect.”

Most Teams Have Difficulty Defining “Good”

Defining AI product quality is critical as applications reshape sectors like medicine, law, and finance—core pillars of public infrastructure. Teams need their own quality indicators tied to their goals. But defining and operationalizing those product-specific measures introduces a new layer of complexity and most teams are unprepared for it.

Ask a PM or engineer, “What does good AI output look like for your product?” and many will hesitate. We’ve seen teams building applications for regulated domains like finance or law struggle to explain how they’d evaluate an AI-generated report. One team asked if we could just tell them what metrics to track, telling us: “We’re not even sure what to look for.”

Too Much Signal, Too Little Insight

But clarity alone isn’t enough. Even with defined goals, teams are overwhelmed by the flood of unstructured data: LLM outputs, edits, feedback, logs… One team described their system as a “nasty bucket” of raw signal with no way to extract meaning. Another team commented, “We have [tried] 17 models. We can’t tell which one is getting closer to what we want.”

Observability tools tell you what happened, but what teams need is help understanding why—and whether it’s acceptable. One founder said: “We collect user reactions but can’t categorize or act on them—it’s just noise.” Another added: “We log every input and output, but we still don’t know what’s working.” More data doesn’t mean more insight. It often means more confusion.

Logs are flat data. They don’t capture meaning. Without semantic connections, teams are flying blind and the risks can be severe. As one finance team noted: “We’re seeing a 5% error rate, and that’s enough to lose customer trust. But we have no fast way to catch those errors before they go live.”

What’s Missing: Tools That Help Make Meaning

The software product development paradigm has shifted with generative AI. What’s needed is a rethinking of how teams define, track, and evolve AI behavior, one that starts not from code or logs, but from the product objectives. There’s a critical need for mechanisms that translate product intent into automatically verifiable signals. This is essential for evaluating AI products and for moving beyond subjective vibe checks. We call this the semantic instrumentation layer, a concept that guides our journey.

Generative AI has added a new layer of complexity to software. To build the next generation of systems with clarity and purpose, we need tools that transform this complexity into meaning.

That’s the vision we’re building at Lumiflow AI, and we look forward to sharing more. If you’re facing similar challenges, we’d love to hear from you.

Ayşe Naz Erkan
Co-founder & CEO at Lumiflow AI

Follow us on LinkedIn