Decoding AI Product Development:Why Traditional Playbooks Fall Short
By Ayşe Naz Erkan | Published May 2025

Building AI products for specialized tasks like medical reports, legal contracts, or accounting workflows differs fundamentally from general-purpose LLM interfaces. It requires defining precise behavior, handling edge cases, and measuring quality in domain-specific terms, anchored to a product goal.
Over the past three months, we spoke with more than 50 AI startups, from early-stage builders to scaled teams shipping products with traction. We observed recurring pain points that helped us decode the core challenges of AI product development and solidify our vision. The industry needs tools that go beyond mechanistic oversight toward holistic semantic instrumentation that aligns AI behavior with product intent.
AI product development teams across industries are realizing that once large language models (LLMs) are embedded into real-world, high-stakes products, things break.
The question isn’t whether the model performs well—it’s whether the AI behaves as intended in the specific context of the product. This gap breeds confusion, cycles of rework, and mounting frustration for developers. The consequence? Engineering time gets burned increasing product risk and cost. One product lead at a growth stage startup shared: “We lack tools to verify if the AI is working. We just eyeball it.” Another engineering lead described their AI quality process as “extremely manual” and “very anecdotal”.

AI Benchmarks ≠ Product Quality
As Ben Recht recently wrote, most model evaluation benchmarks serve marketing rather than offer practical guidance. They might tell you which LLM scores best on a leaderboard, but they say nothing about how a model performs in your product, primarily because benchmarks lack product context. As one founder put it from firsthand experience: “These [model] benchmarks look impressive, but I still have to read every AI answer to see if it‘s usable for our customers.”
When models are studied in isolation, outside the context of their domain-specific applications, this real-world complexity is missed. Unlike research benchmarks, product environments are dynamic, messy, and context-sensitive. One healthcare team said: “We can’t evaluate AI using academic tests. We need to know if it captures the patient’s pain location and severity, because if it misses that, we’re at risk.” A legal AI company echoed the challenge: “Academic evals don’t measure whether the AI includes the citations our clients expect.”

Most Teams Have Difficulty Defining “Good”
Defining AI product quality is critical as applications reshape sectors like medicine, law, and finance—core pillars of public infrastructure. Teams need their own quality indicators tied to their goals. But defining and operationalizing those product-specific measures introduces a new layer of complexity and most teams are unprepared for it.
Ask a PM or engineer, “What does good AI output look like for your product?” and many will hesitate. We’ve seen teams building applications for regulated domains like finance or law struggle to explain how they’d evaluate an AI-generated report. One team asked if we could just tell them what metrics to track, telling us: “We’re not even sure what to look for.”

Too Much Signal, Too Little Insight
But clarity alone isn’t enough. Even with defined goals, teams are overwhelmed by the flood of unstructured data: LLM outputs, edits, feedback, logs… One team described their system as a “nasty bucket” of raw signal with no way to extract meaning. Another team commented, “We have [tried] 17 models. We can’t tell which one is getting closer to what we want.”
Observability tools tell you what happened, but what teams need is help understanding why—and whether it’s acceptable. One founder said: “We collect user reactions but can’t categorize or act on them—it’s just noise.” Another added: “We log every input and output, but we still don’t know what’s working.” More data doesn’t mean more insight. It often means more confusion.
Logs are flat data. They don’t capture meaning. Without semantic connections, teams are flying blind and the risks can be severe. As one finance team noted: “We’re seeing a 5% error rate, and that’s enough to lose customer trust. But we have no fast way to catch those errors before they go live.”

What’s Missing: Tools That Help Make Meaning
The software product development paradigm has shifted with generative AI. What’s needed is a rethinking of how teams define, track, and evolve AI behavior, one that starts not from code or logs, but from the product objectives. There’s a critical need for mechanisms that translate product intent into automatically verifiable signals. This is essential for evaluating AI products and for moving beyond subjective vibe checks. We call this the semantic instrumentation layer, a concept that guides our journey.
Generative AI has added a new layer of complexity to software. To build the next generation of systems with clarity and purpose, we need tools that transform this complexity into meaning.
That’s the vision we’re building at Lumiflow AI, and we look forward to sharing more. If you’re facing similar challenges, we’d love to hear from you.