# Why We Created Agent Analytics, and Why Every Team Building AI Agents Needs It

The moment our product became a AI agent, our entire observability stack became irrelevant—not something you want as an analytics company. Here&#x27;s what we did.

Source: https://amplitude.com/en-us/blog/agent-analytics

---

[Vinay Goel](/blog/author/vinay-goel)

[Staff AI Engineer, Amplitude](/blog/author/vinay-goel)

[](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fagent-analytics)[](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fagent-analytics)[](https://twitter.com/intent/tweet?url=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fagent-analytics\&text=Why%20We%20Created%20Agent%20Analytics%2C%20and%20Why%20Every%20Team%20Building%20AI%20Agents%20Needs%20It)[](mailto:?subject=Checkout%20this%20Amplitude%20Article\&body=Check%20this%20out%3A%20https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fagent-analytics)

*This blog post was co-authored by Jacob Newman, Principal AI PM and Nikhil Gangaraju, PMM lead at Amplitude\&#xA;*

Amplitude is a product analytics company. We've spent more than a decade helping product teams understand what their users are doing and why. Funnels, retention curves and behavioral cohorts. We know this stuff. And yet, [when we started building our own AI agent last year](https://amplitude.com/blog/the-last-bottleneck), we found ourselves in a weird spot:

**We had no idea if our agent was actually *good*.**

Not "good" the way we'd normally measure a product. We had engagement metrics, activation rates, NPS scores and offline evals. None of them told us the thing we needed to know: When a user asked our agent a question, did it give them a useful answer? When it failed, why? Was it a bad prompt? Missing context? A broken tool call? If users had a bad experience, did they churn? If they had a great one, did they retain better or upgrade?

We'd built our careers on the idea that great products come from understanding user behavior. But the moment our “product” became a non-deterministic AI agent, our entire observability stack became irrelevant. So we built Agent Analytics to fix that. [It's now in closed beta](https://amplitude.com/agent-analytics) with a set of design partners.

With Agent Analytics we can finally answer the question we couldn't at the start. The rest of this post is how.

### Why traditional product analytics failed us

Our existing analytics still worked for what they were designed to do. We could see how many people opened the agent. We could track session length, feature adoption, retention. The dashboards were green. The charts went up and to the right.

But a user can open an agent, have a two-minute conversation, and leave furious. That shows up in traditional analytics as an engaged session. Two minutes! Multiple events fired! The funnel says “activated.” In reality, the user asked a question, the agent hallucinated, and the user decided our product was broken.

Traditional product analytics was built for a world where users click buttons and navigate pages, where the product’s behavior is deterministic and you can infer intent from actions. Agents flip that.

With an agent, the user states their intent explicitly (they literally type what they want), but the product's behavior is unpredictable. The agent might call the right tools in the right order and nail the answer. Or it might misinterpret the request, call the wrong tool, hallucinate a data point, and confidently present garbage as insight. Both paths generate events. Both look identical in a traditional analytics dashboard.

We started to realize our blind spots were everywhere.

- **We couldn't measure quality.** Was the agent's output accurate? Did it answer what the user asked? Traditional analytics tracks that something happened, not whether the experience was good. For non-deterministic products that generate free-form responses, that distinction was a big shift.
- **We couldn't debug failures either.** When something went wrong (and with agents, things go wrong in creative ways), we had no systematic way to understand the chain of reasoning that led there. Was the system prompt too vague? Did retrieval return irrelevant context? Did the model hallucinate despite having the right information? Each failure mode requires a completely different fix.
- **Experimentation was a mess.&#x20;**&#x49;n traditional product development, you A/B test a feature release or a flow change and measure the outcome. With agents, the variables are prompt wording, context window composition, tool selection logic, model temperature. The outcomes are qualitative: Did the response make sense? Was it helpful? Did the user trust it? Our experimentation framework wasn't built for any of this.
- **And we couldn't measure ROI.&#x20;**&#x54;he team was asking the right questions: “Do great agent interactions encourage free users to upgrade?” “When our agents hallucinate, what does it cost us?” But with no quality signals beyond shallow engagement data, nobody could answer them.

The irony was hard to miss. A decade of telling customers you can’t build great products without great analytics, and now we were building an agent with essentially no signal into whether it was working.

We talked to other teams building agents. The pattern was universal. Everyone was cobbling together logging frameworks, LLM observability tools, and some vibes. The observability tools gave us traces so you could see what the model did step by step, but these traces operated at the infrastructure level. They told you how the model was running, not whether the product was succeeding. On the flip side traditional analytics tools told us about user behavior around the agent but couldn't see inside the conversation.

Nobody had the full picture. So we built Agent Analytics.

### What is Agent Analytics?

Agent Analytics sits between product analytics and LLM observability. It’s not a logging tool. It’s also not a dashboarding layer on top of LLM traces. It’s a system for understanding how your AI agent performs as a product, from the user’s perspective, with enough depth to actually diagnose and improve it.

When building Agent Analytics, we didn’t set out to replace the deep technical traces engineers use to monitor infrastructure. Instead, it acts as a bridge to take those technical signals and translate them into the language of the product like user retention, conversion, and intent.

A trace is the full record of a conversation between a user and your agent: a multi-turn interaction with intent, reasoning, tool usage, and an outcome that can be evaluated. Unlike observability tools that stop at the trace, Agent Analytics decomposes these conversations into events, directly queryable in the same funnels, cohorts, and retention analyses you already use for everything else.

When you treat traces as the primary object for your analytics, a few things start to open up:

**You can see what your users are actually trying to do.** Agent analytics clusters user queries by intent, surfacing the most common requests, and shows you where your agent consistently delivers versus where it falls apart. This is the product-market fit signal that traditional analytics can't give you. You're no longer guessing from behavior patterns. Instead users inside your agent are telling you.

**You can trace failures to the root cause.** When the agent fails, you drill into the full trace: what the user asked, how the agent interpreted it, which tools it called, what context was retrieved, where things went sideways. Prompt issue? Tool issue? Context issue? This is what makes the Agent Analytics system actionable.

**You can measure quality at scale.** Every trace gets evaluated automatically via configurable evals to determine whether the agent succeeded, partially succeeded, or failed. You can track quality over time, across user segments, across query types. When you deploy a new prompt or add a new tool, you see whether quality went up or down in production, continuously.

### Agent Analytics inform your evals with product signal

When we [shipped Global Agent with a 76% pass rate](https://amplitude.com/blog/ai-analytics-agents-task-based-evaluation) on our offline eval set, we thought the hard part was behind us. It had been essential for getting the agent to a credible baseline, but it could not capture the diversity of how customers actually used the agent once it was live. Our carefully curated set of "realistic" questions turned out to look very little like the questions real users typed.

We have invested heavily in [building offline evals](https://amplitude.com/blog/eval-driven-development) as part of our AI development lifecycle. However, offline evals will never represent the full problem space as you ship to production. The closest thing to the truth is the online signal: evaluations that run on real conversations, with real users, at the scale and variety production actually produces. The real unlock however came after combining our online evals with product data

While plenty of tools can now score a trace in production, without product data they can't connect that trace to what the user did next: whether they upgraded the following week or churned the following month, or how your most engaged users work the agent differently from everyone else. By capturing the trace alongside everything the same user did before and after it, under one identity, we could evaluate the experience, not just the transcript. Product data informed us whether the experience was good. Online evals are simply better sitting on top of it.

### Understand the full picture behind Agent conversations

Another consideration is understanding the real picture behind what people are accomplishing with the agent. Knowing that "Payment Scheduling" fails 31% of the time while "General Questions" fails only 3% is something you can act on if you can connect them to real product outcomes\
Agent Analytics uses the embeddings it already generates to cluster conversations into themes automatically, then overlays per-topic quality: task completion, friction, negative feedback. You plug in and see what your agent is being asked about and where it's struggling, by topic, without anyone building a dashboard first. This is also the most direct route from production back to a better offline set: the topics that fail most are exactly the slices your reference dataset should cover and probably doesn't.

It's worth being clear-eyed here too. Real-time topic clustering is becoming table stakes; we're not the only ones who can group conversations. The difference is what the clusters connect to. When a topic cluster shares a user identity with the rest of your product data, "Payment Scheduling fails 31% of the time" stops being a quality stat and becomes a business question: which of those failures cost us a renewal, and what is the most expensive topic to get wrong?

### Experiment on what drives agent behavior

Once Agent quality is measurable, experimentation changes shape. Instead of A/B testing a surface-level UI change, you experiment on the inputs that actually drive agent behavior: prompt variations, tool configurations, context strategies, model parameters.

We're already seeing autonomous experiment loops where agents run hundreds of prompt variations overnight and keep the best performers. Shopify's Sidekick team [built an LLM-powered simulator](https://shopify.engineering/building-production-ready-agentic-systems) to run candidate systems overnight and select a winner. But systems like that optimize for model-quality metrics such as validation loss. They have no way to know whether the winning variant improved retention or conversion for real users.

With Agent Analytics, you can measure the variant's effect on interaction quality and on the downstream outcomes that pay for the agent.

### How Agent Analytics connects to business outcomes

The question every team building agent eventually asks: Which interactions drive our key metrics? Is the agent improving long-term retention? What about revenue? Which behaviors cause churn?

Answering this question can actually shift AI from cost center to revenue line item. Neither model providers (they don’t have downstream user data) nor observability tools (they don’t have product analytics) can get here on their own. In some ways, it looks like attribution modeling for agents.

Existing observability tools focus on the trace itself. But what was the user doing five minutes before they opened your agent? What did they do two minutes after? Two days after? Did a successful interaction lead to feature adoption, an upgrade, a referral? Did a failure correlate with churn 30 days later?

Inside Agent Analytics, AI session data and product event data share the same user identity. These questions stop being data engineering projects. For example, we found internally that Amplitude users who had high-quality agent sessions retained at 2.3x the rate of users who hit task failures.

Let’s take a hypothetical example of an agent that handles both product recommendations and returns. Recommendation sessions convert to purchase at 35%. Return sessions however convert at 2% and cost 3.5x more to run, because the agent keeps asking for the same information, triggering retries, and dragging out conversations until the user quits. In this scenario, you’re spending the most on the thing that’s failing the most.

Agent Analytics shows you both gaps (quality and cost) in one place. So instead of “our AI costs X per month,” you’re asking, “Why are we spending 3.5x more per session on the topic with the worst outcomes?”

### Content-optional analytics

The number one concern we heard from privacy-sensitive customers early on: “We can't send you prompt content.”

You don‘t need content to get value. But you need it for the full value.

We’ve built privacy tiers that work as an adoption onramp. Even at the metadata-only tier, you get cost analytics, retention segmentation, and behavioral signals like regeneration rate and abandonment.

What you lose without content is the enrichment layer: the automatic “your returns agent is failing on refund requests for enterprise users” that surfaces without anyone building a dashboard.

### In closing

We built Agent Analytics because we had to. We were building an agent and we couldn’t see whether it was working. The existing tools gave us pieces of the picture, infrastructure metrics over here, engagement numbers over there, but nobody was stitching together the full story from the user’s perspective.

What we found the most valuable was going from a failure to acting on it. When we see a specific group of users hitting a “hallucination loop,” we didn't just log it; we used that data to immediately test a fix or guide those specific users toward a better outcome.

If you’re building an agent right now, you’re probably feeling the same thing. You ship a new version, update the system prompt, add new tools, and you don’t really know if it’s better. Your team might be reviewing traces by hand because there’s no automated way to measure quality. When something goes wrong, you’re digging through scattered logs trying to reconstruct what happened. You want to experiment with prompts or tools, but you don’t have a rigorous way to measure the impact.

###### Early Access

If you're working on an agent or any of this sounds interesting to you, we'd love to get your feedback on Agent Analytics.

[Click Here ](https://amplitude.com/agent-analytics#contact)

About the author

Vinay Goel

Staff AI Engineer, Amplitude

[More from ](/blog/author/vinay-goel)

<!-- -->

[Vinay](/blog/author/vinay-goel)

Vinay is a Staff AI Engineer at Amplitude. He builds the foundational AI platforms that empower internal innovation and help define the future of AI-driven analytics at scale.

Topics

[AI](/blog/tag/artificial-intelligence)

[Agents](/blog/tag/agents)

[Amplitude Agent Analytics](/blog/tag/amplitude-agent-analytics)

[Engineering](/blog/tag/engineering)

#### Recommended Reading

[Read ](/blog/wanted-lab-grows-builds-experimentation-culture)

[Customers](/blog/wanted-lab-grows-builds-experimentation-culture)

###### [Wanted Lab Grows Sign-Ups by 150% & Builds Experimentation Culture](/blog/wanted-lab-grows-builds-experimentation-culture)

[Jun 17, 2026](/blog/wanted-lab-grows-builds-experimentation-culture)

[6 min read](/blog/wanted-lab-grows-builds-experimentation-culture)

[Read ](/blog/agent-analytics-beta)

[Product](/blog/agent-analytics-beta)

###### [How to Balance Inference Cost and User Experience for Agents](/blog/agent-analytics-beta)

[Jun 17, 2026](/blog/agent-analytics-beta)

[10 min read](/blog/agent-analytics-beta)

[Read ](/blog/zoning-insights-website-conversion-optimization)

[Product](/blog/zoning-insights-website-conversion-optimization)

###### [Introducing Zoning Insights: Web Intelligence at a Glance](/blog/zoning-insights-website-conversion-optimization)

[Jun 11, 2026](/blog/zoning-insights-website-conversion-optimization)

[8 min read](/blog/zoning-insights-website-conversion-optimization)

[Read ](/blog/five-best-practices-ai-agents)

[Insights](/blog/five-best-practices-ai-agents)

###### [Five best practices for getting started with AI agents](/blog/five-best-practices-ai-agents)

[Jun 11, 2026](/blog/five-best-practices-ai-agents)

[7 min read](/blog/five-best-practices-ai-agents)
