Making Diagnostic Analytics Trustworthy

Customers won’t take your word for it. Diagnostic AI needs to prove its accuracy.

Dec 18, 2025

7 min read

Henry Arbolaez

Senior Software Engineer

Descriptive analytics shows you what happened (e.g., conversion dropped 15%). Diagnostic analytics explains why (e.g., because Safari mobile users hit a form bug introduced in a recent release).

Most analytics tools today are descriptive. We built Amplitude’s automated insights to be diagnostic because we know that teams need clearer understandings of cause and effect. It reads experiments, releases, annotations, and segments, and uses all that context to create a hypothesis about likely explanations.

Descriptive analytics will be a huge step forward for customers who want more value out of their data analysis. But we can’t simply tell our customers that we trained AI to uncover root causes and expect them to believe us. We know we have to repeatedly show them our AI is accurate to earn their trust.

Why trust matters

We all know trust matters. If Amplitude’s output is incorrect, teams will end up making the wrong decision. It’s lose-lose. Our customers would take a step back, and we’d lose their trust. The only way to give customers value (and gain their trust) with automated insights was to make it accurate. Before we shipped anything, we knew we had to define accuracy and measure it in a way that credibly mapped to how people use insights every day.

We quickly found that diagnostic AI doesn’t need to be perfect to provide value. It simply needs to be consistently helpful. That finding became our guiding principle.

How we measure insight accuracy

We set up our automated insights capability to evaluate outputs based on two criteria:

Real cases: Is the case an existing example that human analysts have already analyzed to find the correct root cause?
A separate judging model: Did the AI discover the correct explanation? This is independent from the system itself and performs similarly to human reviewers.

Then we tracked recall, precision, and insightfulness (e.g., whether the system produced at least one correct insight). Interestingly, we found that insightfulness was the most meaningful measure.

Analysts rarely need a perfect, fully polished narrative. With a clear starting point, they can get an answer fast. Once the system could produce a correct insight 80% of the time, we knew it could dramatically reduce the time analysts spent investigating issues. That level of accuracy was enough for us to move forward.

Using confidence levels for partial insights

Some problems are more nuanced and don’t have a single definitive explanation. Bot activity is a great example of this. You can often identify bot-like patterns, but quantifying their exact impact is nearly impossible.

Instead of pretending to know more than it does, we designed our AI to report levels of confidence. It might flag bot traffic as a likely factor without overstating precision. Customers consistently tell us that even partial insights help them work faster. A hint that points in the right direction often unlocks the next step. Even disproving a hypothesis is valuable because it narrows the investigation.

Transparency about uncertainty turns our automated insights capability into a collaborator rather than a black box for teams.

Transparency builds trust

Analysts trust insights more when they can see the underlying logic. Our AI exposes live reasoning so users can watch the system work in real time, including which tools it calls and what information it checks. It also surfaces inline citations, linking all of its assumptions and findings directly to the sources it used to arrive at that conclusion.

We have found that most people verify everything closely the first few times. Once they verify that the results hold up, they become less skeptical.

What we learned from failure

Since we built evals into our development loop, we were able to clearly see recurring areas of improvement: missing tools, missing context, misordered steps, too much data in the context window, prompts that lacked sufficient guidance, etc.

Each issue pointed directly to the fix. Missing bot detection? We needed to build a tool for it. Missing release context? We needed to pull it into the workflow. Funnel root cause hidden between steps? We needed to create micro-funnel analysis.

Evals and this tight feedback loop let us continually improve the system in ways that aligned closer to analyst workflows, not guesses or hypothetical methods.

When there is more than one valid explanation

Real-world data is often ambiguous. We wanted our model to account for that. As a result, instead of only offering a single answer, our automated insights capability can present multiple plausible explanations and let the analyst decide which is best.

This ensures a collaborative partnership between the analyst and the AI. Together, they can decide which hypotheses to explore. This setup makes the system more realistic because, in practice, teams often weigh several hypotheses before landing on the right one.

Descriptive → diagnostic → predictive

The evolution of Amplitude’s AI mirrors how LLMs have changed. We started with descriptive analytics that allowed our users to leverage Ask Amplitude to translate natural language into charts. Our automated insights capability allows our users to perform diagnostic analytics to quickly understand why metrics change. The next natural frontier is predictive analytics, which will allow everyone to understand what will likely happen next.

Predictive analytics requires strong diagnostic tools. It’s impossible to forecast the future effectively if you do not understand the forces behind past changes. We feel confident that the diagnostic foundation we are building today will power the predictive tools that come next.

It starts and ends with trust

Making diagnostic analytics trustworthy is not about making AI sound smarter. It’s about giving people insights they can rely on. Our AI will earn trust by showing its work, expressing uncertainty honestly, learning from failures, and anchoring its reasoning in patterns that mirror how analysts think.

These same principles apply to anyone evaluating or building AI systems designed to explain, recommend, or diagnose. Trust is not something you bolt on at the end. It’s something you earn through design choices that prioritize clarity and transparency.

Looking to find out more about Amplitude’s AI innovations? Visit us here:

About the author

Henry Arbolaez

Senior Software Engineer

More from Henry

Henry Arbolaez is a Senior Software Engineer at Amplitude working on AI-powered products. He enjoys building from zero to one, loves good coffee, and is always looking for the next place to travel.

More from Henry

Topics

Amplitude Analytics

Analytics

Making Diagnostic Analytics Trustworthy

Customers won’t take your word for it. Diagnostic AI needs to prove its accuracy.

Insights

Dec 18, 2025

7 min read

Henry Arbolaez

Senior Software Engineer

Descriptive analytics shows you what happened (e.g., conversion dropped 15%). Diagnostic analytics explains why (e.g., because Safari mobile users hit a form bug introduced in a recent release).