# Putting A Number On AI Quality

The Economist used Amplitude Agent Analytics to move from sampled reviews to measurable AI quality, reaching 96.9% task success and cutting failures by 84%.

Source: https://amplitude.com/en-us/blog/putting-a-number-on-ai-quality

---

[Fionn O'Raghallaigh](/blog/author/fionn-oraghallaigh)

[AI Group Product Manager, The Economist](/blog/author/fionn-oraghallaigh)

[](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fputting-a-number-on-ai-quality)[](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fputting-a-number-on-ai-quality)[](https://twitter.com/intent/tweet?url=https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fputting-a-number-on-ai-quality\&text=Putting%20A%20Number%20On%20AI%20Quality)[](mailto:?subject=Checkout%20this%20Amplitude%20Article\&body=Check%20this%20out%3A%20https%3A%2F%2Famplitude.com%2Fblog%2F%2Fblog%2Fputting-a-number-on-ai-quality)

The Economist Group is best known for its newspaper that has existed since 1843. In 1946, the Economist Intelligence Unit was set up to answer questions Economist readers were asking. Today EIU helps businesses, financial firms and governments to understand how the world is changing and how that creates opportunities to be seized and risks to be managed. EIU’s team of analysts cover nearly every country in the world with analysis, forecasts and indicators to back decisions. They create a lot of content. And data. It can be hard to navigate.

Enter Lens. In March, we added the multi-turn AI research assistant Lens to Viewpoint, EIU’s website. It allows our analyst, strategist and risk customers to ask about the fiscal outlook or compare political risk across the five biggest economies in South America and get an answer sourced entirely from EIU content. Getting Lens to a good enough quality was hard work. We worked with our analysts on evaluations. We went back and forth until we were happy. By the time we launched we knew Lens was good. But a feature in the wild is a different beast and pinpointing the areas to improve was the next challenge.

## Reading the room

Product analytics showed us what users did. It couldn't tell us whether the AI answer was any good. A user who asked three questions in a session might have been getting excellent answers and going deeper. Or they might have been rephrasing the same question because the first two answers were wrong. The data looked the same either way. We opened sessions and read them. But not fast enough to act on it. First impressions matter.

## From samples to signal

Lens sessions landed into Amplitude with signals attached out of the box: intent classification, outcome tracking, quality scores. Getting there took one engineer and some back and forth, but once it was running, reading a trace stopped being archaeology. I could open a session, see each turn, see what the agent retrieved and how it responded, and see how it scored. Before that, our data insights lead, our platform engineers and I were drawing conclusions from handfuls of sessions. A few examples, a hunch. Now we're looking at the same evidence across all of them.

## Filtering by behavior

##### The capability I lean on most is addictive. Pull every session matching a behavioral signal and drill in.

Early on, the request complexity classification surfaced a concern about how Lens handled ambiguous questions. I filtered to those sessions, found two areas that worried me, and we now track both programmatically. That whole loop, from hunch to confirmed pattern to standing metric, took an afternoon.

It also catches things that look like problems and are not. Around half of our session outcomes were labelled "Clarification Requested," which initially read as a defect. Drilling in showed the agent was ending answers with a follow-up question, a deliberate design choice. But when we looked at what users did next, the pattern was clear enough. They weren't going deeper, they were going elsewhere. We tweaked the prompt. Without the ability to interrogate the label, we might have spent a sprint fixing behavior that was working as intended.

## A scorecard instead of a vibe

We all know AI outputs are inherently non-deterministic, which is a long way of saying that viewing a handful of sessions and calling it quality assurance doesn't scale. We wanted a number.

Amplitude's standard signals gave us a floor on day one. Custom evals let us go further, scoring sessions against our own definition of good, including rubrics built from ground-truth Q\&A pairs our analysts wrote.

By Amplitude's measure, Lens now holds a 96.9% task success rate, and weekly task failures fell 84% as we worked through the issues the data surfaced. Those gains came from ordinary engineering. The measurement is what tells us where to point it.

Evals alone tell you whether an answer was correct. Quality signals connected to real product usage tell you whether the product is getting better at its job.

## What comes next

Agent interactions in Amplitude arrive as decomposed events, the same shape as any other product event. That opens the door we care most about, connecting Lens quality to engagement across Viewpoint as a whole. Does a strong Lens experience deepen how teams use the platform? As we add structured data and chart visualisation to Lens, and as other parts of The Economist Group explore whether the architecture fits their own products, that connected view is how we will judge the work.

None of this is about watching individual users. It is about holding our AI products to the same standard of evidence we hold our analysis to. Our clients pay us for rigor. The tools we use to improve their experience should have some too.

##### Ready to use AI to transform your product?

Amplitude Agents help you understand your users more easily than ever.

[Get started now](/signup?source=blog-ai\&topic=ai\&siteLocation=blog-inline-cta)

About the author

Fionn O'Raghallaigh

AI Group Product Manager, The Economist

[More from ](/blog/author/fionn-oraghallaigh)

<!-- -->

[Fionn](/blog/author/fionn-oraghallaigh)

Fionn leads AI product at the Economist Intelligence Unit. Before moving into product, he was a journalist. Knowing what makes an answer trustworthy, useful and worth someone's time turns out to be good preparation for building AI products.

Topics

[AI](/blog/tag/artificial-intelligence)

[Agents](/blog/tag/agents)

[Amplitude Agent Analytics](/blog/tag/amplitude-agent-analytics)

[Amplitude Analytics](/blog/tag/amplitude-analytics)

[Media and Entertainment](/blog/tag/media-and-entertainment)

#### Recommended Reading

[Read ](/blog/ai-impact-awards-2026-winners)

[Customers](/blog/ai-impact-awards-2026-winners)

###### [Meet the Winners of the 2026 Amplitude AI Impact Awards](/blog/ai-impact-awards-2026-winners)

[Jun 26, 2026](/blog/ai-impact-awards-2026-winners)

[9 min read](/blog/ai-impact-awards-2026-winners)

[Read ](/blog/persisted-properties-acquisition-merchandising)

[Product](/blog/persisted-properties-acquisition-merchandising)

###### [Beyond Last-Touch Attribution: Find Out Which Interactions Really Matter](/blog/persisted-properties-acquisition-merchandising)

[Jun 25, 2026](/blog/persisted-properties-acquisition-merchandising)

[3 min read](/blog/persisted-properties-acquisition-merchandising)

[Read ](/blog/using-agent-connectors-together)

[Product](/blog/using-agent-connectors-together)

###### [Agent Connectors Are Better Together](/blog/using-agent-connectors-together)

[Jun 24, 2026](/blog/using-agent-connectors-together)

[6 min read](/blog/using-agent-connectors-together)

[Read ](/blog/introducing-custom-agents)

[Product](/blog/introducing-custom-agents)

###### [Agents That Act on What Actually Happened](/blog/introducing-custom-agents)

[Jun 23, 2026](/blog/introducing-custom-agents)

[5 min read](/blog/introducing-custom-agents)