Platform

AI

Amplitude AI
Analytics that never stops working
AI Agents
Sense, decide, and act faster than ever before
AI Visibility
See how your brand shows up in AI search
AI Feedback
Distill what your customers say they want
Amplitude MCP
Insights from the comfort of your favorite AI tool
AI Assistant
Support powered by product data

Insights

Product Analytics
Understand the full user journey
Marketing Analytics
Get the metrics you need with one line of code
Session Replay
Visualize sessions based on events in your product
Heatmaps
Visualize clicks, scrolls, and engagement

Action

Guides and Surveys
Guide your users and collect feedback
Feature Experimentation
Innovate with personalized product experiences
Web Experimentation
Drive conversion with A/B testing powered by data
Feature Management
Build fast, target easily, and learn as you ship
Activation
Unite data across teams

Data

Data Governance
Complete data you can trust
Integrations
Connect Amplitude to hundreds of partners
Security & Privacy
Keep your data secure and compliant
Solutions
Solutions that drive business results
Deliver customer value and drive business outcomes
Amplitude Solutions →

Industry

Financial Services
Personalize the banking experience
B2B
Maximize product adoption
Media
Identify impactful content
Healthcare
Simplify the digital healthcare experience
Ecommerce
Optimize for transactions

Use Case

Acquisition
Get users hooked from day one
Retention
Understand your customers like no one else
Monetization
Turn behavior into business

Team

Product
Fuel faster growth
Data
Make trusted data accessible
Engineering
Ship faster, learn more
Marketing
Build customers for life
Executive
Power decisions, shape the future

Size

Startups
Free analytics tools for startups
Enterprise
Advanced analytics for scaling businesses
Resources

Learn

Blog
Thought leadership from industry experts
Resource Library
Expertise to guide your growth
Compare
See how we stack up against the competition
Glossary
Learn about analytics, product, and technical terms
Explore Hub
Detailed guides on product and web analytics

Connect

Community
Connect with peers in product analytics
Events
Register for live or virtual events
Customers
Discover why customers love Amplitude
Partners
Accelerate business value through our ecosystem

Support & Services

Customer Help Center
All support resources in one place: policies, customer portal, and request forms
Developer Hub
Integrate and instrument Amplitude
Academy & Training
Become an Amplitude pro
Customer Success
Drive business success with expert guidance and support
Product Updates
See what's new from Amplitude

Tools

Benchmarks
Understand how your product compares
Prompt Library
Prompts for Agents to get started
Templates
Kickstart your analysis with custom dashboard templates
Tracking Guides
Learn how to track events and metrics with Amplitude
Maturity Model
Learn more about our digital experience maturity model
Pricing
LoginContact salesGet started

AI

Amplitude AIAI AgentsAI VisibilityAI FeedbackAmplitude MCPAI Assistant

Insights

Product AnalyticsMarketing AnalyticsSession ReplayHeatmaps

Action

Guides and SurveysFeature ExperimentationWeb ExperimentationFeature ManagementActivation

Data

Data GovernanceIntegrationsSecurity & Privacy
Amplitude Solutions →

Industry

Financial ServicesB2BMediaHealthcareEcommerce

Use Case

AcquisitionRetentionMonetization

Team

ProductDataEngineeringMarketingExecutive

Size

StartupsEnterprise

Learn

BlogResource LibraryCompareGlossaryExplore Hub

Connect

CommunityEventsCustomersPartners

Support & Services

Customer Help CenterDeveloper HubAcademy & TrainingCustomer SuccessProduct Updates

Tools

BenchmarksPrompt LibraryTemplatesTracking GuidesMaturity Model
LoginSign Up

How to Connect AI Evals to Your Product Retention Metrics

We used Agent Analytics to understand whether eval signals actually predict whether users come back. The result surprised us.
Insights

May 7, 2026

6 min read

Vinay Goel

Vinay Goel

Staff AI Engineer, Amplitude

New power users who have a strictly positive first experience and, critically, save an AI artifact consistently retain 3x better over time.

This blog was co-authored by Sandhya Hegde, Cofounder at Calibre, an applied AI research firm.

If you’ve shipped an agentic feature to your product in the last year, you’ve probably asked some version of this: Are agents making my product stickier, or just cannibalizing the parts that already work? Do users with high eval scores actually retain better? How do I understand the impact of agent improvements at the user and business level, not just inside a session?

We had those questions too, so we ran an analysis on Amplitude’s Global Agent for 20K+ users who were regular Amplitude veterans (effectively power users) before we launched Global Agent. We wanted to understand how positive agent sessions (based on eval scores) affected their relationship with Amplitude and find predictors of better long-term retention.

The core insight: Scoring vs retention

For established users of an AI product, eval scores and retention are basically orthogonal. Bad sessions don’t push them out. Good sessions don’t keep them in.

For new users of an AI product, that relationship is the opposite: Eval scores are highly predictive of adoption and future retention. The single sharpest value signal is whether the user had a positive experience and saved the agent’s output in the first session. Below are the three patterns that stood out in our data:

1. Our most retained users are also having the most negative agent experiences.

Pure correlation, not causation. Our power users push the agent harder than anyone else, hit dead ends constantly, and come back the next day with new queries anyway. They’re not retained because the agent fails them. They’re retained despite it because the agent is part of how they work, and friction is the cost of living on the jagged frontier.

2. A positive first-week experience was worth a 3x retention multiplier.

This was the sharpest signal. Users whose first week/session with Global Agent was strictly positive retained 3x better over long periods of time than users whose first week tripped a single failure flag, even when both groups were equally active everywhere else in Amplitude. Same product, same user quality, different first impression, and completely different downstream behavior.

3. Early positive agent experiences pulled overall Amplitude usage up too.

The agent isn’t a sandbox (though it technically runs in one). A great first session with Global Agent improved weekly retention to the rest of Amplitude’s platform, not just to the agent itself. Bad first sessions, on the other hand, didn’t push users off Amplitude. They just pushed them away from the agent specifically.

The implication: If your eval-cohort is showing strong retention and you’re feeling smug that low eval scores aren’t hurting retention, don’t. They probably are. You’re just not looking at the full picture.

The first clue: Tangled eval cohort retention curves

When we set out to correlate eval scores and retention, we started simple. We plotted weekly retention by eval cohort across all users based on session modes (refer to our agent analytics modes post for definitions of Clean Success, Graceful Recovery, Silent Fail, and Dead End). To our surprise, we got nothing. All four eval cohorts were within 5% of each other.

Chart 1: Established users in any eval cohort retain at roughly the same rate.

All four eval cohorts sit within ±5% of each other
Weekly Global Agent retention by eval cohort, shown as deviation from the cross-cohort average.
+8% +6% +4% +2% 0% -2% -4% -6% -8% Deviation from cohort average W1 W2 W3 W4 cohort avg. Clean Success Graceful Recovery Silent Fail Dead End Clean Success · +3.8% Clean Success · +5.0% Clean Success · +3.5% Clean Success · +3.0% Graceful Recovery · +0.4% Graceful Recovery · +0.2% Graceful Recovery · +0.8% Graceful Recovery · +0.8% Silent Fail · -2.8% Silent Fail · -2.0% Silent Fail · -1.2% Silent Fail · -1.2% Dead End · -1.2% Dead End · -3.6% Dead End · -3.4% Dead End · -3.6%
If session quality predicted user retention, these curves would separate. They cluster.

At first glance, there are two ways we can interpret this:

  • Bad sessions are a great demand signal. Users kept trying because they really needed an answer (not because they were masochists... or maybe they are).
  • The eval categorization and scoring don’t matter. It’s either incorrect or irrelevant.

It seemed dangerous to embrace either. They would easily become great excuses to not care about the quality of our agent’s output in each and every session.

So we went back to the drawing board of user psychology. What makes users love/hate an AI product? Change their old workflows and become sticky with a new surface? Give it a chance to work? Was it the agent’s sheer power to be able to handle complex tasks or the speed with which it automated simple ones?

We wanted to find the strongest signal that made a difference.

The data immediately revealed a story when we split the question into two cohorts (established users and new users of an AI product) and asked each question separately. The tangled chart becomes clear, and the answers are nearly opposite.

The signal: First-time UX eval score predicts long-term retention

At this point, we started focusing on first-time Global Agent users only. Note: These were all already power users of Amplitude.

We took Q1 2026 first-time Global Agent users and split them by first-session outcome:

  • positive (clean success flags) vs.
  • negative (any failure flag tripped, including thumbs-down feedback)

The retention curves separate immediately and stay separated week over week.

Chart 2: For first-time users, first-session eval outcome predicts retention

Positive first sessions stick. Negative ones don't.
Weekly retention to Global Agent by first-session outcome, shown as deviation from the two-cohort mean.
+60% +40% +20% +0% -20% -40% -60% Deviation from cohort mean W1 W2 W3 W4 cohort mean Positive Negative ~90 pp spread at W4 Positive · +40% Positive · +44% Positive · +45% Positive · +45% Negative · -40% Negative · -43% Negative · -44% Negative · -45%
First-session outcome is the entire signal for new-user agent retention.

We then layered in the strictest “user got real value” signal in the data: whether they saved the agent’s output (a chart, cohort, dashboard, etc.) and committed it to their workspace.

Chart 3: Users who saved an AI artifact in their first (positive) session retain at 3x the rate of users whose first session hit any failure flag.

Saving an AI artifact in the first session is worth 3x retention
Weekly Global Agent retention by first-session cohort. Indexed to Strict Positive W0 = 1.0.
1.0 0.8 0.6 0.4 0.2 0.0 Relative retention W0 W1 W2 W3 W4 Strict Positive Clean session + saved AI artifact Middle Clean session, no save Broad Negative Any failure flag Strict Positive · 1.00 Strict Positive · 0.48 Strict Positive · 0.44 Strict Positive · 0.41 Strict Positive · 0.40 Middle · 0.17 Middle · 0.22 Middle · 0.15 Middle · 0.13 Broad Negative · 0.05 Broad Negative · 0.06 Broad Negative · 0.06 Broad Negative · 0.06 3x
First-session save is the sharpest predictor of agent adoption in the data.

That’s a 3.57x gradient at Week 1, holding at 3x by Week 4. Monotonic, persistent, and specific to Global Agent.

What makes this possible: Correlating evals and usage metrics

If you’re working on an agentic product, measure first-session eval scores and value extraction. Tie it to user engagement and retention so you can show the value of new AI features to your overall business.

To do this, your eval data and your product analytics need to share a schema.

Most teams can’t run this analysis. Their eval data lives in one warehouse, their product analytics live in another, and any cross-cutting question requires a JOIN, a schema reconciliation, and many meetings.

Amplitude’s agent analytics fires both into the same event stream:

  1. Eval scores. Structured session-evaluation events recorded automatically for every Global Chat session, with boolean flags for things like Has Task Failure, Has Negative Feedback, Has Technical Failure.
  2. Product analytics. Cross-product event taxonomy. When a user saves a chart, builds a cohort, or modifies a dashboard, those product actions fire into the same stream as the agent’s session events, with consistent property naming.

That means cohort definitions like “users who had a clean agent session AND saved an artifact AND came back to send a message in week 2” are direct boolean conditions on a single user’s history. No JOIN. No schema reconciliation. One query.

What this means for your agents

The first session is the whole adoption decision. New users meeting your agent are making a binary call: Is this useful enough to come back to? Value moments like save decisions happen within an hour of the first session, and what happens in that hour determines the next four weeks.

Eval and analytics need to share a schema. None of this analysis would have been possible without session-level eval flags fired into the same event stream as product actions, with consistent property naming. Cross-cutting queries like “users with a clean session AND a save AND a return message in week 2” only work if all three events live in the same place, scoped to the same user.

If you're working on an agent, we'd love to get your feedback on Agent Analytics. Sign up here to join our Partner Design Program for early access.

Methodology: Defining events, eval flags, and cohorts

Below is our complete methodology for running correlation analysis between session-level eval scores and long-term product usage metrics.

Defining events, eval flags, and cohorts.

Three event categories, all in the same event stream:

1. Session-level eval events with structured outcome flags. Fire [Agent] Session Evaluation once per session with properties for flags like Has Task Failure, Has Negative Feedback, Has Technical Failure, Agent ID, Turn Count, Session Cost USD, and Request Complexity. Flags must be booleans (or string-enums you can filter on), not free-text descriptions. If your eval pipeline produces narrative summaries, derive boolean flags from them and fire those alongside.

2. Per-message agent telemetry. [Agent] User Message for user messages, [Agent] Score (source=”user”) for thumbs up/down feedback. These are your retention measurement targets and your fine-grained feedback signals. Each event should carry an Agent ID property so you can scope analyses to one agent at a time.

3. Product action events with attribution back to the agent. When the agent “saves” or commits work on the user’s behalf, the resulting product event should be logged with a property attributing the action to the authoring agent. These are your value signals.

Cohort definition: Strict Positive: clean win + extracted value.

  • ≥1 session in W with all failure flags = false
  • 0 sessions in W with Has Negative Feedback = true
  • 0 sessions in W with Has Task Failure = true
  • ≥1 product save event (or equivalent value action) with source = <your agent attribution>

Cohort definition: Broad Negative: every session had something wrong.

  • ≥1 session in W
  • 0 sessions in W with all failure flags = false

Cohort definition: Middle: clean experience, no value extracted.

  • ≥1 session in W with all failure flags = false
  • 0 product save events with source = <your agent attribution>

With this taxonomy in one event stream, the analyses above are direct boolean queries on a single user’s history. No JOINs, no schema reconciliation between data models.

About the author
Vinay Goel

Vinay Goel

Staff AI Engineer, Amplitude

More from Vinay

Vinay is a Staff AI Engineer at Amplitude. He builds the foundational AI platforms that empower internal innovation and help define the future of AI-driven analytics at scale.

More from Vinay
Topics

AI

Agents

Amplitude Agent Analytics

Engineering

Product Analytics

Recommended Reading

article card image
Read 
Product
Rebuilding Session Replay’s Delivery Layer to Be Lighter on Your Page

May 7, 2026

9 min read

article card image
Read 
Product
Agents Write Code. Fixing It Is Still On You.

May 6, 2026

6 min read

article card image
Read 
Company
Amplitude and Statsig partnership

May 5, 2026

2 min read

article card image
Read 
Insights
5 Agent Skills to Automate Your Weekly Product Review

May 4, 2026

6 min read

Platform
  • AI Agents
  • AI Visibility
  • AI Feedback
  • Amplitude MCP
  • AI Assistant
  • Product Analytics
  • Web Analytics
  • Feature Experimentation
  • Feature Management
  • Web Experimentation
  • Session Replay
  • Guides and Surveys
  • Activation
Compare us
  • Adobe
  • Google Analytics
  • Contentsquare
  • Fullstory
  • Heap
  • LaunchDarkly
  • Mixpanel
  • Optimizely
  • Pendo
  • PostHog
Resources
  • Resource Library
  • Blog
  • Agent Prompt Library
  • Product Updates
  • Amp Champs
  • Amplitude Academy
  • Events
  • Glossary
Partners & Support
  • Status
  • Contact Us
  • Customer Help Center
  • Community
  • Developer Docs
  • Partner Program
  • Partner Directory
  • Become an affiliate
Company
  • About Us
  • Careers
  • Press & News
  • Investor Relations
  • Diversity, Equity & Inclusion
Terms of ServicePrivacy NoticeAcceptable Use PolicyLegal
EnglishJapanese (日本語)Korean (한국어)Español (LATAM)Español (Spain)Português (Brasil)Português (Portugal)FrançaisDeutsch
© 2026 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.
Blog
InsightsProductCompanyCustomers
Topics

101

AI

APJ

Acquisition

Adobe Analytics

Agents

Amplify

Amplitude AI

Amplitude Academy

Amplitude Activation

Amplitude Agent Analytics

Amplitude Analytics

Amplitude Audiences

Amplitude Community

Amplitude Feature Experimentation

Amplitude Full Platform

Amplitude Guides and Surveys

Amplitude Heatmaps

Amplitude Made Easy

Amplitude Session Replay

Amplitude Web Experimentation

Amplitude on Amplitude

Analytics

B2B SaaS

Behavioral Analytics

Benchmarks

Churn Analysis

Cohort Analysis

Collaboration

Consolidation

Conversion

Customer Experience

Customer Lifetime Value

Customer Support

DEI

Data

Data Governance

Data Management

Data Tables

Digital Experience Maturity

Digital Native

Digital Transformer

EMEA

Ecommerce

Employee Resource Group

Engagement

Engineering

Event Tracking

Experimentation

Feature Adoption

Financial Services

Funnel Analysis

Getting Started

Google Analytics

Growth

Healthcare

How I Amplitude

Implementation

Integration

LATAM

LLM

Life at Amplitude

MCP

Machine Learning

Marketing Analytics

Media and Entertainment

Metrics

Modern Data Series

Monetization

Next Gen Builders

North Star Metric

Partnerships

Personalization

Pioneer Awards

Privacy

Product 50

Product Analytics

Product Design

Product Management

Product Releases

Product Strategy

Product-Led Growth

Recap

Retention

Revenue

Startup

Tech Stack

The Ampys

Warehouse-native Amplitude