5 Overlooked Steps in A/B Testing

Explore five crucial (often overlooked) steps in A/B testing for more accurate and effective results.

Perspectives
March 4, 2024
Image of Ken Kutyn
Ken Kutyn
Senior Solutions Consultant, Amplitude
Checklist with five boxes ticked off

When it comes to A/B testing, details matter.

Each phase of A/B testing helps align your product with what your users need and want. Skipping key steps can lead to misinterpreted results, causing you to spend time and resources on features that don’t meet your users’ needs.

Avoiding these five commonly overlooked steps in experimentation ensures you’re not just chasing clicks but making strategic, impactful enhancements. This meticulous approach paves the way for more accurate, reliable results—enabling your teams to make informed decisions that contribute to your product’s long-term success and innovation.

1. Checking the confidence intervals

You likely know to look for statistical significance, but don't forget to check confidence intervals.

Statistical significance (stat sig) means an experiment’s results are likely not due to random chance, indicating the feature you’re testing really impacted your chosen metrics, such as engagement or conversions.

Confidence intervals take it a step further by providing a percentage range to give you an idea of how big or small the impact of your change might be. Checking confidence intervals helps you make informed choices about using the changes you tested.

Let’s say you added a new feature to your app, and the test shows that more people are now using it. That sounds great, right? But what if the confidence interval shows that best case scenario, user engagement could increase by 10%, but in the worst case, it could only increase by 1%? This extra information can help you decide whether this new feature is worth it.

Checking confidence intervals enables you to weigh the potential benefits, costs, and effort required for implementation. This way, you’ll be sure you’re investing in changes that will make a difference.

2. Segmenting your A/B test results

Segmenting your A/B test results ensures you understand how changes affect specific types of users rather than just looking at the average effect across all users.

Many teams find segmenting A/B test results daunting and time-consuming, especially with limited resources. Additionally, there’s a tendency to focus on broader test outcomes, seeking general insights applicable to the entire user base.

This approach overlooks the nuanced test responses by different user segments. Different segments may react differently to the same change: What works for one group may not work for another.

Imagine you’re on a product team at an online streaming service conducting an A/B test on a new feature that recommends movies based on viewing history. By segmenting its data, your team finds that although the feature is popular among young adults, it’s less effective for users over 50. Should you simply roll the feature out to young adults?

Possibly, but you may need more data to know for sure. To avoid false positives, you should rerun the experiment targeted at young adults to verify that you found a positive impact on your metrics and ensure you’re not “p-hacking” your results.

After rerunning the test, the verified results suggest that while the feature may be beneficial to roll out, it needs refinement for older users. Without segmenting and verifying, your team wouldn’t have that extra context.

Personalization is critical in today’s market. Segmenting results helps you tailor your product experience for diverse user needs and preferences. This is one reason customers turn to Amplitude’s Digital Analytics Platform—you can easily build cohorts based on existing user properties and behaviors in one place. Targeting and focusing your experiments on creating better experiences helps enhance satisfaction and engagement for every user.

If full-fledged segmentation isn’t feasible, at a minimum, compare your results against your most important user segment. Ask yourself: Do the results of this test accurately represent your most important users? Doing so provides insights into how well your test aligns with the needs of your core audience.

3. Applying Bonferroni corrections

Some teams overlook Bonferroni corrections, often leading them to incorrectly identify a variant as significant—generating a false positive.

When you test multiple metrics and variants, the risk of getting a false positive increases. For example, if you’re measuring four variants and measuring those variants against three metrics, each variant and metric carries a 5% chance of a false positive. The Bonferroni correction accounts for this by requiring a larger sample size before it declares the test statistically significant.

By setting more strict criteria for statistical significance, Bonferroni corrections ensure your test results are reliable and not just a statistical fluke.

Leading platforms like Amplitude automatically apply Bonferroni corrections. As a test’s complexity increases with more variants and metrics, these platforms adjust the requirements for statistical significance. This automatic adjustment maintains the integrity of the test results and enables better scalability in testing.

4. Examining beyond primary metrics

It’s essential to look beyond your primary metrics to understand the broader impact of a change on your user journey, including aspects like revenue, retention, and user stickiness (long-term engagement).

A/B testing often focuses on the primary metrics directly related to the tested change. For example, if you’re testing a call-to-action (CTA) button, your primary metric is likely the click-through rate (CTR).

However, secondary metrics provide a deeper understanding of user behavior beyond the immediate impact of the tested change. While primary metrics, such as click-through rate, offer direct insights into user response, secondary metrics reveal the broader implications on user engagement, retention, and overall satisfaction.

Let’s say an ecommerce site tests a new, more prominent “Add to Cart” button to increase purchases. While the primary metric (CTR) improves significantly, secondary metrics reveal a different story: The site experiences a higher cart abandonment rate and decreased average session time.

The results suggest that although more users click the button, fewer complete their purchases, and overall engagement on the site decreases. These secondary metrics highlight the need to balance immediate actions with the overall user experience.

Modern digital analytics platforms, like Amplitude, provide robust and easy analysis of secondary metrics so teams can easily track and monitor retention, LTV, stickiness, and more. With Amplitude, teams can also set up secondary metrics or guardrails when configuring an experiment.

5. Documenting A/B test findings

Each A/B test offers value as an optimization to your product but also represents learnings. Document your A/B test result findings so you can always revisit the hypotheses, variants, cohorts, outcomes, and more.

Effective documentation creates a knowledge base that your teams can learn from. That way, teams can understand what worked, what didn’t, and why, facilitating continuous improvement in testing and product development.

Consider your team is testing a new feature in your app. You document the A/B test thoroughly in a shared Google Doc, including:

  • The initial hypothesis—why they think the feature will improve user engagement
  • Detailed descriptions of the feature variants tested
  • The duration of the test
  • The size and characteristics of the user segments involved
  • Test results in terms of user engagement metrics

Six months later, another team plans to iterate on the feature. They revisit the document and quickly understand the previous test’s context, learnings, and why your team made certain decisions. This historical insight enables this team to design a more informed follow-up test, avoiding past mistakes and building on previous successes.

Some tools have built-in documentation features. For example, Amplitude’s Notebooks enable you to document the full context of A/B tests so you can easily refer to them later.

Take a comprehensive approach to A/B testing

Skipping steps in A/B testing can lead to bad outcomes like misaligned decisions, loss of resources, and missed opportunities for meaningful improvement. But when you take a more detailed approach, you’ll find reliable, user-focused outcomes.

Amplitude plays a pivotal role in this process, offering tools and insights that enhance the quality and effectiveness of A/B testing. Amplitude Experiment builds these often overlooked steps directly into the testing process.

By leveraging Amplitude’s capabilities, teams can more accurately interpret data, make informed decisions, and drive impactful changes.

Explore Experiment and its features for a deeper dive into how Amplitude can revolutionize your A/B testing process.

About the Author
Image of Ken Kutyn
Ken Kutyn
Senior Solutions Consultant, Amplitude
Ken has 8 years experience in the analytics, experimentation, and personalization space. Originally from Vancouver Canada, he has lived and worked in London, Amsterdam, and San Francisco and is now based in Singapore. Ken has a passion for experimentation and data-driven decision-making and has spoken at several product development conferences. In his free time, he likes to travel around South East Asia with his family, bake bread, and explore the Singapore food scene.