Analyze your experiment data with the T-test

A T-test compares the means of two data populations to decide if the difference is statistically significant. Amplitude uses the Welch's T-test, which makes a few assumptions about your dataset:

The Central Limit Theorem applies to the metric.
Neither population shares the same variance.
You don't run the T-test until you reach the sample size specified by the duration estimator.

Looking for a Z-test?

T-test supports many of the same options as a Z-test.

Conduct a T-test as either:

Two-sided: looks for any change in the metric, in either direction.
One-sided: looks for an increase or a decrease, but not both.

T-tests work for both binary and continuous metrics.

A two-sided test doesn't explicitly state a statistically significant increase or decrease, while a one-sided test does. If you select Increase, the upper confidence interval bound is positive infinity. For Decrease, the lower confidence interval bound is negative infinity.

If you haven't run your experiment, or your sample size is large enough, use sequential testing instead of a T-test. Read more about the difference in testing options in this blog.

Configure T-test settings

Access the T-test settings from the Settings tab. The required settings depend on the T-test type you want to run and the direction you want the metric to move. To configure your T-test:

Edit the Goals panel, then select Increase or Decrease for your metric.
Open the Analysis Settings panel. Go to Stats Preferences > Advanced. Select the T-test stats method. Choose 1-sided or 2-sided based on the T-test type you want to run. For example, to run a two-sided T-test looking for an increase, select Increase in the primary metric and 2-sided T-test in statistical settings.
Enter the number of users needed under Samples Per Variant Needed. If you don't know which sample size to enter in Samples Per Variant, use Amplitude's duration estimator. To learn more, refer to the Help Center article on planning experiments with the duration estimator.

The T-test first computes the sample size you need to control for a specific false positive and false negative rate. Analyzing your data before reaching the sample size threshold increases your error rates. Review this article for more explanation on how peeking can interrupt your experiment process.

Select Save to change the statistical settings to T-test.

Manage sample size needed for the T-test

You must reach a minimum sample size before you run a T-test. Experiment warns you if your dataset is too small.

The Cumulative Exposure graph and its table show your sample size requirements. The graph shows a constant, dotted line named Sample Size Target, which represents the total number of users needed for each variant. The table next to the graph highlights the Exposure Remaining, which is the number of users each variant still needs. This information confirms the number of users needed before running the T-test, and provides an estimate of the time the experiment needs before you use a T-test to interpret your results.

Reaching the needed sample size doesn't guarantee statistically significant results. For example, if your lift is smaller than the MDE, your results often aren't statistically significant.

Common questions

This section applies to A/B tests in a Funnel Analysis chart. It doesn't apply to the Experiment Results chart or to end-to-end experimentation in Amplitude Experiment.

How does Amplitude calculate improvement over baseline?

Improvement over baseline is the ratio of the mean of the variant (A) over the mean of the baseline (B): mean(A) / mean(B).

For each group, Amplitude calculates the mean as k / n, where k is the number of conversions and n is the sample size.

Why do calculations use unique conversions instead of totals?

Amplitude uses unique conversions instead of totals when checking for statistical significance. Totals make false assumptions about a user's behavior in the funnel. The aggregate sum assumes that each time a user enters the funnel is independent of the previous time. That assumption isn't valid when calculating statistical significance, although totals can still help with other analyses in the Experiment Results chart or end-to-end Amplitude Experiment.

How does Amplitude calculate statistical significance?

Amplitude uses standardized statistical methods to calculate statistical significance. The method varies by feature: sequential testing or a two-tailed T-test. By default, Amplitude Experiment and the Experiment Results chart use sequential testing, while the Funnel Analysis chart uses the two-tailed T-test. When you compare analyses, p-values may not match across charts that use different testing methods.

For both methods, Amplitude uses a 5% false-positive rate by default. The threshold for significance is (1 - p_value) > 95%. You can change the false-positive rate in Amplitude Experiment. You can't change it in the Funnel Analysis chart.

To help reduce false positives, Amplitude requires a minimum sample size before declaring significance: 30 samples, five conversions, and five non-conversions for each variant. Amplitude automatically treats tests that don't meet these minimums as not statistically significant.

Was this helpful?