Test Kitchen/Incident reports/2028-12-18 Incorrectly calculated frequentist statistics
| Status | Closed |
| Severity | Medium |
| Incident coordinators | Mikhail Popov |
| Incident response team | Experiment Platform |
| Date detected | 2025-12-10 |
| Date resolved | 2025-12-18 |
| Phabricator | T412450 |
Summary
- On Wednesday, December 10th, we discovered an issue with the frequentist statistics in Test Kitchen's automated analysis of experiments that made our experiment results appear less statistically significant than they actually were – e.g. confidence intervals wider and p-values larger than they should be.
- Bayesian results (chance to win, credible interval, risk) were unaffected, only frequentist results (p-value, confidence interval) were affected.
- Automated analysis system was fixed on Friday, December 12th.
- Many but not all experiments' results were updated with correct results on Thursday, December 18th.
Background
On December 10th, 2025, while comparing results of an experiment analyzed with Test Kitchen's automated analysis system and GrowthBook (an open source experimentation platform we are currently evaluating as a potential replacement), we noticed a discrepancy in the confidence interval and p-value between the two sets of results. Both systems implement the same analysis methodology, equivalent metric definitions, and use the same data to perform the analysis. (The experiment was conducted October 29 – November 6, 2025, so the underlying data has not yet started to be deleted per the 90 day data retention policy.)
The frequentist statistical analysis uses a t distribution for confidence interval and p-value calculation, and this distribution has a parameter "degrees of freedom" which dictates the shape of the distribution. The degrees of freedom parameter is approximated using the Welch-Satterthwaite equation:
Root cause
During the development of the automated analytics system in May 2025, we incorrectly copied the formula into the prototype notebook. Specifically, we missed squaring the sample sizes in the denominators of the denominator:
Thus, when we implemented the formula into code, the implementation was incorrect. During code review we compared the code to the (incorrect) formula in the notebook and missed reviewing the formula itself.
Resolution
On December 12th, 2025, we deployed the fix to the experiment analysis pipeline so that currently active (and soon-to-be-active) experiments would be analyzed correctly.
On December 18th, 2025, we completed a manual re-analysis of previous experiments and updated the experiments results in the database. Not all experiments could be re-analyzed due to the 90 day retention policy, with some experiments not having any raw data and others only having partial data still available. For more details refer to our feasibility analysis report. For re-analysis methodology refer to our codebase. The following experiments' results were manually corrected:
- Donation Link (Phase II)
- All metrics
- FY2025-26 WE3.1 Image Browsing A/B Test
- All metrics
- Logged-in Synthetic A/A Test (PHP SDK)
- All metrics
- FY25-26 WE 3.1.5 MinT for Readers AA test
- All metrics
- Leveling up new notifications
- All metrics
- fy24-25-we-1-7-rc-grouping-toggle
- Only "RC Grouping Feature Try-out Rate"
The following experiments' results could not be corrected:
- FY25-25 WE3.6.1 Retention E2E AA Test
- All metrics
- Logged-in Synthetic A/A Test (JS SDK)
- All metrics
- SDS 2.4.11 Synthetic A/A Test
- All metrics
- fy24-25-we-1-7-rc-grouping-toggle
- All metrics except "RC Grouping Feature Try-out Rate"