A/B testing

In the Wikimedia platform, the impact of many new features are tested using controlled experiments. Many more should be. This page collects any thoughts, questions, and links that are relevant to the topic.

Infrastructure

Phab tasks
- T208089: Infrastructure for interventions impacting editing metrics [declined]
- T135762: A/B Testing solid framework [declined]
- T76919: Implement reusable framework for A/B testing product features [declined]
- T76917: Investigate using Optimizely for UI A/B testing [declined]
- T213315: [Better Use of Data] Output 3.2: Controlled experiment (A/B test) capabilities [resolved]
The "bucket" field is the standard for recording buckets in EventLogging data. Maybe this could be made a default part of every schema, and general tools provided so that a bucketed user would have their bucket set along with all of their events for the duration of the test.

Bucketing

The most common way to bucket editors for A/B tests has been on their user ID. If the test spans multiple wikis, it would be an improvement to do it on their user name, because that's consistent across wikis, and would ensure they're placed in the same bucket across multiple wikis. This would mean we need to hash the user name to ensure a consistent distribution.
- It would also be possible to user the global user ID, which would remove the need for hashing, although as of April 2019, that's not available in Javascript.

Sample size / power analysis

Evan's Awesome A/B Tools

General advice

Emily Robinson, Guidelines for A/B Testing
Privacy-conscious AB testing at Wikimedia Foundation

A/B testing on wiki

This is from an email from Aaron Halfaker.

This generally applies for any study that might affect our users. An experiment, a survey, or a large-scale interview study, etc.

Create a description of the study on Meta in the Research namespace
- https://meta.wikimedia.org/wiki/Research:New_project
- Make sure to clear describe the goals of the study and any disruption it might cause.
Post on a community forum where the active users are likely to take note. E.g. the Village Pump on English Wikipedia.
Engage in the discussion there.
- Make sure to link to your Meta page.
- It's common for no one to respond. You should wait at least a few days. Consider making a follow-up post reminding people that you'd like to start the study soon (bonus points for a target deployment date.
- Sometimes there will be a negative response. Try your best to address concerns and make modifications to the study design. If the negative response persists, consider rescheduling or fundamentally redesigning the study.
Assuming the discussion went well, do the study. Update the meta page with results and discussion.
Post the results of the study in the same community forum and consider bringing it to the Wikimedia Research Showcase.

Software

Some Wikimedia A/B tests are hard-coded, such as A/B testing used for the skin Vector 2022. As of 2023, an example file they use can be found in the Vector 2022 repo and is named A/B.js.