Jump to content

Test Kitchen/Experiment Notes

From Wikitech

Experiment Platform

Logged-in Synthetic A/A (JS SDK)

The https://mpic.wikimedia.org/experiment/synth-aa-test-mw-js experiment started on 8/7/25 and will run for one week.

Questions/Notes:

Logged-in Synthetic A/A (PHP SDK)

The https://mpic.wikimedia.org/experiment/synth-aa-test-mw-php experiment started on 2025-08-28 and will run for two weeks.

Questions/Notes:

  • Run a logged-in synthetic A/A test using the PHP SDK (T397143)

Moderator Tools

Watchlist Group by Page Toggle

The https://mpic.wikimedia.org/experiment/fy24-25-we-1-7-rc-grouping-toggle experiment began on 8/11/25.

Questions/Notes:

  • Q 8/22/25: If we're concerned that our sampling percentages aren't going to be adequate to make a decision is it possible to update these without deleting the experiment and restarting? Or do I need to delete the existing experiment and start over? (If so, Is deleting and starting over the same experiment going to cause any issues with automated analytics that we already have?)
 * We are prioritizing T396650 xLab: enable changes in traffic allocation during active experiments
  • Question about providing better mapping between db names and formal Wiki names (i.e. 'elwiki' which is Greek) in the FY 25/26 SDS2 KR2.1 Steering Committee meeting 8/11/25.
  • Question about how to update wikis after activation - the most expedient method at this time is for experiment owner to delete experiment and copy details into new experiment with same machine-readable name (8/7/25).
  • Experiment Platform had to rollout asynchronous instrument configs fetching which pushed back the start date by a week (8/1-7/25).
  • Some confusion about when to turn on an experiment 24 hours ahead of the start date (7/31/25).
  • Slack thread reference for some of the bullets below - https://wikimedia.slack.com/archives/C01DFMX6QLB/p1753980769576159
  • Automated Analytics Feedback Notes

Reader Experience

Reading List

Not yet started

Questions/Notes:

  • There is a possibility we will want to leave the experiment on for a yet undetermined amount of time after the experiment concludes. We would no longer care about data collection (although we would still want dashboards to be available for the data collected over the course of the experiment!), we would simply want it to be on for the users who were placed in it. How difficult would this be to set up? Is it easier if we know this going in? (see Slack thread)
  • Initial question about combining JS + PHP SDKs - For our Reading Lists experiment (for logged-in users) we’re considering building out instrumentation using both the PHP SKD and the JS SDK. I just want to get a check as to whether or not this is supported, has any precedent, or any limitations we should be aware of, like if certain contextual attributes are not available server-side (phab:T402314) (Slack thread).

The https://mpic.wikimedia.org/experiment/we-3-2-3-donate-ab-test-1 experiment will start on testwiki on 8/12/25.

Questions/Notes:

  • Max limits: Do the 10%/.1% limitations apply to logged in users as well, or is that only for logged out users? Is there a different upper bound for logged in users?
  • Successive rollouts: any thoughts around how we could support debug testing in prod for edge unique experiments? i.e. testing the treatment path on enwiki in production before fully deploying experiment there.
  • Automated Analytics: notes and feedback on Analytics, decision making, helpful context.
  • Key lesson for Product Team - there are subtle changes to the donate link that we can be making to potentially impact donations in a positive direction.
  • Key lesson for teams running experiments - make sure you fully understand your metrics and success criteria before launching your experiment.  The way we worded the hypothesis, we were likely more interested in a relative difference than an absolute, or should have set our targets differently given the baseline.
  • Key lesson for the experiment platform team - for metrics such as donate link CTR that are so comparatively low, the .1% restriction on English Wikipedia is a bit difficult to test around.  We saw hundreds of thousands of impressions, but only a couple hundred total clicks to the link, so official guidance on how to structure these types of experiments would be helpful.

Reader Growth

Retention Baseline

The https://mpic.wikimedia.org/experiment/we-3-6-1-retention-aa-test2 experiment started on 7/30/25.

Questions/Notes:

Growth

Leveling up notifications

  • Retro
  • Testing a full experiment setup can be tricky. There are a number of indirections which make testing difficult.
    • Beta cluster does not have the same behavior as production cluster in terms of data collection, this is useful to test the automatic contextual attributes inclusion
    • Overridden assignment users do not produce data, this makes it hard to assert the expectation of what the user will see and what data will the instrumentation produce for a QA engineer
    • Overriding is not suitable to test the first steps after account creation
  • Writing instrumentation is time consuming. While writing instrumentation has become much leaner with the tooling provided by Test Kitchen: well thought base schemas and pre-configured metrics clients, it's still a development intensive task. Ideally we'd like to reduce the time spent on this by creating standard instrumentation re-usable between experiments rather than writing it and deleting it in short cycles.