Metrics Platform/Decision Records/Single Table Per Base Schema
Status: ACCEPTED
Author: Clare Ming
Deciders: Sam Smith, Andrew Otto, Gabriele Modena, Mikhail Popov, Andreas Hoelzl, Luke Bowmaker, Virginia Poundstone, Will Doran
Consulted: Megan Neisler
Informed: Product Analytics
Date authored: 2024-07-23
Date decided: 2024-08-08
Keywords: dynamic stream config, dynamic event stream declaration, datasets config, metrics platform instrument configurator, dynamic EventStreamConfig, mpic
Context and Problem Statement
In service of SDS 2, SDS 2.1.6 involved building MPIC (Metrics Platform Instrument Configurator), an application that in conjunction with the Metrics Platform MediaWiki extension enables non-technical users to manage instrument configuration derived from Metrics Platform base schemas. MPIC was initially specified to include dynamic stream configuration creation as part of its feature set. During the course of discussions with Data Engineering, it became apparent that Data Products' objective to allow for dynamic stream config declaration within MPIC would be problematic due to increased complexity downstream (latency, provenance, resource allocation - see the concerns of MPIC dynamic stream creation in this comment).
Considered Options
- Allow for dynamic event stream configuration wherein every new instrument creates a new stream to which events are sent.
- Keep static event stream configuration with pre-declared streams and new experiments are identified by a field in the event, and analysis for that experiment filters on that field.
Decision Outcome
Option 2 - Maintain static event stream configuration and create a stream (or multiple streams if applicable) per Metrics Platform base schema and submit all base-schema-conforming events to those streams.
Per notes from a DPE Sync discussion and T361853: Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator, the ultimate decision came down to Solution 2 of Andrew Otto's comment:
"Static ESC with pre-declared streams: new experiments are ID-ed by a field in the event, and analysis for that experiment filters on that field. ...Solution 2 is better because it is less complex...[and] because experiments from the same instrumentation can be compared using filters, rather than joins different between streams & tables."
Whereupon there was a general agreement/consensus between Data Engineering and Data Products (see Sam's comment in https://phabricator.wikimedia.org/T361853#9824725):
"From the perspective of an instrument creator, I can either reuse an existing schema (and therefore use a stream that's already statically-configured) or add a new entry to the static configuration for the schema. Provided Data Products do a good job of providing useful schemas and, in future, standard instruments, then we can remove the need updating configuration for a lot of cases. Cool!"
Positive Consequences
- The preference/requirement from Data Engineering to keep static configuration is met.
- Product Analytics will be able to query tables using filters rather than joins (performance impacts are being explored in a spike)
- Product teams will not have to create new schemas if they use Metrics Platform base schemas for their instruments.
Negative Consequences
- Product teams will have to continue to declare/deploy stream configurations to match them to Metrics Platform base schemas.
- Metrics Platform client libraries will need to be updated to include this unique instrument identifier field by which Product Analysts will filter on in their queries.
- Vast majority of interaction data would be going into one massive table which will create significant limitations for how much data we would be able to query with Presto – potentially only an hour at a time as opposed to multiple days or weeks or even months that is possible now with the smaller, per-instrument tables. Depending on how powerful our Presto cluster is, we may would likely have to switch to working with interaction data exclusively outside of Superset's SQL Lab since Presto and Spark SQL differ substantially and require a high degree to effort to translate queries between those two SQL dialects.
- This will also negatively affect our ability to create Superset dashboards with Presto based on the un-aggregated interaction data, which has become a common practice among Product Analysts. We accept this consequence because the metrics we measure and make available in those dashboards and other reports should be pre-computed with data pipelines (that have access to more powerful and robust Spark SQL) rather than calculated on-the-fly with Presto. We can still use Presto but mainly for working with pre-computed measurements of interaction metrics rather than with raw interaction data.
- Event sanitization: we can configure sanitization/retention policies on a per-instrument basis since they are different streams/tables, but with the monostream/monotable we would lose that flexibility. Without changing how the current sanitization pipeline works, we would have a single entry in the allowlist for the monotable. We would have to reconsider how we evaluate risk when it comes to retaining sanitized data longer than 90 days. (https://phabricator.wikimedia.org/T367057#10032856)
- Maybe need to re-evaluate whether event sanitization system is a legacy artifact or if it has outlived its usefulness and can be decommissioned at some point.
- DE hopes to one day (after Refined event tables are on Iceberg and maybe after Datasets Config?) take a look at sanitization and retention and refactor it, possibly using in place updates and deletes via Iceberg (https://phabricator.wikimedia.org/T367057#10033075).
Links
Background tickets for reference:
https://phabricator.wikimedia.org/T360647
https://phabricator.wikimedia.org/T360738
https://phabricator.wikimedia.org/T361853
Additional Comments
Next steps outlined in https://phabricator.wikimedia.org/T361853#9824725