SRE/Data Persistence/Design Review/Global Editor Metrics 2025
This page attempts to follow the structure in SRE/Data_Persistence/Design_Review, and adds to it.
Global Editor Metrics Data Persistence Design Review
The data
What is the “business case”? Any user stories? Who are the data product owners?
These are global editor metrics, e.g. edit_counts per editor.
FY25-26 Year in Review and Impact Module - Notes & Product Requirements (see Product needs tab)
T388455 [Spike] Full-year editing stats for Year in Review
T403660 Global Editor Metrics for YiR, Apps Activity Tab, and Growth Impact Module
Data product owners
Every data product should have at least 1 owning team and 2 people listed. At least one of the people listed should be an accountable manager.
- Mobile Apps Teams - Haley Nordeen, Jazmin Tanner, Joseph Seddon
- Growth Team - Kirsten Stoller
Requirements
Global editor metrics are desired for 2025 Year in Review, Mobile app Activity Tab, and Growth Impact Module. The requested metrics are mostly the same, but the time ranges the products want to sum over are different. In order to reduce the number of bespoke metrics, we will try to generate metrics that will satisfy requirements for all of these features. To do this, the metrics will ideally be
- Additive. E.g. they can be summed. The smallest granularity asked for is daily. If we generate daily granularity metrics, they can be summed up for coarser grained use cases like YiR.
- Immutable. Immutability makes the data lifecycle simpler. Backfills are easier and data outputs are (mostly) deterministic.
We can then provide endpoints that sum over the metrics for time ranges as needed.
Depending on the need, storing small grained (daily) metrics in an additive way can be quite large. If this ends up being the case for any of our data, we will consider relaxing this property.
The metrics requested are:
- For each global user editor account:
- Total edit count within a date range.
- In the last year, user A edited 1000 times.
- In the last 30 days, user A edited 100 times.
- Total number of days edited within a date range.
- In the last year, user A edited on 150 days.
- In the last 30 days, user A edited on 5 days.
- Longest daily consecutive edit streak within a date range.
- In the last year, user A's longest daily edit streak was 25 days.
- In the last 30 days, user A's longest daily edit streak was 5 days.
List of edited pages within a date range.In the last year, user A edited page P1, P2, P3, ...In the last 30 days, user A edited page P3, P4, ...- NOTE: This metric requirement is being dropped (for now) as Year in Review will not use it this year. Not needing to store per page information greatly simplifies the queries, especially in worst cases for prolific editors.
- Total number of pageviews within a date range on pages edited by this user.
- In the last year, pages ever edited by user A received 50000 pageviews.
- In the last 30 days, pages ever edited by user A received 10000 pageviews.
Top K viewed pages within a date range on pages edited by this user.In the last 30 days, the top K viewed pages edited by this user were [{P1, 5000 views}, {P2, 1000 views}, ...].- NOTE: Top K pages within a date range requires precomputing storing per user per page pageview counts (in cassandra), so we have relaxed this requirement to the following:
- Top K viewed pages each month on pages edited by this user.
- In 2025-07, the top K viewed pages edited by this user were [{P1, 5000 views}, {P2, 1000 views}, ...].
- This metric will not be additive. You cannot reliably sum monthly results, since the list of top pages may change each month.
- Total edit count within a date range.
Primary or secondary data source
Will this be a canonical data source (primary data) or derived (secondary) data?
Secondary.
If it is secondary data, what are the primary sources of data?
What is the nature of the derivation? For example: Is it an alternative or generated form of the canonical content? An aggregation of some sort?
These are daily aggregations of MediaWiki edit data (not content). One of the metrics joins this edit data with pageview counts.
Sources of data are in the WMF Analytics Data Lake as Hive/Iceberg tables.
- Hive pageviews_hourly table.
- Pageview counts are ultimately sourced from webrequest logs.
- Edit data – there are 3 data source options:
- mediawiki_history Data Lake table
- available monthly
- Accuracy is good – fully consistent snapshot
- Hive event.mediawiki_page_change_v1 Data Lake table
- available hourly (2-3 hour delay)
- Accuracy is best effort. Expect very very minimal regular event loss, and potential for high loss in the very rare case of outages. (No outages were encountered in the last several years).
- Iceberg mediawiki_content_history_v1 Data Lake table
- Available daily
- Accuracy is good – eventually consistent.
- Would require significant engineering effort to include a central user id in this dataset.
- mediawiki_history Data Lake table
For the work in 2025, edit data will be sourced from Hive event.mediawiki_page_change_v1 table. In the future, if we want more accuracy, we will make it possible to source from Hive mediawiki_content_history_v1 table and use that instead.
Data privacy risks
Is there any PII present in the dataset? See also: Data Retention Guidelines and Data Publication Guidelines.
Generally no. edit data is public, and aggregated pageviews counts are also public.
However, one of the metrics is about page per editor. We’d prefer if stored metrics are immutable. If they are immutable, then there is a potential privacy issues:
If a page delete is suppressed, its title is no longer public information. We will not have an easy way to update the list of page titles edited in the past.
Note: this is a current issue for the AQS Pageviews and Commons Impact Metrics APIs too.
An LCS3 review for this issue been completed. Summary:
Data architecture
structured or unstructured data?
For example structured as objects with named attributes, arrays, etc— or is it unstructured and opaque? PDFs and images are some examples of unstructured data, a JSON-object serialized as bytes is not.
Structured.
Data model (if structured)
For this we’re looking for documentation that fully describes the model, including types, constraints, and relationships.
These metrics are about global editor contribution stats. Each dataset is keyed by a MediaWiki central user id (from CentralAuth). The finest granularity needed is daily metrics.
For historical and also for data size reasons, edit metrics will be stored in Druid and pageview metrics will be stored in Cassandra.
Dataset: mediawiki_history_reduced
Edit related metrics are already stored in a serving layer storage system: mediawiki_history_reduced in Druid.
In order to serve the metric requirements, mediawiki_history_reduced will need the following updates:
- Add
user_central_idfield - Add
page_idfield - Ingest data daily.
This work will be tracked in T406069 Global Editor Metrics - Druid mediawiki_history_reduced changes
From this dataset, we can serve the following metrics:
- Total edit count within a date range.
- Client can compute sum from returned daily results, or overall sum can be computed in query.
- Total number of days edited within a date range.
- Client can compute from daily query results, or count can be computed in query
- Longest daily consecutive edit streak within a date range.
- Client can compute from daily query results.
Learn more about the decision to use Druid mediawiki_history_reduced at https://phabricator.wikimedia.org/T401260#11230961.
Dataset:pageviews_per_editor
| Description | ||
|---|---|---|
user_central_id
|
bigint | MediaWiki user central id. At Wikimedia, this is the CentralAuth globaluser gu_id field. |
granularity
|
string | "daily" or "monthly" |
view_count
|
bigint | Count of pageviews on this day on all pages edited by this user before or on this day. |
dt
|
timestamp | The date for which we aggregate the data. It should point to the beginning of the granularity period (e.g. for daily: YYYY-MM-DDT00:00:00.000Z). |
Key setup: PRIMARY KEY ((user_central_id, granularity), dt).
user_central_id, granularity, is the partition key, and dt is the clustering key. This will be useful, since the queries will include date ranges, and the results are expected to be queried over a date range.
This table is similar to the one used for other timeseries pageview endpoints, e.g pageviews per project.
From this table, we can support the following metrics
- Total number of pageviews within a date range to pages edited by this user
- Client can sum timeseries results
views
- Client can sum timeseries results
Dataset: pageviews_top_pages_per_editor
| Field | Type | Description |
|---|---|---|
user_central_id
|
bigint | MediaWiki user central id. At Wikimedia, this is the CentralAuth globaluser gu_id field. |
granularity
|
string | "monthly" |
wiki_id
|
string | Wiki id, sometimes also called just 'wiki' or 'database name'. |
page_id
|
bigint | MediaWiki page id |
rank
|
int | Pageview rank in this user's edited pages monthly pageviews. E.g. the most viewed page is rank 1. |
top_k
|
int | Limit of top viewed pages used when calculating this month's data. When k = 10, there will be max 10 page records per user per month. |
view_count
|
bigint | Count of pageviews to this page on this month. |
dt
|
timestamp | The date for which we aggregate the data. It should point to the beginning of the granularity period (e.g. for monthly: YYYY-MM-01T00:00:00.000Z). |
Key setup: PRIMARY KEY ((user_central_id, granularity), dt, wiki_id, page_id).
user_central_id, granularity , is the partition key, and dt, wiki_id, and page_id are clustering keys. This will be useful, since we will always query by user_central_id and month. The disposition of the remaining clustering keys supports a one-to-many relationship between a timestamp (dt) and wikis (wiki_id), and a one-to-many relationship between each wiki and the corresponding pages (page_id).
From this table, we can support the following metrics
- Top K viewed pages each month on pages edited by this user.
Note that this metric is not additive! Each month's list of top pages cannot always be combined to provide an accurate 'yearly' top pages. The months can be combined if and only if the number of pages ever edited by a user is less than the value of top_k.
Notes
dtcould be a date instead of a timestamp, but for consistency with other data models, and for future flexibility, we decided to store this as a timestamp with values that start at the beginning of the granularity stored (day or month).- We had considered storing pageviews per page per user daily in one table to serve both aggregate pageviews and top k pages viewed per metrics, but after initial data sizing estimations, we decided that this data would be large and it would not be worth storing all if it just to serve the top k use case. See https://phabricator.wikimedia.org/T401260#11230961 for more details. After doing some real data sizing, this may be possible to store, but would require more platform support (e.g. aggregations in Data Gateway). It was decided that the product didn't really need this flexibility, so we are going with monthly precomputations.
Query model
How is the data accessed, how is it written/read?
- Written:
- Daily batch updates from the Data Lake.
- Read:
- Needs to be queryable by mobile apps, as well as by MediaWiki JS and PHP.
- Edit data can be filtered by
user_central_idand several parameters and aggregated for the date range in Druid. - Pageview data will be filtered by
user_central_idand date range.
What is the rate of change / churn rate / cache friendliness, and cardinality?
Edit events will be pushed daily to Druid mediawiki_content_history.
Pageview counts will be computed daily and pushed to Cassandra daily
Top viewed pages per user will be computed monthly and pushed to Cassandra monthly.
The metrics will be immutable, so there is no rate of change to the records themselves. Since they are immutable, they can be cached indefinitely. An exception may be in the case of backfills in the case of bugs or outages.
Cardinality:
user_central_id: approx 80 million distinct keys in CentralAuth as of 2025-09, but only about 22 million have ever made an edit.- Druid
mediawiki_history_reducedwill keep all edit events, and provide timeseries aggregations on them. pageviews_per_editorwill have daily and monthly aggregations. See data sizing below for estimates on number of records.pageviews_top_pages_per_editorhas wiki_id,page_id values, which can have a high cardinality. For each user_central_id, the values will not vary much over time.
What sort of semantics, properties, or guarantees are required? Consistency, availability, isolation, atomicity, ...?
Consistency:
Pageview metrics: Since these are aggregate metrics, consistency here relates to accuracy requirements. For the initial product, best effort is fine. These are large aggregate metrics.
Edit metrics: mediawiki_history_reduced will be daily updated, and monthly reconciled in the style of a lambda archicecture. If daily events are missed, they will eventually be inserted as part of the monthly full history regeneration and update. If base daily updates on an eventually consistent source, like mediawiki_content_history_v1, we should have daily consistency in line with mediawiki_content_history_v1's SLO .
Consistency of the serving storage (cassandra) results can be best effort.
Availability: best effort.
Isolation & Atomicity: Records should be immutable and distinct, so there is no need for write isolation or other transactional concerns.
Data flow diagrams
Edit count metrics:
Pageview metrics:

Operational considerations
Latency expectations
TBD.
Expected throughput
Writes: < 10MB per day. See estimation below.
Reads: TBD.
Request rates
Year in Review is the most bursty feature. Their estimate is:
- 100-999k Bursts expected at the beginning of the calendar year, spread over 24 hour period, dependent on timezone.
Storage volume
current & projected
pageviews_per_editor
| Field | Byte size |
user_central_id
|
8 |
granularity
|
~7 |
view_count
|
8 |
dt
|
8 |
About 28,000,000 users have actually edited a page. Only a fraction of these will receive pageviews to pages they have edited every day. Even so, for worst case estimate, let's assume they will. So that is 28,000,000 new records every day.
Record size: 8+7+8+8=35 bytes
Average daily size: 28000000 new records * 35 bytes = 980000000 bytes = ~1GB per day.
We will also be precomputing at monthly granularity, so that is an additional 12 records per user per year. This is negligible enough to not factor into this rough estimate.
This table will grow by a maximum of ~365 GB per year.
A more realistic guess is much much lower. Estimating this well is quite difficult, because computing these metrics is what will tell us how big they are.
pageviews_top_pages_per_editor
| Field | Byte size |
user_central_id
|
8 |
granularity
|
~7 |
wiki_id
|
~10 |
page_id
|
8 |
rank
|
4 |
top_k
|
4 |
view_count
|
8 |
dt
|
8 |
Record size: 8+7+10+8+4+4+8+8=57 bytes
The size of pageviews_top_pages_per_editor will vary based on size of K. Let's assume a K of 10,
- About 28,000,000 users have actually edited a page.
- The median number of distinct pages ever edited per user is 2. More stats about # of edited pages distribution can be found at T401260#11206915.
- K=10 means that each user will have a max of 10 records per month.
For worst case assumption, let's assume that all editors have edited at least K pages, and that all 10 of those pages receive pageviews.
28000000 users * 57 bytes * 10 pages edited = 14000000000 bytes = ~13GB / month
This table will grow by a max of ~150GB per year
A more realistic guess is much much lower. Using the median # of edited pages of 2, this changes to
28000000 users * 57 bytes * 2 pages edited = ~2G / month == 24GB / year.
But even this is an overestimate. Most edited pages won't receive pageviews every month.
See https://phabricator.wikimedia.org/T401260#11341557 for a real data size estimate.
Data Lifecycle
think: retention policies, archival, deletion, etc.
If possible, data can and should be retained indefinitely.
However, there are no current product asks for global editor metrics beyond the previous 12 months. If we need to set TTLs to manage data size growth, we can set a TTL of a few years.
Data product expiration date
After the expiration date, platform maintainers can justify decommissioning this data product and supporting data pipelines. Anyone can update this expiration date at any time with no questions asked. If this expiration date passes, and their are no official product owners, the data product may be deleted.
2029-04-01
Timeline
How long after completion of the review do you anticipate needing your storage resources?
- An endpoint with dummy data by early November 2025.
- Productionization is requested by mid of November 2025.