Jump to content

User:GGoncalves-WMF/Sandbox:Design Review

From Wikitech

Introduction

The internal infrastructure for transforming and serving wiki data is extremely powerful and flexible, but also complex. In the spirit of our culture of collaboration and review, the Data Platform Engineering and Data Persistence SRE teams offer together a Design Review to help engineering teams:

  • Design data processing and serving architectures, anticipating corner cases and informing decisions that are hard to reverse.
  • Deploy and onboard onto those architectures.
  • Surface lacking capabilities to inform future platform work.

We are intentionally casting a wide net as we experiment with the most valuable format for this process. If you are...

  • A Product team with wireframes, initial requirements, and a rough design for a new data-driven feature.
  • A Product team facing limitations of your existing data architecture (e.g. maintenance scripts, job queue and MariaDB tables).
  • An engineering team helping a Product team with a specific aspect of their implementation (e.g. Machine Learning models).

We want to help you you before implementation begins!

Preparing

We prefer that teams approach us with a draft design, along with whatever artifacts illustrate the underlying business requirements.

However, we are also happy to help you start your design if you are not there yet: the last thing anyone wants is for you to spend significant time on a design, only to find out later you'd gone down a wrong path. You can reach out early and often!

The following table illustrates some of the things we'll consider during the review, and expect you to document in the final design.

Product
What is the business case for this work? Is there a PRD or user stories we can reference?
How is the data going to be used by end users? Are there wireframes we can reference?
What is your timeline? How long after completion of the review do you anticipate needing your storage and serving resources?
What is the priority of this work? Is it critical to the team's goals, or a nice-to-have? Is it tracked under the APP?
What does the roadmap look like, beyond this review? Do we have further capabilities planned, and for when?
Who owns the output data and its business requirements (ideally: a team, engineering and manager points of contact)?
Data
In simple terms, how would you describe the data being created? What do we intend to measure or represent?
Is there a data flow diagram with an overview of the data transformations?
What primary data sources (e.g., MediaWiki tables, instrumented events) are we trying to transform and serve?
What kinds of transformations are we applying (e.g., are we doing an aggregation of some sort?)
How often does the data change?
Is there any Personal Information in the data?
Who should have access to the data? Is it public?
Are we producing structured data, like key-value pairs or JSON (even if serialized), or unstructured, opaque data (e.g., images)?
If the data is structured, what is the data model? What types, constraints, relationships and cardinality are expected?
Do we require specific semantic properties or guarantees (e.g., ACID transactions, eventual consistency)?
Serving
What expectations do we have about the API for accessing this data?
What kinds of read patterns are expected and from where? Are lookups random or sequential? Do we need search semantics?
What kinds of write patterns are expected and from where? Are writes frequent or rare? Individual or in bulk?
Operations
What request rates do we expect? Do we expect spikes?
What throughput do we expect in processing? Do we expect bursts?
How much data are we producing? How do we expect it to grow?
What expectations do we have on the data (e.g., freshness or availability)? What happens if they're not met?
What expectations do we have on serving the data (e.g., availability or query latency)? What happens if they're not met?
What systems do we depend on? What other systems depend on this one?
What retention is needed for this data? Is it versioned, and if so, when do old versions expire?
Work from first principles!

We are all naturally drawn to repeat prior art, biased for or against certain approaches, and conscious of our limited resources and the compromises they typically require.

As you work on your design, we encourage you to think deeply about the problems you are trying to solve and, leaving assumptions aside, imagine what a solution would look like without time, staffing or technology constraints.

If and when compromises are done, this will help us capture them and their rationale in a decision record, which can be used to inform changes in our infrastructure and offerings.


Starting a review

We track reviews through Phabricator. Click here to start a new design review ticket.

Frequently (Asked|Anticipated) Questions

What format do you expect for a data model?

You can document your data model in whatever format you feel best allows you to express your intent. ERD is of course fine, but JSON/YAML works too. At this stage, we're most interested in seeing an abstract representation (as opposed to a working database schema), but if writing SQL/CQL DDL helps you visualize your data model, that is fine as well.

Further reading