Jump to content

User:EEvans (WMF)/Scratch/Design Review

From Wikitech
This draft document has since been published as SRE/Data_Persistence/Design_Review

Data Persistence Design Review

Ours is a culture of collaboration and review; No changeset makes it to production without review from one or more of our peers. What is true for code is also true for software and system design, and it is routine for SRE to review before proceeding to production. That said, timely review of data architectures is perhaps even more critical because once implementation begins, these decisions can quickly become entrenched and very difficult and/or expensive to revisit. The Data Persistence Design Review should happen once design is finalized, and before work begins on an implementation.

Preparing

Preparation for a design review ought to be as simple as creating (and of course, documenting) the design. It is not our intention to burden teams with any unnecessary additional work. However, when working on a design it may be helpful to keep a few things in mind:

First Principles

As you work on your design, think deeply and abstractly about the problem(s) you are trying to solve and —wherever possible— start with First Principles. Be objective, and imagine what a solution would look like without time, staffing, or technological constraints. This can be a really difficult thing to do. We are all keenly aware of our limited resources and the compromises that are typically necessary. We are naturally drawn to prior art (our own, or existing systems in the organization), are biased toward/against certain approaches, and can believe there to be only one logical outcome based on perceived constraints —and you might be right! It's entirely possible (likely?) that some compromise will be necessary. It is important however that we capture these in a decision of record, which can be used to inform changes in our infrastructure and offerings.

You don't have to go it alone!

The last thing anyone wants is for a you to spend time on a design, only to find out later you'd gone down a wrong path. Please reach out to Data Persistence early and often. Whether you just need a question answered, or more direct collaboration, we are here to help!

Asks

It's beyond the scope of a document like this to cover everything Data Persistence might inquire about during review, but the following should give you some idea of what to expect:

The data

What is the data? What is the “business case”? Any user stories?

Will this be a canonical data source (primary data), or data that is somehow generated or derived from a canonical source (aka secondary data)?

If it is secondary data, what is the canonical source (or sources) of data? What is the nature of the derivation? For example: Is it an alternative or generated form of the canonical content? An aggregation of some sort?

Is there any PII present in the dataset?

Data architecture

Is the data structured —for example as objects with named attributes, arrays, etc— or is it unstructured and opaque?  Note: PDFs and images are some examples of unstructured data, a JSON-object serialized as bytes is not.

If structured, what is the data model?  For this we’re looking for documentation that fully describes the model, including types, constraints, and relationships.

What is the query model? How is the data accessed, how is it written/read?

What is the rate of change / churn rate / cache friendliness, and cardinality[1]?

What sort of semantics, properties, or guarantees are required? Consistency, availability, isolation, atomicity, ...?

Do you have a data flow diagram?

Operational considerations

What SLO do you require?

Latency expectations?

Expected throughput?

Request rates?

Storage volume (current & projected)?

What systems will this one be dependent on? What will depend on this?

Lifecycle (think: retention policies, archival, deletion)?

What is your timeline? How long after completion of the review do you anticipate needing your storage resources?

Starting a review

We use Phabricator to create the decision of record; Use <LINK> to start a new design review

Frequently (Asked|Anticipated) Questions

Q: What format do you expect for a data model document?

You can document your data model in whatever format you feel best allows you to express your intent. ERD is of course fine, but JSON/YAML works too. At this stage, we're most interested in seeing an abstract representation (as opposed to a working database schema), but if writing SQL/CQL DDL helps you visualize your data model, that is fine as well.

Q: Is all of this really necessary? The outcome seems so obvious!

Is it obvious though? The best way to demonstrate the right (or obvious) solution to a problem is to "show your work". This can be important not only for putting your own presumptions in check, but to make the answer obvious for others as well. With that said: If the solution is simple, straightforward, or obvious, demonstrating as much should be easy as well.

Further reading

References

  1. Also: Cardinality (data modeling)