User:Giuseppe Lavagetto/MicroServices

From Wikitech

Disclaimer: I don't think we need to do all of what follows on day one. I'm just setting the view so that we can keep a straight line, and ideally get there interatively and incrementally. Also, these are completely my own opinions and should be taken as such.

The long boring list of things we would need

  • Architectural principles:
    • Each microservice should fullfill one specific public-facing functionality
    • Such functionality should be as segregated as possible - so no common “horizontal” layer that other services should call to access their own resources.
    • Each API should be exposable (if not exposed) to the public
    • Interaction should happen via a versioned API, which should be exposed via a discovery service.
  • Development
    • Services should be automatically deployed in stable version in a reference environment to which developers and QA can connect to. Teams are responsible for the stability of their services in said environments
    • Single services cannot depend on more than two other external services, and it should be possible to deploy those automatically on your local vagrant instance(s)
  • Monitoring and alerting:
    • Detailed and standardized performance monitoring, which requires zero or minimal setup for any new service
    • Continuing and clear ownership (and paging!) of individual services by dev teams
    • Tracking of each request across the cluster with a single “transaction ID” should be possible
    • Logging should have a standard format, include the transaction ID, and be collected centrally
  • Stability
    • Each microservice must be independent and mostly isolated
    • Failure should cause a graceful degradation of the user experience.
    • Strict SLA should be decided for every service and failure to meet those is to be considered a service outage.
    • Each service should include accounting of resource usage from its clients, and have throttling mechanisms
  • Nice to haves
    • Automated one-button deploys for all services (yes, this means cluster-wide coordination).
    • At max three runtimes will be used in production to run our microservices. A personal preference would be NOT to have the JVM in the list of such runtimes
    • Good documentation of the flow of requests in the system is always kept up to date and simple cli client libraries that allow consuming the API

I will now go in more detail on some (or most) of these bullet points. This is going to be rather long :)

Architectural principles

Microservices make sense if they produce some output that could be directly useful to any generic (internal or external) user that wants to build an application on top of our data. This may seem like a strong advice to give, but this principle has proven incredibly valuable to me in the past to stop the risk of “service balkanization”, where you end up with tens of services that do one specific thing and need to be chain-called to build anything vaguely interesting to the public. If one service, for instance, is able to expose a parsed version of the wikitext of one page, it can be useful to anyone building an application on top of wikipedia. Ditto for imagescaling, or for a pdf-generation service.

The advice is to avoid as much as possible to build horizontal layers is so that we avoid bloating the internal traffic (which at our scale may be scarily high) and message passing, which add in general the kind of latencies our users hate so much and we fight every day. One notable exception to this are caching services like RESTbase is going to be for parsoid - in this case RESTbase is used to speed up content consumption rather than creating an additional layer of indirection. If we think of it as a general-purpose proxy to a storage engine I’d honestly be more cautious.

A corollary to all this is that every API should be exposed (or exposable, depending on security consideration) to the public directly, and be consumable by others as well. Welcome to the Wikipedia platform :)

As a final point, we want services to cooperate easily, and this means that they should interact via public interfaces and that they need a discovery/publication service in order to be able to locate each other dynamically in the cluster. Think of this as a "DNS for services", sort of. Distributing files around the cluster at every deploy is not that.

Development

I have worked in environments with a myriad of services that were produced seamlessly, but no one was in charge of assuring that its services were going to be easily deployable and configurable by anyone else. The net result was that ops had to set up a centralized “reference” dev environment that everyone connected to for any service not actively developed by their team at the moment. It was a pain for both developers and ops, and the net result was that some service was almost always unstable/down/misconfigured, making time losses distressing for everybody.

Given our development model, with developers distributed across different places and timezones, and with volunteers doing some non-trivial work such a situation would be catastrophic.

A good rule would be thus that a deployment on a reference infrastructure (what currently and infamously is beta) should be guaranteed by the deployment teams individually, and that connection to services in such a cluster should be easy to configure (again, using a registration/discovery service). Also, since we don’t want developers to lose their time and interrupt their workflow every time this reference environment is unavailable or unstable, all “upstream” services needed to run a specific service should be deployable automatically on the developer’s workstation, preferably via vagrant (as we do currently with mediawiki-vagrant). The number of “upstream” services cannot thus be big, or we’d ask developers to use supercomputers - this also works well with the suggestion of avoiding horizontal layers.

Let’s say that ideally one service doesn’t need any other service to work in a dev environment, and not more than two should be needed. Also, rember that any service that works on persistent data would need a sample as well, or provide some mock service at least.

Monitoring

When building microservices, monitoring becomes even more crucial than it is in general for any web site. Here is why:

when the number of services increases, so does the complexity of the whole architecture and debugging an issue becomes factorially more complex. You can’t rely anymore on the ability of some super smart veteran engineer to debug the application stack, or on the ability to find the root cause by seasoned ops engineers. Failure may present in functionality A because of a subtle failure in a seemingly loosely related system, and performance degradation can be difficult-to-impossible to track down without taking hours. So you need a rock-solid, standardized monitoring infrastructure that will allow teams AND ops to track down problems to the single service quickly.

This means that ANY functionality should be monitored, and it should be possible to automate the process of configuring, collecting and alerting on those. If this reminds you the “your monitoring and QA become indistinguishable” quote from the famous Steve Yegge’s rant on SOA at Amazon[1], it’s because I learned this lesson the hard way.

Also, a global “transaction ID” should be attached to any request from a user, and be propagated along in the cluster, and be logged. This is going to be pivotal in order to be able to chase down a failure across hundreds of servers and even more application containers. Also, we should decide and stick to a logging model and format that should be as global as practically possible - at least we should include the transaction ID, ideally have the same semantic information. Also, we’d need those info to be globally searchable - this meaning we should both improve our logging (and weed out any non-useful message) and improve our logstash setup.

As part of this effort, teams must own their services from the top down, and should be the first level of oncall paging for their own services - this is by the way already happening at least for parsoid, so we’re on a not-so-bad track with this.

Stability

Being an opsen, of course I left stability last to make it stand out!

Microservices can be either very good for the stability of a web site, or a traiwreck. I don’t think there is any possibility in the middle. If we manage to make services vertical, so that they serve one specific purpose, and that one service being unavailable doesn’t mean that the overall usability of the site would be compromised, this is probably good for the overall stability of our sites.

Of course this means that services should be able to operate when another service is unresponsive, by setting strong timeouts and sensible management of it every time there is any sort of dependency. This is of course not easy, but the alternative is an ops nightmare in which the total uptime of the system is obtained by subtracting the individual downtimes of each service.

When I speak of downtime, I don’t mean only “the user recieved an error page”, but also “the page took 15 seconds to load instead of the usual 0.5 ms”. In this regard, it’s very important that everyone owns the importance of the performance of their services, and that a service level agreement is defined and properly monitored. Failure to meet that standard (e.g. “the 95th percentile of the rendering time for math formulas should be below 10 ms”) should be treated as an outage for that service - this is the only way to make performance an equally important for everyone and to keep it under control. Else, there is the concrete possibility that services in “maintenance mode” will lay around forever in a state of dodgy performance because we all love to work on new things, and the overall performance of the site will suffer from that.

At last, every service will need to be able to throttle his consumers, to prevent situations where one service consumes all the resources of another service. This means that both clients and servers should be able to handle throttling.

References

  1. ↑ Here is his rant - my favourite quote "There was lots of existing documentation and lore about SOAs, but at Amazon's vast scale it was about as useful as telling Indiana Jones to look both ways before crossing the street"