SessionService/Draft

Goals

Support authenticated reads in all active datacenters, so that we can improve performance by distributing load & handling client requests in the geographically closest datacenter.
Provide the highest possible level of availability that satisfies the consistency requirements.

Requirements

Availability / Fault-tolerance

The SessionService storage system must be resilient to wholesale machine outages without resulting in user-facing impact, or requiring immediate operator intervention. These properties must hold true both within a given a data-center, and across all active sites. Active in this context means all datacenters that are receiving client requests. In the case of a permanent DC outage or maintenance work, requests can be drained from the DC, and the DC is no longer considered active for the purpose of this discussion.

Consistency

Session creation (logins)

After a successful login (a POST to the primary DC), the session must to be available in all active DCs for subsequent GETs (read-your-write consistency). At the very least, the session needs to be available in the DC the client will send subsequent GETs to.

Session auto-creations

When a use has a valid "remember me" cookie token but the session expired, it is automatically recreated. This can largely handle eventual consistency, since if the user ends up on a different DC before replication, the same logic will trigger and recreate the session there. The last-write would win. This might cause the user to have to resubmit a form they where on though, because the token would change once the session got replaced be the newer write via "last-write-wins". To avoid this, read-your-writes consistency should be attempted, best-effort.

Session reads

Reads must be available in all active DCs. Latency and availability are priorities over strict cross-DC consistency for reads, consistency guarantees will be enforced during write/replication (as needed).

Session timestamp updates

These are asynchronous, and are only done once more than half of the session lifetime has expired. A failure to update the session in one / more places should not be user-visible. Inconsistency from outages should be resolved quickly and reliably. Session deletions should eventually overrule these session renewals, even if the renewal happens in one DC, the delete in another, and the network is partitioned.

Session deletion (logouts)

When deleting sessions for logouts, read-your-write consistency is a requirement across active data-centers, as well as within them^[1]^[2]. If this requirement cannot be guaranteed (for example, if the remote DC is unreachable, or there are not enough replicas available to achieve quorum), then the operation must exposed to the client as a failure, though the storage system should still relay the delete in an eventually consistent manner. We can notify users that they should clear their browsers cookies on such failures.

Performance

Storage for the SessionService must be capable of delivering a throughput of 9000/second reads and 100/second writes. Average latency must be no more than 25 ms. 99p latency should be no greater than 250 ms^[3].

Operations

Staffing/expertise

Reliability

Total cost of ownership

Candidates

MySQL

Description

Each data-center would contain N MySQL servers, each configured to be a master, that in turn acts as a slave to all others (multisync)
Several HAProxy instances in an redundant setup would be used for load-balancing and fail-over of client connections to the pool of DB servers
1. For each backend, it monitors replication lag from local DC replicas, and disables servers that are lagging too severely. Replication lag from remote replicas could be ignored (depending on the model chosen, see bellow). If all servers lag, we can chose to ignore the check to continue providing service in a degraded mode (e.g. continue providing service in a split brain, degraded mode)
As there is no sharding involved, this model only stops giving service if all nodes crash, so it has a redundancy equal to the number of nodes implemented
Within each data-center, each server would do semisync replication to some number of it's slaves (blocking successful writes unless/until that number of slaves has acknowledged the write):
1. Can be none: for performance over consistency/security, but then the application has to handle the inconsistency
2. Can be N withing the datacenter: to at least avoid data loss, without separate datacenter round-trip wait
3. Can be "number of servers within a datacenter + 1": to avoid data loss on datacenter loss
4. Can be all: for full consistency among all servers, with a higher write penalty

Questions

How many replicas would exist in each data-center, and what slave count would be used for the semisync replication?
- Whatever we decide works best for us in the performance <-> consistency scale, which depends on clarifying the needs first. I am all for making total nodes == semisync nodes
How is conflict resolution managed on reads? What prevents stale results from being returned from asynchronously replicated slaves. For example, consider a configuration with 3 replicas (A, B, and C) where we are performing sync replication to one host, and async to the third. If a write lands on A and is synchronously replicated to B, what prevents a racing client from reading the value at C before replication completes?
- Writes do not fail, so last-write wins is solved on reads. If perfect consistency is needed, given the low resources it takes for a single node to handle all load, we can redirect all writes to single node through the proxy (if we add sharding, to a single set of nodes); only change on failover. Timeouts on failover will prevent from reading stale data (again, all that is configurable in the performance <--> consistency scale. Should a hardware failure happens, should the service timeout or should it continue servicing potential stale data?).
In the prototype, the dataset is sharded across 256 tables, why so?
- First: practicality- e.g. 256 1GB tables are easy to handle than 1 256GB tables; as an op, I am always thinking of backups, cloning, etc.
- Second: proof of concept for sharding- it takes one extra line of code to maintain 4 open connections and based on the key hash, send the keys to 4 separate servers. Even if it is not a need now, we do not want to reshard in the future; adding that functionality now will make it possible in the future.
- Third: I merely emulated the setup on parsercache/external storage to show how it is done there
- Fourth: having multiple tables reduces query contention as the latest versions of mysql by having partitioned buffer pools, partitioned dirty pages, etc. this only show at throughputs of 1 million queries/s and only really needs 4 tables, so it may be an overhead when the bottleneck on the example is the HTTP handling, but I wanted to add it in case we could get rid of the HTTP overhead with better coding in the future.
Can we support session availability during DC outages, but enforce strong logout consistency across DCs?
- Yes, modify the code to go to sync/async mode as needed; if we could not do that for any reason, the application can connect to each individual server and clear its data synchronosly. If complex patterns do not want to be handled on the application, we can substitute HAProxy for a more inteligent proxy that would do it for you automatically.
Is HAProxy replication lag monitoring sufficient for providing read-your-writes consistency in the local DC?
- It is as sufficient as you want to make it; from a simple lag > 1 second, to a thorough check for the gtid position of the slaves; if that is not enough, divide your writes on shards (shards can be 1), only write and read from the "master" for each shard. The master will be controlled by a different "haproxy" service and failover automatically if it fails. The promise is false here, though- why would you want consistency only for the local DC; but inconsistent reads on the remote DC? Shouldn't be both?
  - To try to summarize my understanding of your proposals, here is a list of options that seem to be closest to the requirements:
    - Strictly sync (no transparent fallback to async) replication between all nodes in all DCs; reads from any node. Issue: Availability.
      - Possible alternative to fix availability: Require global quorum & read with global quorum. Problem: Expensive cross-DC reads.
    - Async (or semi-sync) local replication, all reads in main DC from write master (per shard); sync remote replication, reads from any replica node per shard in remote DC.
      - Issue: Cross-DC availability & performance.
      - Uses a single master for all reads / writes in primary DC.
    - Add replication offsets to session cookies, and use that information to guarantee the use of up-to-date replicas in HAProxy (similar to ChronologyProtector).
      - Only works for primary domain, as session cookies are not updated for other domains.
      - Requires changes to our authentication system.
    - Implementation of quorum writes / reads at the app level, directly talking to individual MySQL replicas.
      - Non-trivial.
    - Semi-sync replication locally / remote, plus replication lag (time) based depooling in HAProxy
      - No strict read-your-writes consistency. Are there options I forgot? Which option is the one you think most closely satisfies the requirements? -- gwicke (talk) 22:52, 22 August 2016 (UTC)[reply]

Cassandra

Description

Each data-center would contain 3 Cassandra nodes, and each data-center would be configured for a replication count of 3^[4]
Cassandra drivers (by convention) perform auto-discovery of nodes, distribute requests across available nodes, and de-pool in the event of failure. Any node can serve any request (coordinating if necessary).
Cassandra employs tunable consistency to determine how (a)synchronous replication is, or how many copies to consider for conflict resolution on a read. If we assume read-your-write guarantees, a per-site replication factor of 3 permits both low-latency quorum reads/writes within a data-center, and either global or "each" quorums when cross-site consistency is required. In detail, we would use:
- Session creation (login) and deletion (logout): EACH_QUORUM (quorum in each DC) if both DCs are up, LOCAL_QUORUM if the secondary DC is down for an extended period. The switch could potentially be automated by polling DC status information from etcd.
- Session reads & updates: LOCAL_QUORUM; could consider relaxing this further for writes.
- Conflict resolution / last-write wins is handled automatically by Cassandra, including update vs. delete.

Questions

How do you handle consistency of reads? You say you can tune your writes to have a redundancy of 3, so they are not lost (same model than semisync replication), but would you tune your reads to 3, too? That will be terribly slow (a write will take ages to be available for reads on a remote node), and so does a local quorum. I would like to see a performance comparison with 3 nodes with both read and write set to 3, where there are hosts on separate racks/rows. The next question clarifies why I am worried about performance.
- A (Gabriel): Quorum writes / reads guarantee read-your-writes, while semi-sync replication does not when it falls back to asynchronous replication, unless slow nodes would be immediately depooled (before returning success).
What is your take on the latency spikes that Restbase/Cassandra have on the REST API? I assume that has a lower consistency level (I hope so, as a cache system, it really does not need a high level of consistency), please correct me if I am wrong. I also assume that higher consistency has a higher penalty on latency (which is of course, understandable). However, even at p50 (which is a bad measure of overall latency, but anyway), There are frequent spikes of 1.3 seconds, which means that for some time, 50% of the POST requests took more than 1.3 seconds, and that is without having into account the client overhead for the extra HTTP handling. If we switch to the 95 percentile, some requests (the 5%) more than 25 seconds to complete! Is 5% of the users going to wait 25 seconds + HTTP overhead for their session to start? Main cluster has not those issues; and in particular, MySQL doesn't because as a mature technology, it will be faster or slower, but InnoDB provides a predictable latency. It would seem to me like if Cassandra was a good solution for large volume of (write?) requests, but not for low latency, which is key, with consistency, for session storage. Please comment on this.
- POST requests are primarily served by backend services like Parsoid. To gauge storage performance, look at storage metrics like these.
How do you handle consistency between datacenters? I ask again because from your previous answer, it seems that those cases will need application logic (I have nothing against it, in fact, I think that is a good thing). However the main point of using a cluster is to let it do things transparently without exposing its internals to the users (programmers). Once we have automatic failover and a conflict-free write-anywhere solution with no data loss, if consistency has to be handled by the application; What is the real advantage between solutions?
- Cassandra lets you select local or cross-dc consistency levels per query. For example, you can require a quorum in each DC. In either case, writes go to all replicas in parallel. The consistency level selects when a query is considered successful. If the requested consistency level is not reached within a timeout, the query can either be retried (this behavior can be parametrized with a RetryStrategy in the driver), or an error is returned.
Who in ops is going to be in charge of maintaining Cassandra nodes? If this was a MySQL-based solution we have 3 (2 DBAs and a former DBA) people to attend your needs. An HA-proxy based solution (not necesarilly HA Proxy, but you get the idea) is already working and probably standarized for our setups, combined with the already-standard LVS. Please do not say "it is not my problem"; standarizing and keeping an uniform setup should be everybody's goal. We "abuse" technologies all the time to keep costs (in the TCO sense) low. Also, not having operations involved usually means unpuppetized messes (I am not referring to Services or Security in particular here).
- The maintenance responsibility would be the same as for the RESTBase cluster, with Filippo & Eric being the primary maintainers.
How many machines/what specs are you aiming for? My 4000 HTTP request/s were on a literally 5 year-old server with HD disks scheduled for decommission (as most things, except writes, will be happening on memory). For mysql we could do with only 2 machines per DC, in an active-passive configuration or in active-active one. Proxies will live with the application (very low footprint) or with the LVSs, if we will create an entry point there. How is the application itself going to be load-balanced/made redundant?
- 3 smallish boxes. Normal LVS load balancing for the service.
What is the real reason why you want to change technology, instead of setting up the service with the separation needed (according to the ticket) on top of the existing Redis and MySQL (I am not talking about MW Loadbalancer, I am talking about the storage itself). Even if it was needed, why not start by setting up the separate service first and then migrating the storage? What is the migration plan? What is the plan in case the service doesn't work to move existing data back to Redis or any other storage without affecting the service itself? Would you say that rewriting everything from 0 and substituting technology all at once is an advantage?
- I imagine we would send the same operations to both Redis and whatever the new service is, then switch reads at some point (and switch back if it did not go well). Unless the new service is also Redis, I don't think we have any other option. If something goes really bad, discarding all session data would log all users out; that's bad but not tragic. --tgr (talk) 08:08, 13 September 2016 (UTC)[reply]
What is your take on the complexity of mediawiki authentication systems and what has been Services/Security involvement helping with recent Mediawiki development "issues" such as Centralauth issues that provides you the experience to do better on your own on those needs (authentication/sessions)? Since it <<make sense to handle [password authentication, sessions, checkuser information] in a single, firewalled "UserInfo" service.>>^[5] Would you say that after migration sessions to Cassandra it is clear those services are bound to be migrated there too?
Do you think you are ready to implement transactions on top of a distributed, session-less REST API?^[6]
- If the requirement is to not allow any access to password hashes even if a MediaWiki box is breached, I don't see an alternative. (It wouldn't exactly be transactions, just a data schema where data remains in an invalid state throughout intermediary steps. It is still a lot less nice than having transactions handled by the DB engine, but you need some intermediary application logic which can do hash checks but is unaffected by MW box breaches, and that means not relying on DB-level transactions. Anyway, this is not really relevant to the session service.) --tgr (talk) 08:08, 13 September 2016 (UTC)[reply]
What is your plan to mitigate the issues for the extra overhead by having internal REST APIs instead of binary ones, which AFAIK it is the #1 reason to do this migration? How do you plan to integrate those APIs with mediawiki, given that mediawiki lacks of proper TLS support as a client on WMF infrastructure (eg. lack of a proper CA, etc.)
- Why would MediaWiki lack TLS support? It uses curl and curl can use client certificates. --tgr (talk) 08:08, 13 September 2016 (UTC)[reply]
How much user service interruption/data loss (e.g. "all sessions lost, users have to relogin again", "users could not login during X minutes", "users could not log out") does your team plans to meet at a maximum before abandoning the migration (please I do not care about bugs, those happen all the time on all projects, even on the current setup, I only care about actual user impact caused because of setting a new service). Please provide a concrete number for the expected and the maximum. Please be very conservative, I won't block if the number is high (underpromise, overdeliver).
What is your answer to "administration complexity" complains for Cassandra, as mentioned by neutral departments vs simpler proxying + replication MySQL setup? Is administration overhead of a cluster taking into account for the cost of the migration? What about issues happening due to memory pressure/incidents due to memory garbage collection? how will those be avoided compared to already existing setups that suffer them? How does cassandra fit in a world where every large player has created its own layer on top of simpler engines to have proper control of the consistency model?
Can you prepare a proof of concept setup on a more production-like environment, rather than a single node on a laptop^[7]? Maybe you could replace the low-impact parser cache on its current hardware first rather than a critical system to prove your point? (this was suggested at some point by performance).

Footnotes

↑ "For now, reliable server-side session invalidation is a hard requirement, in both single master and multi-DC operation. Were we using something like JWT for the session cookie value, then that requirement could be much more relaxed, but I think we're a number of circuitous discussions away from being able to implement signed client-side tokens." -- Darian Patrick
↑ We set cookies across domains, and only clear one of them on logout. Logouts are POSTs, so will hit the primary DC. In an active/active read setup, subsequent reads have a good chance of hitting a non-primary DC. In the case of a network partition or primary DC failure at an inopportune time, this means that users who logged out would in fact not be logged out until the primary DC recovers / the network partition is resolved. This would be a security regression over the guarantees we provide in the current primary-only operation.
↑ Single-DC operations
↑ Cassandra's rack-awareness ensures that replicas are distributed across racks (or rows as the case may be). This isn't terribly useful when both node and replica count are 3 as specified here, but if ever becomes necessary to expand the cluster...
↑ https://phabricator.wikimedia.org/T140813
↑ https://phabricator.wikimedia.org/T140813#2555945
↑ http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

Glosssary

Read-your-write consistency

For the purposes of this document, read-your-write consistency is defined to mean that from the perspective of a discrete client, a read which follows a write is guaranteed to return a value at least as current as the one written.

Stale reads

The assumed data model is that of a versioned key-value store, where versions are monotonically increasing (timestamps in all likelihood). For purposes of this document, a stale (or old, outdated, etc) read is one that has a version strictly less than current, from the perspective of a discrete client.

[:0-1] "For now, reliable server-side session invalidation is a hard requirement, in both single master and multi-DC operation. Were we using something like JWT for the session cookie value, then that requirement could be much more relaxed, but I think we're a number of circuitous discussions away from being able to implement signed client-side tokens." -- Darian Patrick

[:1-2] We set cookies across domains, and only clear one of them on logout. Logouts are POSTs, so will hit the primary DC. In an active/active read setup, subsequent reads have a good chance of hitting a non-primary DC. In the case of a network partition or primary DC failure at an inopportune time, this means that users who logged out would in fact not be logged out until the primary DC recovers / the network partition is resolved. This would be a security regression over the guarantees we provide in the current primary-only operation.

[3] Single-DC operations

[4] Cassandra's rack-awareness ensures that replicas are distributed across racks (or rows as the case may be). This isn't terribly useful when both node and replica count are 3 as specified here, but if ever becomes necessary to expand the cluster...

[5] ttps://phabricator.wikimedia.org/T140813

[6] ttps://phabricator.wikimedia.org/T140813#2555945

[7] ttp://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra

[1]

[2]

[3]

[4]

[5]

[6]

[7]