Jump to content

Machine Learning/LiftWing/Cassandra

From Wikitech

Summary

This page is a guide for setting up a Cassandra instance and connecting to it from a LiftWing service. This guide is based on experiences developing Revise Tone Task Generator.

Creating Cassandra resources

Cassandra at the WMF

Our Cassandra clusters within WMF are owned and managed by the Data Persistence Team.

Although, there has been work in the past to host Cassandra clusters ourselves within ML team, it was never fully developed and deployed to production.

To avoid additional development and maintenance costs, we've decided that collaboration with the Data Persistence Team is the best way to work with Cassandra. It also allows us to:

  1. Use the expertise and knowledge of the Data Persistence Team in discussing the solution and designing the Cassandra table(s)
  2. Focus on ML development and take advantage of the managed Cassandra clusters

Design Review Process

The Data Persistence Team is using a design review process as a prerequisite for deploying on the managed Cassandra clusters. You can see examples of such design reviews here:

Once the design is approved, you should expect a few things to be deployed:

Connecting from LiftWing

Inference-Services

Connecting to Cassandra from Python

We use the cassandra-python-driver as Cassandra driver within Python.

In the inference-services repository, you can find a BaseCassandraCache class implemented within shared Python code. This class parses the environment variables and uses them to set up a connection to the Cassandra cluster.

To use Cassandra within your InferenceService, you should implement a class inheriting from BaseCassandraCache and implementing from_cache, to_cache and remove_from_cache methods. You can look into the Revise Tone implementation as an example.

When implementing writes, you should make sure that:

  • You use ConsistencyLevel.LOCAL_QUORUM to ensure higher consistency without cross-datacenter communication.
  • You use BatchQuery when writing multiple rows instead of doing it in a loop.

How to test the service with Cassandra locally

In order to test your service locally, you'll need to create a docker-compose file deploying your service along with Cassandra. You can learn more about this setup by exploring Revise Tone Task Generator README.md and docker-compose.yml files in gerrit.

Deployment Charts

Deploying Secrets with Cassandra credentials

...

Configuring environmental variables for InferenceService

The credentials CASSANDRA_PASSWORD and CASSANDRA_USER should be loaded from the Secrets deployed previously with the help of an SRE. See the revise-tone-task-generator deployment for example of loading env vars from secrets.

In production environment, you need to make sure that CASSANDRA_DATACENTER matches the cluster you're deploying on. This will make sure that you are communicating with Cassandra servers within your datacenter, which will avoid cross-datacenter latencies. Thus, you need to have separate values-ml-serve-eqiad.yaml and values-ml-serve-codfw.yaml files to set appropriate datacenter in each of them.

For CASSANDRA_TTL, you can set 0 to make the entries non-expiring.

Adding Cassandra to external_services

This makes sure we add a NetworkPolicy allowing us to connect to the Cassandra cluster from our namespace. See example of this change.