Cassandra

From Wikitech
Jump to: navigation, search

Puppetization

See current work in progress.

Cassandra provides deb packages. We imported them into apt.wikimedia.org. We target Cassandra 2.1.3 for now,

Parameterization

For general docs, see [1].

/etc/cassandra/cassandra.yaml

See the docs.

Minimal things to change:

listen_address, rpc_address 
set to external IP of this node
seed_provider / seeds 
set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"

Other things we'll want to change from the defaults are:

/etc/cassandra/cassandra-env.sh

  • MAX_HEAP_SIZE
  • HEAP_NEWSIZE

Common admin tasks

Multiinstance nodes, be aware

Have in mind that nodetool commands in multiinstance nodes (new cluster) have to be prefixed, for example nodetool-a runs the following command against instance "a"

nodetool-a tablestats -- local_group_default_T_pageviews_per_article_flat.data 

Rolling restart

Cassandra will take a while (minutes) to start up when the dataset per instance is large. During this time, it is not part of the cluster and does not respond to client requests. It is therefore important to wait for each node to come fully up before proceeding with another node.

To automate this, we have created a restart playbook:

 # Make sure to have a checkout of the ansible-deploy repo and a local ansible install
 git clone https://github.com/wikimedia/ansible-deploy.git
 cd ansible-deploy
 
 # Perform the rolling restart
 ansible-playbook -i production roles/cassandra/restart.yml

Restart & check status

(Re)start cassandra: systemctl restart cassandra. The command

nodetool status

should return information and show your node (and the other nodes) as being up. Example output:

root@xenon:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  10.64.16.149  91.4 KB    256     33.4%  c72025f6-8ad8-4ab6-b989-1ce2f4b8f665  rack1
UN  10.64.0.200   30.94 KB   256     32.8%  48821b0f-f378-41a7-90b1-b5cfb358addb  rack1
UN  10.64.16.147  58.75 KB   256     33.8%  a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca  rack1

The nodetool utility provides many more interesting tools, among them

Stats about a keyspace (~db)
nodetool cfstats org_wikipedia_en_T_parsoid_html | less
Histograms about a table
nodetool cfhistograms org_wikipedia_en_T_parsoid_html data
Take an immutable snapshot of a specific table or all tables
nodetool snapshot [keyspace] [table]
Clear snapshot(s) to recover disk space
nodetool clearsnapshot [keyspace] [table]

See more options with nodetool help

Adding a new (empty) node

Background: docs

Stop the node and wipe the default data (if it started up) with rm -r /var/lib/cassandra/*.

Remove itself temporarily from its own seeds list in /etc/cassandra/cassandra.yaml. This makes sure that the node goes through a full bootstrap process before fully joining the cluster and accepting requests (see this ticket for background).

Now, start the node. If everything goes well the cassandra node should automatically join the cluster after going through a bootstrap ('joining') phase. Verify with nodetool status as above.

On Bootstrapping

During the bootstrap, data will stream in such a way as to avoid consistency violations. Simply put, the data will be streamed from the node that is giving up the range (primary, or otherwise), and other replicas are not considered. This can result in a lack of streaming concurrency, and a slower bootstrap. For example, consider the case where replication factor is 3, and there are 5 nodes in the cluster, with 2 nodes each in rack A and B, and one in rack C. If a new node is bootstrapped into rack C, all of the data loaded onto the new node will stream from the lone node in rack C.

This behavior can be overridden (if say for example, the single target node were down), by adding the -Dconsistent.rangemovement=false JVM property during startup of the joining node. Tread carefully though, data loss is possible!

See CASSANDRA-2434.

Re-syncing a node

To fully bring a node up to date after crashes or other outages, run nodetool repair -par -inc on the node, ideally in a screen session (can take a while if the delta is large). This will pull this new node's share of data from other nodes in the cluster in the background.

This command is also run from cron every 24 hours as a kind of background scrubbing process.

Getting a CQL shell

Our cassandra nodes only listen on the public IPs, so you need to use:

cqlsh -u <user> -p <password> <public ip or host name>

See the docs for next steps with CQL.

Installing and generating certificates

Cassandra traffic between datacenters travels encrypted as per https://phabricator.wikimedia.org/T108953. The per-host keypairs live in private.git (Feb 2016 on palladium) under /srv/private/modules/secret/secrets/cassandra.

To (re)generate the keypairs a script named cassandra-ca-manager is used, the script is driven by YAML configuration files, one per cassandra cluster. The YAML configuration contains one entry per host plus information for the CA itself, and it is designed to be idempotent. If a certain host in the configuration file is missing its keypair it will be (re)generated when invoking the script, for example:

 cd /srv/private/modules/secret/secrets/cassandra/
 /var/lib/git/operations/puppet/modules/cassandra/files/cassandra-ca-manager services-test.yaml

Once generated and committed the new private material will be deployed at the next puppet run. Also, when adding a new host don't forget to update labs/private.git too and (re)generate the fake private material.

Rollover / expiration

Certificates are generated with one year expiration, around that time (see also https://phabricator.wikimedia.org/T120662 for monitoring) it is sufficient to delete the old keypairs and generate new ones. Once puppet has deployed the new keypair it should be sufficient to roll-restart cassandra.

All keypairs are signed by a CA which is part of a truststore, adding a new CA to the truststore will effectively trust both. Another way to achieve rollover is also by issuing another certificate for the CA with extended expiration without changing the private key, in practice this means deleting rootCa.crt and invoking cassandra-ca-manager again.

Monitoring

  • Logs: /var/log/cassandra/system.log has very detailed information on what's going on.
    • Grep for 'GCInspector' to search for slow GCs
  • Graphite: 'cassandra.$host.*'

RESTBase (major Cassandra user)

On going cassandra bootstrap/decommission

While performing a node bootstrap or decommission it is possible to "monitor" its progress with something like: (change nodetool-a with the "right" nodetool)

 watch -d -n300 "nodetool-a netstats -H | grep total"

Tools

cassandra-tools-wmf

c-cqlsh

Synopsis

c-cqlsh <id>

Description

Given an instance ID, executes cqlsh on that instance. Uses the the cqlshrc file located in the instance's configuration directory.

Example

$ c-cqlsh a
Connected to services-test at 10.64.0.202:9042.
[cqlsh 5.0.1 | Cassandra 2.2.6 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh>

c-any-nt

Synopsis

c-any-nt <arg> [arg ...]

Description

In a WMF multi-instance environment, executes nodetool on a randomly chosen instance (any instance).

Example

$ c-any-nt status -r
...

c-foreach-nt

Synopsis

c-foreach-nt <arg> [arg ...]

Description

In a WMF multi-instance environment, iteratively execute nodetool on all local Cassandra instances.

Example

$ c-foreach-nt version
a: ReleaseVersion: 2.2.6
b: ReleaseVersion: 2.2.6

c-foreach-restart

Synopsis

usage: c-foreach-restart [-h] [-a ATTEMPTS] [-r RETRY]
                          [--execute-post-shutdown CMD] [-d DELAY]
                          [--logmsgbot LOGMSGBOT] [--tcpircbot-host HOST]
                          [--tcpircbot-port PORT] [--phabricator-issue ISSUE]
 
 Cassandra instance restarter
 
 optional arguments:
   -h, --help            show this help message and exit
   -a ATTEMPTS, --attempts ATTEMPTS
                         Maximum number of times to check if service is up
                         after restarting.
   -r RETRY, --retry RETRY
                         Number seconds between connection attempts, in
                         seconds.
   --execute-post-shutdown CMD
                         Command to execute after Cassandra has been shutdown,
                         and before it is started back up.
   -d DELAY, --delay DELAY
                         Delay between instance restarts (defaults to no
                         delay).
   --logmsgbot
                         Log restarts to SAL (via logmsgbot and #wikimedia-
                         operations).
   --tcpircbot-host HOST
                         tcpircbot hostname. Only valid when --logmsgbot is
                         used. Default: einsteinium.wikimedia.org
   --tcpircbot-port PORT
                         tcpircbot port number. Only valid when --logmsgbot is
                         used. Default: 9200
   --phabricator-issue ISSUE
                         Phabricator issue to associate these restarts with.
                         This currently only makes sense in combination with
                         --logmsgbot where it is included in the formatted log
                         message.

Description

In a WMF multi-instance environment, iteratively restart instances.

Each instance is drained prior to being restarted. After each restart, the process blocks until the CQL port is listening.

Commands specified by --execute-post-shutdown can include the {id} template which will be substituted with the ID of the instance being restarted.

Example

$ sudo c-foreach-restart --execute-post-shutdown="echo '{id} is a teapot, short and stout'"
...

c-ls

Synopsis

c-ls

Description

List the IDs of the locally configured instance IDs.

Example

$ c-ls
a
b
c

streams

Synopsis

streams [-h] [-nt CMD]
 
 Cassandra streams monitor
 
 optional arguments:
   -h, --help            show this help message and exit
   -nt CMD, --nodetool CMD
                         Nodetool command to use

Description

Realtime console monitoring of streaming operations.

Preview-cassandra-streams.gif

Example

$ streams -nt nodetool-b

uyaml

Synopsis

uyaml [-h] yaml path
 
 YAML parser
 
 positional arguments:
   yaml        YAML file to parse
   path        Path to parse from file
 
 optional arguments:
   -h, --help  show this help message and exit

Description

Query attributes from a YAML file.

Example

$ uyaml /etc/cassandra-instances.d/restbase1007-a.yaml /jmx_port
7189
$ for i in `c-ls`; do uyaml /etc/cassandra-instances.d/restbase1007-$i.yaml /jmx_port; done
7189
7190
7191

cassandra-ca-manager

cassandra-ca-manager: manage Java keystores with a self-signed certificate authority

cassandra-metrics-collector

cassandra-metrics-collector: JMX metrics collector

cdsh

cdsh: a Cassandra cluster wrapper for dsh

Uncommon admin tasks

Monitor compactions

nuria@aqs1004:~$ nodetool-a compactionstats -H
pending tasks: 998
                                     id   compaction type                                           keyspace   table   completed       total    unit   progress
   88386550-6f46-11e6-908d-b320c64aa5c2        Compaction   local_group_default_T_pageviews_per_article_flat    data   333.74 GB    339.6 GB   bytes     98.27%
   0ee1fe70-6f99-11e6-908d-b320c64aa5c2        Compaction   local_group_default_T_pageviews_per_article_flat    data   156.05 GB   336.07 GB   bytes     46.43%
   e9461700-6ecb-11e6-908d-b320c64aa5c2        Compaction   local_group_default_T_pageviews_per_article_flat    data    710.8 GB   896.46 GB   bytes     79.29%
Active compaction remaining time :   0h24m46s
 

Bootstrap a brand new cluster

When bootstrapping a new cluster at least one of the seeds will need to have auto_bootstrap: false in /etc/cassandra/cassandra.yaml to prevent it from asking (and failing) to bootstrap from other seeds.

Chances are you are bootstrapping a new cluster from freshly-installed machines, thus cassandra would have run at least once and we need to remove its data files.

NOTE this procedure is meant to be run only on a brand new cluster and NOT ON EXISTING CLUSTERS

 service cassandra stop
 rm -rf /var/lib/cassandra/*
 # this is needed only on one of the seeds
 grep -q 'auto_bootstrap: false' /etc/cassandra/cassandra.yaml || echo 'auto_bootstrap: false' >> /etc/cassandra/cassandra.yaml
 service cassandra start

once the node has started, you can stop/wipe/start the other seeds.

Switch to multiple cassandra instances per hardware node

As per the work in https://phabricator.wikimedia.org/T95253 and related, we have the capability of running multiple (up to four ATM) cassandra instances per hardware node. A single node can be converted in place from single instance to multiple instances by first decommissioning the node in cassandra, reimage and assign new IP/names for the instances in DNS and puppet.

  1. Allocate IP addresses in DNS for instance(s) (e.g. https://gerrit.wikimedia.org/r/#/c/249152/)
  2. Generate instance(s) TLS certificates (see palladium:/srv/private/modules/secret/secrets/cassandra)
    1. Also generate new material in labs/private.git for puppet compiler's benefit
  3. Stop cassandra from being restarted with systemctl mask cassandra
  4. Run nodetool decomission on the node to be converted
  5. (In parallel too) Add the new instance(s) to the list of known seeds (e.g. https://gerrit.wikimedia.org/r/#/c/250985/) so they are known to cassandra and allowed in ferm
  6. Reimage the machine
  7. Add one instance at a time to the reimaged machine (e.g. https://gerrit.wikimedia.org/r/#/c/250975/) once puppet runs it will provision the additional instance

Handling JBOD disk failures

Modern versions of Cassandra (those which include CASSANDRA-6696) are capable of using JBOD. A JBOD configuration is accomplished by simply supplying a list of paths as the data_file_directories directive.

The Wikimedia convention is to mount the respective devices as /srv/{device} (e.g. /srv/sda4, /srv/sdb4, etc), with data_file_directories in the form of /srv/{device}/cassandra-{id}/data. Common data should be placed somewhere with single device redundancy (e.g. RAID-10), mounted as /srv/cassandra/instance-data

Cluster Setup

Authentication

Authentication is enabled in cassandra.yaml by setting the authenticator and authorizer directives to PasswordAuthenticator and CassandraAuthorizer respectively. When so enabled, Cassandra will automatically create a system_auth keyspace at first start, and a default (super-)user named cassandra, with a password of cassandra. Replication should be set accordingly on this new keyspace, and a new super-user with a strong password created.

First, connect to the cluster using the default credentials and alter the keyspace:

$ cqlsh -u cassandra -p cassandra
cassandra@cqlsh> ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };

Next, create a new super-user:

cassandra@cqlsh> CREATE USER IF NOT EXISTS admin WITH PASSWORD 'sup3rs3kr3t' SUPERUSER;

Users are not permitted to delete themselves, so exit the shell and reconnect as the new super-user:

cassandra@cqlsh> exit
$ cqlsh -u admin -p sup3rs3kr3t

Delete the default super-user:

admin@cqlsh> DROP USER IF EXISTS cassandra;

For authenticated application access, create a normal user:

admin@cqlsh> CREATE USER IF NOT EXISTS restbase WITH PASSWORD 'moars3kr3t' NOSUPERUSER;

Replicating system_auth

Authentication and authorization information is maintained in the system_auth keyspace, which by default uses SimpleStrategy and a replication factor of 1. This is definitely not what you want; A single node failure can prevent you from accessing your database! Best practice is to configure a replication factor of 3-5 per data-center.

Please check Incident documentation/20170223-AQS before proceeding, increasing the replication of system_auth on a live cluster may lead to outages.

Time-window compaction strategy

See Cassandra/TimeWindowCompactionStrategy.


Prometheus JMX Exporter

See Cassandra/PrometheusJmxExporter.

See also