User:Razzi/2021-05-5

Procedure to fold 2 partitions into one: https://phabricator.wikimedia.org/T278424#7020076

mkdir /srv/sqldata
mv /var/lib/mysql/* /srv/sqldata
umount /var/lib/mysql
umount /srv
lvremove /dev/an-coord1001-vg/mysql
lvextend -l +100%FREE /dev/an-coord1001-vg/srv
resize2fs /dev/an-coord1001-vg/srv

Also had to change the mysql data directory: https://gerrit.wikimedia.org/r/c/operations/puppet/+/681358/2/hieradata/role/common/analytics_cluster/coordinator.yaml

profile::analytics::database::meta::datadir: '/srv/sqldata'

Got a ping on https://phabricator.wikimedia.org/T280367 - Mysql partition on an-coord1001 sudden change in growth rate since Apr 14th

The issue was resolved, which is visible on: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops&from=1618272000000&to=1619913599000

Priority for today:

I had thought originally that we could do the upgrade with everything online, rather than doing a maintenance window with readonly safe mode etc. I can see the benefit of safe mode for protecting against data loss, and there is always the chance a reimage goes horribly wrong, but since all this is on a standby we shouldn't have to take writing offline.

What would happen if we had a snapshot, data keeps getting written, then we have to restore to the snapshot? There would be some unreferenceable data on workers, but what would be the data lost?

In safe mode, what would

Data builds up on kafka

need to understand all the data that flows into hdfs

How to drain the cluster?

...

Created kerberos principal for user, as easy as running create and adding krb: present to data.yaml: https://phabricator.wikimedia.org/T281809