Appserver-buster-upgrade-2021

Observations of differences between MediaWiki servers on stretch vs buster, before and after task T245757, January 2021

TCP errors

Are there more TCP errors ?

At first glance it appears as if TCP errors have been reduced when looking at these 2 example hosts. The data gap in the middle is when the reimaging to buster happened.

mw1263 - TCP errors before and after the upgrade from stretch to buster on 2021-01-27

mw1268 - TCP errors before and after the upgrade from stretch to buster on 2021-01-27

No, not really

But once you zoom out and look at an entire week before, it turns out it isn't actually a pattern.

disk utilization

Did disk utilization go up ?

Similary it first looks as if disk utilization went through the roof after the upgrade:

mw1268 - disk utilization before and after the upgrade from stretch to buster on 2021-01-27

No, seems like spikes are unrelated

But once you zoom out.. you see we have these spikes separate from the upgrade event:

performance (avg response time)

Is it actually getting slower?

Looking at average response time it can appear as if a buster server is actually slower if we look at mw1268 (stretch) vs mw1267 (buster) over a 6 hour span:

Similarly if we compare these hosts over a week:

Or over a full 30 days (mw1267 was reimaged on Jan 8):

Compare another set of servers

App

But mw1267/mw1268 are really old hardware and will soon be replaced anyways. Let's take another set of servers, appservers, one on stretch and unchanged and the other during the reimaging, so before and after, mw1403 to see whether we can confirm this or not on the more modern hardware.

mw1403 (stretch) vs mw1405 (buster) - response time over 6 hours on 2021-01-27 - can't see a difference here

API

Let's do the same for API servers. mw1404 (buster) mw1406 (stretch->buster) - avg response time over 12 hours on 2021-01-27, no obvious difference

Let's check the same hardware before and after

To make really sure, let's check the same machines before and after reimaging.

Before

First let's take a screenshot of the baseline, before touching them. mw1402,mw1404 are API servers, both on stretch. mw1268,mw1269 are app servers, both on stretch in this image. looking at 24 hours and 7 days and also at the 95th percentile now:

After

Now after reimaging one of the 2 servers. stretch on the left, buster on the right. Here we are seeing a slower average response time on buster again.

But... the percentage of responses under 250ms is actually getting better

Same comparison for appservers instead of API. Here stretch and buster are reversed, buster on the left, stretch on the right.

Results still inconsistent? Is there a pattern?

sources

example grafana links, dashboards used: host-overview, application-servers-red-dashboard-wkandek

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-7d&to=now&var-server=mw1268&var-datasource=thanos&var-cluster=appserver

https://grafana.wikimedia.org/d/MLl49gLGk/application-servers-red-dashboard-wkandek?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&var-serverA=mw1268&var-serverB=mw1267