Appserver-buster-upgrade-2021
Observations of differences between MediaWiki servers on stretch vs buster, before and after task T245757, January 2021
TCP errors
Are there more TCP errors ?
At first glance it appears as if TCP errors have been reduced when looking at these 2 example hosts. The data gap in the middle is when the reimaging to buster happened.
No, not really
But once you zoom out and look at an entire week before, it turns out it isn't actually a pattern.
disk utilization
Did disk utilization go up ?
Similary it first looks as if disk utilization went through the roof after the upgrade:
No, seems like spikes are unrelated
But once you zoom out.. you see we have these spikes separate from the upgrade event:
performance (avg response time)
Is it actually getting slower?
Looking at average response time it can appear as if a buster server is actually slower if we look at mw1268 (stretch) vs mw1267 (buster) over a 6 hour span:
Similarly if we compare these hosts over a week:
Or over a full 30 days (mw1267 was reimaged on Jan 8):
Compare another set of servers
App
But mw1267/mw1268 are really old hardware and will soon be replaced anyways. Let's take another set of servers, appservers, one on stretch and unchanged and the other during the reimaging, so before and after, mw1403 to see whether we can confirm this or not on the more modern hardware.
mw1403 (stretch) vs mw1405 (buster) - response time over 6 hours on 2021-01-27 - can't see a difference here
API
Let's do the same for API servers. mw1404 (buster) mw1406 (stretch->buster) - avg response time over 12 hours on 2021-01-27, no obvious difference
Let's check the same hardware before and after
To make really sure, let's check the same machines before and after reimaging.
Before
First let's take a screenshot of the baseline, before touching them. mw1402,mw1404 are API servers, both on stretch. mw1268,mw1269 are app servers, both on stretch in this image. looking at 24 hours and 7 days and also at the 95th percentile now:
After
Now after reimaging one of the 2 servers. stretch on the left, buster on the right. Here we are seeing a slower average response time on buster again.
But... the percentage of responses under 250ms is actually getting better
Same comparison for appservers instead of API. Here stretch and buster are reversed, buster on the left, stretch on the right.
Results still inconsistent? Is there a pattern?
sources
example grafana links, dashboards used: host-overview, application-servers-red-dashboard-wkandek