Incidents/20140113-Poolcounter

Summary

We had a partial "outage" of (some?) pages on at least jawiki and ruwiki, causing at least the Main Page not to be displayed, and anon users getting PoolCounter error messages instead.

Timeline

At 12:42 Max asked on #wikimedia-operations if the poolcounter services could be restarted, after reports of these errors at least since 12:23 in bug https://bugzilla.wikimedia.org/show_bug.cgi?id=59993. Alex promptly responded and restarted the poolcounter daemons. However there were no signs that the poolcounter daemons weren't working correctly. Antoine noted correctly that the services weren't being monitored in Icinga, and submitted a patchset to correct this.

The problem in the mean time was not resolved.

The poolcounter.log on fluorine was being spammed with lots of Queue full messages. (Allegedly, because the messages were in Japanese.) The archives showed that these messages are not unusual under normal circumstances, but were highly elevated now. Max considered a high purge count and/or parser cache failures.

At 13:12 Daniel pinged me and asked me to look into it. I checked the PoolCounter services and statistics and saw no problems with poolcounterd itself. In the mean time, Max had prepared a patchset to temp disable the poolcounter on at least jawiki. I feared however that PoolCounter was doing exactly what it's supposed to do, i.e. preventing the actual problem from getting worse due to the contention on page parses/views. I raised the poolcounter queue size by 50% for the 'ArticleView' pool, to see if it had just tipped over the edge, but it made no difference. (I'll revert that change now.)

Around 14:10, Faidon joined the investigation. According to various different graphs and statistics, the problem seemed to had started yesterday, in the morning hours of UTC. Faidon also found a correlated graph for db1006, with innodb reads, transactions and general query count going up around the same time the problem started. db1006 is part of the s6 cluster, to which both jawiki and ruwiki belong, as well as frwiki which wasn't showing the problem. However db1006's slow log showed a high rate of slow queries for the page_restrictions table (for frwiki AFT). We didn't understand why these queries were slow, as they seemed simple queries for a small table on a primary key. Whether this was a symptom or cause of our problem we didn't know.

When correlating the approximate start time of the problem with changes on the mediawiki platform, I had matched it to a change Reedy did to the CategoryTree extension on Jan 12 around 01:00 UTC https://gerrit.wikimedia.org/r/#/c/107008/, a couple of hours before the problem started to exhibit itself (possibly due to low traffic). I didn't see a direct relation to the issue and opted to continue investigation first. However, without knowing the specifics of CategoryTree and/or MediaWiki internals, I could imagine a relation with this change and various page caching mechanisms, so at 15:10 I decided to just test it, and revert this config change https://gerrit.wikimedia.org/r/#/c/107158/. Immediately poolcounter.log went quiet and the problem disappeared.

Actionables

Status: Done - Analyze CategoryTree problem and implement workaround
Status: Unresolved - Fix monitoring of poolcounter service
- https://rt.wikimedia.org/Ticket/Display.html?id=7108
Status: Done - Improve poolcounter extension error messages. Some context would be helpful, like poolcounter server contacted, pool context, URL. And perhaps error messages even if only in english (as opposed to what's displayed to the user)
- https://bugzilla.wikimedia.org/show_bug.cgi?id=63027
Status: Done - Investigate page_restrictions query slowness on db1006
- SELECT /* Title::loadRestrictions */ pr_type,pr_expiry,pr_level,pr_cascade FROM `page_restrictions` WHERE pr_page = '2720924'; domas says needs FORCE INDEX
- https://rt.wikimedia.org/Ticket/Display.html?id=7126 (closed/rejected)
- Forcing an index seems like the wrong approach here, or perhaps there was some miscommunication somewhere. The example query is properly indexed and fast on any S6 slave today including db1006, so it's likely something else was affecting these queries during the outage, or there was a storm of them, or something else has been fixed in the meantime. Springle (talk) 06:03, 15 April 2014 (UTC)[reply]