Incidents/20160915-MediaWiki

Summary

Deployment of 1.28.0-wmf.19 caused jobs to fail when invoking SiteConfiguration::getConfig. That initially caused Wikidata to no more dispatch updates to the wiki clients (until Thursday 15th 20:55 UTC) and prevented any account creation from Thursday 15th 19:10UTC to Friday 16th 12:50 UTC.

Reverting all wikis to 1.28.0-wmf.18 solved the problem.

The root cause is a failure to update the HHVM bytecode cache because of a filesize limit of 512MBytes enforced via ulimit. The rough chain of call being:

jobrunner service
/rpc/RunJobs.php
wfShellExec( 'mwscript maintenance/getConfiguration.php' ); with ulimit filesize=512MBytes
HHVM in CLI mode attempting to write to either /var/cache/hhvm/fcgi.sq3 or /var/cache/hhvm/cli.sq3 (undetermined). EFBIG (File too large)

As a result wfShellExec() fails with exit code 153 (which is 128 + value of Posix signal SIGXFSZ 25), since the job fails to get configuration from the shell call, it errors out.

The extensive debugging details are on task T145819.

As a result ALL wikis are at 1.28.0-wmf.18 as of Friday 15th 12:50 UTC.

Timeline

This is a wall of text, you can probably skip it entirely and move straight to Conclusions below.

Tuesday 13th at 19:00 UTC

1.28.0-wmf.19 is pushed to group0 as scheduled. That caused renames to get stuck on mediawiki.org (task T145596) fixed via https://phabricator.wikimedia.org/rECAU20f54121d177f996c155d5feea6c799a8c9ed242

Wednesday 14th at 19:00 UTC

1.28.0-wmf.19 pushed to group1. It is promptly reverted due to MediaWiki:Common.css no more being included (task T145673), first noticed on the Hebrew Wikipedia which has the Infobox on the left instead of floating on the right.

After the rollback train conductor Antoine unblocks stuck renames on mediawiki.org (task T145596). They originally failed due to a database related issue (commit)

Only group0 is running 1.28.0-wmf.19

Thursday 15th at 19:00 UTC

The renaming being fixed, 1.28.0-wmf.19 is pushed to group1 at 19:10 UTC. A user notice that Wikidata is lagged. The Wikidata process that dispatch updates to client Wikis is lagging. Aude assumes it is a temporary spike and but we fill task T145819 nonetheless. After more monitoring it is evident the lag is not resolving, the Grafana boards for Wikidata Dispatch having all the informations.

Logs show that SiteConfiguration::getConfig() fails on terbium.

We rollback Wikidata to 1.28.0-wmf.18. The Wikidata dispatch jobs are processed again. The situation seems to be resolved.

After some more monitoring, and to stay on schedule with the train deployment all wikis but wikidata ones are upgraded to 1.28.0-wmf.19 at 20:55 UTC. The account renaming have been fixed the day before, the Wikidata dispatch works fine, there is not much error in fatalmonitor. It is assumed that every is fine.

At 22:00 UTC everything looks fine.

Night goes on in Europe

Friday 16th at 8:00 UTC Antoine dig and further investigate the Wikidata SiteConfiguration::getConfig issue. The extensive debugging is show on task T145819 with support from Addshore. That points out a rather messy situation with MediaWiki has a web service having to shell out to mwscript maintenance/getConfiguration.php T111441. Since that is invoked in the context of the webservice, ulimit is applied and restrict file size to 512 MBytes. Turns out that an unknown reason, HHVM attempt to refresh its Sqlite3 bytecode cache when being run on 1.28.0-wmf.19 but not on 1.28.0-wmf.18 .

The error is manually reproducible from the CLI on terbium. At 12:00 UTC, Antoine deletes /var/cache/hhvm/cli.hhbc.sq3 and the error is gone.

12:40 UTC, Antoine announces decision to rollback all wikis to 1.28.0-wmf.18, with the hope that HHVM would not have to trigger the write to the cli cache since 1.28.0-wmf.18 did not. That is the safe thing to do on a Friday, potentially a hack to remove the filesize limit would have worked as well.

12:45 UTC Antoine notices quite late that Account creation are 100% erroring out since 1.28.0-wmf.19 deployment at 19:10 UTC the day before (see comment and graph on T145839). That confirms rollbacking.

12:50 UTC rebuilt wikiversions.php and synchronized wikiversions files: All wikis back to 1.28.0-wmf.18 :( T145819

Monitoring ensue, the jobs are all happy.

Conclusions

Despite an enormous amount of metrics being collected for the infrastructure or the MediaWiki stack, most do not generate an alarm. The 100% account creation error rate must be made a critical alarm.
Antoine could not find a good dashboard or overview of jobs failure by type of job/wiki/version/error reason
Having MediaWiki as a web service shelling out to run a maintenance script is absolutely terrible and confusing
It has been absolutely not trivial to find out that HHVM tried to write to the bytecode cache and got denied. The shell out fails to capture standard error.

Actionables

MediaWiki / HHVM related:

SiteConfiguration::getConfig must stop using mwscript maintenance/getConfiguration.php (task T111441)
Investigate HHVM writing to the bytecode cache even when 1.28.0-wmf.19 got deployed everywhere. (Task to be filled)
Have the train deployment to purge HHVM bytecode cache for both cli and fcgi. (task T146226)

Jobs related:

Monitoring and alerts for "business" metrics (task T140942)
Add monitoring alarm for Account creation errors (task T146090)
Add monitoring alarm for global and type of jobs errors (Task to be filled)
wikidev people cant read /var/log/mediawiki/jobrunner.log (task T146040)
job errors are not easily discoverable:
- RunJobs logs in logstash should be enhanced to support filtering by job type (task T146469)
- Antoine created https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-runjobs

Future of train:

Decide a plan to push 1.28.0-wmf.19 again. To be discussed by Release Engineering team on September 19th team meeting.
- See the thread on wikitech-l for status/plan.