Jump to content

User:CDanis (WMF)/Diagnosing opcache corruption

From Wikitech


This page contains historical information.
On Kubernetes, we never install new code into an already-running PHP FPM server. This prevents opcache issues.


Doctor House
IT'S NOT OPCACHE
...except sometimes it is


OPcache improves PHP performance by storing precompiled script bytecode in shared memory, thereby removing the need for PHP to load and parse scripts on each request.

Very, very rarely, a race condition or other error happens at PHP script load time, resulting in "mysterious" or "impossible" error messages being produced (undefined method, missing property, class mismatch, "typos" in string literals). In practice at WMF, given our scale and our release cadence, this occurs approximately once to thrice per quarter.

While OPcache corruption is an attractive explanation for a wide variety of problems, ordinary code bugs are far more likely. If you hear hoofbeats, think horses, not zebras.

Likely symptoms of opcache corruption

If even one of the symptoms below is not present, then you are NOT debugging a case of opcache corruption.
The error message is "impossible"
string constants with supposed typos but the source code is correct, "method doesn't exist" but it definitely does exist, "file not found" but it is definitely included in the release, etc. Opcache stores compiled bytecode as well as interned copies of all string literals, so all of those are possible types of error to see.
The impossible error(s) occur on a very small number of servers (approx 1-6)
Opcache corruption requires a series of very-low-probability, mostly-independent events to occur.
If occurring on multiple servers, the impossible errors are different between each server
It is astronomically improbable to have the same very-low-probability, mostly-independent events occur for the same files across multiple servers.
Restarting php-fpm fixes the issue
All deployers are able to run sudo -i /usr/local/sbin/restart-php7.2-fpm which will depool/restart/repool; it is always safe to invoke this command on a couple of servers simultaneously. If this doesn't fix the issue, it is astronomically unlikely to have been OPcache corruption. (If you do somehow obtain a set of PHP files which deterministically reproduces OPcache corruption, please please please let us know)

Symptoms which disqualify OPcache

The same error message occurs at high rate on more than a single server
It is astronomically improbable to have the same very-low-probability, mostly-independent events occur for the same files between multiple servers.
The beginning of errors isn't associated with a rollout, backport, configuration change, or server reboot/PHP engine restart
PHP source is compiled into OPcache on first use, and then retained for the lifetime of the process; there's no known mechanism by which it can happen outside of these conditions (although there have been a few yet-unexplained possible occurrences).