Network Error Logging
Intro
There are many classes of reliability issues (e.g. failures/misconfigurations in intermediate networks) that we only find out about via direct, manual reports from users, or, for very widespread cases, we notice because traffic is 'missing' and below expected rates.
Many modern browsers[1] support a feature called Network Error Logging, or NEL. On successful requests, we ask browsers to remember "if you later encounter an error talking to us, let an error reporting endpoint know".
Asking browsers to enable NEL is implemented by serving HTTP response headers Report-To
and NEL
, which together define a set of endpoints that can receive reports, sampling fractions for each of failures and successes, and a TTL for this entire definition to be stored in the user's browser. See also Sample Policy Definitions.
Our policy sets a 5% failure_fraction
-- a tradeoff between fidelity and load placed on Logstash.
We use a GeoDNS trick to receive reports more promptly: gdnsd
is configured to send the "next-best" edge site for a given user for the intake-logging.wikimedia.org domain where reports are received. This is because if a user needs to send a NEL report, it's very likely that from their perspective something is wrong with their usual 'primary'/closest site.
That being said, browsers will buffer reports and retry submission later if they are unable to send them immediately. No absolute time is included, due to both clock skew and privacy fingerprinting concerns, but the age
field of the report indicates the number of milliseconds elapsed between generation and successful submission. EventGate timestamps each report in the _dt
field.
Dashboards
When diagnosing network connectivity issues, especially focus on sudden increases in tcp.timed_out
reports, as well as tcp.address_unreachable
.
Serving the necessary response headers
For our implementation, sub https_deliver_networkerrorlogging in wikimedia-frontend.vcl.erb.
Our report receiver implementation
TODO EventGate, Kafka, logstash