Analytics/Systems/EventLogging/NotErrorLogging

From Wikitech
Jump to navigation Jump to search

Eventlogging is not well suited to do error logging

There had been many talks in the past (and present, see: Task T203814) about using eventlogging to do client side errorl ogging. While eventlogging is well suited to handle events, and ingesting client side traces shares similarities with ingesting events, Eventlogging is really not well suited to be a client side error logging library for several reasons:

1) Eventlogging is designed around handling data that validates to a schema, while error messages might be json there is really no value on validating them against a schema. The error happened and it should be ingested by backend regardless of whether it validates, it is not "curated data".

2) Any system we use to handle error logging should be tier-1, EventLogging is tier-2.

3) Any system to handle errors needs to be able to group by stacktraces and be good at handling free text, EventLogging is not for reasons in 1). It is made to deal with data that abides to a schema. It does not group events by free text (like stack traces) and an error that appears for a million pageviews in 1 hour will appear in the database a million times rather than appearing once with a count of 1 million. Because EventLogging is made for distinct events, and errors are not distinct events. All users of Chrome 68 might be running on the same error for the same reasons. This is a big deal and why a solution customized to the error space is needed. See 4).

4) There is no need to reinvent the wheel, [Sentry https://sentry.io/welcome/] is a well-stablished software to do this very thing: client side error logging. See attempts to install Sentry at the foundation: https://phabricator.wikimedia.org/tag/sentry/

While grouping server side stack traces is a core usage of sentry in order to deal with bursty traffic an error logging solution probably needs to do some client side normalization of stack traces and deduplication so errors are somewhat processed by the time they get to the server side. Sentry comes with a client side library that does part of this pre-processing.

5) Privacy concerns. This is a smaller concern that the ones listed prior but since it has come up listing it for completion: A log system is normally short retention. EL data is retained for 90 days and the whitelisted data is retained for a longer term. Erroneously whitelisting error messages for longer retention can lead to privacy concerns such as these: https://phabricator.wikimedia.org/T136851