Data Platform/Data quality/User agent entropy
This document describes a POC made by the Analytics team. It studied the feasibility of using entropy calculations on top of user agent values to produce data quality metrics that would be fit for alarming. One of the motivations of this study was an issue that happened with the EventLogging system, where, due to a change in a library, collected events would have all sub-fields of the parsed user agent field nullified. EventLogging was not raising any error (was silent) and we didn't have any system in place to check the quality of the data. So a couple weeks passed until we noticed. We discussed data quality alarming many times, until we had the idea to use entropy calculations on top of EventLogging user agent fields (or subfields) to generate data quality metrics that we could alarm on, whenever they would present anomalies. In the following lines, we discuss the process that we followed to calculate those metrics, the criteria we used, the results we got and some conclusions.
Process
We implemented a Hive UDAF (user defined aggregation function) that given a column with numeric frequencies, returns their entropy. For more details on how it works, see the code in github. It has a comprehensive documentation in comments.
We also implemented a couple Hive SQL queries that use the entropy UDAF and generate the following metrics:
- useragent_os_family_entropy: Entropy within the useragent subfield os_family.
- useragent_browser_family_entropy: Entropy within the useragent subfield browser_family.
- useragent_device_family_entropy: Entropy within the useragent subfield device_family.
- useragent_combined_entropy: Entropy of the concatenation within the useragent subfield device_family.
The queries collected those entropy values from the EventLogging table event.navigationtiming on a hourly basis during a couple months, and stored them in the table wmf.data_quality_hourly.
Criteria
The event that we use to check whether the metrics generated would be fit for data quality alarming is the user agent parser update that we did around September 18th 2019 (See: https://phabricator.wikimedia.org/T212854). This upgrade significantly changed the distribution of several sub-fields of the parsed user agent field, thus it should be reflected in the metrics in a way that an anomaly detection algorithm would clearly spot such an event. If the metrics are capable of exposing that, then they would be surely capable of exposing data quality issues like the one we described in the summary above.
Results
The following graphs show the entropy value of each one of the described metrics.
Useragent os family entropy
We can see that from the upgrade on the amplitude of the seasonality is greater. The lows are more or less on the same ballpark, but the highs are significantly higher. Note, though, that a couple weeks prior to the upgrade there was already a progressive growth towards that norm. Not sure what did cause the progressive growth. The change in amplitude is caused by the change in how the UA parser labeled the operating systems family. It is to be expected that a UA parser upgrade would "increase" the number of OS recognized and thus "increasing" entropy. In reality more often than not sudden changes might come from bugfixes. Prior to the update the OS labeled as "windows 95" are now being labelled as "windows"which would "decrease" the entropy of the OS field, at the same time "Google Search App" is not falsely counted now as Safari what would mean an entropy "increase".
Useragent browser family entropy
In this one we also see a change from the UA parser upgrade on. In this case the amplitude of the seasonality is reduced. The highs are maintained more or less in the same region, but the lows are higher. This could be caused by a more specific labeling of browser families by the new UA parser.
Useragent device family entropy
In this graph we see a much clearer change of ballpark, almost doubling the entropy value of the related subfield. Again the timeline matches the upgrade of the UA parser. This could definitely be caused by a more specific an comprehensive labeling of the device family by the new UA parser.
Useragent combined entropy
As this is the combined entropy of the 3 previous metrics, we can expect that it shows a mixture of those. Note the first and second graphs are much more subtle, they do grow from approx. 1.8 to 2, or from approx. 2.6 to 2.7. But the signal of the 3rd graph is much stronger, practically doubling the value, and thus making the combined metric much more useful to detect sudden jumps that in this type of data should not happen.
Conclusions
We concluded that:
- The use of entropy calculations can definitely generate metrics that expose data quality issues.
- The more specific metrics will expose more specific issues, while the combined metrics can expose a wider range of issues.
- It remains to be seen if a too-wide combination, like combining all user agent sub-fields would end up masking issues specific to one sub-field.
- It remains to be seen if entropy calculations can be useful for other types of underlying data (not user agent).