Determining the number of distinct elements in a set (known as cardinality estimation) is a challenge when the set is very large. Since computing the exact cardinality of a set is linear in the cardinality itself, algorithms should be used to approximate cardinality in many of our use cases. For example, counting the number of unique editors in a month across all projects can be done using the traditional methods for counting (SELECT COUNT DISTINCT when using SQL queries). However, counting the number of unique search queries on a monthly basis or the number of unique IP addresses that access Wikimedia projects (from 120,000 request logs we receive every second) in a month requires new ways for counting uniques.
A user id of length n requires 8n bits to be represented. With 120,000 requests per second and the estimated large number of unique user IDs accessing Wikimedia services, the space required to store the ids is huge (more than 50GB). Finding the number of uniques in this huge set requires huge RAM space (100+GB) for counting unique daily. Counting weekly or monthly uniques exactly is in practice impossible.
We can use Bernoulli Sampling to reduce the number of records to count uniques over. How many records we sample depends on the error we can tolerate. The size of the sample should be determined by running experiments for the specific Wikimedia data-sets.
It is worth to note that sampling a very large data-set can be a challenge on its own. One way to overcome the challenge is to use a uniform hashing function, for example, Md5, on the user ids. Then, we can filter users based on how their hashing ends. For example, all the users whose hashing ends with 00 can be used in the sample.
Linear Probabilistic Counter
The Linear Probabilistic Counter allows us to specify the desired level of accuracy. The algorithm works by assigning a bitmap in memory initialized to all zeros. It applyes a hash function on each entry, mapping each entry to a bit in the bitmap. Then, the algorithm computes the ratio of empty bits and estimates the cardinality as
cardinality = -(size of the bitmap) x Ln(ratio of empty bits/size of the bitmap)
HyperLogLog algorithm was developed as an improvement to Linear and Log algorithms. The algorithm can estimate cardinality N using LogLog(N)+O(1). The details of the algorithm are explained in the paper. What the algorithm does is equivalent to recording the longest run of heads in a series of coin flips and determining the number of head coin flips from that sequence.
A hive implementation of the algorithm is developed.
Better than HyperLogLog
An improvement to HyperLogLog is presented by Stefan Heule et al. in HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. The improvements increase the accuracy of the estimate while decreasing memory usage.
Please note that (since this changeset was merged: https://gerrit.wikimedia.org/r/#/c/182478/ ) we have "native" partitioning on Hive and thus we can use Hive's TABLESAMPLE syntax to query the tables that already have bucketed data to get a random sample.
This not only speeds queries cause the "processed" tables have a less taxing format on resources, but also allows us to do true sampling so we are not all combing through the whole raw dataset when a subset should suffice. Note that in order to obtain a significant performance gain, bucketing needs to happen on the columns by which the webrequest table is clustered (hash-partitioned): (hostname, sequence). (Bucketing by rand() instead is not efficient, because TABLESAMPLE will then still need to scan the entire table.)
A general example of a "sampled query":
SELECT a.useragent, COUNT(*) FROM (SELECT user_agent AS useragent FROM wmf.webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON hostname, sequence) WHERE year=2017 AND month = 1 AND day = 1 AND access_method = "mobile web") AS a GROUP BY a.useragent;