Jump to content


From Wikitech

Analytics is the systematic computational analysis of data or statistics, for the purposes of discovery, interpretation, and communication of meaningful patterns.

In the context of the Wikimedia Foundation, the term Analytics generally refers to work carried out on the Analytics Cluster and the Data Lake by various WMF staff and volunteers.

These Wikitech pages below the Analytics path are intended to be reference documentation for users of these systems.

The Data Engineering team has responsibility for managing the Analytics Cluster and the Data Lake, so some pages regarding cluster operations and data governance etc. will be found under that path.

Analytics Cluster

The Analytics Cluster comprises a number of different systems geared to help researchers, data scientists, machine learning engineers and other authorized parties to access the data lake.

If you believe that you need access to the cluster, please refer to Analytics/Data access

Data Lake

The term Data Lake refers to the set of data files (also referred to as datasets) that are stored on the Hadoop HDFS file system.

Many of these datasets are managed by the Data Engineering team with pipelines deployed to production and monitored.

However, members of the analytics-privatedata-users group may also create their own data files in Hadoop, enabling custom Hive tables plus manipulation of data from Jupyter and Spark etc.

Child Pages of Analytics

See also

  • The Product Analytics style guide has code conventions for SQL, Python, and R. Consider adopting them to make your work more consistent and easier for others in the movement to read!