Data Platform/Systems/Hive

Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS. WMF's first use case was to enable querying of unsampled webrequest logs.

As of February 2021, we are running Hive 2.3.6.

Use

The Hive command-line interface is available on all the analytics clients. Here's how to use it:

$ kinit

$ hive

(hive takes a few seconds to start. You should see a prompt like hive (default)> when it’s ready.)

Hive can also be easily accessed through Python using wmfdata-python

#!/usr/bin/env python3
import wmfdata as wmf

# Returns a Pandas dataframe
wmf.hive.run("SHOW TABLES FROM event")

Access

In order to access Hive, you need an account with production shell access in the analytics-privatedata-users user group. For more instructions, see Data_Platform/Data access.

Some of the data in Hive, like the webrequest logs, are private data so only analytics-privatedata-users can access it. If you are requesting access to Hive, you probably want to be in this group.

Querying

Data_Platform/Systems/Hive/Queries (includes a FAQ about common tasks and problems)
Data_Platform/Systems/Hive/QueryUsingUDF

While hive supports SQL, there are some differences: see the Hive Language Manual for more info.

Maintained tables

(see also Data_Platform/Data_Lake)

Webrequest (raw and refined)
pageview_hourly
projectview_hourly
pagecounts_all_sites
mediacounts
mobile_apps_session_metrics
EventLogging data, in the event database (details)

Notes

The wmf_raw and wmf databases contain Hive tables maintained by Analytics. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.

Troubleshooting

See the FAQ

Subpages of Data Platform/Systems/Hive

References