As of February 2021, we are running Hive 2.3.6.
The Hive command-line interface is available on all the analytics clients. Here's how to use it:
hive takes a few seconds to start. You should see a prompt like
hive (default)> when it’s ready.)
Hive can also be easily accessed through Python using wmfdata-python
import wmfdata as wmf
# Returns a Pandas dataframe
wmf.hive.run("SHOW TABLES FROM event")
Some of the data in Hive, like the webrequest logs, are private data so only
analytics-privatedata-users can access it. If you are requesting access to Hive, you probably want to be in this group.
- Analytics/Cluster/Hive/Queries (includes a FAQ about common tasks and problems)
While hive supports SQL, there are some differences: see the Hive Language Manual for more info.
(see also Analytics/Data)
- Webrequest (raw and refined)
- EventLogging data, in the
- The wmf_raw and wmf databases contain Hive tables maintained by Analytics. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
- Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
- The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.
See the FAQ