Data Platform/Systems/Hive/Queries
Writing queries
Handy column names
set hive.resultset.use.unique.column.names=false;
Use partitions
There is a lot of data in Hive, so you want to make sure your queries have time bounds so they complete in a reasonable amount of time. For tables with event data you will be frequently using something like this in your queries:
WHERE year = 2025 AND month = 1 AND day = 5
You might also need to append AND hour = NN
as well if the dataset is sufficiently large.
Other datasets might be partitioned in other ways, for example a "snapshot" (e.g. Analytics/Data Lake/Edits/MediaWiki history), or wiki (e.g. the monthly sqoop'ed tables in the wmf_raw
database use both snapshot and wiki).
In fact, you must restrict your query to some Hive "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour.
For example, this query fails with an error:
SELECT agent_type FROM wmf.webrequest LIMIT 5; Error: Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" (state=42000,code=10041)
Once you restrict the query to specific partitions as shown below, the error goes away:
SELECT
agent_type
FROM wmf.webrequest
WHERE webrequest_source = 'text'
AND year = 2023 AND month = 5 AND day = 4 AND hour = 0
LIMIT 5
INSERT OVERWRITE ... IF NOT EXISTS
When using INSERT OVERWRITE
with IF NOT EXISTS
be sure to use unqualified tables. Otherwise, the IF NOT EXISTS
gets silently ignored.
Earliest data available in hadoop
What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?
show partitions webrequest;
Will show you all of the available partitions. Webrequest data in Hadoop is currently pruned after 62 days.
Avoiding overgreedy scans / Operator precedence
In HiveQL, AND
takes precedence over OR
. This means that if one's trying to select all "mobile" and "text" partitions for 2015-01-01, this query will not work:
WHERE year = 2015
AND month = 1
AND day = 1
AND webrequest_source = 'mobile'
OR webrequest_source = 'text'
This does not work because the precedence rules results in Hive parsing it at follows:
WHERE (year = 2015
AND month = 1
AND day = 1
AND webrequest_source = 'mobile')
OR webrequest_source = 'text'
This means hive would select all "text" partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all "text" partitions locked.
You can use parenthesis to override operator precedence. To select all "mobile" and "text" partitions for 2015-01-01 can be done using this statement:
WHERE year = 2015
AND month = 1
AND day = 1
AND (webrequest_source = 'mobile'
OR webrequest_source = 'text')
User-defined functions
You can create or reuse UDFs (user-defined functions), see Analytics/Cluster/Hive/QueryUsingUDF.
WMF has developed several UDFs, such as is_wikimedia_bot
, geocoded_data
, etc.
Some are run to create its derived tables from raw Hadoop information.
Slow queries
A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.
SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;
Counting rows which match a criterion
A common use case in querying data is counting the number of rows which match a criterion. With Hive, a good way to do this is by casting an expression which produces a boolean value to an integer and taking the sum. For example:
select sum(cast(distinct_editors >= 500 as int))
from geowiki_monthly
where month = "2018-03"
This counts the number of rows where the month
is March 2018 and the value of distinct_editors
is greater than 500.
You can't simply use count()
on a boolean expression, as in count(distinct_editors >= 500)
, because that counts the rows where the expression is not null—which includes rows where it is false!
Create your own database
Hive uses databases to organize tables. You can create databases for your own use, and by convention we use our shell username as database name. Here is an example of command to create a database:
CREATE DATABASE my_user_name;
Running queries
Command line
hive
or beeline
, you first need to authenticate with Kerberos. See also Authenticate via Kerberos.There are two command line programs available for accessing Hive, hive
and beeline
. hive
is officially deprecated in favor of beeline
, but as of October 2018, the Analytics team does not recommend migrating to it. The hive
client still has significantly better error reporting and a few other advantages.
Generally, the two clients can be used identically; see Analytics/Cluster/Beeline for Beeline-specific documentation.
Once you can ssh
to stat1004 you can simply access the hive command-line interface by entering hive
or beeline
. Here's the start of a sample interactive session:
nuria@stat1004:~$ beeline
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show databases;
wmf
...
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> use wmf;
No rows affected (0.019 seconds)
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show tables;
tab_name
aqs_hourly
browser_general
last_access_uniques_daily
last_access_uniques_monthly
mediacounts
mobile_apps_session_metrics
mobile_apps_session_metrics_by_os
mobile_apps_uniques_daily
mobile_apps_uniques_monthly
pagecounts_all_sites
pageview_hourly
pageview_unexpected_values
pageview_whitelist
projectview_hourly
webrequest
webrequest_empty
16 rows selected (0.028 seconds)
You can also write your query in a text file, e.g. named my_query.hql, and execute it. This example redirects the output of the query (by default in tab-separated values format) to results.txt:
$ beeline -f my_query.hql > results.txt
Use wmf
You should use wmf
database (instead of the wmf_raw
database) if you can, or your queries will be slow. The wmf
database includes filtered and preprocessed data.
Output to a file
beeline -f my_query.hql > out.txt
Run queries in Cron
# The USER variable needs to be set with the username
# that will run the queries. Example: USER=elukey
USER=<user> /usr/local/bin/beeline -f my_query.hql > out.txt
Run long queries in a screen session
- Screen: If your queries take a long time to run you can execute them in a
screen
session. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run. - Email notification: For long-running queries, one may want to get an email alert as soon as the result is available. See phab:P7948 for an example of how to do this using a bash script.
Vertical query results
!set outputformat vertical
Loading data
TSV file
If you have a data file you'd like to load in to Hive (perhaps to join with an existing Hive table), start by copying it onto one of the stats or notebook machines. Then, create a table in Hive with a "delimited" row format:
CREATE TABLE tablename (tablespec)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE
You can easily change the terminator string from "\t" to "," if you have a CSV file.
Finally, use the hive
command line client on that machine to run the following query:
LOAD DATA LOCAL INPATH '{{local path to file}}'
OVERWRITE INTO TABLE {{name}}
Note that you cannot use beeline
since it will look on the Hive server instead for your data file, even when you use the LOCAL
keywoard.
Editing Schemas
Hive allows ALTER TABLE
statements, but there are gotchas compared to a traditional RDBMS where you would issue such commands:
- There's something really weird going on with the wmf_raw database. When doing
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
sometimes it just doesn't work, and Hive gives an error when trying to, for example,desc wmf_raw.webrequest
. This is not consistent but it did happen to both ottomata and milimetric. - Adding a column to wmf_raw.webrequest broke selects from that table for existing partitions. Both
select * ...
andselect hostname ...
broke, so it's not a matter of Hive metadata. The fix is to drop a partition and recreate it, querying will then work for that partition only.
Troubleshooting
Query fails with generic "Execution Error"
You may find that queries fail with an error message like "FAILED: Execution Error" and an error code like "return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask".
In this case, the first thing to do is make sure you have the detailed error message. For some reason, HiveServer2 or Thrift clients like the Hue interface, the beeline
command line program, or the impyla Python library don't give specific error messages. Try rerunning your query using the hive
command line program, which should give you more detail on the error you've encountered.
Hadoop containers run out of memory
If your query fails with a diagnostic message like the following:
Container [pid=40261,containerID=container_e85_1538672679435_0171_01_002286] is running beyond physical memory limits. Current usage: 2.3 GB of 2 GB physical memory used; 4.1 GB of 4.2 GB virtual memory used. Killing container.
Try increasing the map memory setting with the command SET mapreduce.map.memory.mb=4096
.
JsonSerDe Errors
If you get errors like:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found
then enter the following:
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;
Alternatively, you could try
ADD JAR /usr/lib/hive/lib/hive-serde.jar;
Killing a running query
Once you submit a query, it is handed off to Hadoop. Hadoop runs the query as a YARN application. The Hive CLI is then detached from the actual application. If you Ctrl-C your Hive CLI, you will quit the interface you used to submit the query, but will not actually kill the application. To kill the application, you have to tell YARN you want it dead.
Note the application ID from when your query started. You should see something like:
Starting Job = job_1387838787660_12241, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_12241/
The application ID in this case is application_1387838787660_12241. To kill this application, run:
yarn application -kill application_1387838787660_12241
Scratch space
You can create your own database using regular SQL
CREATE DATABASE dartar;
Batch mode
You can save the output of a hive query into a TSV by using hive in batch mode, e.g.
hive -e "USE wmf; SELECT foo;" > ./bar.tsv
Dates
Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.reflect.InvocationTargetException
> [...]
> Caused by: java.lang.reflect.InvocationTargetException
> [...]
> Caused by: java.io.EOFException
> [...]
Out of Memory Errors on Client
When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.
Invoke Hive via the following to allocate more memory
export HADOOP_HEAPSIZE=2048 && beeline
Out of Memory Errors when exporting large datasets
If your query generates a large dataset and you are writing it out to a file (e.g. beeline -f query.hql > output.tsv
), your query might fail due to beeline buffering all the data before writing it out, thereby running out of memory. Passing the parameter --incremental=true
to make beeline print to stdout incrementally as data comes in. The previous example turns into the following command:
beeline --incremental=true -f query.hql > output.tsv
Another solution is of course to switch to a data processing approach that does not require the data to be exported.
My query does not start
Check if the cluster is overloaded. If it is, let the Analytics team know.
My query does not progress / finish
Querying through Hive is not as fast as querying a plain SQL table. So expect to wait an hour for your result to show up.
But if the progress counter in your query does not increase (like at least a 1% per minute), you are either unintentionally querying a lot of data, or the cluster is stalled.
If you are querying “a lot” of data (a week worth of webrequest logs is definitely already more “a lot”!), consider splitting up your query. That will increase performance for you, cut down runtime of your query, and it will leave cluster resources available to other users. If you run into blockers, let the Analytics team know.
If your query is small, nice and neat, and does not progress nonetheless, please check if the cluster is overloaded. If it is, let the Analytics team know.
Search through logs
If your job is finished, you can find all of the job logs in HDFS at:
/mnt/hdfs/var/log/hadoop-yarn/apps/<user>/
Where USER is your username (echo $USER at a shell prompt) and application_id is the application_XXXXXX_XXXX
number that yarn gives you when your app starts.
You can also access these via the yarn CLI, like:
yarn logs -applicationId <applicationId>
More debugging information in CLI
You can get more debugging information in your CLI by launching Hive with a special configuration option:
hive -hiveconf hive.root.logger=INFO,console
Use RLIKE instead of LIKE when working with multi-line text
As documented in this hive issue, the LIKE
operator in hive does't match patterns over multi-lines strings.
Example: my_string LIKE ''abc%def'
will return false if my_string
is abc\ndef
.
In order to have the above working, you need to use RLIKE
with the dot-matching-all flag set (?s)
, as by default . doesn't match new-lines (this is the reason of the bug in Hive).
Warning: don't forget to use beginning ^
and end-of string $
regex markers when using RLIKE, as it matches subportions of string while like matches entire strings.
Example: my_string RLIKE '(?s)^abc.*def$'
matches abc\ndef
EventLogging
To query EventLogging data from Hadoop using Hive, see: