Analytics/Cluster/Systems/Hive/Compression

From Wikitech

Uncompressed vs. Snappy compressed Sequence Files

I just ran some rough comparisons of data sizes and Hive queries of webrequest data stored in HDFS uncompressed vs. as Snappy compressed Sequence Files.

Uncompressed

Size

In the first 8 hours of January 7th, 2014, uncompressed JSON formatted mobile webrequest logs imported into HDFS via Kafka totaled 91.1 GB. Each hourly import was between 8 and 12 GB each.

hdfs dfs -du -s -h /wmf/data/external/webrequest_mobile/hourly/2014/01/07/{00..08}
10.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/00
10.9 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/01
11.3 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/02
11.7 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/03
10.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/04
9.9 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/05
9.1 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/06
8.6 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/07
8.4 G  /wmf/data/external/webrequest_mobile/hourly/2014/01/07/08

91.1 G


Query Time

Running a select count(*) query on a single hour took 44.276 seconds, and launched 42 mappers. Running the same query on 8 hours of data took 158.627 seconds and launched 343 mappers.

-- 
select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00;
...
MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec
Ended Job = job_1387838787660_0365
MapReduce Jobs Launched:
Job 0: Map: 42  Reduce: 1   Cumulative CPU: 971.67 sec   HDFS Read: 11422909543 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec
OK
_c0
16641115
Time taken: 44.276 seconds

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08;
...
MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec
Ended Job = job_1387838787660_0363
MapReduce Jobs Launched:
Job 0: Map: 343  Reduce: 1   Cumulative CPU: 15372.42 sec   HDFS Read: 98055786272 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec
OK
_c0
143199253
Time taken: 158.627 seconds

Snappy compressed Sequence Files

I recently got SequenceFileRecordWriterProvider.java merged upstream in LinkedIn's Camus. Using this rather than StringRecordWriterProvider.java writes out the same data as Snappy compressed Hadoop Sequence Files.

Size

JSON data imported for the same 8 hour period and Snappy compressed was 21.9 GB, 24% of the original size.

hdfs dfs -du -s -h /user/otto/data/compressed/webrequest_mobile/hourly/2014/01/07/{00..08}

2.5 G  data/compressed/webrequest_mobile/hourly/2014/01/07/00
2.6 G  data/compressed/webrequest_mobile/hourly/2014/01/07/01
2.7 G  data/compressed/webrequest_mobile/hourly/2014/01/07/02
2.8 G  data/compressed/webrequest_mobile/hourly/2014/01/07/03
2.5 G  data/compressed/webrequest_mobile/hourly/2014/01/07/04
2.4 G  data/compressed/webrequest_mobile/hourly/2014/01/07/05
2.2 G  data/compressed/webrequest_mobile/hourly/2014/01/07/06
2.1 G  data/compressed/webrequest_mobile/hourly/2014/01/07/07
2.1 G  data/compressed/webrequest_mobile/hourly/2014/01/07/08

21.9 G

Query Time

The same select count(*) query on a single hour of compressed data took 86.232 seconds, about twice as long as on uncompressed data. Running the query on 8 hours worth of compressed data took 158.627 seconds, which is only 8% longer than when run on 8 hours of uncompressed data. The number of mappers launched was the same as in the uncompressed case.

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour=00;
...
MapReduce Total cumulative CPU time: 16 minutes 11 seconds 670 msec
Ended Job = job_1387838787660_0365
MapReduce Jobs Launched:
Job 0: Map: 42  Reduce: 1   Cumulative CPU: 971.67 sec   HDFS Read: 11422909543 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 16 minutes 11 seconds 670 msec
OK
_c0
16641115
Time taken: 44.276 seconds

select count(*) from webrequest_mobile where year=2014 and month=01 and day=07 and hour between 00 and 08;
...
MapReduce Total cumulative CPU time: 0 days 4 hours 16 minutes 12 seconds 420 msec
Ended Job = job_1387838787660_0363
MapReduce Jobs Launched:
Job 0: Map: 343  Reduce: 1   Cumulative CPU: 15372.42 sec   HDFS Read: 98055786272 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 16 minutes 12 seconds 420 msec
OK
_c0
143199253
Time taken: 158.627 seconds

Summary

Using Snappy to compress the JSON webrequest logs results in significant space savings, and only a slight reduction in performance for large queries. Query performance is affected for smaller data sets. I will run another test once I have more data to compare (a month), but if results are approximately the same I will not update this page.

Recommendation: use snappy compression for all webrequest imports.