Analytics/Team/Conferences/Apache Big Data Europe - November 2016

Luca Toscano and Joseph Allemandou attended to that conference for the analytics team.

Slides of the talks: http://events.linuxfoundation.org/events/apache-big-data-europe/program/slides

Talk we attended

Day 1

Keynotes

Flink

Definitely the next big streaming thing

The Apache way

Since we are using Apache software alot, would make sense to be trained in contributing when joining the WMF.

Session 1: GearPump

Smaller feature set than Flink, lighter but not ready
Interesting for mini clusters but not more

Session 2: Hive optimisations in 2.X

Improvements: latency / scalability / HQL to standard SQL
Multiple engines, partition pruning, joins, vectorized execution
Optimizer: tradeoff between query planning and runtime
Metastore change: remove ORM, hbase instead of mysql
Calcite as query optimiser
- rule-based + cost-base (fix-point cost or iteration limit)
- Examples:
  - cost based - join reordering - Bushy joins instead of sequential
  - rule based - predicate rewriting
  - semi-join reduction
- Gain: small for some, but for some, huge
Something else: Materialized view support -- Using LLAP

Session 3: Real time @ Uber

Kafka: Regional + aggregate (uReplicator +/-= MirrorMaker)
Schema enforcement with Proxy

Session 4: Planetary dataset

Different DBs for different scales
Tiles approach
Spark for computing, with incremental jobs

Session 5: Hive + Druid integration using Calcite

Real integration on the Hive side (create external table materiallized in Druid - DruidStorageHandler - Wow !)
Query druid as much as possible based on optimizer rewrite
Load data from druid to hive, then run rest of query in hive
Version: Hive 2.2.0 (Q12017) - if everything goes well

Day 2

Keynotes

Hadoop @ Uber

Use mesos under hadoop to share cluster among services (hadoop, Presto)
Allow better utilisation of resources by elastic reallocation
Scheduling resource management is difficult - Myrias had issues for Uber, built their own scheduler

ASF Big Tent

difference ASF / github -> Structure in project management (long life)
Community management

Session 1 - Hbase

Fast data retrieval on Hadoop
Has issues with big blobs (more than 50M)
Too broad in my opinion

Session 2 - Calcite adapter dev

Calcite is very powerfull - allows abstract representations of queries and optimisation / rewriting
This makes it difficult to grab at the beginning - High entry cost
Classical approach is to extend Enumerables object for calcite

session 3 - Data science - 50 years

Awesome discussion on stats vs engineering in data science

session 4 - Contributing to Spark

Pick "starter" bugs in Spark Jira
Provide tests
Be patient and don't hesitate to contact people
GraphX is dead, use GraphFrames instead

Session 5 - Apache Bahir

New project - Umbrella for Spark / Flink connectors
Connectors for Spark and Flink - MQTT, Akka, 0mq, Twitter, ActiveMQ, Flume, Reddis

Day 3

Keynotes

TensorFlow

Python / GPU + CPU
TensorFlow Playground - UI materializing layers (awesome)
Tensor flow for poets
Example with trying to predict asteroid collision with earth

Hadoop lessons learnt

Worth paying the price of standardisation upfront

Session 1 - Beam

Programming model for data flows (both batch and stream)

PCollection - Parallel collection of timestamped elements
Bounded / Unbounded PCollections (batch / stream)
Windowing builtin
Engine agnostic: Spark, Flink, GearPump, Apex

Session 2 - Kudu

Read/write as on hbase, and request analytics style
Fast analytics with changing data
Using columnar in memory format for fast analytics
Using random-read approach for fast retrieval/modification

Session 3 - Parquet in details

Cross-platform (Support for Python and C++
Columnar: Vectorized operation, small footprint, predicate push-down
Frameworks: Hive, impala, drill, presto, spark, python pandas !!!
nested-data (from Dremel)- Blog paper from twitter
How is it built:
- File
  - Row group (unit of paralelisation) - Stats (for fast predicate resolution)
    - Columns chunks
      - pages - Stats
Having stats allow fast filtering - data gathering for specific queries (range, etc)
Encoding:
- RLE+BitPacking on dictionary
- Dictionary - Plain or RLE (repetitions!) (dictionary and data stored separately - different pages)
Compression:
- Not a lot of gain since data is already compressed
- use snappy by default for fast decompression)
- Compression at page level - makes predicate and stats usage possible without decompression - So great :)
Predicate pushdown - only decode/transfer needed data (columns needed, rows from stats)

Ideas and discussions

Kafka

1) Kafka manager UI (Yahoo) - https://github.com/yahoo/kafka-manager

2) Chaperone - new tool to be released from Uber (partition rebalancing and failure scenario prediction)

3) uReplicatior - Alternative to MirrorMaker based on Apache Helix - https://eng.uber.com/ureplicator/

4) Uber config:

- Smart clients ---> Kafka Broker HTTP Proxy ---> Kafka Brokers

- The HTTP layer guarantees to have simpler clients not aware of the Kafka protocol.

- The client on the Uber app is very smart and it buffers events/data before sending it to the HTTP layer of the Uber infrastructure. It also knows how to drop data if a network partition happens, buffering only the important data hoping for the connection to come up again soon.

- Kafka producers are all async, but ISR are two for most of the topic partitions.

- Need to follow up with the Uber guys to figure out what are the downsides of running an HTTP layer (very similar to what we do with EventBus).

- One Kafka cluster per DC and the uReplicator used to collect data in the central DC for processing.

Hadoop

1) YARN Timeline Server to get info about map/reduce jobs (not sure if part of Yarn UI)

2) Apache Ambari to manage the Hadoop cluster - https://ambari.apache.org/

3) Apache Zeppelin to visualize Cluster's status (notebook, similar to Jupyter?)

4) Apache Myriad is an interesting project to follow to run Yarn on Mesos. Still incubating with some deficiencies (like only static allocation).

Spark

There is a Spark connector (not the standard one) able to differentiate the number of RDD partitions to create, not defaulting to the number of Topic's partitions set in Kafka.

https://github.com/dibbhatt/kafka-spark-consumer

Infrastructure in general

1) Kafka is the constant for ingesting data at scale.

2) Spark streaming seems to be used a lot, more present than Flink in this year's talks.

3) Presto is used extensively for query to HDFS data.

4) Mesos is another constant in big data infrastructure sharing resources to maximize utilization. It has been mentioned a lot as stepping stone to scale infrastructure beyond a certain point but nobody seemed to explain why and what are the compromises. Uber shed some light during its keynote on Hadoop talking about Myriad, a way to join Yarn and Mesos (still incubating, for example at the moment it does not handle dynamic allocation of resources - it does not release them back to Mesos).

Interesting companies

1) Uber uses extensively Kafka and it is a Tier1 system for them, tons of open source. Following them would be nice.

2) DigitalOcean is collecting tons of data from infrastructure (they are similar to AWS) using Kafka, Spark and Presto.

Streaming

Streaming is very sexy, but doesn't really fit the webrequest use cases as-is:
- We want to store data in parquet - Flink will not be good at that.
  - Streaming needs regular checkpoints to prevent reconstructing too big of a state in case of error.
  - This would mean writing samll parquet files regularly.
  - But, parquet advantages are correlated to data repetition
  - So we would loose some ofp arquet impact (in term of compression and computation)
- Idea: We could refine on the fly and write avro (easier and better perf)
- We still would need batch to convert to parquet (perf ???)
- This way of doing things implies keeping both raw and refined and pageviews etc in kafka - Way bigger - but doable

Seems interesting for eventlogging
- Smaller data, easier
- We could easily partition by schema + revision + date and create a hive table by schema + revision
- Maybe push in clickhouse ????
- This would gives us real-life experience with streaming system - The A-Team could lead the effort toward proper streaming infrastructure
- CR on Gerrit (schema validation without event error handling) - https://gerrit.wikimedia.org/r/#/c/321936
Let's provide an example of spark streaming over our kafka cluster to allow ops to read and filter webrequest data in pseudo realtime CR on gerrit using Spark Streaming - https://gerrit.wikimedia.org/r/#/c/322266/

Federated SQL query service

Providing a single SQL entrypoint to our multiple datasources and expect fast result is really something that sense for me (Joseph)

I view Two approaches to handling multiple systems:

Calcite seems a very good abstraction layer for federating various querying systems
- If we go down that way, we should write a paper or do a talk (perf results), dixit Apache PMC chair
- spark connector for calcite is not maintained - we should make it live again
Use Presto as a fast query engine that can use subsystems
- Mesos + Myriad to provide elasticity and resource sharing among systems (use of presto and hadoop on same machines) -- Seems complex
- Uber had issues with Myriad - But would we have have other options?
- Slider would allow to run distributed system in yarn easily (more easily than with mesos?)
Pivot as a FrontEnd now that it's not open source anymore
- Luca and Joseph think we'd rather use something else
- We should rerying superset (previsouly caravel)