Analytics/Team/Conferences/Apache Big Data Europe - November 2016

Luca Toscano and Joseph Allemandou attended to that conference for the analytics team.

Slides of the talks:

Talk we attended

Day 1


  • Definitely the next big streaming thing
The Apache way
  • Since we are using Apache software alot, would make sense to be trained in contributing when joining the WMF.

Session 1: GearPump

  • Smaller feature set than Flink, lighter but not ready
  • Interesting for mini clusters but not more

Session 2: Hive optimisations in 2.X

  • Improvements: latency / scalability / HQL to standard SQL
  • Multiple engines, partition pruning, joins, vectorized execution
  • Optimizer: tradeoff between query planning and runtime
  • Metastore change: remove ORM, hbase instead of mysql
  • Calcite as query optimiser
    • rule-based + cost-base (fix-point cost or iteration limit)
    • Examples:
      • cost based - join reordering - Bushy joins instead of sequential
      • rule based - predicate rewriting
      • semi-join reduction
    • Gain: small for some, but for some, huge
  • Something else: Materialized view support -- Using LLAP

Session 3: Real time @ Uber

  • Kafka: Regional + aggregate (uReplicator +/-= MirrorMaker)
  • Schema enforcement with Proxy

Session 4: Planetary dataset

  • Different DBs for different scales
  • Tiles approach
  • Spark for computing, with incremental jobs

Session 5: Hive + Druid integration using Calcite

  • Real integration on the Hive side (create external table materiallized in Druid - DruidStorageHandler - Wow !)
  • Query druid as much as possible based on optimizer rewrite
  • Load data from druid to hive, then run rest of query in hive
  • Version: Hive 2.2.0 (Q12017) - if everything goes well

Day 2


Hadoop @ Uber
  • Use mesos under hadoop to share cluster among services (hadoop, Presto)
  • Allow better utilisation of resources by elastic reallocation
  • Scheduling resource management is difficult - Myrias had issues for Uber, built their own scheduler
ASF Big Tent
  • difference ASF / github -> Structure in project management (long life)
  • Community management

Session 1 - Hbase

  • Fast data retrieval on Hadoop
  • Has issues with big blobs (more than 50M)
  • Too broad in my opinion

Session 2 - Calcite adapter dev

  • Calcite is very powerfull - allows abstract representations of queries and optimisation / rewriting
  • This makes it difficult to grab at the beginning - High entry cost
  • Classical approach is to extend Enumerables object for calcite

session 3 - Data science - 50 years

  • Awesome discussion on stats vs engineering in data science

session 4 - Contributing to Spark

  • Pick "starter" bugs in Spark Jira
  • Provide tests
  • Be patient and don't hesitate to contact people
  • GraphX is dead, use GraphFrames instead

Session 5 - Apache Bahir

  • New project - Umbrella for Spark / Flink connectors
  • Connectors for Spark and Flink - MQTT, Akka, 0mq, Twitter, ActiveMQ, Flume, Reddis

Day 3


  • Python / GPU + CPU
  • TensorFlow Playground - UI materializing layers (awesome)
  • Tensor flow for poets
  • Example with trying to predict asteroid collision with earth
Hadoop lessons learnt
  • Worth paying the price of standardisation upfront

Session 1 - Beam

  • Programming model for data flows (both batch and stream)
  • PCollection - Parallel collection of timestamped elements
  • Bounded / Unbounded PCollections (batch / stream)
  • Windowing builtin
  • Engine agnostic: Spark, Flink, GearPump, Apex

Session 2 - Kudu

  • Read/write as on hbase, and request analytics style
  • Fast analytics with changing data
  • Using columnar in memory format for fast analytics
  • Using random-read approach for fast retrieval/modification

Session 3 - Parquet in details

  • Cross-platform (Support for Python and C++
  • Columnar: Vectorized operation, small footprint, predicate push-down
  • Frameworks: Hive, impala, drill, presto, spark, python pandas !!!
  • nested-data (from Dremel)- Blog paper from twitter
  • How is it built:
    • File
      • Row group (unit of paralelisation) - Stats (for fast predicate resolution)
        • Columns chunks
          • pages - Stats
  • Having stats allow fast filtering - data gathering for specific queries (range, etc)
  • Encoding:
    • RLE+BitPacking on dictionary
    • Dictionary - Plain or RLE (repetitions!) (dictionary and data stored separately - different pages)
  • Compression:
    • Not a lot of gain since data is already compressed
    • use snappy by default for fast decompression)
    • Compression at page level - makes predicate and stats usage possible without decompression - So great :)
  • Predicate pushdown - only decode/transfer needed data (columns needed, rows from stats)

Ideas and discussions


1) Kafka manager UI (Yahoo) -

2) Chaperone - new tool to be released from Uber (partition rebalancing and failure scenario prediction)

3) uReplicatior - Alternative to MirrorMaker based on Apache Helix -

4) Uber config:

- Smart clients ---> Kafka Broker HTTP Proxy ---> Kafka Brokers

- The HTTP layer guarantees to have simpler clients not aware of the Kafka protocol.

- The client on the Uber app is very smart and it buffers events/data before sending it to the HTTP layer of the Uber infrastructure. It also knows how to drop data if a network partition happens, buffering only the important data hoping for the connection to come up again soon.

- Kafka producers are all async, but ISR are two for most of the topic partitions.

- Need to follow up with the Uber guys to figure out what are the downsides of running an HTTP layer (very similar to what we do with EventBus).

- One Kafka cluster per DC and the uReplicator used to collect data in the central DC for processing.


1) YARN Timeline Server to get info about map/reduce jobs (not sure if part of Yarn UI)

2) Apache Ambari to manage the Hadoop cluster -

3) Apache Zeppelin to visualize Cluster's status (notebook, similar to Jupyter?)

4) Apache Myriad is an interesting project to follow to run Yarn on Mesos. Still incubating with some deficiencies (like only static allocation).


There is a Spark connector (not the standard one) able to differentiate the number of RDD partitions to create, not defaulting to the number of Topic's partitions set in Kafka.

Infrastructure in general

1) Kafka is the constant for ingesting data at scale.

2) Spark streaming seems to be used a lot, more present than Flink in this year's talks.

3) Presto is used extensively for query to HDFS data.

4) Mesos is another constant in big data infrastructure sharing resources to maximize utilization. It has been mentioned a lot as stepping stone to scale infrastructure beyond a certain point but nobody seemed to explain why and what are the compromises. Uber shed some light during its keynote on Hadoop talking about Myriad, a way to join Yarn and Mesos (still incubating, for example at the moment it does not handle dynamic allocation of resources - it does not release them back to Mesos).

Interesting companies

1) Uber uses extensively Kafka and it is a Tier1 system for them, tons of open source. Following them would be nice.

2) DigitalOcean is collecting tons of data from infrastructure (they are similar to AWS) using Kafka, Spark and Presto.


  • Streaming is very sexy, but doesn't really fit the webrequest use cases as-is:
    • We want to store data in parquet - Flink will not be good at that.
      • Streaming needs regular checkpoints to prevent reconstructing too big of a state in case of error.
      • This would mean writing samll parquet files regularly.
      • But, parquet advantages are correlated to data repetition
      • So we would loose some ofp arquet impact (in term of compression and computation)
    • Idea: We could refine on the fly and write avro (easier and better perf)
    • We still would need batch to convert to parquet (perf ???)
    • This way of doing things implies keeping both raw and refined and pageviews etc in kafka - Way bigger - but doable
  • Seems interesting for eventlogging
    • Smaller data, easier
    • We could easily partition by schema + revision + date and create a hive table by schema + revision
    • Maybe push in clickhouse ????
    • This would gives us real-life experience with streaming system - The A-Team could lead the effort toward proper streaming infrastructure
    • CR on Gerrit (schema validation without event error handling) -
  • Let's provide an example of spark streaming over our kafka cluster to allow ops to read and filter webrequest data in pseudo realtime CR on gerrit using Spark Streaming -

Federated SQL query service

Providing a single SQL entrypoint to our multiple datasources and expect fast result is really something that sense for me (Joseph)

I view Two approaches to handling multiple systems:

  • Calcite seems a very good abstraction layer for federating various querying systems
    • If we go down that way, we should write a paper or do a talk (perf results), dixit Apache PMC chair
    • spark connector for calcite is not maintained - we should make it live again
  • Use Presto as a fast query engine that can use subsystems
    • Mesos + Myriad to provide elasticity and resource sharing among systems (use of presto and hadoop on same machines) -- Seems complex
    • Uber had issues with Myriad - But would we have have other options?
    • Slider would allow to run distributed system in yarn easily (more easily than with mesos?)
  • Pivot as a FrontEnd now that it's not open source anymore
    • Luca and Joseph think we'd rather use something else
    • We should rerying superset (previsouly caravel)