Analytics/Systems/Cluster/Gobblin

From Wikitech
Jump to navigation Jump to search

Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.

Until 2021, we used Camus for this purpose. T238400 has some information on how Gobblin was chosen as its replacement.

Gobblin jobs

Gobblin jobs are declared in puppet.

WMF's Gobblin fork

The Data Engineering team maintains a fork of Gobblin. We use this fork to maintain our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code for interact with Event Platform based events in Kafka. The master branch should track upstream.

Releasing new Gobblin versions

We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.

We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.