Jump to content

Data Platform/Systems/Bigtop Packages

From Wikitech

Overview

We use Apache Bigtop (https://bigtop.apache.org/) as our Hadoop distribution and use their toolchain to build Debian packages for all of the components. The upstream source for Bigtop is here: https://github.com/apache/bigtop/

Our current build target is Bigtop version 3.4.0

The gradle build system is used throughout bigtop, along with puppetized build slaves running under docker.

An example command to build a single component, such as hadoop, under bullseye is as follows:

docker run --rm -v `pwd`:/ws --workdir /ws bigtop/slaves:3.4.0-debian-12 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean hadoop-pkg'

WMF Build Script

We maintain a build script for bigtop here: https://gitlab.wikimedia.org/repos/data-engineering/bigtop-build

  1. We have a specific set of components that we need to build.
  2. We need to build packages for at least two distributions.

The top level script to run is: build_all_bigtop_distros_wmf.sh

This script does the following:

  1. Creates build slaves for debian-11 and debian-12
  2. Uses each build slave to run build_bigtop_wmf.sh
  3. Amends the packages to include an infix denoting the correct Debian version.Unlike most other software that we build, bigtop packages are currently built on an engineer's workstation and then uploaded to the APT repository for serving

Package Amendment

The default build mechanism in bigtop does not include any information about which distribution the packages were built for in the generated package name and control files.

This causes an issue for us, since we have to ensure that we have packages available for both buster and bullseye on apt.wikimedia.org for distribution.

Reprepro does not currently support having identically named files being served for different distributions.

In order to get around this restriction, the amend_bigtop_packages_wmf.sh script does the following for each package file generated.

  1. Unpack the deb file including the metadata files with dpkg-deb -R
  2. Modify the DEBIAN/control file and append either -deb11 or -deb12 to the Version: field
  3. Re-pack the deb file with dpkg-deb -b including either -deb11 or -deb12 to the file name

This will allow us to host all of these packages with reprepro concurrently.

Historical - Bigtop 1.5

They maintain a branch called branch-1.5 which was the version that we used until recently. Unfortunately for us, they have decided not to support Debian 11 Bullseye or later as an installation target, whereas we have had to do so. Therefore, we have had to create our own fork of their repository, which is https://gitlab.wikimedia.org/repos/data-engineering/bigtop

Our brevious version of the build scripts were kept in a branch here: https://gitlab.wikimedia.org/repos/data-engineering/bigtop/-/tree/update_bigtop_1.5_build

The list of components that we currently build with bigtop is currently:

  • bigtop-groovy
  • bigtop-jsvc
  • bigtop-tomcat
  • bigtop-utils
  • hadoop
  • hbase
  • hive
  • mahout
  • oozie
  • solr
  • spark
  • sqoop
  • sqoop2