Data Platform/Systems/Bigtop Packages
Overview
We use Apache Bigtop (https://bigtop.apache.org/) as our Hadoop distribution and use their toolchain to build Debian packages for all of the components. The upstream source for Bigtop is here: https://github.com/apache/bigtop/
They maintain a branch called branch-1.5 which was the version that we used until recently. Unfortunately for us, they have decided not to support Debian 11 Bullseye or later as an installation target, whereas we have had to do so. Therefore, we have had to create our own fork of their repository, which is https://gitlab.wikimedia.org/repos/data-engineering/bigtop
The gradle build system is used throughout bigtop, along with puppetized build slaves running under docker.
An example command to build a single component, such as hadoop, under bullseye is as follows:
docker run --rm -v `pwd`:/ws --workdir /ws bigtop/slaves:1.5.0-debian-11 bash -c '. /etc/profile.d/bigtop.sh; ./gradlew allclean hadoop-pkg'
The list of components that we currently build with bigtop is currently:
- bigtop-groovy
- bigtop-jsvc
- bigtop-tomcat
- bigtop-utils
- hadoop
- hbase
- hive
- mahout
- oozie
- solr
- spark
- sqoop
- sqoop2
Unlike most other software that we build, bigtop packages are currently built on an engineer's workstation and then uploaded to the APT repository for serving
WMF Build Script
We have added a couple of scripts to our branch-1.5 since:
- We have a specific set of components that we need to build
- We use an operating system that is unsupported
The top level script to run is: build_all_bigtop_distros_wmf.sh
This script does the following:
- Creates build slaves for
debian-10
anddebian-11
- Uses each build slave to run
build_bigtop_wmf.sh
Package Amendment
The default build mechanism in bigtop does not include any information about which distribution the packages were built for.
This causes an issue for us, since we have to ensure that we have packages available for both buster and bullseye on apt.wikimedia.org for distribution.
In order to get around this, the build_bigtop_wmf.sh script does the following for each package file generated.
- Unpack the deb file including the metadata files with
dpkg-deb -R
- Modify the
DEBIAN/control
file and append either-deb10
or-deb11
to theVersion:
field - Re-pack the deb file with
dpkg-deb -b
including either-deb10
or-deb11
to the file name
This will allow us to host all of these packages with reprepro concurrently.