Data Engineering/Systems/DataHub/Upgrading
The upstream DataHub repository is: https://github.com/linkedin/datahub/
At the moment we maintain a fork of DataHub here: https://gerrit.wikimedia.org/r/admin/repos/analytics/datahub
The reasons why we do this are:
- DataHub do not publish binary artifacts other than their docker images
- We need to add files for PipelineLib configuration files and Blubber build pipelines alongside the codebase
Currently our changes are made in a [wmf branch] and we frequently squash any changes to that branch down to a single commit.
When a new release is required we perform the following operations.
- Update the code in a feature branch
- Merge to the wmf branch to publish the new containers
- Create a feature branch in the deployment-charts repository and update the image version in the helm charts
- Deploy the new version with
helmfile
Update the code
- Check out the code locally.
git checkout -b datahub_upgrade_branch
- Add the upstream remote if it does not already exist
git remote add linkedin-github git@github.com:datahub-project/datahub.git
- Pull the master branch from the
upstream
remote.
git remote update linkedin-github
- Push the master branch from the upstream repository to our gerrit repository.
git push origin linkedin-github/master:master
- Also push the tags to the remote repository
git push origin --tags
- Checkout the
wmf
branch.
git checkout wmf
- Rebase your current branch against the tag of the new version. In this case it is
v0.8.34
git rebase -i v0.8.34
- Fix any merge conflicts if encountered
- Force-push the branch to gerrit
git push --force-with-lease
Deploy datahub CLI tool
The version of the CLI tool has to match the server version, so we have to:
- Update the datahub-cli version on the packaged virtual environment
- Build and publish to Archiva: Analytics/Systems/Archiva#Uploading_Dependency_Artifacts
- Update the artifact version and metadata ingestion jobs in the airflow jobs repository