Data Platform/Systems/Airflow/Developer guide/Python Job Repos
This page provides a tutorial for how to design a python based job repository in GitLab that publishes job artifacts that can be scheduled and launched by Airflow . There is also an example Gitlab repository that follows all these recommendations: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project.
Overview
We intentionally want to separate job logic from scheduling logic. A job should be standalone and parameterized in a way that given specific inputs it produces certain outputs. Airflow is a scheduler that is meant to run the job with input parameters for a particular run of a job, usually based on timestamps or incoming data.
A job repository can be used to specify all dependencies and logic needed to run a job. In order for the Airflow Scheduler to to launch the job, it needs to be able to access the job code and dependencies somewhere.
Data Engineering has implemented reusable GitLab CI pipelines to automate the generation of job 'artifacts', as well as tooling to deploy these artifacts so that Airflow can access them.
As of 2022-04, the CI pipelines focus on python based jobs (or anything that uses conda environments), but the artifact deployment can work with any kind of artifact file (zip files, jars, compiled binaries, etc.)
GitLab Job Repository Setup
Python package setup
You must minimally have the following:
- A conda-environment.yaml file that specifies minimally the python version:
dependencies:
- python=3.7
- A pip installable project setup, e.g. pyproject.toml, setup.cfg, setup.py etc. I.e.
pip install .
in your project dir will work.
Optionally, to use automated releases, you should use bump2version to manage your package version. If you use setup.cfg to manage your package version, then you need a .bumpversion.cfg file as follows:
[bumpversion]
current_version = 0.1.0.dev
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z0-9]+))?
serialize =
{major}.{minor}.{patch}.{release}
{major}.{minor}.{patch}
[bumpversion:part:release]
optional_value = unused
values =
dev
unused
[bumpversion:file:setup.cfg]
search = version = {current_version}
replace = version = {new_version}
When you first create this file, make sure that the current_version
ends in ".dev", and that the version
in setup.cfg matches this exactly.
GitLab CI setup
If you are choosing to use automated release versioning, then you .gitlab-ci.yml file should contain the following.
# Include conda_artifact_repo.yml to add release and conda env publishing jobs.
include:
- project: 'repos/data-engineering/workflow_utils'
ref: v0.19.0
file: '/gitlab_ci_templates/pipelines/conda_artifact_repo.yml'
ALTERNATIVELY, if you choose not to use automated release versioning, then you include just the publish_conda_env
job directly:
# Include just the publish_conda_env job.
# This does not include automated releasing, so you will need to either manually
# run the publish_conda_env job, or manually push tags to trigger the
# publish_conda_env job.
include:
- project: 'repos/data-engineering/workflow_utils'
ref: v0.19.0
file: '/gitlab_ci_templates/jobs/publish_conda_env.yml'
Automated Release GitLab Project Setup
If you choose to use automated releasing, you'll need to allow GitLab CI to push commits as follows:
- create a project access token
- on the left sidebar, go to Settings > Access Tokens
- enter a token name like
gitlab-ci
- set or delete the expiration date
- select the
Maintainer
role. This makes sure that GitLab CI can push to yourmain
branch, where aMaintainer
can write by default - tick the
api
andwrite_repository
scopes - push the button and copy the token
- set CI variables
- on the left sidebar, go to Settings > CI/CD, expand Variables and add the following key-value pairs:
CI_PROJECT_PASSWORD
- paste the project access token. Tick theMask variable
flag.
- on the left sidebar, go to Settings > CI/CD, expand Variables and add the following key-value pairs:
- If you selected the 'Protect variable' checkbox when you created the
CI_PROJECT_PASSWORD
variable, you'll need to designated certain branches as protected. Usually, this is just done for the main branch:- Settings -> Repository, expand Branches and add the
main
(or other) branch as a protected branch.
- Settings -> Repository, expand Branches and add the
How to Deploy
This is a step-by-step tutorial to deploy a new release to Airflow, feel free to read the following sections for details.
- On the left sidebar, go to CI/CD > Pipelines
- click on the play button and select
trigger_release
- on the left sidebar, go to Packages and registries > Package Registry
- click on the first item, right-click the asset file name, and copy the URL
- branch out of airflow-dags
- update the URL in the DAG config
- merge into
main
- deploy the DAGs (deployment permissions required):
me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/${YOUR_AIRFLOW_INSTANCE}/
me@deploy1002:~$ scap deploy
Replace ${YOUR_AIRFLOW_INSTANCE}
with one of the available instances.
Job Repository Conda Env Artifact Publishing
Assuming you are using automated releases and you've followed all the setup instructions above, to publish a conda env job artifact you'll do the following.
Development
You publish a .dev version of your conda env from any commit on the main branch. To do so, from a commit Pipeline, manually run the publish_conda_env
job. This will publish a conda env to your Projects Generic Package Registry
Releasing
- Make sure the changes you want to deploy are merged into your main branch.
- Go to CI/CD -> Pipelines -> Run Pipeline (Blue button In the upper right)
- The only variable here you might want to edit is
POST_RELEASE_VERSION_BUMP
. This is the part of the semantic version to bump after releasing. Allowed values are major, minor and patch. Default is minor. - Click Run Pipeline at the bottom. This will launch a new pipeline for the latest commit on your main branch.
- The only variable here you might want to edit is
- Go to CI/CD -> Pipelines and click on the pipeline you just launched.
- Once any tests have finished, you should see be able to manually run the
trigger_release
job. This job will: remove the .dev version, commit and tag, and then bump the version and make a new commit to main. After this is done, the new .dev version will be bumped in main, and a tag will have been created and pushed to gitlab. - The creation of a new tag in gitlab will automatically launch a new pipeline that will build and publish a conda env artifact to your GitLab Project's Generic Package registry. Go to CI/CD -> Pipelines and you should see a Pipeline running for a tag commit titled something like 'Release version 0.15.0'. This is the pipeline that will make a GitLab release and publish the conda env.
Once the release tag pipeline finishes, you should have a new GitLab Release as well as a conda dist env artifact published in your project's Generic Package Registry.
Deploying your conda env artifact for use by Airflow
Go to Packages & Registries -> Package Registry and you should see a list of all the conda env artifacts. To deploy this so that Airflow can use it, you should declare this artifact in Your airflow-dags instance artifact config file.
Example: I want to use example-job-project 0.15.0 conda env artifact. At https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/packages/113, I can copy the URL for the .tgz artifact file. I then use this URL when I declare the artifact in e.g. analytics/config/artifacts.yaml:
artifacts:
# ...
example-job-project-0.15.0.conda.tgz:
id: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/package_files/487/download
This will then allow me to use the airflow-dags dag_config.artifact
to refer to this artifact by name in my DAG code:
# STILL WIP!
from analytics.config import dag_config
from wmf_airflow_common.operators.spark import SparkSubmitOperator
with DAG(
# ...
): as dag
etl = SparkSubmitOperator.for_virtualenv(
# This will be translated to a cached URL (in HDFS) accessible by Airflow.
# By default, the alias name of the extracted archive directory will be 'venv'
virtualenv_archive=artifact('example-job-project-0.15.0.conda.tgz')
# This should be a relative path to the pyspark job entrypoint in the archive.
# Note that this needs to end in .py if it is really a pyspark job!
application='bin/pyspark_job_file.py',
)
Spark and Conda
TODO
Gitlab CI UI test integration
GitLab CI has the ability to integrate test coverage and reporting in its UI.
For pytest reporting, make sure your pytest job outputs a junitxml format report by adding the a flag like --junitxml=junit_pytest_report.xml
. Then, add a junit artifact to your test job that generates this file.
For coverage, add a --cov-report=xml
flag to your pytest command. Then, add a cobertura artifact to your test job that generates this file.
Full example:
In your setup.cfg [tool:pytest] section, or in your pytest.ini file:
# Coverage and junit XML report formats are output for use with GitLab CI UI.
addopts = -svv --failed-first --cov-report=xml --cov-report=term --cov=example_job_project --junitxml=junit_pytest_report.xml tests example_job_project
Then, in your .gitlab-ci.yml file in your test job:
test:
stage: test
script:
# - pytest, tox, whatever you prefer.
# - ...
# Match coverage total from job log output.
# See: https://docs.gitlab.com/ee/ci/yaml/index.html#coverage
# This is what allows for use of the GitLab coverage badge.
coverage: '/^TOTAL.+?(\d+\%)$/'
# Add these artifacts to integrate with MR and Pipeline UIs.
artifacts:
when: always
reports:
# This shows test reports in the Pipeline test tab
junit: junit_pytest_report.xml
# This shows coverage information in Merge Request diffs.
cobertura: coverage.xml