Data Platform/Systems/Airflow/Developer guide/Python Job Repos

This page provides a tutorial for how to design a python based job repository in GitLab that publishes job artifacts that can be scheduled and launched by Airflow . There is also an example Gitlab repository that follows all these recommendations: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project.

Overview

We intentionally want to separate job logic from scheduling logic. A job should be standalone and parameterized in a way that given specific inputs it produces certain outputs. Airflow is a scheduler that is meant to run the job with input parameters for a particular run of a job, usually based on timestamps or incoming data.

A job repository can be used to specify all dependencies and logic needed to run a job. In order for the Airflow Scheduler to to launch the job, it needs to be able to access the job code and dependencies somewhere.

Data Engineering has implemented reusable GitLab CI pipelines to automate the generation of job 'artifacts', as well as tooling to deploy these artifacts so that Airflow can access them.

As of 2022-04, the CI pipelines focus on python based jobs (or anything that uses conda environments), but the artifact deployment can work with any kind of artifact file (zip files, jars, compiled binaries, etc.)

GitLab Job Repository Setup

Python package setup

You must minimally have the following:

A conda-environment.yaml file that specifies minimally the python version:

dependencies:
  - python=3.7

A pip installable project setup, e.g. pyproject.toml, setup.cfg, setup.py etc. I.e. pip install . in your project dir will work.

Optionally, to use automated releases, you should use bump2version to manage your package version. If you use setup.cfg to manage your package version, then you need a .bumpversion.cfg file as follows:

[bumpversion]
current_version = 0.1.0.dev
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z0-9]+))?
serialize = 
	{major}.{minor}.{patch}.{release}
	{major}.{minor}.{patch}

[bumpversion:part:release]
optional_value = unused
values = 
	dev
	unused

[bumpversion:file:setup.cfg]
search = version = {current_version}
replace = version = {new_version}

When you first create this file, make sure that the current_version ends in ".dev", and that the version in setup.cfg matches this exactly.

GitLab CI setup

If you are choosing to use automated release versioning, then you .gitlab-ci.yml file should contain the following.

# Include conda_artifact_repo.yml to add release and conda env publishing jobs.
include:
  - project: 'repos/data-engineering/workflow_utils'
    ref: v0.19.0
    file: '/gitlab_ci_templates/pipelines/conda_artifact_repo.yml'

ALTERNATIVELY, if you choose not to use automated release versioning, then you include just the publish_conda_env job directly:

# Include just the publish_conda_env job.
# This does not include automated releasing, so you will need to either manually
# run the publish_conda_env job, or manually push tags to trigger the
# publish_conda_env job.
include:
  - project: 'repos/data-engineering/workflow_utils'
    ref: v0.19.0
    file: '/gitlab_ci_templates/jobs/publish_conda_env.yml'

Automated Release GitLab Project Setup

If you choose to use automated releasing, you'll need to allow GitLab CI to push commits as follows:

create a project access token
- on the left sidebar, go to Settings > Access Tokens
- enter a token name like gitlab-ci
- set or delete the expiration date
- select the Maintainer role. This makes sure that GitLab CI can push to your main branch, where a Maintainer can write by default
- tick the api and write_repository scopes
- push the button and copy the token
set CI variables
- on the left sidebar, go to Settings > CI/CD, expand Variables and add the following key-value pairs:
  - CI_PROJECT_PASSWORD - paste the project access token. Tick the Mask variable flag.
If you selected the 'Protect variable' checkbox when you created the CI_PROJECT_PASSWORD variable, you'll need to designated certain branches as protected. Usually, this is just done for the main branch:
- Settings -> Repository, expand Branches and add the main (or other) branch as a protected branch.

How to Deploy

This is a step-by-step tutorial to deploy a new release to Airflow, feel free to read the following sections for details.

On the left sidebar, go to CI/CD > Pipelines
click on the play button and select trigger_release
on the left sidebar, go to Packages and registries > Package Registry
click on the first item, right-click the asset file name, and copy the URL
branch out of airflow-dags
update the URL in the DAG config
merge into main
deploy the DAGs (deployment permissions required):

me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/${YOUR_AIRFLOW_INSTANCE}/
me@deploy1002:~$ scap deploy

Replace ${YOUR_AIRFLOW_INSTANCE} with one of the available instances.

Job Repository Conda Env Artifact Publishing

Assuming you are using automated releases and you've followed all the setup instructions above, to publish a conda env job artifact you'll do the following.

Development

You publish a .dev version of your conda env from any commit on the main branch. To do so, from a commit Pipeline, manually run the publish_conda_env job. This will publish a conda env to your Projects Generic Package Registry

Releasing

Make sure the changes you want to deploy are merged into your main branch.
Go to CI/CD -> Pipelines -> Run Pipeline (Blue button In the upper right)
1. The only variable here you might want to edit is POST_RELEASE_VERSION_BUMP. This is the part of the semantic version to bump after releasing. Allowed values are major, minor and patch. Default is minor.
2. Click Run Pipeline at the bottom. This will launch a new pipeline for the latest commit on your main branch.
Go to CI/CD -> Pipelines and click on the pipeline you just launched.
Once any tests have finished, you should see be able to manually run the trigger_release job. This job will: remove the .dev version, commit and tag, and then bump the version and make a new commit to main. After this is done, the new .dev version will be bumped in main, and a tag will have been created and pushed to gitlab.
The creation of a new tag in gitlab will automatically launch a new pipeline that will build and publish a conda env artifact to your GitLab Project's Generic Package registry. Go to CI/CD -> Pipelines and you should see a Pipeline running for a tag commit titled something like 'Release version 0.15.0'. This is the pipeline that will make a GitLab release and publish the conda env.

Once the release tag pipeline finishes, you should have a new GitLab Release as well as a conda dist env artifact published in your project's Generic Package Registry.

Deploying your conda env artifact for use by Airflow

Go to Packages & Registries -> Package Registry and you should see a list of all the conda env artifacts. To deploy this so that Airflow can use it, you should declare this artifact in Your airflow-dags instance artifact config file.

Example: I want to use example-job-project 0.15.0 conda env artifact. At https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/packages/113, I can copy the URL for the .tgz artifact file. I then use this URL when I declare the artifact in e.g. analytics/config/artifacts.yaml:

artifacts:
# ...
  example-job-project-0.15.0.conda.tgz:
    id: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/package_files/487/download

This will then allow me to use the airflow-dags dag_config.artifact to refer to this artifact by name in my DAG code:

# STILL WIP!

from analytics.config import dag_config
from wmf_airflow_common.operators.spark import SparkSubmitOperator

with DAG(
    # ...    
): as dag
    etl = SparkSubmitOperator.for_virtualenv(
        # This will be translated to a cached URL (in HDFS) accessible by Airflow.
        # By default, the alias name of the extracted archive directory will be 'venv'
        virtualenv_archive=artifact('example-job-project-0.15.0.conda.tgz')
        
        # This should be a relative path to the pyspark job entrypoint in the archive.
        # Note that this needs to end in .py if it is really a pyspark job!
        application='bin/pyspark_job_file.py',
    )

Spark and Conda

TODO

Gitlab CI UI test integration

GitLab CI has the ability to integrate test coverage and reporting in its UI.

For pytest reporting, make sure your pytest job outputs a junitxml format report by adding the a flag like --junitxml=junit_pytest_report.xml. Then, add a junit artifact to your test job that generates this file.

For coverage, add a --cov-report=xml flag to your pytest command. Then, add a cobertura artifact to your test job that generates this file.

Full example:

In your setup.cfg [tool:pytest] section, or in your pytest.ini file:

# Coverage and junit XML report formats are output for use with GitLab CI UI.
addopts = -svv --failed-first --cov-report=xml --cov-report=term --cov=example_job_project --junitxml=junit_pytest_report.xml tests example_job_project

Then, in your .gitlab-ci.yml file in your test job:

test:
  stage: test
  script: 
    # - pytest, tox, whatever you prefer.
    # - ...

  # Match coverage total from job log output.
  # See: https://docs.gitlab.com/ee/ci/yaml/index.html#coverage
  # This is what allows for use of the GitLab coverage badge.
  coverage: '/^TOTAL.+?(\d+\%)$/'
  
  # Add these artifacts to integrate with MR and Pipeline UIs.
  artifacts:
    when: always
    reports:
      # This shows test reports in the Pipeline test tab
      junit: junit_pytest_report.xml
      # This shows coverage information in Merge Request diffs.
      cobertura: coverage.xml