Data Engineering/Systems/Airflow/Airflow testing instance tutorial

From Wikitech

This page explains a way of creating your own Airflow instance in a stats machine. You can use it to test the DAGs that you are developing before you merge them to the code base.

Creating your own Airflow instance

1. All steps in this tutorial assume you are logged in your preferred stats machine via ssh.

ssh stat1007.eqiad.wmnet

2. Also, make sure at all times that your Kerberos authentication ticket is fresh. Note that you'll be able to execute tests in airflow only for as long as your ticket is valid. So, consider renewing it for long tests.

kinit

Installing Airflow

1. Make sure you have a dedicated directory for Airflow in your home folder. It should contain a subfolder named dags, where you will put your dag files.

mkdir -p ~/airflow/dags

2. Set the environment variable AIRFLOW_HOME to your Airflow folder. This will tell Airflow where to setup configuration, database files and where to find your dag files.

export AIRFLOW_HOME=~/airflow

3. Change directory to your Airflow folder.

cd ~/airflow

4. Create a Python virtual environment. This will allow you to install all required Python packages for Airflow without altering other Python systems you may have.

python3 -m venv venv/

5. Activate the python virtual environment. This will set some environment variables that control which Python executable and packages are going to be used, and will display a different command line prompt.

source venv/bin/activate

6. Make sure your environment variable https_proxy is set to allow you to download Python packages from the internet.

export https_proxy=http://webproxy.eqiad.wmnet:8080

7. Install all required Python packages. Note Airflow is installed together with its Hdfs, Hive and Kerberos extensions. Flask-admin version needs to be 1.4.0, because newer versions break when spinning up the Airflow web server (2020-04-15).

pip install wheel
pip install hmsclient
pip install apache-airflow[hdfs,hive,kerberos]
pip install flask-admin==1.4.0
pip install pyarrow

8. Execute Airflow's db init command. Airflow will create a SQLite database file, a logs folder and a config file, all under your Airflow directory. The installation is finished at this point.

airflow db init


Configuring Airflow

1. If you just installed Airflow following the previous section, skip this step. If not, make sure your environment is setup correctly.

export AIRFLOW_HOME=~/airflow
cd ~/airflow
source venv/bin/activate

2. Obtain your Kerberos credentials cache path. Execute Kerberos' klist command. You'll find the path to your credentials cache directory under Ticket cache: FILE:<path>.

klist
# copy your credentials cache path and Service principal

3. Edit ~/airflow/airflow.cfg and assign the following configuration values.

# under the [core] section
load_examples = False
security = kerberos
# under the [kerberos] section
ccache = <your credentials cache path>
principal = <your service principal>
reinit_frequency = 3600
kinit_path = kinit 

4. The Hive metastore configurations need to be set from the Airflow UI. For that, spin up the Airflow web server. Use another port if necessary.

airflow webserver -p 8080

5. On your local machine, create an ssh tunnel to the stats machine you are running Airflow. Use the port that you specified when launching the web server.

ssh stat1007.eqiad.wmnet -L 8080:stat1007.eqiad.wmnet:8080

6. Open http://localhost:8080/connection/list/ (change port if needed) on your browser, and click on the edit button for the connection with Conn Type = hive_metastore. Set the following configurations and save changes:

Conn id = analytics-test-hive (Same string defined from your configuration. Currently in config/dag_config.py)
Host = analytics-hive.eqiad.wmnet
Port = 9083 (This is the default port)
Extra = {"authMechanism": "GSSAPI"}

7. You can now stop the Airflow web server in your stats machine. Airflow is now configured to be able to access Hive. The steps followed so far don't need to be repeated. Whenever you want to test an Airflow DAG, just jump to the next section.


Executing a DAG

1. If you just configured Airflow following the previous section, skip this step. If not, make sure your environment is setup correctly.

export AIRFLOW_HOME=~/airflow
cd ~/airflow
source venv/bin/activate

2. Execute the Airflow web server inside a screen/tmux. This will spin up the Airflow UI. Use another port if necessary.

screen -S airflow_webserver
airflow webserver -p 8080

3. Execute the Airflow scheduler inside a screen/tmux. This will spin up the service that executes the dags.

screen -S airflow_scheduler
airflow scheduler

4. On your local machine, create an ssh tunnel to the stats machine you are running Airflow. Use the port that you specified when launching the web server. After that, you should be able to see Airflow's UI if you open http://localhost:8080/ (change port if needed) on your browser.

ssh stat1007.eqiad.wmnet -L 8080:stat1007.eqiad.wmnet:8080

5. To add a new dag to your Airflow instance, just scp the dag Python file to the corresponding stats machine dag folder.

scp dagFile.py stat1007.eqiad.wmnet:airflow/dags/dagFile.py

6. After a bit, you should see your new dag under the DAGs tab in the Airflow UI (refresh page). By default, new dags are turned off in Airflow. So for your dag to run, you should turn it on using the ON/OFF toggle in the Airflow UI. You can access the DAG execution logs via the Airflow UI as well. Open the detail page of your DAG and select the Tree View. You'll see small colored boxes that represent the executions of each of your DAG's tasks. If you click on them, you can access their corresponding logs.

Kerberos Considerations

When your job needs a keytab, like a SimpleSkeinOperator, this is an example that can work around the various hurdles in our environment (assuming you cloned your dags to ~/airflow-dags and have a instance of airflow at ~/airflow with dags_folder set to the dags you're testing)

REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt AIRFLOW_HOME=$HOME/airflow HOME=$AIRFLOW_HOME PYTHONPATH=$HOME/airflow-dags sudo --preserve-env=AIRFLOW_HOME,PYTHONPATH,HOME,REQUESTS_CA_BUNDLE -u analytics-privatedata kerberos-run-command analytics-privatedata /home/milimetric/.conda/envs/airflow_development/bin/airflow tasks test <<dag_id>> <<task_id>> <<start_date>>

TODO: figure out how Marcel updated run_dev_instance.sh and how the above should be updated