Fundraising/techops/docs/analytics stack/dbt
What is dbt and why do we use it?
dbt (data build tool) is an open source framework for modeling data. dbt helps data teams organize, transform, and document their data using code. When data is collected (from banner logs, apps like Zendesk, CRMs, etc.), it's often messy, hard to understand, and spread across many tables. Modeling is the process of cleaning, combining, and organizing that raw data into useful shapes so analysts or dashboards can make sense of it. dbt has the following features that make it great for modeling data:
- dbt encourages modular design: instead of one huge SQL script, you write small pieces (called "models") that build on each other. dbt understands how models depend on each other. This makes it easy to:
- only update what has changed instead of rerunning all transformations
- track data lineage from start to end to see how raw data is transformed into data ready for analysis. This makes it easier to debug and track down issues
- dbt helps ensure data quality by providing built in testing to ensure data matches certain assumptions, like revenue always being >= 0, or that there are no duplicate donors in the database
- dbt makes it easy to automatically document models, providing a data dictionary so folks can understand the meaning of the data they're looking at
- dbt reduces boilerplate code needed to read and write data from the database, so the team can focus on implementing business logic
- dbt code is version controlled which helps us ensure that updates can deployed safely and rolled back quickly if needed
dbt & Dagster
We use our orchestrator, Dagster, to run dbt via the dagster-dbt python package. This package automatically converts dbt models into dagster assets which makes it easier to integrate dbt with the other data sources managed by dagster. E.g. this extension lets us:
- Use Dagster's UI or APIs to run subsets of our dbt models
- Track failures, logs, and run history for individual dbt models or entire jobs
- Define dependencies between individual dbt models and other data assets. For example, put dbt models after we load new zendesk or google sheets data
Dagster relies on dbt's manifest.json to integrate dbt into dagster. The manifest file contains a full representation of our dbt project's resources (models, tests, macros, etc), including all node configurations and resource properties.
Generating the manifest.json file
The manifest file gets generated as part of our fundraising_code_update deployment script. If we choose to restart dagster at the end of fundraising_code_update, the dagster systemd initscript code gets run. this script runs dbt docs generate which generates the manifest.json for dagster to read.
dbt environment variables
Whether running in production or locally, dbt uses the following environment variables, usually located in /etc/default/dagster on the analytics host:
- DBT_USER: the trino username used by dbt. in production, this value is
dagsterand for local development, this value is the user's username. e.g. My (Avishua Stein) user name isasteinand so is my trino username. - DBT_SCHEMA: the trino schema dbt writes data to. We use an adjusted version of the
generate_schema_namemacro to generate this value. See the below section on dynamically generating schema names for more details. - TRINO_PASSWORD: the trino login password.
- DBT_TARGET_PATH: the target path dbt uses to write certain files like the manifest.json.
- DBT_LOG_PATH: the path dbt writes logs to, including the dbt.log file and individual run logs
Like most other analytics environment variables, dbt running in production inherits its environment variables from puppet in the dagster.env.erb and dagster/manifests/init.pp files. the actual values can be found in /etc/default/dagster. When running dbt locally, these variables will be in the user's home directory in the ~/.analytics.env file.
Dynamically generating schema names with the generate_schema_name macro
Imagine you are locally developing in dbt- adding some columns to the donors model. When you run dbt run, instead of overwriting the production data, the results get put into your named schema- e.g. astein. What is this magic and how does it work? I'll tell you! Our dbt setup uses the generate_schema_name macro to dynamically write data to different trino schemas depending on who is running dbt and from where! The linked documentation runs through a basic example, but you can find our implementation on the analytics host under macros/generate_schema_name.sql.
Basically, generate_schema_name first checks if this dbt is running against our production profile (target = prod). If it is, then we check if our model has a custom_schema_name. If it does, then we use the custom_schema_name, and if not, we use whatever the default schema is, probably the value of DBT_SCHEMA. custom_schema_names are defined in our dbt_project.yml under lines starting with +schema. So this line in the yml
core:
+schema: core
intermediate:
+schema: core_intermediate
tells dbt to put all models in the core/intermediate/ subdirectory into the core_intermediate schema and anything else in the core/ directory into the core schema.
If dbt isn't running in production, like when developing locally, then generate_schema_name will concat the default_schema (e.g. astein) with the custom_schema_name. so if Avishua Stein is running the .../core/donors.sql model locally, the data will get written to dev_astein_core.
dbt-metabase Integration
We use the dbt-metabase package to sync data descriptions and definitions from dbt directly into Metabase. dbt-metabase is automatically run in the dbt-post-merge script whenever the analytics project is updated by fundraising_code_update. Users can also run dbt-metabase to sync their dev schemas between dbt and metabase. To run dbt-metabase locally, do the following:
- check if you have a
dbt-metabase-config.ymlfile in your home directory on fran2001. If you don't, you can copy the below template - activate the dagster virtual environment:
source /srv/dagster/bin/activate - optional- regenerate the dbt manifest.json by running e.g.,
dbt parse - run dbt-metabase:
dbt-metabase --config-path ~/dbt-metabase-config.yml models- note that you don't need to use a config file. instead you can pass arguments directly in the command. check the package repo for examples
- if needed, update your model ymls and repeat steps 3 & 4. if needed, change the values in your
dbt-metabase-config.yml
dbt-metabase-config.yml template
config:
# manifest_path should be $DBT_TARGET_PATH/manifest.json, but note that you need to provide the absolute path to the file.
# e.g. instead of ~/target/manifest.json, use /home/user/target/manifest.json
manifest_path: /home/astein/target/manifest.json # replace astein with your username. e.g. jmando, chudson, etc.
metabase_url: http://127.0.0.1:9081
metabase_api_key: <hidden> # ask an SRE or someone on analytics if you need this
# check the dbt-metabase repo for more options
models:
metabase_database: "trino" # name of the database in metabase to use, probably 'trino'. check https://metabase.frdev.wikimedia.org/admin/databases for options.
include_schemas: "dev_astein_core" # list of schemas to sync. try not to sync over someone else's dev data