Fundraising/techops/docs/analytics stack/how to guides
How to add environment variables to dagster:
The dagster user sources its environment variables from fran2001:/etc/default/dagster. This file is created by puppet and its content comes from frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erb. dagster.env.erb is in turn parameterized by puppet with values located in the dagster class in frpm1002:/etc/puppet/modules/dagster/manifests/init.pp. Non-secret values are directly defined in this class, while secrets are defined as empty string and inherit from puppet-private.
To add a new non-secret dagster environment variable, sample_var, follow these instructions:
- in the dagster class at the top of
frpm1002:/etc/puppet/modules/dagster/manifests/init.pp, define the type, name, and value of the variable:String $sample_var = 'fun_value', - in
frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erbdefine the environment variable and its value, following the parameterization syntax:SAMPLE_VAR="<%= @sample_var -%>"- Note that this will usually be the same as the variable name in
init.pp, but uppercase. However, there are some cases where these wont be the same. E.g. dlt uses very specifically named variables that are a mouthful and not necessarily descriptive likeDESTINATION__FILESYSTEM__CREDENTIALS__ENDPOINT_URLto represent the minIO endpoint url. In this case, the puppet variable is nameddlt_minio_endpoint_url. simpler names make it easier to reason about what a variable does and to grep/ discover variables in puppet. In this case, thedagster.env.erbentry looks like:DESTINATION__FILESYSTEM__CREDENTIALS__ENDPOINT_URL="<%= @dlt_minio_endpoint_url -%>"
- Note that this will usually be the same as the variable name in
- after pushing and merging these files into the puppet repo, go to the dagster host,
fran2001and run puppet:sudo puppet agent -tv - verify that the variable was set correctly by looking for it in
/etc/default/dagster
The process for adding secret env vars to dagster is similar but has some extra steps. To add a new secret dagster environment variable, api_key for a cool_new_tool like a new BI tool, we need to follow the above instructions, but also update the puppet-private repo.
- in the dagster class at the top of
frpm1002:/etc/puppet/modules/dagster/manifests/init.pp, define the type and name of the secret var, but leave the value as an empty string,''. This value will automatically be inherited. - in
frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erbdefine the environment variable and its value, following the parameterization syntax:SAMPLE_VAR="<%= @sample_var -%>" - each tool or service in puppet-private has its own provider file ending in
.pp. If we're adding a secret for a new tool, we'll need to create this file infrpm1002://puppet-private/manifests/passwords/cool_new_tool.pp.- Rather than code the boilerplate yourself, it's easier to copy an existing file Following the format of the other files like
metabase.ppand replacemetabasewithcool_new_tool
- Rather than code the boilerplate yourself, it's easier to copy an existing file Following the format of the other files like
- to add the
api_keyvar, yourcool_new_tool.ppshould look like:# vim: set ts=4 et sw=4: class private::passwords::cool_new_tool { $secret = { 'api_key' => 'super_secret_value', } }
- That's all we need to do in puppet-private, so once you're happy with your changes, add, commit, and push them to the repo. Then merge the changes by running
sudo puppet-merge privateon frpm1002. After merging, your variables will be accessible in the puppet repo! - Near the top of
frpm1002:/etc/puppet/modules/role/manifests/analytics.ppadd anincludestatement alongside the other statements:include private::passwords::cool_new_tool - Further down in the same file, find the dagster class and add your secret's name and value, following the formatting of the other entries:
cool_new_tool_api_key => $private::passwords::cool_new_tool::secret['api_key'],
- that's it! commit and push the changes, then on
fran2001, merge them and sync puppet:sudo puppet agent -tv
How to add new data sources in Trino
To add a new table into Trino from e.g. dlt, create the table in the trino 'hive' schema. This schema is used by the Trino hive connector to interact with data in minIO. The minIO path to use is dlt/raw_data/google_sheets/ and your DDL might look like:
create table "hive"."raw_google_sheets"."budget_and_forecast" (
fiscal_year VARCHAR,
month_received_sorted VARCHAR,
budget_category VARCHAR,
country_name VARCHAR,
budget DOUBLE,
reforecast DOUBLE,
_dlt_load_id VARCHAR,
_dlt_id VARCHAR
) WITH ( format = 'PARQUET', external_location = 's3a://dlt/raw_data/google_sheets/seed__budget_and_forecast/');
Note the external_location config and how it matches the minIO path. Then, the analytics engineers will be able to create a dbt source from this data and dbt will handle rewriting it in iceberg format. Also note that dlt adds two metadata columns to the end of the data- _dlt_load_id and dlt_id, be sure to include those in the DDL.
Adding a new data source from google sheets
As of 2025-08-15, all google sheets data is ingested into Trino from this gsheet. To add a new table to Trino from gsheets:
- add the data to a new tab in the gsheet
- set the data type of the first non-header row to plain text. this will prevent google sheets from truncating any decimals
- in dagster, under the Deployment section, click 'Reload all'. This triggers dlt to look at the gsheet, find the new sheet you've added, and register as a dagster asset
- in dagster, run the daily_google_sheets_job. Note that if you get an error on the first run you may need to run the job again.
- finally, in trino, run ddl to create the table like in the example above
- the data should now be accessible as a source you can use in dbt, or you can just query the data directly in trino
Adding new columns to a google sheet source
The only thing you need to do when adding new columns to the end of a google sheet is drop the table in trino and run the create table statement with the new column. You can get the ddl in trino by running: SHOW CREATE TABLE <tbl>. E.g. show create table hive.raw_google_sheets.sheets__asana_main_fundraising_calendar. Take the output of this, add your column to the end, run it, and the data should now be available in trino!
I recommend setting all columns to VARCHAR to avoid google sheets formatting issues
How to add new python packages to the dagster virtual environment
Before adding a new python package to the dagster venv, we first need to add that package to the PyPI allowlist, otherwise we'll get a 500 error when trying to install it. Changes should be tested in virtualbox- here i'll outline the process that works for both vb and prod.
The pypi allowlist is in frpm1002://etc/puppet/hieradata/external/pypi.yaml and its entries are regex strings matching the package name, semvver, and file extension. It's usually enough to copy an existing line and edit in your package name. After updating the allowlist we still need to get the package installed. Puppet manages the dagster venv and installs packages listed in the pip::virtualenv section of frpm1002:/etc/puppet/modules/dagster/manifests/init.pp. Once the package is added, let's try installing the dagster venv!
Puppet will only recreate the dagster venv if the venv isn't already present, i.e. if /srv/dagster/ doesn't exist. So to reinstall the venv, make sure you move/ backup that dir to another dir- e.g. mv /srv/dagster /srv/dagster.20250715. Then, run puppet agent -tv and puppet will attempt to recreate the dagster venv. Note that sometimes your package will have dependencies that also need to be added to the allowlist. I don't have a great way of dealing with this other than repeating the process of adding packages, trying to install the venv, and adding packages that failed ¯\_(ツ)_/¯. You'll know if this is happening because puppet will output an error like:
[2025-06-12 17:39:38] <avishua> Notice: /Stage[main]/Dagster/Pip::Virtualenv[/srv/dagster]/Exec[venv_packages /srv/dagster]/returns: ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='frpm1002.frack.eqiad.wmnet', port=443): Max retries exceeded with url: /packages/14/4b/ead00905132820b623732b175d66354e9d3e69fcf2a5dcdab780664e7896/google_api_core-2.25.1-py3-none-any.whl.metadata (Caused by ResponseError('too many 500 error responses'))
This error means that the google_api_core package is not on the allowlist.
How to pin python package versions in virtual environments
You can pin package versions in the virtualenv's puppet provider file with the usual pip syntax: lib==1.0.0. You can also specify versions with >= and <=, but beware that using these operators require extra quoting so they don't get interpreted by puppet as output operators. This example illustrates how to pin deepdiff>=8.6.1. Note the extra double quotes ".
pip::virtualenv { $virtualenv_dir:
packages => [
'"deepdiff>=8.6.1"',
'google-api-python-client'
],
How to use dbt defer when developing models
the dbt defer flag lets you reference a different dbt manifest json file in your dbt commands. As a reminder, the manifest file contains virtually ALL details of a dbt project and it's most recent runs. This means that e.g. when you want to develop and test a model, you don't need to already have the upstream dependencies in your dev schema, you can instead defer missing models to use prod data instead.
Say you're developing a new_model that depends on model_x but you haven't rebuilt model_x in a long time or it doesnt even exist in your dev schema, that's ok! if you use dbt run -s "new_model" --defer --state /srv/dagster_data/dbt_target , instead of giving you an error like Compilation Error in model... relation "iceberg"."dev_astein_core_intermediate"."model_x" does not exist and no default value was provided. dbt will just use "iceberg"."core_intermediate"."model_x" instead!
No special setup is needed to use defer and it works for many dbt commands. just add --defer --state /srv/dagster_data/dbt_target to your dbt command! By default, dbt will only defer to models in prod if the model doesn't exist for your local profile. If you want to use prod models even if they exist in your dev schemas, add --favor-state to the end of your command. E.g.: dbt run --select goals --defer --state /srv/dagster_data/dbt_target --favor-state
How to fix new dbt models not showing in dagster
as root:
source /srv/dagster/bin/activate
set -a
source /etc/default/dagster
cd /srv/analytics/dagster/fundraising_analytics_dbt
dbt compile --target prod
How to set dlt configs with environment variables
TODO
How to add metabase environment variables
Metabase environment vars are defined in puppet/modules/metabase/templates/metabase.env.erb
How to add a new package (duckdb)
To add a new package or program- there are a bunch of steps:
- create a puppet pg/ package_getter provider and update the package_getter conf
- create a deb package template
- whitelist the download url
- configure which hosts the package will be available on
- build the package
- install the package on hosts
You can also check the prod frpm puppet commits from 2025-12-12 to see *most* of the duckdb changes. They're just missing the initial package template in `~/packages/package/`.
Steps
- Compile the info you'll need: the packages' download url, executable name, and file type. e.g. duckdb's download url is https://github.com/duckdb/duckdb/releases/download/{version}/duckdb_cli-linux-amd64.gz and the gzip contains a single binary executable named "duckdb"
- Whitelist the download url. if the package comes from github you just need to add it to the github nginx config in
modules/role/manifests/build.pp. add it to the end of$location_patternand make sure it's separated from the previous entries with a|. if it comes from elsewhere, you'll need to whitelist the site. This should be done first so you can actually test your build_package logic. otherwise you wont even be able to download the source file. - TODO: add instructions to fill out the package template in `~/packages/{pkg}`
- in puppet- create some providers and update some configs:
modules/base/manifests/package/duckdb.pp: this file tells puppet how apt should handle the file and to make sure it gets installed. you can copy this file and bulk rename duckdb to your packagemodules/role/manifests/analytics.pp: this file contains general config for the analytics role. the only change we need to make here (assuming no other config values like secrets are needed for the program) is to add aninclude base::package::package. this makes the duckdb provider run as part of the analytics role class. the duckdb.pp file itself just contains logic to make sure it gets installed by puppet and also pin it to aptmodules/package_getter/manifests/init.pp: this file contains, among other things, config for pg/ package_getter- the script we use as part of the package build process. this is also where we put basic config values for our package that get passed to the build_package process. We just need to add a new entry under$projects. include the base download url (not including the tag or version), the github project and repo, and the plugin name. this will probably just be the name of the program and tells pg to pull the program's specific {plugin}.py helper process. e.g. setting this to 'duckdb' tells pg to use package_getter/pg/duckdb.py for instructions on preparing the duckdb package. well, technically, puppet uses these to populate package_getter.conf with each package's configmodules/package_getter/files/pg/duckdb.py: this file defines logic for building the tarball that get built into a package, provides input data for the build_package cli when it shows you the available versions, and orchestrates the steps from downloading the package to actually prepping the tarball. see this debian tutorial for helpful details on the packaging process and needed files- note that in the duckdb case, the versions are prepended with 'v'. build_package wants an numeric version, so we've defined a
tag_to_relfunction that takes the available versions and strips the 'v'. - the
available_releases()function checks the available package versions in the remote repo (e.g. github), formats them as a dict with the version number and changelog description, and adds them to the args used by build_package. note that we call thetag_to_relfunction defined in the last step. - the
prep_sourcefunction orchestrates the steps required to download the source file from the remote repo and tar it up! this function isn't usually identical between package pp files, but there are overlapping steps/ functions that get called. at a minimum, prep_source should:- create (and if necessary, empty) the dirs that will hold the package
- download the file from the remote repo to a local dir**
- if necessary, extract the files from the compressed download. the duckdb download is a gz dir containing a single file named 'duckdb', so this function downloads the gz and then extracts the binary
- note that the location fetch_file or the gz extract function downloads the remote file to should match what's in
~/packages/{package}/debian-1/install
- note that the location fetch_file or the gz extract function downloads the remote file to should match what's in
- calls `prep_orig_tarball` which just tars up the file or dir
fetch_and_build()calls the above functions and then callscreate_and_import_package()which calls these and other functions to build the package and push it to reprepro
- note that in the duckdb case, the versions are prepended with 'v'. build_package wants an numeric version, so we've defined a
TODO: add more instructions on actually building, installing, and deploying