Jump to content

Fundraising/techops/docs/analytics stack/how to guides

From Wikitech

How to add environment variables to dagster:

The dagster user sources its environment variables from fran2001:/etc/default/dagster. This file is created by puppet and its content comes from frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erb. dagster.env.erb is in turn parameterized by puppet with values located in the dagster class in frpm1002:/etc/puppet/modules/dagster/manifests/init.pp. Non-secret values are directly defined in this class, while secrets are defined as empty string and inherit from puppet-private.

To add a new non-secret dagster environment variable, sample_var, follow these instructions:

  1. in the dagster class at the top of frpm1002:/etc/puppet/modules/dagster/manifests/init.pp, define the type, name, and value of the variable: String $sample_var = 'fun_value',
  2. in frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erb define the environment variable and its value, following the parameterization syntax: SAMPLE_VAR="<%= @sample_var -%>"
    1. Note that this will usually be the same as the variable name in init.pp, but uppercase. However, there are some cases where these wont be the same. E.g. dlt uses very specifically named variables that are a mouthful and not necessarily descriptive like DESTINATION__FILESYSTEM__CREDENTIALS__ENDPOINT_URL to represent the minIO endpoint url. In this case, the puppet variable is named dlt_minio_endpoint_url. simpler names make it easier to reason about what a variable does and to grep/ discover variables in puppet. In this case, the dagster.env.erb entry looks like: DESTINATION__FILESYSTEM__CREDENTIALS__ENDPOINT_URL="<%= @dlt_minio_endpoint_url -%>"
  3. after pushing and merging these files into the puppet repo, go to the dagster host, fran2001 and run puppet: sudo puppet agent -tv
  4. verify that the variable was set correctly by looking for it in /etc/default/dagster

The process for adding secret env vars to dagster is similar but has some extra steps. To add a new secret dagster environment variable, api_key for a cool_new_tool like a new BI tool, we need to follow the above instructions, but also update the puppet-private repo.

  1. in the dagster class at the top of frpm1002:/etc/puppet/modules/dagster/manifests/init.pp, define the type and name of the secret var, but leave the value as an empty string, ''. This value will automatically be inherited.
  2. in frpm1002:/etc/puppet/modules/dagster/templates/dagster.env.erb define the environment variable and its value, following the parameterization syntax: SAMPLE_VAR="<%= @sample_var -%>"
  3. each tool or service in puppet-private has its own provider file ending in .pp. If we're adding a secret for a new tool, we'll need to create this file in frpm1002://puppet-private/manifests/passwords/cool_new_tool.pp.
    1. Rather than code the boilerplate yourself, it's easier to copy an existing file Following the format of the other files like metabase.pp and replace metabase with cool_new_tool
  4. to add the api_key var, your cool_new_tool.pp should look like:
    # vim: set ts=4 et sw=4:
    
    class private::passwords::cool_new_tool {
        $secret = {
            'api_key' => 'super_secret_value',
        }
    }
    
  5. That's all we need to do in puppet-private, so once you're happy with your changes, add, commit, and push them to the repo. Then merge the changes by running sudo puppet-merge private on frpm1002. After merging, your variables will be accessible in the puppet repo!
  6. Near the top of frpm1002:/etc/puppet/modules/role/manifests/analytics.pp add an include statement alongside the other statements: include private::passwords::cool_new_tool
  7. Further down in the same file, find the dagster class and add your secret's name and value, following the formatting of the other entries:
    1. cool_new_tool_api_key            => $private::passwords::cool_new_tool::secret['api_key'],
  8. that's it! commit and push the changes, then on fran2001, merge them and sync puppet: sudo puppet agent -tv


How to add new data sources in Trino

To add a new table into Trino from e.g. dlt, create the table in the trino 'hive' schema. This schema is used by the Trino hive connector to interact with data in minIO. The minIO path to use is dlt/raw_data/google_sheets/ and your DDL might look like:

create table "hive"."raw_google_sheets"."budget_and_forecast" (
 fiscal_year VARCHAR,
 month_received_sorted VARCHAR,
 budget_category VARCHAR,
 country_name VARCHAR,
 budget DOUBLE,
 reforecast DOUBLE,
 _dlt_load_id VARCHAR,
 _dlt_id VARCHAR
 ) WITH ( format = 'PARQUET', external_location = 's3a://dlt/raw_data/google_sheets/seed__budget_and_forecast/');

Note the external_location config and how it matches the minIO path. Then, the analytics engineers will be able to create a dbt source from this data and dbt will handle rewriting it in iceberg format. Also note that dlt adds two metadata columns to the end of the data- _dlt_load_id and dlt_id, be sure to include those in the DDL.

Adding a new data source from google sheets

As of 2025-08-15, all google sheets data is ingested into Trino from this gsheet. To add a new table to Trino from gsheets:

  1. add the data to a new tab in the gsheet
  2. set the data type of the first non-header row to plain text. this will prevent google sheets from truncating any decimals
  3. in dagster, under the Deployment section, click 'Reload all'. This triggers dlt to look at the gsheet, find the new sheet you've added, and register as a dagster asset
  4. in dagster, run the daily_google_sheets_job. Note that if you get an error on the first run you may need to run the job again.
  5. finally, in trino, run ddl to create the table like in the example above
  6. the data should now be accessible as a source you can use in dbt, or you can just query the data directly in trino

Adding new columns to a google sheet source

The only thing you need to do when adding new columns to the end of a google sheet is drop the table in trino and run the create table statement with the new column. You can get the ddl in trino by running: SHOW CREATE TABLE <tbl>. E.g. show create table hive.raw_google_sheets.sheets__asana_main_fundraising_calendar. Take the output of this, add your column to the end, run it, and the data should now be available in trino!

I recommend setting all columns to VARCHAR to avoid google sheets formatting issues



How to add new python packages to the dagster virtual environment

Before adding a new python package to the dagster venv, we first need to add that package to the PyPI allowlist, otherwise we'll get a 500 error when trying to install it. Changes should be tested in virtualbox- here i'll outline the process that works for both vb and prod.

The pypi allowlist is in frpm1002://etc/puppet/hieradata/external/pypi.yaml and its entries are regex strings matching the package name, semvver, and file extension. It's usually enough to copy an existing line and edit in your package name. After updating the allowlist we still need to get the package installed. Puppet manages the dagster venv and installs packages listed in the pip::virtualenv section of frpm1002:/etc/puppet/modules/dagster/manifests/init.pp. Once the package is added, let's try installing the dagster venv!

Puppet will only recreate the dagster venv if the venv isn't already present, i.e. if /srv/dagster/ doesn't exist. So to reinstall the venv, make sure you move/ backup that dir to another dir- e.g. mv /srv/dagster /srv/dagster.20250715. Then, run puppet agent -tv and puppet will attempt to recreate the dagster venv. Note that sometimes your package will have dependencies that also need to be added to the allowlist. I don't have a great way of dealing with this other than repeating the process of adding packages, trying to install the venv, and adding packages that failed ¯\_(ツ)_/¯. You'll know if this is happening because puppet will output an error like:

[2025-06-12 17:39:38] <avishua> Notice: /Stage[main]/Dagster/Pip::Virtualenv[/srv/dagster]/Exec[venv_packages /srv/dagster]/returns: ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='frpm1002.frack.eqiad.wmnet', port=443): Max retries exceeded with url: /packages/14/4b/ead00905132820b623732b175d66354e9d3e69fcf2a5dcdab780664e7896/google_api_core-2.25.1-py3-none-any.whl.metadata (Caused by ResponseError('too many 500 error responses'))

This error means that the google_api_core package is not on the allowlist.

How to pin python package versions in virtual environments

You can pin package versions in the virtualenv's puppet provider file with the usual pip syntax: lib==1.0.0. You can also specify versions with >= and <=, but beware that using these operators require extra quoting so they don't get interpreted by puppet as output operators. This example illustrates how to pin deepdiff>=8.6.1. Note the extra double quotes ".

        pip::virtualenv { $virtualenv_dir:
            packages        => [
                '"deepdiff>=8.6.1"',
                'google-api-python-client'
            ],

How to use dbt defer when developing models

the dbt defer flag lets you reference a different dbt manifest json file in your dbt commands. As a reminder, the manifest file contains virtually ALL details of a dbt project and it's most recent runs. This means that e.g. when you want to develop and test a model, you don't need to already have the upstream dependencies in your dev schema, you can instead defer missing models to use prod data instead.

Say you're developing a new_model that depends on model_x but you haven't rebuilt model_x in a long time or it doesnt even exist in your dev schema, that's ok! if you use dbt run -s "new_model" --defer --state /srv/dagster_data/dbt_target , instead of giving you an error like Compilation Error in model... relation "iceberg"."dev_astein_core_intermediate"."model_x" does not exist and no default value was provided. dbt will just use "iceberg"."core_intermediate"."model_x" instead!

No special setup is needed to use defer and it works for many dbt commands. just add --defer --state /srv/dagster_data/dbt_target to your dbt command! By default, dbt will only defer to models in prod if the model doesn't exist for your local profile. If you want to use prod models even if they exist in your dev schemas, add --favor-state to the end of your command. E.g.: dbt run --select goals --defer --state /srv/dagster_data/dbt_target --favor-state

How to fix new dbt models not showing in dagster

as root:

source /srv/dagster/bin/activate
set -a
source /etc/default/dagster
cd /srv/analytics/dagster/fundraising_analytics_dbt
dbt compile --target prod

How to set dlt configs with environment variables

TODO

How to add metabase environment variables

Metabase environment vars are defined in puppet/modules/metabase/templates/metabase.env.erb

How to add a new package (duckdb)

To add a new package or program- there are a bunch of steps:

  1. create a puppet pg/ package_getter provider and update the package_getter conf
  2. create a deb package template
  3. whitelist the download url
  4. configure which hosts the package will be available on
  5. build the package
  6. install the package on hosts

You can also check the prod frpm puppet commits from 2025-12-12 to see *most* of the duckdb changes. They're just missing the initial package template in `~/packages/package/`.

Steps

  1. Compile the info you'll need: the packages' download url, executable name, and file type. e.g. duckdb's download url is https://github.com/duckdb/duckdb/releases/download/{version}/duckdb_cli-linux-amd64.gz and the gzip contains a single binary executable named "duckdb"
  2. Whitelist the download url. if the package comes from github you just need to add it to the github nginx config in modules/role/manifests/build.pp. add it to the end of $location_pattern and make sure it's separated from the previous entries with a |. if it comes from elsewhere, you'll need to whitelist the site. This should be done first so you can actually test your build_package logic. otherwise you wont even be able to download the source file.
  3. TODO: add instructions to fill out the package template in `~/packages/{pkg}`
  4. in puppet- create some providers and update some configs:
    1. modules/base/manifests/package/duckdb.pp: this file tells puppet how apt should handle the file and to make sure it gets installed. you can copy this file and bulk rename duckdb to your package
    2. modules/role/manifests/analytics.pp: this file contains general config for the analytics role. the only change we need to make here (assuming no other config values like secrets are needed for the program) is to add an include base::package::package. this makes the duckdb provider run as part of the analytics role class. the duckdb.pp file itself just contains logic to make sure it gets installed by puppet and also pin it to apt
    3. modules/package_getter/manifests/init.pp: this file contains, among other things, config for pg/ package_getter- the script we use as part of the package build process. this is also where we put basic config values for our package that get passed to the build_package process. We just need to add a new entry under $projects. include the base download url (not including the tag or version), the github project and repo, and the plugin name. this will probably just be the name of the program and tells pg to pull the program's specific {plugin}.py helper process. e.g. setting this to 'duckdb' tells pg to use package_getter/pg/duckdb.py for instructions on preparing the duckdb package. well, technically, puppet uses these to populate package_getter.conf with each package's config
    4. modules/package_getter/files/pg/duckdb.py: this file defines logic for building the tarball that get built into a package, provides input data for the build_package cli when it shows you the available versions, and orchestrates the steps from downloading the package to actually prepping the tarball. see this debian tutorial for helpful details on the packaging process and needed files
      1. note that in the duckdb case, the versions are prepended with 'v'. build_package wants an numeric version, so we've defined a tag_to_rel function that takes the available versions and strips the 'v'.
      2. the available_releases() function checks the available package versions in the remote repo (e.g. github), formats them as a dict with the version number and changelog description, and adds them to the args used by build_package. note that we call the tag_to_rel function defined in the last step.
      3. the prep_source function orchestrates the steps required to download the source file from the remote repo and tar it up! this function isn't usually identical between package pp files, but there are overlapping steps/ functions that get called. at a minimum, prep_source should:
        1. create (and if necessary, empty) the dirs that will hold the package
        2. download the file from the remote repo to a local dir**
        3. if necessary, extract the files from the compressed download. the duckdb download is a gz dir containing a single file named 'duckdb', so this function downloads the gz and then extracts the binary
          1. note that the location fetch_file or the gz extract function downloads the remote file to should match what's in ~/packages/{package}/debian-1/install
      4. calls `prep_orig_tarball` which just tars up the file or dir
      5. fetch_and_build() calls the above functions and then calls create_and_import_package() which calls these and other functions to build the package and push it to reprepro

TODO: add more instructions on actually building, installing, and deploying