Data Engineering/Systems/Conda

From Wikitech

We use Conda to manage packages and virtual environments on the analytics clients.

Environments are created by cloning conda-analytics, a custom Conda distribution maintained by the Data Engineering team.

Use with Jupyter

For basic instructions on using Conda within our Jupyter environment, see Data Engineering/Systems/Jupyter#Conda environments.

Use outside Jupyter

This section applies to Conda use outside of Jupyter (that is, when you connect to one of the analytics clients with a plain SSH terminal session).

In most cases, you can use the standard Conda commands (e.g. conda install, conda remove, conda list, conda deactivate). This section covers the exceptions where we have custom commands to support our cloning-based workflow.

Creating a new environment

In the terminal, run conda-analytics-clone and a new clone of conda-analytics will be created for you in ~/.conda/envs.

It will be automatically named with the time and your username. If you prefer, you can give it a custom name: conda-analytics-clone my-cool-env.

Listing environments

$ conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo     /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo     /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo     /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base                  *  /opt/conda-analytics

Activating an environment

Run source conda-analytics-activate my-cool-env.

You can achieve the same thing with vanilla commands:

$ source /opt/conda-analytics/etc/profile.d/conda.sh
$ conda activate my-cool-env

You can also activate the read-only base environment, run source conda-analytics-activate base.

Installing packages

With a Conda environment activated, you can install packages by running conda install {{package}} in the terminal. If you are using Conda outside of Jupyter, you will first have to set your environment to use the HTTP proxy.

Conda will install packages from the Conda Forge channel by default. You can manually select a different channel by adding --channel {{channel}} to the command. The easiest way to search Conda Forge for a specific package is to do a regular web search with the qualifier "site:anaconda.org/conda-forge/".

If a Python package you need is not available from Conda Forge, you can use Pip instead.

The Mamba solver

There is a new "solver" for Conda which provides dramatically increased performance installing packages and avoids common problems.

Work was undertaken in phab:T337258 to make this new solver the default and that was deployed in conda-analytics version 0.0.23.

It is still possible to select the classic solver instead of libmamba if required, by adding the argument --solver classic to any conda install command.

Troubleshooting

Installing packages is extremely slow

Create a new Conda environment, which will use the Mamba solver by default.

Conda fails to solve the environment with an error about not finding an old version of Conda

Sometimes, after updating Conda to a new version, it will no longer be able to install new packages. Instead, it will fail to solve the environment with the error message ResolvePackageNotFound: conda={{old version}}.

To fix this, create a new environment, which will use the Mamba solver by default.

Spark 3 insert statement requirements

Using an INSERT statement in Spark 3 SQL or write.insertInto() in PySpark 3 results in the environment's Python executable being called. If the code is run from a cron job that loads a custom Python environment this might result in errors being thrown because that executable isn't available on the cluster. One way to solve this is to use wmfdata.spark.create_session(ship_python_env = True) to create a custom Spark session that ships the Python environment to the cluster nodes.

Broken environments

We have found that using the Mamba solver can lead to broken environments. So far the only case we've found is if one tries to install R 4.2.3, but check T343823 to see if other cases have been found. One way to prevent this from happening is to first do a "dry run" of the installation to see what will get installed and updated. This can be done by adding the -d command line parameter to conda, e.g. conda install -d R to see what gets installed and updated when R is installed.

R support

conda-analytics was built with Python in mind. Other languages are passively supported, and R is not included by default.

However, you can easily install R into your environment:

# Make sure you are using a conda env. This is not necessary inside Jupyter.
$ source conda-analytics-activate my-cool-r-env

# Install the conda R package into your user conda environment.
$ conda install R

# R is now fully contained in your user conda environment.
$ which R
/home/xcollazo/.conda/envs/my-cool-r-env/bin/R

# You can install additional R packages using conda
$ conda install r-tidyverse

# Enable R notebooks in Jupyter
$ conda install r-irkernel

# This step must be done in a Jupyter terminal.
$ R

R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
[...]

> IRkernel::installspec()

Tips

You should install additional R packages using Conda whenever possible. However, if a package you need is not available, you can use R's own package manager by running install.packages() during an R session.

Since the version of R coming from Conda-Forge is 4.2 (or newer) we now have access to newer features such as a new syntax for specifying strings and a built-in pipe operator (|>) – replacing the need for magrittr's %>%.

It is also recommended to create a ~/.Rprofile file with the following:

options(
  repos = c(
    CRAN = "https://cran.rstudio.com/",
    STAN = "https://mc-stan.org/r-packages/"
  ),
  mc.cores = 4
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)

Querying the data lake and MariaDB in R

It's possible to use the reticulate R package to access Python from R, from which you can use the wmfdata Python package to query the data lake using for example Spark as the backend. The latter is installed by default in your conda-analytics environment so it won't require separate installation.

First, install the reticulate and jsonlite R packages:

install.packages('reticulate')
install.packages('jsonlite')

You should then be able to run the following R code to set up reticulate to use your active conda environment and import wmfdata-python:

library(jsonlite)
library(reticulate)
conda_env_data <- paste(
    system2(
        "conda",
        args = c("info", "-a", "--json"),
        stdout = TRUE),
    collapse = '') |>
    fromJSON()
use_condaenv(conda_env_data[['active_prefix']])
wmfdata <- import('wmfdata')

You should then be able to use the various wmfdata backends for queries. For example you can use wmfdata$spark$run() to run a Spark query, provided you have your kerberos authentication done already with kinit.

brms and lme4

If you would like to use brms and/or lme4 for statistical modeling, install the packages in a similar way you installed R (see instructions above):

conda install r-brms r-lme4

Open a Terminal in JupyterLab or an SSH session (if you haven't yet) and run the following in R:

install.packages("BH")

For some reason BH (a dependency for brms) needs to be installed that way even when installing brms via conda.

conda-analytics

conda-analytics is based on Miniconda and has some extra packages as well as scripts for cloning the environment. On the analytics clients, it is available in /opt/conda-analytics.

The code used to build new releases of conda-analytics lives in gitlab:repos/data-engineering/conda-analytics/. The actual releases live in the associated package registry.