We use Conda to manage packages and environments on the analytics clients.
As of Nov 2022, all Conda environments are based on the
anaconda-wmf environments. However, soon, we will change our setup so that newly created environments will be based on the
conda-analytics environment. Existing
anaconda-wmf environments will continue to work until 31 March 2023.
To tell which type of environment you are in, run
conda info --base from a terminal. If the output is
/usr/lib/anaconda-wmf, it's an
anaconda-wmf environment. If it something else, it's a
Using Jupyter to migrate from
From Jupyter Console:
- Select File -> Hub Control Panel This will open a new tab with ‘Stop My Server’ and ‘Start My Server’ buttons.
- Select ‘Stop My Server’ and then, when ready, ‘Start My Server’
- From the ‘Select a job profile:’ pulldown menu select ‘Create and use new cloned conda environment…’ The ‘Spawning server…’ progress bar will run, once completed you should get a Server Ready message, tab will open a new console window. If the initial cloning action fails you may need to logout and re-try again.
- To test that you are indeed operating in
conda-analyticsopen a Terminal and run command
conda list | grep spark
If clone was successful you will see the following version numbers:
When you next login to use Jupyter you should be operating in the new
Key library changes
anaconda-wmf uses Spark 2,
conda-analytics upgrades to Spark 3. There are many changes between these two major versions of Spark when it comes to Spark's SQL. More information about this can be found in Upgrading from Spark SQL 2.4 to 3.0 in the Spark SQL Guide.
Basic Conda use
Listing conda environments
/usr/lib/anaconda-wmf/bin/conda env list # conda environments: # 2020-08-19T16.19.37_otto /home/otto/.conda/envs/2020-08-19T16.19.37_otto 2020-08-19T16.47.40_otto /home/otto/.conda/envs/2020-08-19T16.47.40_otto 2020-08-19T16.56.54_otto /home/otto/.conda/envs/2020-08-19T16.56.54_otto 2020-08-19T16.59.40_otto /home/otto/.conda/envs/2020-08-19T16.59.40_otto 2020-12-13T19.40.09_otto /home/otto/.conda/envs/2020-12-13T19.40.09_otto base * /usr/lib/anaconda-wmf
Listing installed packages
To see the packages installed in your current environment, you can run
conda list. Note that this will not include the packages in the base anaconda-wmf environment, which are also accessible in the current environment. To see the packages installed in the base environement, run
conda list -n base.
This new base environment supports Spark3 and is based off of Miniconda, which is a minimal version of Anaconda. Having a minimal set of dependencies will allow you to better control package versions to fit your needs, while also lowering the maintenance burden. You can see what packages are included by default here.
Anaconda is a prepackaged Conda distribution for mostly Python-based analytics and research purposes. We have developed a modified version of Anaconda named
anaconda-wmf that includes some extra packages and scripts for creating 'stacked' conda user environments. These Conda user environments allow users to install packages into their own conda environment without modifying the base anaconda environment.
It has a large list of packages already installed, and these packages are installed on all Hadoop worker nodes. This environment only supports Spark 2.
Anaconda base environment
To use the readonly Anaconda base environment, you can simply run python or other executables directly out of
/usr/lib/anaconda-wmf/bin. If you prefer to activate the anaconda base environment, run
and a new conda environment will be created for you in ~/.conda/envs. When used, this environment will automatically append the base conda environment Python load paths to its own. If the same package is installed in both environments, your user conda environment's package will take precedence.
If you prefer, you can name your conda environment
There are several ways to activate a conda user environment. Just running
On its own will attempt to guess at the most recent conda environment to activate. If you only have one conda environment, this will work.
You can also specify the name of the conda env to activate. Run
/usr/lib/anaconda-wmf/bin/conda info --envs to get a list of available conda environments. E.g.
source conda-activate-stacked otto_2020-08-17T20.52.02
Or, you can run the 'activate' script out if your conda environment path:
Installing packages into your user conda environment
After activating your user conda environment, you can set http proxy env vars and install conda and pip packages. E.g.
export http_proxy=http://webproxy.eqiad.wmnet:8080 export https_proxy=http://webproxy.eqiad.wmnet:8080 conda install -c conda-forge <desired_conda_package> pip install --ignore-installed <desired_pip_package>
Conda is much preferred over pip, if the package you need is available via Conda. Conda can better track packages and their install locations than pip.
--ignore-installed flag for
pip install. This is only needed if you are installing a pip package into your Conda environment that already exists in the base anaconda-wmf environment.
These packages will be installed into the currently activated Conda user environment.
Deactivating your user conda environment
Or, since the user conda env's bin dir has been added to your path, you should also be able to just run
stacked conda environments
Conda supports activating environments 'stacked' on another one. However, all this 'stacking' does by default is leave the base conda environment's bin directory on your PATH. It does not allow for python dependencies from multiple environments.
Our customization fixes this. When conda-create-stacked is run, an anaconda.pth file is created in the new conda environment's site-packages directory. This file tells Python to add the anaconda-wmf base environemnt python search paths to its own. If a package is present in both environments, the stacked conda environment's version will take precedence.
For more details on why upstream Conda has not implemented this behavior, see this GitHub issue.
Persistent import issues
If you find that packages fail to import properly and that the issue is not resolved by creating a new Conda environment, the issue may still be due to something specific to your environment (for an example, see task T313249). Try deleting your
~/.conda folder and see if that fixes the issue.
Otherwise, you can try an even "harder" reset by deleting everything in your home folder. Make sure to include hidden files since this is where the problematic configuration would be. Obviously, make completely sure that you have backed up all your files first.
Spark 3 insert statement requirements
INSERT statement in Spark 3 SQL or
write.insertInto() in PySpark 3 results in the environment's Python executable being called. If the code is run from a
cron job that loads a custom Python environment this might result in errors being thrown because that executable isn't available on the cluster. One way to solve this is to use
wmfdata.spark.create_session(ship_python_env = True) to create a custom Spark session that ships the Python environment to the cluster nodes.
WMF's anaconda environment support was built with Python in mind. Other languages are passively supported.
R is included in the base anaconda-wmf environment, but it is not installed into the user conda environment by default. Doing so makes the size of user environments much larger, and makes distributing them to HDFS take much longer.
To install R packages into your user environment, do the following:
# Make sure you are using a conda env. This is not necessary if running in Jupyter. source conda-activate-stacked # Enable http proxy. This is not necessary if running in Jupyter export http_proxy=http://webproxy.eqiad.wmnet:8080; export https_proxy=http://webproxy.eqiad.wmnet:8080; export no_proxy=127.0.0.1,localhost,.wmnet # R is currently the base anaconda-wmf R. which R /usr/lib/anaconda-wmf/bin/R # Install the conda R package into your user conda environment. conda install R # R is now fully contained in your user conda environment. which R /home/otto/.conda/envs/2021-04-07T21.37.00_otto/bin/R
You should now be able to install R packages using R's package manager via
However, just like with Python, installing R packages with conda is preferred over using R's package manager. If a conda R package exists, you should be able to just install it like:
$ conda install r-tidyverse
After installing R using the steps above, launch a new Terminal session in JupyterLab (found under Other or File > New > Terminal) and open R:
Once R loads, use the following commands to install and set up the R kernel (otherwise you'll get errors when creating an R notebook):
It is also recommended to create a ~/.Rprofile file with the following:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080") Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080") options( repos = c( CRAN = "https://cran.rstudio.com/", STAN = "https://mc-stan.org/r-packages/" ), mc.cores = 4 ) Sys.setenv(MAKEFLAGS = "-j4") Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)
/bin/gtar not found
If you attempt to install from a Git repository – e.g. wmfdata via
remotes::install_github("wikimedia/wmfdata-r") and get the following:
Downloading GitHub repo wikimedia/wmfdata-r@HEAD sh: 1: /bin/gtar: not found sh: 1: /bin/gtar: not found Error: Failed to install 'wmfdata' from GitHub: error in running command In addition: Warning messages: 1: In system(cmd) : error in running command 2: In utils::untar(tarfile, ...)
For some reason this is an issue with Conda's R. The only workaround is running
Sys.setenv(TAR = system("which tar", intern = TRUE)) before the install commands.
lme4 depends on nloptr package, which is very difficult to build from source (which is what happens if you try to install directly in R) and is the primary hurdle for installing lme4 on our systems. The easiest way to install it is with:
conda install r-nloptr
then you can
install.packages("lme4") – to verify that it works:
library(lme4) fit <- lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
First, install pkg-config with:
conda install pkg-config
This is necessary for installing some of brms's dependencies. Then you can install in R:
To verify that it works run the following:
library(brms) prior1 <- prior(normal(0,10), class = b) + prior(cauchy(0,2), class = sd) fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient), data = epilepsy, family = poisson(), prior = prior1)
The code used to build new releases of anaconda-wmf lives in operations/debs/anaconda-wmf.