snakemake, singularity, and HPC

Posted on Mon 01 August 2022 in posts

This is a bit of a sequel to my post last May about using conda environments with snakemake on the HPC. That solution worked for me, but was hacky and limited. This post will describe a different solution, which uses a singularity container containing a conda environment that can be used by snakemake for local use or use with SLURM, and has the ability to mount extra dependencies.

I put this together for my spatial frequency preferences project, and you can read that repo's README for more details on how to use it. The following post will attempt to describe the different components of it and why they work, which will hopefully make it easier to modify for other uses. The following description is all for NYU's greene cluster (which uses SLURM), but can hopefully be modified to other clusters and job submission systems.

My goal was to make it easier to rerun my analyses, allowing other people to make use of snakemake and the cluster to run them quickly, without requiring lots of setup and understanding of snakemake, SLURM, or singularity. To do that, I wanted to create a single container with a wrapper script that would be able to:

Run locally.
Run in an interactive session on greene.
Use snakemake to handle job submission on greene.
Mount paths for additional requirements that cannot be easily included in the container.
Be backed up for long-term archival.
Would not require the user to modify any source code themselves.

In order to do accomplish the first three goals, I needed to have the container and script work with both docker and singularity, as docker does not play nicely with computing clusters (since it wants to run everything with sudo permissions) and singularity is difficult for the typical user to set up on their personal machine.

The fourth goal was necessary because some of the requirements for re-running my full analysis have non open-source licenses (Matlab requires the user to pay for a license, and FSL and Freesurfer both require registration), which makes including them in a container difficult (with Matlab, at least, I could compile the required code into mex files and include those, but I've never done that before, I didn't write the required Matlab code (and so don't understand perfectly) and from preliminary research and discussion with NYU HPC staff, I don't think it's as straightforward as I had originally hoped).

The fifth goal is necessary because Docker hub makes no promises about its images staying around forever (and, in particular, is planning on deleting the images from free Docker accounts after six months of inactivity), so I can't rely on users always being able to find it there.

In order to accomplish the above, we need to create a container that contains all the dependencies we can and allows for mounting the ones we can't include. We will create a wrapper script, in python, which handles a lot of boilerplate, mounting the required paths and remapping arguments that get passed to snakemake so that it uses the correct paths for within the container. We also want this wrapper script to work with the default python 3 libraries, since any additional packages will only be available within the container. Finally, we need to configure snakemake so that it uses the container on jobs that it submits to the cluster.

With all that in mind, let's walk through the files that implement this solution. It involves four files from my spatial-frequency-preferences repo (build_docker, singularity_env.sh, run_singularity.py, and config.json), plus the singularity branch of my slurm snakemake profile.

`build_docker`:

# This is a Dockerfile, but we use a non-default name to prevent mybinder from
# using it. we also don't expect people to build the docker image themselves, so
# hopefully it shouldn't trip people up.

FROM mambaorg/micromamba

# git is necessary for one of the packages we install via pip, gcc is required
# to install another one of those packages, and we need to be root to use apt
USER root
RUN apt -y update
RUN apt -y install git gcc

# switch back to the default user
USER micromamba

# create the directory we'll put our dependencies in.
RUN mkdir -p /home/sfp_user/
# copy over the conda environment yml file
COPY ./environment.yml /home/sfp_user/sfp-environment.yml

# install the required python packages and remove unnecessary files.
RUN micromamba install -n base -y -f /home/sfp_user/sfp-environment.yml && \
    micromamba clean --all --yes

# get the specific commit of the MRI_tools repo that we need
RUN git clone https://github.com/WinawerLab/MRI_tools.git /home/sfp_user/MRI_tools; cd /home/sfp_user/MRI_tools; git checkout 8508652bd9e6b5d843d70be0910da413bbee432e
# get the matlab toolboxes that we need
RUN git clone https://github.com/cvnlab/GLMdenoise.git /home/sfp_user/GLMdenoise
RUN git clone https://github.com/vistalab/vistasoft.git /home/sfp_user/vistasoft

# get the slurm snakemake profile, the singularity branch
RUN mkdir -p /home/sfp_user/.config/snakemake
RUN git clone -b singularity https://github.com/billbrod/snakemake-slurm.git /home/sfp_user/.config/snakemake/slurm
# copy over the env.sh file, which sets environmental variables
COPY ./singularity_env.sh /home/sfp_user/singularity_env.sh

This is a standard Dockerfile (we use a different name because we also want to be able to launch a Binder instance from this repo, and Binder defaults to using Dockerfile if present), which builds on the mamba container (mamba is a drop-in replacement for conda that, in my experience, solves the environment much quicker). The main thing it does is install the conda environment used by my analysis. It also grabs the other git repos that are required: the MRI_tools repo (used for pre-processing the fMRI data), two matlab packages, and the singularity branch of my slurm snakemake profile.

To then get this image onto the cluster, you'll also need to push it to DockerHub, so build and push it:

sudo docker build --tag=billbrod/sfp:latest -f build_docker  ./
sudo docker push  billbrod/sfp:latest

`singularity_env.sh`

#!/usr/bin/env bash

# set up environment variables for other libraries, add them to path
export FREESURFER_HOME=/home/sfp_user/freesurfer
export PATH=$FREESURFER_HOME/bin:$PATH

export PATH=/home/sfp_user/matlab/bin:$PATH

export FSLOUTPUTTYPE=NIFTI_GZ
export FSLDIR=/home/sfp_user/fsl
export PATH=$FSLDIR/bin:$PATH

# modify the config.json file so it points to the location of MRI_tools,
# GLMDenoise, and Vistasoft within the container
if [ -f /home/sfp_user/spatial-frequency-preferences/config.json ]; then
    cp /home/sfp_user/spatial-frequency-preferences/config.json /home/sfp_user/sfp_config.json
    sed -i 's|"MRI_TOOLS":.*|"MRI_TOOLS": "/home/sfp_user/MRI_tools",|g' /home/sfp_user/sfp_config.json
    sed -i 's|"GLMDENOISE_PATH":.*|"GLMDENOISE_PATH": "/home/sfp_user/GLMdenoise",|g' /home/sfp_user/sfp_config.json
    sed -i 's|"VISTASOFT_PATH":.*|"VISTASOFT_PATH": "/home/sfp_user/vistasoft",|g' /home/sfp_user/sfp_config.json
fi

This file gets copied into the container and will get sourced as soon as the container is started up (see the run_singularity.py section below for how we do this). It sets up environmental variables for the extra dependencies and adds them to path, as well as modifying the config.json path to point where those packages are located within the container. Note that these software packages (matlab, FSL, and Freesurfer) are not included in the container, but because of how we've set up the run_singularity.py script, we know where they'll be mounted.

`config.json`

{
  "DATA_DIR": "/scratch/wfb229/sfp",
  "WORKING_DIR": "/scratch/wfb229/preprocess",
  "MATLAB_PATH": "/share/apps/matlab/2020b",
  "FREESURFER_HOME": "/share/apps/freesurfer/6.0.0",
  "FSLDIR": "/share/apps/fsl/5.0.10",
  "MRI_TOOLS": "/home/billbrod/Documents/MRI_tools",
  "GLMDENOISE_PATH": "/home/billbrod/Documents/MATLAB/toolboxes/GLMdenoise",
  "VISTASOFT_PATH": "/home/billbrod/Documents/MATLAB/toolboxes/vistasoft",
  "TESLA_DIR": "/mnt/Tesla/spatial_frequency_preferences",
  "EXTRA_FILES_DIR": "/mnt/winawerlab/Projects/spatial_frequency_preferences/extra_files",
  "SUBJECTS_DIR": "/mnt/winawerlab/Freesurfer_subjects",
  "RETINOTOPY_DIR": "/mnt/winawerlab/Projects/Retinotopy/BIDS"
}

Snakemake allows for a configuration file, either yml or json, which we use to specify a variety of paths. We use json here, even though it doesn't allow for comments, because it can be parsed by the standard python library. These paths should all be set to locations on your machine / the cluster (not within the container). The above is an example that works for my user on the NYU greene cluster.

When using the container, only the first five paths need to be set (from DATA_DIR to FSLDIR; the final ones are used either when running without the container or when copying data into a BIDS-compliant format). DATA_DIR gives the location of the data set and where we'll place the output of the analysis and WORKING_DIR is a working directory for preprocessing and is only used temporarily in that step. The next three are the root directory of the installations for matlab, Freesurfer, and FSL: to find their locations, make sure they're on your path (if you're on a cluster, this is probably by using module load) and then run e.g., which matlab (or which mri_convert, etc.) to find where they're installed. Note that we want the root directory of the install (not the bin/ folder containing the binary executables so that if which matlab returns /share/apps/matlab/2020b/bin/matlab, we just want /share/apps/matlab/2020b).

Of all the files needed for this process, this is the only one that requires modification by the user, and my spatial-frequency-preferences readme includes a long description of what the different fields are, which need to be set, etc.

`run_singularity.py`

This is the main script that the user will run, which I generally refer to as the "wrapper script". If everything is working, the user will simply pass the command they wish to run to this script and it will bind the various directories, set environmental variables, and properly construct the arguments for singularity or docker, whichever the user wishes to use.

Because this script is so much larger than the rest, we'll step through it section by section, and I'll include the full script at the end.

#!/usr/bin/env python3

import argparse
import subprocess
import os
import os.path as op
import json
import re
from glob import glob

First, we specify this script must be run with python 3 and import the necessary python modules.

# slurm-related paths. change these if your slurm is set up differently or you
# use a different job submission system. see docs
# https://sylabs.io/guides/3.7/user-guide/appendix.html#singularity-s-environment-variables
# for full description of each of these environmental variables
os.environ['SINGULARITY_BINDPATH'] = os.environ.get('SINGULARITY_BINDPATH', '') + ',/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd'
os.environ['SINGULARITYENV_PREPEND_PATH'] = os.environ.get('SINGULARITYENV_PREPEND_PATH', '') + ':/opt/slurm/bin'
os.environ['SINGULARITY_CONTAINLIBS'] = os.environ.get('SINGULARITY_CONTAINLIBS', '') + ',' + ','.join(glob('/opt/slurm/lib64/libpmi*'))

The next several lines handle some slurm-related boilerplate. We take some singularity-related environmental variables and add on paths related to the slurm configuration so that we'll have access to the slurm commands (e.g., sbatch) from within the container. These will likely vary across SLURM clusters and so will need to be modified if using this on any cluster other than NYU's greene.

def check_singularity_envvars():
    """Make sure SINGULARITY_BINDPATH, SINGULARITY_PREPEND_PATH, and SINGULARITY_CONTAINLIBS only contain existing paths
    """
    for env in ['SINGULARITY_BINDPATH', 'SINGULARITYENV_PREPEND_PATH', 'SINGULARITY_CONTAINLIBS']:
        paths = os.environ[env]
        joiner = ',' if env != "SINGULARITYENV_PREPEND_PATH" else ':'
        paths = [p for p in paths.split(joiner) if op.exists(p)]
        os.environ[env] = joiner.join(paths)


def check_bind_paths(volumes):
    """Check that paths we want to bind exist, return only those that do."""
    return [vol for vol in volumes if op.exists(vol.split(':')[0])]

The next two functions make sure we only pass through existing paths to the container, either via the environmental variables discussed above, or via the user-specified bind paths. By excluding non-existing directories from the bind paths, we allow the user to ignore any options they don't want to set, so that they can ignore mounting any directories containing software they don't need.

def main(image, args=[], software='singularity', sudo=False):
    """Run sfp singularity container!

    Parameters
    ----------
    image : str
        If running with singularity, the path to the .sif file containing the
        singularity image. If running with docker, name of the docker image.
    args : list, optional
        command to pass to the container. If empty (default), we open up an
        interactive session.
    software : {'singularity', 'docker'}, optional
        Whether to run image with singularity or docker
    sudo : bool, optional
        If True, we run docker with `sudo`. If software=='singularity', we
        ignore this.

    """

The main() function accepts the same arguments as the command-line script, which we'll discuss at the end.

check_singularity_envvars()
with open(op.join(op.dirname(op.realpath(__file__)), 'config.json')) as f:
    config = json.load(f)
volumes = [
    f'{op.dirname(op.realpath(__file__))}:/home/sfp_user/spatial-frequency-preferences',
    f'{config["MATLAB_PATH"]}:/home/sfp_user/matlab',
    f'{config["FREESURFER_HOME"]}:/home/sfp_user/freesurfer',
    f'{config["FSLDIR"]}:/home/sfp_user/fsl',
    f'{config["DATA_DIR"]}:{config["DATA_DIR"]}',
    f'{config["WORKING_DIR"]}:{config["WORKING_DIR"]}'
]
volumes = check_bind_paths(volumes)
# join puts --bind between each of the volumes, we also need it in the
# beginning
volumes = '--bind ' + " --bind ".join(volumes)

First, main() takes care of the paths and environmental variables. We check the singularity-related environmental variables, as discussed above, then load in the config.json file, also discussed earlier. We use the user-supplied values here to set up the arguments for the --bind flag (equivalent to docker's --volume flag). Note that the software-related paths (the ones containing the project repo, matlab, freesurfer, and FSL) are all remapped to static paths within /home/sfp_user/, while the data-related paths (DATA_DIR and WORKING_DIR) are left unchanged. The data paths are unchanged because there are too many references to them in how snakemake structures the commands for me to comprehensively remap, so we just preserve them.

We then check that all the paths that we'll be binding exist and remove any that don't with the check_bind_paths() function, discussed above. Finally, we combine that list of volumes into a string to pass through to singularity, which will be formatted like --bind VOL1 --bind VOL2 ....

# if the user is passing a snakemake command, need to pass
# --configfile /home/sfp_user/sfp_config.json, since we modify the config
# file when we source singularity_env.sh
if args and 'snakemake' == args[0]:
    args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
            '-d', '/home/sfp_user/spatial-frequency-preferences',
            '-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', *args[1:]]

args specifies what the user wants to do with the container, and there are four possibilities:

args is empty, in which case we open up an interactive session.
args is a list of strings, and the first string is snakemake.
args is a single string, and starts with snakemake.
args is either a single string or a list of strings, and doesn't contain snakemake.

The above section handles possibility \#2. In this case, we know that the arguments contains no flags for snakemake (in that case, args would have to be quoted and thus we'd be in possibility \#3). We therefore modify the command to be run from inside the container, adding the specification of the config file, which we modified with singularity_env.sh, and specifying the paths to both the working directory and the Snakefile (these last two mean that we can run the snakemake command from other paths within the container). Finally, we append the rest of the user-specified arguments.

# in this case they passed a string so args[0] contains snakemake and then
# a bunch of other stuff
elif args and args[0].startswith('snakemake'):
    args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
            '-d', '/home/sfp_user/spatial-frequency-preferences',
            '-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', args[0].replace('snakemake ', ''), *args[1:]]
    # if the user specifies --profile slurm, replace it with the
    # appropriate path. We know it will be in the last one of args and
    # nested below the above elif because if they specified --profile then
    # the whole thing had to be wrapped in quotes, which would lead to this
    # case.
    if '--profile slurm' in args[-1]:
        args[-1] = args[-1].replace('--profile slurm',
                                    '--profile /home/sfp_user/.config/snakemake/slurm')
    # then need to make sure to mount this
    elif '--profile' in args[-1]:
        profile_path = re.findall('--profile (.*?) ', args[-1])[0]
        profile_name = op.split(profile_path)[-1]
        volumes.append(f'{profile_path}:/home/sfp_user/.config/snakemake/{profile_name}')
        args[-1] = args[-1].replace(f'--profile {profile_path}',
                                    f'--profile /home/sfp_user/.config/snakemake/{profile_name}')

This block corresponds to possibility \#3. The beginning is very similar to \#2 explained above, except that we include the rest of args[0], after having removed the word snakemake.

The following sections either remap the snakemake profile from slurm to the slurm profile within the container, or find the specified path of the snakemake profile and add it to the list of volumes to mount within the container. Note that this will fail if the user just specifies the name of the profile, rather than giving the full path. This could be made more general by checking whether the profile_path variable grabbed with regex looks like a path, and, if not, searching the locations that snakemake searches for profiles (notably, ~/.config/snakemake), but that seemed like a lot of work.

# open up an interactive session if the user hasn't specified an argument,
# otherwise pass the argument to bash. regardless, make sure we source the
# env.sh file
if not args:
    args = ['/bin/bash', '--init-file', '/home/sfp_user/singularity_env.sh']

This block corresponds to situation \#1: no arguments passed. In this case, we simply open up an interactive bash session, sourcing singularity_env.sh before we start.

else:
    args = ['/bin/bash', '-c',
            # this needs to be done with single quotes on the inside so
            # that's what bash sees, otherwise we run into
            # https://stackoverflow.com/questions/45577411/export-variable-within-bin-bash-c;
            # double-quoted commands get evaluated in the *current* shell,
            # not by /bin/bash -c
            f"'source /home/sfp_user/singularity_env.sh; {' '.join(args)}'"]

This block handles situations 2 through 4: we pass the arguments through to the bash interpreter, making sure to source singularity_env.sh first. In an ideal world, this sourcing could be done with the same --init-file arg as used in the previous block, but that wasn't working for me. Note the importance of single quotes.

# set these environmental variables, which we use for the jobs submitted to
# the cluster so they know where to find the container and this script
env_str = f"--env SFP_PATH={op.dirname(op.realpath(__file__))} --env SINGULARITY_CONTAINER_PATH={image}"

These two environmental variables are used by our snakemake slurm profile, described in the next section, so that all our submitted jobs know where the container and the wrapper script are found, since they'll be used by them as well.

# the -e flag makes sure we don't pass through any environment variables
# from the calling shell, while --writable-tmpfs enables us to write to the
# container's filesystem (necessary because singularity_env.sh makes a
# temporary config.json file)
if software == 'singularity':
    exec_str = f'singularity exec -e {env_str} --writable-tmpfs {volumes} {image} {" ".join(args)}'

Now, we combine all the strings we've been configuring into a single string that we can execute. The only new components are the -e flag, which ensures we don't send through any extra environmental variables from the calling shell, and the --writable-tmpfs flag, which allows us to write to the container's filesystem (not just the mounted ones), which we need because we modify the snakemake configuration file.

elif software == 'docker':
    volumes = volumes.replace('--bind', '--volume')
    exec_str = f'docker run {volumes} -it {image} {" ".join(args)}'
    if sudo:
        exec_str = 'sudo ' + exec_str

If we're using docker instead of singularity, the command is slightly different: we replace --bind with --volume and change some of the other flags. This command doesn't use the -e flag and the extra environmental variable manipulations that singularity requires, because docker cannot be used on the cluster, and so this will not need to interact with the job scheduler (e.g., SLURM).

print(exec_str)
# we use shell=True because we want to carefully control the quotes used
subprocess.call(exec_str, shell=True)

Finally, we print out the command we're running (largely for debugging purposes, I'm not sure if this would be that useful to the user), and call it using subprocess.

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=("Run billbrod/sfp container. This is a wrapper, which binds the appropriate"
                     " paths and sources singularity_env.sh, setting up some environmental variables.")
    )
    parser.add_argument('image',
                        help=('If running with singularity, the path to the '
                              '.sif file containing the singularity image. '
                              'If running with docker, name of the docker image.'))
    parser.add_argument('--software', default='singularity', choices=['singularity', 'docker'],
                        help="Whether to run this with singularity or docker")
    parser.add_argument('--sudo', '-s', action='store_true',
                        help="Whether to run docker with sudo or not. Ignored if software==singularity")
    parser.add_argument("args", nargs='*',
                        help="Command to pass to the container. If empty, we open up an interactive session.")
    args = vars(parser.parse_args())
    main(**args)

The very bottom of the script contains the argparse configuration, which makes the help string from the command line more informative.

The users calls the script like so: ./run_singularity.py path/to/singularity_image.sif CMD if using singularity, and ./run_singularity.py user/image_name --software docker -s CMD if using docker (that -s flag runs docker as sudo, and so may not be necessary, depending on their docker configuration). Note that this CMD is the equivalent of args discussed above. CMD can be left empty (in which case an interactive session will be opened) or a string that will be run inside the container, such as the snakemake commands required to recreate the analysis, e.g., 'snakemake main_figure_paper' (for my spatial-frequency-preferences project). Note that single quotes are necessary if any flags are included in CMD, in order to prevent run_singularity.py from trying to interpret them.

Finally, the complete script:

#!/usr/bin/env python3

import argparse
import subprocess
import os
import os.path as op
import json
import re
from glob import glob

# slurm-related paths. change these if your slurm is set up differently or you
# use a different job submission system. see docs
# https://sylabs.io/guides/3.7/user-guide/appendix.html#singularity-s-environment-variables
# for full description of each of these environmental variables
os.environ['SINGULARITY_BINDPATH'] = os.environ.get('SINGULARITY_BINDPATH', '') + ',/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd'
os.environ['SINGULARITYENV_PREPEND_PATH'] = os.environ.get('SINGULARITYENV_PREPEND_PATH', '') + ':/opt/slurm/bin'
os.environ['SINGULARITY_CONTAINLIBS'] = os.environ.get('SINGULARITY_CONTAINLIBS', '') + ',' + ','.join(glob('/opt/slurm/lib64/libpmi*'))


def check_singularity_envvars():
    """Make sure SINGULARITY_BINDPATH, SINGULARITY_PREPEND_PATH, and SINGULARITY_CONTAINLIBS only contain existing paths
    """
    for env in ['SINGULARITY_BINDPATH', 'SINGULARITYENV_PREPEND_PATH', 'SINGULARITY_CONTAINLIBS']:
        paths = os.environ[env]
        joiner = ',' if env != "SINGULARITYENV_PREPEND_PATH" else ':'
        paths = [p for p in paths.split(joiner) if op.exists(p)]
        os.environ[env] = joiner.join(paths)


def check_bind_paths(volumes):
    """Check that paths we want to bind exist, return only those that do."""
    return [vol for vol in volumes if op.exists(vol.split(':')[0])]


def main(image, args=[], software='singularity', sudo=False):
    """Run sfp singularity container!

    Parameters
    ----------
    image : str
        If running with singularity, the path to the .sif file containing the
        singularity image. If running with docker, name of the docker image.
    args : list, optional
        command to pass to the container. If empty (default), we open up an
        interactive session.
    software : {'singularity', 'docker'}, optional
        Whether to run image with singularity or docker
    sudo : bool, optional
        If True, we run docker with `sudo`. If software=='singularity', we
        ignore this.

    """
    check_singularity_envvars()
    with open(op.join(op.dirname(op.realpath(__file__)), 'config.json')) as f:
        config = json.load(f)
    volumes = [
        f'{op.dirname(op.realpath(__file__))}:/home/sfp_user/spatial-frequency-preferences',
        f'{config["MATLAB_PATH"]}:/home/sfp_user/matlab',
        f'{config["FREESURFER_HOME"]}:/home/sfp_user/freesurfer',
        f'{config["FSLDIR"]}:/home/sfp_user/fsl',
        f'{config["DATA_DIR"]}:{config["DATA_DIR"]}',
        f'{config["WORKING_DIR"]}:{config["WORKING_DIR"]}'
    ]
    volumes = check_bind_paths(volumes)
    # join puts --bind between each of the volumes, we also need it in the
    # beginning
    volumes = '--bind ' + " --bind ".join(volumes)
    # if the user is passing a snakemake command, need to pass
    # --configfile /home/sfp_user/sfp_config.json, since we modify the config
    # file when we source singularity_env.sh
    if args and 'snakemake' == args[0]:
        args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
                '-d', '/home/sfp_user/spatial-frequency-preferences',
                '-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', *args[1:]]
    # in this case they passed a string so args[0] contains snakemake and then
    # a bunch of other stuff
    elif args and args[0].startswith('snakemake'):
        args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
                '-d', '/home/sfp_user/spatial-frequency-preferences',
                '-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', args[0].replace('snakemake ', ''), *args[1:]]
        # if the user specifies --profile slurm, replace it with the
        # appropriate path. We know it will be in the last one of args and
        # nested below the above elif because if they specified --profile then
        # the whole thing had to be wrapped in quotes, which would lead to this
        # case.
        if '--profile slurm' in args[-1]:
            args[-1] = args[-1].replace('--profile slurm',
                                        '--profile /home/sfp_user/.config/snakemake/slurm')
        # then need to make sure to mount this
        elif '--profile' in args[-1]:
            profile_path = re.findall('--profile (.*?) ', args[-1])[0]
            profile_name = op.split(profile_path)[-1]
            volumes.append(f'{profile_path}:/home/sfp_user/.config/snakemake/{profile_name}')
            args[-1] = args[-1].replace(f'--profile {profile_path}',
                                        f'--profile /home/sfp_user/.config/snakemake/{profile_name}')
    # open up an interactive session if the user hasn't specified an argument,
    # otherwise pass the argument to bash. regardless, make sure we source the
    # env.sh file
    if not args:
        args = ['/bin/bash', '--init-file', '/home/sfp_user/singularity_env.sh']
    else:
        args = ['/bin/bash', '-c',
                # this needs to be done with single quotes on the inside so
                # that's what bash sees, otherwise we run into
                # https://stackoverflow.com/questions/45577411/export-variable-within-bin-bash-c;
                # double-quoted commands get evaluated in the *current* shell,
                # not by /bin/bash -c
                f"'source /home/sfp_user/singularity_env.sh; {' '.join(args)}'"]
    # set these environmental variables, which we use for the jobs submitted to
    # the cluster so they know where to find the container and this script
    env_str = f"--env SFP_PATH={op.dirname(op.realpath(__file__))} --env SINGULARITY_CONTAINER_PATH={image}"
    # the -e flag makes sure we don't pass through any environment variables
    # from the calling shell, while --writable-tmpfs enables us to write to the
    # container's filesystem (necessary because singularity_env.sh makes a
    # temporary config.json file)
    if software == 'singularity':
        exec_str = f'singularity exec -e {env_str} --writable-tmpfs {volumes} {image} {" ".join(args)}'
    elif software == 'docker':
        volumes = volumes.replace('--bind', '--volume')
        exec_str = f'docker run {volumes} -it {image} {" ".join(args)}'
        if sudo:
            exec_str = 'sudo ' + exec_str
    print(exec_str)
    # we use shell=True because we want to carefully control the quotes used
    subprocess.call(exec_str, shell=True)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=("Run billbrod/sfp container. This is a wrapper, which binds the appropriate"
                     " paths and sources singularity_env.sh, setting up some environmental variables.")
    )
    parser.add_argument('image',
                        help=('If running with singularity, the path to the '
                              '.sif file containing the singularity image. '
                              'If running with docker, name of the docker image.'))
    parser.add_argument('--software', default='singularity', choices=['singularity', 'docker'],
                        help="Whether to run this with singularity or docker")
    parser.add_argument('--sudo', '-s', action='store_true',
                        help="Whether to run docker with sudo or not. Ignored if software==singularity")
    parser.add_argument("args", nargs='*',
                        help=("Command to pass to the container. If empty, we open up an interactive session."
                              " If it contains flags, surround with SINGLE QUOTES (not double)."))
    args = vars(parser.parse_args())
    main(**args)

slurm snakemake profile

If we want to submit jobs to the job scheduler, we need to tell snakemake how to do so and to use the container when it does so. We do this via a custom profile, which can be found as the singularity branch of my snakemake-slurm github repository. This is based on the canonical slurm snakemake profile. I originally created my version 5 years ago, and it's possible there are changes to the original in that time that would be helpful, so you can use mine or the original for this (the only changes I've made besides those described below are a small change to add support for the –gres option, required for requesting GPUs, and a small change to partition).

The only file that is specific to the implementation discussed in this post is slurm-jobscript.sh:

#!/bin/bash
#SBATCH --export=SINGULARITY_CONTAINER_PATH,SFP_PATH
# properties = {properties}

# q is a special formatting symbol used by snakemake to tell it to escape quotes
# correctly
$SFP_PATH/run_singularity.py $SINGULARITY_CONTAINER_PATH {exec_job:q}

Note that this script makes use of the SINGULARITY_CONTAINER_PATH and SFP_PATH environmental variables, which we made sure to set in the wrapper script run_singularity.py above. They specify the location of the singularity container containing the environment and the directory containing the code for the project (and thus, run_singularity.py).

Each job that snakemake submits to the cluster will use this script as the template for its job, and so we are ensuring that each of those jobs will use run_singularity.py to run within the container. We also use the :q formatting symbol (which is a special snakemake one, not available in standard python) to escape quotes correctly, which is important since, as discussed in the previous section, we need to control the single quotes carefully to make sure the flags are interpreted by the right software.

Archiving

Finally, we would like to be able to back up the containers for long-term archiving. I wrote this code as a way of making my analysis and figure-creation reproducible for a scientific paper. This means that I would like the code to remain runnable for as long as possible and it is unlikely that many people will use it. Docker hub had planned to delete images from free Docker accounts after six months of activity, though it's unclear what the status of that plan is. Regardless, we should not be relying on Docker hub for long-term archiving and so we need another solution.

Fortunately, singularity creates a .sif file containing the container when pulling it, and docker can export the container into a .tar file using docker save (e.g., docker save billbrod/sfp:v1.0.0 > sfp_v1.0.0_docker.tar). These files can then be archived like any other file. They are rather large (2.5GB for the .sif file, 5.1GB for the .tar file), and so something like the Open Science Framework, where I normally place my academic research-related files, is not an option.

Ultimately, I used NYU's Faculty Digital Archive. Something like Amazon Web Service's S3 buckets (or other cloud storage) would be another option, though not free. When looking for where to archive, any service you use should give objects a doi and allow for versioning. Ideally, it should be run by a non-profit and open source, but that might not be possible. If you're at a university, I recommend reaching out to your university library or HPC team to see if they have any suggestions. Fundamentally, archiving and sharing the container is not that different from archiving and sharing a large data set.

Archiving the container somewhere publicly-accessible allows users to download the container and use it, even if the docker hub image is deleted. With singularity, the container is used directly, whereas with docker, docker load is required (e.g., docker load < sfp_v1.0.0_docker.tar).

Conclusion

This post outlines how to create and use a container which contains all dependencies possible, allows for mounting any additional dependencies (such as those that require a license for use), and can be used identically locally and on the cluster, as well as being able to use snakemake to submit jobs on a cluster. While the scripts I wrote are all slurm-specific (and, even more so, specific to NYU's greene cluster), hopefully they can provide a jumping off point for others.

build_docker:

singularity_env.sh

config.json

run_singularity.py