snakemake, singularity, and HPC
Posted on Mon 01 August 2022 in posts
This is a bit of a sequel to my post last May about using conda environments with snakemake on the HPC. That solution worked for me, but was hacky and limited. This post will describe a different solution, which uses a singularity container containing a conda environment that can be used by snakemake for local use or use with SLURM, and has the ability to mount extra dependencies.
I put this together for my spatial frequency
preferences
project, and you can read that repo's README for more details on how to
use it. The following post will attempt to describe the different
components of it and why they work, which will hopefully make it easier
to modify for other uses. The following description is all for NYU's
greene
cluster (which uses SLURM
), but can hopefully be modified to
other clusters and job submission systems.
My goal was to make it easier to rerun my analyses, allowing other people to make use of snakemake and the cluster to run them quickly, without requiring lots of setup and understanding of snakemake, SLURM, or singularity. To do that, I wanted to create a single container with a wrapper script that would be able to:
- Run locally.
- Run in an interactive session on
greene
. - Use snakemake to handle job submission on
greene
. - Mount paths for additional requirements that cannot be easily included in the container.
- Be backed up for long-term archival.
- Would not require the user to modify any source code themselves.
In order to do accomplish the first three goals, I needed to have the
container and script work with both docker and singularity, as docker
does not play nicely with computing clusters (since it wants to run
everything with sudo
permissions) and singularity is difficult for the
typical user to set up on their personal machine.
The fourth goal was necessary because some of the requirements for
re-running my full analysis have non open-source licenses (Matlab
requires the user to pay for a license, and FSL and Freesurfer both
require registration), which makes including them in a container
difficult (with Matlab, at least, I could compile the required code into
mex
files and include those, but I've never done that before, I didn't
write the required Matlab code (and so don't understand perfectly) and
from preliminary research and discussion with NYU HPC staff, I don't
think it's as straightforward as I had originally hoped).
The fifth goal is necessary because Docker hub makes no promises about its images staying around forever (and, in particular, is planning on deleting the images from free Docker accounts after six months of inactivity), so I can't rely on users always being able to find it there.
In order to accomplish the above, we need to create a container that contains all the dependencies we can and allows for mounting the ones we can't include. We will create a wrapper script, in python, which handles a lot of boilerplate, mounting the required paths and remapping arguments that get passed to snakemake so that it uses the correct paths for within the container. We also want this wrapper script to work with the default python 3 libraries, since any additional packages will only be available within the container. Finally, we need to configure snakemake so that it uses the container on jobs that it submits to the cluster.
With all that in mind, let's walk through the files that implement this
solution. It involves four files from my
spatial-frequency-preferences
repo (build_docker
, singularity_env.sh
, run_singularity.py
, and
config.json
), plus the singularity
branch of
my slurm snakemake profile.
build_docker
:
# This is a Dockerfile, but we use a non-default name to prevent mybinder from
# using it. we also don't expect people to build the docker image themselves, so
# hopefully it shouldn't trip people up.
FROM mambaorg/micromamba
# git is necessary for one of the packages we install via pip, gcc is required
# to install another one of those packages, and we need to be root to use apt
USER root
RUN apt -y update
RUN apt -y install git gcc
# switch back to the default user
USER micromamba
# create the directory we'll put our dependencies in.
RUN mkdir -p /home/sfp_user/
# copy over the conda environment yml file
COPY ./environment.yml /home/sfp_user/sfp-environment.yml
# install the required python packages and remove unnecessary files.
RUN micromamba install -n base -y -f /home/sfp_user/sfp-environment.yml && \
micromamba clean --all --yes
# get the specific commit of the MRI_tools repo that we need
RUN git clone https://github.com/WinawerLab/MRI_tools.git /home/sfp_user/MRI_tools; cd /home/sfp_user/MRI_tools; git checkout 8508652bd9e6b5d843d70be0910da413bbee432e
# get the matlab toolboxes that we need
RUN git clone https://github.com/cvnlab/GLMdenoise.git /home/sfp_user/GLMdenoise
RUN git clone https://github.com/vistalab/vistasoft.git /home/sfp_user/vistasoft
# get the slurm snakemake profile, the singularity branch
RUN mkdir -p /home/sfp_user/.config/snakemake
RUN git clone -b singularity https://github.com/billbrod/snakemake-slurm.git /home/sfp_user/.config/snakemake/slurm
# copy over the env.sh file, which sets environmental variables
COPY ./singularity_env.sh /home/sfp_user/singularity_env.sh
This is a standard Dockerfile (we use a different name because we also
want to be able to launch a Binder instance
from this repo, and Binder defaults to using Dockerfile
if present),
which builds on the mamba
container
(mamba is a drop-in replacement
for conda that, in my experience, solves the environment much quicker).
The main thing it does is install the conda environment used by my
analysis. It also grabs the other git repos that are required: the
MRI_tools
repo (used for pre-processing the fMRI data), two matlab
packages, and the singularity branch of my slurm snakemake profile.
To then get this image onto the cluster, you'll also need to push it to DockerHub, so build and push it:
sudo docker build --tag=billbrod/sfp:latest -f build_docker ./
sudo docker push billbrod/sfp:latest
singularity_env.sh
#!/usr/bin/env bash
# set up environment variables for other libraries, add them to path
export FREESURFER_HOME=/home/sfp_user/freesurfer
export PATH=$FREESURFER_HOME/bin:$PATH
export PATH=/home/sfp_user/matlab/bin:$PATH
export FSLOUTPUTTYPE=NIFTI_GZ
export FSLDIR=/home/sfp_user/fsl
export PATH=$FSLDIR/bin:$PATH
# modify the config.json file so it points to the location of MRI_tools,
# GLMDenoise, and Vistasoft within the container
if [ -f /home/sfp_user/spatial-frequency-preferences/config.json ]; then
cp /home/sfp_user/spatial-frequency-preferences/config.json /home/sfp_user/sfp_config.json
sed -i 's|"MRI_TOOLS":.*|"MRI_TOOLS": "/home/sfp_user/MRI_tools",|g' /home/sfp_user/sfp_config.json
sed -i 's|"GLMDENOISE_PATH":.*|"GLMDENOISE_PATH": "/home/sfp_user/GLMdenoise",|g' /home/sfp_user/sfp_config.json
sed -i 's|"VISTASOFT_PATH":.*|"VISTASOFT_PATH": "/home/sfp_user/vistasoft",|g' /home/sfp_user/sfp_config.json
fi
This file gets copied into the container and will get sourced as soon as
the container is started up (see the run_singularity.py
section below
for how we do this). It sets up environmental variables for the extra
dependencies and adds them to path, as well as modifying the
config.json
path to point where those packages are located within the
container. Note that these software packages (matlab, FSL, and
Freesurfer) are not included in the container, but because of how we've
set up the run_singularity.py
script, we know where they'll be
mounted.
config.json
{
"DATA_DIR": "/scratch/wfb229/sfp",
"WORKING_DIR": "/scratch/wfb229/preprocess",
"MATLAB_PATH": "/share/apps/matlab/2020b",
"FREESURFER_HOME": "/share/apps/freesurfer/6.0.0",
"FSLDIR": "/share/apps/fsl/5.0.10",
"MRI_TOOLS": "/home/billbrod/Documents/MRI_tools",
"GLMDENOISE_PATH": "/home/billbrod/Documents/MATLAB/toolboxes/GLMdenoise",
"VISTASOFT_PATH": "/home/billbrod/Documents/MATLAB/toolboxes/vistasoft",
"TESLA_DIR": "/mnt/Tesla/spatial_frequency_preferences",
"EXTRA_FILES_DIR": "/mnt/winawerlab/Projects/spatial_frequency_preferences/extra_files",
"SUBJECTS_DIR": "/mnt/winawerlab/Freesurfer_subjects",
"RETINOTOPY_DIR": "/mnt/winawerlab/Projects/Retinotopy/BIDS"
}
Snakemake allows for a configuration file, either yml or json, which we use to specify a variety of paths. We use json here, even though it doesn't allow for comments, because it can be parsed by the standard python library. These paths should all be set to locations on your machine / the cluster (not within the container). The above is an example that works for my user on the NYU greene cluster.
When using the container, only the first five paths need to be set (from
DATA_DIR
to FSLDIR
; the final ones are used either when running
without the container or when copying data into a BIDS-compliant
format). DATA_DIR
gives the location of the data set and where we'll
place the output of the analysis and WORKING_DIR
is a working
directory for preprocessing and is only used temporarily in that step.
The next three are the root directory of the installations for matlab,
Freesurfer, and FSL: to find their locations, make sure they're on your
path (if you're on a cluster, this is probably by using module load
)
and then run e.g., which matlab
(or which mri_convert
, etc.) to find
where they're installed. Note that we want the root directory of the
install (not the bin/
folder containing the binary executables so that
if which matlab
returns /share/apps/matlab/2020b/bin/matlab
, we just
want /share/apps/matlab/2020b
).
Of all the files needed for this process, this is the only one that
requires modification by the user, and my
spatial-frequency-preferences
readme includes a long description of
what the different fields are, which need to be set, etc.
run_singularity.py
This is the main script that the user will run, which I generally refer to as the "wrapper script". If everything is working, the user will simply pass the command they wish to run to this script and it will bind the various directories, set environmental variables, and properly construct the arguments for singularity or docker, whichever the user wishes to use.
Because this script is so much larger than the rest, we'll step through it section by section, and I'll include the full script at the end.
#!/usr/bin/env python3
import argparse
import subprocess
import os
import os.path as op
import json
import re
from glob import glob
First, we specify this script must be run with python 3 and import the necessary python modules.
# slurm-related paths. change these if your slurm is set up differently or you
# use a different job submission system. see docs
# https://sylabs.io/guides/3.7/user-guide/appendix.html#singularity-s-environment-variables
# for full description of each of these environmental variables
os.environ['SINGULARITY_BINDPATH'] = os.environ.get('SINGULARITY_BINDPATH', '') + ',/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd'
os.environ['SINGULARITYENV_PREPEND_PATH'] = os.environ.get('SINGULARITYENV_PREPEND_PATH', '') + ':/opt/slurm/bin'
os.environ['SINGULARITY_CONTAINLIBS'] = os.environ.get('SINGULARITY_CONTAINLIBS', '') + ',' + ','.join(glob('/opt/slurm/lib64/libpmi*'))
The next several lines handle some slurm-related boilerplate. We take
some singularity-related environmental variables and add on paths
related to the slurm configuration so that we'll have access to the
slurm commands (e.g., sbatch
) from within the container. These will
likely vary across SLURM clusters and so will need to be modified if
using this on any cluster other than NYU's greene.
def check_singularity_envvars():
"""Make sure SINGULARITY_BINDPATH, SINGULARITY_PREPEND_PATH, and SINGULARITY_CONTAINLIBS only contain existing paths
"""
for env in ['SINGULARITY_BINDPATH', 'SINGULARITYENV_PREPEND_PATH', 'SINGULARITY_CONTAINLIBS']:
paths = os.environ[env]
joiner = ',' if env != "SINGULARITYENV_PREPEND_PATH" else ':'
paths = [p for p in paths.split(joiner) if op.exists(p)]
os.environ[env] = joiner.join(paths)
def check_bind_paths(volumes):
"""Check that paths we want to bind exist, return only those that do."""
return [vol for vol in volumes if op.exists(vol.split(':')[0])]
The next two functions make sure we only pass through existing paths to the container, either via the environmental variables discussed above, or via the user-specified bind paths. By excluding non-existing directories from the bind paths, we allow the user to ignore any options they don't want to set, so that they can ignore mounting any directories containing software they don't need.
def main(image, args=[], software='singularity', sudo=False):
"""Run sfp singularity container!
Parameters
----------
image : str
If running with singularity, the path to the .sif file containing the
singularity image. If running with docker, name of the docker image.
args : list, optional
command to pass to the container. If empty (default), we open up an
interactive session.
software : {'singularity', 'docker'}, optional
Whether to run image with singularity or docker
sudo : bool, optional
If True, we run docker with `sudo`. If software=='singularity', we
ignore this.
"""
The main()
function accepts the same arguments as the command-line
script, which we'll discuss at the end.
check_singularity_envvars()
with open(op.join(op.dirname(op.realpath(__file__)), 'config.json')) as f:
config = json.load(f)
volumes = [
f'{op.dirname(op.realpath(__file__))}:/home/sfp_user/spatial-frequency-preferences',
f'{config["MATLAB_PATH"]}:/home/sfp_user/matlab',
f'{config["FREESURFER_HOME"]}:/home/sfp_user/freesurfer',
f'{config["FSLDIR"]}:/home/sfp_user/fsl',
f'{config["DATA_DIR"]}:{config["DATA_DIR"]}',
f'{config["WORKING_DIR"]}:{config["WORKING_DIR"]}'
]
volumes = check_bind_paths(volumes)
# join puts --bind between each of the volumes, we also need it in the
# beginning
volumes = '--bind ' + " --bind ".join(volumes)
First, main()
takes care of the paths and environmental variables. We
check the singularity-related environmental variables, as discussed
above, then load in the config.json file, also discussed earlier. We use
the user-supplied values here to set up the arguments for the --bind
flag (equivalent to docker's --volume
flag). Note that the
software-related paths (the ones containing the project repo, matlab,
freesurfer, and FSL) are all remapped to static paths within
/home/sfp_user/
, while the data-related paths (DATA_DIR
and
WORKING_DIR
) are left unchanged. The data paths are unchanged because
there are too many references to them in how snakemake structures the
commands for me to comprehensively remap, so we just preserve them.
We then check that all the paths that we'll be binding exist and remove
any that don't with the check_bind_paths()
function, discussed above.
Finally, we combine that list of volumes into a string to pass through
to singularity, which will be formatted like --bind VOL1 --bind VOL2
...
.
# if the user is passing a snakemake command, need to pass
# --configfile /home/sfp_user/sfp_config.json, since we modify the config
# file when we source singularity_env.sh
if args and 'snakemake' == args[0]:
args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
'-d', '/home/sfp_user/spatial-frequency-preferences',
'-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', *args[1:]]
args
specifies what the user wants to do with the container, and there
are four possibilities:
args
is empty, in which case we open up an interactive session.args
is a list of strings, and the first string issnakemake
.args
is a single string, and starts withsnakemake
.args
is either a single string or a list of strings, and doesn't containsnakemake
.
The above section handles possibility \#2. In this case, we know that
the arguments contains no flags for snakemake (in that case, args
would have to be quoted and thus we'd be in possibility \#3). We
therefore modify the command to be run from inside the container, adding
the specification of the config file, which we modified with
singularity_env.sh
, and specifying the paths to both the working
directory and the Snakefile (these last two mean that we can run the
snakemake command from other paths within the container). Finally, we
append the rest of the user-specified arguments.
# in this case they passed a string so args[0] contains snakemake and then
# a bunch of other stuff
elif args and args[0].startswith('snakemake'):
args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
'-d', '/home/sfp_user/spatial-frequency-preferences',
'-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', args[0].replace('snakemake ', ''), *args[1:]]
# if the user specifies --profile slurm, replace it with the
# appropriate path. We know it will be in the last one of args and
# nested below the above elif because if they specified --profile then
# the whole thing had to be wrapped in quotes, which would lead to this
# case.
if '--profile slurm' in args[-1]:
args[-1] = args[-1].replace('--profile slurm',
'--profile /home/sfp_user/.config/snakemake/slurm')
# then need to make sure to mount this
elif '--profile' in args[-1]:
profile_path = re.findall('--profile (.*?) ', args[-1])[0]
profile_name = op.split(profile_path)[-1]
volumes.append(f'{profile_path}:/home/sfp_user/.config/snakemake/{profile_name}')
args[-1] = args[-1].replace(f'--profile {profile_path}',
f'--profile /home/sfp_user/.config/snakemake/{profile_name}')
This block corresponds to possibility \#3. The beginning is very
similar to \#2 explained above, except that we include the rest of
args[0]
, after having removed the word snakemake
.
The following sections either remap the snakemake profile from slurm
to the slurm profile within the container, or find the specified path of
the snakemake profile and add it to the list of volumes to mount within
the container. Note that this will fail if the user just specifies the
name of the profile, rather than giving the full path. This could be
made more general by checking whether the profile_path
variable
grabbed with regex looks like a path, and, if not, searching the
locations that snakemake searches for profiles (notably,
~/.config/snakemake
), but that seemed like a lot of work.
# open up an interactive session if the user hasn't specified an argument,
# otherwise pass the argument to bash. regardless, make sure we source the
# env.sh file
if not args:
args = ['/bin/bash', '--init-file', '/home/sfp_user/singularity_env.sh']
This block corresponds to situation \#1: no arguments passed. In this
case, we simply open up an interactive bash session, sourcing
singularity_env.sh
before we start.
else:
args = ['/bin/bash', '-c',
# this needs to be done with single quotes on the inside so
# that's what bash sees, otherwise we run into
# https://stackoverflow.com/questions/45577411/export-variable-within-bin-bash-c;
# double-quoted commands get evaluated in the *current* shell,
# not by /bin/bash -c
f"'source /home/sfp_user/singularity_env.sh; {' '.join(args)}'"]
This block handles situations 2 through 4: we pass the arguments through
to the bash interpreter, making sure to source singularity_env.sh
first. In an ideal world, this sourcing could be done with the same
--init-file
arg as used in the previous block, but that wasn't working
for me. Note the importance of single quotes.
# set these environmental variables, which we use for the jobs submitted to
# the cluster so they know where to find the container and this script
env_str = f"--env SFP_PATH={op.dirname(op.realpath(__file__))} --env SINGULARITY_CONTAINER_PATH={image}"
These two environmental variables are used by our snakemake slurm profile, described in the next section, so that all our submitted jobs know where the container and the wrapper script are found, since they'll be used by them as well.
# the -e flag makes sure we don't pass through any environment variables
# from the calling shell, while --writable-tmpfs enables us to write to the
# container's filesystem (necessary because singularity_env.sh makes a
# temporary config.json file)
if software == 'singularity':
exec_str = f'singularity exec -e {env_str} --writable-tmpfs {volumes} {image} {" ".join(args)}'
Now, we combine all the strings we've been configuring into a single
string that we can execute. The only new components are the -e
flag,
which ensures we don't send through any extra environmental variables
from the calling shell, and the --writable-tmpfs
flag, which allows us
to write to the container's filesystem (not just the mounted ones),
which we need because we modify the snakemake configuration file.
elif software == 'docker':
volumes = volumes.replace('--bind', '--volume')
exec_str = f'docker run {volumes} -it {image} {" ".join(args)}'
if sudo:
exec_str = 'sudo ' + exec_str
If we're using docker instead of singularity, the command is slightly
different: we replace --bind
with --volume
and change some of the
other flags. This command doesn't use the -e
flag and the extra
environmental variable manipulations that singularity requires, because
docker
cannot be used on the cluster, and so this will not need to
interact with the job scheduler (e.g., SLURM).
print(exec_str)
# we use shell=True because we want to carefully control the quotes used
subprocess.call(exec_str, shell=True)
Finally, we print out the command we're running (largely for debugging
purposes, I'm not sure if this would be that useful to the user), and
call it using subprocess
.
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=("Run billbrod/sfp container. This is a wrapper, which binds the appropriate"
" paths and sources singularity_env.sh, setting up some environmental variables.")
)
parser.add_argument('image',
help=('If running with singularity, the path to the '
'.sif file containing the singularity image. '
'If running with docker, name of the docker image.'))
parser.add_argument('--software', default='singularity', choices=['singularity', 'docker'],
help="Whether to run this with singularity or docker")
parser.add_argument('--sudo', '-s', action='store_true',
help="Whether to run docker with sudo or not. Ignored if software==singularity")
parser.add_argument("args", nargs='*',
help="Command to pass to the container. If empty, we open up an interactive session.")
args = vars(parser.parse_args())
main(**args)
The very bottom of the script contains the argparse
configuration,
which makes the help string from the command line more informative.
The users calls the script like so: ./run_singularity.py
path/to/singularity_image.sif CMD
if using singularity, and
./run_singularity.py user/image_name --software docker -s CMD
if using
docker (that -s
flag runs docker as sudo, and so may not be necessary,
depending on their docker configuration). Note that this CMD
is the
equivalent of args
discussed above. CMD
can be left empty (in which
case an interactive session will be opened) or a string that will be run
inside the container, such as the snakemake commands required to
recreate the analysis, e.g., 'snakemake
main_figure_paper'
(for my spatial-frequency-preferences
project).
Note that single quotes are necessary if any flags are included in
CMD
, in order to prevent run_singularity.py
from trying to interpret
them.
Finally, the complete script:
#!/usr/bin/env python3
import argparse
import subprocess
import os
import os.path as op
import json
import re
from glob import glob
# slurm-related paths. change these if your slurm is set up differently or you
# use a different job submission system. see docs
# https://sylabs.io/guides/3.7/user-guide/appendix.html#singularity-s-environment-variables
# for full description of each of these environmental variables
os.environ['SINGULARITY_BINDPATH'] = os.environ.get('SINGULARITY_BINDPATH', '') + ',/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd'
os.environ['SINGULARITYENV_PREPEND_PATH'] = os.environ.get('SINGULARITYENV_PREPEND_PATH', '') + ':/opt/slurm/bin'
os.environ['SINGULARITY_CONTAINLIBS'] = os.environ.get('SINGULARITY_CONTAINLIBS', '') + ',' + ','.join(glob('/opt/slurm/lib64/libpmi*'))
def check_singularity_envvars():
"""Make sure SINGULARITY_BINDPATH, SINGULARITY_PREPEND_PATH, and SINGULARITY_CONTAINLIBS only contain existing paths
"""
for env in ['SINGULARITY_BINDPATH', 'SINGULARITYENV_PREPEND_PATH', 'SINGULARITY_CONTAINLIBS']:
paths = os.environ[env]
joiner = ',' if env != "SINGULARITYENV_PREPEND_PATH" else ':'
paths = [p for p in paths.split(joiner) if op.exists(p)]
os.environ[env] = joiner.join(paths)
def check_bind_paths(volumes):
"""Check that paths we want to bind exist, return only those that do."""
return [vol for vol in volumes if op.exists(vol.split(':')[0])]
def main(image, args=[], software='singularity', sudo=False):
"""Run sfp singularity container!
Parameters
----------
image : str
If running with singularity, the path to the .sif file containing the
singularity image. If running with docker, name of the docker image.
args : list, optional
command to pass to the container. If empty (default), we open up an
interactive session.
software : {'singularity', 'docker'}, optional
Whether to run image with singularity or docker
sudo : bool, optional
If True, we run docker with `sudo`. If software=='singularity', we
ignore this.
"""
check_singularity_envvars()
with open(op.join(op.dirname(op.realpath(__file__)), 'config.json')) as f:
config = json.load(f)
volumes = [
f'{op.dirname(op.realpath(__file__))}:/home/sfp_user/spatial-frequency-preferences',
f'{config["MATLAB_PATH"]}:/home/sfp_user/matlab',
f'{config["FREESURFER_HOME"]}:/home/sfp_user/freesurfer',
f'{config["FSLDIR"]}:/home/sfp_user/fsl',
f'{config["DATA_DIR"]}:{config["DATA_DIR"]}',
f'{config["WORKING_DIR"]}:{config["WORKING_DIR"]}'
]
volumes = check_bind_paths(volumes)
# join puts --bind between each of the volumes, we also need it in the
# beginning
volumes = '--bind ' + " --bind ".join(volumes)
# if the user is passing a snakemake command, need to pass
# --configfile /home/sfp_user/sfp_config.json, since we modify the config
# file when we source singularity_env.sh
if args and 'snakemake' == args[0]:
args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
'-d', '/home/sfp_user/spatial-frequency-preferences',
'-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', *args[1:]]
# in this case they passed a string so args[0] contains snakemake and then
# a bunch of other stuff
elif args and args[0].startswith('snakemake'):
args = ['snakemake', '--configfile', '/home/sfp_user/sfp_config.json',
'-d', '/home/sfp_user/spatial-frequency-preferences',
'-s', '/home/sfp_user/spatial-frequency-preferences/Snakefile', args[0].replace('snakemake ', ''), *args[1:]]
# if the user specifies --profile slurm, replace it with the
# appropriate path. We know it will be in the last one of args and
# nested below the above elif because if they specified --profile then
# the whole thing had to be wrapped in quotes, which would lead to this
# case.
if '--profile slurm' in args[-1]:
args[-1] = args[-1].replace('--profile slurm',
'--profile /home/sfp_user/.config/snakemake/slurm')
# then need to make sure to mount this
elif '--profile' in args[-1]:
profile_path = re.findall('--profile (.*?) ', args[-1])[0]
profile_name = op.split(profile_path)[-1]
volumes.append(f'{profile_path}:/home/sfp_user/.config/snakemake/{profile_name}')
args[-1] = args[-1].replace(f'--profile {profile_path}',
f'--profile /home/sfp_user/.config/snakemake/{profile_name}')
# open up an interactive session if the user hasn't specified an argument,
# otherwise pass the argument to bash. regardless, make sure we source the
# env.sh file
if not args:
args = ['/bin/bash', '--init-file', '/home/sfp_user/singularity_env.sh']
else:
args = ['/bin/bash', '-c',
# this needs to be done with single quotes on the inside so
# that's what bash sees, otherwise we run into
# https://stackoverflow.com/questions/45577411/export-variable-within-bin-bash-c;
# double-quoted commands get evaluated in the *current* shell,
# not by /bin/bash -c
f"'source /home/sfp_user/singularity_env.sh; {' '.join(args)}'"]
# set these environmental variables, which we use for the jobs submitted to
# the cluster so they know where to find the container and this script
env_str = f"--env SFP_PATH={op.dirname(op.realpath(__file__))} --env SINGULARITY_CONTAINER_PATH={image}"
# the -e flag makes sure we don't pass through any environment variables
# from the calling shell, while --writable-tmpfs enables us to write to the
# container's filesystem (necessary because singularity_env.sh makes a
# temporary config.json file)
if software == 'singularity':
exec_str = f'singularity exec -e {env_str} --writable-tmpfs {volumes} {image} {" ".join(args)}'
elif software == 'docker':
volumes = volumes.replace('--bind', '--volume')
exec_str = f'docker run {volumes} -it {image} {" ".join(args)}'
if sudo:
exec_str = 'sudo ' + exec_str
print(exec_str)
# we use shell=True because we want to carefully control the quotes used
subprocess.call(exec_str, shell=True)
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=("Run billbrod/sfp container. This is a wrapper, which binds the appropriate"
" paths and sources singularity_env.sh, setting up some environmental variables.")
)
parser.add_argument('image',
help=('If running with singularity, the path to the '
'.sif file containing the singularity image. '
'If running with docker, name of the docker image.'))
parser.add_argument('--software', default='singularity', choices=['singularity', 'docker'],
help="Whether to run this with singularity or docker")
parser.add_argument('--sudo', '-s', action='store_true',
help="Whether to run docker with sudo or not. Ignored if software==singularity")
parser.add_argument("args", nargs='*',
help=("Command to pass to the container. If empty, we open up an interactive session."
" If it contains flags, surround with SINGLE QUOTES (not double)."))
args = vars(parser.parse_args())
main(**args)
slurm snakemake profile
If we want to submit jobs to the job scheduler, we need to tell snakemake how to do so and to use the container when it does so. We do this via a custom profile, which can be found as the singularity branch of my snakemake-slurm github repository. This is based on the canonical slurm snakemake profile. I originally created my version 5 years ago, and it's possible there are changes to the original in that time that would be helpful, so you can use mine or the original for this (the only changes I've made besides those described below are a small change to add support for the –gres option, required for requesting GPUs, and a small change to partition).
The only file that is specific to the implementation discussed in this
post is slurm-jobscript.sh
:
#!/bin/bash
#SBATCH --export=SINGULARITY_CONTAINER_PATH,SFP_PATH
# properties = {properties}
# q is a special formatting symbol used by snakemake to tell it to escape quotes
# correctly
$SFP_PATH/run_singularity.py $SINGULARITY_CONTAINER_PATH {exec_job:q}
Note that this script makes use of the SINGULARITY_CONTAINER_PATH
and
SFP_PATH
environmental variables, which we made sure to set in the
wrapper script run_singularity.py
above. They specify the location of
the singularity container containing the environment and the directory
containing the code for the project (and thus, run_singularity.py
).
Each job that snakemake submits to the cluster will use this script as
the template for its job, and so we are ensuring that each of those jobs
will use run_singularity.py
to run within the container. We also use
the :q
formatting symbol (which is a special snakemake one, not
available in standard python) to escape quotes correctly, which is
important since, as discussed in the previous section, we need to
control the single quotes carefully to make sure the flags are
interpreted by the right software.
Archiving
Finally, we would like to be able to back up the containers for long-term archiving. I wrote this code as a way of making my analysis and figure-creation reproducible for a scientific paper. This means that I would like the code to remain runnable for as long as possible and it is unlikely that many people will use it. Docker hub had planned to delete images from free Docker accounts after six months of activity, though it's unclear what the status of that plan is. Regardless, we should not be relying on Docker hub for long-term archiving and so we need another solution.
Fortunately, singularity creates a .sif
file containing the container
when pulling it, and docker can export the container into a .tar
file
using docker save
(e.g., docker save billbrod/sfp:v1.0.0 >
sfp_v1.0.0_docker.tar
). These files can then be archived like any other
file. They are rather large (2.5GB for the .sif
file, 5.1GB for the
.tar
file), and so something like the Open Science
Framework, where I normally place my academic
research-related files, is not an option.
Ultimately, I used NYU's Faculty Digital Archive. Something like Amazon Web Service's S3 buckets (or other cloud storage) would be another option, though not free. When looking for where to archive, any service you use should give objects a doi and allow for versioning. Ideally, it should be run by a non-profit and open source, but that might not be possible. If you're at a university, I recommend reaching out to your university library or HPC team to see if they have any suggestions. Fundamentally, archiving and sharing the container is not that different from archiving and sharing a large data set.
Archiving the container somewhere publicly-accessible allows users to
download the container and use it, even if the docker hub image is
deleted. With singularity, the container is used directly, whereas with
docker, docker load
is required (e.g., docker load <
sfp_v1.0.0_docker.tar
).
Conclusion
This post outlines how to create and use a container which contains all dependencies possible, allows for mounting any additional dependencies (such as those that require a license for use), and can be used identically locally and on the cluster, as well as being able to use snakemake to submit jobs on a cluster. While the scripts I wrote are all slurm-specific (and, even more so, specific to NYU's greene cluster), hopefully they can provide a jumping off point for others.