Using NVIDIA NIM Containers
We will assume you have set up your Apptainer cache directory as noted in the Apptainer documentation.
We provide two examples: one using a stock Llama 3.1 (8B) model image, and another where we build a container for the Evo DNA foundation model.
NVIDIA provides Docker images on the NGC Site. Since Docker is ill-suited to an HPC environment, Marlowe uses Apptainer which works seamlessly with Docker images. However, NGC requires authentication to download images and therefore, a one-time set up is required.
One-time Setup
-
Create an NVIDIA Developer account if you haven’t already done so.
-
Get an API KEY for logging into NVIDIA GPU Cloud (NGC). For our example, you can obtain one by clicking on the Get API Key at the top of the python code.
To avoid having to deal with this every time, you can save the username and key in your
~/.bash_profileand ensure it is effective (or log out and back in again).export APPTAINER_DOCKER_USERNAME='$oauthtoken' export APPTAINER_DOCKER_PASSWORD="NGC_API_KEY" -
Beware that python packages such as
tritonmake use of caches, typically in home directories. Since space is limited there, you should make a symbolic link to a larger/faster directory, for example:ln -s /scratch/m223813/.triton_cache ~/.tritonSame goes for other packages which use
~/.cachefor hugging face downloads. Better to make a symbolic link for that too. -
Apptainer also uses a cache that can become large. So best to create a cache directory in scratch and set an environment variable to point Apptainer to it in your
~/.bash_profile.export APPTAINER_CACHEDIR=/scratch/<your_space>/.apptainer_cache
Llama Example
-
Pull down the
Llamaimage—you can search for it on the NGC website and find a copyable link for the image. Then create an apptainer image (.siffile) as below.you@login-02$ cd /scratch/m223813 you@login-02$ apptainer pull docker://nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3 -
Run an interactive queue on the partition provided for you. We use 8 GPUs and ask for 30 minutes in our example. Note down the node number which is typically something like
n01orn02etc. We’ll assumen01in what follows.you@login-02$ srun --partition=<your_partition> --gres=gpu:8 --ntasks=1 --time=30:00 --pty /bin/bash -
This Llama example runs a web service, so you need to use a tool such as
tmuxto split the screen into two, one where you will run the web service (call itA) and the other where you will send requests (B). Run the container in sessionA; this will take about 10 minutes the first time. Note the use of the API Key from step 2 which can be set up once in your~/.bash_profilefor convenience.export LOCAL_NIM_CACHE=$SCRATCH/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" apptainer run --nv \ --bind "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ --env NGC_API_KEY=$NGC_API_KEY \ llama-3.1-8b-instruct_1.3.3.sifOnce the API service has started, you will see lines like below:
INFO 2025-02-07 11:36:27.25 server.py:82] Started server process [1275338] INFO 2025-02-07 11:36:27.25 on.py:48] Waiting for application startup. INFO 2025-02-07 11:36:27.52 on.py:62] Application startup complete. INFO 2025-02-07 11:36:27.69 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) -
Switch to session
Bto hit the API end point. Below is the result of a test API call usingcurlonn01where we ask for a limerick about Marlowe:curl -X 'POST' \ 'http://localhost:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama-3.1-8b-instruct", "messages": [{"role":"user", "content":"Write a limerick about Marlowe GPU Cluster (31 DGXs)"}], "max_tokens": 64 }'The JSON output contains:
There once was a cluster so fine, Marlowe DGXs, 31 in line, Processing with care, Through NVIDIA GPUs fair, In AI work, a speedy shrine.
Evo container
Often, it is more convenient to create a custom container that can be used over and over to run a bunch of jobs. We demonstrate by building such a container for the Evo model using the definition file evo.def below.
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:25.01-py3
%post
# Update package list and install development tools
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
wget \
git \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Several NVIDIA containers contain bad triton versions, which
# look for libcuda.so in the wrong places, so delete if installed
# and reinstall latest
pip uninstall -y triton
pip install triton
pip install evo-model
%runscript
# Run an Evo example script
echo "Running Evo example"
git clone --depth 1 https://github.com/evo-design/evo.git
cd evo
python -m scripts.generate \
--model-name evo-1-131k-base \
--prompt ACGT \
--n-samples 10 \
--n-tokens 100 \
--temperature 1. \
--top-k 4 \
--device cuda:0
This definition file starts off using the pytorch:25.01-py3 container from NVIDIA NGC and then pip installs evo-model. One could use other pytorch containers, but we’ve found that earlier versions of the containers (e.g. pytorch:24.02-py3) contain triton package versions that do a bad job of locating libcuda.so. If you use those, be sure to uninstall the triton package and update to the latest version. In the worst case, you may have to build it yourself.
The %runscript section runs an example from the github repo and uses just one GPU device.
Building the container image doesn’t require a GPU, so one can do the following on the login node.
export APPTAINER_DOCKER_USERNAME='$oauthtoken'
export APPTAINER_DOCKER_PASSWORD="YOUR NGC API KEY"
apptainer build evo.sif evo.def
This will create a file evo.sif that can then be run on a GPU node via apptainer run <image.sif> which will simply execute the example in %runscript section. Or you can use an sbatch script.
#!/bin/bash
#SBATCH --job-name=run_evo
#SBATCH -p batch
#SBATCH --nodes=1
#SBATCH -A marlowe-m000xxx
#SBATCH -G 1
#SBATCH --time=2:00:00
#SBATCH --chdir=/scratch/m000xxx/
#SBATCH --error=/scratch/m000xxx/run_evo.err
export APPTAINER_DOCKER_USERNAME='$oauthtoken'
export APPTAINER_DOCKER_PASSWORD="YOUR NGC API KEY"
module load slurm
apptainer run --nv --bind /scratch/m000xxx evo.sif
Here is sample output:
Running test Evo script
Generated sequences:
Prompt: "ACGT", Output: "AGACAAGGGCATACACCCCACCCTCAGTAAACTTCGGCCTGCCCTTGGAG...", Score: -1.458
Prompt: "ACGT", Output: "CCGGTTCCTCGGCCGTCTCCTCCGGCGCCAGATCGTAGATATTGGCAACT...", Score: -1.601
Prompt: "ACGT", Output: "GTTCGACGAGCCGGTGGCAGTGCAGCATCCGGCCCGCCTGGGAGACCTCC...", Score: -1.520
...
One can run something other than the code in the %runscript section using the line
apptainer shell --nv --bind /scratch/m000xxx evo.sif
which will drop one into a shell. Or one can append a command:
apptainer exec --nv --bind /scratch/m000xxx evo.sif python -m scripts.generate \
--model-name evo-1-131k-base \
--prompt ACGT \
--n-samples 10 \
--n-tokens 100 \
--temperature 1. \
--top-k 4 \
--device cuda:0
The evo.sif file can be shared among your lab members or others and will help with reproducibility.
A few things to note:
-
Tools such as
Evodownload models etc. and place them in caches, typically in the home directory, e.g.~/.cache. To avoid running out of disk quota, it is best to symlink the cache into the scratch area and ensure that scratch is mounted on the container. That is done via the--bindoption above. -
Apptainer itself uses a cache and once again, it is best to set a variable as indicated at the top of this document.