The HPC Container Community Survey

This is the first HPC Container Community Survey that will provide insights to container usage of the HPC community. The idea was originally proposed as part of the Containers Working Group, but it was stopped in its tracks. Earlier this year we realized how valuable the insights would be, even to understand basics about container technology usage, and the effort was rekindled. We got 202 responses in total, presenting the results at the High Performance Container Workshop (agenda). We think that this first year was great success!


About People

What is your primary role

The leading category is System Administrator, followed by Research Software Engineer, Software Engineer, and then Manager and Computer Scientist. We believe this sample explains why most of survey respondents reported having intermediate or expert level experience with containers - these are the folks actively provisioning, building, and supporting the container technologies. In future years we might consider expanding the audience out to more scientific communities to get feedback from more beginners.

We can tell from this question that we have a diverse community.

What is your primary environment

Most of you are in academia, followed by commercial environments and national laboratories and even consulting. This is both an expected result and fantastic to see the diversity of our community.

Container usage in HPC spans academia, national labs, industry, and beyond.

How do you rate your experience with containers?

It was surprising to see the majority of respondents report intermediate to expert usage. We likely need to do a better job to engage with beginner container users.

The majority of the HPC containers community that we surveyed reports intermediate to expert ability.

The HPC containers community has a lot of developers! This question would be interesting to expand out into the kind of developer. For example, a core container technology is different from a container orchestration tool, which is also different than a metadata extraction tool.

Developers, developers, developers! We have a surprisingly large developer base in this survey audience.


About Container Technologies

Which container technologies are supported on the system(s) you are working on?

Singularity / Apptainer is the most commonly provided container technology, followed by Docker and Podman. It’s not clear in what context Docker is being provided. We likely need to expand the questions to ask about the type of cluster or resource where the container technology is provided to give better insight to this answer.

It's clear that centers provide a range of supported technologies, with a handful being more likely to be found than others.

On those same systems and out of the set above, which HPC container technologies are you using?

The usage mirrors the provision, albeit fewer people report using each. For Singularity, the difference is small (~20), but for Docker (~35) and Podman (of those that reported their center provided it, only half actually report using it) the differences are larger. This question suggests that centers should keep abreast of what users want to use vs. what is provided.

However, of that set, a fewer number are reported to be used regularly.

What container technologies do you use on your local machine(s), personal or for work?

Logically, Docker (and having root) is a standard and the preferred container technology when we have full control. Of the rootless “HPC” set, Singularity / Apptainer is next, followed by Podman.

For local usage, Docker is king.

Which HPC container technologies have you not used that you would like to use?

This question is interesting because the majority of people aren’t interested in trying a new one, suggesting they are satiated, or at least not interested in other options they have not tried. It’s not clear if people chose responses for the other technologies just for the heck of it (and don’t intend to actually try them) or if these individuals will make a concerted effort to try them. My (@vsoch) guess is the first - it’s a survey question that people were providing an answer to, and they likely won’t prioritize going out of their way to try a new one. What additional information is needed here is to ask why they want to try a new one. Likely a missing feature or ability is a stronger driver than “Sure, might be fun.”


About Images

What specification or recipe do you use to build containers?

Dockerfile is the clear leader here, and this makes sense because the other container technologies have support for either interacting with it directly, or pulling down containers built from it.

Do you use any supporting tools to build containers?

Our HPC package managers are leaders in helping us to build containers.

Once built, do you tend to push containers to a central registry?

The fact that almost half of the community is not pushing images to a central registry is concerning, as it indicates reproducibility might be less likely. If you need help with creating a CI/CD pipeline or exploring options for registries (public or private) you can ask your local HPC administrator or research software engineers. This could also reflect the survey population in that the majority of HPC administrators provide container technologies but do not actively build them.

What container registries are you pushing builds to?

We have a lot of registries, and despite a few issues over the years, Docker Hub is still the leader. GitLab is a close second, which might be a reflection of the fact it can be self-hosted and thus provided by national labs and academic centers on premises. GitHub packages and Quay.io are next in line, and GitHub packages makes sense as it is tightly paired with GitHub actions, the CI/CD service for GitHub.

In what context(s) are you using containers?

Using containers for HPC applications and simulations makes sense, as does for developer environments (on local machines or remote) and Kubernetes. The surprising result here was the case of provisioning use.


Finishing Up

Do you typically have to build containers for multiple architectures?

The majority of our community is building for one architecture, but there is a non-insignificant number that are building for others, so it is a valid use case that deserves attention.

Do you use CI/CD for automated build and deploy?

This result could be reflective of the survey population, and that the majority of, for example, HPC administrators aren’t actively building and testing containers. If it’s reflective of overall practives, the result is more concerning. If almost half of you aren’t using CI/CD for automated deploy, please consult a research software engineer or support staff if you’d like to do this but do not know how.

What are you biggest challenges or pain points when using containers, or reasons that you don’t use them?

Dealing with secrets is a pain point.
Getting MPI to work and making sure the build is optimized for any CPU it is running on.
Users
Docker's control on containers.
Knowing the restrictions placed on what a user can do with containers in their environment. I know singularity has a configuration file but a version of that file in a more user-friendly format/details would be helpful.
I only use them if needed.
Apptainer containers are immutable and lack caching so building is expensive. It does not have layers and we need separate distribution mechanism for first party software that changes often for efficiency.
Co-processor integration
Certificates
Extra complexity, little added value, lack of access to build environment, security concerns, tendency to leave built containers as-is and the resulting steady decent into deprecation hell
Lack of access to system administration
Handling very large container images, the layered approached is not compatible with this sort of workload - engineering applications, large software bundles with regular updates
root access
Everything HPC is singularity. Everything else is docker (I'm looking at you Nvidia)
Challenge: end-user education that containers are not a magical box that solves all software development/packaging/portability/distribution issues.
Challenge: end-user education that containers do not replace the need for revision control of the underlying software and the container build steps.
Challenge: end-user education that containers are not necessarily static (ie., updates needed to address security issues in OS or library components w/in the container, not the application).
Challenge: networking config of containers used for infrastructure services.
Integration with MPI is sometimes an issue
Convincing management that it is both safe and efficient to use containers on our compute resources. There still is some resistance, both culturally and out of technical ignorance, to using containers on research clusters.
Integration of HPC and viz libs / concepts, security
Security and lack of patched clusters
container size, weird apptainer behaviors between different ways to run containers, multiple architectures
For many things (not everything) containers seem to add an extra hoop to jump through without any proportional added value. For more complex things (kubernetes / orchestration / etc), it seems difficult to get much benefit until you're ready to "drink the koolaid" and go all in.
APIs and tooling can be a bit painful. Authentication is always tricky.
Proprietary hardware with software; GPU MPI injection
Although it makes sense, not being able to access them when they are in an improper state and having to get around with containerization commands instead of direct CLI commands inside of the container.
Running out of local disk space when building containers
Easy deploying of Containers in the HPC Cluster beyond Kubernetes
container size, especially for GPU or ML workloads. the conversion between docker and sif format to run in apptainer
Environment interoperability (multiple binaries, environment variables, etc)
Deploying MPI based software across multiple nodes on a Slurm system. Been able to execute on single node, but not multiple nodes.
Managing the size of containers with full compiler dev environments
Use of GPUs w/ custom CUDA libraries was a challenge when experimenting, but most of that comes from just starting out with using containers.
Variations in builds and differing dependencies. And MPI
Multi node e.g. MPI
user permission SUID etc
people upload images/containers with incomplete documentation
I guess any mention of containers make some of my coworkers flip out and mumble about root but they are also wedded to uh really old enterprise linux distros soooo it's all stressful. Hopefully they will retire soon.
Apptainer and Singularity are common in HPC, but they are outliers in the wider OCI ecosystem. For example, the images aren't supported in DockerHub or Quay; they don't come with a hypervisor for Mac or Windows; and they have no orchestration.
Image build time.
Ineffective policy implementation due to misunderstandings about container functionality by policy authors
Builds with spack
Retooling for containers yields little gain for a lot of work.
I didn't test widely containers on HPC but I want to
Networking and l2/l3 custom
Ease of use
For most code, initally no containers just python code etc and conda. Though if more standardised, containers are great.
Need to prepare two build scripts for AARCH64 and X86-64 container for optimised HPC workload. Try to minimise the size of container image file.
If you rely on pulling base image from public registry rather than copy kept in local registry the packages can change and break things.
docker security, place to save/backup docker images, took too long to backup, not as easy to get multi-node to run with containers, e.g. need RDMA support needs special setup on container (need user-level network driver on containers from different vendors)
Interconnects, MPI, GPU/accelerator
Not being able to use the same container technology on my local machine (Mac OS) and the HPC I use (Linux). I prefer Docker, which is what I use locally, but this cannot be used on the HPC. I cannot use Singularity/Apptainer locally, since I can't build there. So I use Docker locally and convert to SIF from Docker and upload it to the HPC whenever I need a custom image on the HPC.
From my experience managing teams of scientists (AI/HPC), the main challenge they face when trying to adopt containers is their lack of DevOps skills. Scientists want to prove a point and show that the idea in a scientific paper can be brought to life. However, once they prove their point, i.e. once their code runs in a given system (local, server, etc), they have little to no incentive to make it easier for others to reproduce their pipeline. Packaging applications into containers are extremely complex for scientists -- even with tools like Docker, they end up wasting a lot of time trying to get it right and ready for deployment.
Reproducibility of builds
Updating a container takes too long time, e.g. installing a new software
Challenges in compatibility between docker and singularity. Environment variables and some docker definition file are not equivalent or translated exactly in singularity.
Challenges in using MPI when using containers in a SLURM hpc system, specially system libraries vs container libraries.
Challenges when using CUDA in containers in a SLURM system, similar to the MPI case.
Lack of knowledge about free registries to upload the containers
Hard to get docker containers to work on M2 macbook.
long build times (especially with spack) paired with try and error for specific software needs
I am developing a Python library, which in turn depends on several other python and C libraries. I build a Singularity container with all dependencies. I would like to also install my own library in the container but I need it in "editable" mode for development, which I can't do in a read-only container.
This means I end up having to run the container from the library repo so it is findable at import time. I don't have much experience with containers so maybe there's a proper fix that I haven't found yet. Some kind of "developer" mode for interpreted languages that don't need a compilation step would be really nice.
I tend to use off-the-shelf containers provided by vendors (e.g. NGC) and the pain is usually about having to build a new container based on them to install relatively simple/small dependencies. Sometimes, when I don't want to build and manage an additional image, I just have a start up script to install such dependencies at runtime, just before my application executes. It would be nice to have this part facilitated somehow.
MPI support
Getting users to understand apptainer and the hassles of root
Lack of a standard manifest to indicate what's inside the container
Running rootless podman as a systemd service is, AFAICT, broken.
The story for using containerised MPI workloads (e.g. under Slurm) is complicated and seems to be not well understood.
The biggest pain points is to verify if there is any vulnerability on the containers.
Documentations for non-software engineers is limited and usually lacking applications example for my field (chemistry) which is a pity since it can unlock powerful workflows
Security--use of containers on high side systems is a PITA even when they solve problems.
Some HPC centers have poor support, like poor SLURM integration (no Pyxis), no easy way to pull them from registries or squashed files, no easy way to build containers at the HPC facility requiring transferring containers over slow networks, etc.
Running containers the way you would do on an HPC
Difficult to go to the "source code" of a container. Often given a running one to debug, and the original Dockerfile is nowhere to be found/nobody knows/etc.
User adoption and education
Containerization of certain legacy packages in bioinformatics
Supporting users in developing containers - that extra "hump" of getting an application or project containerized is often frustrating for users. Needing privilege to build containers is part of this, but also the perceived complexity around the "idea" of what a container is. It would be great for the user experience to be "start an interactive job, install/update your software, hit the "save" button to use it again".
Tend to prefer building from source
Getting from a working container with lots of layers down to something that's actually portable and usable. Providing enough flexibility for users to bring their own datasets.
Debugging - error messages not always informative
The dichotomy between dev (macOS) and production (Linux) environments; Docker is forbidden on the former so we use Singularity, while Singularity is not supported on the latter so we use Docker
compatibility with container and host MPIs and GPU drivers
In HPC workload, using MPI efficiently Is almost impossible. In particolare the shared Memory communication are the week point.
My limited knowledge of OS - a challenge often overcame by trial and error (and lots of google time).
Keeping up with system changes that impact the containers
The complexity
complexity - getting others up to speed
I had a lot of issues in the past using containers on old systems (GLIBC compatibility issues).
Complexity, reproducibility is too lax
file systems
scanning containers for malware on endpoints when not downloading from a trusted repository
Portability and sharing with users.
Explaining container paradigm/workflows to those unfamiliar with the concept.
Steep learning curve for users, so it's hard to explain users how to use singularity containers
Mostly getting users over the intimidation hurdle to use them! We use SHPC to obscure some containers, but I find users are hesitant to dive into containers directly
Most work, for my staff, is systems oriented where we build tools from source code
Lack of familiarity/experience and concern from our security team.
The requirement for subuid/subgid for builds and the lack of support for building on NFS/GPFS/Lustre filesystems. This is a major reason we don't support rootless podman on our HPC cluster and that we have to have local scratch disks for /tmp and apptainer builds.
Requiring elevated privileges (Docker), or the system not having a container runtime installed. Not being as secure for CI job runners as spinning up an ephemeral VM.
Compatibility with the host system
"Clever" things that do things for you but are a hassle (eg hidden singularity scripts, or complicated entry point scripts)
scheduling, location affinity, resource utilization
Visibility into the processes running inside the containers
Creating containers as non-root user requires manual configuration to permit
biggest pain: size overhead
Container builds requiring sudo
Time to build and troubleshoot iterations. Still new and defining our standards for building, base containers, naming conventions, user experience, all of it
Porting
MPI of proprietary HPC software

What can containers not do that you wish they could? What features would you like to see?

common X forwarding, finer grained platform differentiation (OCI WG), reuse of layers independent of layer rank
Mount inside a container (this is probably out of scope for containers in HPC)
Better MPI/PMIx support
reduction in complexity. using podman to create a simple jail isn't simple (should just use a jail or chroot tho lol)
Vulnerability scanning across the board. Generating an SBOM.
A better distribution story?
More attention to co-processor integration
make applications portable, make results reliably repeatable
Performance for specific architectures
Better support for rootless in shared clusters with appropriate networking and user namespaces. Better support for rootless builds.
Hardware isolation. HPC ecosystem needs to use singularity more for background processes, services (loghosts, web, open ondemand, etc), needs to be used to sequester users into individual or lab based clusters similar to virtuals (good for security), why isn't each user login a singularity container?
Security vulnerability reporting native to the container at build time; the equivalent of running "dnf -n update" or "apt list --upgradable" within the container when the container is launched and on a scheduled basis, and reporting the results to an address defined when the container was created.
Security built in to easily identify if a container has been patched or scanned
Easier integration with different CPU architectures and better performance especially for Docker.
I've been able to do everything I can do on bare metal in a container. I recently created a remote desktop container which is one of the last things I wish I was able to do.
Remote storage on startup (think, samba mount on start)
Proper pass through of sssd user ids
docker with normal user permission
I don't have enough experience to know how to answer this question...With my ex-CIO/IT manager hat on, I have to give the first answer to all such questions: Can security of containers be improved?
It's sort of done but would be nice to have better support for docker-compose style workloads. Any sort of container networking for singularity/apptainer would be nice.
Run Singularity/Apptainer images locally similarly to Docker Desktop.
Automate clean-up of the image following build
Better compatibility
Schedulers like Slurm need a better ability to deploy a collection of containers with private, encrypted container networking. Of course Kubernetes could be used, but that requires a very different HPC architecture than traditional HPC.
Possibly more tutorials and simplicification of using containers
Help to cross the OS platform automatically
Be reproducible in perPettit which they are far from
Generate Dockerfiles
Work without root permissions, that's the reason we use apptainer and not docker
See my previous answer.
It'd be great if we could easily combine containers !
I've had customers wish the ability to distribute containers that are usable by an end user but otherwise not modifiable/accessible. Basically a binary that can't be distributed outside of the container.
Easy way to import the MPI implementation from the HPC center, which is typically tuned to that center.
Easy "debug mode/tools" - you can always `exec` into a container, but then there's a vastly different set of software installed (might not have `curl`, and so on). In k8s environments the concept is easier (create a new container in the pod with all your tools, share namespaces & mounts) but still not perfect.
Be as fast as native baremetal applications.
I'd like to see checkpoint-migration-restore with production-ready integration with as many container technologies and schedulers as possible.
The CNCF and HPC facilities should take a close look at docker-nvidia-glx-desktop and docker-nvidia-egl-desktop, as well as selkies-gstreamer for graphical containers in unprivileged clusters.
Most HPC clusters and NSF-funded infrastructure are unprivileged and utilizing hardware-accelerated GUIs within a container was historically hard to perform.
Easily access data on a broad range of systems, be truly portable.
Seamless interaction without careful setup
Currently they do all I need
Out of the box infiniband, id management, out of the box GPU support.
Magically identify there system dependences, magically deal with version skews on the system
host mpi version conflicts
Provide a quick and standardized metadata/provenance report, to be used in other provenance formats like RO-Crate
sudo inside container
Docker and user identity is still a headache, but this is fine in singularity/apptainer
We have had great success with our researchers able to get going when our environment may not support all the requirements
Having something like the ONBUILD instructions in the Docker container format be supported by the OCI container format
Drivers! (Yes I know)
improvement in regards to graphics
Nesting containers
Not sure yet. But need to chain multiple containers in a pipeline to support various researcher workflows (haven't done this yet and hoping its possible)
Support for Linux device tree validation - would like containers built for specific architectures and accelerators (rdma nic, gpu, etc). This feature could be a serialization of the device tree file that gets validated with a string comparison on container initialization, or maybe a file with linear fields for different features that have hashed values which are validated against the machine running the image so that the container runtime could provide useful errors (architectures dont match, etc). Users could pick and choose what information to bake into the container so as little as architecture (x86, arm, riscv, etc) or as much as accelerator (gpu, rdma nic, etc)