Docker + GPUs | Note of Thi

The last modifications of this post were around 1 year ago, some information may be outdated!

On this page

WSL + Windows
With Tensorflow or PyTorch
Basic installation
Check info
1. Does Docker work with GPU?
2. Check cudnn
Install nvidia-docker2
Difference: nvidia-container-toolkit vs nvidia-container-runtime
Using docker-compose?
Check usage of GPU
1. Kill process
Reset GPU
Errors with GPU
Make NVIDIA work in docker (Linux)
References

👉 Note: All docker notes.
👉 My Dockerfile setting up on Github.

WSL + Windows

👉 Note: WSL + Windows

With Tensorflow or PyTorch

👉 Official doc for TF + docker
👉 Note: Docker + TF.
👉 An example of docker pytorch with gpu support.

Basic installation

You must (successfully) install the GPU driver on your (Linux) machine before proceeding with the steps in this note. Go to the "Check info" section to check the availability of your drivers.

(Maybe just for me) It works perfectly on Pop!_OS 20.04, I tried it and we have a lot of problems with Pop!_OS 21.10 so stay with 20.04!

sudo apt update

sudo apt install -y nvidia-container-runtime
# You may need to replace above line with
sudo apt install nvidia-docker2
sudo apt install nvidia-container-toolkit

sudo apt install -y nvidia-cuda-toolkit
# restard required

If you have problems installing nvidia-docker2, read this section!

Check info

# Verify that your computer has a graphic card
lspci | grep -i nvidia

# First, install drivers and check
nvidia-smi
# output: NVIDIA-SMI 450.80.02 Driver Version: 450.80.02    CUDA Version: 11.0
# It's the maximum CUDA version that your driver supports

# Check current version of cuda
nvcc --version
# If nvcc is not available, it may be in /usr/local/cuda/bin/
# Add this location to PATH
# modify ~/.zshrc or ~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH

# You may need to install
sudo apt install -y nvidia-cuda-toolkit

If below command doesn't work, try to install nvidia-docker2 (read this section).

# Install and check nvidia-docker
dpkg -l | grep nvidia-docker
# or
nvidia-docker version

# Verifying –gpus option under docker run
docker run --help | grep -i gpus
# output: --gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)

Does Docker work with GPU?

# List all GPU devices
docker run -it --rm --gpus all ubuntu nvidia-smi -L
# output: GPU 0: GeForce GTX 1650 (...)

# ERROR ?
# docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

# ERROR ?
# Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

# Solution: install nvidia-docker2

# Verifying again with nvidia-smi
docker run -it --rm --gpus all ubuntu nvidia-smi

# Return something like
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P0    11W /  N/A |    369MiB /  4096MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
# and another box like this

It's archived, but still useful

# Test a working setup container-toolkit
# Update 14/04/2022: the tag "latest" has deprecated => check your system versions and use
# the corresponding tag
# The following code is for reference only, it no longer works
docker run --rm --gpus all nvidia/cuda nvidia-smi

# Test a working setup container-runtime
# Update 14/04/2022: below code isn't working anymore because nvidia/cuda doesn't have
# the "latest" tag!
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

# Error response from daemon: Unknown runtime specified nvidia.
# Search below for "/etc/docker/daemon.json"
# Maybe it helps.

Check `cudnn`

whereis cudnn
# cudnn: /usr/include/cudnn.h

# Check cudnn version
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
# or try this (it works for me, cudnn 8)
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

Install `nvidia-docker2`

More information (ref)

This package is the only docker-specific package of any of them. It takes the script associated with the nvidia-container-runtime and installs it into docker's /etc/docker/daemon.json file for you. This then allows you to run (for example) docker run --runtime=nvidia ... to automatically add GPU support to your containers. It also installs a wrapper script around the native docker CLI called nvidia-docker which lets you invoke docker without needing to specify --runtime=nvidia every single time. It also lets you set an environment variable on the host (NV_GPU) to specify which GPUs should be injected into a container.

👉 (Should follow this for the up-to-date) Officicial guide to install.

Note: (Only for me) Use the codes below.

Command lines (for quickly preview)

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

# NOTE FOR POPOS 20.04
# replace above line with
distribution=ubuntu20.04

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

👇 Read more about below error.

# Error?
# Read more:
# Depends: nvidia-container-toolkit (>= 1.9.0-1) but 1.5.1-1pop1~1627998766~20.04~9847cf2 is to be installed

# create a new file
sudo nano /etc/apt/preferences.d/nvidia-docker-pin-1002
# with below content
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002
# then save

# try again
sudo apt-get install -y nvidia-docker2

# restart docker
sudo systemctl restart docker

# wanna check?
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

# check version
nvidia-docker version

Difference: `nvidia-container-toolkit` vs `nvidia-container-runtime`

👉 What's the difference between the lastest nvidia-docker and nvidia container runtime？

In this note, with Docker 19.03+ (docker --version), he says that nvidia-container-toolkit is used for --gpus (in docker run ...), nvidia-container-runtime is used for --runtime=nvidia (can also be used in docker-compose file).

However, if you want to use Kubernetes with Docker 19.03, you actually need to continue using nvidia-docker2 because Kubernetes doesn't support passing GPU information down to docker through the --gpus flag yet. It still relies on the nvidia-container-runtime to pass GPU information down the stack via a set of environment variables.

👉 Installation Guide — NVIDIA Cloud Native Technologies documentation

Using docker-compose?

Purpose?

# instead of using
docker run \
    --gpus all\
    --name docker_thi_test\
    --rm\
    -v abc:abc\
    -p 8888:8888

# we use this with docker-compose.yml
docker-compose up

# check version of docker-compose
docker-compose --version

# If "version" in docker-compose.yml < 2.3
# Modify: /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

# restart our docker daemon
sudo pkill -SIGHUP dockerd

# If "version" in docker-compose.yml >=2.3
# docker-compose.yml => able to use "runtime"
version: '2.3' # MUST BE >=2.3 AND <3
services:
  testing:
    ports:
      - "8000:8000"
    runtime: nvidia
    volumes:
      - ./object_detection:/object_detection

👉 Check more in my repo my-dockerfiles on Github.

Run the test,

docker pull tensorflow/tensorflow:latest-gpu-jupyter
mkdir ~/Downloads/test/notebooks

Without using docker-compose.yml (tensorflow) (cf. this note for more)

docker run --name docker_thi_test -it --rm -v $(realpath ~/Downloads/test/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter

With docker-compose.yml?

# ~/Download/test/Dockerfile
FROM tensorflow/tensorflow:latest-gpu-jupyter

# ~/Download/test/docker-compose.yml
version: '2'
services:
  jupyter:
    container_name: 'docker_thi_test'
    build: .
    volumes:
        - ./notebooks:/tf/notebooks # notebook directory
    ports:
        - 8888:8888 # exposed port for jupyter
    environment:
        - NVIDIA_VISIBLE_DEVICES=0 # which gpu do you want to use for this container
        - PASSWORD=12345

Then run,

docker-compose run --rm jupyter

Check usage of GPU

# Linux only
nvidia-smi

Return something like this

# |===============================+======================+======================|
# |   0  GeForce GTX 1650    Off  | 00000000:01:00.0 Off |                  N/A |
# | N/A   53C    P8     2W /  N/A |   3861MiB /  3914MiB |      2%      Default |
# |                               |                      |                  N/A |
# +-------------------------------+----------------------+----------------------+

# => 3861MB / 3914MB is used!

# +-----------------------------------------------------------------------------+
# | Processes:                                                       GPU Memory |
# |  GPU       PID   Type   Process name                             Usage      |
# |=============================================================================|
# |    0      3019      C   ...e/scarter/anaconda3/envs/tf1/bin/python  3812MiB |
# +-----------------------------------------------------------------------------+

# => Process 3019 is using the GPU

# All processes that use GPU
sudo fuser -v /dev/nvidia*

Kill process

# Kill a single process
sudo kill -9 3019

Reset GPU

# all
sudo nvidia-smi --gpu-reset

# single
sudo nvidia-smi --gpu-reset -i 0

Errors with GPU

# Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
# Function call stack:
# train_function

👉 Check this answer as a reference!

👇 Use a GPU.

# Limit the GPU memory to be used
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

Problems with pytorch versions: check this.

RuntimeError: cuda runtime error (804) : forward compatibility was attempted on non supported HW at /pytorch/aten/src/THC/THCGeneral.cpp:47 (after update system including nvdia-cli, maybe) => The same problem with below, need to restart the computer.

nvidia-smi: Failed to initialize NVML: Driver/library version mismatch.

This thread: just restart the computer.

Make NVIDIA work in docker (Linux)

This section still works (on 26-Oct-2020), but it's obselete for newer methods.

One idea: Use NVIDIA driver of the base machine, don't install anything in Docker!

Detail of steps

First, maker sure your base machine has an NVIDIA driver.

# list all gpus
lspci -nn | grep '\[03'

# check nvidia & cuda versions
nvidia-smi

Install nvidia-container-runtime

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

sudo apt-get update

sudo apt-get install nvidia-container-runtime

Note that, we cannot use docker-compose.yml in this case!!!

Create an image img_datas with Dockerfile is

FROM nvidia/cuda:10.2-base

RUN apt-get update && \
	apt-get -y upgrade && \
	apt-get install -y python3-pip python3-dev locales git

# install dependencies
COPY requirements.txt requirements.txt
RUN python3 -m pip install --upgrade pip && \
	python3 -m pip install -r requirements.txt
COPY . .

# default command
CMD [ "jupyter", "lab", "--no-browser", "--allow-root", "--ip=0.0.0.0"  ]

Create a container,

docker run --name docker_thi --gpus all -v /home/thi/folder_1/:/srv/folder_1/ -v /home/thi/folder_1/git/:/srv/folder_2 -dp 8888:8888 -w="/srv" -it img_datas

# -v: volumes
# -w: working dir
# --gpus all: using all gpus on base machine

This article is also very interesting and helpful in some cases.

References

Difference between base, runtime and devel in Dockerfile of CUDA.
Dockerfile on Github of Tensorflow.