A notable point in this post is the installation of TensorFlow 2.9.1 (TF) with CUDA 11.3, which is not officially supported!
👉 Note: All Docker notes.
Since the version of what is to be mentioned in this post is really important, keep in mind that what I write are for the time I am writing this post!
From this to that
I want to create a Dockerfile based on a machine with the following specifications,
- My computer: Dell XPS 7590 (Intel i7 9750H/2.6GHz, GeForce GTX 1650 Mobile, RAM 32GB, SSD 1TB).
- OS: Pop!_OS 20.04 LTS (a distribution based upon Ubuntu 20.04).
- Nvidia driver (on the physical machine): 510.73.05
- CUDA (on the physical machine): 10.1
- Docker engine: 20.10.17
- Python: 3.9.7
In this Dockerfile we can create a container that supports,
- TensorFlow: 2.9.1.
- PyTorch: 1.12.1+cu113
- CUDA: 11.3
- cuDNN: 8
- Python: 3.8.10
- OS: Ubuntu 20.04.5 LTS (Focal Fossa)
- Zsh & oh-my-zsh are already installed.
- Jupyter notebook is installed and automatically runs as an entrypoint.
- OpenSSH support (for accessing the container via SSH)
The final Dockerfile on Github.
TL;DR;
Yes, you don't have to read other sections, just this one for everything run!
Install GPU driver, Docker and make them communicate to each other, read my note about Docker & GPU.
If you don't want to understand and just use my Dockerfile to build an image, read "The final workflow" section.
Motivation: I want to try detectron2 which requires CUDA 11.3 and the largest version of CUDA supported by TensorFlow is 11.2.
Create a Dockerfile based on this official image from NVIDIA (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04) to build a testing image. This image has already CUDA 11.3 and cuDNN 8.
# Create the image "img_sample" from the file "Dockerfile"
docker build -t img_sample . -f Dockerfile
# Create and run a container from the image "img_sample"
docker run --name container_sample --gpus all -w="/working" img_sample bash
# Enter the "container_sample" container
docker exect -it container_sample bash
# In the container "container_sample"
# Check if the NVIDIA Driver is recognized
nvidia-smi
# Check the version of CUDA
nvcc --version
# Check the version of cuDNN
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2Follow step 5 of the official tutorial with just a normal command:
pip install tensorflow==2.9.1
and verify the installation by
# TF works with CPU?
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
# TF works with GPU?
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"If everything is OK, we are successfull!
Then add the installation of PyTorch by
# Dockerfile
RUN pip3 install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113And other stuffs
# Install python packages in the requirements.txt file
COPY requirements.txt requirements.txt
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install -r requirements.txt
COPY . .
# Install and setup Zsh to replace the default "bash"
RUN apt-get install -y zsh && apt-get install -y curl
RUN PATH="$PATH:/usr/bin/zsh"
RUN sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
# Install OpenSSH
RUN apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN echo 'root:qwerty' | chpasswd
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile
EXPOSE 22Go back to "The final workflow" to see how it will be used after all.
Motivation
I want to try detectron2 from Meta AI, a library that provides state-of-the-art detection and segmentation algorithms. This library requires the use of cuda=1.13
for a smooth installation. Therefore, I need a Docker container with cuda=1.13
+ TensorFlow + PyTorch for this task. However, the latest version of cuda
that is officially supported by TF is cuda=11.2
.
If you do not necessarily need a special version of cuda
that TF may not support, you can simply use TF's official Docker images.
So, if you find that the versions of TF, CUDA and cudnn match, just use it as a base Docker image or follow the official instructions from TensorFlow. This post is a general idea how we can install TF with other versions of CUDA and cuDNN.
The final workflow
Download this Dockerfile file and rename it as
Dockerfile
(Yes, without extension!).Create a requirements.txt file and add all the required Python packages (with their versions) there, like so
matplotlib==3.3.4
pandas==1.1.5
scikit-learn==0.24.2Create a new Docker image with name "img_ai"
docker build -t img_ai . -f Dockerfile
Create and start a new container with name "container_ai" based on image
img_ai
docker run --name container_ai --gpus all \
-v /home/thi/git/:/git/ \ # Change to your local directory
-dp 8888:8888 \
-dp 6789:22 \
-w="/git" -it img_aiEnter the container and check
docker exec -it container_ai zsh # Yes, we use zsh!!!
In the container,
# GPU driver
nvidia-smi
# CUDA version
nvcc --version
# cuDNN version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
# TensorFlow works with CPU?
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
# TensorFlow works with GPU?
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Torch works with GPU?
python3 -c "import torch; print(torch.cuda.is_available())"If you want to use SSH to access the container?
# First run the ssh server in the container first
docker exec container_ai $(which sshd) -Ddp 22
# Access it via
ssh -p 6789 root@localhost
# password of root: qwertyGo to http://localhost:8888 to open the Jupyter Notebook.
That's it!
Installation and setup
Make sure the GPU driver is successfully installed on your machine and read this note to allow Docker Engine communicate with this physical GPU.
Basically, the following codes should work.
# Check if a GPU is available
lspci | grep -i nvidia
# Check NVIDIA driver info
nvidia-smi
# Check the version of cuda
nvcc --version
# Verify your nvidia-docker installation
docker run --gpus all --rm nvidia/cuda nvidia-smi
If docker -v
gives a version earlier than 19.03, you have to use --runtime=nvidia
instead of --gpus all
.
Choose a base image
Most problems come from TF, it is imperative to adjust the version of TF, CUDA, cuDNN. You can check this link for the corresponding versions between TF, cuDNN and CUDA (we call it "list-1"). A natural way to choose a base image is from a TF docker image and then install separately PT. This is a Dockerfile I built with this idea (tf-2.8.1-gpu, torch-1.12.1+cu113). However, as you can see in the "list-1" list, the official TF only supports CUDA 11.2 (or 11.0 or 10.1). If you want to install TF with CUDA 11.3, it's impossible if you start from the official build.
Based on the official tutorial of installing TF with pip
, to install TF 2.9.1, we need cudatoolkit=11.2
and cudnn=8.1.0
. What if we handle to have cuda=11.3
and cudnn=8
first and then we look for a way to install TF 2.9?
From the NVIDIA public hub repository, I found this image (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04) which has already installed cuda=11.3
and cudnn=8
.
What is the difference between base
, runtime
and devel
in the name of images from the NVIDIA public hub? Check this.
I create a very simple Dockerfile starting from this base image to check if we can install torch=1.12.1+cu113
and tensorflow=2.9.1
.
👉 A simple "Dockerfile" file based on nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
# Create the image "img_sample" from the file "Dockerfile"
docker build -t img_sample . -f Dockerfile
# Create and run a container from the image "img_sample"
docker run --name container_sample --gpus all -w="/working" img_sample bash
# Enter the "container_sample" container
docker exect -it container_sample bash
# In the container "container_sample"
# Check if the NVIDIA Driver is recognized
nvidia-smi
# Check the version of CUDA
nvcc --version
# Check the version of cuDNN
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
Then I try to install tensorflow
from step 5 of this official tutorial,
pip install --upgrade pip
pip install tensorflow==2.9.1
YES! It's that simple!
Let us check if it works (step 6 in the tutorial)?
# Verify the CPU setup
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
# A tensor should be return, something like
# tf.Tensor(-686.383, shape=(), dtype=float32)
# Verify the GPU setup
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# A list of GPU devices should be return, something like
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
🎉 Voilà , it works like clockwork!
Add more things in the Dockerfile
Basically, we are done with the main part of this post. This section mainly explains why I also include the Zsh installation and OpenSSH setup in the Dockerfile.
Python libraries
All normally used packages (with their corresponding versions) are stored in the requirements.txt
file. We need into copy this file to the container and start the installation process,
COPY requirements.txt requirements.txt
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install -r requirements.txt
COPY . .
PyTorch
Install PyTorch by following the official instructions,
RUN pip3 install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
Zsh
👉 Note: Terminal + ZSH.
Add the following lines to install and set up Zsh. Why do we need Zsh instead of the default bash
? Because we need a better look of the command lines and not just white texts. Another problem sometimes arises from the "backspace" key on the keyboard. When you type something and use the backspace key to correct the mistake, the previous character is not removed as it should be, but other characters appear. This problem was mentioned once before at the end of this section in another note.
# Dockerfile
RUN apt-get install -y zsh && apt-get install -y curl
RUN PATH="$PATH:/usr/bin/zsh"
RUN sh -c "$(curl -fsSL https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
After Zsh installed,
# Instead of using this
docker exec -it docker_name bash
# Use this
docker exec -it docker_name zsh
One note: When adding an alias, be sure to add it to both .bashrc
and .zshrc
as follows,
# Dockerfile
RUN echo 'alias python="python3"' >> ~/.bashrc
RUN echo 'alias python="python3"' >> ~/.zshrc
OpenSSH
👉 Note: Local connection between 2 computers.
If you want to access a running container via SSH, you must install and run OpenSSH in that container and expose the port 22
.
# Dockerfile
RUN apt-get install -y openssh-server
RUN mkdir /var/run/sshd
RUN echo 'root:qwerty' | chpasswd
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile
EXPOSE 22
Don't forget to export port 22
when you create a new container,
docker run --name container_name --gpus all \
-dp: 6789:22
# other options
-it image_name
One more step: If a jupyter notebook is running (at port 8888
) in your container, you need to run the following code to get the SSH server running,
docker exec docker_ai $(which sshd) -Ddp 22
Now if you want to access this container via SSH,
ssh -p 6789 root@localhost
Let's use the password qwerty
(it's set in the above code, at line RUN echo 'root:qwerty' | chpasswd
)!
JupyterLab
It's great if our image has a running jupyter notebook server as an entry point so every time we create a new container, there's already a jupyter notebook running and we just use it.
# Dockerfile
RUN python3 -m pip install jupyterlab
CMD /bin/bash -c 'jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token="" --NotebookApp.password=""'
Don't forget to expose the port 8888
when you create a new container,
docker run --name container_name --gpus all \
-dp: 8888:8888
# other options
-it image_name
Go to http://localhost:8888 to open the notebook.
💬 Comments