As part of a research group that prioritizes reproducibility, we’ve gradually started using Docker containers to ensure our code runs consistently across machines. While some members of the group have implemented this well, others (myself included) have only just come around.
Now, with a paper nearing submission, it’s become clear that I need a containerized environment for my work. The timing? Not ideal. The Supervisor is away on holiday. But, truth be told, I’m not the most patient person. Hence, I figured: if I need a distraction from integrating comments in my paper anyway, why not dive into Docker and see where I end up?
Honestly, it was mostly okay. There are a lot of resources out there, but you really need to know what you’re looking for. And that’s somehow the hard part. Because I couldn’t find a single, clear guide that exactly matched what I was trying to do, I’ve decided to write one — for my future self and for you.
Therefore, this post is for you if your goals include:
- Setting up Docker containers for data science research
- Working on a Windows machine but with Windows Subsystem for Linux (WSL) instead of Hyper-V
- Using VSCode as your environment
The idea of using WSL 2 might seem a bit random at first, but after a few quick reads and chats with AI tools, you’ll likely come across the advice that a WSL 2 backend tends to be more flexible and scalable for development, especially over time and across different types of projects, than relying on Hyper-V.
So, let’s get to it.
1. Understand What Docker Containers Are
Before diving in, make sure you understand the concept of Docker containers. The usual analogy is shipping containers: each project is neatly packaged with everything it needs to run – the code, dependencies and environment settings. This way, the project behaves the same regardless of where it is deployed.
Yes, you could keep everything in a local folder, but Docker ensures complete isolation between projects. And when sharing your work, you don’t need to send instructions like, “You’ll need to install X, Y, and Z before this runs.” You just ship off the container.
2. Install Docker and Set Up WSL 2
To get started on Windows, you’ll need to set up WSL 2 and install Docker Desktop.
- Follow this Microsoft guide to set up WSL environment and use this for an extended guide on connecting to Docker.
- You’ll be prompted to create a Linux username and password. Just follow the prompts.
- Download Docker Desktop from the official Docker docs.
- During setup, select the option to use the WSL 2 backend.
- Once installed, you can check if Docker works by opening a terminal in Ubuntu (your WSL distribution) and running:
docker --version
3. Move Your Project into WSL and Open It in VSCode
You can use files in your regular Windows directories (e.g., C:/Users/), but for best performance and compatibility, work directly in the WSL file system. Thus:
- Copy your project folder from the Windows directory into the WSL directory
wsl…/home/username/. - Open VSCode and install the Remote Development Extension Pack from Microsoft.
- In the Ubuntu terminal, run these lines one after the other
cd your_project_folder_name
code .
cd sets the directory to your project folder and code . opens VSCode, running inside WSL.
4. Containerize the Project with Docker
Now that Docker and WSL are working and your project is open in VSCode, it’s time to create the container.
- Create a
Dockerfilei.e., open your note app and create a file with content such as the one below. Save the file as a ‘Dockerfile’ with no extension.
FROM python:3.10
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
CMD ["python", "main.py"]
To get an idea of what should go into my Dockerfile, I checked Dockerfiles in my supervisor’s and other Data Science repositories and used some AI help.
- Build the Docker image by running the following command in your terminal.
docker build -t my_project_dir_name .
The runtime of this build command depends on what you indicate should go into the container. After successful building, you can see the image and containers in the Docker panel in VSCode if you’ve installed the Docker extension (should be part of the earlier installed extension pack).
- Create a
.dockerignorefile to exclude large or unnecessary files from your image. Ideally, only let the script folder be included while you mount other folders during project sessions.
5. Devcontainers for Data Science
The steps above give you a functional Docker image, but they assume you run everything from a terminal. Running scripts from a terminal is fine for app development but not ideal for data science, where we often work with scripts, interactive sessions, or Jupyter notebooks.
Devcontainers let you open your project inside the container in VSCode, with full access to notebooks, scripts, extensions, and interactive tools. To create a .devcontainer folder,
- Inside your project directory, create a folder called
.devcontainerand add adevcontainer.jsonfile such as:
{
"name": "My Research Project",
"build": {
"dockerfile": "Dockerfile"
},
"settings": {
"terminal.integrated.shell.linux": "/bin/bash"
},
"extensions": [
"ms-python.python",
"ms-toolsai.jupyter"
],
"postCreateCommand": "pip install -r requirements.txt"
}
- Once created, from VSCode, open the Command Palette (
Ctrl+Shift+P) and select:Dev Containers: Reopen in Container.
Now you are running inside a container with full access to notebooks and extensions.
You can check out this comprehensive guide on Devcontainers for data science.
6. GitHub and Final Tips
Back up your project to GitHub by initializing a repository linked to your WSL folder. Here again, utilize gitignore to specify that large files should be excluded from GitHub syncs.
Expect a few hiccups during this setup, and some steps might fail due to machine-specific issues. Thus, exercise patience or refuse to sleep until you resolve the issues. But take breaks. Breaks help refresh the mind and give new insights.
Leverage AI as Copilot. Tools like GitHub Copilot, ChatGPT, and others can help you draft your Dockerfile or devcontainer.json and understand the syntax or config options in these files. But always compare with working examples and ask what each line actually does.
How to stop a running container:
When you’re done working, you don’t want containers running in the background forever.
- In VSCode: Open the Command Palette (Ctrl+Shift+P) and run
Dev Containers: Close Remote Connectionto shut the container down cleanly. - Or, in the Ubuntu terminal:
List running containers withdocker ps
And then stop the container using its ID:docker stop <container-ID>
That’s it!
I go back now to revising my paper. Have fun with your setup.