Talk for the Research Computing Leeds Conference: Save your tears for the data - A touch of Docker in a Data Scientist's workflow.
Slides: PDF
Many data science teams have become multilingual, leveraging R, Python, Julia and friends in their work. Into the bargain, different data scientists have different preferences in their coding environments and operating systems. While this diversity allows data scientists to work with the tools they are most comfortable with, it can become a pain to share the same projects on different machines with different configurations. This talk illustrates how data scientists can leverage Dev Containers to create portable, reproducible and tailored development environments, which can be instantiated reliably in different environments, operating systems and hardware. Data scientists can therefore focus on what they love and do best (i.e data science) without having to worry about the hassle required to reproduce their work, deploy their analysis dashboards or even deploy their models.
A development container is a running Docker container with a well-defined tool/runtime stack and its prerequisites. You can try out development containers with GitHub Codespaces on the cloud or Visual Studio Code Remote - Containers on a machine that has Docker installed as per the instructions below:
Follow these steps to open this workshop in a Codespace:
- Click the Code drop-down menu on the repo and select the Open with Codespaces option.
- Select + New codespace at the bottom on the pane.
For more info, check out the GitHub documentation.
Follow these steps to open this workshop in a container using the VS Code Remote - Containers extension:
-
If this is your first time using a development container, please ensure your system meets the pre-reqs (i.e. have Docker installed) in the getting started steps.
-
Press F1 select and Add Development Container Configuration Files... command for Remote-Containers or Codespaces.
Note: If needed, you can drag-and-drop the
.devcontainer
folder from this sub-folder in a locally cloned copy of this repository into the VS Code file explorer instead of using the command. -
Select this definition. You may also need to select Show All Definitions... for it to appear.
-
Finally, press F1 and run Remote-Containers: Reopen Folder in Container to start using the definition.
At some point, you may want to make changes to your container, such as installing a new package. You'll need to rebuild your container for your changes to take effect.
We create a dev container that can support data science tasks in R, Python, VS Code and RStudio.
Toggle terminal: ctrl
+ `
Navigate to PORTS
tab, then click on the 🌐 icon:
The default username and password is rstudio (as it should? 🤭)
- The Rocker Project: Docker Containers for the R Environment
- A repository of development container definitions
- Introduction to dev containers
- Saturn Cloud Webinar: Docker for Data Scientists by Jacqueline Nolis
- Valohai: Docker for Data Science: What every data scientist should know about Docker
Thank you to the following folks for providing helpful info on how to set up RStudio server on a dev container:
- David Smith: Zero-setup R workshops with GitHub Codespaces
- えいつぴ (@eitsupi): For helpful info on using RStudio in a Rocker container
- Eric Nantz (R-Podcast): For the episode "Fully containerized R dev environment with Docker, RStudio, and VS-Code"