In this workshop we will take a data analysis pipeline implemented in a Jupyter notebook and convert it into a script that can be run from command-line. We will then convert this script into a Python package: a collection of code modules supporting a pre-defined set of command-line tools. Finally, we will refactor the package by applying the paradigm of object-oriented programming.
The purpose of this course is not to dissuade you from using Jupyter! Notebooks are an incredibly accessible and powerful tool for data scientists and researchers alike. However, as an experiment expands in scope and scale, the limiting features of notebooks start to become readily apparent. We will focus on the process of software design: where and when in the course of building an analysis pipeline you may want to consider investing the effort to leverage the other tools at your disposal as a Python developer.
-
Notebooks
- predicting UFO sightings, as implemented in a Jupyter notebook
- the advantages and disadvantages of notebooks
- when in the development of an experiment to consider moving beyond a notebook
-
Scripts
- converting a notebook into a script
- parametrizing a script using
argparse
- modularizing a script using helper functions
-
Packages
- how packages are designed in Python
- possible ways to structure your package
- creating package infrastructure
- sharing your package with the world
-
Classes
- applying object-oriented programming within a package
- how OOP affects package structure
- refactoring a class design to introduce hierarchical class structure
These materials are designed for users with at least some knowledge of Python, and particuarly with using Jupyter
notebooks to build data analysis experiments. You may also want to refresh your acquiantance with the use of Python
packages such as requests
, pandas
, matplotlib
, and scikit-learn
before starting this workshop.
To run the code included in this workshop, you'll need access to a command-line environment with a conda installation. In this environment, choose a place to check out the course repository:
git clone [email protected]:michal-g/Notebooks-to-Packages.git
In the newly-created folder Notebooks-to-Packages
you'll find the workshop materials including the code; we create the
the environment to run the code and activate it using:
conda create --name notebooks-packages -c conda-forge python=3.9 pandas plotly jupyter imageio matplotlib \
'scikit-learn<1.1' nbconvert nbformat
conda activate notebooks-packages
pip install kaleido skits
v1
presented as part of Princeton Wintersession 2023
The dataset nuforc_events_complete.csv
was downloaded from Link Wentz' repo on
January 12, 2024.