Sous chef is the Python monorepo for the data team.
This guide will help you set up your machine with the necessary tools to start developing with sous-chef. It will go through the following steps:
- Python and
pyenvinstallation - Poetry installation
- Git installation
sous-chefcloningsous-chefCLI installation
Start by making sure your GitHub account is set up correctly.
- Configure two-factor authentication in your GitHub profile settings.
- If your GitHub account isn't set up with your
cheffelo.comemail address, then you need to add it in your profile settings.
You need a code editor to work on the code. We recommend using either VSCode or Cursor (which is an AI powered code editor built on VSCode).
Cursor
The benefit of cursor is that it is very good at understanding our code base, offers great autocomplete, and you can ask it questions about the code base.
- Go to https://www.cursor.com and install the version for your machine.
- Follow the on-screen prompts to complete the installation
- Keyboard: Default
- Language: Up to you
- Codebase-wide: Enable
- Add terminal command: Install cursor
- If you are currently using VSCode, it will ask you if you want to import your VSCode settings, click yes if you want Cursor to act like your current VSCode setup
- Automplete preference: Contiunue with default
- Data preferences: Privacy mode
- Sign up and login using GitHub
- We will now install a few extensions to help with development. Open up a terminal in Cursor and run the following:
code --install-extension innoverio.vscode-dbt-power-user && \
code --install-extension databricks.databricks && \
code --install-extension analysis-services.TMDL && \
code --install-extension GerhardBrueckl.powerbi-vscode && \
code --install-extension jianfajun.dax-language && \
code --install-extension ms-python.python && \
code --install-extension samuelcolvin.jinjahtml && \
code --install-extension redhat.vscode-yaml && \
code --install-extension sdras.night-owl- We installed some theme options, to change theme options, press
Ctrl (or cmd) + Shift + Pand selectPreferences: Color Themeand choose your theme.
VSCode
- Go to https://code.visualstudio.com/download and install the version for your machine.
- We will now install a few extensions to help with development. Open up a terminal in VSCode and run the following:
code --install-extension innoverio.vscode-dbt-power-user && \
code --install-extension databricks.databricks && \
code --install-extension analysis-services.TMDL && \
code --install-extension GerhardBrueckl.powerbi-vscode && \
code --install-extension jianfajun.dax-language && \
code --install-extension ms-python.python && \
code --install-extension samuelcolvin.jinjahtml && \
code --install-extension redhat.vscode-yaml && \
code --install-extension sdras.night-owl- We installed some theme options, to change theme options, press
Ctrl (or cmd) + Shift + Pand selectPreferences: Color Themeand choose your theme.
A lot of the steps from here onwards will require copying commands into your terminal, so we recommend opening up your code editor on one half of your screen, and this readme on the other half.
Python is the back-bone of sous-chef, so we need to install this first. But each project can have different Python versions. Therefore, we will also install pyenv to manage different Python versions on your local machine.
Windows
- Open your code editor and the built-in terminal with Powershell.
- Install pyenv-win.
Invoke-WebRequest -UseBasicParsing -Uri "https://raw.githubusercontent.com/pyenv-win/pyenv-win/master/pyenv-win/install-pyenv-win.ps1" -OutFile "./install-pyenv-win.ps1"; &"./install-pyenv-win.ps1"If you get a "Running scripts is disabled on this system" error, you can enable it by running the following:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned- Add pyenv to your path
Adding PYENV, PYENV_HOME and PYENV_ROOT to your Environment Variables:
[System.Environment]::SetEnvironmentVariable('PYENV',$env:USERPROFILE + "\.pyenv\pyenv-win\","User")
[System.Environment]::SetEnvironmentVariable('PYENV_ROOT',$env:USERPROFILE + "\.pyenv\pyenv-win\","User")
[System.Environment]::SetEnvironmentVariable('PYENV_HOME',$env:USERPROFILE + "\.pyenv\pyenv-win\","User")Now adding the following paths to your USER PATH variable in order to access the pyenv command:
[System.Environment]::SetEnvironmentVariable('path', $env:USERPROFILE + "\.pyenv\pyenv-win\bin;" + $env:USERPROFILE + "\.pyenv\pyenv-win\shims;" + [System.Environment]::GetEnvironmentVariable('path', "User"),"User")If for some reason you cannot execute PowerShell command, type "environment variables for you account" in Windows search bar and open Environment Variables dialog.
You will need create those 3 new variables in System Variables section (bottom half). Let's assume username is my_pc.
| Variable | Value |
|---|---|
| PYENV | C:\Users\my_pc\.pyenv\pyenv-win\ |
| PYENV_HOME | C:\Users\my_pc\.pyenv\pyenv-win\ |
| PYENV_ROOT | C:\Users\my_pc\.pyenv\pyenv-win\ |
And add two more lines to user variable Path.
C:\Users\my_pc\.pyenv\pyenv-win\bin
C:\Users\my_pc\.pyenv\pyenv-win\shims
-
Close and reopen your code editor
-
Check if the installation was successful.
pyenv --version
- Check a list of Python versions supported by
pyenv-win
pyenv install -l
- Install python 3.11
pyenv install 3.11.5
- Install python 3.10
pyenv install 3.10.11
- Set a Python version as the global version
pyenv global 3.11.5
- Check which Python version you are using and its path
pyenv version
Output: <version> (set by \path\to\.pyenv\pyenv-win\.python-version)
- Check that Python is working
python -c "import sys; print(sys.executable)"
Output: \path\to\.pyenv\pyenv-win\versions\<version>\python.exe
- Install pip such that we can install packages later
python -m ensurepip --upgrade
- Check that pip is working
pip --version
macOS
We will install pyenv and Python using Homebrew.
- Open your code editor and the built-in terminal.
- Install Homebrew if you haven't already
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"- Use Homebrew to install
pyenv
brew update
brew install pyenv- Check that
pyenvis installed correctly by running
pyenv --versionIt should return something like pyenv 2.X.X
- Add the following to your
.zshrcfile, this will enablepyenvin your terminal
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrcThen close your terminal and open a new one.
- Install Xcode command line tools
xcode-select --install- Install
pyenvdependencies
brew install openssl readline sqlite3 xz zlib tcl-tk- Check that
pyenvis in your path
which pyenv- Check that
pyenv's shims directory is in your path
echo $PATH | grep --color=auto "$(pyenv root)/shims"- Install Python 3.11 using
pyenv
pyenv install 3.11.5- Install Python 3.10 using
pyenv
pyenv install 3.10.15- Set the global Python version to 3.11
pyenv global 3.11.5- Check that
pyenvhas versions available
pyenv versionsTo check that you have set up Python correctly, run the following command:
python --versionIt should return something like Python 3.11.X
Poetry is a tool for dependency management in Python projects. It helps manage project dependencies, virtual environments, and package publishing.
Windows
- Install Poetry using the official installer in PowerShell
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -- Add Poetry to your PATH by
[System.Environment]::SetEnvironmentVariable(
"path",
[System.Environment]::GetEnvironmentVariable("path", "User") + ";%APPDATA%\Python\Scripts",
"User"
)Restart your code editor
- Check that you have set up Poetry correctly
poetry --versionIt should return something like Poetry version 2.X.X
If you get an access denied error, then you may need to add %APPDATA%\Python\Scripts to the exemptions:
- Go to Settings > Security > Virus & threat protection
- Under Virus & threat protection settings select Manage settings
- Under Exclusions select Add or remove exclusions
- Select Add an exclusion
- Choose Folder
- Locate and add your Python/Scripts folder (e.g.
C:\Users\<your-username>\AppData\Local\Programs\Python\Scripts)
- Set poetry to prefer the currently active Python version
poetry config virtualenvs.in-project trueNote: running poetry self updateon Windows may be problematic. If so, run a re-install of Poetry by running the following:
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -macOS
- Install Poetry using the official installer
curl -sSL https://install.python-poetry.org | python3 -- Add poetry to your path
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc- Check that you have set up Poetry correctly
poetry --versionIt should return something like Poetry version 2.X.X
- Set poetry to create project specific virtualenvs and install the poetry-plugin-shell plugin
poetry config virtualenvs.in-project true
poetry self add poetry-plugin-shellGit is a version control system that allows you to track changes to your code. It is essential for managing and collaborating on projects.
Windows
- Visit https://git-scm.com/download/win
- Download and run the Git installer for Windows.
- There will be a lot of prompts to choose between. Choose the default options apart from the follows:
- Default branch name suffix:
main - Default editor: your code editor
- Repoen your code editor
- Open command prompt and check that Git is installed correctly by running:
git --version
It should return something like git version 2.X.X
- Install GitHub CLI
winget install --id github.cli
Accept the source agreements when prompted
macOS
- Install Git
brew install git
- Install GitHub CLI
brew install gh
Once git is installed, we should provide git with our full name and email address.
- Start by setting your full name.
git config --global user.name "Your Full Name"
- Set the email address, this should be the email address you use to access GitHub.
git config --global user.email "Your GitHub Email"
- Set the pull strategy to only fast-forward.
We recommend requiring git to only fast-forward when pulling, instead of trying (and potentially failing) to rebase. This will stop git from putting itself in a bad state.
git config --global pull.ff only
- Check that you have set up git correctly:
git --version
It should return something like git version 2.X.X
Before cloning the repository, you need to authenticate your local machine with GitHub. This step ensures that you have the necessary permissions to access the repository.
-
Authenticate with GitHub
gh auth login -
Follow the prompts to complete the authentication process using the following answers
- What account do you want to log into? - GitHub.com
- What is your preferred protocol for Git operations? - HTTPS
- Authenticate Git with your GitHub credentials? - Yes
- How would you like to authenticate GitHub CLI? - Login with a web browser
Now let's get the sous-chef repo cloned to your local machine.
We recommend creating a directory within your home directory named cheffelo and place all of Cheffelo's source code repositories in there.
- Navigate to your home/user directory
cd %USERPROFILE% (Windows) cd ~ (macOS)
- Create a cheffelo directory
mkdir cheffelo
- CD into the cheffelo directory
cd cheffelo
- Clone the sous-chef repo
git clone https://github.com/cheffelo/sous-chef sous-chef
- CD into the sous-chef directory
cd sous-chef
There is an issue on Windows where "Windows Defender" blocks the access to the sous-chef folder by default. To fix this, we need to add the folder to the "Exclusions", to do this:
- Go to Settings > Security > Virus & threat protection
- Under Virus & threat protection settings select Manage settings
- Under Exclusions select Add or remove exclusions
- Select Add an exclusion
- Choose Folder
- Add the sous-chef folder
We use pre-commit hooks to ensure code quality and consistency across the monorepo.
Windows
- Install pre-commit
pip install pre-commit
If pip is not installed, then go to the Install Python section to install this
- Install the pre-commit hooks
cd cheffelo/sous-chef
pre-commit install
- Update the pre-commit hooks
pre-commit autoupdate
If you get a Permission denied error when trying to run pre-commit, then it's probably because it's being blocked by Windows Defender. To fix this, do the following:
- Go to Settings > Security > Virus & threat protection
- Under Virus & threat protection settings select Manage settings
- Under Exclusions select Add or remove exclusions
- Select Add an exclusion
- Choose Folder
- Add the folder outlined in the error, typically:
Users\your.name\.cache\pre-commit
macOS
- Install pre-commit
brew install pre-commit
- Install the pre-commit hooks
cd cheffelo/sous-chef
pre-commit install
- Update the pre-commit hooks
pre-commit autoupdate
Now we have the sous-chef report cloned. Let's install the dependencies and activate the chef cli.
Run the following commands to install the dependencies and activate the chef cli.
cd cheffelo/sous-chef
poetry shell
poetry installThis will spin up a new Python virtualenv, and activate the venv in a new shell.
It will also install the core utils (chef) to manage the monorepo.
If this was successful, then you're fully set up in sous-chef and can start working on projects.
To check that the chef cli is working, run the following command:
chef --helpIt should return a list of commands that you can use.
We use pyenv to manage different Python versions, and Poetry to create virtualenvs with the correct dependencies for each project.
Let's test creating a new Python environment for a project.
Windows
- cd into a project
cd cheffelo/sous-chef/projects/data-model- Create a new virtualenv
poetry shell- Install the project dependencies
poetry install- Check that you are in a virtualenv
which python
Should output something like:
/Users/<your-username>/cheffelo/sous-chef/projects/data-model/.venv/bin/python
macOS
- cd into a project
cd cheffelo/sous-chef/projects/data-model- Create a new virtualenv
poetry shell- Install the project dependencies
poetry install- Activate the virtualenv
source .venv/bin/activate- Check that you are in a virtualenv
which python
Should output something like:
/Users/<your-username>/cheffelo/sous-chef/projects/data-model/.venv/bin/python
If you want to run local code in Databricks, you need to first connect to Databricks using the Databricks extension in your code editor.
- Click on the Databricks logo in the left-hand side of your code editor.
- Click on ´Migrate existing project to Databricks´
Now we have sous-chef setup, we can start creating new projects and packages using the chef cli.
To create a new project, run the new-service target from the command line and provide a name for your service:
chef create projectTo create a new package:
chef create packageFor more information about the chef cli, view packages/chef/README.md
Library name is a human readable name. E.g: Analytics API
Project name is a name without spaces and upper letters for workflows and folders. E.g: analytics-api
Module name is the python package name which needs underscores. E.g: analytics_api
In this section, you will find a few use cases to describe how to develop different projects.
In this section will we showcase how to setup a simple ML application that.
Make sure you have access to the chef cli.
Note
If you do not have access to the chef cli. Try running the following:
poetry shell
poetry installWith the chef cli activated, run the following to create a new project:
chef create projectThis will prompt you for different questions. However, it is mainly the Project Name that needs to be inputted. For the remaining prompts, simply press Enter, unless you want to customise it further.
The command will add a basic project structure needed for Python development, and a few files for basic development locally, and on Databricks.
We are now ready to start adding our custom code.
Open up a new shell / terminal, and move into the project directory. This can be done by running:
cd projects/<your-project-name>This project will have it's one Python environment, which prohibits conflicting Python packages across projects, but it also enable us to use different Python versions per project.
As a result, we need to create a new virtual environment, and install the project packages again. Therefore, run the following:
poetry shell
poetry installWe can now add external and internal packages with:
chef add data-contracts # Internal package at `packages/data-contracts`
chef add streamlit # External UI packageTo showcase a simple example. Add the following streamlit app to app.py. This will search for recipe embeddings in a vector database.
import asyncio
import streamlit as st
from project_owners.owner import Owner
from data_contracts.recommendations.recipe import RecipeFeatures
from data_contracts.recommendations.store import recommendation_feature_contracts
from aligned import feature_view, String, Bool, FileSource, model_contract
from aligned.exposed_model.ollama import ollama_embedding_contract
from aligned.sources.lancedb import LanceDBConfig
vector_db = LanceDBConfig(path="./vector_db")
recipe = RecipeFeatures()
RecipeEmbedding = ollama_embedding_contract(
input=recipe.recipe_name,
entities=recipe.recipe_id,
model="nomic-embed-text",
endpoint="http://our-embedding-service:11434",
contract_name="recipe_embedding",
contacts=[Owner.matsmoll().markdown()],
output_source=vector_db.table("recipe_embeddings").as_vector_index("recipes")
)
async def main():
recipe_to_search = "Laks med soya og ris"
st.write(f"Searching for '{recipe_to_search}'")
store = recommendation_feature_contracts()
store.add_model(RecipeEmbedding)
with st.spinner("Creating Embeddings"):
await store.model("recipe_embedding").predict_over(
RecipeFeatures.query().all()
).insert_into_output_source()
similar_recipes = await (
store.vector_index("recipes")
.nearest_n_to({
"recipe_name": [recipe_to_search],
number_of_records=5
}).to_pandas()
)
st.title("Similar recipes")
st.write(similar_recipes)
if __name__ == "__main__":
asyncio.run(main())You can run the project locally through Docker with a small application. Ensure that the startup command is added to the docker-compose.yaml file first.
services:
app:
platform: linux/amd64
build:
context: ../../
dockerfile: projects/<project-name>/docker/Dockerfile
volumes:
- ./:/opt/projects/<project-name>/
- ./../../packages:/opt/packages
command: "python -m streamlit run app.py --server.fileWatcherType poll"
depends_on:
- base
ports:
- 8500:8501
env_file:
- ../../.env
- .env
...Now startup the application with:
chef up appThis will build the project, install everything that is needed and start up the server at http://127.0.0.1:8500.
Data science applications are a subtype of a Python project. Meaning you can use everything described in the Python Project use-case. However, to manage the unpredicability of data and ML could the following also be needed:
- Experiment tracking
- Model versioning - through a model registry
- Feature store - to load offline point-in-time data, and low latency online data.
- Big Brain compute - aka. extra RAM / disk
- Out of memory compute - through Spark / distributed processing
- Job orchestration
- Model serving endpoint
- Monitor and validate data - either data drift or semantic expectations
- Evaluate model online performance
- Explain model outputs
For all of this do we default to the Databricks' components.
Meaning MLFlow, Spark, Databricks' feature-engineering package, Databricks Asset Bundles.
However, we still use Docker to control the dependencies through the docker/Dockerfile.databricks file. See the databricks-env README for more details.
