chore: multiple python version support with latest pyspark and hail #974

project-defiant · 2025-01-16T16:58:15Z

✨ Context

This PR closes #3189
This PR closes #3680
and links to opentargets/orchestration#94

🛠 What does this PR implement

This PR focuses on allowing pyspark>=3.5.0, <3.6 along with newest hail version, that supports it.
The changes above caused trouble (errors) for poetry to resolve the dependencies. This lead to more developments, the full list is described below:

replaced poetry with uv as a dependency management tool (~5x faster dependency resolution without issues)
added support for X.Y.Z-devXX releases on push to dev branch
updated way to create dev dataproc cluster (Open Targets users only) - see details below
add support for multiple python versions in gentropy (py310, py311, py312) py313 support failure due to missing BLAS when installing scipy from wheel
support for pyspark>=3.5.0, <3.6
bump hail version to sync with pyspark dependency
dropped google as dependency, added more granular google-cloud-storage (To be removed in future release with other google dependent packages)
github actions update to use uv and test for dependency matrix of python versions (requires changes in the repository rules)

dev cluster setup

Previously dev cluster was created from the package pushed to the google cloud storage bucket under the namespace that matched the git ref name. This has now changed, as cluster now install gentropy directly from the github branch (or tag) specified in the orchestration - see changes in opentargets/orchestration#94.

Benefits of new solution

keeping just one copy of the static files used for running gentropy steps(install_dependencies_on_cluster.sh and cli.py) in the GCS.
spped up the development process, since the one does not need to wait for the github actions to finish, before they can set up the cluster from airflow on the branch.
speed up the dependency resolution with uv pip install (pip resolution of gentropy with support for multiple python versions takes ~ 20minutes leading to initialization actions timeout, uv does the same thing with around 1.5min).

🙈 Missing

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

ireneisdoomed

Thank you!
2 questions:

Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

project-defiant · 2025-01-17T12:50:47Z

Thank you! 2 questions:

Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports

We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

These are very nice points!

I am aftraid, that poetry will not allow us to supoport more pyspark versions, latest hail requires version between 3.5 to 3.6, otherwise we could do it.
Yes I am planning to do that, just right after I finish the harmonisation :). Let's wait for this check before we merge.

This reverts commit 630c0c9.

ireneisdoomed

Really cool!! New PySpark, support for several Python versions (dev version set to 3.11), and migration to UV.

I suggest changing the PR title, this is clearly a lie haha

Setting up the dev environment was smooth simply by doing make setup-dev (except that uv wasnt in my path, see minor comment below).

I haven't used uv myself, so I can't really evaluate pros and cons. M only questions is to confirm that we are not missing anything from CI/CD after the migration?

utils/clean_status.sh

pyproject.toml

Dockerfile

ireneisdoomed · 2025-01-24T17:17:47Z

utils/install_dependencies.sh

+if ! command -v uv &>/dev/null; then
+    echo "uv was not found, installing uv..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+    source $HOME/.local/bin/env


After installing the environment, uv wasn't on my path. Should we add export PATH="$HOME/.local/bin:$PATH" here?

This is a good point! The UV setup script looks over 3 locations (the $HOME/.local/bin/{uv,uvx} is the last fallback if XDG_BIN_HOME nor XDG_DATA_HOME are not defined, which is one of the standards, that most Unix based machines follow (apparently not Mac).

I will make additional fix for this !

I looked over how uv is set up, the installation path is specified at runtime, so referencing it means recreating the lookup I described above. On the good note, the uv adds itself to the PATH in the shell rc file, so ideally one just need to source the shell rc to have it run as executable. There should be no need to add the ${HOME}/.local/bin/ to the path, as it might not be correct by default on linux machines.

To explain in short:

uv sets the export for the env file that holds the paths to the uv and uvx.

You just have to source the shellrc file (~/.zshrc in my example) after the successfull installation to make it available in the terminal (this is due to the fact that Make runs in a separate shell and parent shell does not inherit from child shell). I have added a notification about this after the installation process

ireneisdoomed · 2025-01-24T17:21:55Z

Makefile


 test: ## Run tests
 	@echo "Running Tests..."
-	@poetry run pytest
+	@uv run pytest


Please test this yourself, but the current command goes idle without running any tests

Suggested change

@uv run pytest

@uv run pytest .

The tests still go ok on my side, although it seems that the long lookup, not sure why the dot helps here, assuming that we already add the testpath(s) to the pytest options, the dot just overwrites it. While testing your solution I got following errors cache related:

import file mismatch: imported module 'a_creating_spark_session' has this __file__ attribute: /home/mindos/Projects/OpenTargets/gentropy/docs/src_snippets/howto/python_api/a_creating_spark_session.py which is not the same as the test file we want to collect: /home/mindos/Projects/OpenTargets/gentropy/site/src_snippets/howto/python_api/a_creating_spark_session.py HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

This seems to be due to the fact that I have previously generated the docs that contain some test duplicate.

On the note of test collection speeds:

(base) mindos@mindos  ~/Projects/OpenTargets/gentropy   pyspark-bump ±  time uv run pytest --collect-only . 1> /dev/null uv run pytest --collect-only . > /dev/null 11,81s user 2,56s system 115% cpu 12,390 total (base) mindos@mindos  ~/Projects/OpenTargets/gentropy   pyspark-bump ±  time uv run pytest --collect-only 1> /dev/null uv run pytest --collect-only > /dev/null 11,70s user 2,62s system 116% cpu 12,269 total

in the first run the site dir is removed

ireneisdoomed

Thank you for addressing the comments!

github-actions bot added Chore size-XS labels Jan 16, 2025

SzymonSzyszkowski added 2 commits January 17, 2025 11:28

chore(pyspark): update to 3.5.X

90e6028

chore: fix doctest syntax

630c0c9

project-defiant force-pushed the pyspark-bump branch from 371441b to 630c0c9 Compare January 17, 2025 10:30

github-actions bot added size-M Method Dataset Datasource and removed size-XS labels Jan 17, 2025

project-defiant marked this pull request as ready for review January 17, 2025 10:36

project-defiant requested review from ireneisdoomed and removed request for ireneisdoomed January 17, 2025 10:36

chore: bump temurin version to 11

1dbe1b0

ireneisdoomed approved these changes Jan 17, 2025

View reviewed changes

Szymon Szyszkowski and others added 7 commits January 21, 2025 14:36

feat: allow multiple python versions

bcf0b9a

feat: python matrix for gha

4d3380a

chore: pre-commit auto fixes [...]

28b3e2c

Merge branch 'dev' into pyspark-bump

9cc2c78

chore: typos

fbaa8d9

chore: fix python version in setup dev script

7f416ed

fix: attempt to fix the 3.11 tests

5a9cd8f

github-actions bot added the Step label Jan 21, 2025

Szymon Szyszkowski added 3 commits January 21, 2025 15:28

fix: set the session correctly in variant_index_config

c46cdab

Revert "chore: fix doctest syntax"

18c66b1

This reverts commit 630c0c9.

chore: update dependencies

def0fbb

github-actions bot removed size-M Method Dataset labels Jan 21, 2025

Szymon Szyszkowski added 2 commits January 22, 2025 15:41

chore: add .python-version file to ignored

f1ff1f9

build: new setup

98d464d

project-defiant force-pushed the pyspark-bump branch from 887b5f7 to 98d464d Compare January 23, 2025 11:40

Szymon Szyszkowski added 8 commits January 23, 2025 11:46

build: new setup

aa64db9

build: new setup

570e33e

build: new setup

89a9c34

build: new setup

2db0610

revert: bring back initialization actions

b392368

chore: align variable name

2329a8a

chore: update pre-commit python version

04c2ed2

chore: docs update

979325d

project-defiant requested a review from ireneisdoomed January 23, 2025 14:39

Merge branch 'dev' into pyspark-bump

3eb7d55

ireneisdoomed approved these changes Jan 24, 2025

View reviewed changes

SzymonSzyszkowski added 4 commits January 27, 2025 13:11

feat: more complex uv installation

79022b9

feat: notify to source shellrc file when installing uv

9adc76a

fix: checks

7d10b63

chore: debug gha

a17290d

project-defiant changed the title ~~chore: bump pyspark version~~ chore: multiple python version support with latest pyspark and hail Jan 27, 2025

SzymonSzyszkowski added 9 commits January 27, 2025 14:11

chore: debug gha

79de16b

feat: debug gha

e7d5cd8

feat: debug gha

f4ab0d0

feat: debug gha

3bc1318

feat: force user shell

a82b86c

feat: gha debug

09b83a8

feat: gha debug

640f493

feat: gha debug

4a2018a

feat: gha debug

e3ea829

ireneisdoomed approved these changes Jan 27, 2025

View reviewed changes

project-defiant merged commit 3d31edd into dev Jan 28, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: multiple python version support with latest pyspark and hail #974

chore: multiple python version support with latest pyspark and hail #974

project-defiant commented Jan 16, 2025 •

edited

Loading

ireneisdoomed left a comment

project-defiant commented Jan 17, 2025

ireneisdoomed left a comment •

edited

Loading

ireneisdoomed Jan 24, 2025

project-defiant Jan 27, 2025

project-defiant Jan 27, 2025

project-defiant Jan 27, 2025 •

edited

Loading

ireneisdoomed Jan 24, 2025

project-defiant Jan 27, 2025 •

edited

Loading

ireneisdoomed left a comment

chore: multiple python version support with latest pyspark and hail #974

chore: multiple python version support with latest pyspark and hail #974

Conversation

project-defiant commented Jan 16, 2025 • edited Loading

✨ Context

🛠 What does this PR implement

dev cluster setup

Benefits of new solution

🙈 Missing

🚦 Before submitting

ireneisdoomed left a comment

Choose a reason for hiding this comment

project-defiant commented Jan 17, 2025

ireneisdoomed left a comment • edited Loading

Choose a reason for hiding this comment

ireneisdoomed Jan 24, 2025

Choose a reason for hiding this comment

project-defiant Jan 27, 2025

Choose a reason for hiding this comment

project-defiant Jan 27, 2025

Choose a reason for hiding this comment

project-defiant Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

ireneisdoomed Jan 24, 2025

Choose a reason for hiding this comment

project-defiant Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

ireneisdoomed left a comment

Choose a reason for hiding this comment

project-defiant commented Jan 16, 2025 •

edited

Loading

ireneisdoomed left a comment •

edited

Loading

project-defiant Jan 27, 2025 •

edited

Loading

project-defiant Jan 27, 2025 •

edited

Loading