Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: multiple python version support with latest pyspark and hail #974

Merged
merged 47 commits into from
Jan 28, 2025

Conversation

project-defiant
Copy link
Contributor

@project-defiant project-defiant commented Jan 16, 2025

✨ Context

This PR closes #3189
This PR closes #3680
and links to opentargets/orchestration#94

🛠 What does this PR implement

This PR focuses on allowing pyspark>=3.5.0, <3.6 along with newest hail version, that supports it.
The changes above caused trouble (errors) for poetry to resolve the dependencies. This lead to more developments, the full list is described below:

  • replaced poetry with uv as a dependency management tool (~5x faster dependency resolution without issues)
  • added support for X.Y.Z-devXX releases on push to dev branch
  • updated way to create dev dataproc cluster (Open Targets users only) - see details below
  • add support for multiple python versions in gentropy (py310, py311, py312) py313 support failure due to missing BLAS when installing scipy from wheel
  • support for pyspark>=3.5.0, <3.6
  • bump hail version to sync with pyspark dependency
  • dropped google as dependency, added more granular google-cloud-storage (To be removed in future release with other google dependent packages)
  • github actions update to use uv and test for dependency matrix of python versions (requires changes in the repository rules)

dev cluster setup

Previously dev cluster was created from the package pushed to the google cloud storage bucket under the namespace that matched the git ref name. This has now changed, as cluster now install gentropy directly from the github branch (or tag) specified in the orchestration - see changes in opentargets/orchestration#94.

Benefits of new solution

  • keeping just one copy of the static files used for running gentropy steps(install_dependencies_on_cluster.sh and cli.py) in the GCS.
  • spped up the development process, since the one does not need to wait for the github actions to finish, before they can set up the cluster from airflow on the branch.
  • speed up the dependency resolution with uv pip install (pip resolution of gentropy with support for multiple python versions takes ~ 20minutes leading to initialization actions timeout, uv does the same thing with around 1.5min).

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
2 questions:

  1. Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
  2. We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

@project-defiant
Copy link
Contributor Author

Thank you! 2 questions:

  1. Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
  2. We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

These are very nice points!

  1. I am aftraid, that poetry will not allow us to supoport more pyspark versions, latest hail requires version between 3.5 to 3.6, otherwise we could do it.
  2. Yes I am planning to do that, just right after I finish the harmonisation :). Let's wait for this check before we merge.

@github-actions github-actions bot added the Step label Jan 21, 2025
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool!! New PySpark, support for several Python versions (dev version set to 3.11), and migration to UV.

I suggest changing the PR title, this is clearly a lie haha
image

Setting up the dev environment was smooth simply by doing make setup-dev (except that uv wasnt in my path, see minor comment below).

I haven't used uv myself, so I can't really evaluate pros and cons. M only questions is to confirm that we are not missing anything from CI/CD after the migration?

utils/clean_status.sh Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
if ! command -v uv &>/dev/null; then
echo "uv was not found, installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After installing the environment, uv wasn't on my path. Should we add export PATH="$HOME/.local/bin:$PATH" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point! The UV setup script looks over 3 locations (the $HOME/.local/bin/{uv,uvx} is the last fallback if XDG_BIN_HOME nor XDG_DATA_HOME are not defined, which is one of the standards, that most Unix based machines follow (apparently not Mac).

I will make additional fix for this !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over how uv is set up, the installation path is specified at runtime, so referencing it means recreating the lookup I described above. On the good note, the uv adds itself to the PATH in the shell rc file, so ideally one just need to source the shell rc to have it run as executable. There should be no need to add the ${HOME}/.local/bin/ to the path, as it might not be correct by default on linux machines.

Copy link
Contributor Author

@project-defiant project-defiant Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain in short:

  • uv sets the export for the env file that holds the paths to the uv and uvx.
  • You just have to source the shellrc file (~/.zshrc in my example) after the successfull installation to make it available in the terminal (this is due to the fact that Make runs in a separate shell and parent shell does not inherit from child shell). I have added a notification about this after the installation process


test: ## Run tests
@echo "Running Tests..."
@poetry run pytest
@uv run pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test this yourself, but the current command goes idle without running any tests

Suggested change
@uv run pytest
@uv run pytest .

Copy link
Contributor Author

@project-defiant project-defiant Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests still go ok on my side, although it seems that the long lookup, not sure why the dot helps here, assuming that we already add the testpath(s) to the pytest options, the dot just overwrites it. While testing your solution I got following errors cache related:

import file mismatch:
imported module 'a_creating_spark_session' has this __file__ attribute:
  /home/mindos/Projects/OpenTargets/gentropy/docs/src_snippets/howto/python_api/a_creating_spark_session.py
which is not the same as the test file we want to collect:
  /home/mindos/Projects/OpenTargets/gentropy/site/src_snippets/howto/python_api/a_creating_spark_session.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

This seems to be due to the fact that I have previously generated the docs that contain some test duplicate.

On the note of test collection speeds:

(base)  mindos@mindos  ~/Projects/OpenTargets/gentropy   pyspark-bump ±  time uv run pytest --collect-only . 1> /dev/null
uv run pytest --collect-only . > /dev/null  11,81s user 2,56s system 115% cpu 12,390 total
(base)  mindos@mindos  ~/Projects/OpenTargets/gentropy   pyspark-bump ±  time uv run pytest --collect-only 1> /dev/null 
uv run pytest --collect-only > /dev/null  11,70s user 2,62s system 116% cpu 12,269 total

in the first run the site dir is removed

@project-defiant project-defiant changed the title chore: bump pyspark version chore: multiple python version support with latest pyspark and hail Jan 27, 2025
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments!

@project-defiant project-defiant merged commit 3d31edd into dev Jan 28, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow multiple python versions Update Dataproc and PySpark in Genetics
3 participants