-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: multiple python version support with latest pyspark and hail #974
Conversation
371441b
to
630c0c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
2 questions:
- Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
- We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?
These are very nice points!
|
887b5f7
to
98d464d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool!! New PySpark, support for several Python versions (dev version set to 3.11), and migration to UV.
I suggest changing the PR title, this is clearly a lie haha
Setting up the dev environment was smooth simply by doing make setup-dev
(except that uv wasnt in my path, see minor comment below).
I haven't used uv
myself, so I can't really evaluate pros and cons. M only questions is to confirm that we are not missing anything from CI/CD after the migration?
utils/install_dependencies.sh
Outdated
if ! command -v uv &>/dev/null; then | ||
echo "uv was not found, installing uv..." | ||
curl -LsSf https://astral.sh/uv/install.sh | sh | ||
source $HOME/.local/bin/env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After installing the environment, uv
wasn't on my path. Should we add export PATH="$HOME/.local/bin:$PATH"
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point! The UV setup script looks over 3 locations (the $HOME/.local/bin/{uv,uvx}
is the last fallback if XDG_BIN_HOME
nor XDG_DATA_HOME
are not defined, which is one of the standards, that most Unix based machines follow (apparently not Mac).
I will make additional fix for this !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over how uv is set up, the installation path is specified at runtime, so referencing it means recreating the lookup I described above. On the good note, the uv adds itself to the PATH in the shell rc file, so ideally one just need to source the shell rc to have it run as executable. There should be no need to add the ${HOME}/.local/bin/
to the path, as it might not be correct by default on linux machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To explain in short:
- uv sets the export for the
env file
that holds the paths to theuv
anduvx
. - You just have to source the shellrc file (
~/.zshrc
in my example) after the successfull installation to make it available in the terminal (this is due to the fact that Make runs in a separate shell and parent shell does not inherit from child shell). I have added a notification about this after the installation process
|
||
test: ## Run tests | ||
@echo "Running Tests..." | ||
@poetry run pytest | ||
@uv run pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test this yourself, but the current command goes idle without running any tests
@uv run pytest | |
@uv run pytest . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests still go ok on my side, although it seems that the long lookup, not sure why the dot helps here, assuming that we already add the testpath(s)
to the pytest options, the dot just overwrites it. While testing your solution I got following errors cache related:
import file mismatch:
imported module 'a_creating_spark_session' has this __file__ attribute:
/home/mindos/Projects/OpenTargets/gentropy/docs/src_snippets/howto/python_api/a_creating_spark_session.py
which is not the same as the test file we want to collect:
/home/mindos/Projects/OpenTargets/gentropy/site/src_snippets/howto/python_api/a_creating_spark_session.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
This seems to be due to the fact that I have previously generated the docs that contain some test duplicate.
On the note of test collection speeds:
(base) mindos@mindos ~/Projects/OpenTargets/gentropy pyspark-bump ± time uv run pytest --collect-only . 1> /dev/null
uv run pytest --collect-only . > /dev/null 11,81s user 2,56s system 115% cpu 12,390 total
(base) mindos@mindos ~/Projects/OpenTargets/gentropy pyspark-bump ± time uv run pytest --collect-only 1> /dev/null
uv run pytest --collect-only > /dev/null 11,70s user 2,62s system 116% cpu 12,269 total
in the first run the site
dir is removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the comments!
✨ Context
This PR closes #3189
This PR closes #3680
and links to opentargets/orchestration#94
🛠 What does this PR implement
This PR focuses on allowing pyspark>=3.5.0, <3.6 along with newest hail version, that supports it.
The changes above caused trouble (errors) for poetry to resolve the dependencies. This lead to more developments, the full list is described below:
X.Y.Z-devXX
releases on push todev
branchgoogle
as dependency, added more granulargoogle-cloud-storage
(To be removed in future release with other google dependent packages)dev cluster setup
Previously dev cluster was created from the package pushed to the google cloud storage bucket under the namespace that matched the git ref name. This has now changed, as cluster now install gentropy directly from the github branch (or tag) specified in the orchestration - see changes in opentargets/orchestration#94.
Benefits of new solution
cli.py
) in the GCS.uv pip install
(pip resolution of gentropy with support for multiple python versions takes ~ 20minutes leading to initialization actions timeout, uv does the same thing with around 1.5min).🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?