-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690
Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4 #4690
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
(Summarizing some offline conversations, to get this into the public record here on GitHub) For the last few days (unsure how long), CI jobs here targeting
full conda solve error trace (click me)
how to reproduce this (click me)docker run \
--rm \
--gpus 1 \
--env CI=false \
--env RAPIDS_BUILD_TYPE="pull-request" \
--env RAPIDS_REPOSITORY="rapidsai/cugraph" \
--env RAPIDS_REF_NAME=pull-request/4690 \
--env RAPIDS_SHA=922571b6db5f721a287897b3c5acc81b3fe07f69 \
-v $(pwd):/opt/work \
-w /opt/work \
--network host \
-it rapidsai/ci-conda:cuda11.8.0-rockylinux8-py3.10 \
bash
RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"
rapids-logger "Downloading artifacts from previous jobs"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)
rapids-logger "Generate Python testing dependencies"
rapids-dependency-file-generator \
--output conda \
--file-key test_python \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml
rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg
conda activate test_cugraph_pyg
CONDA_CUDA_VERSION="11.8"
PYG_URL="https://data.pyg.org/whl/torch-2.3.0+cu118.html"
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
--channel pyg \
"cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pytorch>=2.3,<2.4" \
"ogb" This only shows up in the Lines 187 to 189 in 5fad435
The PyTorch floor here was raised to
So what can we do?Ideally, there would be But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc. The options I can think of:
|
update on #4690 (comment) After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of commit: f267c77 They're built from the same sources, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James! 🙏
AIUI this matches what we discussed
Also grep
ped for any remaining pyg
dependency lines to fix and didn't find any
Included one informational note below, but no action needed
Approving to unblock
All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉 The
The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱
It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages. In my experience with
There absolutely is a I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue. code to do that (click me)docker run \
--rm \
--gpus 1 \
--env CI=false \
--env RAPIDS_BUILD_TYPE="pull-request" \
--env RAPIDS_REPOSITORY="rapidsai/cugraph" \
--env RAPIDS_REF_NAME=pull-request/4690 \
--env RAPIDS_SHA=f267c771707d4007c6869b4a0a79feb3e0c27700 \
-v $(pwd):/opt/work \
-w /opt/work \
--network host \
-it rapidsai/ci-conda:cuda11.8.0-ubuntu22.04-py3.10 \
bash
RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)
rapids-dependency-file-generator \
--output conda \
--file-key docs \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml
rapids-mamba-retry env create --yes -f env.yaml -n docs
conda activate docs
if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then
CONDA_CUDA_VERSION="11.8"
DGL_CHANNEL="dglteam/label/cu118"
else
CONDA_CUDA_VERSION="12.1"
DGL_CHANNEL="dglteam/label/cu121"
fi
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
--channel conda-forge \
--channel nvidia \
--channel "${DGL_CHANNEL}" \
"libcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-dgl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-service-server=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-service-client=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"libcugraph_etl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibcugraphops=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibwholegraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
pytorch \
"cuda-version=${CONDA_CUDA_VERSION}"
python -c "import cugraph_dgl.convert"
Following that code shared above, that can reproduced without actually invoking python -c "import cugraph_dgl.convert" Walking down the trace: python -c "import dgl"
conda install -c conda-forge torchdata
python -c "import dgl"
conda install -c conda-forge pydantic
python -c "import dgl"
So what do we do?I'm not sure. Looks like
Those seem to have not made it in until Here in
I'm not sure how to fix this. The https://anaconda.org/dglteam/dgl/files?version=&channel=cu118 Maybe we want the https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118 |
Summarizing recent commits:
Here in the 24.10 release of
and requiring this label on the
As @alexbarghi-nv pointed out to me, something similar is being done in For wheels, I've updated the |
I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of Thanks for the help everyone! |
/merge |
Thanks James! 🙏 |
## Summary Follow-up to #4690. Proposes consolidating stuff like this in CI scripts: ```shell pip install A pip install B pip install C ``` Into this: ```shell pip install A B C ``` ## Benefits of these changes Reduces the risk of creating a broken environment with incompatible packages. Unlike `conda`, `pip` does not evaluate the requirements of all installed packages when you run `pip` install. Installing `torch` and `cugraph-dgl` at the same time, for example, gives us a chance to find out about packaging issues like *"`cugraph-dgl` and `torch` have conflicting requirements on `{other_package}`"* at CI time. Similar change from `cudf`: rapidsai/cudf#16575 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #4701
Another steps towards completing the work started in #53 Fixes #15 Contributes to rapidsai/build-planning#111 Proposes changes to get CI running on pull requests for `cugraph-pyg` and `cugraph-dgl` ## Notes for Reviewers Workflows for nightly builds and publishing nightly packages are intentionally not included here. See #58 (comment) Notebook tests are intentionally not added here... they'll be added in the next PR. Pulls in changes from these other upstream PRs that had not been ported over to this repo: * rapidsai/cugraph#4690 * rapidsai/cugraph#4393 Authors: - James Lamb (https://github.com/jameslamb) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Bradley Dice (https://github.com/bdice) URL: #59
We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the
cugraph-gnn
repository going forward, where we can pin a specific pytorch version for testing.