-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try GPU CI with cupy
(DNM)
#466
Conversation
MNT: Re-rendered with conda-build 3.21.7+119.g1b221ef0, conda-smithy 3.22.1.post.dev3, and conda-forge-pinning 2022.12.19.14.36.50
Co-authored-by: Amit Kumar <[email protected]>
Hi! This is the friendly automated conda-forge-webservice. It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details. Please close this pull request and remake it from a fork of this feedstock. Have a great day! |
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
Oh wow @jaimergp many thanks for the test drive! It seems to work fine! (cc: @kmaehashi, we're testing GPU CI for conda-forge!) Jaime, can we turn on tests? Even running a subset of GPU tests is a great improvement. |
Great, thank you @jaimergp for testing! |
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
Hi! This is the friendly automated conda-forge-webservice. It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details. Please close this pull request and remake it from a fork of this feedstock. Have a great day! |
Thanks for the tips! My goal is here is not so much making everything pass, but at least assure that the UX is nice, and having the machine die mid job is not one 😬 I'll try adding more deps and see if that passes, but in the meantime I wonder if this is a resource starvation issue. It passes on CUDA 11 but maybe CUDA 12 is heavier? How much disk/RAM are you using in your test boxes? Thanks! |
Sadly it's an ephemeral VM and OpenStack doesn't offer any immediate way of keeping a history around as far as I can see, but it would be interesting to have some info, so I'll see what we can do. |
We are running under 8 cores CPU & 20 GB disk. As for RAM, we allocate 52 GB but this includes RAM disk space so I'm not sure how much is actually required for tests itself, unfortunately. With that said I guess the problem is elsewhere, as 99% of the test has passed. How about adding |
Hm, these machines are:
|
Ok if I don't run We still see cufft errors though, despite the packages being in meta.yaml:
|
I see. I think this is a package layout problem specific to conda. CuPy expects headers can be found in We either have to patch CuPy or rework on the conda package layout, neither is trivial task. I suggest we note this issue and move on. |
Also, the cuFFT callback support in CuPy was never expected to work with conda packages, due to static libcufft & headers not shipped in the past. NVIDIA is working on a new solution that would get Python libraries like CuPy lifted, so let's not bothered by this 🙂) |
The VM test finished correctly, but it's true that we are dangerously close to the RAM limits: Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech? |
I think that does explain all the jobs that failed without apparent reason. GitHub's message around lack of resources Memory/CPU was right. We also have a larger flavor available with 16GB RAM, if we know (or can find out) that it won't exceed that then we can give that a try as well. |
I don't think we know how much memory is used. So we probably need to collect more data first Perhaps it is worth running something like Is there some way to do store artifacts from completed CI runs in our setup here? Can we store results even if the job fails? |
I haven't tried adding CI artifacts yet, actually. That's a good point. After increasing the VM RAM to 16GB it doesn't OOM. We are also investigating how we can protect the GHA runner from the OOM killer a bit so other processes are stopped instead of that one. That should alleviate the disconnection problems we've seen. If the runner process dies there's nothing we can do about sending CI artifacts or other "post-mortem" diagnosis steps via GHA. |
|
Q: I may have lost track the chronological order of the commits & above comments, was OOM hit only when |
Correct! There's a grafana plot a few messages above. |
Not sure if there's any memory leak, what if we run the cupyx tests in a separate process? |
…ForgeAutomergeUpdate
…eAutomergeAndRerender
Checklist
0
(if the version changed)conda-smithy
(Use the phrase@conda-forge-admin, please rerender
in a comment in this PR for automated rerendering)Same as #446
Issues:
libcuda.so.1
not found onaarch64
🤔