Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try GPU CI with cupy (DNM) #466

Closed
wants to merge 73 commits into from
Closed

Try GPU CI with cupy (DNM) #466

wants to merge 73 commits into from

Conversation

jaimergp
Copy link
Member

@jaimergp jaimergp commented Jan 23, 2023

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Same as #446

Issues:

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@jaimergp
Copy link
Member Author

@leofang - seems to be working :D Can you take a look at the logs for 0d99872? I'll see what happens with qemu now.

@leofang
Copy link
Member

leofang commented Jan 27, 2023

Oh wow @jaimergp many thanks for the test drive! It seems to work fine! (cc: @kmaehashi, we're testing GPU CI for conda-forge!)
https://github.com/conda-forge/cf-autotick-bot-test-package-feedstock/actions/runs/3998064331/jobs/6860216209#step:3:4910

Jaime, can we turn on tests? Even running a subset of GPU tests is a great improvement.

@kmaehashi
Copy link
Member

Great, thank you @jaimergp for testing!

recipe/run_test.py Outdated Show resolved Hide resolved
@conda-forge-webservices

This comment was marked as outdated.

1 similar comment
@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

@jaimergp
Copy link
Member Author

Thanks for the tips! My goal is here is not so much making everything pass, but at least assure that the UX is nice, and having the machine die mid job is not one 😬 I'll try adding more deps and see if that passes, but in the meantime I wonder if this is a resource starvation issue. It passes on CUDA 11 but maybe CUDA 12 is heavier? How much disk/RAM are you using in your test boxes? Thanks!

@jaimergp
Copy link
Member Author

Interesting do we have any metrics on what the machine was doing before it stopped? Was there high memory usage or something else amiss?

Sadly it's an ephemeral VM and OpenStack doesn't offer any immediate way of keeping a history around as far as I can see, but it would be interesting to have some info, so I'll see what we can do.

@kmaehashi
Copy link
Member

How much disk/RAM are you using in your test boxes?

We are running under 8 cores CPU & 20 GB disk. As for RAM, we allocate 52 GB but this includes RAM disk space so I'm not sure how much is actually required for tests itself, unfortunately. With that said I guess the problem is elsewhere, as 99% of the test has passed. How about adding -v to pytest, or try with the subset (e.g., cupy_tests only) and see what happens?

@jaimergp
Copy link
Member Author

Hm, these machines are:

CPU  RAM Disk
4 12GB 60GB

@jaimergp
Copy link
Member Author

Ok if I don't run cupyx_tests it doesn't error out badly. I have a VM with grafana running the whole thing now so we'll see how it goes :)

We still see cufft errors though, despite the packages being in meta.yaml:

In file included from /tmp/tmpmfodjtfc/tmpnl7smisg/cupy_callback_c05c60196d1dae6877fd7811c4f34c5b.cpp:771:
/tmp/tmpmfodjtfc/tmpnl7smisg/cupy_cufft.h:11:10: fatal error: cufft.h: No such file or directory
   11 | #include <cufft.h>
      |          ^~~~~~~~~
compilation terminated.
_ Test1dCallbacks.test_fft_load[_param_2_{n=None, norm=None, shape=(10, 10)}] __

@leofang
Copy link
Member

leofang commented Oct 19, 2023

We still see cufft errors though, despite the packages being in meta.yaml:

I see. I think this is a package layout problem specific to conda. CuPy expects headers can be found in $CUDA_PATH/include, and so in the past we've set $CUDA_PATH to $CONDA_PREFIX in the activation script. But, with CUDA 12.0+ the CTK headers are only available in $CONDA_PREFIX/targets/x86_64-linux/include/, hence the missing headers.

We either have to patch CuPy or rework on the conda package layout, neither is trivial task. I suggest we note this issue and move on.

@leofang
Copy link
Member

leofang commented Oct 19, 2023

Also, the cuFFT callback support in CuPy was never expected to work with conda packages, due to static libcufft & headers not shipped in the past. NVIDIA is working on a new solution that would get Python libraries like CuPy lifted, so let's not bothered by this 🙂)

@jaimergp
Copy link
Member Author

76% of the test suite, this is how it's looking:

tests/cupyx_tests/scipy_tests/signal_tests/test_signaltools.py ......... [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
.................................................s...................... [ 76%]
image image image

@jaimergp
Copy link
Member Author

The VM test finished correctly, but it's true that we are dangerously close to the RAM limits:

image

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

@jaimergp
Copy link
Member Author

CUDA 12 logs

@aktech
Copy link

aktech commented Oct 24, 2023

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

I think that does explain all the jobs that failed without apparent reason. GitHub's message around lack of resources Memory/CPU was right. We also have a larger flavor available with 16GB RAM, if we know (or can find out) that it won't exceed that then we can give that a try as well.

@jakirkham
Copy link
Member

I don't think we know how much memory is used. So we probably need to collect more data first

Perhaps it is worth running something like pytest-monitor to collect and analyze data about memory usage in tests. This will store results in a SQLite DB, which can be analyzed later, but we would need to persist that database as an artifact for retrieval once the job has ended

Is there some way to do store artifacts from completed CI runs in our setup here? Can we store results even if the job fails?

@jaimergp
Copy link
Member Author

I haven't tried adding CI artifacts yet, actually. That's a good point. After increasing the VM RAM to 16GB it doesn't OOM. We are also investigating how we can protect the GHA runner from the OOM killer a bit so other processes are stopped instead of that one. That should alleviate the disconnection problems we've seen. If the runner process dies there's nothing we can do about sending CI artifacts or other "post-mortem" diagnosis steps via GHA.

@jaimergp
Copy link
Member Author

store_build_artifacts require 7z and/or zip, but these are not present in the VM image right now. We'll add this in the next update.

@leofang
Copy link
Member

leofang commented Oct 27, 2023

Q: I may have lost track the chronological order of the commits & above comments, was OOM hit only when cupyx_tests was enabled?

@jaimergp
Copy link
Member Author

Correct! There's a grafana plot a few messages above.

@leofang
Copy link
Member

leofang commented Oct 27, 2023

Not sure if there's any memory leak, what if we run the cupyx tests in a separate process?

@jaimergp jaimergp closed this Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants