Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings to cuFileDriverOpen() and cuFileDriverClose() #514

Open
wants to merge 15 commits into
base: branch-24.12
Choose a base branch
from

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Oct 24, 2024

Changes:

  • Adding Python bindings to cuFileDriverOpen() and cuFileDriverClose().
  • We now only open the cufile driver explicitly in CUDA versions older than v12.2.
  • Introducing kvikio.cufile_driver.initialize(), which open the cuFile driver and close it again at module exit.
  • Let CI fail if KvikIO wasn't built with cuFile support.
    • Except on cuda11.8+arm64; cuFile didn't support arm until cuda v12.4.
  • Some refactor and clean up!

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 24, 2024
@madsbk madsbk force-pushed the cufile_driver branch 2 times, most recently from 47a0452 to 4b8181c Compare October 24, 2024 13:10
@madsbk madsbk force-pushed the cufile_driver branch 10 times, most recently from 4967be7 to e7bde28 Compare October 25, 2024 11:50
@madsbk madsbk force-pushed the cufile_driver branch 2 times, most recently from 3d02555 to 0b4537f Compare October 26, 2024 11:00
@madsbk
Copy link
Member Author

madsbk commented Oct 26, 2024

@EricKern calling kvikio.cufile_driver.initialize() should fix your segfault. Are you able to build KvikIO in your setup? If so, could you test this PR?

@EricKern
Copy link

Thank you @madsbk for your efforts!
I'll try to build it next week. Maybe I can even find time tomorrow.

We want to trigger CI error if a package was built without cufile
support
@EricKern
Copy link

I've built and reran my small segfault reproducer script without explicitly opening and closing the driver.
This still causes the segfault when I set profile.cufile_stats in cufile.json to anything above 0.
Also when I explicitly open and close the driver it still happens.

If profile.cufile_stats=0 everything works fine.

I guess my segfault (#497) is unrelated to the driver initialization and destruction.

I have tested this on my local machine where I currently don't have a GDS-supported file system. So no actual writing happened.
Only initialization and then cufile's switch to its own compatibility mode.
But even then, the segfault was reproducible on another machine.

@madsbk madsbk force-pushed the cufile_driver branch 5 times, most recently from 37d318f to 3f8f8b7 Compare October 27, 2024 13:51
@madsbk
Copy link
Member Author

madsbk commented Oct 28, 2024

Thanks @EricKern, good to get this confirmed. Let's continue the discussion in #497

@madsbk madsbk marked this pull request as ready for review October 28, 2024 06:59
@madsbk madsbk requested review from a team as code owners October 28, 2024 06:59
return cufile_driver.driver_close()


def initialize() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Whose job is it to call initialize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user for now. If we find a Python example that segfaults because of cuFile's termination issues, we should consider calling it in __init__.py.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestions, none are blocking, I think

cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
Comment on lines 114 to 115
if (!stream_available) { // The stream API was introduced in CUDA 12.2.
driver_open();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we gate this behind the thing we actually mean cuda < 12? It seems we could use something like:

#if CUDA_VERSION_LT(12, 0)
driver_open();
#endif

Using a hypothetical macro.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the version of CTK and cuFile might not match, updated the comment:

    // cuFile is supposed to open and close the driver automatically but
    // because of a bug in cuFile v1.4 (CUDA v11.8), it sometimes segfault:
    // <https://github.com/rapidsai/kvikio/issues/159>.
    // We use the stream API as an version indicator of cuFile, it was introduced
    // in cuFile v1.7 (CUDA v12.2).

std::cerr << "Unable to close GDS file driver: " << cufileop_status_error(error.err)
<< std::endl;
}
if (!stream_available) { driver_close(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

* cuFile accept multiple calls to `cufileDriverOpen()`, only the first call opens
* the driver, but every call should have a matching call to `cufileDriverClose()`.
*/
void driver_open()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/doc: Should we note that it should be unnecessary to call these functions and that this is a workaround for 11.8 bug? Or does it turn out because the automatic open/close that cufile implements is wrong we do need to do this manually?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by kvikio::DriverInitializer, which can be used to pay for init overhead up front

from kvikio._lib import cufile_driver # type: ignore

# TODO: Wrap nicely, maybe as a dataclass?
DriverProperties = cufile_driver.DriverProperties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a short docstring?

"""
What are the DriverProperties?
"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer to #526

python/kvikio/kvikio/cufile_driver.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants