Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Testing] Fix RL9 Nvidia driver issue due to RL9 new release #1839

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

LujieDuan
Copy link
Contributor

@LujieDuan LujieDuan commented Nov 29, 2024

Description

Rocky Linux 9.5 has been released Nov 19.

RL9's default repo setting (/etc/yum.repos.d/rocky.repo) uses mirrorlist (e.g., mirrorlist=https://mirrors.rockylinux.org/mirrorlist?arch=$basearch&repo=AppStream-$releasever$rltype) to fetch packages, and the mirrorlist will automatically use the RL9.5 repo even when the GCE image is still on 9.4.

This is fine for most packages, but CUDA requires the exact version of kernel-devel package (the version that match the OS kernel version). RL9.4's kernel-devel won't exist in RL9.5 repo.

Fix:

When GCE builds a new RL9.5 image, the driver issue will be resolved. To fix the issue right now and prevent it from happening again, add a new repo that matches the OS version.

Notes:

  • Also removed a previous version pin and install the latest CUDA 12.6.3;
  • We won't need to apply similar fix to RL8 since RL8.10 is already the last RL8.

Related issue

b/380251927

How has this been tested?

Integration tests passing.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

sudo yum install -y kernel-devel-$(uname -r) pciutils gcc make wget yum-utils

# Installing latest version of NVIDIA CUDA and driver
# Data Center/Tesla drivers and CUDA are released on different schedules;
# normally we install the matching versions of driver and CUDA
# ($DRIVER_VERSION == $CUDA_BUNDLED_DRIVER_VERSION); due to https://github.com/NVIDIA/open-gpu-kernel-modules/issues/550
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned issue has been fixed in the new version of CUDA, so here we can simplify the steps to install CUDA+driver from one single package.

@LujieDuan LujieDuan requested review from a team, rafaelwestphal and braydonk and removed request for a team and rafaelwestphal November 29, 2024 18:25
@LujieDuan LujieDuan merged commit 19ec7da into master Nov 29, 2024
69 checks passed
@LujieDuan LujieDuan deleted the lujieduan-20241129-nvidia-driver branch November 29, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants