Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new branch with linux header changes #65

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file removed ubuntu22.04/empty
Empty file.
41 changes: 21 additions & 20 deletions ubuntu22.04/nvidia-driver
Original file line number Diff line number Diff line change
Expand Up @@ -63,39 +63,40 @@ _resolve_kernel_version() {
}

# Install the kernel modules header/builtin/order files and generate the kernel version string.
_install_prerequisites() (
_install_prerequisites() {
local tmp_dir=$(mktemp -d)

trap "rm -rf ${tmp_dir}" EXIT
cd ${tmp_dir}

if [ "${PERSIST_DRIVER}" = false ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should not be relying on the PERSIST_DRIVER environment variable as that feature is not introduced in this PR.

rm -rf /lib/modules/${KERNEL_VERSION}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why cleaning up this directory is required on every driver container instantiation. @shivamerla @tariq1890 would you happen to know why? If we can remove this, then we no longer need this conditional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On container startup, if we do not mount /lib/modules or /usr/src explicitly from the host, both of these directories will be empty inside of the container. As an example:

$ kubectl exec -n gpu-operator -it nvidia-driver-daemonset-7qwn5 -- sh
# ls -ltr /lib/modules/
ls: cannot access '/lib/modules/': No such file or directory
# ls -ltr /usr/src/
total 0

Based on this observation, what are your thoughts on my below proposal?

  1. Remove the following command from our script as it appears to not be needed: rm -rf /lib/modules/${KERNEL_VERSION}
  2. Only install the linux-modules package if /lib/modules/${KERNEL_VERSION} directory does not exist.

fi

if [ ! -d "/usr/src/linux-headers-$(uname -r)/" ]; then
echo "Installing Linux kernel headers..."
apt-get -qq install --no-install-recommends linux-headers-${KERNEL_VERSION} > /dev/null
fi
Comment on lines +76 to +79
Copy link
Contributor

@cdesiniotis cdesiniotis Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An observation -- if we mount /usr/src from the host AND /usr/src/linux-headers-<kernel-version> does not exist, meaning the linux headers are not available on the host, then we end up installing the headers both in the container AND on the host at /usr/src/linux-headers-<kernel-version>. I am not sure if this is desirable.


echo ${KERNEL_VERSION} > version

rm -rf /lib/modules/${KERNEL_VERSION}
echo "Generating Linux kernel version string..."

mkdir -p /lib/modules/${KERNEL_VERSION}/proc

echo "Installing Linux kernel headers..."
apt-get -qq install --no-install-recommends linux-headers-${KERNEL_VERSION} > /dev/null

echo "Installing Linux kernel module files..."
apt-get -qq download linux-image-${KERNEL_VERSION} && dpkg -x linux-image*.deb .
{ apt-get -qq download linux-modules-${KERNEL_VERSION} && dpkg -x linux-modules*.deb . || true; } 2> /dev/null
Comment on lines -79 to -80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the linux-headers package, shouldn't we have a conditional here along the lines of "if linux-image / linux-modules packages are not install, then install them"? I am uncertain on the best way to determine whether we need to install them or not...

mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION}
mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION}
depmod ${KERNEL_VERSION}

echo "Generating Linux kernel version string..."
mv version /lib/modules/${KERNEL_VERSION}/proc

ls -1 boot/vmlinuz-* | sed 's/\/boot\/vmlinuz-//g' - > version
if [ -z "$(<version)" ]; then
echo "Could not locate Linux kernel version string" >&2
return 1
Comment on lines -87 to -90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivamerla after reviewing this again, do we even really require to install the linux-image-$KERNEL_VERSION package? It seems like we only use it to construct the kernel-version string. However, the kernel version string is already assumed to be set correctly in the KERNEL_VERSION environment variable before we ever reach this point in the script.

Based on my understanding, I think we can remove this code block entirely in all cases and never install the linux-image-$KERNEL_VERSION package.

if [ "${PERSIST_DRIVER}" = false ]; then
mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION}
mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION}
depmod ${KERNEL_VERSION}
Comment on lines +90 to +92
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, these commands won't work as they require that the linux-image and linux-modules deb packages were downloaded and extracted locally first.

Comment on lines +89 to +92
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comment https://github.com/NVIDIA/gpu-driver-container/pull/65/files#r1681872260 I would recommend the following change:

Suggested change
if [ "${PERSIST_DRIVER}" = false ]; then
mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION}
mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION}
depmod ${KERNEL_VERSION}
if [ ! -d "/lib/modules/${KERNEL_VERSION}" ]; then
{ apt-get -qq download linux-modules-${KERNEL_VERSION} && dpkg -x linux-modules*.deb . || true; } 2> /dev/null
mv lib/modules/${KERNEL_VERSION}/modules.* /lib/modules/${KERNEL_VERSION}
mv lib/modules/${KERNEL_VERSION}/kernel /lib/modules/${KERNEL_VERSION}
depmod ${KERNEL_VERSION}

fi
mv version /lib/modules/${KERNEL_VERSION}/proc
)
}

# Cleanup the prerequisites installed above.
_remove_prerequisites() {
if [ "${PACKAGE_TAG:-}" != "builtin" ]; then
apt-get -qq purge linux-headers-${KERNEL_VERSION} > /dev/null
apt-get -qq purge linux-headers-${KERNEL_VERSION} > /dev/null || true
# TODO remove module files not matching an existing driver package.
fi
}
Expand Down