Skip to content

Commit db0297d

Browse files
committed
feat(gpu): Update and enhance GPU initialization script
This commit introduces several improvements to the `install_gpu_driver.sh` script and its documentation: 1. **Refactored for Custom Images:** The script now supports a deferred configuration model when used for building custom Dataproc images. By passing `--metadata invocation-type=custom-images`, the script performs driver/toolkit installations but defers Hadoop/Spark-specific settings to the first boot of a cluster instance via a systemd service (`dataproc-gpu-config.service`). This ensures compatibility with the custom image build process. 2. **Improved Dependency Handling:** * Added logic to handle potentially missing `kernel-devel` packages from vaulted or staging Rocky Linux repositories. * Ensures `python3-venv` is installed on Ubuntu 2.2+ for the GPU agent. * Corrected Conda root path for Dataproc 2.3+. 3. **Enhanced Repository and Key Management:** * Updated GPG key fetching for NVIDIA Container Toolkit and CUDA repositories on Debian/Ubuntu to include necessary keys and proxy support. 4. **NVIDIA Artifact Hash Verification:** Added an associative array `recognized_hashes` to store known SHA256 sums for downloaded NVIDIA driver and CUDA `.run` files. The script now checks the hash of downloaded files against this list, although it currently only warns on mismatch. 5. **Documentation Updates (README.md):** * Clarified default CUDA versions per Dataproc image series. * Updated example `gcloud` commands to be more complete and modern. * Detailed the new `invocation-type` metadata for custom image builds. * Reorganized and updated sections on cuDNN, metadata parameters, Secure Boot, and troubleshooting. * Added an important section on performance implications and the benefits of cache warming, especially when builds from source are required. * Noted that the GPU agent now handles metric creation, deprecating the need for the `create_gpu_metrics.py` script. * Changed default for `install-gpu-agent` to `true`. 6. **Script Robustness:** * Added `set +e` around `get_metadata_attribute` calls to handle missing attributes gracefully. * Improved error handling and messages in various functions. * Ensured YARN/Spark configurations are only applied if the respective config directories exist. * MIG scripts are fetched if missing when a MIG-enabled GPU is detected during the configuration phase. These changes aim to make the GPU initialization action more reliable, flexible, and easier to use, both for regular cluster creation and custom image building.
1 parent b1ca547 commit db0297d

File tree

3 files changed

+738
-381
lines changed

3 files changed

+738
-381
lines changed

0 commit comments

Comments
 (0)