-
Notifications
You must be signed in to change notification settings - Fork 519
feat(gpu): Update and enhance GPU initialization script #1363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
cjac
wants to merge
3
commits into
GoogleCloudDataproc:main
Choose a base branch
from
LLC-Technologies-Collier:gpu-20251011
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
feat(gpu): Update and enhance GPU initialization script #1363
cjac
wants to merge
3
commits into
GoogleCloudDataproc:main
from
LLC-Technologies-Collier:gpu-20251011
+1,378
−468
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/gcbrun |
1 similar comment
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
1 similar comment
/gcbrun |
This commit introduces several improvements to the `install_gpu_driver.sh` script and its documentation: 1. **Refactored for Custom Images:** The script now supports a deferred configuration model when used for building custom Dataproc images. By passing `--metadata invocation-type=custom-images`, the script performs driver/toolkit installations but defers Hadoop/Spark-specific settings to the first boot of a cluster instance via a systemd service (`dataproc-gpu-config.service`). This ensures compatibility with the custom image build process. 2. **Improved Dependency Handling:** * Added logic to handle potentially missing `kernel-devel` packages from vaulted or staging Rocky Linux repositories. * Ensures `python3-venv` is installed on Ubuntu 2.2+ for the GPU agent. * Corrected Conda root path for Dataproc 2.3+. 3. **Enhanced Repository and Key Management:** * Updated GPG key fetching for NVIDIA Container Toolkit and CUDA repositories on Debian/Ubuntu to include necessary keys and proxy support. 4. **NVIDIA Artifact Hash Verification:** Added an associative array `recognized_hashes` to store known SHA256 sums for downloaded NVIDIA driver and CUDA `.run` files. The script now checks the hash of downloaded files against this list, although it currently only warns on mismatch. 5. **Documentation Updates (README.md):** * Clarified default CUDA versions per Dataproc image series. * Updated example `gcloud` commands to be more complete and modern. * Detailed the new `invocation-type` metadata for custom image builds. * Reorganized and updated sections on cuDNN, metadata parameters, Secure Boot, and troubleshooting. * Added an important section on performance implications and the benefits of cache warming, especially when builds from source are required. * Noted that the GPU agent now handles metric creation, deprecating the need for the `create_gpu_metrics.py` script. * Changed default for `install-gpu-agent` to `true`. 6. **Script Robustness:** * Added `set +e` around `get_metadata_attribute` calls to handle missing attributes gracefully. * Improved error handling and messages in various functions. * Ensured YARN/Spark configurations are only applied if the respective config directories exist. * MIG scripts are fetched if missing when a MIG-enabled GPU is detected during the configuration phase. * Repaired broken /etc/init.d/hadoop-yarn-nodemanager stop function * Removed dependency on lspci * Accepting false values of either install-gpu-agent and enable-gpu-monitoring metadata to disable GPU metrics collection These changes aim to make the GPU initialization action more reliable, flexible, and easier to use, both for regular cluster creation and custom image building.
/gcbrun |
This commit addresses several issues related to NodeManager stability on Rocky Linux and fixes errors in the verification scripts. **NodeManager Restart:** * Ensures `hadoop-yarn-nodemanager` service is disabled at the start of the init action to prevent conflicts with the `Restart=always` policy. * The service is now masked within the `yarn_exit_handler` before port checks and unmasked/enabled just before starting. * The LSB init script (`/etc/init.d/hadoop-yarn-nodemanager`) now uses the `daemon` function correctly, passing the `nodemanager` command without additional `--daemon start` flags, allowing the LSB wrapper to manage the daemon lifecycle. * Removed a duplicate definition of the `ensure_good_nodemanager_init_script` function. * Added aggressive port clearing for all NodeManager related ports in the `stop()` function of the LSB script. **Verification Script Fixes:** * Corrected quoting and variable expansion in the `verify_pytorch` command string in `gpu_test_case_base.py` to prevent remote shell syntax errors. * Fixed an `AttributeError` in `verify_cluster.py` by changing `self.getClusterRegion()` to `self.cluster_region`. These changes aim to make NodeManager restarts more reliable during the GPU initialization process and ensure the verification scripts run correctly.
/gcbrun |
1 similar comment
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit introduces several improvements to the
install_gpu_driver.sh
script and its documentation:Refactored for Custom Images: The script now supports a deferred configuration model when used for building custom Dataproc images. By passing
--metadata invocation-type=custom-images
, the script performs driver/toolkit installations but defers Hadoop/Spark-specific settings to the first boot of a cluster instance via a systemd service (dataproc-gpu-config.service
). This ensures compatibility with the custom image build process.Improved Dependency Handling:
kernel-devel
packages from vaulted or staging Rocky Linux repositories.python3-venv
is installed on Ubuntu 2.2+ for the GPU agent.Enhanced Repository and Key Management:
NVIDIA Artifact Hash Verification: Added an associative array
recognized_hashes
to store known SHA256 sums for downloaded NVIDIA driver and CUDA.run
files. The script now checks the hash of downloaded files against this list, although it currently only warns on mismatch.Documentation Updates (README.md):
gcloud
commands to be more complete and modern.invocation-type
metadata for custom image builds.create_gpu_metrics.py
script.install-gpu-agent
totrue
.Script Robustness:
set +e
aroundget_metadata_attribute
calls to handle missing attributes gracefully.These changes aim to make the GPU initialization action more reliable, flexible, and easier to use, both for regular cluster creation and custom image building.
Accepts PR #1357
Fixes Issue #1356
Addresses GoogleCloudDataproc/custom-images#110 in the initialization-actions repository. I don't think a separate Issue was opened for this work.
Fixes a long-running issue about Rocky systems not being able to stop the
hadoop-yarn-nodemanager
service with/etc/init.d/hadoop-yarn-nodemanager stop
orsystemctl stop hadoop-yarn-nodemanager
- this fix should be contributed upstream to bigtop ; I see that HEAD on the repo has native systemd services, which may address the issue moving forward. This fix only addresses legacy images before the next release of bigtop.