Skip to content

Conversation

@jgainerdewar
Copy link
Collaborator

Description

Adds the gpu runtime attribute required by WDL 1.1. (spec)

A task with this flag set to true is guaranteed to only run if a GPU is a available within the runtime environment. It is the responsibility of the execution engine to check prior to execution whether a GPU is provisionable, and if not, preemptively fail the task.

This check is implemented in two steps:

  • For each backend, define whether a GPU is potentially available. GPUs are available on GCP and AWS, unavailable on TES and local/SFS (Cromwell doesn't have a reliable way to detect presence of GPUs in these environments). This is checked in JobPreparationActor and tasks are failed if they require GPU and are running on a backend that doesn't support it.
  • For backends where a GPU may be available (GCP and AWS), each backend manages its own check for whether the task is correctly configured to use a GPU. This is checked in *BatchRuntimeAttributes creation.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@jgainerdewar jgainerdewar requested a review from a team as a code owner October 30, 2025 19:45
Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

If `true`, Cromwell will attempt to ensure that the task can run in an environment with GPU support. The task will be
failed if we can't confirm a GPU is available.

- Google Cloud: Cromwell will attempt to examine other runtime attributes such as `gpuCount`, `gpuType`, `predefinedMachineType` to determine whether the task is configured to use a GPU, and fail the task if it is not.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to include detail that answers the following user concern:

I use GPUs on Batch, are you saying I need to go through all my WDLs and add gpu: true?! 😖

val supportsGpu: Boolean =
machineType.toLowerCase.contains("nvidia") ||
machineType.toLowerCase.contains("gpu") ||
machineType.toLowerCase.matches("^g[0-9]*-.*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. a2-highgpu-1g

Suggested change
machineType.toLowerCase.matches("^g[0-9]*-.*")
machineType.toLowerCase.matches("^[ga][0-9]*-.*")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That machine a2-highgpu-1g is covered by the line above, though, .contains("gpu"), right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I didn't think of that. Disregard if you don't feel it's necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants