Prevent memory explosion during GeoTransolver inference on large meshes by phermosomore · Pull Request #1361 · NVIDIA/physicsnemo

phermosomore · 2026-01-29T12:29:54Z

…er sub batch

PhysicsNeMo Pull Request

Description

This PR fixes an out-of-memory (OOM) issue when running GeoTransolver inference on large meshes (10M+ cells) that was causing the process to be killed.

During inference on full car meshes, the broadcast_global_features: true setting caused fx (global features: air density, stream velocity) to be replicated to every mesh point before sub-batching.

This, combined with downstream processing in the ContextProjector, exceeded GPU memory before even the first forward pass.

The GeoTransolver model uses a global_tokenizer (ContextProjector) that processes the global features through linear projections and multi-head attention. When fx is broadcast to 2M+ tokens upfront, the intermediate activations and attention computations scale linearly with mesh size, causing OOM.

Solution:

inference_on_vtk.py: Force broadcast_global_features: false in the datapipe for inference, regardless of the training config. This keeps fx as a single token (B, 1, 2).

inference_on_zarr.py: Modified batched_inference_loop to broadcast fx per sub-batch dynamically:
If fx is single-token → expand to match sub-batch size
If fx is full-mesh (legacy path) → slice for sub-batch

Why This Doesn't Affect Inference Quality?
Since all tokens in broadcast fx have identical values, the aggregation result is mathematically equivalent. The model sees the same sub-batch size it was trained on, just processed sequentially instead of all at once.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

None

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

…er sub batch Signed-off-by: Pablo Hermoso Moreno <phermosomore@nvidia.com>

greptile-apps · 2026-01-29T12:32:59Z

Greptile Overview

Greptile Summary

Fixed OOM issue during GeoTransolver inference on large meshes by disabling broadcast_global_features in the datapipe and implementing per-sub-batch broadcasting of global features (fx) in batched_inference_loop.

Key Changes:

inference_on_vtk.py: Forces broadcast_global_features: false to keep fx as single token (B, 1, 2) instead of broadcasting to 2M+ tokens upfront
inference_on_zarr.py: Dynamically broadcasts fx per sub-batch using expand() (memory-efficient view) or slices pre-broadcast fx for legacy compatibility
The approach maintains mathematical equivalence since all broadcast tokens have identical values

Issue Found:

The squeeze(1) operation in dimension normalization could incorrectly remove the batch dimension when batch size is 1, potentially causing shape mismatches

Important Files Changed

Filename	Overview
CHANGELOG.md	Added clear entry documenting the memory fix for GeoTransolver inference
examples/cfd/external_aerodynamics/transformer_models/src/inference_on_vtk.py	Forces `broadcast_global_features: false` in datapipe to prevent pre-broadcasting fx to full mesh size
examples/cfd/external_aerodynamics/transformer_models/src/inference_on_zarr.py	Implements per-sub-batch fx broadcasting for GeoTransolver with dimension normalization; potential issue with squeeze logic

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

…rence_on_zarr.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

coreyjadams · 2026-01-29T14:15:34Z

[reposting from slack for posterity]

Hi @phermosomore - thanks for sharing!

For the record, the intention in GeoTransolver is to never broadcast global features. It's something to be set in the config file. Here's the logic:

Transolver takes fx and embeddings and concatenates them at every point. The only path to put global features into the model is to broadcast to every point, which is a bit wasteful.
GeoTransolver specifically avoids this by treating global features in a manner similar to geometry: leave them as un-broadcasted features (shape [B, N_features, C] etc) and encode them into the latent space via the ContextProjector. In this way, the huge matrix should never be realized.

Additionally, in most of the inference scripts we usually aren't running 10M points at one time but batching them. Is that what you are doing and seeing the OOM?

coreyjadams · 2026-02-27T18:13:17Z

This is addressed by #1415 so I'll close this here.

Update inference scripts geotransolver to broadcast global features p…

a6a077d

…er sub batch Signed-off-by: Pablo Hermoso Moreno <phermosomore@nvidia.com>

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

Comment thread examples/cfd/external_aerodynamics/transformer_models/src/inference_on_zarr.py Outdated

Update examples/cfd/external_aerodynamics/transformer_models/src/infe…

1257fae

…rence_on_zarr.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

coreyjadams closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent memory explosion during GeoTransolver inference on large meshes#1361

Prevent memory explosion during GeoTransolver inference on large meshes#1361
phermosomore wants to merge 2 commits intoNVIDIA:mainfrom
phermosomore:reduce_mem_geotransolver_inference

phermosomore commented Jan 29, 2026

Uh oh!

greptile-apps bot commented Jan 29, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

coreyjadams commented Jan 29, 2026

Uh oh!

coreyjadams commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phermosomore commented Jan 29, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Jan 29, 2026

Greptile Overview

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coreyjadams commented Jan 29, 2026

Uh oh!

coreyjadams commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants