Skip to content

fix(data): preserve other batch edge_index in Batch.append()#46

Open
Ryan-Reese wants to merge 4 commits intoNVIDIA:mainfrom
Ryan-Reese:fix/batch-append-mutates-input
Open

fix(data): preserve other batch edge_index in Batch.append()#46
Ryan-Reese wants to merge 4 commits intoNVIDIA:mainfrom
Ryan-Reese:fix/batch-append-mutates-input

Conversation

@Ryan-Reese
Copy link
Copy Markdown
Contributor

@Ryan-Reese Ryan-Reese commented Apr 2, 2026

ALCHEMI Toolkit Pull Request

Description

Batch.append() temporarily offsets the other batch's edge_index by the receiver's atom count for correct concatenation, but never restores the original values. This silently corrupts the input batch — any subsequent use of the appended batch produces wrong edge indices.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Related Issues

Relates to #21 (Data layer gaps and limitations)

Changes Made

  • Save other's original edge_index reference before applying the node offset
  • Wrap concatenation in try/finally to restore other's edge_index even if concatenation raises
  • Reject shared-storage aliasing (batch.append(batch) or shared MultiLevelStorage) which cannot work correctly with in-place concatenation

Reproduction

from nvalchemi.data import AtomicData, Batch
import torch

d1 = AtomicData(
    atomic_numbers=torch.ones(3, dtype=torch.long),
    positions=torch.randn(3, 3),
    edge_index=torch.tensor([[0, 1], [1, 2]]),
)
d2 = AtomicData(
    atomic_numbers=torch.ones(4, dtype=torch.long),
    positions=torch.randn(4, 3),
    edge_index=torch.tensor([[0, 1], [1, 2], [2, 3]]),
)

b1 = Batch.from_data_list([d1])
b2 = Batch.from_data_list([d2])

b2_ei_before = b2.edge_index.clone()
b1.append(b2)

# BEFORE fix: b2.edge_index is [[3,4],[4,5],[5,6]] — corrupted
# AFTER fix:  b2.edge_index is [[0,1],[1,2],[2,3]] — preserved
assert torch.equal(b2.edge_index, b2_ei_before)

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • Verified bug reproduction before/after fix on GPU node
  • All 68 existing test/data/test_batch.py tests pass with no regressions

Checklist

  • I have performed a self-review of my code
  • I have updated the documentation (if applicable)

Additional Notes

Known limitation (pre-existing): append() is not transactional with respect to self — if group.concatenate() raises mid-loop, self may be left partially updated. This is a pre-existing architectural issue unrelated to this fix and would require cloning all storage groups before mutation. The try/finally added here only guarantees other is restored.

Batch.append() temporarily offsets the other batch's edge_index by the
receiver's atom count for correct concatenation, but never restores the
original values. This silently corrupts the input batch — any subsequent
use of the appended batch (e.g. in inflight batching or multi-stage
pipelines) produces wrong edge indices.

Save and restore the original edge_index reference around the
concatenation in a try/finally block to keep the input batch intact even
if concatenation raises. Also reject shared-storage aliasing
(batch.append(batch) or shared MultiLevelStorage) which cannot work
correctly with in-place concatenation.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 2, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Ryan-Reese Ryan-Reese marked this pull request as ready for review April 2, 2026 17:32
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR fixes a silent mutation bug in Batch.append() where other's edge_index was permanently offset by the receiver's node count after concatenation. The fix saves the original tensor reference before the offset, wraps the concatenation in try/finally to guarantee restoration, and adds an early guard to reject self-aliasing. The three new tests cover all the new code paths and the previous reviewer comments have been fully addressed.

Important Files Changed

Filename Overview
nvalchemi/data/batch.py Adds self-aliasing guard and try/finally to restore other's edge_index after append; logic is correct and the finally block safely uses the cached local other_edges.
test/data/test_batch.py Adds three new tests covering edge_index preservation, self-append guard, and shared-storage guard; all tests correctly exercise the new behaviour.

Reviews (4): Last reviewed commit: "ci: retrigger checks" | Re-trigger Greptile

Comment on lines +989 to +992
finally:
# Restore other's edge_index to avoid mutating the input batch.
if saved_ei is not None:
other._edges_group._data["edge_index"] = saved_ei
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 other._edges_group re-evaluated in finally instead of using the cached local

The finally block calls other._edges_group (a property that does self._storage.groups.get("edges")) instead of reusing the already-resolved other_edges local variable. Under normal operation both refer to the same object, but using the local is slightly cheaper and makes the symmetry with the setup code explicit:

Suggested change
finally:
# Restore other's edge_index to avoid mutating the input batch.
if saved_ei is not None:
other._edges_group._data["edge_index"] = saved_ei
finally:
# Restore other's edge_index to avoid mutating the input batch.
if saved_ei is not None:
other_edges._data["edge_index"] = saved_ei

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +964 to +969
if other is self or other._storage is self._storage:
raise ValueError(
"Cannot append a Batch that shares storage with the "
"receiver (would corrupt both). Use "
"batch.append(batch.clone()) instead."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test coverage for the new guard and the core fix

Neither the ValueError for self-aliasing nor the edge_index preservation behavior described in the PR have a corresponding test. The existing test_append only checks graph/node counts on b1 and would not catch a regression. Consider adding two assertions:

# test self-aliasing guard
with pytest.raises(ValueError, match="shares storage"):
    b1.append(b1)

# test other's edge_index is not mutated
b2_ei_before = b2.edge_index.clone()
b1.append(b2)
assert torch.equal(b2.edge_index, b2_ei_before)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ryan-Reese please consider this suggestion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included

Add three tests to TestBatchMutation covering the new append() safety
behaviour:
- test_append_preserves_other_edge_index: verifies input batch's
  edge_index is not mutated after append
- test_append_self_raises: verifies self-append raises ValueError
- test_append_shared_storage_raises: verifies shared-storage append
  raises ValueError

Use cached other_edges variable in finally block per review feedback.
@Ryan-Reese
Copy link
Copy Markdown
Contributor Author

/ok to test efa2639

@laserkelvin
Copy link
Copy Markdown
Collaborator

Coverage is failing - I'll check out the branch locally and confirm it's okay

@Ryan-Reese
Copy link
Copy Markdown
Contributor Author

@laserkelvin thanks :)

Copy link
Copy Markdown
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed locally that coverage is fine and tests pass

group.concatenate(other_group)
else:
group.extend_for_appended_graphs(n_other)
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns with this bit are twofold:

  1. There's no exception being handled explicitly
  2. There's overhead associated with try/except/finally; it's not huge but it can be non-negligible

If possible, I would rewrite it without the try block, unless there is something you're guarding against

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants