Skip to content

get_device_context tensor goes stale if heap_bases change after init #467

@mawad-amd

Description

@mawad-amd

Bug

get_device_context() builds a new torch.tensor from self.heap_bases.tolist() on every call (see #466). Once #466 is fixed by precomputing the tensor in __init__, the context tensor will hold a snapshot of heap_bases at construction time.

If heap_bases were to change after init (e.g., via refresh_peer_access() after a new shmem.allocate() or as_symmetric() call with a future allocator), the precomputed context tensor would contain stale base addresses. Kernels using DeviceContext would translate pointers using wrong bases, causing silent data corruption or hangs.

Today this is not a bug — both the torch and vmem allocators produce stable heap_bases after the first refresh_peer_access(). But it will become one if an allocator ever remaps peer VA ranges.

Fix

After precomputing self._device_context in __init__, add an in-place update in refresh_peer_access():

self._device_context[2:2+self.num_ranks] = self.heap_bases

No allocation, CUDAGraph safe, one line.

Component

iris/iris.py, iris/symmetric_heap.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingirisIris project issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions