Skip to content

Add MetaInstance declarative layer (rebased)#1519

Open
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/meta-instance
Open

Add MetaInstance declarative layer (rebased)#1519
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/meta-instance

Conversation

@AlexCheema
Copy link
Contributor

@AlexCheema AlexCheema commented Feb 17, 2026

Summary

  • Adds a MetaInstance declarative layer for managing model instances with automatic placement, retry logic, and lifecycle management
  • Includes dashboard UI for creating/managing MetaInstances with node selection, sharding config, and error feedback
  • Adds JACCL SideChannel relay via named pipes for distributed inference coordination
  • Implements task cancellation on client disconnect
  • Retry logic with max 3 attempts, error collection, and persistent failure display

Rebased onto current main with independent changes extracted into separate PRs:

Test plan

  • basedpyright passes (0 errors)
  • ruff check passes
  • nix fmt applied (0 changes)
  • pytest: 291 passed, 1 skipped (1 pre-existing Rust bindings failure)

🤖 Generated with Claude Code

@AlexCheema
Copy link
Contributor Author

Code Review -- PR #1519: Add MetaInstance declarative layer (rebased)

CI Status: All 6 checks passed (aarch64-darwin, x86_64-linux, aarch64-linux -- both pipeline runs).

Overview

Large PR (~3,400 additions, 32 files) introducing a declarative MetaInstance abstraction: users specify model constraints (model ID, sharding, instance type, min nodes, node preferences) and the system automatically places, monitors, retries, and recreates instances. Also adds JACCL SideChannel relay via named pipes for distributed inference coordination, a ProcessManager protocol for reconciliation, dashboard UI for MetaInstance lifecycle, and removes MlxIbv throughout.

This is a rebase of #1447 (previously merged then reverted).


Critical Issues

1. Duplicate TaskCancelled in Command union type (commands.py)

    | CreateInstance
    | DeleteInstance
    | TaskCancelled        # original
    | CreateMetaInstance
    | DeleteMetaInstance
    | TaskCancelled        # duplicate!
    | TaskFinished

The new variants were inserted above TaskFinished but TaskCancelled was not removed from its original position. Pydantic discriminated unions may behave unpredictably with duplicates. The second TaskCancelled must be removed.

2. Duplicate _cancel_sender operations in shutdown() (runner_supervisor.py)

def shutdown(self):
    self._ev_recv.close()
    self._task_sender.close()
    self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))   # NEW
    self._cancel_sender.close()                                # NEW
    self._event_sender.close()
    self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))   # OLD -- will raise ClosedResourceError
    self._cancel_sender.close()                                # OLD -- closes already-closed channel
    self._close_pipe_fds()

The second send() after close() will raise ClosedResourceError at shutdown. The old lines must be removed.


Significant Issues

3. Dashboard tautological condition (+page.svelte line ~557)

const matchesSelectedRuntime = (runtime: InstanceMeta): boolean =>
    selectedInstanceType === "MlxRing"
      ? runtime === "MlxRing"
      : runtime === "MlxJaccl" || runtime === "MlxJaccl";

After removing MlxIbv, both operands of || are identical. Should be just runtime === "MlxJaccl". Not a behavioral bug (result is the same), but a clear copy-paste artifact that will confuse future readers.

4. Dead field _child_pipe_fds never assigned (runner_supervisor.py)

The field _child_pipe_fds: tuple[int, int] | None = None is declared and cleaned up in _close_pipe_fds(), but never assigned a value anywhere in the codebase. It will always be None, making the cleanup code dead. Either the assignment was accidentally omitted from FIFO setup, or the field should be removed.

5. _jaccl_pending not cleaned on instance deletion (master/main.py)

When an instance is deleted (whether via cascade delete, network failure, or retry limit exceeded), accumulated _jaccl_pending[instance_id] data is never cleaned up. The cleanup only happens on successful gather completion. If a runner crashes mid-sequence and the instance is deleted, pending data for that instance leaks. Over many create/delete cycles this grows unbounded.

Suggested fix in _apply_and_broadcast or _event_processor:

if isinstance(event, InstanceDeleted):
    self._jaccl_pending.pop(event.instance_id, None)

6. No timeout on FIFO open (runner_supervisor.py)

If the child process crashes before opening its end of the named pipes, the parent's os.open() calls block forever:

async with anyio.create_task_group() as open_tg:
    open_tg.start_soon(open_read)
    open_tg.start_soon(open_write)

Wrapping in anyio.fail_after() would prevent a permanent hang.

7. No timeout on pipe relay waiter (runner_supervisor.py)

After sending JacclSideChannelData, the relay waits indefinitely for JacclSideChannelGathered:

await waiter.wait()  # no timeout

If the master crashes or the event is never delivered, this coroutine hangs forever. A timeout with retry or error handling would improve resilience.

8. Placement strategy changed from smallest to largest cycles (placement.py, placement_utils.py)

get_smallest_cycles is renamed to get_largest_cycles and the logic inverted. This changes placement behavior for ALL instances, not just MetaInstance-backed ones. Previously the system preferred minimal resource usage per model; now it prefers maximum parallelism. A min_nodes=1 request could now be spread across all nodes if a full-cluster cycle exists. The PR description does not call this out. Is this intentional for non-MetaInstance placements?


Minor Issues

9. mp.set_start_method("spawn", force=True) (main.py)

Changed from force=False to force=True. This silently overrides any previously-set start method instead of raising RuntimeError. The justification is not documented. If a library has already set a different start method intentionally, this masks it.

10. RDMA validation now strict (placement.py)

MlxJaccl without RDMA cycles now raises ValueError("Requested RDMA (MlxJaccl) but no RDMA-connected cycles available") instead of silently falling back to non-RDMA cycles. More correct but a behavioral change for callers that relied on the fallback.

11. Reconcile loop 10x faster (master/main.py)

_reconcile() sleeps 1s (was 10s in _plan()). MetaInstanceReconciler calls ModelCard.load() per unsatisfied MetaInstance per cycle, though cached. Worth monitoring in clusters with many unsatisfied MetaInstances. Consider exponential backoff per MetaInstance after repeated placement failures.

12. _update_meta_instance uses **fields: object (apply.py)

No type checking on field names or values. A typo in a field name silently passes until runtime model_copy call. Consider explicit parameters or TypedDict.

13. Unrelated fix bundled in (download/coordinator.py)

The downloaded_bytes >= total_bytes > 0 check for emitting DownloadCompleted is a separate bug fix. Not blocking, but overlaps with #1512's scope.


What's Good

  • ProcessManager protocol: Clean async def reconcile(self, state: State) -> Sequence[Event] interface. State is frozen/read-only, events are pure return values. Composable and independently testable. The extraction of NodeTimeoutReconciler, InstanceHealthReconciler, and MetaInstanceReconciler from the monolithic _plan() loop is a significant architectural improvement.

  • _apply_and_broadcast() consolidation: Replaces the hacky _loopback_processor with a clean single method. State is updated synchronously before the await, preventing interleaving under cooperative scheduling.

  • Event sourcing correctness: Pure apply() functions with guard clauses for missing references. The mutual exclusion between InstanceRetrying (removes runners, increments failures) and InstanceDeleted (removes instance, increments failures) paths is correctly maintained. apply_instance_retrying returns early when the instance is already deleted to avoid double-counting.

  • Comprehensive test coverage: ~1,520 lines across two new test files covering MetaInstance lifecycle, retry logic with boundary conditions, connection health checks (Ring and JACCL topology, IP-level matching), runner failure detection, cascade delete with task cancellation, and concurrent MetaInstances.

  • Dashboard improvements: Per-node status display during loading/warming, MetaInstance creation UI with node selection, JACCL error banner with tooltip, retry status display. The unifiedDisplayItems approach that merges MetaInstances and orphan instances into a single list is clean.

  • API validation: place_instance now validates placement before queuing the command, returning 400 immediately instead of silently dropping in the master.

  • Terminal peer guard in plan.py: _create_runner checks that no peer runners are in terminal state before creating, preventing thundering herd during failures. Workers wait for master's InstanceRetrying signal.


Verdict

Well-architected declarative layer that replaces ad-hoc instance management with a proper reconciliation pattern. The core design -- ProcessManager protocol, event sourcing for MetaInstance state, health reconciliation -- is sound and extensible. The JACCL SideChannel relay is a significant new capability.

Must fix before merge:

  1. Remove duplicate TaskCancelled from Command union
  2. Remove duplicate _cancel_sender calls in shutdown()

Strongly recommend:
3. Fix tautological MlxJaccl || MlxJaccl in dashboard
4. Remove or properly assign _child_pipe_fds
5. Clean up _jaccl_pending on instance deletion
6. Add timeout to FIFO open

Clarify:
7. Is the smallest-to-largest cycle placement change intentional for all placements?
8. Why is force=True needed for mp.set_start_method?

Review only -- not a merge approval.

@AlexCheema
Copy link
Contributor Author

Bug: Worker crash on meta-instance delete

Tested on: alexcheema/meta-instance @ 17cb9e54
Cluster: 4× Mac Mini M4 Pro, EXO_LIBP2P_NAMESPACE=meta-instance-test

What happened

Create and inference work fine. Deleting a meta-instance crashes worker nodes.

Steps to reproduce

# Create
curl -X POST http://mac-mini-1:52415/meta_instance \
  -H "Content-Type: application/json" \
  -d '{"model_id": "mlx-community/Llama-3.2-3B-Instruct-4bit", "sharding": "Pipeline", "min_nodes": 1}'

# Delete
curl -X DELETE http://mac-mini-1:52415/meta_instance/<meta_instance_id>

Crash 1 — anyio.ClosedResourceError (worker/main.py:249)

ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  File "src/exo/worker/main.py", line 249, in plan_step
      runner.shutdown()
      ...
      self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))
  anyio.ClosedResourceError

runner.shutdown() sends to _cancel_sender, but the channel is already closed by the time plan_step runs during delete. The unhandled exception propagates up through the anyio TaskGroup and tears down the worker.

Crash 2 — Rust/PyO3 destructor panic

Immediately follows crash 1 on the affected nodes:

thread 'tokio-runtime-worker' panicked at interpreter_lifecycle.rs:117:
assertion `left != right` failed: The Python interpreter is not initialized
  and the `auto-initialize` feature is not enabled.

thread caused non-unwinding panic. aborting.
panic in a destructor during cleanup

The Rust event loop tries to call back into Python during cleanup, but the Python interpreter is already gone. Results in a hard abort.

Impact

  • 2 of 4 nodes (minis 3 and 4) fully crashed (process aborted)
  • 2 of 4 nodes (minis 1 and 2) survived with degraded state
  • Cluster does not self-recover

Suggested fix

Guard the runner.shutdown() call in plan_step against a closed channel:

# src/exo/worker/main.py ~line 249
try:
    runner.shutdown()
except ClosedResourceError:
    pass  # Runner already shut down

The Rust destructor panic may be a downstream consequence of the Python crash — worth verifying if it goes away once the ClosedResourceError is handled.

What works ✅

  • 4-node cluster formation and election
  • POST /meta_instance — creates meta-instance, auto-spawns child instance with correct shard assignments
  • GET /meta_instances — lists correctly
  • Inference via /v1/chat/completions through the meta-instance

@AlexCheema
Copy link
Contributor Author

Fix: DELETE /meta_instance crash (ClosedResourceError)

Commit: 5b487171

Root Cause

RunnerSupervisor.shutdown() had duplicate _cancel_sender.send() + _cancel_sender.close() calls (lines 188-189 copied again at 191-192). The second pair always hit an already-closed channel, throwing anyio.ClosedResourceError. This was unhandled inside the worker's TaskGroup, tearing it down and triggering a Rust/PyO3 destructor panic on affected nodes.

Changes (2 files, +11/-7)

src/exo/worker/runner/runner_supervisor.py

  • Removed duplicate _cancel_sender.send() / _cancel_sender.close() block
  • Wrapped remaining send/close in try/except ClosedResourceError

src/exo/worker/main.py

  • Added contextlib.suppress(ClosedResourceError) around runner.shutdown() in the plan_step Shutdown handler (line 249 crash site from the report)
  • Added same guard in the Worker.run() finally block (line 120) for the shutdown-all-runners loop

Verification

  • basedpyright: 0 errors
  • ruff: clean
  • pytest: 261 passed, 1 skipped

@Evanev7
Copy link
Member

Evanev7 commented Feb 19, 2026

This MLX_JACCL_PIPE doesn't seem to exist - and shouldn't be a part of this PR anyway.

@AlexCheema
Copy link
Contributor Author

This MLX_JACCL_PIPE doesn't seem to exist - and shouldn't be a part of this PR anyway.

Yeah that is in my MLX fork. I agree it shouldn't be part of this PR. I'm currently working on splitting this up into 5 separate PRs as it has got too big and there are a number of separate features here.

AlexCheema added a commit that referenced this pull request Feb 19, 2026
Implement named-pipe (FIFO) based relay for JACCL all_gather operations
across the exo control plane, enabling distributed tensor operations
between MlxJaccl runner instances.

Components:
- Base64Bytes type + JacclSideChannelData/Gathered event types
- RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads
  local data from runner, emits events, waits for gathered result, and
  writes ordered data back
- Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT
  env vars for C++ SideChannel
- Worker: routes JacclSideChannelGathered events to RunnerSupervisors
- Master: _handle_jaccl_side_channel() accumulates per-runner data and
  emits gathered event when all runners for an instance have contributed
- mx_any() docstring explaining all_sum for GPU deadlock prevention

Extracted from meta-instance branch (#1519) — PR 4 of 5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema added a commit that referenced this pull request Feb 19, 2026
Three independent fixes extracted from meta-instance branch (#1519):

- Use force=True for mp.set_start_method("spawn") to prevent errors
  when the start method was already set by another initialization path
- Detect already-complete downloads on restart instead of reporting them
  as DownloadPending (checks downloaded_bytes >= total_bytes)
- Guard runner.shutdown() with contextlib.suppress(ClosedResourceError)
  to handle already-closed resources during worker teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema added a commit that referenced this pull request Feb 21, 2026
Adds a declarative MetaInstance system for managing model instances with
automatic placement, retry logic (max 3 attempts), and lifecycle
management via a reconciliation loop.

- Process managers for instance health, meta-instance lifecycle, and
  node timeout detection
- Reconciliation engine driving state transitions and cascading deletes
- Dashboard UI for creating/managing MetaInstances with node selection,
  sharding config, retry status, and error feedback
- JACCL SideChannel integration for distributed inference coordination
- Comprehensive test suite (25+ edge cases)

Split from original #1519. Independent bug fixes extracted to:
  #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download
  detection), #1580 (RDMA warning)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from 5b48717 to 45bdae1 Compare February 21, 2026 19:54
AlexCheema added a commit that referenced this pull request Feb 21, 2026
Adds a declarative MetaInstance system for managing model instances with
automatic placement, retry logic (max 3 attempts), and lifecycle
management via a reconciliation loop.

- Process managers for instance health, meta-instance lifecycle, and
  node timeout detection
- Reconciliation engine driving state transitions and cascading deletes
- Dashboard UI for creating/managing MetaInstances with node selection,
  sharding config, retry status, and error feedback
- JACCL SideChannel integration for distributed inference coordination
- Comprehensive test suite (25+ edge cases)

Split from original #1519. Independent bug fixes extracted to:
  #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download
  detection), #1580 (RDMA warning)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from 45bdae1 to f8304c4 Compare February 21, 2026 20:12
AlexCheema added a commit that referenced this pull request Feb 21, 2026
Adds a declarative MetaInstance system for managing model instances with
automatic placement, retry logic (max 3 attempts), and lifecycle
management via a reconciliation loop.

- Process managers for instance health, meta-instance lifecycle, and
  node timeout detection
- Reconciliation engine driving state transitions and cascading deletes
- Dashboard UI for creating/managing MetaInstances with node selection,
  sharding config, retry status, and error feedback
- JACCL SideChannel integration for distributed inference coordination
- Comprehensive test suite (25+ edge cases)

Split from original #1519. Independent bug fixes extracted to:
  #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download
  detection), #1580 (RDMA warning)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from f8304c4 to 8819ae4 Compare February 21, 2026 20:38
AlexCheema added a commit that referenced this pull request Feb 21, 2026
Three independent fixes extracted from meta-instance branch (#1519):

- Use force=True for mp.set_start_method("spawn") to prevent errors
  when the start method was already set by another initialization path
- Detect already-complete downloads on restart instead of reporting them
  as DownloadPending (checks downloaded_bytes >= total_bytes)
- Guard runner.shutdown() with contextlib.suppress(ClosedResourceError)
  to handle already-closed resources during worker teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema added a commit that referenced this pull request Feb 21, 2026
Implement named-pipe (FIFO) based relay for JACCL all_gather operations
across the exo control plane, enabling distributed tensor operations
between MlxJaccl runner instances.

Components:
- Base64Bytes type + JacclSideChannelData/Gathered event types
- RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads
  local data from runner, emits events, waits for gathered result, and
  writes ordered data back
- Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT
  env vars for C++ SideChannel
- Worker: routes JacclSideChannelGathered events to RunnerSupervisors
- Master: _handle_jaccl_side_channel() accumulates per-runner data and
  emits gathered event when all runners for an instance have contributed
- mx_any() docstring explaining all_sum for GPU deadlock prevention

Extracted from meta-instance branch (#1519) — PR 4 of 5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a declarative MetaInstance system for managing model instances with
automatic placement, retry logic (max 3 attempts), and lifecycle
management via a reconciliation loop.

- Process managers for instance health, meta-instance lifecycle, and
  node timeout detection
- Reconciliation engine driving state transitions and cascading deletes
- Dashboard UI for creating/managing MetaInstances with node selection,
  sharding config, retry status, and error feedback
- JACCL SideChannel integration for distributed inference coordination
- Comprehensive test suite (25+ edge cases)

Split from original #1519. Independent bug fixes extracted to:
  #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download
  detection), #1580 (RDMA warning)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema AlexCheema force-pushed the alexcheema/meta-instance branch from 8819ae4 to 928c41b Compare February 21, 2026 21:15
AlexCheema added a commit that referenced this pull request Feb 22, 2026
## Summary
- Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard
components
- MlxIbv was replaced by MlxJaccl; these are leftover type checks
- Split from #1519 for independent review

## Test plan
- [x] Visual inspection of dashboard components

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
AlexCheema added a commit that referenced this pull request Feb 23, 2026
## Summary
- When MlxJaccl (RDMA) placement is requested but no RDMA-connected
cycles exist, raise a clear ValueError instead of silently falling back
to non-RDMA cycles
- Split from #1519 for independent review

## Test plan
- [x] basedpyright — 0 errors
- [x] ruff check — passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Evanev7 pushed a commit that referenced this pull request Feb 23, 2026
Three independent fixes extracted from meta-instance branch (#1519):

- Use force=True for mp.set_start_method("spawn") to prevent errors
  when the start method was already set by another initialization path
- Detect already-complete downloads on restart instead of reporting them
  as DownloadPending (checks downloaded_bytes >= total_bytes)
- Guard runner.shutdown() with contextlib.suppress(ClosedResourceError)
  to handle already-closed resources during worker teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evanev7 pushed a commit that referenced this pull request Feb 23, 2026
Three independent fixes extracted from meta-instance branch (#1519):

- Use force=True for mp.set_start_method("spawn") to prevent errors
  when the start method was already set by another initialization path
- Detect already-complete downloads on restart instead of reporting them
  as DownloadPending (checks downloaded_bytes >= total_bytes)
- Guard runner.shutdown() with contextlib.suppress(ClosedResourceError)
  to handle already-closed resources during worker teardown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
adurham pushed a commit to adurham/exo that referenced this pull request Feb 27, 2026
- Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard
components
- MlxIbv was replaced by MlxJaccl; these are leftover type checks
- Split from exo-explore#1519 for independent review

- [x] Visual inspection of dashboard components

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
adurham pushed a commit to adurham/exo that referenced this pull request Feb 27, 2026
…lore#1585)

## Summary
- When MlxJaccl (RDMA) placement is requested but no RDMA-connected
cycles exist, raise a clear ValueError instead of silently falling back
to non-RDMA cycles
- Split from exo-explore#1519 for independent review

## Test plan
- [x] basedpyright — 0 errors
- [x] ruff check — passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants