Add MetaInstance declarative layer (rebased) by AlexCheema · Pull Request #1519 · exo-explore/exo

AlexCheema · 2026-02-17T22:03:38Z

Summary

Adds a MetaInstance declarative layer for managing model instances with automatic placement, retry logic, and lifecycle management
Includes dashboard UI for creating/managing MetaInstances with node selection, sharding config, and error feedback
Adds JACCL SideChannel relay via named pipes for distributed inference coordination
Implements task cancellation on client disconnect
Retry logic with max 3 attempts, error collection, and persistent failure display

Rebased onto current main with independent changes extracted into separate PRs:

fix: suppress closure errors in runnersupervisor and force spawn start method #1547 — misc bug fixes (spawn force, shutdown guard)
refactor: extract JACCL SideChannel pipe relay #1546 — JACCL SideChannel pipe relay extraction
fix: change RDMA AVAILABLE to RDMA NOT ENABLED warning #1580 — RDMA warning wording (merged)
fix: detect completed downloads by checking final file exists #1582 — download completion detection fix
chore: remove deprecated MlxIbv dashboard references #1584 — MlxIbv dashboard cleanup
fix: raise error when MlxJaccl requested without RDMA cycles #1585 — RDMA placement error check

Test plan

basedpyright passes (0 errors)
ruff check passes
nix fmt applied (0 changes)
pytest: 291 passed, 1 skipped (1 pre-existing Rust bindings failure)

🤖 Generated with Claude Code

AlexCheema · 2026-02-17T23:36:14Z

Code Review -- PR #1519: Add MetaInstance declarative layer (rebased)

CI Status: All 6 checks passed (aarch64-darwin, x86_64-linux, aarch64-linux -- both pipeline runs).

Overview

Large PR (~3,400 additions, 32 files) introducing a declarative MetaInstance abstraction: users specify model constraints (model ID, sharding, instance type, min nodes, node preferences) and the system automatically places, monitors, retries, and recreates instances. Also adds JACCL SideChannel relay via named pipes for distributed inference coordination, a ProcessManager protocol for reconciliation, dashboard UI for MetaInstance lifecycle, and removes MlxIbv throughout.

This is a rebase of #1447 (previously merged then reverted).

Critical Issues

1. Duplicate TaskCancelled in Command union type (commands.py)

    | CreateInstance
    | DeleteInstance
    | TaskCancelled        # original
    | CreateMetaInstance
    | DeleteMetaInstance
    | TaskCancelled        # duplicate!
    | TaskFinished

The new variants were inserted above TaskFinished but TaskCancelled was not removed from its original position. Pydantic discriminated unions may behave unpredictably with duplicates. The second TaskCancelled must be removed.

2. Duplicate _cancel_sender operations in shutdown() (runner_supervisor.py)

def shutdown(self):
    self._ev_recv.close()
    self._task_sender.close()
    self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))   # NEW
    self._cancel_sender.close()                                # NEW
    self._event_sender.close()
    self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))   # OLD -- will raise ClosedResourceError
    self._cancel_sender.close()                                # OLD -- closes already-closed channel
    self._close_pipe_fds()

The second send() after close() will raise ClosedResourceError at shutdown. The old lines must be removed.

Significant Issues

3. Dashboard tautological condition (+page.svelte line ~557)

const matchesSelectedRuntime = (runtime: InstanceMeta): boolean =>
    selectedInstanceType === "MlxRing"
      ? runtime === "MlxRing"
      : runtime === "MlxJaccl" || runtime === "MlxJaccl";

After removing MlxIbv, both operands of || are identical. Should be just runtime === "MlxJaccl". Not a behavioral bug (result is the same), but a clear copy-paste artifact that will confuse future readers.

4. Dead field _child_pipe_fds never assigned (runner_supervisor.py)

The field _child_pipe_fds: tuple[int, int] | None = None is declared and cleaned up in _close_pipe_fds(), but never assigned a value anywhere in the codebase. It will always be None, making the cleanup code dead. Either the assignment was accidentally omitted from FIFO setup, or the field should be removed.

5. _jaccl_pending not cleaned on instance deletion (master/main.py)

When an instance is deleted (whether via cascade delete, network failure, or retry limit exceeded), accumulated _jaccl_pending[instance_id] data is never cleaned up. The cleanup only happens on successful gather completion. If a runner crashes mid-sequence and the instance is deleted, pending data for that instance leaks. Over many create/delete cycles this grows unbounded.

Suggested fix in _apply_and_broadcast or _event_processor:

if isinstance(event, InstanceDeleted):
    self._jaccl_pending.pop(event.instance_id, None)

6. No timeout on FIFO open (runner_supervisor.py)

If the child process crashes before opening its end of the named pipes, the parent's os.open() calls block forever:

async with anyio.create_task_group() as open_tg:
    open_tg.start_soon(open_read)
    open_tg.start_soon(open_write)

Wrapping in anyio.fail_after() would prevent a permanent hang.

7. No timeout on pipe relay waiter (runner_supervisor.py)

After sending JacclSideChannelData, the relay waits indefinitely for JacclSideChannelGathered:

await waiter.wait()  # no timeout

If the master crashes or the event is never delivered, this coroutine hangs forever. A timeout with retry or error handling would improve resilience.

8. Placement strategy changed from smallest to largest cycles (placement.py, placement_utils.py)

get_smallest_cycles is renamed to get_largest_cycles and the logic inverted. This changes placement behavior for ALL instances, not just MetaInstance-backed ones. Previously the system preferred minimal resource usage per model; now it prefers maximum parallelism. A min_nodes=1 request could now be spread across all nodes if a full-cluster cycle exists. The PR description does not call this out. Is this intentional for non-MetaInstance placements?

Minor Issues

9. mp.set_start_method("spawn", force=True) (main.py)

Changed from force=False to force=True. This silently overrides any previously-set start method instead of raising RuntimeError. The justification is not documented. If a library has already set a different start method intentionally, this masks it.

10. RDMA validation now strict (placement.py)

MlxJaccl without RDMA cycles now raises ValueError("Requested RDMA (MlxJaccl) but no RDMA-connected cycles available") instead of silently falling back to non-RDMA cycles. More correct but a behavioral change for callers that relied on the fallback.

11. Reconcile loop 10x faster (master/main.py)

_reconcile() sleeps 1s (was 10s in _plan()). MetaInstanceReconciler calls ModelCard.load() per unsatisfied MetaInstance per cycle, though cached. Worth monitoring in clusters with many unsatisfied MetaInstances. Consider exponential backoff per MetaInstance after repeated placement failures.

12. _update_meta_instance uses **fields: object (apply.py)

No type checking on field names or values. A typo in a field name silently passes until runtime model_copy call. Consider explicit parameters or TypedDict.

13. Unrelated fix bundled in (download/coordinator.py)

The downloaded_bytes >= total_bytes > 0 check for emitting DownloadCompleted is a separate bug fix. Not blocking, but overlaps with #1512's scope.

What's Good

ProcessManager protocol: Clean async def reconcile(self, state: State) -> Sequence[Event] interface. State is frozen/read-only, events are pure return values. Composable and independently testable. The extraction of NodeTimeoutReconciler, InstanceHealthReconciler, and MetaInstanceReconciler from the monolithic _plan() loop is a significant architectural improvement.
_apply_and_broadcast() consolidation: Replaces the hacky _loopback_processor with a clean single method. State is updated synchronously before the await, preventing interleaving under cooperative scheduling.
Event sourcing correctness: Pure apply() functions with guard clauses for missing references. The mutual exclusion between InstanceRetrying (removes runners, increments failures) and InstanceDeleted (removes instance, increments failures) paths is correctly maintained. apply_instance_retrying returns early when the instance is already deleted to avoid double-counting.
Comprehensive test coverage: ~1,520 lines across two new test files covering MetaInstance lifecycle, retry logic with boundary conditions, connection health checks (Ring and JACCL topology, IP-level matching), runner failure detection, cascade delete with task cancellation, and concurrent MetaInstances.
Dashboard improvements: Per-node status display during loading/warming, MetaInstance creation UI with node selection, JACCL error banner with tooltip, retry status display. The unifiedDisplayItems approach that merges MetaInstances and orphan instances into a single list is clean.
API validation: place_instance now validates placement before queuing the command, returning 400 immediately instead of silently dropping in the master.
Terminal peer guard in plan.py: _create_runner checks that no peer runners are in terminal state before creating, preventing thundering herd during failures. Workers wait for master's InstanceRetrying signal.

Verdict

Well-architected declarative layer that replaces ad-hoc instance management with a proper reconciliation pattern. The core design -- ProcessManager protocol, event sourcing for MetaInstance state, health reconciliation -- is sound and extensible. The JACCL SideChannel relay is a significant new capability.

Must fix before merge:

Remove duplicate TaskCancelled from Command union
Remove duplicate _cancel_sender calls in shutdown()

Strongly recommend:
3. Fix tautological MlxJaccl || MlxJaccl in dashboard
4. Remove or properly assign _child_pipe_fds
5. Clean up _jaccl_pending on instance deletion
6. Add timeout to FIFO open

Clarify:
7. Is the smallest-to-largest cycle placement change intentional for all placements?
8. Why is force=True needed for mp.set_start_method?

Review only -- not a merge approval.

AlexCheema · 2026-02-18T20:29:37Z

Bug: Worker crash on meta-instance delete

Tested on: alexcheema/meta-instance @ 17cb9e54
Cluster: 4× Mac Mini M4 Pro, EXO_LIBP2P_NAMESPACE=meta-instance-test

What happened

Create and inference work fine. Deleting a meta-instance crashes worker nodes.

Steps to reproduce

# Create
curl -X POST http://mac-mini-1:52415/meta_instance \
  -H "Content-Type: application/json" \
  -d '{"model_id": "mlx-community/Llama-3.2-3B-Instruct-4bit", "sharding": "Pipeline", "min_nodes": 1}'

# Delete
curl -X DELETE http://mac-mini-1:52415/meta_instance/<meta_instance_id>

Crash 1 — `anyio.ClosedResourceError` (worker/main.py:249)

ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  File "src/exo/worker/main.py", line 249, in plan_step
      runner.shutdown()
      ...
      self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK"))
  anyio.ClosedResourceError

runner.shutdown() sends to _cancel_sender, but the channel is already closed by the time plan_step runs during delete. The unhandled exception propagates up through the anyio TaskGroup and tears down the worker.

Crash 2 — Rust/PyO3 destructor panic

Immediately follows crash 1 on the affected nodes:

thread 'tokio-runtime-worker' panicked at interpreter_lifecycle.rs:117:
assertion `left != right` failed: The Python interpreter is not initialized
  and the `auto-initialize` feature is not enabled.

thread caused non-unwinding panic. aborting.
panic in a destructor during cleanup

The Rust event loop tries to call back into Python during cleanup, but the Python interpreter is already gone. Results in a hard abort.

Impact

2 of 4 nodes (minis 3 and 4) fully crashed (process aborted)
2 of 4 nodes (minis 1 and 2) survived with degraded state
Cluster does not self-recover

Suggested fix

Guard the runner.shutdown() call in plan_step against a closed channel:

# src/exo/worker/main.py ~line 249
try:
    runner.shutdown()
except ClosedResourceError:
    pass  # Runner already shut down

The Rust destructor panic may be a downstream consequence of the Python crash — worth verifying if it goes away once the ClosedResourceError is handled.

What works ✅

4-node cluster formation and election
POST /meta_instance — creates meta-instance, auto-spawns child instance with correct shard assignments
GET /meta_instances — lists correctly
Inference via /v1/chat/completions through the meta-instance

AlexCheema · 2026-02-18T20:51:50Z

Fix: DELETE /meta_instance crash (ClosedResourceError)

Commit: 5b487171

Root Cause

RunnerSupervisor.shutdown() had duplicate _cancel_sender.send() + _cancel_sender.close() calls (lines 188-189 copied again at 191-192). The second pair always hit an already-closed channel, throwing anyio.ClosedResourceError. This was unhandled inside the worker's TaskGroup, tearing it down and triggering a Rust/PyO3 destructor panic on affected nodes.

Changes (2 files, +11/-7)

src/exo/worker/runner/runner_supervisor.py

Removed duplicate _cancel_sender.send() / _cancel_sender.close() block
Wrapped remaining send/close in try/except ClosedResourceError

src/exo/worker/main.py

Added contextlib.suppress(ClosedResourceError) around runner.shutdown() in the plan_step Shutdown handler (line 249 crash site from the report)
Added same guard in the Worker.run() finally block (line 120) for the shutdown-all-runners loop

Verification

basedpyright: 0 errors
ruff: clean
pytest: 261 passed, 1 skipped

Evanev7 · 2026-02-19T13:28:23Z

This MLX_JACCL_PIPE doesn't seem to exist - and shouldn't be a part of this PR anyway.

AlexCheema · 2026-02-19T13:29:58Z

This MLX_JACCL_PIPE doesn't seem to exist - and shouldn't be a part of this PR anyway.

Yeah that is in my MLX fork. I agree it shouldn't be part of this PR. I'm currently working on splitting this up into 5 separate PRs as it has got too big and there are a number of separate features here.

Implement named-pipe (FIFO) based relay for JACCL all_gather operations across the exo control plane, enabling distributed tensor operations between MlxJaccl runner instances. Components: - Base64Bytes type + JacclSideChannelData/Gathered event types - RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads local data from runner, emits events, waits for gathered result, and writes ordered data back - Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT env vars for C++ SideChannel - Worker: routes JacclSideChannelGathered events to RunnerSupervisors - Master: _handle_jaccl_side_channel() accumulates per-runner data and emits gathered event when all runners for an instance have contributed - mx_any() docstring explaining all_sum for GPU deadlock prevention Extracted from meta-instance branch (#1519) — PR 4 of 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement named-pipe (FIFO) based relay for JACCL all_gather operations across the exo control plane, enabling distributed tensor operations between MlxJaccl runner instances. Components: - Base64Bytes type + JacclSideChannelData/Gathered event types - RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads local data from runner, emits events, waits for gathered result, and writes ordered data back - Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT env vars for C++ SideChannel - Worker: routes JacclSideChannelGathered events to RunnerSupervisors - Master: _handle_jaccl_side_channel() accumulates per-runner data and emits gathered event when all runners for an instance have contributed - mx_any() docstring explaining all_sum for GPU deadlock prevention Extracted from meta-instance branch (#1519) — PR 4 of 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard components - MlxIbv was replaced by MlxJaccl; these are leftover type checks - Split from #1519 for independent review ## Test plan - [x] Visual inspection of dashboard components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - When MlxJaccl (RDMA) placement is requested but no RDMA-connected cycles exist, raise a clear ValueError instead of silently falling back to non-RDMA cycles - Split from #1519 for independent review ## Test plan - [x] basedpyright — 0 errors - [x] ruff check — passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard components - MlxIbv was replaced by MlxJaccl; these are leftover type checks - Split from exo-explore#1519 for independent review - [x] Visual inspection of dashboard components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…lore#1585) ## Summary - When MlxJaccl (RDMA) placement is requested but no RDMA-connected cycles exist, raise a clear ValueError instead of silently falling back to non-RDMA cycles - Split from exo-explore#1519 for independent review ## Test plan - [x] basedpyright — 0 errors - [x] ruff check — passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

AlexCheema force-pushed the alexcheema/meta-instance branch from 7beff8b to 17cb9e5 Compare February 17, 2026 22:09

AlexCheema mentioned this pull request Feb 17, 2026

feat: disable RDMA option when no Thunderbolt 5 hardware #1505

Open

6 tasks

AlexCheema mentioned this pull request Feb 19, 2026

fix: suppress closure errors in runnersupervisor and force spawn start method #1547

Merged

AlexCheema force-pushed the alexcheema/meta-instance branch from 5b48717 to 45bdae1 Compare February 21, 2026 19:54

This was referenced Feb 21, 2026

chore: remove deprecated MlxIbv dashboard references #1584

Merged

fix: raise error when MlxJaccl requested without RDMA cycles #1585

Merged

AlexCheema force-pushed the alexcheema/meta-instance branch from 45bdae1 to f8304c4 Compare February 21, 2026 20:12

AlexCheema force-pushed the alexcheema/meta-instance branch from f8304c4 to 8819ae4 Compare February 21, 2026 20:38

AlexCheema force-pushed the alexcheema/meta-instance branch from 8819ae4 to 928c41b Compare February 21, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MetaInstance declarative layer (rebased)#1519

Add MetaInstance declarative layer (rebased)#1519
AlexCheema wants to merge 1 commit intomainfrom
alexcheema/meta-instance

AlexCheema commented Feb 17, 2026 •

edited

Loading

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

AlexCheema commented Feb 18, 2026

Uh oh!

AlexCheema commented Feb 18, 2026

Uh oh!

Evanev7 commented Feb 19, 2026

Uh oh!

AlexCheema commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexCheema commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review -- PR #1519: Add MetaInstance declarative layer (rebased)

Overview

Critical Issues

Significant Issues

Minor Issues

What's Good

Verdict

Uh oh!

AlexCheema commented Feb 18, 2026

Bug: Worker crash on meta-instance delete

What happened

Steps to reproduce

Crash 1 — anyio.ClosedResourceError (worker/main.py:249)

Crash 2 — Rust/PyO3 destructor panic

Impact

Suggested fix

What works ✅

Uh oh!

AlexCheema commented Feb 18, 2026

Fix: DELETE /meta_instance crash (ClosedResourceError)

Root Cause

Changes (2 files, +11/-7)

Verification

Uh oh!

Evanev7 commented Feb 19, 2026

Uh oh!

AlexCheema commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexCheema commented Feb 17, 2026 •

edited

Loading

Crash 1 — `anyio.ClosedResourceError` (worker/main.py:249)