Add MetaInstance declarative layer (rebased)#1519
Conversation
7beff8b to
17cb9e5
Compare
Code Review -- PR #1519: Add MetaInstance declarative layer (rebased)CI Status: All 6 checks passed (aarch64-darwin, x86_64-linux, aarch64-linux -- both pipeline runs). OverviewLarge PR (~3,400 additions, 32 files) introducing a declarative This is a rebase of #1447 (previously merged then reverted). Critical Issues1. Duplicate | CreateInstance
| DeleteInstance
| TaskCancelled # original
| CreateMetaInstance
| DeleteMetaInstance
| TaskCancelled # duplicate!
| TaskFinishedThe new variants were inserted above 2. Duplicate def shutdown(self):
self._ev_recv.close()
self._task_sender.close()
self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK")) # NEW
self._cancel_sender.close() # NEW
self._event_sender.close()
self._cancel_sender.send(TaskId("CANCEL_CURRENT_TASK")) # OLD -- will raise ClosedResourceError
self._cancel_sender.close() # OLD -- closes already-closed channel
self._close_pipe_fds()The second Significant Issues3. Dashboard tautological condition ( const matchesSelectedRuntime = (runtime: InstanceMeta): boolean =>
selectedInstanceType === "MlxRing"
? runtime === "MlxRing"
: runtime === "MlxJaccl" || runtime === "MlxJaccl";After removing 4. Dead field The field 5. When an instance is deleted (whether via cascade delete, network failure, or retry limit exceeded), accumulated Suggested fix in if isinstance(event, InstanceDeleted):
self._jaccl_pending.pop(event.instance_id, None)6. No timeout on FIFO open ( If the child process crashes before opening its end of the named pipes, the parent's async with anyio.create_task_group() as open_tg:
open_tg.start_soon(open_read)
open_tg.start_soon(open_write)Wrapping in 7. No timeout on pipe relay waiter ( After sending await waiter.wait() # no timeoutIf the master crashes or the event is never delivered, this coroutine hangs forever. A timeout with retry or error handling would improve resilience. 8. Placement strategy changed from smallest to largest cycles (
Minor Issues9. Changed from 10. RDMA validation now strict (
11. Reconcile loop 10x faster (
12. No type checking on field names or values. A typo in a field name silently passes until runtime 13. Unrelated fix bundled in ( The What's Good
VerdictWell-architected declarative layer that replaces ad-hoc instance management with a proper reconciliation pattern. The core design -- ProcessManager protocol, event sourcing for MetaInstance state, health reconciliation -- is sound and extensible. The JACCL SideChannel relay is a significant new capability. Must fix before merge:
Strongly recommend: Clarify: Review only -- not a merge approval. |
Bug: Worker crash on meta-instance deleteTested on: What happenedCreate and inference work fine. Deleting a meta-instance crashes worker nodes. Steps to reproduce# Create
curl -X POST http://mac-mini-1:52415/meta_instance \
-H "Content-Type: application/json" \
-d '{"model_id": "mlx-community/Llama-3.2-3B-Instruct-4bit", "sharding": "Pipeline", "min_nodes": 1}'
# Delete
curl -X DELETE http://mac-mini-1:52415/meta_instance/<meta_instance_id>Crash 1 —
|
Fix: DELETE /meta_instance crash (ClosedResourceError)Commit: Root Cause
Changes (2 files, +11/-7)
Verification
|
|
This MLX_JACCL_PIPE doesn't seem to exist - and shouldn't be a part of this PR anyway. |
Yeah that is in my MLX fork. I agree it shouldn't be part of this PR. I'm currently working on splitting this up into 5 separate PRs as it has got too big and there are a number of separate features here. |
Implement named-pipe (FIFO) based relay for JACCL all_gather operations across the exo control plane, enabling distributed tensor operations between MlxJaccl runner instances. Components: - Base64Bytes type + JacclSideChannelData/Gathered event types - RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads local data from runner, emits events, waits for gathered result, and writes ordered data back - Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT env vars for C++ SideChannel - Worker: routes JacclSideChannelGathered events to RunnerSupervisors - Master: _handle_jaccl_side_channel() accumulates per-runner data and emits gathered event when all runners for an instance have contributed - mx_any() docstring explaining all_sum for GPU deadlock prevention Extracted from meta-instance branch (#1519) — PR 4 of 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5b48717 to
45bdae1
Compare
Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
45bdae1 to
f8304c4
Compare
Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f8304c4 to
8819ae4
Compare
Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement named-pipe (FIFO) based relay for JACCL all_gather operations across the exo control plane, enabling distributed tensor operations between MlxJaccl runner instances. Components: - Base64Bytes type + JacclSideChannelData/Gathered event types - RunnerSupervisor: FIFO creation, _pipe_relay() async loop that reads local data from runner, emits events, waits for gathered result, and writes ordered data back - Bootstrap: opens FIFOs in child process, sets MLX_JACCL_PIPE_IN/OUT env vars for C++ SideChannel - Worker: routes JacclSideChannelGathered events to RunnerSupervisors - Master: _handle_jaccl_side_channel() accumulates per-runner data and emits gathered event when all runners for an instance have contributed - mx_any() docstring explaining all_sum for GPU deadlock prevention Extracted from meta-instance branch (#1519) — PR 4 of 5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a declarative MetaInstance system for managing model instances with automatic placement, retry logic (max 3 attempts), and lifecycle management via a reconciliation loop. - Process managers for instance health, meta-instance lifecycle, and node timeout detection - Reconciliation engine driving state transitions and cascading deletes - Dashboard UI for creating/managing MetaInstances with node selection, sharding config, retry status, and error feedback - JACCL SideChannel integration for distributed inference coordination - Comprehensive test suite (25+ edge cases) Split from original #1519. Independent bug fixes extracted to: #1547 (misc fixes), #1546 (JACCL sidechannel), #1582 (download detection), #1580 (RDMA warning) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8819ae4 to
928c41b
Compare
## Summary - Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard components - MlxIbv was replaced by MlxJaccl; these are leftover type checks - Split from #1519 for independent review ## Test plan - [x] Visual inspection of dashboard components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary - When MlxJaccl (RDMA) placement is requested but no RDMA-connected cycles exist, raise a clear ValueError instead of silently falling back to non-RDMA cycles - Split from #1519 for independent review ## Test plan - [x] basedpyright — 0 errors - [x] ruff check — passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three independent fixes extracted from meta-instance branch (#1519): - Use force=True for mp.set_start_method("spawn") to prevent errors when the start method was already set by another initialization path - Detect already-complete downloads on restart instead of reporting them as DownloadPending (checks downloaded_bytes >= total_bytes) - Guard runner.shutdown() with contextlib.suppress(ClosedResourceError) to handle already-closed resources during worker teardown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove legacy MlxIbvInstance references from ChatSidebar and ModelCard components - MlxIbv was replaced by MlxJaccl; these are leftover type checks - Split from exo-explore#1519 for independent review - [x] Visual inspection of dashboard components 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…lore#1585) ## Summary - When MlxJaccl (RDMA) placement is requested but no RDMA-connected cycles exist, raise a clear ValueError instead of silently falling back to non-RDMA cycles - Split from exo-explore#1519 for independent review ## Test plan - [x] basedpyright — 0 errors - [x] ruff check — passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Rebased onto current
mainwith independent changes extracted into separate PRs:Test plan
🤖 Generated with Claude Code