Skip to content

feat: disable RDMA option when no Thunderbolt 5 hardware#1505

Open
AlexCheema wants to merge 1 commit intomainfrom
fix/rdma-tb5-tooltip
Open

feat: disable RDMA option when no Thunderbolt 5 hardware#1505
AlexCheema wants to merge 1 commit intomainfrom
fix/rdma-tb5-tooltip

Conversation

@AlexCheema
Copy link
Contributor

Summary

  • Adds a hasRdmaCapableNodes derived check that detects if >=2 nodes have Thunderbolt 5 hardware
  • Disables the MLX RDMA button when TB5 hardware is not available (grayed out with cursor-not-allowed)
  • Shows a hover tooltip explaining "RDMA requires Thunderbolt 5 hardware on at least 2 nodes"
  • Prevents accidental selection of RDMA instance type that would fail at placement time

Closes #1351

Test plan

  • Verify RDMA button is disabled and dimmed when cluster has no TB5 nodes
  • Verify tooltip appears on hover over disabled RDMA button
  • Verify RDMA button works normally when TB5 nodes are present
  • npm run build succeeds
  • uv run basedpyright — 0 errors
  • nix fmt — 0 files changed

🤖 Generated with Claude Code

When the cluster has fewer than 2 nodes with Thunderbolt 5 hardware,
the MLX RDMA button is now disabled with a tooltip explaining that
RDMA requires Thunderbolt 5. This prevents users from selecting an
instance type that will fail at placement time.

Closes #1351

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexCheema
Copy link
Contributor Author

PR #1505 Review: "feat: disable RDMA option when no Thunderbolt 5 hardware"

Clean, focused PR. 1 file changed, +51/-21. CI all green. Closes #1351.


Summary

Adds a hasRdmaCapableNodes derived check that detects if >=2 nodes have Thunderbolt 5 interfaces. When TB5 hardware is not available, the MLX RDMA button is disabled (grayed out, cursor-not-allowed, opacity-40) with a hover tooltip explaining the requirement.

What's Done Well

  • Small, focused scope — single responsibility, easy to review
  • Good UX: disabled state with visual feedback (opacity + cursor) plus explanatory tooltip on hover
  • Correct reactive pattern: $derived.by() recomputes automatically when tbIdentifiers changes
  • Defensive: the onclick handler has a guard (if (!hasRdmaCapableNodes) return) in addition to the disabled attribute — belt and suspenders
  • Tooltip styling is consistent with the existing dashboard aesthetic (backdrop blur, border, dark bg)

Minor Concerns

  1. Radio button visual state when RDMA was previously selected: If a user previously selected RDMA (saved to localStorage), then connects to a cluster without TB5, the radio button shows as unselected (due to selectedInstanceType === 'MlxIbv' && hasRdmaCapableNodes). However, the underlying selectedInstanceType is still "MlxIbv" in state. If the user then launches a model without clicking "MLX Ring", they'd send MlxIbv to the backend. Consider auto-switching selectedInstanceType to "MlxRing" when hasRdmaCapableNodes becomes false, e.g.:

    $effect(() => {
      if (!hasRdmaCapableNodes && selectedInstanceType === "MlxIbv") {
        selectedInstanceType = "MlxRing";
        saveLaunchDefaults();
      }
    });
  2. TB5 detection heuristic: node.interfaces.length > 0 checks if any network interface is present, but is this specifically TB5? Could a node have non-TB5 interfaces in the tbIdentifiers data? If tbIdentifiers is already pre-filtered to only TB5, then this is fine — but the variable name tbIdentifiers suggests it might include all Thunderbolt versions.

  3. Tooltip accessibility: The tooltip uses CSS opacity-0 invisiblegroup-hover:opacity-100 group-hover:visible transition, which works for mouse users. Keyboard-only users who tab to the disabled button won't see the tooltip. Consider adding focus-within alongside group-hover, or use aria-describedby pointing to the tooltip text.

Verdict

Clean and correct. The auto-fallback for stale RDMA selection (concern #1) is the only functional issue worth addressing before merge.

@AlexCheema
Copy link
Contributor Author

Code Review: PR #1505 — feat: disable RDMA option when no Thunderbolt 5 hardware

Author: AlexCheema
Status: OPEN (auto-merge enabled, squash)
Changes: +51 / -21 in dashboard/src/routes/+page.svelte

Closes #1351.

Overview

Adds a hasRdmaCapableNodes derived check that detects whether >=2 nodes have
Thunderbolt 5 hardware. When TB5 is unavailable, the MLX RDMA button is disabled
(grayed out, cursor-not-allowed) with a hover tooltip explaining the requirement.

Correctness — hasRdmaCapableNodes

PASS. The derived value:

const hasRdmaCapableNodes = $derived.by(() => {
    const ids = tbIdentifiers;
    if (!ids) return false;
    const tb5NodeCount = Object.values(ids).filter(
        (node) => node.interfaces.length > 0,
    ).length;
    return tb5NodeCount >= 2;
});

This checks:

  1. tbIdentifiers is available (from /state endpoint's nodeThunderbolt data)
  2. Counts nodes that have at least one TB5 interface (interfaces.length > 0)
  3. Requires >= 2 such nodes (RDMA needs at least 2 peers)

The existing showTb5Info derived (lines 100-113) already uses the same pattern
(checking tbIdentifiers → filtering by interfaces.length > 0 → checking count >= 2)
for the TB5 setup info banner, confirming this is the established way to detect
TB5 hardware in the dashboard.

Correctness — Button Behavior

PASS. The RDMA button changes:

  1. disabled={!hasRdmaCapableNodes} — HTML disabled attribute prevents form
    interaction

  2. onclick has early return: if (!hasRdmaCapableNodes) return; — belt-and-
    suspenders with the disabled attribute

  3. Visual styling splits on hasRdmaCapableNodes:

    • Disabled: cursor-not-allowed opacity-40, no hover effects
    • Active selection requires && hasRdmaCapableNodes — prevents showing
      the yellow selected state when disabled
    • Radio dot only shown when selectedInstanceType === "MlxIbv" && hasRdmaCapableNodes
  4. Tooltip: wrapped in {#if !hasRdmaCapableNodes} with CSS group-hover
    visibility transition. Uses z-50 to layer above other content.

Correctness — Tooltip Implementation

PASS. The button is now wrapped in a <div class="group relative"> container.
The tooltip uses:

  • opacity-0 invisiblegroup-hover:opacity-100 group-hover:visible
  • transition-all duration-200 for smooth appearance
  • Fixed positioning below the button (top-full mt-2)
  • w-64 width constraint, z-50 layering

The tooltip only renders when !hasRdmaCapableNodes, so it doesn't consume
DOM space when RDMA is available.

Edge Case — Stale Selection

NOTE. If a user previously selected MlxIbv (saved to localStorage) and then
TB5 hardware becomes unavailable, selectedInstanceType remains "MlxIbv" in
state. The button appears grayed out (correct), but the variable still holds
"MlxIbv" — which means placement previews would request MlxIbv and likely get
an error response. This is a minor UX issue, not a correctness bug. The user
would see the error in the preview and click MlxRing. A future improvement
could auto-reset to MlxRing when hasRdmaCapableNodes becomes false.

Edge Case — Data Loading

SAFE. Before tbIdentifiers loads (initial page load), ids is undefined/null,
so hasRdmaCapableNodes returns false. The RDMA button starts disabled and
becomes enabled once TB5 data arrives. This prevents a flash of enabled state
followed by disabling, which would be worse UX.

Nits

None.

Verdict

LGTM. Clean UX improvement that prevents users from selecting an instance type
that would fail at placement time. The TB5 detection logic is consistent with
the existing pattern used for the setup info banner. The tooltip provides clear
guidance. Minor edge case with stale localStorage selection is non-blocking.

@AlexCheema
Copy link
Contributor Author

Code Review — PR #1505: Disable RDMA option when no Thunderbolt 5 hardware

CI: ALL PASSING (aarch64-darwin, x86_64-linux, aarch64-linux)

Overview

Single file change (+51/-21) in dashboard/src/routes/+page.svelte. Adds a hasRdmaCapableNodes derived check and disables the MLX RDMA button with a tooltip when TB5 hardware is unavailable. Closes #1351.

Correctness — hasRdmaCapableNodes

const hasRdmaCapableNodes = $derived.by(() => {
    const ids = tbIdentifiers;
    if (!ids) return false;
    const tb5NodeCount = Object.values(ids).filter(
        (node) => node.interfaces.length > 0,
    ).length;
    return tb5NodeCount >= 2;
});

Checks tbIdentifiers (from /state endpoint's nodeThunderbolt data), counts nodes with at least one TB5 interface, requires >= 2 (RDMA needs peer-to-peer). This is the same pattern already used by the existing showTb5Info derived value (lines 100–113) for the TB5 setup info banner. ✅ Consistent.

Data loading edge case: Before tbIdentifiers loads, ids is null → returns false → RDMA button starts disabled. This prevents a flash of enabled-then-disabled state. ✅ Safe.

Correctness — button behavior

Four layers of protection:

  1. disabled={!hasRdmaCapableNodes} — HTML disabled attribute
  2. onclick guard: if (!hasRdmaCapableNodes) return; — belt and suspenders
  3. Visual styling: cursor-not-allowed opacity-40 when disabled, no hover effects
  4. Radio dot + yellow selected state both gated on && hasRdmaCapableNodes — prevents showing selected appearance when disabled

Tooltip

Wrapped in {#if !hasRdmaCapableNodes} — only renders when needed. Uses group-hover CSS on a parent <div class="group relative"> container. opacity-0 invisiblegroup-hover:opacity-100 group-hover:visible with transition-all duration-200. Positioned below button (top-full mt-2), w-64, z-50. ✅ Clean implementation.

Edge case — stale localStorage selection

If a user previously selected MlxIbv (saved to localStorage) and TB5 hardware later becomes unavailable, selectedInstanceType remains "MlxIbv" in state. The button appears grayed out (correct visually), but launching would still send MlxIbv to the backend. This is a minor UX issue — the user would see a placement error and switch manually. A future improvement could auto-reset to MlxRing via a $effect. Non-blocking.

Verdict

LGTM. Clean, focused UX improvement — prevents users from selecting an instance type that would fail at placement time. TB5 detection logic is consistent with the existing showTb5Info pattern. Tooltip provides clear guidance. Single commit, no regressions.

@AlexCheema
Copy link
Contributor Author

Review: PR #1505 — feat: disable RDMA option when no Thunderbolt 5 hardware

Summary

Dashboard-only change that disables the MLX RDMA button when fewer than 2 nodes have Thunderbolt 5 hardware. Shows a hover tooltip explaining the requirement. Prevents users from selecting an instance type that would fail at placement time.

Analysis

Detection logic is correct. hasRdmaCapableNodes derives from nodeThunderbolt state, counting nodes where interfaces.length > 0 (i.e., TB5 hardware detected by the system custodian). The >= 2 threshold is correct — RDMA requires at least two nodes to communicate.

UI implementation is clean:

  • Button is wrapped in a <div class="group relative"> to scope the tooltip hover
  • disabled attribute prevents form submission; onclick guard (if (!hasRdmaCapableNodes) return) prevents programmatic selection
  • Styling degrades gracefully: opacity-40 + cursor-not-allowed when disabled, conditional hover/active states
  • Radio dot only renders when both selected AND RDMA is available — avoids visual inconsistency
  • Tooltip uses the standard pattern from elsewhere in the dashboard (group-hover:opacity-100 group-hover:visible)

Edge case handled: If a user had previously selected RDMA (saved in localStorage via saveLaunchDefaults), then disconnects their TB5 hardware, the button appears disabled but selectedInstanceType could still be "MlxIbv" from the persisted defaults. The radio dot won't render (guarded by && hasRdmaCapableNodes), so the UI won't look selected, but the internal state is technically stale. This is a minor edge case — the placement would fail anyway with a clear error, and the user would see the button as disabled.

Note: This PR uses "MlxIbv" for the instance type, which is the current value on main. PR #1519 renames this to "MlxJaccl" — these will need to be reconciled when both merge.

Verdict

Clean, focused UX improvement that prevents a confusing failure mode. LGTM.

@AlexCheema
Copy link
Contributor Author

Code Review: PR #1505feat: disable RDMA option when no Thunderbolt 5 hardware

Overview

Disables the MLX RDMA instance type button when the cluster doesn't have at least 2 nodes with Thunderbolt 5 hardware. Shows a hover tooltip explaining the requirement.

Files changed: dashboard/src/routes/+page.svelte (+51, -21)

Verdict: Has a bug — stale selectedInstanceType when RDMA becomes unavailable


Bug: Stale selectedInstanceType causes invisible RDMA launch

When a user previously saved MlxIbv as their instance type (via saveLaunchDefaults), then opens the dashboard on a cluster without TB5 hardware:

  1. applyLaunchDefaults() (line 282) sets selectedInstanceType = "MlxIbv" unconditionally
  2. The RDMA button is disabled and visually dimmed (opacity-40)
  3. The RDMA radio dot is hidden (selectedInstanceType === 'MlxIbv' && hasRdmaCapableNodes → false)
  4. The MlxRing radio dot is also hidden (selectedInstanceType === 'MlxRing' → false)
  5. Neither button appears selected, but selectedInstanceType is still "MlxIbv"
  6. If the user launches a model, the API call sends instance_meta=MlxIbv (line ~721), which will fail at placement time — the exact scenario this PR is trying to prevent

Suggested fix — add a reactive guard that resets to MlxRing when RDMA becomes unavailable:

$effect(() => {
  if (!hasRdmaCapableNodes && selectedInstanceType === "MlxIbv") {
    selectedInstanceType = "MlxRing";
    saveLaunchDefaults();
  }
});

Or alternatively, guard in applyLaunchDefaults:

if (defaults.instanceType === "MlxIbv" && !hasRdmaCapableNodes) {
  selectedInstanceType = "MlxRing";
} else {
  selectedInstanceType = defaults.instanceType;
}

Other Analysis

1. hasRdmaCapableNodes detection — Correct ✅

The logic mirrors the existing tb5WithoutRdma check (lines 102–113): a node has TB5 if node.interfaces.length > 0. Requiring >= 2 nodes is correct since RDMA is a multi-node protocol.

2. Button disable mechanics — Mostly correct

  • disabled={!hasRdmaCapableNodes} prevents native button behavior
  • if (!hasRdmaCapableNodes) return; in onclick is redundant with disabled but harmless as defense-in-depth
  • cursor-not-allowed opacity-40 provides clear visual feedback
  • Hover effect (hover:border-exo-yellow/50) is correctly suppressed when disabled

3. Tooltip — Good UX ✅

The tooltip uses CSS-only group-hover pattern (no JS state management), positioned below the button with absolute top-full. The z-50 ensures it renders above other elements. The invisiblevisible transition with opacity creates a smooth fade-in.

4. Wrapping in <div class="group relative"> — Correct

The button is wrapped in a div to enable the group-hover tooltip pattern. The relative positioning anchors the absolutely-positioned tooltip.


Summary

  • Core concept: ✅ Detecting TB5 hardware and disabling the RDMA button is the right UX
  • Bug: 🐛 Stale selectedInstanceType from localStorage can silently remain "MlxIbv" even when the button appears unselected, leading to failed launches
  • Fix needed: Reset selectedInstanceType to "MlxRing" when hasRdmaCapableNodes is false
  • Tooltip UX: ✅ Clean CSS-only approach

@AlexCheema
Copy link
Contributor Author

Code Review: PR #1505 — feat: disable RDMA option when no Thunderbolt 5 hardware

Summary

Dashboard UX improvement: disables the MLX RDMA button when the cluster has fewer than 2 nodes with TB5 hardware, with a tooltip explaining the requirement.

Review

hasRdmaCapableNodes logic is correct:

const tb5NodeCount = Object.values(ids).filter(
    (node) => node.interfaces.length > 0,
).length;
return tb5NodeCount >= 2;

Checks that at least 2 nodes have TB5 interfaces. ✅

UX is good:

  • Button is disabled and dimmed (opacity-40, cursor-not-allowed) ✅
  • Tooltip on hover explains the requirement ✅
  • onclick is guarded with if (!hasRdmaCapableNodes) return
  • HTML disabled attribute is set ✅

Issues

1. The check tests for interfaces, not specifically TB5
The hasRdmaCapableNodes check looks at node.interfaces.length > 0, which tests for any Thunderbolt interface, not specifically TB5. If a node has TB4 interfaces, they would be counted. However, looking at the data flow, tbIdentifiers only contains TB5 identifiers (filtered upstream), so this is correct in practice. A comment clarifying this assumption would help.

2. Still uses MlxIbv not MlxJaccl
The button still references selectedInstanceType = "MlxIbv" and the comparison selectedInstanceType === 'MlxIbv'. Based on other PRs (like #1519 which removes MlxIbv), this may need updating. If MlxIbv is being deprecated in favor of MlxJaccl, this PR should use the new name.

Verdict

Good UX improvement. The main concern is the MlxIbv vs MlxJaccl naming, which may conflict with other in-flight PRs.

LGTM with the naming caveat.

@AlexCheema
Copy link
Contributor Author

Code Review -- PR #1505: feat: disable RDMA option when no Thunderbolt 5 hardware

CI: ALL PASSING (aarch64-darwin, x86_64-linux, aarch64-linux)

Overview

Single-file change (+51/-21) in dashboard/src/routes/+page.svelte. Adds a hasRdmaCapableNodes derived value that detects whether >=2 nodes have Thunderbolt 5 interfaces, then uses it to disable the MLX RDMA button with a tooltip when TB5 hardware is unavailable. Closes #1351.


Critical: Second RDMA button is not guarded

There are two RDMA instance type selectors in this file. The main one (~line 2892) is updated by this PR, but the compact/sidebar version at lines 4659-4682 (labeled "RDMA (Fast)") is completely unguarded:

<!-- lines 4659-4682: NOT updated by this PR -->
<button
  onclick={() => {
    selectedInstanceType = "MlxIbv";
    saveLaunchDefaults();
  }}
  class="... cursor-pointer {selectedInstanceType === 'MlxIbv' ...}"
>
  ...
  RDMA (Fast)
</button>

A user can select RDMA through this second button even when hasRdmaCapableNodes is false. This bypasses the entire protection this PR intends to add. The same disabled + guard + tooltip treatment needs to be applied here.


Significant: Stale selectedInstanceType from localStorage

When a user previously saved MlxIbv as their instance type (via saveLaunchDefaults), then connects to a cluster without TB5 hardware:

  1. applyLaunchDefaults() at line 531 sets selectedInstanceType = defaults.instanceType unconditionally -- restoring "MlxIbv"
  2. The RDMA button appears disabled and the radio dot is hidden (both gated on && hasRdmaCapableNodes)
  3. But selectedInstanceType is still "MlxIbv" in state
  4. The MLX Ring button also shows as unselected (selectedInstanceType === 'MlxRing' is false)
  5. Neither button appears selected -- confusing UI state
  6. Launching a model sends instance_meta: "MlxIbv" to the backend (lines 386, 991), which fails at placement -- the exact scenario this PR aims to prevent

Suggested fix -- add a reactive fallback:

$effect(() => {
  if (!hasRdmaCapableNodes && selectedInstanceType === "MlxIbv") {
    selectedInstanceType = "MlxRing";
    saveLaunchDefaults();
  }
});

Minor

  1. Tooltip not accessible to keyboard users. The tooltip uses CSS group-hover only. Users navigating by keyboard who tab to the disabled button will not see the explanation. Adding group-focus-within:opacity-100 group-focus-within:visible alongside the hover classes would fix this.

  2. Redundant onclick guard. The if (!hasRdmaCapableNodes) return; in the onclick handler is redundant with the HTML disabled attribute (which prevents click events in all browsers). Harmless as defense-in-depth but not strictly necessary.


What's Good

  • Detection logic is correct and consistent. hasRdmaCapableNodes uses the same tbIdentifiers + interfaces.length > 0 pattern as the existing tb5WithoutRdma derived value (lines 102-113). The >= 2 threshold is correct for RDMA which requires peer-to-peer.
  • Safe during initial load. Before tbIdentifiers loads, ids is null/undefined, so hasRdmaCapableNodes returns false. The RDMA button starts disabled and enables only when TB5 data arrives. This avoids a flash of enabled-then-disabled state.
  • Clean tooltip implementation. CSS-only group-hover pattern with opacity-0 invisible transition, positioned with absolute top-full, z-50 layering. Only rendered when !hasRdmaCapableNodes so no DOM cost when RDMA is available.
  • Small, focused scope. Single-responsibility change that is easy to reason about.

Verdict

The concept is right but the implementation has a gap: the second RDMA button (compact/sidebar "RDMA (Fast)" at line 4659) is not protected, allowing users to bypass the TB5 check entirely. The stale localStorage issue can also silently leave selectedInstanceType as "MlxIbv" when no button appears selected, leading to a failed launch.

Both should be fixed before merging.

Review only -- not a merge approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Show user if they have an RDMA-enabled device

1 participant