Add support for Qwen3.5 tensor parallel by 0xDaizz · Pull Request #1644 · exo-explore/exo

0xDaizz · 2026-03-02T06:57:45Z

Motivation

Qwen3.5 MoE models (e.g., Qwen3.5-397B-A17B-6bit) are now supported by mlx-lm via qwen3_5_moe model type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes.

Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate in_proj_qkv, in_proj_z, in_proj_b, in_proj_a instead of Qwen3-Next's combined in_proj_qkvz and in_proj_ba. This requires architecture-aware sharding logic.

Changes

`auto_parallel.py`

Add imports for Qwen3_5TextModel, Qwen3_5MoeModel, Qwen3_5DecoderLayer, and Qwen3_5SparseMoeBlock from mlx_lm.models.qwen3_5 / qwen3_5_moe
Add Qwen3.5 models to the QwenShardingStrategy dispatch table
Handle Qwen3.5's separate GatedDeltaNet projections via hasattr(linear_attn, "in_proj_qkvz") conditional
Use section-aware sharding for in_proj_qkv with segments=[key_dim, key_dim + key_dim] to correctly split q/k/v sections across devices
Add Qwen3_5SparseMoeBlock to MoE and shared expert sharding

`model_cards.py`

Add Qwen3_5MoeForConditionalGeneration to the supports_tensor whitelist

`utils_mlx.py`

Add Qwen3.5 EOS token IDs (248046, 248044) to get_eos_token_ids_for_model(), following the existing pattern for GLM, Kimi, and GPT-OSS models

Why It Works

Qwen3.5's GatedDeltaNet has an in_proj_qkv linear layer with three concatenated sections: [q(key_dim), k(key_dim), v(value_dim)]. A naive contiguous split (segments=1) would slice across section boundaries, corrupting q/k/v values and producing garbled output.

By passing segments=[key_dim, key_dim + key_dim] to shard_linear(), each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components.

The remaining separate projections (in_proj_z, in_proj_b, in_proj_a) and the MoE layers follow the same all_to_sharded / sharded_to_all pattern already used for Qwen3-Next.

Test Plan

Manual Testing

Hardware: 2x Mac Studio M3 Ultra (512GB each), connected via Thunderbolt 5 direct cable
Backend: MlxJaccl (RDMA over Thunderbolt 5), Tensor Parallelism
Model: Qwen3.5-397B-A17B-6bit (301GB, 6-bit quantized MLX format, 60 layers, 512 experts)

Correctness verification:

"What is 2+2?" → 4 (finish_reason: stop, EOS handled correctly)
Extended generation produces coherent, well-structured reasoning and output
No special token leakage (<|im_end|>, <|endoftext|>) in API responses

Benchmark results (/bench/chat/completions endpoint):

Test	Prompt Tokens	Gen Tokens	Prefill TPS	Gen TPS	Peak Memory
Short→Short	12	50	0.8	38.6	162.1 GB
Short→Medium	20	300	13.7	37.8	162.1 GB
Medium→Medium	143	256	10.3	37.7	162.3 GB
Short→Long	21	512	14.0	37.6	162.1 GB

Generation throughput is consistent at ~37.8 tok/s across all test sizes. Peak memory stable at ~162 GB per node.

Automated Testing

No new automated tests added. Existing tests are unaffected — changes only add new model dispatch paths alongside the existing Qwen3-Next/Qwen3-MoE sharding logic. Pre-existing basedpyright error count (8) is unchanged.

Evanev7 · 2026-03-02T09:58:07Z

testing this now

Add tensor parallel sharding support for Qwen3.5 MoE models: - Add Qwen3_5MoeForConditionalGeneration to tensor parallel whitelist - Add Qwen3.5 model/layer/MoE imports and dispatch in auto_parallel.py - Handle Qwen3.5's separate GatedDeltaNet projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) vs Qwen3-Next's combined projections (in_proj_qkvz, in_proj_ba) - Use section-aware sharding for in_proj_qkv with segments=[key_dim, key_dim+key_dim] to correctly split q/k/v sections across devices - Add Qwen3.5 EOS token IDs (248046, 248044) to model-specific mapping - Add Qwen3.5 SparseMoeBlock to shared expert sharding Tested with Qwen3.5-397B-A17B-6bit across 2x M3 Ultra nodes via Jaccl (RDMA over Thunderbolt 5). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0xDaizz · 2026-03-02T10:09:22Z

@Evanev7 just fixed the test issue!

hw and others added 3 commits March 2, 2026 19:08

style: fix import ordering for ruff I001

0e2f7f7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: apply ruff format for CI treefmt compliance

6a3b123

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0xDaizz force-pushed the feat/qwen3.5-support branch from 8fe2ece to 6a3b123 Compare March 2, 2026 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Qwen3.5 tensor parallel#1644

Add support for Qwen3.5 tensor parallel#1644
0xDaizz wants to merge 3 commits intoexo-explore:mainfrom
0xDaizz:feat/qwen3.5-support

0xDaizz commented Mar 2, 2026

Uh oh!

Evanev7 commented Mar 2, 2026

Uh oh!

0xDaizz commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0xDaizz commented Mar 2, 2026

Motivation

Changes

auto_parallel.py

model_cards.py

utils_mlx.py

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Evanev7 commented Mar 2, 2026

Uh oh!

0xDaizz commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`auto_parallel.py`

`model_cards.py`

`utils_mlx.py`

0xDaizz commented Mar 2, 2026 •

edited

Loading