Skip to content

Add support for Qwen3.5 tensor parallel#1644

Open
0xDaizz wants to merge 3 commits intoexo-explore:mainfrom
0xDaizz:feat/qwen3.5-support
Open

Add support for Qwen3.5 tensor parallel#1644
0xDaizz wants to merge 3 commits intoexo-explore:mainfrom
0xDaizz:feat/qwen3.5-support

Conversation

@0xDaizz
Copy link

@0xDaizz 0xDaizz commented Mar 2, 2026

Motivation

Qwen3.5 MoE models (e.g., Qwen3.5-397B-A17B-6bit) are now supported by mlx-lm via qwen3_5_moe model type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes.

Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate in_proj_qkv, in_proj_z, in_proj_b, in_proj_a instead of Qwen3-Next's combined in_proj_qkvz and in_proj_ba. This requires architecture-aware sharding logic.

Changes

auto_parallel.py

  • Add imports for Qwen3_5TextModel, Qwen3_5MoeModel, Qwen3_5DecoderLayer, and Qwen3_5SparseMoeBlock from mlx_lm.models.qwen3_5 / qwen3_5_moe
  • Add Qwen3.5 models to the QwenShardingStrategy dispatch table
  • Handle Qwen3.5's separate GatedDeltaNet projections via hasattr(linear_attn, "in_proj_qkvz") conditional
  • Use section-aware sharding for in_proj_qkv with segments=[key_dim, key_dim + key_dim] to correctly split q/k/v sections across devices
  • Add Qwen3_5SparseMoeBlock to MoE and shared expert sharding

model_cards.py

  • Add Qwen3_5MoeForConditionalGeneration to the supports_tensor whitelist

utils_mlx.py

  • Add Qwen3.5 EOS token IDs (248046, 248044) to get_eos_token_ids_for_model(), following the existing pattern for GLM, Kimi, and GPT-OSS models

Why It Works

Qwen3.5's GatedDeltaNet has an in_proj_qkv linear layer with three concatenated sections: [q(key_dim), k(key_dim), v(value_dim)]. A naive contiguous split (segments=1) would slice across section boundaries, corrupting q/k/v values and producing garbled output.

By passing segments=[key_dim, key_dim + key_dim] to shard_linear(), each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components.

The remaining separate projections (in_proj_z, in_proj_b, in_proj_a) and the MoE layers follow the same all_to_sharded / sharded_to_all pattern already used for Qwen3-Next.

Test Plan

Manual Testing

Hardware: 2x Mac Studio M3 Ultra (512GB each), connected via Thunderbolt 5 direct cable
Backend: MlxJaccl (RDMA over Thunderbolt 5), Tensor Parallelism
Model: Qwen3.5-397B-A17B-6bit (301GB, 6-bit quantized MLX format, 60 layers, 512 experts)

Correctness verification:

  • "What is 2+2?"4 (finish_reason: stop, EOS handled correctly)
  • Extended generation produces coherent, well-structured reasoning and output
  • No special token leakage (<|im_end|>, <|endoftext|>) in API responses

Benchmark results (/bench/chat/completions endpoint):

Test Prompt Tokens Gen Tokens Prefill TPS Gen TPS Peak Memory
Short→Short 12 50 0.8 38.6 162.1 GB
Short→Medium 20 300 13.7 37.8 162.1 GB
Medium→Medium 143 256 10.3 37.7 162.3 GB
Short→Long 21 512 14.0 37.6 162.1 GB

Generation throughput is consistent at ~37.8 tok/s across all test sizes. Peak memory stable at ~162 GB per node.

Automated Testing

No new automated tests added. Existing tests are unaffected — changes only add new model dispatch paths alongside the existing Qwen3-Next/Qwen3-MoE sharding logic. Pre-existing basedpyright error count (8) is unchanged.

@Evanev7
Copy link
Member

Evanev7 commented Mar 2, 2026

testing this now

hw and others added 3 commits March 2, 2026 19:08
Add tensor parallel sharding support for Qwen3.5 MoE models:

- Add Qwen3_5MoeForConditionalGeneration to tensor parallel whitelist
- Add Qwen3.5 model/layer/MoE imports and dispatch in auto_parallel.py
- Handle Qwen3.5's separate GatedDeltaNet projections (in_proj_qkv,
  in_proj_z, in_proj_b, in_proj_a) vs Qwen3-Next's combined projections
  (in_proj_qkvz, in_proj_ba)
- Use section-aware sharding for in_proj_qkv with segments=[key_dim,
  key_dim+key_dim] to correctly split q/k/v sections across devices
- Add Qwen3.5 EOS token IDs (248046, 248044) to model-specific mapping
- Add Qwen3.5 SparseMoeBlock to shared expert sharding

Tested with Qwen3.5-397B-A17B-6bit across 2x M3 Ultra nodes via
Jaccl (RDMA over Thunderbolt 5).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@0xDaizz 0xDaizz force-pushed the feat/qwen3.5-support branch from 8fe2ece to 6a3b123 Compare March 2, 2026 10:08
@0xDaizz
Copy link
Author

0xDaizz commented Mar 2, 2026

@Evanev7 just fixed the test issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants