Add support for Qwen3.5 tensor parallel#1644
Open
0xDaizz wants to merge 3 commits intoexo-explore:mainfrom
Open
Add support for Qwen3.5 tensor parallel#16440xDaizz wants to merge 3 commits intoexo-explore:mainfrom
0xDaizz wants to merge 3 commits intoexo-explore:mainfrom
Conversation
Member
|
testing this now |
Add tensor parallel sharding support for Qwen3.5 MoE models: - Add Qwen3_5MoeForConditionalGeneration to tensor parallel whitelist - Add Qwen3.5 model/layer/MoE imports and dispatch in auto_parallel.py - Handle Qwen3.5's separate GatedDeltaNet projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) vs Qwen3-Next's combined projections (in_proj_qkvz, in_proj_ba) - Use section-aware sharding for in_proj_qkv with segments=[key_dim, key_dim+key_dim] to correctly split q/k/v sections across devices - Add Qwen3.5 EOS token IDs (248046, 248044) to model-specific mapping - Add Qwen3.5 SparseMoeBlock to shared expert sharding Tested with Qwen3.5-397B-A17B-6bit across 2x M3 Ultra nodes via Jaccl (RDMA over Thunderbolt 5). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8fe2ece to
6a3b123
Compare
Author
|
@Evanev7 just fixed the test issue! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Qwen3.5 MoE models (e.g.,
Qwen3.5-397B-A17B-6bit) are now supported bymlx-lmviaqwen3_5_moemodel type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes.Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate
in_proj_qkv,in_proj_z,in_proj_b,in_proj_ainstead of Qwen3-Next's combinedin_proj_qkvzandin_proj_ba. This requires architecture-aware sharding logic.Changes
auto_parallel.pyQwen3_5TextModel,Qwen3_5MoeModel,Qwen3_5DecoderLayer, andQwen3_5SparseMoeBlockfrommlx_lm.models.qwen3_5/qwen3_5_moeQwenShardingStrategydispatch tablehasattr(linear_attn, "in_proj_qkvz")conditionalin_proj_qkvwithsegments=[key_dim, key_dim + key_dim]to correctly split q/k/v sections across devicesQwen3_5SparseMoeBlockto MoE and shared expert shardingmodel_cards.pyQwen3_5MoeForConditionalGenerationto thesupports_tensorwhitelistutils_mlx.py248046,248044) toget_eos_token_ids_for_model(), following the existing pattern for GLM, Kimi, and GPT-OSS modelsWhy It Works
Qwen3.5's GatedDeltaNet has an
in_proj_qkvlinear layer with three concatenated sections:[q(key_dim), k(key_dim), v(value_dim)]. A naive contiguous split (segments=1) would slice across section boundaries, corrupting q/k/v values and producing garbled output.By passing
segments=[key_dim, key_dim + key_dim]toshard_linear(), each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components.The remaining separate projections (
in_proj_z,in_proj_b,in_proj_a) and the MoE layers follow the sameall_to_sharded/sharded_to_allpattern already used for Qwen3-Next.Test Plan
Manual Testing
Hardware: 2x Mac Studio M3 Ultra (512GB each), connected via Thunderbolt 5 direct cable
Backend: MlxJaccl (RDMA over Thunderbolt 5), Tensor Parallelism
Model:
Qwen3.5-397B-A17B-6bit(301GB, 6-bit quantized MLX format, 60 layers, 512 experts)Correctness verification:
"What is 2+2?"→4(finish_reason:stop, EOS handled correctly)<|im_end|>,<|endoftext|>) in API responsesBenchmark results (
/bench/chat/completionsendpoint):Generation throughput is consistent at ~37.8 tok/s across all test sizes. Peak memory stable at ~162 GB per node.
Automated Testing
No new automated tests added. Existing tests are unaffected — changes only add new model dispatch paths alongside the existing Qwen3-Next/Qwen3-MoE sharding logic. Pre-existing
basedpyrighterror count (8) is unchanged.