Mixtral is a state-of-the-art AI model developed by Mistral AI, utilizing a sparse mixture-of-experts (MoE) architecture.
To get started, follow the instructions at mistral-inference to download the model. Once downloaded, run llama_or_mistral_ckpt.py to convert the checkpoint for MaxText compatibility. You can then proceed with decoding, pretraining, and finetuning. You could find Mixtral 8x7B example in the end_to_end/tpu/mixtral/8x7b test scripts.
Additionally, Mixtral integrates with MegaBlocks, an efficient dropless MoE strategy, which can be activated by setting the megablox flag to True (default).
Model Flop utilization for training on v5p TPUs.
Model size | Accelerator type | TFLOP/chip/sec | Model flops utilization (MFU) |
---|---|---|---|
Mixtral 8X7B | v5p-128 | 251.94 | 54.89% |