Update DeepSpeed test case with improved documentation and scripts #867

KeitaW · 2025-09-28T00:08:16Z

Summary

This PR updates the DeepSpeed test case with improved documentation, better script organization, and fixes for Pyxis/Enroot container configuration.

Changes

Container Configuration

Add --container-mount-home flag to properly mount home directory in Pyxis containers
Fix DATA_PATH variable definition syntax error

Documentation Improvements

Update Llama2 model download instructions with two methods:
- Option 1 (Recommended): Download via HuggingFace CLI with step-by-step authentication
- Option 2: Direct download from Meta with download.sh script
Add detailed HuggingFace CLI commands for model download

Script Improvements

Consolidate finetune_llama.sbatch to root directory for better organization
Convert variable definitions to declare -a array style for consistency
Update argument passing to use proper bash array expansion ("${ARRAY[@]}")
Add FINETUNE_ARGS, CONVERT_HF2MDS_ARGS, and CONVERT_MDS2HF_ARGS arrays
Fix command-line argument passing to include COMM_ARGS for conversion commands

Cleanup

Remove deprecated scripts from scripts/ directory:
- convert-weights-hf-to-megatron-deepspeed.sh
- scripts/finetune_llama.sbatch
- scripts/finetune_llama.sh

Testing

Tested weight conversion process with Llama2-7B on AWS P5en instances
Verified Pyxis container mounting with home directory access

Related Work

This PR complements the fix submitted to Megatron-DeepSpeed for the DummyOptim state_dict issue: KeitaW/Megatron-DeepSpeed#1

Co-authored-by: aravneelaws [email protected]

- Add --container-mount-home flag to Pyxis configuration for proper home directory mounting - Update Llama2 download documentation with latest methods (HuggingFace CLI and Meta direct download) - Consolidate finetune_llama.sbatch script to root directory with improved structure - Convert variable definitions to declare -a array style for better bash practices - Fix DATA_PATH variable definition syntax error - Update argument passing to use proper array expansion - Remove deprecated scripts from scripts/ directory - Add detailed HuggingFace CLI download instructions with authentication steps Co-authored-by: aravneelaws <[email protected]>

KeitaW assigned aravneelaws Sep 28, 2025

aravneelaws added 2 commits October 3, 2025 20:23

Updated script path

b783214

Fixed formatting typo

e04eb9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update DeepSpeed test case with improved documentation and scripts #867

Update DeepSpeed test case with improved documentation and scripts #867

Uh oh!

KeitaW commented Sep 28, 2025

Uh oh!

Uh oh!

Update DeepSpeed test case with improved documentation and scripts #867

Are you sure you want to change the base?

Update DeepSpeed test case with improved documentation and scripts #867

Uh oh!

Conversation

KeitaW commented Sep 28, 2025

Summary

Changes

Container Configuration

Documentation Improvements

Script Improvements

Cleanup

Testing

Related Work

Uh oh!

Uh oh!