Skip to content

Conversation

KeitaW
Copy link
Contributor

@KeitaW KeitaW commented Sep 28, 2025

Summary

This PR updates the DeepSpeed test case with improved documentation, better script organization, and fixes for Pyxis/Enroot container configuration.

Changes

Container Configuration

  • Add --container-mount-home flag to properly mount home directory in Pyxis containers
  • Fix DATA_PATH variable definition syntax error

Documentation Improvements

  • Update Llama2 model download instructions with two methods:
    • Option 1 (Recommended): Download via HuggingFace CLI with step-by-step authentication
    • Option 2: Direct download from Meta with download.sh script
  • Add detailed HuggingFace CLI commands for model download

Script Improvements

  • Consolidate finetune_llama.sbatch to root directory for better organization
  • Convert variable definitions to declare -a array style for consistency
  • Update argument passing to use proper bash array expansion ("${ARRAY[@]}")
  • Add FINETUNE_ARGS, CONVERT_HF2MDS_ARGS, and CONVERT_MDS2HF_ARGS arrays
  • Fix command-line argument passing to include COMM_ARGS for conversion commands

Cleanup

  • Remove deprecated scripts from scripts/ directory:
    • convert-weights-hf-to-megatron-deepspeed.sh
    • scripts/finetune_llama.sbatch
    • scripts/finetune_llama.sh

Testing

  • Tested weight conversion process with Llama2-7B on AWS P5en instances
  • Verified Pyxis container mounting with home directory access

Related Work

This PR complements the fix submitted to Megatron-DeepSpeed for the DummyOptim state_dict issue: KeitaW/Megatron-DeepSpeed#1

Co-authored-by: aravneelaws [email protected]

- Add --container-mount-home flag to Pyxis configuration for proper home directory mounting
- Update Llama2 download documentation with latest methods (HuggingFace CLI and Meta direct download)
- Consolidate finetune_llama.sbatch script to root directory with improved structure
- Convert variable definitions to declare -a array style for better bash practices
- Fix DATA_PATH variable definition syntax error
- Update argument passing to use proper array expansion
- Remove deprecated scripts from scripts/ directory
- Add detailed HuggingFace CLI download instructions with authentication steps

Co-authored-by: aravneelaws <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants