Skip to content

Commit 6160140

Browse files
amauryaamaurya
authored andcommitted
Fix whitespaces
Signed-off-by: amaurya <[email protected]>
1 parent 83b66d9 commit 6160140

File tree

4 files changed

+12
-1
lines changed

4 files changed

+12
-1
lines changed

deepspeed/datastates/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# DataStates-LLM checkpointing engine.
2+
3+
This feature is not enabled by default. To enable, set the following options in ds_config.json and download [DataStates-LLM checkpointing library](https://github.com/DataStates/datastates-llm/). A detailed tutorial is available [here](../../docs/_tutorials/datastates-async-checkpointing.md).

deepspeed/runtime/checkpoint_engine/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ class CheckpointEngine(object):
3939

4040
### Asynchronous Lazy Checkpointing using DataStates-LLM
4141

42-
DataStates-LLM is an asynchronous checkpointing approach optimized for LLM pre-training and can be obtained at https://github.com/DataStates/datastates-llm. To enable datastates-llm checkpointing, specify the `host_cache_size` (in gigabytes) which reserves pinned host memory for asynchronous checkpoint flushing, and `parser_threads` to parse multiple checkpoint file requests in parallel using the following lines in config.json supplied during the launch:
42+
DataStates-LLM is an asynchronous checkpointing approach optimized for LLM pre-training and can be obtained at https://github.com/DataStates/datastates-llm. A detailed tutorial is available [here](../../../docs/_tutorials/datastates-async-checkpointing.md). To enable datastates-llm checkpointing, specify the `host_cache_size` (in gigabytes) which reserves pinned host memory for asynchronous checkpoint flushing, and `parser_threads` to parse multiple checkpoint file requests in parallel using the following lines in config.json supplied during the launch:
4343
```
4444
{
4545
... other deepspeed config options,

deepspeed/runtime/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
from ..profiling.config import DeepSpeedFlopsProfilerConfig
5858
from ..autotuning.config import DeepSpeedAutotuningConfig
5959
from ..nebula.config import DeepSpeedNebulaConfig
60+
from ..datastates.config import DeepSpeedDataStatesConfig
6061

6162
from ..compression.config import get_compression_config, get_quantize_enabled
6263
from ..compression.constants import *
@@ -908,6 +909,7 @@ def _initialize_params(self, param_dict):
908909
self.dataloader_drop_last = get_dataloader_drop_last(param_dict)
909910

910911
self.nebula_config = DeepSpeedNebulaConfig(param_dict)
912+
self.datastates_config = DeepSpeedDataStatesConfig(param_dict)
911913

912914
self.weight_quantization_config = WeightQuantConfig(
913915
**param_dict['weight_quantization']) if 'weight_quantization' in param_dict else None

deepspeed/runtime/engine.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2264,6 +2264,12 @@ def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
22642264
# https://nvidia.github.io/apex/advanced.html#gradient-clipping
22652265
master_params = amp.master_params(self.optimizer)
22662266
clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
2267+
2268+
try:
2269+
self.checkpoint_engine.wait()
2270+
except Exception as exc:
2271+
logger.error(f"Error during optimizer wait step: {exc}")
2272+
22672273
self.optimizer.step()
22682274

22692275
if hasattr(self.optimizer, '_global_grad_norm'):

0 commit comments

Comments
 (0)