From 38c02fed1941b9fef87825a6d715e4095ada5b35 Mon Sep 17 00:00:00 2001 From: Vasiliy Kuznetsov Date: Tue, 16 Jul 2024 15:57:03 -0700 Subject: [PATCH] update readme (#317) Summary: Pull Request resolved: https://github.com/pytorch-labs/float8_experimental/pull/317 cleaning up the readme to reflect latest changes Reviewed By: drisspg Differential Revision: D59827460 fbshipit-source-id: aba3d31c6087ddfbf1892b86e31e058569770c50 --- README.md | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 464e9b1..6102f5a 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,12 @@ This is an early version of a library for accelerating training with float8 in native PyTorch according to the recipes laid out in https://arxiv.org/pdf/2209.05433.pdf. -The codebase strives to stay small, easily hackable, and debuggable with native PyTorch tooling. -``torch.compile`` is supported out of the box. With ``torch.compile`` on, initial results show +The codebase strives to stay small, easily hackable, debuggable with native PyTorch tooling, +and composable with key systems such as autograd, ```torch.compile``` and distributed. +With ``torch.compile`` on, initial results show throughput speedups of up to 1.2x on small scale (8 GPUs) LLaMa pretraining jobs. -:warning: See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features. Key features such as weight cast recomputation in backward and large scale distributed support are not ready yet. +:warning: See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features. :warning: Backwards compatibility is not guaranteed at this point. The codebase is in active development and will change rapidly. @@ -25,7 +26,7 @@ pip install -e . pip install -e ".[dev]" ``` -# User API +# Single GPU User API We provide two per-tensor scaling strategies: dynamic and delayed. See https://arxiv.org/pdf/2209.05433.pdf, Section 4.3 for more details. These strategies are configurable separately for activations (`x`), weights (`w`) and gradients (`dL_dY`). @@ -113,13 +114,11 @@ for _ in range(N_ITER): optimizer.step() ``` -# 🧭 Code Organization +# Multi GPU User API -* `float8_experimental/float8_linear.py` - - `Float8Linear` (main user facing entry point for Float8Linear) -* `float8_experimental/float8_tensor.py` - - `Float8Tensor`, which allows `Float8Linear` to abide by the `x.dtype == x.grad.dtype` restriction - - `ScaledMMConfig` defines the semantics for matmul in the forward and backwards pass +We compose with the `DTensor` based [distributed APIs](https://pytorch.org/docs/stable/distributed.tensor.parallel.html), +such as FSDP, TP and SP. Please see the [torchtitan](https://github.com/pytorch/torchtitan) repository for e2e examples +on using `float8_experimental` in a distributed setting. # Testing @@ -127,16 +126,20 @@ for _ in range(N_ITER): # run single-GPU unit tests pytest test/test_base.py -# run a single-GPU integration test on SAM -pytest test/test_sam.py - # run single-GPU compile tests pytest test/test_compile.py + +# run single-GPU numerics integration tests +pytest test/test_numerics_integration.py + # run a two-GPU integration test on FSDP ./test/test_fsdp.sh -# run integration tests for TP/SP (outdated) -./test/test_tp.sh +# run integration tests on the DTensor TP/SP integration +./test/test_dtensor.sh + +# run integration tests on the FSDP2 integration +python test/test_fsdp2/test_fsdp2_eager.py # run all of these tests ./test/test_everything.sh