-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USP latency test #416
Comments
When we conduct experiments on the H100 with CUDA 12 and Torch 2.5.1, both flux_example.py and flux_usp_example.py (with and without torch.compile) exhibit comparable performance levels. The inference process across all configurations consistently completes within a few seconds. What are the version of CUDA Runtime and torch on your machine? |
Hello @xibosun! Interesting, I am using I have also:
|
I recall we tested Flux on the H100 using FA version 2.7.1. Could you adjust the FA version to evaluate if there are any performance changes? Unfortunately, we currently do not have access to H100 machines. However, we would be happy to assist in checking the latency if we had one available. |
Hello,
On a 8xH100 80GB node, when running:
I get the following results:
Meanwhile, when running
flux_example.py
instead offlux_usp_example.py
(is this even intended usage?):Produces:
@feifeibear are the scripts and timings working as expected? In the latter case, the true time is more similar to the former case, rather than couple of seconds. Did you use the same scripts and timing points for producing the results in performance/flux.md?
Another example (without and with
torch.compile
, default mode):The text was updated successfully, but these errors were encountered: