Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weight streaming #3222

Open
wants to merge 78 commits into
base: develop
Choose a base branch
from
Open

Add weight streaming #3222

wants to merge 78 commits into from

Conversation

eddieliao
Copy link
Contributor

Add weight streaming to allow running of large models on GPUs with low memory.

Closes #3156.

@eddieliao eddieliao self-assigned this Jun 26, 2024
Copy link

codecov bot commented Jul 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.17%. Comparing base (ddc4c0c) to head (0fc282e).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3222   +/-   ##
========================================
  Coverage    92.17%   92.17%           
========================================
  Files          512      512           
  Lines        21387    21387           
========================================
  Hits         19714    19714           
  Misses        1673     1673           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eddieliao
Copy link
Contributor Author

17 of 18 test failures right now for test/schedule_test is due to a check of EXPECT(not t.has_stream(one));. This is failing due to the fact that I have added both literals and their associated copy_to_gpu to a stream. Should literals thus not be included on the stream?

@eddieliao
Copy link
Contributor Author

Doing some very basic perf testing adding the literals onto the stream with copies seems to very marginally improve performance. Still not sure if the tests can be ignored since this is a new feature.

@@ -40,6 +41,8 @@ struct compile_options

bool fast_math = true;
bool exhaustive_tune = false;
bool weight_streaming = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we should have a flag to enable weight streaming, if you run out of memory then it is the only way to run the model, so it should be enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we determine if the model runs out of memory? Unless the model is first compiled and run with no weight streaming to test if there is sufficient memory, but that would also require re-compiling anyways after a failure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take the estimated scratch+literals+parameters to determine how much the memory the model will use. The estimated scratch might be an overestimate, but this is a boderline case. Its better to nder use the memory the over use it and fail. Furthermore, if the user really wants to use it on the GPU they can set the max memory larger.

src/program.cpp Outdated Show resolved Hide resolved
src/schedule.cpp Outdated Show resolved Hide resolved
src/targets/gpu/target.cpp Outdated Show resolved Hide resolved
@@ -40,6 +41,8 @@ struct compile_options

bool fast_math = true;
bool exhaustive_tune = false;
bool weight_streaming = false;
long max_memory = LONG_MAX;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about this more, I think we want it set to 0 for default, which would mean to use the memory available on the gpu, and then the user can set max to always disable weight streaming.

Also, this should probably be std::size_t type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the point of setting the max if it always disables weight streaming? I understand the idea of 0 for default (aka 0 = make it run no matter what), but if setting the max disables weight streaming than it would either a. always fail with not enough memory, or b. always work since the max is > size of literals.

@umangyadav umangyadav removed their request for review August 17, 2024 13:18
@migraphx-bot
Copy link
Collaborator

Test Batch Rate new
0fc282
Rate old
ddc4c0
Diff Compare
torchvision-resnet50 64 3,261.48 3,261.96 -0.01%
torchvision-resnet50_fp16 64 6,988.20 6,986.50 0.02%
torchvision-densenet121 32 2,436.70 2,435.05 0.07%
torchvision-densenet121_fp16 32 4,076.48 4,098.64 -0.54%
torchvision-inceptionv3 32 1,638.92 1,637.72 0.07%
torchvision-inceptionv3_fp16 32 2,762.55 2,760.21 0.08%
cadene-inceptionv4 16 775.80 776.83 -0.13%
cadene-resnext64x4 16 810.72 812.01 -0.16%
slim-mobilenet 64 7,532.36 7,539.31 -0.09%
slim-nasnetalarge 64 211.49 211.58 -0.04%
slim-resnet50v2 64 3,505.22 3,507.45 -0.06%
bert-mrpc-onnx 8 1,147.43 1,148.38 -0.08%
bert-mrpc-tf 1 493.92 462.39 6.82% 🔆
pytorch-examples-wlang-gru 1 434.13 416.01 4.36% 🔆
pytorch-examples-wlang-lstm 1 405.95 387.25 4.83% 🔆
torchvision-resnet50_1 1 797.47 799.30 -0.23%
cadene-dpn92_1 1 398.08 413.07 -3.63% 🔴
cadene-resnext101_1 1 384.99 384.54 0.12%
onnx-taau-downsample 1 343.18 343.19 -0.00%
dlrm-criteoterabyte 1 33.33 33.33 0.02%
dlrm-criteoterabyte_fp16 1 52.74 52.75 -0.02%
agentmodel 1 8,118.46 8,254.64 -1.65%
unet_fp16 2 58.84 58.74 0.17%
resnet50v1_fp16 1 940.57 942.04 -0.16%
resnet50v1_int8 1 996.35 993.52 0.28%
bert_base_cased_fp16 64 1,170.71 1,171.05 -0.03%
bert_large_uncased_fp16 32 363.57 363.48 0.02%
bert_large_fp16 1 199.12 200.70 -0.79%
distilgpt2_fp16 16 2,202.47 2,201.26 0.05%
yolov5s 1 533.05 537.45 -0.82%
tinyllama 1 43.41 43.68 -0.62%
vicuna-fastchat 1 179.88 175.99 2.21%
whisper-tiny-encoder 1 416.38 418.69 -0.55%
whisper-tiny-decoder 1 428.82 428.55 0.06%

This build is not recommended to merge 🔴

@migraphx-bot
Copy link
Collaborator


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

     ✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

     ✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ bert_large: PASSED: MIGraphX meets tolerance

     ✅ yolov5s: PASSED: MIGraphX meets tolerance

     ✅ tinyllama: PASSED: MIGraphX meets tolerance

     ✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

     ✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

@kahmed10 kahmed10 self-assigned this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request UAI Ubuntu related changes for Ubuntu Environments Windows Related changes for Windows Environments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add weight streaming at runtime
5 participants