-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add weight streaming #3222
base: develop
Are you sure you want to change the base?
Add weight streaming #3222
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3222 +/- ##
========================================
Coverage 92.17% 92.17%
========================================
Files 512 512
Lines 21387 21387
========================================
Hits 19714 19714
Misses 1673 1673 ☔ View full report in Codecov by Sentry. |
17 of 18 test failures right now for |
Doing some very basic perf testing adding the literals onto the stream with copies seems to very marginally improve performance. Still not sure if the tests can be ignored since this is a new feature. |
@@ -40,6 +41,8 @@ struct compile_options | |||
|
|||
bool fast_math = true; | |||
bool exhaustive_tune = false; | |||
bool weight_streaming = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should have a flag to enable weight streaming, if you run out of memory then it is the only way to run the model, so it should be enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we determine if the model runs out of memory? Unless the model is first compiled and run with no weight streaming to test if there is sufficient memory, but that would also require re-compiling anyways after a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take the estimated scratch+literals+parameters to determine how much the memory the model will use. The estimated scratch might be an overestimate, but this is a boderline case. Its better to nder use the memory the over use it and fail. Furthermore, if the user really wants to use it on the GPU they can set the max memory larger.
@@ -40,6 +41,8 @@ struct compile_options | |||
|
|||
bool fast_math = true; | |||
bool exhaustive_tune = false; | |||
bool weight_streaming = false; | |||
long max_memory = LONG_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking about this more, I think we want it set to 0
for default, which would mean to use the memory available on the gpu, and then the user can set max to always disable weight streaming.
Also, this should probably be std::size_t
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the point of setting the max if it always disables weight streaming? I understand the idea of 0 for default (aka 0 = make it run no matter what), but if setting the max disables weight streaming than it would either a. always fail with not enough memory, or b. always work since the max is > size of literals.
This build is not recommended to merge 🔴 |
🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output |
Add weight streaming to allow running of large models on GPUs with low memory.
Closes #3156.