-
Notifications
You must be signed in to change notification settings - Fork 21
CRE-362: dynamic batching based on observation sizes #1226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ee85ff3
to
41a130e
Compare
return allExecutionIDs, serialized, err | ||
} | ||
|
||
func (r *reportingPlugin) Query(_ context.Context, _ ocr3types.OutcomeContext) (types.Query, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to see a benchmark test. i have no intuition about how expensive the proto serialization will be nor how many rounds of trial and error we need.
that then turns into a concern about the performance budget of repeated serializations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. algo I made is O(log2(n)), so e.g. 10^6 reports ends up with max 20 (~19.93) optimization rounds (I included that in one of the unit-tests); considering it's a simple structure -- serializing it few times should not be an issue; I agree to empirically verify how many rounds becomes "too slow" -- solidify it with a bench test and warn logs (when closing up to a time limit).
… Serializable interface; code refactor;
type Serializable interface { | ||
Serialize(lggr logger.Logger) ([]string, []byte, error) | ||
Len() int | ||
Mid(mid int) Serializable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Mid" is a bit confusing, at first I thought it returns the middle element. Maybe "Prefix" is a more accurate name?
} | ||
|
||
weids := make([]string, 0, len(o.reqMap)) | ||
for k := range o.reqMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random order will be problematic here. If the leader sends a lot of IDs in the query and every node contributes observations for a random subset of that query, there's no guarantee we will find quorums for any of the IDs later. We should follow the order from the Query and if necessary reduce to a prefix of that array.
// It finds the best utilization of space with the protobuf-marshalled structures using logarithmic (binary search) | ||
// approach to identify the optimal number of Requests that can be serialized without exceeding | ||
// the limit (defaultBatchSizeMiB). | ||
func packToSizeLimit(lggr logger.Logger, all Serializable) ([]string, []byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to doubt that bin-search is actually useful.
- Query elements are all of the same size so we can simply divide max by that size (minus some buffer).
- Observations can be less predictable so that's the only place where bin search makes sense.
- Outcome already does a relatively expensive computation for each request (aggregation). I can't imagine proto marshaling being meaningful compared to that. So we can simply marshal them one-by-one as we process and stop early if we happen to hit the limit (again, with some buffer).
As much as I appreciate the generalization of the approach behind a Serializable interface, I think it causes a lot of unnecessary churn in the existing code.
How about we handle Query and Outcome in a simpler way? If you want, you can still use bin-search for observations (maybe inline for simplicity). Or we could also run an experiment to measure how expensive it is to proto-marshal one-by-one until we hit the limit, compared to bin-search. If marshaling cost is proportional to input size then we can do the simple thing. Or maybe the overhead of every call is super high?
mid = 1 // poor man's ceil | ||
} | ||
candidate := all.Mid(mid) | ||
executionIDs, serialized, err := candidate.Serialize(lggr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be extra careful here in case the leader goes crazy and sends a very large query. Maybe a maxmax const for protection?
I also realized that we don't have deduplication of request IDs inside query - something we could add.
This PR is stale because it has been open 30 days with no activity. |
CRE-362