CRE-362: dynamic batching based on observation sizes #1226

mchain0 · 2025-06-02T11:58:02Z

krehermann · 2025-06-03T15:10:41Z

pkg/capabilities/consensus/ocr3/reporting_plugin.go

+	return allExecutionIDs, serialized, err
+}
+
+func (r *reportingPlugin) Query(_ context.Context, _ ocr3types.OutcomeContext) (types.Query, error) {


i'd like to see a benchmark test. i have no intuition about how expensive the proto serialization will be nor how many rounds of trial and error we need.

that then turns into a concern about the performance budget of repeated serializations

good point. algo I made is O(log2(n)), so e.g. 10^6 reports ends up with max 20 (~19.93) optimization rounds (I included that in one of the unit-tests); considering it's a simple structure -- serializing it few times should not be an issue; I agree to empirically verify how many rounds becomes "too slow" -- solidify it with a bench test and warn logs (when closing up to a time limit).

pkg/capabilities/consensus/ocr3/reporting_plugin.go

… Serializable interface; code refactor;

bolekk · 2025-07-02T03:52:29Z

pkg/capabilities/consensus/ocr3/reporting_plugin_batching.go

+type Serializable interface {
+	Serialize(lggr logger.Logger) ([]string, []byte, error)
+	Len() int
+	Mid(mid int) Serializable


"Mid" is a bit confusing, at first I thought it returns the middle element. Maybe "Prefix" is a more accurate name?

bolekk · 2025-07-02T03:56:10Z

pkg/capabilities/consensus/ocr3/reporting_plugin_batching.go

+	}
+
+	weids := make([]string, 0, len(o.reqMap))
+	for k := range o.reqMap {


Random order will be problematic here. If the leader sends a lot of IDs in the query and every node contributes observations for a random subset of that query, there's no guarantee we will find quorums for any of the IDs later. We should follow the order from the Query and if necessary reduce to a prefix of that array.

bolekk · 2025-07-02T04:06:30Z

pkg/capabilities/consensus/ocr3/reporting_plugin_batching.go

+// It finds the best utilization of space with the protobuf-marshalled structures using logarithmic (binary search)
+// approach to identify the optimal number of Requests that can be serialized without exceeding
+// the limit (defaultBatchSizeMiB).
+func packToSizeLimit(lggr logger.Logger, all Serializable) ([]string, []byte, error) {


I'm starting to doubt that bin-search is actually useful.

Query elements are all of the same size so we can simply divide max by that size (minus some buffer).

Observations can be less predictable so that's the only place where bin search makes sense.

Outcome already does a relatively expensive computation for each request (aggregation). I can't imagine proto marshaling being meaningful compared to that. So we can simply marshal them one-by-one as we process and stop early if we happen to hit the limit (again, with some buffer).

As much as I appreciate the generalization of the approach behind a Serializable interface, I think it causes a lot of unnecessary churn in the existing code.

How about we handle Query and Outcome in a simpler way? If you want, you can still use bin-search for observations (maybe inline for simplicity). Or we could also run an experiment to measure how expensive it is to proto-marshal one-by-one until we hit the limit, compared to bin-search. If marshaling cost is proportional to input size then we can do the simple thing. Or maybe the overhead of every call is super high?

bolekk · 2025-07-02T04:20:56Z

pkg/capabilities/consensus/ocr3/reporting_plugin_batching.go

+			mid = 1 // poor man's ceil
+		}
+		candidate := all.Mid(mid)
+		executionIDs, serialized, err := candidate.Serialize(lggr)


We should probably be extra careful here in case the leader goes crazy and sends a very large query. Maybe a maxmax const for protection?

I also realized that we don't have deduplication of request IDs inside query - something we could add.

github-actions · 2025-08-02T00:49:32Z

This PR is stale because it has been open 30 days with no activity.
Remove the stale label or comment or this will be closed in 7 days.

mchain0 temporarily deployed to integration June 2, 2025 11:58 — with GitHub Actions Inactive

mchain0 had a problem deploying to integration June 2, 2025 11:58 — with GitHub Actions Failure

CRE-362: dynamic batching based on observation sizes

41a130e

mchain0 force-pushed the cre-362-dynamic-batching-based-on-observation-sizes branch from ee85ff3 to 41a130e Compare June 2, 2025 13:39

mchain0 temporarily deployed to integration June 2, 2025 13:40 — with GitHub Actions Inactive

mchain0 had a problem deploying to integration June 2, 2025 13:40 — with GitHub Actions Failure

krehermann self-requested a review June 3, 2025 15:05

krehermann reviewed Jun 3, 2025

View reviewed changes

bolekk reviewed Jun 6, 2025

View reviewed changes

pkg/capabilities/consensus/ocr3/reporting_plugin.go Outdated Show resolved Hide resolved

pkg/capabilities/consensus/ocr3/reporting_plugin.go Show resolved Hide resolved

mchain0 added 4 commits June 25, 2025 13:57

observations batching; generic packToSizeLimit; wrapping structs with…

637cae0

… Serializable interface; code refactor;

minor improvements

4190a50

outcomes filtering separated; serializer simplified

05afbcf

minor improvements

c881fe9

bolekk requested changes Jul 2, 2025

View reviewed changes

github-actions bot added the Stale label Aug 2, 2025

weids mid

edfbb51

mchain0 closed this Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CRE-362: dynamic batching based on observation sizes #1226

CRE-362: dynamic batching based on observation sizes #1226

Uh oh!

mchain0 commented Jun 2, 2025 •

edited by jira bot

Loading

Uh oh!

krehermann Jun 3, 2025

Uh oh!

mchain0 Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

bolekk Jul 2, 2025

Uh oh!

bolekk Jul 2, 2025

Uh oh!

bolekk Jul 2, 2025

Uh oh!

bolekk Jul 2, 2025

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

Uh oh!

CRE-362: dynamic batching based on observation sizes #1226

CRE-362: dynamic batching based on observation sizes #1226

Uh oh!

Conversation

mchain0 commented Jun 2, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krehermann Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

mchain0 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bolekk Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

bolekk Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

bolekk Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

bolekk Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

Uh oh!

mchain0 commented Jun 2, 2025 •

edited by jira bot

Loading