feat(dot/sync): improve worker pool #4258

haikoschol · 2024-10-15T12:31:21Z

The main difference in the worker pool API is that SubmitBatch() does not block until the whole batch has been processed. Instead, it returns an ID which can be used to retrieve the current state of the batch. In addition, Results() returns a channel over which task results are sent as they become available.

The main improvement this brings is increased concurrency, since results can be processed before the whole batch has been completed.

What has not changed is the overall flow of the Strategy interface; getting a new batch of tasks with NextActions() and processing the results with Process().

Changes

replaced the code in dot/sync/worker_pool.go
adapted SyncService to the API changes of the new worker pool
adapted some expectations in tests regarding how often some mocks are called (hopefully without changing the logic being tested)

Tests

go test github.com/ChainSafe/gossamer/dot/sync

Issues

Closes #4198

CLAassistant · 2024-10-15T12:31:27Z

All committers have signed the CLA.

haikoschol · 2024-10-15T12:33:09Z

Created as a draft for two reasons:

I'd like to run a sync from scratch on Westend and/or Paseo as a regression test for a while.
To discuss and possibly address this TODO:

// SubmitBatch accepts a list of tasks and immediately returns a batch ID. The batch ID can be used to query the status
// of the batch using [GetBatchStatus].
// TODO
// If tasks are submitted faster than they are completed, resChan will run full, blocking the calling goroutine.
// Ideally this method would provide backpressure to the caller in that case. The rejected tasks should then stay in
// FullSyncStrategy.requestQueue until the next round. But this would need to be supported in all sync strategies.
func (w *workerPool) SubmitBatch(tasks []Task) (id BatchID, err error) {

dimartiro · 2024-10-21T13:46:20Z

dot/sync/service.go

+		workerPool: NewWorkerPool(WorkerPoolConfig{
+			MaxRetries: 5,
+			// TODO: This should depend on the actual configuration of the currently used sync strategy.
+			Capacity: defaultNumOfTasks * 10,


Why times 10?

Just an arbitrary value that "should be enough" while I was testing the branch on the Westend genesis instance. Apart from wanting to do this testing, this TODO and the other one about back pressure are the reasons why this is still a draft.
Do you think it makes sense to use the actual value for number of tasks configured in the sync strategy? And should we add some extra space here or not?

Ok got it, we can move it to a constant to self explain the usage.

I'd like to try and make this depend on the configuration of the strategy. But if that's too invasive, I'll move it to a constant.

I've done this in b6cf4d7. It required extending the Strategy interface and delaying initialization of the worker pool.

dot/sync/service.go

dimartiro · 2024-10-21T15:35:16Z

dot/sync/fullsync.go

-					},
-				})
-			}
+	task, ok := result.Task.(*syncTask)


What if we define result.Task over a generic? so we can skip this casting?

I considered making this generic and decided against it, but I can't remember why. Will give it a try.

Just tried and the reason I didn't do this is because SyncService is supposed to maintain the worker pool and work with any kind of Strategy. This won't work if the worker pool is a generic type instantiated with *syncTask.

With the current implementation, only the strategies themselves know the type of the tasks processed by the worker pool.

Good point, that makes me think that the strategies should be responsible for creating their own workerpool maybe

I'm still in favor of having the service manage one pool that is used for all strategies. I think it's easier to manage in terms of concurrency and good separation of concern. Is it worth giving that up to avoid a few casts? The strategies definitely know what concrete type they use as sync tasks.

100% agree that the service should be responsible for the working pool, but I'm still trying to find a way to improve the types. We can keep the casts in the meanwhile

dot/sync/fullsync.go

haikoschol · 2024-10-24T12:41:13Z

To discuss and possibly address this TODO:

I've addressed the TODO in f4a9ccc. Not sure if this is the right approach, but I think it is better than always adding new tasks, risking a blocked goroutine.

The main difference in the worker pool API is that SubmitBatch() does not block until the whole batch has been processed. Instead, it returns an ID which can be used to retrieve the current state of the batch. In addition, Results() returns a channel over which task results are sent as they become available. The main improvement this brings is increased concurrency, since results can be processed before the whole batch has been completed. What has not changed is the overall flow of the Strategy interface; getting a new batch of tasks with NextActions() and processing the results with Process(). Closes #4232

When the worker pool falls behind processing tasks, the service won't ask the strategy for more tasks and instead directly runs Process() again.

It doesn't make sense to give up since we need the blocks in that request to progress.

haikoschol force-pushed the haiko/sync-worker-pool branch 5 times, most recently from 4c0d5cb to c875d08 Compare October 18, 2024 03:18

dimartiro reviewed Oct 21, 2024

View reviewed changes

dimartiro requested changes Oct 21, 2024

View reviewed changes

dot/sync/service.go Outdated Show resolved Hide resolved

dot/sync/service.go Outdated Show resolved Hide resolved

dot/sync/service.go Outdated Show resolved Hide resolved

dimartiro reviewed Oct 21, 2024

View reviewed changes

dot/sync/fullsync.go Outdated Show resolved Hide resolved

dimartiro assigned haikoschol Oct 22, 2024

haikoschol force-pushed the haiko/sync-worker-pool branch from 0588ed0 to a828b43 Compare October 24, 2024 12:35

haikoschol force-pushed the haiko/sync-worker-pool branch 3 times, most recently from 5b1f455 to 03f66ec Compare October 28, 2024 12:29

haikoschol force-pushed the haiko/sync-worker-pool branch 4 times, most recently from e10cc50 to b64d894 Compare November 6, 2024 10:24

haikoschol added 9 commits November 6, 2024 22:03

chore(dot/sync): move syncTask to fullsync.go

e6ac318

feat(dot/sync): use UUID for syncTask.ID()

49c9d17

chore(dot/sync): use constant for maxTaskRetries

1f6e72f

chore(dot/sync): narrow the type of syncTask.request

3a6a940

feat(dot/sync): make worker pool capacity depend on strategy config

ce99ab7

feat(dot/sync): avoid overwhelming the worker pool

22236a7

When the worker pool falls behind processing tasks, the service won't ask the strategy for more tasks and instead directly runs Process() again.

remove unused error return value from WorkerPool.SubmitBatch()

4e4ff3c

don't limit retries on sync tasks

3af5e2a

It doesn't make sense to give up since we need the blocks in that request to progress.

haikoschol added 2 commits November 6, 2024 22:03

remove peers with network errors from worker pool

3accdcb

ensure ID is included in syncTask.String()

dc9040b

haikoschol force-pushed the haiko/sync-worker-pool branch 6 times, most recently from 1b5b722 to 3502cd4 Compare November 7, 2024 15:12

WIP

136bcc2

haikoschol force-pushed the haiko/sync-worker-pool branch from 3502cd4 to 136bcc2 Compare November 8, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dot/sync): improve worker pool #4258

feat(dot/sync): improve worker pool #4258

haikoschol commented Oct 15, 2024 •

edited by P1sar

Loading

CLAassistant commented Oct 15, 2024 •

edited

Loading

haikoschol commented Oct 15, 2024 •

edited

Loading

dimartiro Oct 21, 2024

haikoschol Oct 22, 2024

dimartiro Oct 22, 2024

haikoschol Oct 23, 2024

haikoschol Oct 24, 2024 •

edited

Loading

dimartiro Oct 21, 2024

haikoschol Oct 22, 2024

haikoschol Oct 22, 2024 •

edited

Loading

dimartiro Oct 22, 2024

haikoschol Oct 23, 2024

dimartiro Oct 23, 2024

haikoschol commented Oct 24, 2024 •

edited

Loading

feat(dot/sync): improve worker pool #4258

Are you sure you want to change the base?

feat(dot/sync): improve worker pool #4258

Conversation

haikoschol commented Oct 15, 2024 • edited by P1sar Loading

Changes

Tests

Issues

CLAassistant commented Oct 15, 2024 • edited Loading

haikoschol commented Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haikoschol Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haikoschol Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haikoschol commented Oct 24, 2024 • edited Loading

haikoschol commented Oct 15, 2024 •

edited by P1sar

Loading

CLAassistant commented Oct 15, 2024 •

edited

Loading

haikoschol commented Oct 15, 2024 •

edited

Loading

haikoschol Oct 24, 2024 •

edited

Loading

haikoschol Oct 22, 2024 •

edited

Loading

haikoschol commented Oct 24, 2024 •

edited

Loading