Integrate TaskManager into NodeGraph and Discovery #445

tegefaulkes · 2022-09-02T08:03:24Z

Description

This PR focuses on updating Nodes, Discovery domain to use the new Tasks system.

Generally there are 3 places where background queues are being used. Each of these need to be updated to use the new tasks system from #438.

Discovery system
Nodes pinging and authenticating nodes.
Nodes refreshing buckets on a timeout.
NodeConnectionManager Does the network entry procedure with syncNodeGraph.

SetNode details

In NodeConnectionManager when adding a node to the NodeGraph with nodeManager.setNode we can end up with the case where a bucket is full. When this happens we need to ping nodes within the bucket to determine if they're still alive and remove any nodes that don't respond so we can add the new one. We need to convert this to use the scheduler/queue.

By default the nodeManager.setNode doesn't ping a node to check if it's online before adding it. It is expected that you ping the node before using setNode to add it. We add nodes whenever we interact or discover a node. This can happen in the cases of... We learn about a node from other nodes, the a connection to the learned node hasn't been made so it needs to be pinged. A node has connected to us so it just needs to be added. We connected to a node, details need to be updated. Given that we needed the ability to ping and setNode or just setNode.

RefreshBuckets details

nodeManager.refreshBucket needs to updated to use the new scheduler/queue system. In this case we will be making use of the Scheduler features. A refresh bucket operation needs to be run on a bucket if the bucket hasn't seen activity for an hour. Given this we need schedule each bucket for an hour delay. If a bucket is updated we need to reset the timer for the bucket. To do this we need to make use of the taskPath and timer updating features of the Scheduler. These refreshBucket tasks should be the lowest priority.

Having refreshBuckets system use the Tasks system is a little complex since it works on a kind of watchdog system. here are the requirements of the refreshBucket system

A refreshBucket operation selects a random NodeId within the target bucket's range of nodes and preforms a search for that node.
Each bucket needs to be scheduled to do a refreshBucket operation every hour.
If data is updated within a bucket then this scheduled delay is reset to an hour.

here are some relevant constraints of the tasks system.

A task is scheduled with a delay and is removed when cancelled or completed.
A task state can be checked. It can be scheduled, queued, active, success or failure.
A scheduled task's delay can be updated if it is in the scheduled state.
A task can be 'grouped' using the task's path, We can track a task for a single bucket using a path such as ['refreshBucket', bucketIndex]. This can enable us to find existing tasks for buckets.

Given these constraints I think we need to do the following.

During start we need to iterate over the existing refreshBucket tasks using the tasks.getTasksByPath, reset the delay on existing tasks and create ones for buckets missing them.
When any bucket get updated using nodeManager.setNode then we need to update the delay of that refresh bucket task. This can be done by getting the task for the bucket using tasks.getTasksbyPath and updating the scheduled delay if the task is in the scheduled state. If it is queued or Active then ignore, if not task exist create one. The tasks's delay can be updated by either canceling it and re-creating or using a provided updateDelay method.
The tasks themselves are technically emphemiral so we could remove all refreshBuckets tasks during nodeManager.stop(). Minor detail, won't really make a difference to operation.

NodeConenctionManager details

In the NodeConnectionManager we have a method syncNodeGraph that does the following proceedure. If syncNodeGrap is run with blocking = false then we need to run it in the background.

for each seed node we request the closest nodes to our NodeId. Then go through this list pinging and adding their details to our nodeGraph.
Then for every bucket above the closet node's bucket we run a refreshBucket operation.

Looking over this I think it should be part of NodeManager and making calls to NodeConnectionManager for pinging.

As for how to implement this using Tasks.

Do an initial getRemoteNodeClosestNodes to the seed node. This is blocking but it could also be a task.
The first step then schedules pingAndAdd tasks for each of the closest nodes.
The first step then schedules with 0 delay a refreshBucket operation for every bucket above the closest node's bucket.

Other details

Other aspects to consider. The Kademlia findNode operation in the NodeConnectionManager is considered a single operation but is in essence a priority queue search for the target node. We can consider splitting this up into a compound task where each step in the search can be it's own task within the process. This would apply to the refreshBucket operation since that is doing a findNode as well.

Handlers will need to support cancel-ability. They must take an abort signal and quickly end operation.

Issues Fixed

Fixes Destroyed agent shuts down remote agent performing background tasks #418
Fixes Continuous connection attempts between a deployed seed node and local agent #415
Fixes Connection dropped/timed out when connecting to deployed agent #414
Fixes Starting Connection Forward infinite loop #413
Fixes Process exit handler for tracking unresolved asynchronous work ("promise deadlocks"/"why is node terminating for no reason?") #307
Related Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) #353 - This will be a separate PR after this one. After we address updating Discovery to use TaskManager.
Related Discovery - revisiting Gestalt Vertices and error handling #328
Related Generic Non-Blocking Task Management ("Queue") for discovery and nodes domains #329 - this issue contains notes on how to refactor the discovery and prioritising gestalt vertices
Related Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 - this issue contains notes on what should be made timed/cancellable

Tasks

1. AddTaskManager to PolykeyAgent
- At the end of PolykeyAgent.start you need to do scheduler.startProcessing(), it should be a lazy creation and lazy start from the beginning to allow handlers to be registered. This is why PolykeyAgent would be creating a lazy scheduler so it doesn't start the processing loop.
2. Update nodes pinging/authentication queue to use the new tasks system
3. Update refresh buckets to use new tasks system.
~~4. Update Discovery domain to use the new tasks system.~~
5. Update and check nodes tests.
~~6. Update and check Discovery tests.~~
~~7. Address issue Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) #353~~
8. Address issue Destroyed agent shuts down remote agent performing background tasks #418
9. Address issue Continuous connection attempts between a deployed seed node and local agent #415
~~10. Address issue Connection dropped/timed out when connecting to deployed agent #414~~ - Needs more testing, can't address this now. I've added Process exit handler for tracking unresolved asynchronous work ("promise deadlocks"/"why is node terminating for no reason?") #307 to help with part of this.
11. Address issue Starting Connection Forward infinite loop #413
12. refactor setNode garbage collection Integrate TaskManager into NodeGraph and Discovery #445 (comment)
13. Address issue Process exit handler for tracking unresolved asynchronous work ("promise deadlocks"/"why is node terminating for no reason?") #307

Final checklist

ghost · 2022-09-02T08:05:17Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

tegefaulkes · 2022-09-07T02:30:02Z

Looking at the syncNodeGraph in the NodeConnectionManager I think we can move that functionality into the NodeManager and avoid having the NodeConnectionManager depend on the Tasks. But it does fit within the context of other NodeConnectionManager methods such as findNode, getClosestGlobalNodes and getRemoteNodeClosestNodes so I'm not sure.

CMCDragonkai · 2022-09-07T03:03:30Z

What does this function do again?

tegefaulkes · 2022-09-07T05:58:18Z

That function does the initial network entry procedure for Kademlia by getting the closes nodes to yourself, pining and adding them to your grapth. Then setting up initial refreshBuckets operations to fill in the rest of the NodeGraph.

CMCDragonkai · 2022-09-08T11:44:25Z

@tegefaulkes during integration into node graph. I'd like to ensure that we have understood all the problems that the nodegraph has such as:

CMCDragonkai · 2022-09-08T11:44:58Z

Then we redeploy testnet and focus on #441.

CMCDragonkai · 2022-09-11T05:27:36Z

Discovery has a bug according to vscode:

    // If we don't have one then we can't request data so just skip
    if (authIdentityIds === [] || authIdentityIds[0] == null) {
      return undefined;
    }

It has never had a proper review, so it's architecture should be reviewed and probably refactored to align to the model we have developed in the Tasks system, since it's also something that has background tasks. In fact... now that the tasks system centralises background task processing, it's possible that we can have discovery and nodegraph both delegate repeat-processing to the tasks system, and remove their own internal loops, thus simplifying how discovery and nodegraph works!!

…eparate from the overall timer other fixes have been applied.

…y idempotent the @ready decorator caused them to throw if ran while `taskManager` was not running. They needed to be called during incomplete startup, so I removed the decorator.

tegefaulkes · 2022-09-21T09:37:41Z

Ok, this is good to merge now. There are 3 test domains that need an eyeball still.

Discovery
tests/nat
Nodes - I think I got all the bugs in this but waiting for CI to finish will take a while.

src/bin/utils/ExitHandlers.ts

tegefaulkes mentioned this pull request Sep 2, 2022

Feature TaskManager Scheduler and Queue and Context Decorators for Timed and Cancellable #438

Merged

74 tasks

tegefaulkes changed the base branch from staging to feature-queue September 2, 2022 08:09

tegefaulkes force-pushed the feature-tasks_implementation branch from d969836 to 4e3da17 Compare September 6, 2022 01:36

tegefaulkes changed the title ~~Feature tasks implementation~~ Integrate Tasks into NodeGraph and Discovery Sep 6, 2022

tegefaulkes force-pushed the feature-tasks_implementation branch from 4e3da17 to c3bc60e Compare September 6, 2022 03:42

CMCDragonkai force-pushed the feature-queue branch 18 times, most recently from 24e2cb4 to 63fde43 Compare September 11, 2022 15:18

tegefaulkes added 14 commits September 21, 2022 19:31

fix: moved 'syncNodeGraph from NodeConnectionManager to NodeManager`

97ce1d8

fix: pingNodes inside of garbageCollectBucket now have timeouts s…

1229695

…eparate from the overall timer other fixes have been applied.

fix: cleaning up ephemeral tasks when stopping NodeManager

48298f8

fix: cleaning up errors

9f3d2c0

fix: small fix to updateRefreshBucketDelay

45360d4

fix: rollback of proxy changes, was out of scope for this PR

81e5532

fix: small fix to garbageCollectBucket concurrent pinging

a33ea26

fix: updated default timeout for NodeConnectionManager.pingNode

f11197c

fix: using Symbols for cancelling tasks

0f729b1

fix: TaskManager should extend the CreateDestroyStartStop interface

607095b

fix: test wasn't overriding key-pair generation

43027e7

fix: TaskManager's stopProcessing and stopTasks are now properl…

0267d3b

…y idempotent the @ready decorator caused them to throw if ran while `taskManager` was not running. They needed to be called during incomplete startup, so I removed the decorator.

tests: slightly increasing timeouts for two tests

ccc61a8

tests: general fixes for tests failing in CI

b29ba9e

tegefaulkes force-pushed the feature-tasks_implementation branch from 65b8c6f to b29ba9e Compare September 21, 2022 09:31

CMCDragonkai reviewed Sep 21, 2022

View reviewed changes

src/bin/utils/ExitHandlers.ts Outdated Show resolved Hide resolved

CMCDragonkai reviewed Sep 21, 2022

View reviewed changes

src/bin/utils/ExitHandlers.ts Outdated Show resolved Hide resolved

tegefaulkes force-pushed the feature-tasks_implementation branch from 2c04717 to 806bb09 Compare September 21, 2022 09:46

syntax: formatting change for ExitHandlers.ts

23f470b

tegefaulkes force-pushed the feature-tasks_implementation branch from 7073654 to 23f470b Compare September 21, 2022 09:47

CMCDragonkai merged commit 586750e into staging Sep 21, 2022

tegefaulkes mentioned this pull request Oct 4, 2022

Feature: network cancellability and deadlines #468

Merged

8 tasks

CMCDragonkai mentioned this pull request Oct 5, 2022

Update Proxy domain to use timedCancellable #464

Closed

This was referenced Oct 5, 2022

Fix connection bugs resulting from short timeout on NodeManager.pingNode #473

Closed

Updating the Nodes domain with timedCancellable #475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate TaskManager into NodeGraph and Discovery #445

Integrate TaskManager into NodeGraph and Discovery #445

tegefaulkes commented Sep 2, 2022 •

edited

Loading

ghost commented Sep 2, 2022 •

edited by ghost

Loading

tegefaulkes commented Sep 7, 2022

CMCDragonkai commented Sep 7, 2022 via email •

edited

Loading

tegefaulkes commented Sep 7, 2022

CMCDragonkai commented Sep 8, 2022

CMCDragonkai commented Sep 8, 2022

CMCDragonkai commented Sep 11, 2022

tegefaulkes commented Sep 21, 2022 •

edited by CMCDragonkai

Loading

Integrate TaskManager into NodeGraph and Discovery #445

Integrate TaskManager into NodeGraph and Discovery #445

Conversation

tegefaulkes commented Sep 2, 2022 • edited Loading

Description

SetNode details

RefreshBuckets details

NodeConenctionManager details

Other details

Issues Fixed

Tasks

Final checklist

ghost commented Sep 2, 2022 • edited by ghost Loading

Legend

tegefaulkes commented Sep 7, 2022

CMCDragonkai commented Sep 7, 2022 via email • edited Loading

tegefaulkes commented Sep 7, 2022

CMCDragonkai commented Sep 8, 2022

CMCDragonkai commented Sep 8, 2022

CMCDragonkai commented Sep 11, 2022

tegefaulkes commented Sep 21, 2022 • edited by CMCDragonkai Loading

tegefaulkes commented Sep 2, 2022 •

edited

Loading

ghost commented Sep 2, 2022 •

edited by ghost

Loading

CMCDragonkai commented Sep 7, 2022 via email •

edited

Loading

tegefaulkes commented Sep 21, 2022 •

edited by CMCDragonkai

Loading