Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-termination of identities discover #203

Closed
tegefaulkes opened this issue Jun 12, 2024 · 9 comments · Fixed by MatrixAI/Polykey#739
Closed

Non-termination of identities discover #203

tegefaulkes opened this issue Jun 12, 2024 · 9 comments · Fixed by MatrixAI/Polykey#739
Assignees
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@tegefaulkes
Copy link
Contributor

Specification

We've found during testing that in some cases running polykey identities discover <identitiy> --monitor fails to make any progress or terminate properly.

cmcdragonkai ➜ matrix-framework-13-ryzen-7040  ➜ ~/Projects/Polykey-CLI
 $ ./dist/polykey.js identities discover github.com:brynblack --monitor
queued         github.com:brynblack

After a while this will time out the RPC. That makes sense since we have to stream data for the timeout period. We may have to look into increasing the timeout for this command or having the RPC time out with a more descriptive error than a RPC timeout.

Additional context

Tasks

  1. Do some manual testing to find the root cause of the hanging discovery.
  2. Apply a fix as needed.
@tegefaulkes tegefaulkes added the development Standard development label Jun 12, 2024
Copy link

linear bot commented Jun 12, 2024

@tegefaulkes
Copy link
Contributor Author

In my testing it processes the identity just fine. However I don't see any node claims being queued up. That is a problem for MatrixAI/Polykey#737.

Right now the conditions that cause this to hang are very unclear. There are two possiblilities for this happening.

  1. For whatever reason the task for processing that vertex isn't starting. It may be crowded out by other tasks being started first. That would be concerning howerver.
  2. The task is starting, but processing the vertex is hanging on something or taking a very long time.

In either case, I can't confirm anything without being able to replicate the problem.

Copy link
Contributor Author

After some digging I found out two thing.

  1. I wasn't aware you needed to authorise github to be able to process any identities. This explains why my example suceeded while others had an explicit failure.
  2. When processing the identity vertex we seem to be processing the same page over and over again in an infinite loop.

As to why it's infinitely looping I need to dig deeper. I think @brynblack only has multiple newer style claims but I'm not certain about that. What I do know is it's processing a page of claims over and over again and seeing only the same claim over and over again.

@amydevs
Copy link
Member

amydevs commented Jun 13, 2024

Hm that's strange, how did u make it infinitely loop? I'm gonna work on the pagination system a bit, see if u can reproduce it after the changes

@amydevs
Copy link
Member

amydevs commented Jun 13, 2024

the culprit is this:

image

because this happens in a loop, once we exhaust all pages, the next navigation token will eternally stay at the last page. Causing an infinite loop on processing the last page, until the ctx times out.

@amydevs
Copy link
Member

amydevs commented Jun 13, 2024

The reason why this was never caught in a test, was because the TestProvider would never set the nextPaginationToken to null to indicate that there are no more pages of claims to process. My implementation of the pagination loop originally would've only accounted for a seek style pagination mechanism, where pages are processed until we encounter a page with no claims.

Because the logic only sets the nextPaginationToken when it is not null, we end up looping on the last page.

By making the TestProvider used in tests to return a null pagination token when all pages of claims are exhausted, we get test timeouts when running tests/discovery/...:

image

With the fixes applied, the timeout goes away:

image

This proves that this was the problem.

@tegefaulkes
Copy link
Contributor Author

Ok, yeah, that pretty much proves it. While we're here check some edge cases such as failing to process a page, or a single claim within that page.

@amydevs
Copy link
Member

amydevs commented Jun 13, 2024

If you're talking abt failing to fetch a page of claims, or specific claim, ErrorProviderCall will throw within the pagination loop, and bubble all the way up to the try catch block in Discovery.discoverVertexHandler. This will dispatch the EventDiscoveryVertexFailed event, and then throw. This will cause the TaskManager to warn on the logger that the task has thrown with the specific error.

This will really be done to all errors except for the specific Symbol reasons we have defined.

@tegefaulkes
Copy link
Contributor Author

There's some overlap here with the invalid claims format problem we're having as well. So what happens if one of the claims is badly formatted here? Do we fail to process any further claims after that? Can we add some more robust handling of this or event tests demonstrating that case?

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Aug 15, 2024
@CMCDragonkai CMCDragonkai changed the title non-termination of identities discover Non-termination of identities discover Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

Successfully merging a pull request may close this issue.

3 participants