Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

Kubernetes clustering improvements #633

Merged
merged 5 commits into from
Aug 11, 2022
Merged

Conversation

jhujhiti
Copy link
Contributor

I was excited to see the Kubernetes clustering feature from #560 in 0.25.0. When testing, I noticed the frequent service update log messages and went exploring in the code to understand why it was so much chattier than my previous Consul deployment, and what I found became this PR.

There's a detailed explanation in each commit here (especially note the RBAC changes required by the first commit), but the upshot of the changes is that we can use Endpoint objects directly to avoid having to find the gNMIc pods ourselves. We get some extra benefits for free with Endpoints: non-ready pod addresses get removed from the cluster's discovered services without us having to do the work of finding them. We can then use "watches" on these objects instead of scraping them repeatedly, which gives us realtimeish detection of changes and -- ultimately what I was looking for -- a quiet log stream in the steady state because we only send updates when things actually change.

I did not see any documentation for the K8s locker yet, but if I just missed it, that will need to be updated too.

This is my first time both with the Kubernetes API and with any nontrivial amount of Go code, so don't be shy about being critical if I've done something wrong.

targetName = addr.TargetRef.Name
}
ls := &lockers.Service{
ID: fmt.Sprintf("%s-api", targetName),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's safe to potentially have two services with the same ID? This would happen in the event that a pod had multiple addresses assigned to it. Obviously the Address field would be different, but since we base the ID on the pod's hostname, it would match for every address on the pod.

@karimra
Copy link
Owner

karimra commented Jul 19, 2022

Thanks for doing this, I took the lazy approach when implementing the k8s locker just to get it to work.
I will look into this asap.

jhujhiti added 3 commits July 19, 2022 17:06
Instead of using two API calls to get the service and then the
associated pods, we can use just one to get the endpoints associated
with the service we care about.

This change requires a different Role to be bound. There is no more
need for the services or pods endpoint but endpoints needs to be
added. An entire role might now look like this:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gnmic
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - get
      - list
      - update
      - create
      - patch
      - delete
      - watch
  - apiGroups:
      - ''
    resources:
      - endpoints
    verbs:
      - get
      - list
      - watch
```
Instead of fetching the object on a schedule, we can ask Kubernetes to
send us changes as they happen. This is very similar to the Consul
method. The first response we get will include the ResourceVersion of
the current object, which we can use in subsequent requests to restart
the stream of object updates from where we left off.

This also changes the semantics of the watch timer to match the Consul
locker. Since we no longer poll for updates, this timer becomes a
timeout on the watch API call. This also adds a retry timer that is
used in case of an API error, again the same as the Consul locker.
I found these messages to be far too chatty and they don't appear to
be helpful in healthy operation.
@jhujhiti
Copy link
Contributor Author

I just realized that these commits had the wrong email address, so I forced-pushed to fix that. No other changes.

@karimra
Copy link
Owner

karimra commented Aug 2, 2022

Appart from the comment I left, this looks good to me.
There is no specific documentation for the k8s locker, there is just a page about clustering: https://gnmic.kmrd.dev/user_guide/HA/

@jhujhiti
Copy link
Contributor Author

jhujhiti commented Aug 9, 2022

I think this should handle the ctx.Done() in the inner loop?

@karimra
Copy link
Owner

karimra commented Aug 11, 2022

Thanks again for the contribution!

@karimra karimra merged commit bb12679 into karimra:main Aug 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants