Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up check_kubernetes_service_replication #4020

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nemacysts
Copy link
Member

@nemacysts nemacysts commented Mar 13, 2025

At some point this ran in <1min, and we have >1 things assuming this is still running in <1min...but the current version of this script actually takes multiple minutes in most of our clusters.

I grabbed a couple flamegraphs of the current version of this (which I've since lost in my scrollback/browser history, but I can re-gen pretty easily) and noticed a couple obvious things:

  • we spend a significant amount of time getting pods - let's parallelize that! Note: we do this with multiprocessing and not multithreading since there's a lot of serialization happening of k8s objects that would likely hold onto the GIL
  • we were spending an obscene amount of time in filter_pods_by_service_instance() since we were calling it over and over again - let's group pods once in a smarter way and pass the grouping around instead :)
  • this one is kinda ???: socket.gethostbyaddr() was pretty prominent in the flamegraphs - we don't actually use the hostnames in this check, so let's add a way to skip that hostname resolution

I also deleted the --additional-namespaces code since it's entirely unused now and it made things slightly cleaner.

EDIT: I regenerated the flamegraphs:

@nemacysts nemacysts requested a review from a team as a code owner March 13, 2025 21:48
@nemacysts
Copy link
Member Author

grr, i forgot that make test doesn't run mypy for some reason

@nemacysts
Copy link
Member Author

huh, what are our tests doing - mypy is showing me that this should have definitely broken some of the other checkers (e.g., the flink one)

@nemacysts nemacysts marked this pull request as draft March 13, 2025 22:30
At some point this ran in <1min, and we have >1 things assuming this is
still running in <1min...but the current version of this script actually
takes multiple minutes in most of our clusters.

I grabbed a couple flamegraphs of the current version of this (which
I've since lost in my scrollback/browser history, but I can re-gen
pretty easily) and noticed a couple obvious things:
* we spend a significant amount of time getting pods - let's parallelize
  that! Note: we do this with multiprocessing and not multithreading
  since there's a lot of serialization happening of k8s objects that
  would likely hold onto the GIL
* we were spending an obscene amount of time in
  filter_pods_by_service_instance() since we were calling it over and
  over again - let's group pods once in a smarter way and pass the
  grouping around instead :)
* this one is kinda ???: socket.gethostbyaddr() was pretty
  prominent in the flamegraphs - we don't actually use the hostnames in
  this check, so let's add a way to skip that hostname resolution

I also deleted the --additional-namespaces code since it's entirely
unused now and it made things slightly cleaner.
@nemacysts nemacysts force-pushed the luisp/PAASTA-18192-speedup-check_kubernetes_service_replication branch from 6f67448 to e09c741 Compare March 17, 2025 22:29
Copy link
Member

@EvanKrall EvanKrall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer grouped_pods to be something more descriptive like pods_by_service_instance or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants