Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-monitor as option #23

Closed
bethac07 opened this issue Mar 8, 2023 · 2 comments · Fixed by #27
Closed

Auto-monitor as option #23

bethac07 opened this issue Mar 8, 2023 · 2 comments · Fixed by #27
Labels
enhancement New feature or request

Comments

@bethac07
Copy link
Contributor

bethac07 commented Mar 8, 2023

One obvious downside of the monitor is that it needs to be running to work, so a) you have to remember to run it and b) if the machine it's on goes down, it's not running anymore.

In general, we had rejected using lambdas for the monitor, because they can only run for 15 minutes - in theory, though, if we had an existing monitor lambda function, what we could have each DS "startCluster" step do is to start a cron job with the monitor file parameters that triggers that lambda every (1,5,etc) minutes - that lambda would check the designated stuff, if everything is running do nothing, and when done clean up the stuff (which takes less than 15 minutes), including the cron job.

I think we would want this as optional, for two major reasons

  • Lambdas do cost money - if all we're doing is checking the state of the queue and the spot fleet, it should be able to run pretty quickly and on the smallest possible machine, but I haven't yet back-of-the-enveloped the expected costs. I can't think they'll be thousands of dollars but they might not be 0 either.
  • Sometimes, it's useful to be able to temporarily shut the monitor off - maybe others always start their jobs perfectly on the first try, but sometimes on mine I realize there's something bad going on and need to empty the queues and reboot the Dockers but don't actually want to do a full-on infrastructure cleanup and re-deployment. That's easy when it's just ctl+c but harder for a cron job.

(Personally, for me, I think that there is one additional, un-quantifiable benefit to having monitor be a step the user executes - reinforcing to users that teardown is a thing that needs to happen and not have them just blindly trust that it has. Even the best written, most-debugged auto-teardown code (whether written by us or an Amazon native service is going to have a day where eventually it just barfs, and so I would rather have the implication that the responsibility for teardown is clearly placed where it belongs, on the person who spun it up in the first place. But that might be a "me shaking my fist at kids these days, just blindly trusting their stuff will work, in my day, nothing was automated and we checked things by hand, uphill in the snow both ways, etc".)

What do you think @ErinWeisbart?

@bethac07 bethac07 added the enhancement New feature or request label Mar 8, 2023
@bethac07
Copy link
Contributor Author

Would probably want to be doing #2 at the same time, because this will be annoying for users to set up

@bethac07
Copy link
Contributor Author

bethac07 commented Mar 14, 2023

@ErinWeisbart and I remembered that one of the two of us (which is unclear) already had thought this through as part of AuSPICES nine months ago, and that we realized at the time that a very nice way to trigger it is as an alarm on the queue, because then there's no need for ongoing checks. It also requires uploading the monitor file to the bucket.

Ongoing checks are nice for the auto-downscaling, so we might think about doing some back of the envelope calculations of how much a "lambda every X minutes for Y time" might cost vs "alarm for Y time with assuming we knock off 5% of the compute costs with an auto-downscale", but certainly alarms are more elegant. This means 2 extra steps in the initial setup (an SNS topic and a lambda), so again, we likely want to do #2

@ErinWeisbart ErinWeisbart linked a pull request Mar 14, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant