Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: modularize file state #126

Merged
merged 12 commits into from
Aug 2, 2024
Merged

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Aug 2, 2024

Yeah good luck, not reviewable.

This PR is primarily to separate worker (DefaultWorker) from state (NePSState).

  • Previously this was bundled into one big loop in runtime.py.
  • A worker is in charge with interacting with NePSState and deciding when to stop.
  • The state is where resources can be accessed by the worker, or really any code, in a uniform manner.
    • For example, removing the need to know where a trial is located or how to get it, just give me the trial and its outcome. Other instances of resources include SeedSnapshot, ErrDump (containing all global worker errors) and OptimizerInfo.
  • To prevent race conditions with workers there is now a Synced protocol.
    • It is parametrized by two types, i.e. Synced[Resource, Location].
      • For example Synced[Trial, Path] indicated a synced Trial which is indentified by a Path.
    • The Synced provides operations for unique access to resources, ensuring proper locking behaviour and uses a concept of versioning for each individual resource such that a NePSState does not need to reload everything from scratch each time, if it already has the latest version.
    • The Synced protocol is agnostic of the underlying way in which it's performed, this would allow for example, a fully in-memory NePS or a server hosting NePS (@vladislavalerievich, this might be something we consider once you're available)

For trusting this PR, all tests pass, mypy passes and I've added tests on:

  • Ensuring that all resources read from/write to disk work when wrapped inside a Synced, essentially letting them be access and mutated in an atomic way across workers.
  • Test that the operations you can do on a NePSState is atomic and correct, with every optimizer that is currently indexed in SearcherMapping.
  • Check that DefaultWorker respects all stopping criterion and does the loop it's meant to
  • ... pass all existings tests

Additional notes:

  • Given each worker is only ever working on one Trial at a time and that there is only ever one NePSState they are associated with, there are now two functions:
    • neps.runtime.get_in_progress_trial(): This gives you the trial which the worker is currently evaluating. This can be useful to inject functionality into the run_pipeline() of the user, such as load_checkpoint, tblogger, etc... the Trial object has much richer meta information than what is available through just the arguments to run_pipeline(...). Calling this outside of run_pipeline context is considered a developer error, and this is not something we directly advertise to users. For example, calling this while inside the context of sampling a new trial from an optimizer is considered an error as there is no trial currently being evaluated.
    • neps.runtime.get_workers_neps_state(): This gives you the entire NePSState object the work is operating from. This is set as soon as a worker starts and is available for the duration of the program. Calling this before a worker starts is considered a developer error
  • In tblogger and status, I've adapted them to use the new get_workers_neps_state() such that they can easily get a reference to the NePSState the worker is operating under. This reduces the need for them to be explicitly aware of where anything is located and instead ask NePSState for what they need.
    • I haven't put in dedicated functions for the things they require which would be useful and good idea going forward, but low priority for now.
  • There are some non-trivial cases I did not solve yet where hardcoded path are assumed and the NePSState is located in the shared file-system. These can be solved over time and are lower priority. The solution would be to expose the information they are trying to read from disk directly from NePSState, where proper locking can be done.
  • New criterion by which to stop worker such as worker/global evaluation time, worker walltime, worker/global error etc...

I'm sure there's plenty more things but they'll come up as they need to

@eddiebergman eddiebergman merged commit 08f30ae into master Aug 2, 2024
13 checks passed
@eddiebergman eddiebergman deleted the refactor-modularize-file-state branch August 2, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant