refactor: modularize file state #126

eddiebergman · 2024-08-02T15:03:36Z

Yeah good luck, not reviewable.

This PR is primarily to separate worker (DefaultWorker) from state (NePSState).

Previously this was bundled into one big loop in runtime.py.
A worker is in charge with interacting with NePSState and deciding when to stop.
The state is where resources can be accessed by the worker, or really any code, in a uniform manner.
- For example, removing the need to know where a trial is located or how to get it, just give me the trial and its outcome. Other instances of resources include SeedSnapshot, ErrDump (containing all global worker errors) and OptimizerInfo.
To prevent race conditions with workers there is now a Synced protocol.
- It is parametrized by two types, i.e. Synced[Resource, Location].
  - For example Synced[Trial, Path] indicated a synced Trial which is indentified by a Path.
- The Synced provides operations for unique access to resources, ensuring proper locking behaviour and uses a concept of versioning for each individual resource such that a NePSState does not need to reload everything from scratch each time, if it already has the latest version.
- The Synced protocol is agnostic of the underlying way in which it's performed, this would allow for example, a fully in-memory NePS or a server hosting NePS (@vladislavalerievich, this might be something we consider once you're available)

For trusting this PR, all tests pass, mypy passes and I've added tests on:

Ensuring that all resources read from/write to disk work when wrapped inside a Synced, essentially letting them be access and mutated in an atomic way across workers.
Test that the operations you can do on a NePSState is atomic and correct, with every optimizer that is currently indexed in SearcherMapping.
Check that DefaultWorker respects all stopping criterion and does the loop it's meant to
... pass all existings tests

Additional notes:

Given each worker is only ever working on one Trial at a time and that there is only ever one NePSState they are associated with, there are now two functions:
- neps.runtime.get_in_progress_trial(): This gives you the trial which the worker is currently evaluating. This can be useful to inject functionality into the run_pipeline() of the user, such as load_checkpoint, tblogger, etc... the Trial object has much richer meta information than what is available through just the arguments to run_pipeline(...). Calling this outside of run_pipeline context is considered a developer error, and this is not something we directly advertise to users. For example, calling this while inside the context of sampling a new trial from an optimizer is considered an error as there is no trial currently being evaluated.
- neps.runtime.get_workers_neps_state(): This gives you the entire NePSState object the work is operating from. This is set as soon as a worker starts and is available for the duration of the program. Calling this before a worker starts is considered a developer error
In tblogger and status, I've adapted them to use the new get_workers_neps_state() such that they can easily get a reference to the NePSState the worker is operating under. This reduces the need for them to be explicitly aware of where anything is located and instead ask NePSState for what they need.
- I haven't put in dedicated functions for the things they require which would be useful and good idea going forward, but low priority for now.
There are some non-trivial cases I did not solve yet where hardcoded path are assumed and the NePSState is located in the shared file-system. These can be solved over time and are lower priority. The solution would be to expose the information they are trying to read from disk directly from NePSState, where proper locking can be done.
New criterion by which to stop worker such as worker/global evaluation time, worker walltime, worker/global error etc...

I'm sure there's plenty more things but they'll come up as they need to

eddiebergman added 12 commits August 2, 2024 16:29

feat: Shared resource objects for state

8975368

feat: Jobqueue

12ee400

refactor(state): NePS State set up and tested

167fb28

refactor(runtime): Implement runtime with the new state

506d244

test: Pass them tests

89f0849

fix: fix test errors from rebase

e807a43

fix: Remove exploratory ask-and-tell

785818d

ci: Add pytest_cases dev dependancy

774aa6f

fix: Use from __futures__ import annotations in test files

4d8df2a

fix: Use from __future__ import annotations

6f5821d

fix: Sigh, more from __future__ import annotations

e4856b9

fix: Introduce sleep() for windows test -_-

ded7514

eddiebergman merged commit 08f30ae into master Aug 2, 2024
13 checks passed

eddiebergman deleted the refactor-modularize-file-state branch August 2, 2024 16:23

This was referenced Aug 5, 2024

[UX] Show error and traceback when something goes wrong #62

Closed

Potential issues if multiple checkpoints for one config #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: modularize file state #126

refactor: modularize file state #126

eddiebergman commented Aug 2, 2024 •

edited

Loading

refactor: modularize file state #126

refactor: modularize file state #126

Conversation

eddiebergman commented Aug 2, 2024 • edited Loading

Additional notes:

eddiebergman commented Aug 2, 2024 •

edited

Loading