Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Task system to retain ProtocolDAG, upload ProtocolUnitResults and ResultFiles as they complete #180

Open
dotsdl opened this issue Sep 19, 2023 · 0 comments · May be fixed by #104

Comments

@dotsdl
Copy link
Member

dotsdl commented Sep 19, 2023

Currently, when a compute service executes a Task, it generates a new ProtocolDAG locally, executes it, and pushes the (successful or failed) ProtocolDAGResult back to the server. This adds the serialized ProtocolDAGResult to the object store, and a ProtocolDAGResultRef to the state store. A Task can have any number of failed ProtocolDAGResultRefs, and (typically) a single successful ProtocolDAGResultRef.

This approach does not currently support ResultFile upload (files produced by ProtocolUnits that are desired for permanent storage, available on-demand to users later), nor does it allow for a ProtocolDAG that successfully executes some ProtocolUnits to be started again from where it left off on another compute service (checkpointing). Our aim is to support both of these in alchemiscale.

This proposal should accomplish both:

  • instead of storing things in object store by ProtocolDAGResult, we should store them by Task/ProtocolDAG
    • this fits in with the idea that the same file storage system can be used to enable partial restarts
  • a Task gets a ProtocolDAGRef in state store upon creation, serialized ProtocolDAG in object store
  • as ProtocolDAG is executed on compute service, ProtocolUnitResults and ResultFiles shipped to object store
  • on success, a complete ProtocolDAGResult shipped to object store, ProtocolDAGResultRef added to state store; same retrieval pattern as before
  • on failure, same as above but for a failed ProtocolDAGResult
  • when another compute service picks up a Task, it checks for existence of a ProtocolDAGRef; if present, pulls ProtocolDAG and its associated ProtocolUnitResults from object store
    • it then finds the ProtocolUnits in the ProtocolDAG that have not successfully been executed (either failed or not run at all), identifies their dependency ProtocolUnitResults, grabs their ResultFiles if included in outputs, and proceeds with DAG execution

This has some nice properties:

  • a Task has a single ProtocolDAGRef ever, and this may have any number of failed ProtocolDAGResultRefs and only one successful ProtocolDAGResultRef
  • we don't have to do odd workarounds to utilize gufe storage system for ResultFiles (see gufe#186 and gufe#234 for current state as of this writing)
  • we get architectural support for checkpointing for ProtocolDAGs, reducing waste and time to results
  • still mostly the same system in terms of execution, status model, Task claiming, result retrieval, etc.
  • gives what is needed to support ResultFile retrieval user-side
  • gives what is needed to support extends support compute side, where one or more ResultFiles may be needed to extend a ProtocolDAG from a previous ProtocolDAGResult
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment