From 242822306447b34e8c00a20a73aff308139be0c1 Mon Sep 17 00:00:00 2001 From: Terry Brady Date: Tue, 2 Apr 2024 16:01:27 -0700 Subject: [PATCH] migrate design to new repo --- design/queue-2023/api.md | 7 - design/queue-2023/data.md | 169 ----- design/queue-2023/diagrams/README.md | 6 - design/queue-2023/diagrams/batches.mmd | 15 - design/queue-2023/diagrams/batches.mmd.svg | 1 - design/queue-2023/diagrams/jobs.mmd | 22 - design/queue-2023/diagrams/jobs.mmd.svg | 1 - design/queue-2023/java-api.md | 171 ----- design/queue-2023/manager.md | 3 - design/queue-2023/queue-admin.md | 68 -- design/queue-2023/ruby-api.md | 158 ---- design/queue-2023/service.md | 30 - design/queue-2023/states.md | 117 --- design/queue-2023/tests.yml | 8 - design/queue-2023/transition.md | 841 --------------------- design/queue-2023/use-cases.md | 595 --------------- 16 files changed, 2212 deletions(-) delete mode 100644 design/queue-2023/api.md delete mode 100644 design/queue-2023/data.md delete mode 100644 design/queue-2023/diagrams/README.md delete mode 100644 design/queue-2023/diagrams/batches.mmd delete mode 100644 design/queue-2023/diagrams/batches.mmd.svg delete mode 100644 design/queue-2023/diagrams/jobs.mmd delete mode 100644 design/queue-2023/diagrams/jobs.mmd.svg delete mode 100644 design/queue-2023/java-api.md delete mode 100644 design/queue-2023/manager.md delete mode 100644 design/queue-2023/queue-admin.md delete mode 100644 design/queue-2023/ruby-api.md delete mode 100644 design/queue-2023/service.md delete mode 100644 design/queue-2023/states.md delete mode 100644 design/queue-2023/tests.yml delete mode 100644 design/queue-2023/transition.md delete mode 100644 design/queue-2023/use-cases.md diff --git a/design/queue-2023/api.md b/design/queue-2023/api.md deleted file mode 100644 index 559a925..0000000 --- a/design/queue-2023/api.md +++ /dev/null @@ -1,7 +0,0 @@ -# Merritt ZK API - -- [Design](README.md) - -## Implementations -- [Ruby API](ruby-api.md) -- [Java API](https://cdluc3.github.io/merritt-tinker/org/cdlib/mrt/package-summary.html) diff --git a/design/queue-2023/data.md b/design/queue-2023/data.md deleted file mode 100644 index 1a76875..0000000 --- a/design/queue-2023/data.md +++ /dev/null @@ -1,169 +0,0 @@ -# Queue Data - -- [Design](README.md) - -## Queue Data - - - -### Space Considerations -> ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data. This data can come in the form of configuration, status information, rendezvous, etc. A common property of the various forms of coordination data is that they are relatively small: measured in kilobytes. The ZooKeeper client and the server implementations have sanity checks to ensure that znodes have less than 1M of data, but the data should be much less than that on average. [^1] -[^1]: https://zookeeper.apache.org/doc/r3.3.3/zookeeperProgrammers.html#Data+Access - -#### Final vs Volatile data fields -- As we write to zookeeper, should be distinguish our static fields (submitter, file name) from the volatile fields (status, space_needed, last update)? - -## Design Ideas - -| Zookeeper Node Path | Node Data Type | Fields | Created By | Modified By | Comment | -| - | - | - | - | - | - | -| /batches/BID/lock | none | - | Pending, Reporting | - | **Ephemeral node** to lock a batch, deleted by the thread that creates the node | -| /batches/BID/submission | json | profile_name
submitter
payload_filename

erc_what
erc_who
erc_when
erc_where
type
submission_mode | creation | none | | -| /batches/BID/status | json | status
last_modified | creation | all jobs done | | -| /batches/BID/status-report | json | failed_jobs | failure | failure | last status report sent to user | -| /batches/BID/states/STATE/JID | none | - | | | STATE = pending / held / processing / failed / completed
Create watcher to watch for states/processing to be empty| -| /jobs/JID/lock | none | - | Several states | - | **Ephemeral node** to lock a job, deleted by the thread that creates the node | -| /jobs/JID/bid | string | batch_id | creation | none | | -| /jobs/JID/configuration | json | batch_id
profile_name
submitter
payload_url
payload_type
response_type
working_dir
local_id | creation | none | | -| /jobs/JID/status | json | status
last_successful_status
last_modification_date
retry_count | creation | none | | -| /jobs/JID/priority | int | - | creation | estimating | | -| /jobs/JID/space_needed | long | - | creation | estimating | | -| /jobs/JID/identifiers | json | primary_id
local_id: [] | creation | processing | | -| /jobs/states/STATE/PP-JID | none | - | | | PP = priority
STATE = pending / held / estimating / provisioning / downloading / processing / recording / notify / failed / completed | - -### Job Transition - -- Processing /jobs/states/StateX/PP-JID -- Job finishes StateX -- Update /jobs/JID/status data - - last_successful_status = StateX - - status = StateY - - last_modification_date = now -- Delete /jobs/states/StateX/PP-JID -- Create /jobs/states/StateY/PP-JID - - Note: The prior state might have altered the priority -- If StateY == Completed - - Delete /batches/BID/states/processing/JID - - Create /batches/BID/states/completed/JID -- If StateY == Failed - - Delete /batches/BID/states/processing/JID - - Create /batches/BID/states/failed/JID -- If /batches/BID/states/processing is empty, watcher will trigger batch notification - -## Batch Data - -### Batch Object Data Elements - -```mermaid -classDiagram - class Batch { - final String batch_id - final BatchSubmissionInfo - BatchState state - Hash~String_JobStatus~ jobs_status - String error_message - } - class BatchSubmissionInfo { - final String profile_name - final String submitter - final ManifestType manifest_type - final String payload_filename - final ResponseType response_type - final SubmissionMode submission_mode - } - class JobStatus { - JobState status - JobState last_reported_status - Date last_update - } - class BatchState{ - <> - } - class JobState{ - <> - } - class ManifestType{ - <> - SingleFile, - ObjectManfiest, - ManifestOfContainers, - ManifestOfManifests - } - class ResponseType{ - <> - XML, - JSON, - turtle - } - class SubmissionMode{ - <> - Add, - Update, - Reset - } -``` - -### Enum -- [BatchState.java](https://github.com/CDLUC3/merritt-tinker/blob/main/state-transition/src/main/java/org/cdlib/mrt/BatchState.java) - -## Job Data - -### Job Queue Data Elements -```mermaid -classDiagram - class Job { - final String job_id - final String batch_id - final BatchSubmissionInfo - final String payload_url - final PayloadType payload_type - final ResponseType callback_response_type - - String working_directory - - JobState status - JobState last_successful_state - Time status_updated - String error_message - - int retry_count - String local_id - String ark - int priority - long space_needed - } - class PayloadType{ - <> - File, - Manifest, - Container - } -``` -### Enum -- [JobState.java](https://github.com/CDLUC3/merritt-tinker/blob/main/state-transition/src/main/java/org/cdlib/mrt/JobState.java) - -### Questions -- Store timing info - - ---- -## Legacy Zookeeper Data Structure - -### Record Data -- Ingest currently serializes java properties -- Inventory currently serializes XML data - -### Record Keys -The ingest service currently packs a priority value into the path name for the zookeeper record. -- /ingest/mrtQ-02100000000003 -- (document the component parts here) -- Question: priority may become a more dynamic property in the future - - We could have a baseline priority in the pathname (for sorting) and an actual priority in the payload - - We could also explore renaming a path dynamically when a priority change is appropriate - -### Record Sorting - -#### Current Implementation -In Merritt's current zookeeper implementation, record headers contain binary data. -- Status: 1 byte status field with each byte representing a different queue state -- Time: 8 byte long representing the number of seconds since 1970 diff --git a/design/queue-2023/diagrams/README.md b/design/queue-2023/diagrams/README.md deleted file mode 100644 index 5ce01b8..0000000 --- a/design/queue-2023/diagrams/README.md +++ /dev/null @@ -1,6 +0,0 @@ -## make svg -```sh -docker run --rm -v "$(pwd):/data" minlag/mermaid-cli:latest -i /data/batches.mmd -docker run --rm -v "$(pwd):/data" minlag/mermaid-cli:latest -i /data/jobs.mmd -``` - diff --git a/design/queue-2023/diagrams/batches.mmd b/design/queue-2023/diagrams/batches.mmd deleted file mode 100644 index 7adfccc..0000000 --- a/design/queue-2023/diagrams/batches.mmd +++ /dev/null @@ -1,15 +0,0 @@ -%%{init: {'theme': 'neutral', 'securityLevel': 'loose', 'themeVariables': {'fontFamily': 'arial'}}}%% -graph LR - START --> Pending - click START "javascript:alert(222)" "Tip" - Pending --> Held - Pending --> Processing - Held -.-> Processing - Processing --> Reporting - Reporting --> COMPLETED - Reporting --> Failed - Failed -.-> UpdateReporting - UpdateReporting --> Failed - UpdateReporting --> COMPLETED - Failed -.-> DELETED - Held -.-> DELETED diff --git a/design/queue-2023/diagrams/batches.mmd.svg b/design/queue-2023/diagrams/batches.mmd.svg deleted file mode 100644 index 6962285..0000000 --- a/design/queue-2023/diagrams/batches.mmd.svg +++ /dev/null @@ -1 +0,0 @@ -
START
Pending
Held
Processing
Reporting
COMPLETED
Failed
UpdateReporting
DELETED
\ No newline at end of file diff --git a/design/queue-2023/diagrams/jobs.mmd b/design/queue-2023/diagrams/jobs.mmd deleted file mode 100644 index 159f224..0000000 --- a/design/queue-2023/diagrams/jobs.mmd +++ /dev/null @@ -1,22 +0,0 @@ -%%{init: {'theme': 'neutral', 'securityLevel': 'loose', 'themeVariables': {'fontFamily': 'arial'}}}%% -graph TD - START --> Pending - Pending --> Held - Pending --> Estimating - Held -.-> Pending - Estimating --> Provisioning - Provisioning --> Downloading - Downloading --> Processing - Downloading --> Failed - Failed -.-> Downloading - Processing --> Recording - Processing --> Failed - Failed -.-> Processing - Recording --> Notify - Recording --> Failed - Failed -.-> Recording - Notify --> COMPLETED - Notify --> Failed - Failed -.-> Notify - Failed -.-> DELETED - Held -.-> DELETED diff --git a/design/queue-2023/diagrams/jobs.mmd.svg b/design/queue-2023/diagrams/jobs.mmd.svg deleted file mode 100644 index 66c9649..0000000 --- a/design/queue-2023/diagrams/jobs.mmd.svg +++ /dev/null @@ -1 +0,0 @@ -
START
Pending
Held
Estimating
Provisioning
Downloading
Processing
Failed
Recording
Notify
COMPLETED
DELETED
\ No newline at end of file diff --git a/design/queue-2023/java-api.md b/design/queue-2023/java-api.md deleted file mode 100644 index 2ff9f5b..0000000 --- a/design/queue-2023/java-api.md +++ /dev/null @@ -1,171 +0,0 @@ -# Merritt ZK Java API - -- [Design](README.md) / [API](api.md) - -## States - -```java -package org.cdlib.mrt; - -public interface IngestState { - public List nextStates(); - public String name(); - public IngestState stateChange(IngestState next); - default boolean isDeletable(); - default boolean stateChangeAllowed(IngestState next); - default IngestState success(); - default IngestState fail(); - public static JSONObject statesAsJson(IngestState[] values); -} - -public enum BatchState implements IngestState{ - Pending, - Held, - Processing, - Reporting , - Failed, - UpdateReporting, - Completed, - Deleted; -} - -public enum JobState implements IngestState { - Pending, - Held, - Estimating, - Provisioning, - Downloading, - Processing, - Recording, - Notify, - Failed, - Completed, - Deleted; -} - -``` - -## ZK Node Keys - -```java -public enum ZKKey { - STATUS("status"), - LOCK("lock"), - BATCH_SUBMISSION("submission"), - BATCH_STATUS_REPORT("status-report"), - JOB_CONFIGURATION("configuration"), - JOB_IDENTIFIERS("identifiers"), - JOB_PRIORITY("priority"), - JOB_SPACE_NEEDED("space_needed"), - JOB_BID("bid"); -} -``` - -## Merritt ZK API - -```java -public class MerrittZKNodeInvalid extends Exception { - public MerrittZKNodeInvalid(String message); -} - -public class MerrittStateError extends Exception { - public MerrittZKNodeInvalid(String message); -} - -abstract public class QueueItem - private String id; - private JSONObject data; - private IngestState status; - - public QueueItem(String id); - public QueueItem(String id, JSONObject data); - public String id(); - public JSONObject data(); - public IngestState status(); - public void loadProperties(ZooKeeper client) throws MerrittZKNodeInvalid; - public String stringProperty(ZooKeeper client, ZKKey key) throws MerrittZKNodeInvalid; - public JSONObject jsonProperty(ZooKeeper client, ZKKey key) throws MerrittZKNodeInvalid; - public int intProperty(ZooKeeper client, ZKKey key) throws MerrittZKNodeInvalid; - public long longProperty(ZooKeeper client, ZKKey key) throws MerrittZKNodeInvalid; - public void setData(ZooKeeper client, ZKKey key, Object data) throws MerrittZKNodeInvalid; - - public abstract String dir(); - public abstract String prefix(); - public abstract IngestState initState(); - public abstract IngestState[] states(); - public String path(); - public abstract IngestState resolveStatus(String s); - - public static String serialize(Object data); - public static String createId(ZooKeeper client, String prefix); - public JSONObject statusObject(IngestState status); - public void setStatus(ZooKeeper client, IngestState status) throws MerrittZKNodeInvalid; - public boolean lock(ZooKeeper client) throws MerrittZKNodeInvalid; - public boolean unlock(ZooKeeper client) throws MerrittZKNodeInvalid; - public abstract void delete(ZooKeeper client) throws MerrittStateError; -} - -public class Batch extends QueueItem { - private boolean hasFailure; - - public Batch(String id); - public Batch(String id, JSONObject data); - - public boolean hasFailure(); - - public String dir(); - public String prefix(); - public IngestState initState(); - public IngestState[] states(); - public static String prefixPath(); - public IngestState resolveStatus(String s); - - public static Batch createBatch(ZooKeeper client, JSONObject submission); - - public void delete(ZooKeeper client) throws MerrittStateError; - - public static Batch aquirePendingBatch(ZooKeeper client); - public static Batch aquireCompletedBatch(ZooKeeper client); -} - -public class Job extends QueueItem { - private String bid; - private int priority; - private long space_needed; - private String jobStatePath; - private String batchStatePath; - - public String dir(); - public String prefix(); - public IngestState initState(); - public IngestState[] states(); - public static String prefixPath(); - public IngestState resolveStatus(String s); - - public Job(String id, String bid); - public Job(String id, String bid, JSONObject data); - - public static String dir(); - public static String prefix(); - public static String prefixPath(); - public String bid(); - public int priority(); - public long spaceNeeded(); - - public static Job createJob(ZooKeeper client, String bid, JSONObject configuration); - @Override public void loadProperties(ZooKeeper client); - public void setPriority(ZooKeeper client, int priority); - public void setSpaceNeeded(ZooKeeper client, long space_needed); - public void setStatus(ZooKeeper client, IngestState status); - public String batch_state_subpath(); - public void setBatchStatePath(ZooKeeper client); - public void setJobStatePath(ZooKeeper client); - - public void delete(ZooKeeper client) throws MerrittStateError; - - public JSONObject statusObject(IngestState status); - public static Job acquireJob(ZooKeeper client, IngestState status); -} - - -``` diff --git a/design/queue-2023/manager.md b/design/queue-2023/manager.md deleted file mode 100644 index ea09534..0000000 --- a/design/queue-2023/manager.md +++ /dev/null @@ -1,3 +0,0 @@ -# Queue Manager Service - -- [Design](README.md) diff --git a/design/queue-2023/queue-admin.md b/design/queue-2023/queue-admin.md deleted file mode 100644 index 2ffc42c..0000000 --- a/design/queue-2023/queue-admin.md +++ /dev/null @@ -1,68 +0,0 @@ -# Queue Administration - -- [Design](README.md) - -## Purpose -Migrate Queue Administration Tasks from Ingest to the Merritt Admin Tool - -## Potential Enhancements to Enable Shift of Admin Functionality - -### Read from Zookeeper from Admin Tool - -- This is very do-able. -- Current ingest queue uses java property serialization. This may be difficult for ruby code to read. -- Proposal: modify the Ingest Queue Item to be serialized as JSON instead - -### Write to Zookeeper from Admin Tool - -- This is very do-able. -- Assumes binary data can be written back as-is from ruby -- Proposal: modify the Ingest Queue Item to be serialized as JSON instead - -### Publish Ingest Profiles as an Artifact - -- Ingest service will pull profiles from a deployed artifact (zip file) rather than cloning git -- Admin tool code will pull profiles from a deployed artifact (zip file) rather than requesting data from ingest -- See https://github.com/CDLUC3/mrt-doc-private/issues/80 - -### ~Mount ZFS to Lambda~ - -- This is not recommended -- Conceptually, this could allow the remaining set of admin functions to be performed entirely from Lambda - -## Existing Ingest Admin Endpoints - -|service|admin endpoint|future loc|feature needed | comment| -|-|-|-|-|-| -|ingest|/state| NA | | /admin/state duplicates /state | -|ingest|/help| NA | | /admin/help duplicates /state | -|ingest|POST reset| ?? | | | -|ingest|/locks| admin| read zookeeper from admin| | -|ingest|/queues| admin | read zookeeper from admin | | -|ingest|/queues-acc| admin | read zookeeper from admin | | -|ingest|/queues-inv| admin | read zookeeper from admin | | -|ingest|/queue| admin | read zookeeper from admin| | -|ingest|/queue/{queue}| admin | read zookeeper from admin | diffult with current java property serialization | -|ingest|/queue-acc/{queue}| admin | read zookeeper from admin | | -|ingest|/queue-inv/{queue}| admin | read zookeeper from admin | | -|ingest|/lock/{lock}| admin |read zookeeper from admin | | -|ingest|POST /requeue/{queue}/{id}/{fromState}| admin | write zookeeper from admin | | -|ingest|POST /deleteq/{queue}/{id}/{fromState}| admin | write zookeeper from admin | | -|ingest|POST /cleanupq/{queue}| admin | write zookeeper from admin | | -|ingest|POST /{action: hold or release}/{queue}/{id}| admin | write zookeeper from admin | | -|ingest|POST /release-all/{queue}/{profile}| admin | write zookeeper from admin | | -|ingest|{profilePath}| admin | profiles as artifact | | -|ingest|/profiles-full| admin| profiles as artifact | | -|ingest|/profile/{profile}| admin | profiles as artifact| | -|ingest|/profile/admin/{env}/{type}/{profile}| admin| profiles as artifact | | -|ingest|/bids/{batchAge}| ingest | mount zfs to lambda | keep in ingest | -|ingest|/bid/{batchID}| ingest | mount zfs to lambda | keep in ingest| -|ingest|/bid/{batchID}/{batchAge}| ingest | mount zfs to lamda | keep in ingest| -|ingest|/jid-erc/{batchID}/{jobID}| ingest| mount zfs to lambda | keep in ingest| -|ingest|/jid-file/{batchID}/{jobID}| ingest | mount zfs to lambda| keep in ingest| -|ingest|/jid-manifest/{batchID}/{jobID}| ingest | mount zfs to lambda| keep in ingest| -|ingest|POST /submission/{request: freeze or thaw}/{collection}| admin | implement hold/freeze in ZK | | -|ingest|POST /submissions/{request: freeze or thaw}| admin | implement hold/freeze in ZK | | -|ingest|POST /profile/{type}| admin? | | Is this simply a template edit? If so, could the admin tool do this?| -|access|POST /flag/set/access/#{qobj}|admin|write zookeeper from admin |Access Queue freeze/thaw| -|access|POST /flag/clear/access/#{qobj}|admin|write zookeeper from admin |Access Queue freeze/thaw| diff --git a/design/queue-2023/ruby-api.md b/design/queue-2023/ruby-api.md deleted file mode 100644 index 10c8086..0000000 --- a/design/queue-2023/ruby-api.md +++ /dev/null @@ -1,158 +0,0 @@ -# Merritt ZK Ruby API - -- [Design](README.md) / [API](api.md) - -## States - -```rb -module MerrittZK - class IngestState - # clients should not need this method - def initialize(status, next_status) - - # returns the status object - def status - # returns the name of the status - def name - # returns a hash of the next allowable status values - def next_status - # indicates if deletion is permitted from the current status (no next status values) - def deletable? - # returns the next successful state object - def success - # returns the name of the next successful state - def fail - # returns the name of the next failure state - def state_change_allowed(state) - # perform a state change - def state_change(state) - def to_s - end - - class JobState < IngestState - # Get hash of states - def self.states - # Get initial state object - def self.init - - def state_change(state) - def success - def fail - - # Singleton Enum-like objects - def self.Pending - def self.Held - def self.Estimating - def self.Provisioning - def self.Downloading - def self.Processing - def self.Recording - def self.Notify - def self.Failed - def self.Deleted - def self.Completed - end - - class BatchState < IngestState - # Get hash of states - def self.states - # Get initial state object - def self.init - - def state_change(state) - def success - def fail - - # Singleton Enum-like objects - def self.Pending - def self.Held - def self.Processing - def self.Reporting - def self.UpdateReporting - def self.Failed - def self.Deleted - def self.Completed - end - -end -``` - -## ZK API - -```rb -module MerrittZK - class MerrittZKNodeInvalid < StandardError - def initialize(message) - end - - class MerrittStateError < StandardError - def initialize(message) - end - - class QueueItem - # initialize a new queue item - def initialize(id, data: nil) - attr_reader :id, :status - def states - # load a queue item from zookeeper - def load(zk) - # load properties associated with a queue item - def load_properties(zk) - # retrieve a string property from zk - def string_property(zk, key) - # retrieve a json property from zk - def json_property(zk, key) - # retrieve an integer property from zk - def int_property(zk, key) - # set data for a zookeeper node - def set_data(zk, key, data) - # path for a queue item - def path - # path to the status object for a queue item - def status_path - # serialize an object as a string - def self.serialize(v) - # create a sequential queue node which will generate a unique id - def self.create_id(zk, prefix) - # generate the status object for a queue item - def status_object(status) - # save/update the status for a queue item - def set_status(zk, status) - # lock a queue item with an ephemeral lock - def lock(zk) - # release the lock on a queue item - def unlock(zk) - end - - class Batch < QueueItem - def states - def self.dir - def self.prefix_path - def path - def delete(zk) - def self.create_batch(zk, submission) - def self.acquire_pending_batch(zk) - def self.acquire_completed_batch(zk) - end - - class Job < QueueItem - def initialize(id, bid: nil, data: nil) - def load_properties(zk) - attr_reader :bid, :priority, :space_needed, :jobstate - def set_priority(zk, priority) - def set_space_needed(zk, space_needed) - def set_status(zk, status) - def batch_state_subpath - def set_batch_state_path(zk) - def set_job_state_path(zk) - def states - def self.prefix_path - def path - def delete(zk) - def self.create_job(zk, bid, data) - def status_object(status) - def self.acquire_job(zk, state) - end -end -``` - diff --git a/design/queue-2023/service.md b/design/queue-2023/service.md deleted file mode 100644 index 2899733..0000000 --- a/design/queue-2023/service.md +++ /dev/null @@ -1,30 +0,0 @@ -# Underlying Queue Service - -- [Design](README.md) - -## Why [Zookeeper](https://zookeeper.apache.org/)? - -Explain our rationale. - -## Queueing Alternatives - -- [Amazon SQS](https://aws.amazon.com/pm/sqs/) - no priority, messages have a shorter duration than we need -- [Amazon MQ](https://aws.amazon.com/amazon-mq/) (deploys to managed EC2 instances) - - [Apache MQ](https://activemq.apache.org/) (uses Zookeeper!) - - [Rabbit MQ](https://www.rabbitmq.com/) -- [Apache Kafka](https://kafka.apache.org/intro) (more like Amazon SNS) - -## Distributed Locking -- [Redis](https://redis.io/) -- [DynamoDB](https://aws.amazon.com/dynamodb) - -## State Transition (for Queue Items) -- [Amazon Eventbridge](https://aws.amazon.com/eventbridge/) -- [Amazon Step Functions](https://docs.aws.amazon.com/step-functions/index.html) - -## AWS Consultation with Kevin and Maria -- Consider EMR for running ZK, may be a costly option: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zookeeper.html -- Consider AWS Step Functions instead of queueing with ZK - - Kevin will brainstorm and estimate if this could be more affordable than our 5 t3.small instances -- Consider fargate + event stream to configure hosts: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_event_stream.html - - Kevin will send Ashley code samples diff --git a/design/queue-2023/states.md b/design/queue-2023/states.md deleted file mode 100644 index 65fe7c3..0000000 --- a/design/queue-2023/states.md +++ /dev/null @@ -1,117 +0,0 @@ -# Queue States - -- [Design](README.md) - -## Batch Queue - -### Batch Queue State Diagram - -```mermaid -graph LR - START --> Pending - click START "javascript:alert(222)" "Tip" - Pending --> Held - Pending --> Processing - Held -.-> Processing - Processing --> Reporting - Reporting --> COMPLETED - Reporting --> Failed - Failed -.-> UpdateReporting - UpdateReporting --> Failed - UpdateReporting --> COMPLETED - Failed -.-> DELETED - Held -.-> DELETED -``` - ---- -## Batch Queue States -_A dashed line indicates and administrative action initiated by the Merritt Team_ - -### Pending -Batch is ready to be processed -### Held -Collection is HELD. The hold must be released before the batch can proceed. -### Processing -Payload is analyzed. If the payload is a manifest, it will be downloaded. Jobs are created in the job queue. -### Reporting -All jobs have COMPLETED or FAILED, a summary e-mail is sent to the depositor. -### COMPLETED -All jobs COMPLETED -### Failed -At least one job FAILED -### UpdateReporting -Determine if any previously FAILED jobs are not complete. If so, notify the depositor by email. - ---- -## Job Queue - -### Job Queue State Diagram -```mermaid -graph TD - START --> Pending - Pending --> Held - Pending --> Estimating - Held -.-> Pending - Estimating --> Provisioning - Provisioning --> Downloading - Downloading --> Processing - Downloading --> Failed - Failed -.-> Downloading - Processing --> Recording - Processing --> Failed - Failed -.-> Processing - Recording --> Notify - Recording --> Failed - Failed -.-> Recording - Notify --> COMPLETED - Notify --> Failed - Failed -.-> Notify - Failed -.-> DELETED - Held -.-> DELETED -``` - ---- -## Job Queue States -### Pending -Job is waiting to be acquired by the queue -### Held -Since Job was queued, the collection has been put into a HELD state. The job will require an administrateive action to release it after the hold is released. -### Estimating -Perform HEAD requests for all content to estimate the size of the ingest. If HEAD requests are not supported, no value is updated and the job should proceed with limited information. This could potentially affect priority for the job -### Provisioning -Once dynamic provisioning is implemented (ie zfs provisioning), wait for dedicated file system to be provisioned. - -If not dedicated file system is specified, use default working storage. -- if working storage is more than 80% full, then wait -- otherwise, use default working storage -### Downloading -One or more downloads is in progress. This can be a multi-threaded step. Threads are not managed in the queue. -### Processing -All downloads complete; perform Merritt Ingest -- validiate checksum check -- mint identifiers -- create system files -- notify storage -### Recording -Storage is complete; ready for Inventory. -The Inventory service will operate on this step. -### Notify -Invoke callback if needed -Notify batch handler that the job is complete -### COMPLETED -Storage and inventory are complete, cleanup job folder -### Failed -The queue will track the last successful step so that the job can be resumed at the appropriate step. - ---- - - -## Design Questions - -- Should we have separate states for "Active Provisioning" vs "Capacity Checks"? -- Should each job know about its list of downloads so that the download step could be resumed at a specific point? -- -- reset status -### Failed --> Notify -- reset status -### Failed --> Deleted (admin function) diff --git a/design/queue-2023/tests.yml b/design/queue-2023/tests.yml deleted file mode 100644 index bb78e42..0000000 --- a/design/queue-2023/tests.yml +++ /dev/null @@ -1,8 +0,0 @@ -test_id: - title: Test name - input: - path: value - api-calls: - - create-batch - output: - path: value diff --git a/design/queue-2023/transition.md b/design/queue-2023/transition.md deleted file mode 100644 index 590c1c1..0000000 --- a/design/queue-2023/transition.md +++ /dev/null @@ -1,841 +0,0 @@ -# State Transition Details - -- [Design](README.md) - -## Conventions - -- BID: Zookeeper will manage the incrementing id numbers for batches -- JID: Zookeeper will manage the incrementing id numbers for jobs - -### Job Creation - -> [!NOTE] -> A new job id will be written in 3 places -> - To the list of jobs `/jobs/JID` -> - To the batch's job queue `/batches/BID/states/STATE/JID` -> - this queue is used to determine when a batch is complete -> - To the actual job queue `/jobs/states/STATE/XX-JID` -> - XX is the job's initial priority -> - This is the only place where the priority appears in the path name -> -> When applying a change to the job queue, the following sequence should be used -> - Update job data `/jobs/JID` -> - Delete old job queue entry `/jobs/states/OLD_STATE/XX-JID` -> - Create new job queue entry `/jobs/states/NEW_STATE/XX-JID` -> - Update the batch queue state (if applicable) -> - Delete old batch queue state `/batches/BID/states/OLD_STATE/JID` -> - Create new batch queue state `/batches/BID/states/NEW_STATE/JID` - -### Locking Batches and Jobs - -Locks on Jobs and Batches should be implemented with a [Zookeeper ephemeral lock](https://zookeeper.apache.org/doc/r3.4.5/zookeeperOver.html#Nodes+and+ephemeral+nodes). If a zookeeper client process terminates, ephemeral locks are released. - -- [ ] TODO: Review this with Mark - -### Acquiring a Batch -```yml -/batches/BID/lock: #ephemeral node to be held by a consumer daemon -``` - -### Acquiring a Job -```yml -/jobs/JID/lock: #ephemeral node to be held by a consumer daemon -``` - -## Consumer Daemons to Create - -- Batch Pending -- Batch Reporting -- Batch Update Reporting -- Job Pending -- Job Estimating -- Job Provisioning (when work is held in this state, sleep between cycles) -- Job Downloading -- Job Processing -- Job Recording (implemented by Merritt Inventory) -- Job Notify - -## Batch: API Call Triggers the Creation and Queuing of a Batch - -User submits a submission payload. -A batch is created using the payload url. -Regardless of the type of submission, the payload should be represented as a URL. -This step should be as lightweight as possible. - -### Input -```yml -submitter: submitter -type: file # container or file in the case of a zip deposit -profile: profile -payload_url: payload_url -manifest: -- file1.checkm loc001 -- file2.checkm loc002 -- file3.checkm loc003 ark123 -``` - -### ZK Nodes - -```yml -/batches/bid0001/submission: - profile_name: profile_name - submitter: submitter - payload_url: payload_url - erc_what: title - erc_who: author - erc_when: date - erc_where: - type: file - submission_mode: add -/batches/bid0001/status: - status: pending - last_modified: now -``` - -## Batch: Acquire Pending Batch - -### Identifying Pending Batches -- A "pending batch" can be identified by the absence of a "/states" child -- If a batch has a "/states" child, the queue will ignore it - -### Create Lock -```yml -/batches/bid001/lock: #ephemeral -``` - -### State Description -If a Collection Hold is in place, change status to Held and stop processing. - -The differences in batch submission types (single file, object manifest, manifest of manifests, manifest of containers) should be handled at this phase. -One job will be spawned for each object that needs to be created for the payload. - -If configured in the profile, a summary email should be sent to the depositor confirming the queueing of the batch of jobs. - -### Batch: Pending to Held - -If the collection is in a held state, the batch should move to a held status. -An administrative action is necessary to release the hold. - -```yml -/batches/bid0001/status: - status: held - last_modified: now -# DELETE /batches/bid001/lock -``` - -### Batch: Pending --> Processing - -```yml -/batches/bid0001/status: - status: processing - last_modified: now -# DELETE /batches/bid001/lock -```### Output - -#### Job Details -```yml -/jobs/jid0001/configuration: - batch_id: bid0001 - profile_name: profile_name - submitter: submitter - payload_url: file1.checkm - payload_type: object_manifest - response_type: response_type - response_type: tbd - submission_mode: add - working_dir: /zfs/queue/bid0001/jid0001 -/jobs/jid0001/identifiers: - primary: - local_id: [loc001] -/jobs/jid0001/status: - status: pending - last_successful_status: #nil - last_modification_date: now - retry_count: 0 -/jobs/jid0001/priority: 5 -/jobs/jid0002/configuration: - batch_id: bid0001 - profile_name: profile_name - submitter: submitter - payload_url: file2.checkm - payload_type: object_manifest - response_type: response_type - response_type: tbd - submission_mode: add - working_dir: /zfs/queue/bid0001/jid0002 - local_id: loc002 -/jobs/jid0002/identifiers: - primary: - local_id: [loc002] -/jobs/jid0002/status: status: - status: pending - last_successful_status: #nil - last_modification_date: now - retry_count: 0 -/jobs/jid0002/priority: 5 -/jobs/jid0003/configuration: - batch_id: bid0001 - profile_name: profile_name - submitter: submitter - payload_url: file2.checkm - payload_type: object_manifest - response_type: response_type - response_type: tbd - submission_mode: add - working_dir: /zfs/queue/bid0001/jid0003 -/jobs/jid0003/identifiers: - primary: ark123 - local_id: [loc003] -/jobs/jid0003/status: status: - status: pending - last_successful_status: #nil - last_modification_date: now - retry_count: 0 -/jobs/jid0003/priority: 5 -``` - -#### Place jobs in job queue, allowing sorting by priority -```yml -/jobs/states/pending/05-jid0001: #no data - acts as a reference -/jobs/states/pending/05-jid0002: #no data - acts as a reference -/jobs/states/pending/05-jid0003: #no data - acts as a reference -``` - -#### Place jobs references in batch queue -```yml -/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference -/batches/bid0001/states/batch-processing/jid0002: #no data - acts as a reference -/batches/bid0001/states/batch-processing/jid0003: #no data - acts as a reference -``` - -## Batch: Held --> Pending (Admin Action) - -An administrative action is performed to release a "Held" batch. -After confirming that the target collection is no longer "Held", proceed to the Processing step. - -```yml -/batches/bid0001/status: - status: pending - last_modified: now -``` - -## The Job Queue - -The Job Queue runs independently from the Batch Queue -- The keys in the job queue thread are sorted by job priority which ensures that higher priority jobs will be initiated first -- If a collection hold has been set since the job was created, set the job state to Held -- Otherwise, set the job state to Processing - -## Job: Acquire Pending Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### Job: Pending --> Failed - -A job will immediately fail under the following conditions -- if payload digest does not match depositor digest -- if manifest is corrupt - -Recovery is not possible under these conditions. A new submission will be required. - -```yml -/jobs/jid0001/status: - status: failed - last_successful_status: #nil - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/pending/05-jid0001: -/jobs/states/failed/05-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Pending --> Held - -The job will be kept in a Held state until an administrative action releases the job. - -```yml -/jobs/jid0002/status: - status: held - last_successful_status: #nil - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/pending/05-jid0001: -/jobs/states/held/05-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Pending --> Estimating - -Once a job is acquired, it will move to an Estimating step. - -### Status -```yml -/jobs/jid0002/status: - status: estimating - last_successful_status: #nil - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/pending/05-jid0001: -/jobs/states/estimating/05-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Held --> Pending (Admin Action) - -Job is administratively released back to a Pending status. - -### Status -```yml -/jobs/jid0002/status: - status: pending - last_successful_status: #nil - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/held/05-jid0001: -/jobs/states/pending/05-jid0001: #no data - acts as a reference -``` - -## Job: Acquire Estimating Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -The first step of a job is to estimate the resources that will be needed to process the job. -This will be accomplished by running HEAD reqeusts for content to be ingested and calculating a size estimate for the object. -If a job is excessively large , the job priority may be adjusted. - -The estimating step does not fail. If a proper size calculation cannot be made for a job, the space_needed should be set to 0 and job priority may be adjusted. - -### Output - -```yml -/jobs/jid0002/space_needed: 1000000000 -/jobs/jid0002/priority: 10 -``` - -### Job: Estimating --> Provisioning - -```yml -/jobs/jid0002/status: - status: provisioning - last_successful_status: estimating - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/estimating/05-jid0001: -/jobs/states/provisioning/10-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Acquire Provisioning Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -The Provisioning state will be used to determine if there are sufficient system resources for a job to procede. -At the simplest level, this state would allow us to throttle all subsequent ingests if our ZFS capacity is insufficient to support a specific download. -Unestimated jobs should be held in this state if the ZFS capacity is below a specific threshold. - -Additionally, this state could be used to hold a job while resources are dynamically provisioned from AWS. This will not be a feature of the initial release. - -Jobs that fail the provisioning test will remain in this state, so it is important that ALL jobs in this state get evaluated. If some jobs are retained in the provisioning state, it might make sense for the provisioning thread to sleep between tests. - -### Job: Provisioning --> Downloading - -```yml -/jobs/jid0002/status: - status: downloading - last_successful_status: provisioning - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/provisioning/10-jid0001: -/jobs/states/downloading/10-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Acquire Downloading Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -The Downloading step performs the following actions -- Performs a GET request on every download (multi-threaded), with a finite number of retries -- Saves files to working folder -- Recalculate space_needed (in case estimate was inaccurate) -- Perform digest validation (if user-supplied in manifest) - -### Ouptut (if changes detected) - -```yml -/jobs/jid0001/space_needed: 1000000000 -/jobs/jid0001/priority: 10 -``` - -### Job: Downloading --> Processing - -```yml -/jobs/jid0001/status: - status: processing - last_successful_status: downloading - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/downloading/10-jid0001: -/jobs/states/processing/10-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Downloading --> Failed (downloading) - -If any individual download does not succeed (after a set number of retries), the job will go to a failed state. - -```yml -/jobs/jid0002/status: - status: failed - last_successful_status: provisioning # retain_prior_value - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/provisioning/10-jid0001: -/jobs/states/failed/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Acquire Processing Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -The processing step is where the bulk of Merritt Ingest processing takes place - -- Local_id lookup -- Mint ark using EZID if needed -- Write ark out to zookeeper immediately - - If the job is restarted, a new id will not need to be minted -- if local_id does not match user-supplied ark, fail -- Write ERC file -- Write dublin_core file -- Check digest for each file if needed (HandlerDigest) -- Create storage manifest (HandlerDigest) -- Request storage worker for handling request (very low risk of failure) -- Call storage enpoint to pass storage manifest -- Check return status from storage - -### Output - -```yml -/jobs/jid0002/identifier: - primary: 555 - local_id: [loc002] -``` - -### Job: Processing --> Recording - -```yml -/jobs/jid0002/status: - status: recording - last_successful_status: processing - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/processing/10-jid0001: -/jobs/states/recording/10-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Processing --> Failed (processing) - -Jobs may fail processing due to minting failure or storage failures. - -```yml -/jobs/jid0002/status: - status: failed - last_successful_status: downloading # retain_prior_value - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/processing/10-jid0001: -/jobs/states/failed/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Acquire Recording Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -This step will be processed by the Merritt Inventory service. -- Inventory will read the storage manifest -- Inventory will update the Merritt INV database - -This will satisfy one of the key motivations for the queue redesign effort. -By processing the inventory step from the ingest queue, the depositor notification process will ensure that content is immediately accessible from Merritt. -Previously, it was possible that depositors were notified of a successful ingest BEFORE content had been recorded in inventory. - -### Job: Recording --> Notify - -```yml -/jobs/jid0002/status: - status: notify - last_successful_status: recording - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/recording/10-jid0001: -/jobs/states/notify/10-jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Recording --> Failed (recording) - -This status change indicates that an error occurred while recording an object change in the inventory database. - -```yml -/jobs/jid0002/status: - status: failed - last_successful_status: processing # retain_prior_value - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/recording/10-jid0001: -/jobs/states/failed/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Job: Acquire Notify Job - -### Create Lock -```yml -/jobs/jid0001/lock: #ephemeral -``` - -### State Description - -If a callback has been configured in a collection profile, the callback will be invoked for the job. -As the status of the job is changed to "completed", the batch object for the job will be notified of the update (potentially via a Zookeeper "Watcher"). -This will allow the batch to determine if the entire job has been completed. - -### Job: Notify --> Completed - -```yml -/jobs/jid0002/status: - status: completed - last_successful_status: notify - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/notify/10-jid0001: -/jobs/states/completed/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-completed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -### Job: Notify --> Failed - -If the event of a callback failure, the job will go to a Failed state. - -```yml -/jobs/jid0002/status: - status: failed - last_successful_status: recording # retain_prior_value - last_modification_date: now - retry_count: 0 # no change -# DELETE /jobs/states/notify/10-jid0001: -/jobs/states/failed/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-processing/jid0001: -/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference -# DELETE /jobs/jid0001/lock -``` - -## Resuming failed jobs - -The failed job will be resumed via an admin action. -The resumed job will restart at an appropriate state based on the "last_successful_state". - -### Job: Failed --> Downloading (Admin Action) - -```yml -/jobs/jid0002/status: - status: downloading - last_successful_status: provisioning # no change - last_modification_date: now - retry_count: 1 # increment by 1 -# DELETE /jobs/states/failed/10-jid0001: -/jobs/states/downloading/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-failed/jid0001: -/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference -``` - -### Job: Failed --> Processing (Admin Action) - -```yml -/jobs/jid0002/status: - status: processing - last_successful_status: downloading # no change - last_modification_date: now - retry_count: 1 # increment by 1 -# DELETE /jobs/states/failed/10-jid0001: -/jobs/states/processing/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-failed/jid0001: -/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference -``` - -### Job: Failed --> Recording (Admin Action) - -```yml -/jobs/jid0002/status: - status: recording - last_successful_status: processing # no change - last_modification_date: now - retry_count: 1 # increment by 1 -# DELETE /jobs/states/failed/10-jid0001: -/jobs/states/recording/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-failed/jid0001: -/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference -``` - -### Job: Failed --> Notify (Admin Action) - -```yml -/jobs/jid0002/status: - status: notify - last_successful_status: processing # no change - last_modification_date: now - retry_count: 1 # increment by 1 -# DELETE /jobs/states/failed/10-jid0001: -/jobs/states/notify/10-jid0001: #no data - acts as a reference -# DELETE /batches/bid0001/states/batch-failed/jid0001: -/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference -``` - -## Job: Completed --> DELETED (Automated Task) - -Upon completion of the job, the job's ZFS working directory (producer AND system) can be deleted. - -Other job-related data will be retained in zookeeper to facilitate reporting. - -```yml -# DELETE /jobs/states/completed/10-jid0001: -``` - -## Job: Failed --> DELETED (Admin Action) - -If the batch is not yet completed, confirm that the user understands that job deletion will prevent notification of job-related information. - -Upon deletion of a failed job, the job's zookeeper nodes and the ZFS working directory can be deleted. - -```yml -# DELETE /jobs/jid0001/configuration: -# DELETE /jobs/jid0001/status: -# DELETE /jobs/jid0001/priority: -# DELETE /jobs/jid0001/ark: -# DELETE /jobs/states/failed/10-jid0001: -# DELETE /batches/bid0001/states/batch-failed/jid0001: -``` - -## Job: Held --> DELETED (Admin Action) - -If the batch is not yet completed, confirm tha tthe user understands that job deletion will prevent notification of job-related information. - -Upon completion of a held job, the job's zookeeper nodes and the ZFS working directory can be deleted. - -```yml -# DELETE /jobs/jid0001/configuration: -# DELETE /jobs/jid0001/status: -# DELETE /jobs/jid0001/priority: -# DELETE /jobs/jid0001/ark: -# DELETE /jobs/states/held/10-jid0001: -# DELETE /batches/bid0001/states/batch-processing/jid0001: -``` - -## Batch: Processing --> Reporting (Automated by event) - -Once the last job for a batch has either failed or completed, the batch will move to a reporting step. - -```yml -# NOTE the absence of /batches/bid0001/states/batch-processing/*: -# NOTE check for the presence of /batches/bid0001/states/batch-failed/*: -# NOTE check for the presence of /batches/bid0001/states/batch-completed/*: -/batches/bid0001/status: - status: reporting - last_modified: now -``` - -## Batch: Acquire Reporting Batch - -### Create Lock -```yml -/batches/bid0001/lock: #ephemeral -``` - -### State Description - -The reporting phase will gather a list of completed jobs for a batch and failed jobs for a batch. -This will be compiled into a report for the depositor. - -The list of failed jobs should be saved to a zookeeper node so their status can be re-evaluated for a subsequent report. - -### Output - -```yml -/batches/bid0001/status-report: - last_modified: now - failed_jobs: - # array of jids - successful_jobs: - # array of jids -``` - -### Batch: Reporting --> Completed - -```yml -/batches/bid0001/status: - status: completed - last_modified: now -# DELETE /batches/bid0001/lock -``` - -### Batch: Reporting --> Failed - -The reporting phase will gather a list of completed jobs for a batch and failed jobs for a batch. -This will be compiled into a report for the depositor. - -```yml -/batches/bid0001/status: - status: failed - last_modified: now -# DELETE /batches/bid0001/lock -``` - -## Batch: Failed --> UpdateReporting (Admin Action) - -This status change will be triggered by an administrative action. This action indicates that attempts to troubleshoot failed jobs for a batch have concluded. - -```yml -/batches/bid0001/status: - status: update-reporting - last_modified: now -``` - -## Batch: Acquire Update Reporting Batch - -### Create Lock -```yml -/batches/bid0001/lock: #ephemeral -``` - -### State Description - -A subsequent report will be sent to the depositor indicating jobs that succeeded since the last report was sent. - -It _might_ make sense to also indicate the jobs that were not resolved since the prior report was sent. - -### Output - -```yml -/batches/bid0001/status-report: - last_modified: now - failed_jobs: - # array of jids - successful_jobs: - # array of jids -``` - -### Batch: UpdateReporting --> Completed - -A subsequent report will be sent to the depositor indicating jobs that succeeded since the last report was sent. - -```yml -/batches/bid0001/status: - status: completed - last_modified: now -# DELETE /batches/bid0001/lock -``` - -### Batch: UpdateReporting --> Failed - -```yml -/batches/bid0001/status: - status: failed - last_modified: now -# DELETE /batches/bid0001/lock -``` - -## Batch: Failed --> DELETED (Admin Action) - -An administrative action will trigger the delete of a failed batch (and any outstanding jobs for that batch). - -This action should only be taken once all attempts at job recovery have been exhausted. - -```yml -# DELETE /batches/bid0001/status: -# DELETE /batches/bid0001/status-report: -# DELETE /batches/bid0001/submission: -# for every JID in /batches/bid0001/states/batch-*/*: -# DELETE /batches/bid0001/states/batch-completed/JID: -# DELETE /jobs/states/STATE/*-JID if present -# DELETE /jobs/JID/configuration: -# DELETE /jobs/JID/status: -# DELETE /jobs/JID/priority: -# DELETE /jobs/JID/ark: -``` - -## Batch: Held --> Deleted (Admin Action) - -An administrative action will trigger the delete of a held batch. - -_Execute this step with caution since the depositor will not be notified of this action._ - -```yml -# DELETE /batches/bid0001/status: -# DELETE /batches/bid0001/status-report: -# DELETE /batches/bid0001/submission: -# for every JID in /batches/bid0001/states/batch-*/*: -# DELETE /batches/bid0001/states/batch-completed/JID: -# DELETE /jobs/states/STATE/*-JID if present -# DELETE /jobs/JID/configuration: -# DELETE /jobs/JID/status: -# DELETE /jobs/JID/priority: -# DELETE /jobs/JID/ark: -``` - -## Batch: Completed --> Automatic Cleanup - -Clean up the remnants of a properly completed batch. - -```yml -# DELETE /batches/bid0001/status: -# DELETE /batches/bid0001/status-report: -# DELETE /batches/bid0001/submission: -# for every JID in /batches/bid0001/states/batch-completed/*: -# DELETE /batches/bid0001/states/batch-completed/JID: -# DELETE /jobs/states/STATE/*-JID if present -# DELETE /jobs/JID/configuration: -# DELETE /jobs/JID/status: -# DELETE /jobs/JID/priority: -# DELETE /jobs/JID/ark: -``` diff --git a/design/queue-2023/use-cases.md b/design/queue-2023/use-cases.md deleted file mode 100644 index a9d33ee..0000000 --- a/design/queue-2023/use-cases.md +++ /dev/null @@ -1,595 +0,0 @@ -# Batch Queue Use Cases - -- [Design](README.md) - -## Use Case: Successful Batch - -### User submits manifest with 3 items - -```mermaid -graph TD - Manifest[/Batch Manifest/] - Ingest(Ingest Service) - Batch[Batch: Pending] - Manifest --> Ingest - Ingest --> Batch -``` - -### Batch Queue starts Batch - -```mermaid -graph TD - Batch[Batch: Processing] -``` - -### Batch downloads manifest and creates 3 jobs - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Pending] - Job2[Job 2: Pending] - Job3[Job 3: Pending] - Batch --> |job1_payload_url| Job1 - Batch --> |job2_payload_url| Job2 - Batch --> |job3_payload_url| Job3 -``` - -### Jobs Begin - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: Processing] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Job 2 Completes - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- |notify| Job2 - Batch --- Job3 -``` -### Job 3 completes -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 -``` - -### Job 1 completes -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- |notify| Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Batch Reports Job Status to Depositor -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Email(Email Status to Depositor) - Batch --> StatusReport - StatusReport --> Email -``` - -### Batch Completes -```mermaid -graph TD - Batch[Batch: COMPLETED] -``` - -## Use Case: Failed Batch - -### User submits manifest with 3 items - -```mermaid -graph TD - Manifest[/Batch Manifest/] - Ingest(Ingest Service) - Batch[Batch: Pending] - Manifest --> Ingest - Ingest --> Batch -``` - -### Batch Queue starts Batch - -```mermaid -graph TD - Batch[Batch: Processing] -``` - -### Batch downloads manifest and creates 3 jobs - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Pending] - Job2[Job 2: Pending] - Job3[Job 3: Pending] - Batch --> |job1_payload_url| Job1 - Batch --> |job2_payload_url| Job2 - Batch --> |job3_payload_url| Job3 -``` - -### Jobs Begin - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: Processing] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Job 2 Completes - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- |notify| Job2 - Batch --- Job3 -``` -### Job 3 fails -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 -``` - -### Job 1 completes -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- |notify| Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Batch Reports Job Status to Depositor -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Email(Email Status to Depositor) - Batch --> StatusReport - StatusReport --> Email -``` - -### Batch Goes to Failed state -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - - -### Admin Deletes Batch after determining that resubmission of failed job is not possible -```mermaid -graph TD - Batch[Batch: DELETED] -``` - - -## Use Case: Failed Batch with Successful Retry - - -### User submits manifest with 3 items - -```mermaid -graph TD - Manifest[/Batch Manifest/] - Ingest(Ingest Service) - Batch[Batch: Pending] - Manifest --> Ingest - Ingest --> Batch -``` - -### Batch Queue starts Batch - -```mermaid -graph TD - Batch[Batch: Processing] -``` - -### Batch downloads manifest and creates 3 jobs - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Pending] - Job2[Job 2: Pending] - Job3[Job 3: Pending] - Batch --> |job1_payload_url| Job1 - Batch --> |job2_payload_url| Job2 - Batch --> |job3_payload_url| Job3 -``` - -### Jobs Begin - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: Processing] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Job 2 Completes - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- |notify| Job2 - Batch --- Job3 -``` -### Job 3 fails -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 -``` - -### Job 1 completes -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- |notify| Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Batch Reports Job Status to Depositor -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Email(Email Status to Depositor) - Batch --> StatusReport - StatusReport --> Email -``` - -### Batch Goes to Failed state -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Job 3 is restarted -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Job 3 completes -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Admin changes batch state to UpdateReporting -```mermaid -graph TD - Batch[Batch: UpdateReporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Email is sent to depositor showing the status change for Job 3 -```mermaid -graph TD - Batch[Batch: UpdateReporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: COMPLETE] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --> StatusReport - StatusReport -.-> Batch - Email(Email Status to Depositor) - StatusReport --> Email -``` - -### Batch Status is COMPLETE -```mermaid -graph TD - Batch[Batch: COMPLETE] -``` - -## Use Case: Failed Batch with Unsuccessful Retry - - -### User submits manifest with 3 items - -```mermaid -graph TD - Manifest[/Batch Manifest/] - Ingest(Ingest Service) - Batch[Batch: Pending] - Manifest --> Ingest - Ingest --> Batch -``` - -### Batch Queue starts Batch - -```mermaid -graph TD - Batch[Batch: Processing] -``` - -### Batch downloads manifest and creates 3 jobs - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Pending] - Job2[Job 2: Pending] - Job3[Job 3: Pending] - Batch --> |job1_payload_url| Job1 - Batch --> |job2_payload_url| Job2 - Batch --> |job3_payload_url| Job3 -``` - -### Jobs Begin - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: Processing] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Job 2 Completes - -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- |notify| Job2 - Batch --- Job3 -``` -### Job 3 fails -```mermaid -graph TD - Batch[Batch: Processing] - Job1[Job 1: Processing] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 -``` - -### Job 1 completes -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- |notify| Job1 - Batch --- Job2 - Batch --- Job3 -``` - -### Batch Reports Job Status to Depositor -```mermaid -graph TD - Batch[Batch: Reporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Email(Email Status to Depositor) - Batch --> StatusReport - StatusReport --> Email -``` - -### Batch Goes to Failed state -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Job 3 is restarted -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Processing] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Job 3 fails again -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- |notify| Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Admin changes batch state to UpdateReporting -```mermaid -graph TD - Batch[Batch: UpdateReporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --- StatusReport -``` - -### Since there is no job state change since last report, no email is sent -```mermaid -graph TD - Batch[Batch: UpdateReporting] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --> StatusReport - StatusReport -.-> Batch -``` - -### Batch Status is Failed -```mermaid -graph TD - Batch[Batch: Failed] - Job1[Job 1: COMPLETE] - Job2[Job 2: COMPLETE] - Job3[Job 3: Failed] - Batch --- Job1 - Batch --- Job2 - Batch --- Job3 - StatusReport[/StatusReport/] - Batch --> StatusReport -``` - -### Admin Deletes Batch after determining that resubmission of failed job is not possible -```mermaid -graph TD - Batch[Batch: DELETED] -```