|
| 1 | +# Understanding and Running the DSC Workflow |
| 2 | + |
| 3 | +_This documentation describes the DSC workflow and how to run the application._ |
| 4 | + |
| 5 | +**DISCLAIMER**: While the CLI application is runnable on its own, the DSO Step Function offers a simplified user interface for running the full ETL pipeline. For more details on the DSO Step Function and how to use it, see https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4690542593/DSpace+Submission+Orchestrator+DSO. |
| 6 | + |
| 7 | +## The DSC Workflow |
| 8 | + |
| 9 | +The DSC workflow consists of the following key steps: |
| 10 | + |
| 11 | +1. Create a batch |
| 12 | +2. Queue a batch for ingest |
| 13 | +3. Ingest items into DSpace |
| 14 | +4. Inspect ingest results |
| 15 | + |
| 16 | +It's important to note that DSC is not responsible for ingesting items into DSpace; this task is handled by _DSS_. The DSC CLI provides commands for all other steps in the DSC workflow. |
| 17 | + |
| 18 | +### Create a batch |
| 19 | +DSC processes deposits in "batches", a collection of item submissions grouped by a unique identifier. DSC requires that the item submission assets (metadata and bitstream files) are uploaded to a "folder" in S3, named after the batch ID. While some requestors may upload the submission assets to S3 themselves, in other cases, these files need to be retrieved (via API requests) and uploaded during the batch creation step. |
| 20 | + |
| 21 | +At the end of this step: |
| 22 | +* If all item submission assets are complete: |
| 23 | + - A batch folder with complete item submission assets exists in the DSO S3 bucket |
| 24 | + - Each item submission in the batch is recorded in DynamoDB (with `status="batch_created"`) |
| 25 | + - **[OPTIONAL]** An email is sent reporting the number of created item submissions. The email includes a CSV file with the batch records from DynamoDB. |
| 26 | +* If any item submission assets were invalid (missing metadata and/or bitstreams): |
| 27 | + - A batch folder with incomplete item submission assets exists in the DSO S3 bucket |
| 28 | + - **[OPTIONAL]** An email is sent reporting that zero item submissions were created. The email |
| 29 | + includes a CSV file describing the failing item submissions with the corresponding error message. |
| 30 | + |
| 31 | +**Data syncing** |
| 32 | + |
| 33 | +✨ If the batch folder was already created (i.e., an S3 bucket in a different deployment environment), DSC can sync the data and avoid repeating data retrieval steps. |
| 34 | + |
| 35 | +### Queue a batch for ingest |
| 36 | +DSC retrieves the batch records from DynamoDB, and for each item submission, it performs the following steps: |
| 37 | +* Determine whether the item submission should be sent to the DSS input queue |
| 38 | +* Map/transform the source metadata to follow the Dublin Core schema |
| 39 | +* Create and upload a metadata JSON file in the batch folder (under `dspace_metadata/`) |
| 40 | +* Send a message to the DSS input queue |
| 41 | + |
| 42 | + Note: The message is structured in accordance with the [Submission Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/submission-message-specification.md). |
| 43 | + |
| 44 | +At the end of this step: |
| 45 | +* Batch records in DynamoDB are updated. Updates are made to the folllowing fields: |
| 46 | + - `status`: Indicates submit status |
| 47 | + - `status_details`: Set to error messages (if message failed to send) |
| 48 | + - `last_run_date`: Set to current run date |
| 49 | + - `submit_attempts`: Increments by 1 |
| 50 | +* **[OPTIONAL]** An email is sent reporting the counts for each submission status. The email includes a CSV file with the batch records from DynamoDB, reflecting the latest information. |
| 51 | + |
| 52 | +### Run DSS |
| 53 | +DSS consumes the submission messages from the input queue in SQS. DSS uses a client to interact with DSpace. For each item submission, DSS reads the metadata JSON file and bitstreams from S3, using the information provided in the message, and creates an item with bitstreams in DSpace. |
| 54 | + |
| 55 | +At the end of this step: |
| 56 | +* Result messages are written to the output queue for DSC (`dss-output-dsc`). |
| 57 | + |
| 58 | + Note: The message is structured in accordance with the [Result Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/result-message-specification.md). |
| 59 | + |
| 60 | +### Inspect ingest results |
| 61 | +DSC consumes result messages from its output queue, parsing the messages to determine whether the associated item was ingested into DSpace. It then loops through the batch records from DynamoDB, updating those that have a corresponding result message. Additional steps with the item submission may be performed on behalf of the requestor (e.g., custom reports). |
| 62 | + |
| 63 | +At the end of this step: |
| 64 | +* Batch records in DynamoDB are updated. Updates are made to the folllowing fields: |
| 65 | + - `dspace_handle`: Set to generated DSpace handle (if item was ingested) |
| 66 | + - `status`: Indicates ingest status |
| 67 | + - `status_details`: Set to error messages (if item failed ingest) |
| 68 | + - `last_result_message`: Set to result message string |
| 69 | + - `last_run_date`: Set to current run date |
| 70 | + - `ingest_attempts`: Increments by 1 |
| 71 | +- Result messages are deleted from the queue |
| 72 | + - If any errors occur during the processing of result message, the result message will remain in the queue. |
| 73 | +- An email is sent reporting the counts for each ingest status. The email includes a CSV file with the batch records from DynamoDB, reflecting the latest information. |
| 74 | + |
| 75 | +## The DSC CLI |
| 76 | + |
| 77 | +### `pipenv run dsc` |
| 78 | + |
| 79 | +```text |
| 80 | +Usage: -c [OPTIONS] COMMAND [ARGS]... |
| 81 | +
|
| 82 | + DSC CLI. |
| 83 | +
|
| 84 | +Options: |
| 85 | + -w, --workflow-name TEXT The workflow to use for the batch of DSpace |
| 86 | + submissions [required] |
| 87 | + -b, --batch-id TEXT A unique identifier for the workflow run, also |
| 88 | + used as an S3 prefix for workflow run files |
| 89 | + [required] |
| 90 | + -v, --verbose Pass to log at debug level instead of info |
| 91 | + --help Show this message and exit. |
| 92 | +
|
| 93 | +Commands: |
| 94 | + create Create a batch of item submissions. |
| 95 | + finalize Process the result messages from the DSS output queue... |
| 96 | + reconcile Reconcile bitstreams with item identifiers from the metadata. |
| 97 | + submit Send a batch of item submissions to the DSpace Submission... |
| 98 | + sync Sync data between two directories using the aws s3 sync... |
| 99 | +``` |
| 100 | + |
| 101 | +### `pipenv run dsc -w <workflow-name> -b <batch-id> create` |
| 102 | + |
| 103 | +```text |
| 104 | +Usage: -c create [OPTIONS] |
| 105 | +
|
| 106 | + Create a batch of item submissions. |
| 107 | +
|
| 108 | +Options: |
| 109 | + --sync-data / --no-sync-data Invoke 'sync' CLI command. |
| 110 | + --sync-dry-run Display the operations that would be performed |
| 111 | + using the sync command without actually |
| 112 | + running them |
| 113 | + -s, --sync-source TEXT Source directory formatted as a local |
| 114 | + filesystem path or an S3 URI in |
| 115 | + s3://bucket/prefix form |
| 116 | + -d, --sync-destination TEXT Destination directory formatted as a local |
| 117 | + filesystem path or an S3 URI in |
| 118 | + s3://bucket/prefix form |
| 119 | + -e, --email-recipients TEXT The recipients of the batch creation results |
| 120 | + email as a comma-delimited string |
| 121 | + --help Show this message and exit. |
| 122 | +``` |
| 123 | + |
| 124 | +**Important:** If the boolean flag `--sync-data` is set, the `sync` CLI command is invoked, which executes a basic [`aws s3 sync`](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) command. |
| 125 | + |
| 126 | +### `pipenv run dsc -w <workflow-name> -b <batch-id> submit` |
| 127 | + |
| 128 | +```text |
| 129 | +Usage: -c submit [OPTIONS] |
| 130 | +
|
| 131 | + Send a batch of item submissions to DSS. |
| 132 | +
|
| 133 | +Options: |
| 134 | + -c, --collection-handle TEXT The handle of the DSpace collection to which |
| 135 | + the batch will be submitted [required] |
| 136 | + -e, --email-recipients TEXT The recipients of the submission results email |
| 137 | + as a comma-delimited string |
| 138 | + --help Show this message and exit. |
| 139 | +``` |
| 140 | + |
| 141 | +### `pipenv run dsc -w <workflow-name> -b <batch-id> finalize` |
| 142 | + |
| 143 | +```text |
| 144 | +Usage: -c finalize [OPTIONS] |
| 145 | +
|
| 146 | + Process the result messages from the DSC output queue. |
| 147 | +
|
| 148 | +Options: |
| 149 | + -e, --email-recipients TEXT The recipients of the submission results email |
| 150 | + as a comma-delimited string [required] |
| 151 | + --help Show this message and exit. |
| 152 | +``` |
| 153 | + |
| 154 | +### `pipenv run dsc -w <workflow-name> -b <batch-id> sync` |
| 155 | + |
| 156 | +```text |
| 157 | +Usage: -c sync [OPTIONS] |
| 158 | +
|
| 159 | + Sync data between two directories using the aws s3 sync command. |
| 160 | +
|
| 161 | + If 'source' and 'destination' are not provided, the method will derive |
| 162 | + values based on the required '--batch-id / -b' and 'workflow-name / -w' |
| 163 | + options and S3 bucket env vars: |
| 164 | +
|
| 165 | + * source: batch path in S3_BUCKET_SYNC_SOURCE |
| 166 | +
|
| 167 | + * destination: batch path in S3_BUCKET_SUBMISSION_ASSETS |
| 168 | +
|
| 169 | + This command accepts both local file system paths and S3 URIs in |
| 170 | + s3://bucket/prefix form. It synchronizes the contents of the source |
| 171 | + directory to the destination directory, and is configured to delete files in |
| 172 | + the destination that are not present in the source exclude files in the |
| 173 | + dspace_metadata/ directory. |
| 174 | +
|
| 175 | + Although the aws s3 sync command recursively copies files, it ignores empty |
| 176 | + directories from the sync. |
| 177 | +
|
| 178 | +Options: |
| 179 | + -s, --source TEXT Source directory formatted as a local filesystem |
| 180 | + path or an S3 URI in s3://bucket/prefix form |
| 181 | + -d, --destination TEXT Destination directory formatted as a local |
| 182 | + filesystem path or an S3 URI in s3://bucket/prefix |
| 183 | + form |
| 184 | + --dry-run Display the operations that would be performed using |
| 185 | + the sync command without actually running them |
| 186 | + --help Show this message and exit. |
| 187 | +``` |
0 commit comments