Skip to content

Commit 1d028e2

Browse files
[wip]
1 parent 3b133c3 commit 1d028e2

File tree

3 files changed

+164
-5
lines changed

3 files changed

+164
-5
lines changed

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# dspace-submission-composer
2-
An application for creating messages for the [DSpace Submission Service application](https://github.com/MITLibraries/dspace-submission-service).
2+
DSpace Submission Composer (DSC) is a Python CLI application that prepares items for ingest into DSpace.
33

4-
# Application Description
4+
DSC is a component of the DSpace Submission Orchestrator (DSO), a collection of microservices that form a data pipeline for ingesting items into DSpace repositories. The application's name highlights a key step of the DSC workflow in which it "composes" and sends a message to an SQS queue. These messages follow the specification set by the [DSpace Submission Service (DSS)](https://github.com/MITLibraries/dspace-submission-service), another component of DSO. Together, DSC and DSS follow a message-driven architecture, communicating over message queues in SQS.
55

6-
Description of the app
6+
See additional documentation in the :
7+
* [Understanding and Running the DSC Workflow](docs/how_to_run.md)
78

89
## Development
910

docs/how_to_run.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Understanding and Running the DSC Workflow
2+
3+
_This documentation describes the DSC workflow and how to run the application._
4+
5+
**DISCLAIMER**: While the CLI application is runnable on its own, the DSO Step Function offers a simplified user interface for running the full ETL pipeline. For more details on the DSO Step Function and how to use it, see https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4690542593/DSpace+Submission+Orchestrator+DSO.
6+
7+
## The DSC Workflow
8+
9+
The DSC workflow consists of the following key steps:
10+
11+
1. Create a batch
12+
2. Queue a batch for ingest
13+
3. Ingest items into DSpace
14+
4. Inspect ingest results
15+
16+
It's important to note that DSC is not responsible for ingesting items into DSpace; this task is handled by _DSS_. The DSC CLI provides commands for all other steps in the DSC workflow.
17+
18+
What the step function does with each key step....
19+
20+
### Create a batch
21+
DSC processes deposits in "batches", a collection of item submissions grouped by a unique identifier. DSC requires that the item submission assets (metadata and bitstream files) are uploaded to a "folder" in S3, named after the batch ID. While some requestors may upload the submission assets to S3 themselves, in other cases, these files need to be retrieved (via API requests) and uploaded during the batch creation step.
22+
23+
At the end of this step:
24+
* If all item submission assets are complete:
25+
- A batch folder with complete item submission assets exists in the DSO S3 bucket
26+
- Each item submission in the batch is recorded in DynamoDB (with `status="batch_created"`)
27+
- **[OPTIONAL]** An email is sent reporting the number of created item submissions. The email includes a CSV file with the batch records from DynamoDB.
28+
* If any item submission assets were invalid (missing metadata and/or bitstreams):
29+
- A batch folder with incomplete item submission assets exists in the DSO S3 bucket
30+
- **[OPTIONAL]** An email is sent reporting that zero item submissions were created. The email
31+
includes a CSV file describing the failing item submissions with the corresponding error message.
32+
33+
### Queue a batch for ingest
34+
DSC retrieves the batch records from DynamoDB, and for each item submission, it performs the following steps:
35+
* Determine whether the item submission should be sent to the DSS input queue
36+
* Map/transform the source metadata to follow the Dublin Core schema
37+
* Create and upload a metadata JSON file in the batch folder (under `dspace_metadata/`)
38+
* Send a message to the DSS input queue
39+
40+
Note: The message is structured in accordance with the [Submission Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/submission-message-specification.md).
41+
42+
At the end of this step:
43+
* Batch records in DynamoDB are updated. Updates are made to the folllowing fields:
44+
- `status`: Indicates submit status
45+
- `status_details`: Set to error messages (if message failed to send)
46+
- `last_run_date`: Set to current run date
47+
- `submit_attempts`: Increments by 1
48+
* **[OPTIONAL]** An email is sent reporting the counts for each submission status. The email includes a CSV file with the batch records from DynamoDB, reflecting the latest information.
49+
50+
### Run DSS
51+
DSS consumes the submission messages from the input queue in SQS. DSS uses a client to interact with DSpace. For each item submission, DSS reads the metadata JSON file and bitstreams from S3, using the information provided in the message, and creates an item with bitstreams in DSpace.
52+
53+
At the end of this step:
54+
* Result messages are written to the output queue for DSC (`dss-output-dsc`).
55+
56+
Note: The message is structured in accordance with the [Result Message Specification](https://github.com/MITLibraries/dspace-submission-service/blob/main/docs/specifications/result-message-specification.md).
57+
58+
### Inspect ingest results
59+
DSC consumes result messages from its output queue, parsing the messages to determine whether the associated item was ingested into DSpace. It then loops through the batch records from DynamoDB, updating those that have a corresponding result message. Additional steps with the item submission may be performed on behalf of the requestor (e.g., custom reports).
60+
61+
At the end of this step:
62+
* Batch records in DynamoDB are updated. Updates are made to the folllowing fields:
63+
- `dspace_handle`: Set to generated DSpace handle (if item was ingested)
64+
- `status`: Indicates ingest status
65+
- `status_details`: Set to error messages (if item failed ingest)
66+
- `last_result_message`: Set to result message string
67+
- `last_run_date`: Set to current run date
68+
- `ingest_attempts`: Increments by 1
69+
- Result messages are deleted from the queue
70+
- If any errors occur during the processing of result message, the result message will remain in the queue.
71+
- An email is sent reporting the counts for each ingest status. The email includes a CSV file with the batch records from DynamoDB, reflecting the latest information.
72+
73+
## The DSC CLI
74+
75+
### `pipenv run dsc`
76+
77+
```text
78+
Usage: -c [OPTIONS] COMMAND [ARGS]...
79+
80+
DSC CLI.
81+
82+
Options:
83+
-w, --workflow-name TEXT The workflow to use for the batch of DSpace
84+
submissions [required]
85+
-b, --batch-id TEXT A unique identifier for the workflow run, also
86+
used as an S3 prefix for workflow run files
87+
[required]
88+
-v, --verbose Pass to log at debug level instead of info
89+
--help Show this message and exit.
90+
91+
Commands:
92+
create Create a batch of item submissions.
93+
finalize Process the result messages from the DSS output queue...
94+
reconcile Reconcile bitstreams with item identifiers from the metadata.
95+
submit Send a batch of item submissions to the DSpace Submission...
96+
sync Sync data between two directories using the aws s3 sync...
97+
```
98+
99+
### `pipenv run dsc -w <workflow-name> -b <batch-id> create`
100+
101+
```text
102+
Usage: -c create [OPTIONS]
103+
104+
Create a batch of item submissions.
105+
106+
Options:
107+
--sync-data / --no-sync-data
108+
--sync-dry-run Display the operations that would be performed
109+
using the sync command without actually
110+
running them
111+
-s, --sync-source TEXT Source directory formatted as a local
112+
filesystem path or an S3 URI in
113+
s3://bucket/prefix form
114+
-d, --sync-destination TEXT Destination directory formatted as a local
115+
filesystem path or an S3 URI in
116+
s3://bucket/prefix form
117+
-e, --email-recipients TEXT The recipients of the batch creation results
118+
email as a comma-delimited string
119+
--help Show this message and exit.
120+
```
121+
122+
### `pipenv run dsc -w <workflow-name> -b <batch-id> submit`
123+
124+
```text
125+
Usage: -c submit [OPTIONS]
126+
127+
Send a batch of item submissions to DSS.
128+
129+
Options:
130+
-c, --collection-handle TEXT The handle of the DSpace collection to which
131+
the batch will be submitted [required]
132+
-e, --email-recipients TEXT The recipients of the submission results email
133+
as a comma-delimited string
134+
--help Show this message and exit.
135+
```
136+
137+
### `pipenv run dsc -w <workflow-name> -b <batch-id> finalize`
138+
139+
```text
140+
Usage: -c finalize [OPTIONS]
141+
142+
Process the result messages from the DSC output queue.
143+
144+
Options:
145+
-e, --email-recipients TEXT The recipients of the submission results email
146+
as a comma-delimited string [required]
147+
--help Show this message and exit.
148+
```
149+
150+
CLI commands are defined to mirror the names of the workflow steps (with the exception of running DSS). The next sections cover each step in more detail.
151+
152+
prepare items for DSpace...
153+
submit items into DSpace......
154+
creating submission packages.....
155+
DSS ingests the SIPs...
156+
157+
While the CLI is the main entry point for DSC, the workflow modules handles the core functionality invoked by the CLI.

dsc/cli.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ def main(
4141
batch_id: str,
4242
verbose: bool, # noqa: FBT001
4343
) -> None:
44+
"""DSC CLI."""
4445
ctx.ensure_object(dict)
4546
ctx.obj["start_time"] = perf_counter()
4647
workflow_class = Workflow.get_workflow(workflow_name)
@@ -288,7 +289,7 @@ def submit(
288289
collection_handle: str,
289290
email_recipients: str | None = None,
290291
) -> None:
291-
"""Send a batch of item submissions to the DSpace Submission Service (DSS)."""
292+
"""Send a batch of item submissions to DSS."""
292293
workflow = ctx.obj["workflow"]
293294
workflow.submit_items(collection_handle)
294295

@@ -308,7 +309,7 @@ def submit(
308309
required=True,
309310
)
310311
def finalize(ctx: click.Context, email_recipients: str) -> None:
311-
"""Process the result messages from the DSS output queue according the workflow."""
312+
"""Process the result messages from the DSC output queue."""
312313
workflow = ctx.obj["workflow"]
313314
workflow.finalize_items()
314315
workflow.send_report(

0 commit comments

Comments
 (0)