Open
Description
The https://github.com/developmentseed/eoapi-k8s doesn't include an ingestor out of box, so we'll need to make one or leverage an existing one from another project.
Metadata
Metadata
Labels
No labels
Type
Projects
Status
In Progress
Relationships
Development
No branches or pull requests
Activity
[-]Design an ingestion system[/-][+]Build an ingestion system[/+]ceholden commentedon Apr 16, 2025
Background
The current OAM upload process involves,
Users upload new data to the HOT OAM catalog using a dataset upload page on the OAM Browser website (docs), implemented as a React component (source). This upload form supports data sources from local files, URLs, or from Dropbox. The frontend asks the backend API for a presigned URL (
/uploads/url
) that permits end users to upload their data to AWS S3 for storage. Once the image is uploaded to S3, the form passes metadata to the OAM data catalog through the catalog API'sPOST /uploads
endpoint.The catalog API
POST /uploads
endpoint,"uploads"
collection"images"
collectionThe transcode worker,
/uploads/:id/:sceneIdx/:imageId
)Upon receiving this callback the OAM API updates processing information about the uploaded image in the
"images"
collection so users can know their upload has finished processing. This callback also records the details in the OAM catalog in the"meta"
collection, thereby making the upload available in the OAM catalog through the/meta
endpoint.Connection to STAC API
For a STAC ingestion system, we have two related concerns to think about,
For the translation, the OAM metadata specification uses a format based on the Open Imagery Network (OIN) metadata specification. It's very common to have a custom metadata format, and we have tools to translate this OAM specification into STAC (https://github.com/hotosm/stactools-hotosm). We could potentially run this translation as part of the STAC ingestion system, or we could translate the OAM metadata into STAC as part of the system that invokes the ingestion.
For the making the connection between the OAM Browser upload page and our STAC system, there are generally two main patterns,
eoapi-cdk
(link) exposes an API to enqueue ingestion of STAC Items into PgSTACOne of the primary benefits for the asynchronous ingestion pathway is the potential for higher ingestion throughput by performing batch inserts and pre-optimizing inserts to align with database table partitions. The primary benefit for synchronous ingestion is the immediate feedback on whether or not the STAC Item was successfully ingested.
Proposal
Given the existing OAM catalog setup and design considerations for STAC ingestion of OAM metadata, I propose that we,
stactools-hotosm
package) and insert/upsert OIN formatted metadata (e.g.,POST /ingest/oam
)/uploads/:id/:sceneIdx/:imageId
endpoint to STAC API to perform STAC ingestion.If we follow this proposed idea, the imagery upload process would look like,
Implications
If we want to implement this proposal we would need four sets of steps,
OamIngestionExtension
). This extension will gluestactools-hotosm
andstac-fastapi-pgstac
to translate and insert/upsert OAM metadata records into our STAC catalog.stac-fastapi-pgstac
that includes thisOamIngestionExtension
.OamIngestionExtension
Alternatives Considered
Respond to OAM metadata JSON uploads to S3
One alternate idea that might be nice from a decoupling perspective (i.e., no code change in OAM API) would be to,
_meta.json
to S3 when metadata is added to the Mongo based catalogs3:ObjectCreated:*
to this bucket with the suffix pattern_meta.json
pypgstac
loader directly from this "translate" workerI also like this but it seemed like more work for something that would be largely thrown away when the entire catalog is based off STAC. It'd be nice to avoid modifying OAM API, but this decoupling means we couldn't leverage the status tracking built into the OAM API.
Resynchronize the entire OAM catalog into PgSTAC
Largely included for completeness, the minimal effort approach to keeping the existing OAM catalog up to date with the STAC based catalog would be to reingest everything. This could be done as a scheduled task or manually (e.g., using this notebook). This would be simple but has relatively high latency and involves the most compute work, although given the small size of the OAM catalog might not be a big deal.
Future Directions
It seems reasonable that OAM would want to consolidate all of the data ingestion process to use this new catalog instead of relying on a hybrid of OAM API and STAC API. The missing components to this process are,
rio-cogeo
for COG creation,rio-tiler
for generating thumbnails, andstactools-hotosm
for generating rich metadata.gadomski commentedon Apr 16, 2025
👍🏼
If the existing metadata generation were in Python I'd want to push it farther left (in your diagram) but since it's not, I think this design feels correct.
A+ description, btw
aliziel commentedon Apr 17, 2025
Correct me if I'm misunderstanding, but this sounds like a dual-write situation (between Mongo and pgSTAC) so just want to clarify how we maintain pgSTAC consistency on failure.
Earlier we decided on uploading through Mongo and serving through pgSTAC. If a failure occurs in Mongo or during the S3 upload, that would constitute a failed write and should throw. If a failure occurs on write to pgSTAC though, should we use Mongo as a fallback for lookups and translate on the fly ? I don't think we need to orchestrate distributed transactions or streaming CDC or anything, but I'm wondering if we can make retries sufficiently robust to handle failures at any of the multiple upstream writes without compromising too much on data integrity. Or, at risk of blending together the a/synchronous patterns, handle failures/resyncing in a separate process so its all easier to unplug + rewire later.
Its been a while since I've used Mongo, so I may also be missing something there as well (not accounting for S3).
gadomski commentedon Apr 17, 2025
I don't think so, for two reasons:
Given both of those reasons, I think a regular re-index to bring mongo and pgstac into sync should be Good Enough™ to solve any data inconsistencies.
ceholden commentedon Apr 17, 2025
+1 to what Pete said, also considering,
There's also a dual write scenario already (writing to Mongo then S3), but I added an "Alternatives Considered" section that might be nice from a decoupling perspective in the original comment, #182 (comment)
spwoodcock commentedon Apr 18, 2025
Thanks for the great writeup! 🎉
There are many moving pieces here - I haven't had a chance to fully dig into the whole pgSTAC plan, as I have been buried in some work for FieldTM & DroneTM.
Could you help me understand the Resynchronize the entire OAM catalog into PgSTAC option please?
As for the two other options proposed, we can dig into them more if the option above isn't possible, but will comment briefly 👍
Respond to OAM metadata JSON uploads to S3
OAM-API --> pgSTAC
17 remaining items