Skip to content

Build an ingestion system #182

Open
Open
@gadomski

Description

@gadomski

The https://github.com/developmentseed/eoapi-k8s doesn't include an ingestor out of box, so we'll need to make one or leverage an existing one from another project.

Activity

changed the title [-]Design an ingestion system[/-] [+]Build an ingestion system[/+] on Apr 9, 2025
ceholden

ceholden commented on Apr 16, 2025

@ceholden

Background

The current OAM upload process involves,

Users upload new data to the HOT OAM catalog using a dataset upload page on the OAM Browser website (docs), implemented as a React component (source). This upload form supports data sources from local files, URLs, or from Dropbox. The frontend asks the backend API for a presigned URL (/uploads/url) that permits end users to upload their data to AWS S3 for storage. Once the image is uploaded to S3, the form passes metadata to the OAM data catalog through the catalog API's POST /uploads endpoint.

The catalog API POST /uploads endpoint,

  • records the user metadata associated with the upload into a MongoDB "uploads" collection
  • records information about each image in an upload to a MongoDB "images" collection
  • kicks off a background task to "transcode" each image into a COG

The transcode worker,

  • converts the image into a COG and creates a thumbnail
  • records detailed imagery metadata (GSD, projection, file size, etc)
  • uploads the COG and thumbnail to the destination bucket
  • calls back to OAM API after successful completion with the imagery locations and detailed metadata (/uploads/:id/:sceneIdx/:imageId)

Upon receiving this callback the OAM API updates processing information about the uploaded image in the "images" collection so users can know their upload has finished processing. This callback also records the details in the OAM catalog in the "meta" collection, thereby making the upload available in the OAM catalog through the /meta endpoint.

Connection to STAC API

For a STAC ingestion system, we have two related concerns to think about,

  1. How will OAM metadata be translated into STAC?
  2. How do we want to connect the existing data ingestion system (OAM Browser "upload" page) to the STAC ingestion system proposed in this ticket?

For the translation, the OAM metadata specification uses a format based on the Open Imagery Network (OIN) metadata specification. It's very common to have a custom metadata format, and we have tools to translate this OAM specification into STAC (https://github.com/hotosm/stactools-hotosm). We could potentially run this translation as part of the STAC ingestion system, or we could translate the OAM metadata into STAC as part of the system that invokes the ingestion.

For the making the connection between the OAM Browser upload page and our STAC system, there are generally two main patterns,

  • Synchronous ingestion using STAC API "transactions" endpoints
  • Asynchronous ingestion, typically on a background worker that directly connects to the PgSTAC database. For example,
    • The "Ingestor API" from eoapi-cdk (link) exposes an API to enqueue ingestion of STAC Items into PgSTAC
    • Other systems use bucket notifications, notification topics, or message queues directly as mechanisms for enqueuing ingestion of STAC Items

One of the primary benefits for the asynchronous ingestion pathway is the potential for higher ingestion throughput by performing batch inserts and pre-optimizing inserts to align with database table partitions. The primary benefit for synchronous ingestion is the immediate feedback on whether or not the STAC Item was successfully ingested.

Proposal

Given the existing OAM catalog setup and design considerations for STAC ingestion of OAM metadata, I propose that we,

  • Translate the OAM metadata from the OAM format to STAC as part of the STAC services.
    • This makes sense tactically as our OAM to STAC translation code and the STAC-FastAPI API ecosystem are written in Python, but the OAM API is written in Javascript.
    • This choice is also more forward looking as it keeps this translation component within the new codebases and services we're building. This will make it easier if we want to eventually deprecate the OAM API.
  • Ingest STAC records using endpoints on the STAC API service
    • Use the STAC API "transactions" endpoints for native STAC Items
    • Create new API endpoint to translate (using stactools-hotosm package) and insert/upsert OIN formatted metadata (e.g., POST /ingest/oam)
  • Keep the OAM Browser connection to OAM API, but connect OAM API /uploads/:id/:sceneIdx/:imageId endpoint to STAC API to perform STAC ingestion.
    • We can eventually directly connect the OAM frontend to STAC API for ingestion, but we would need additional support for authentication/authorization, data ingestion (presigned URLs, etc), and data preprocessing (transcoding to COG, creating thumbnails, etc).
    • For now we can have the STAC ingestion be an additional step after adding to the existing OAM API catalog data store (MongoDB). For example we could place the call right here
    • Connecting OAM API to STAC API would simplify authorization for the STAC API transactions endpoints. We could communicate between OAM API and STAC API using a simple user/password or machine to machine token.

If we follow this proposed idea, the imagery upload process would look like,

Loading
sequenceDiagram
    participant BROWSER as OAM Browser
    participant OAMAPI as OAM API
    participant MONGO as OAM MongoDB
    participant OAMWORKER as OAM Transcoder
    participant STACAPI as STAC API
    participant PGSTAC as PgSTAC

    BROWSER->>OAMAPI: /oauth/<provider>
    OAMAPI->>BROWSER: auth details

    BROWSER->>OAMAPI: POST /upload
    OAMAPI->>MONGO: Record "upload" and "images"
    OAMAPI->>OAMWORKER: Queue transcoding job
    activate OAMWORKER

    BROWSER->>OAMAPI: GET /uploads/:id
    OAMAPI->>BROWSER: "processing"

    OAMWORKER->>OAMAPI: Update imagery status
    deactivate OAMWORKER
    OAMAPI->>MONGO: Record "meta", making upload available to OAM catalog

    OAMAPI->>STACAPI: POST or PUT /ingest/oam
    activate STACAPI
    STACAPI->>STACAPI: Transform OAM metadata into STAC Item
    STACAPI->>PGSTAC: insert/upsert Item
    STACAPI->>OAMAPI: 2xx response code
    deactivate STACAPI

    BROWSER->>OAMAPI: GET /uploads/:id
    OAMAPI->>BROWSER: "complete"

    BROWSER->>STACAPI: search/filter/list
    STACAPI->>BROWSER: STAC Items

Implications

If we want to implement this proposal we would need four sets of steps,

  1. Write a new STAC-FastAPI ApiExtension to support the OAM ingestion transactions (e.g., OamIngestionExtension). This extension will glue stactools-hotosm and stac-fastapi-pgstac to translate and insert/upsert OAM metadata records into our STAC catalog.
  2. Deploy a custom container for stac-fastapi-pgstac that includes this OamIngestionExtension.
  3. Provision authentication details giving permission for OAM API to insert using this OamIngestionExtension
  4. Update OAM API to communicate with our STAC API

Alternatives Considered

Respond to OAM metadata JSON uploads to S3

One alternate idea that might be nice from a decoupling perspective (i.e., no code change in OAM API) would be to,

  • Use the fact that OAM API writes a _meta.json to S3 when metadata is added to the Mongo based catalog
  • Add bucket notifications topic for s3:ObjectCreated:* to this bucket with the suffix pattern _meta.json
  • In our new work, create a SQS queue that is fed by this bucket notification topic.
  • Add a worker to,
    • Pull off this queue
    • Translate the OAM metadata to STAC
    • Add to PgSTAC through either,
      • Insert into PgSTAC using the pypgstac loader directly from this "translate" worker
      • Insert using the STAC API transactions endpoints
      • Forward the converted STAC Item onto another, STAC specific, async ingestion worker. This STAC ingestion worker would insert into PgSTAC directly and might be useful for other efforts like integrating with 3rd party catalogs.

I also like this but it seemed like more work for something that would be largely thrown away when the entire catalog is based off STAC. It'd be nice to avoid modifying OAM API, but this decoupling means we couldn't leverage the status tracking built into the OAM API.

Resynchronize the entire OAM catalog into PgSTAC

Largely included for completeness, the minimal effort approach to keeping the existing OAM catalog up to date with the STAC based catalog would be to reingest everything. This could be done as a scheduled task or manually (e.g., using this notebook). This would be simple but has relatively high latency and involves the most compute work, although given the small size of the OAM catalog might not be a big deal.

Future Directions

It seems reasonable that OAM would want to consolidate all of the data ingestion process to use this new catalog instead of relying on a hybrid of OAM API and STAC API. The missing components to this process are,

  1. User authentication
    • We might be able to use stac-auth-proxy to integrate identity providers with our STAC API and associated services.
  2. Imagery preprocessing (COG transcoding, thumbnail generation, metadata creation)
    • We might use rio-cogeo for COG creation, rio-tiler for generating thumbnails, and stactools-hotosm for generating rich metadata.
    • This image preprocessing would generate STAC metadata directly, eliminating the need to convert from the OAM specification
  3. Preprocessing orchestration, status tracking, and communication with the frontend
    • The design of this probably wouldn't change very much, but it might utilize different tools for background task processing.
gadomski

gadomski commented on Apr 16, 2025

@gadomski
CollaboratorAuthor

👍🏼

If the existing metadata generation were in Python I'd want to push it farther left (in your diagram) but since it's not, I think this design feels correct.

A+ description, btw

aliziel

aliziel commented on Apr 17, 2025

@aliziel

Correct me if I'm misunderstanding, but this sounds like a dual-write situation (between Mongo and pgSTAC) so just want to clarify how we maintain pgSTAC consistency on failure.

Earlier we decided on uploading through Mongo and serving through pgSTAC. If a failure occurs in Mongo or during the S3 upload, that would constitute a failed write and should throw. If a failure occurs on write to pgSTAC though, should we use Mongo as a fallback for lookups and translate on the fly ? I don't think we need to orchestrate distributed transactions or streaming CDC or anything, but I'm wondering if we can make retries sufficiently robust to handle failures at any of the multiple upstream writes without compromising too much on data integrity. Or, at risk of blending together the a/synchronous patterns, handle failures/resyncing in a separate process so its all easier to unplug + rewire later.

Its been a while since I've used Mongo, so I may also be missing something there as well (not accounting for S3).

gadomski

gadomski commented on Apr 17, 2025

@gadomski
CollaboratorAuthor

If a failure occurs on write to pgSTAC though, should we use Mongo as a fallback for lookups and translate on the fly ?

I don't think so, for two reasons:

  • The volume of the archive is small (full archive is 2mb when compressed)
  • Folks aren't relying on the system for short duration (ie <24 hr) turnaround (plz correct me if this isn't true)

Given both of those reasons, I think a regular re-index to bring mongo and pgstac into sync should be Good Enough™ to solve any data inconsistencies.

ceholden

ceholden commented on Apr 17, 2025

@ceholden

If a failure occurs on write to pgSTAC though, should we use Mongo as a fallback for lookups and translate on the fly

+1 to what Pete said, also considering,

  • The end goal is to replace Mongo as the data store for metadata, so having to write to both places is just temporary for the prototype.
  • Having to support reading/queries and translating against non-STAC in Mongo will explode the amount of work required
  • There's error handling to signal for intervention. If the "update metadata" callback fails (on write to either) the OAM API server will throw an error. The caller (the "transcoder") will see a failure, and it has a failure handler that will mark failure for the upload processing.

There's also a dual write scenario already (writing to Mongo then S3), but I added an "Alternatives Considered" section that might be nice from a decoupling perspective in the original comment, #182 (comment)

spwoodcock

spwoodcock commented on Apr 18, 2025

@spwoodcock
Member

Thanks for the great writeup! 🎉

There are many moving pieces here - I haven't had a chance to fully dig into the whole pgSTAC plan, as I have been buried in some work for FieldTM & DroneTM.

Could you help me understand the Resynchronize the entire OAM catalog into PgSTAC option please?

  • It's mentioned at the bottom in a way that makes it sound like not a great choice? What are the main downsides?
  • As you say, the OAM metadata isn't that big, so it make total sense to me to re-ingest this into a much more resilient platform.
    • pgSTAC is based on Postgres, which we use for all our tools at HOT.
    • I don't have many nice words to say about Mongo. We have a managed Mongo database that keeps being force upgraded, causing issues with our outdated API and Mongo drivers.
    • Ideally we want a relational database, with all the advantages that come with postgres (in-house knowledge, scalability, etc).
  • Would this make things easier when using eoAPI, and help maintainability into the future?
  • I also mention this option, as the code for OAM-API is significantly outdated. I had to go through the painful process of a Node 16 upgrade recently when things broke, with many patches of outdated libs. It's not a maintainable solution. Sure, it could last a few more years without a re-write, but I'm not holding my breath.

EDIT after writing the above, I realised I was probably being a bit silly. I'm pretty sure we are already migrating into pgSTAC based on previous discussions, but presumably this solution is suggesting we continue to write to OAM-API, then ingest periodically to pgSTAC? In that case, is the main downside the latency this introduces, as we need to do this on a schedule?

Personally, even if this is the case, I still like this approach, as the time engineering workarounds could be better spent on starting the replacement for the uploader 😄

As for the two other options proposed, we can dig into them more if the option above isn't possible, but will comment briefly 👍

Respond to OAM metadata JSON uploads to S3

  • Nice idea, but I'm not a huge fan due to reliance on more AWS services (SQS). Sure we have credits now, but we need to be ready in the event that AWS pull them.
  • We could consider options like hosting Minio in the cluster (depending on a cost comparison / amount of imagery storage required), as it includes events by default and is open-source / rock-solid. But it's outside the scope of this project probably.
  • Any other alternatives I thought about here aren't great (polling etc).

OAM-API --> pgSTAC

  • This is a well thought out and nice option 😄
  • My main concern echos what I wrote above about OAM-API.
  • But as this is a temp solution, it's not a bad one!
  • I wouldn't expect it to be easy to modify the OAM-API code, but let's see.
  • If the re-ingestion pathway is a definite no go, this seems like the best choice!

17 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @gadomski@ceholden@aliziel@spwoodcock

      Issue actions

        Build an ingestion system · Issue #182 · hotosm/OpenAerialMap