Skip to content

mongodb_cdc: add autoschema support#4091

Open
Jeffail wants to merge 2 commits intomainfrom
autoschema-mongodb-cdc
Open

mongodb_cdc: add autoschema support#4091
Jeffail wants to merge 2 commits intomainfrom
autoschema-mongodb-cdc

Conversation

@Jeffail
Copy link
Collaborator

@Jeffail Jeffail commented Mar 11, 2026

Add schema metadata to mongodb_cdc messages using a two-tier strategy: Tier 1 queries $jsonSchema validators at startup for accurate type info, Tier 2 infers schema from document structure as a fallback. Both tiers use key-set fingerprinting for lightweight schema change detection.

Schema is attached as immutable "schema" metadata in benthos Common Schema format, enabling downstream processors like parquet_encode to automatically determine column types. The "database_schema" metadata field is also added with the database name.

Key design decisions:

  • Fields are always sorted alphabetically for deterministic ordering
  • Partial updates and documentKey-only deletes use cached schema
  • Schema cache is reset on reconnect

Add schema metadata to mongodb_cdc messages using a two-tier strategy:
Tier 1 queries $jsonSchema validators at startup for accurate type info,
Tier 2 infers schema from document structure as a fallback. Both tiers
use key-set fingerprinting for lightweight schema change detection.

Schema is attached as immutable "schema" metadata in benthos Common
Schema format, enabling downstream processors like parquet_encode to
automatically determine column types. The "database_schema" metadata
field is also added with the database name.

Key design decisions:
- Fields are always sorted alphabetically for deterministic ordering
- Partial updates and documentKey-only deletes use cached schema
- Schema cache is reset on reconnect
@claude
Copy link

claude bot commented Mar 11, 2026

Commits
LGTM

Review
Clean implementation of two-tier schema discovery (Tier 1: $jsonSchema validators, Tier 2: document inference) with key-set fingerprinting for change detection. Well-tested with both unit and integration tests.

  1. Error wrap prefixes in fetchCollectionSchema use imperative/noun form instead of the required gerund form — comment (convention: fmt.Errorf("sending batch: %w", err)).

- Inject _id into Tier 1 ($jsonSchema) schemas so the key-set
  fingerprint matches real documents and the validator's
  required/optional classification is preserved
- Remove unused keysFromSchema function
- Fix TOCTOU race in getOrInferCollectionSchema by using a single Lock
- Move schema pre-fetch from background goroutine into Connect() so the
  change stream is not delayed
- Use gerund form in error wrap prefixes per project conventions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jeffail Jeffail force-pushed the autoschema-mongodb-cdc branch from 875997a to 2266396 Compare March 12, 2026 08:32
@claude
Copy link

claude bot commented Mar 12, 2026

Commits
LGTM

Review
This PR adds autoschema support to mongodb_cdc with a well-structured two-tier approach: Tier 1 queries $jsonSchema validators at startup, Tier 2 infers from document structure as fallback. Schema caching with key-set fingerprinting for change detection is implemented correctly with proper mutex protection.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants