Skip to content

fix: Merge dataset item schemas across heterogeneous shapes#854

Merged
MQ37 merged 3 commits into
masterfrom
fix/dataset-schema-merge-empty-shapes
May 21, 2026
Merged

fix: Merge dataset item schemas across heterogeneous shapes#854
MQ37 merged 3 commits into
masterfrom
fix/dataset-schema-merge-empty-shapes

Conversation

@MQ37
Copy link
Copy Markdown
Contributor

@MQ37 MQ37 commented May 16, 2026

Context

get-dataset-schema, call-actor, and get-actor-run all share generateSchemaFromItems, which delegated to to-json-schema@0.2.5. Any two items with different key sets collapsed to {type:'array',items:{type:'object'}} — properties wiped out. Reported on a real NYC restaurants dataset where ~half the items carried markdown and half didn't.

Solution

Replaced the library with an in-house inferrer in src/utils/schema_generation.ts. The merge does a set-union of property keys and recurses; primitive type conflicts emit JSON Schema type arrays (e.g. ["string","null"]). Drops the arrayMode field from get-dataset-schema — it only existed as a workaround for the buggy mode:'all', and all internal callers were already passing it anyway.

Worth your attention

  • No external dependency, no supply-chain surface. to-json-schema was last published in 2020 and the upstream repo is dead. Owning ~120 LOC of pure JSON-Schema inference is cheaper than auditing an unmaintained transitive surface on a server that handles customer Apify tokens.
  • Type-array unions for primitive conflicts. {x:1} + {x:"hi"} produces {"type":["integer","string"]} — spec-valid JSON Schema, handled natively by LLMs reading the tool output. Verified the generated schema is never Ajv-validated downstream (checked both this repo and apify-mcp-server-internal — Ajv only validates tool input args).
  • arrayMode field removed from get-dataset-schema. Technically a public API change. Safe because (a) all 3 internal callers always passed arrayMode:'all', and (b) the 'first' mode was never useful — to-json-schema applies it recursively to nested arrays too, which is almost never what callers want.
  • Drops the upstream's format:"style" false positive. Free-form Markdown text was being tagged with a CSS-ish format. The new format detector covers only uri, date-time, date, email, uuid — the unambiguous ones.

Follow-up

Replaces `to-json-schema` with an in-house inferrer. The old library's
`mode: 'all'` collapsed any two items with differing key sets to
`{type: 'array', items: {type: 'object'}}` — properties wiped out.
Affected three tools that share `generateSchemaFromItems`:
`get-dataset-schema`, `call-actor`, `get-actor-run`.

The new inferrer set-unions property keys and emits JSON Schema type
arrays for primitive conflicts. Drops the `arrayMode` field from
`get-dataset-schema` (only existed as a workaround for the lib bug).
Also drops the upstream's noisy `format: 'style'` false positive on
free-form text.
@github-actions github-actions Bot added t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics. labels May 16, 2026
@MQ37 MQ37 requested a review from jirispilka May 16, 2026 13:16
Copy link
Copy Markdown
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise it looks good.
I'll test it in the follow up PR

Comment thread src/utils/schema_generation.ts Outdated
@MQ37 MQ37 merged commit f2691c0 into master May 21, 2026
15 checks passed
@MQ37 MQ37 deleted the fix/dataset-schema-merge-empty-shapes branch May 21, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants