fix: Merge dataset item schemas across heterogeneous shapes#854
Merged
Conversation
Replaces `to-json-schema` with an in-house inferrer. The old library's
`mode: 'all'` collapsed any two items with differing key sets to
`{type: 'array', items: {type: 'object'}}` — properties wiped out.
Affected three tools that share `generateSchemaFromItems`:
`get-dataset-schema`, `call-actor`, `get-actor-run`.
The new inferrer set-unions property keys and emits JSON Schema type
arrays for primitive conflicts. Drops the `arrayMode` field from
`get-dataset-schema` (only existed as a workaround for the lib bug).
Also drops the upstream's noisy `format: 'style'` false positive on
free-form text.
jirispilka
approved these changes
May 20, 2026
Collaborator
jirispilka
left a comment
There was a problem hiding this comment.
One nit, otherwise it looks good.
I'll test it in the follow up PR
Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
get-dataset-schema,call-actor, andget-actor-runall sharegenerateSchemaFromItems, which delegated toto-json-schema@0.2.5. Any two items with different key sets collapsed to{type:'array',items:{type:'object'}}— properties wiped out. Reported on a real NYC restaurants dataset where ~half the items carriedmarkdownand half didn't.Solution
Replaced the library with an in-house inferrer in
src/utils/schema_generation.ts. The merge does a set-union of property keys and recurses; primitive type conflicts emit JSON Schematypearrays (e.g.["string","null"]). Drops thearrayModefield fromget-dataset-schema— it only existed as a workaround for the buggymode:'all', and all internal callers were already passing it anyway.Worth your attention
to-json-schemawas last published in 2020 and the upstream repo is dead. Owning ~120 LOC of pure JSON-Schema inference is cheaper than auditing an unmaintained transitive surface on a server that handles customer Apify tokens.{x:1}+{x:"hi"}produces{"type":["integer","string"]}— spec-valid JSON Schema, handled natively by LLMs reading the tool output. Verified the generated schema is never Ajv-validated downstream (checked both this repo andapify-mcp-server-internal— Ajv only validates tool input args).arrayModefield removed fromget-dataset-schema. Technically a public API change. Safe because (a) all 3 internal callers always passedarrayMode:'all', and (b) the'first'mode was never useful —to-json-schemaapplies it recursively to nested arrays too, which is almost never what callers want.format:"style"false positive. Free-form Markdown text was being tagged with a CSS-ish format. The new format detector covers onlyuri,date-time,date,email,uuid— the unambiguous ones.Follow-up