Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor json_schema.py, implement JSON Schema to YAML #1182

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented Sep 30, 2024

Overview

Refactor json_schema.py to be more coherent and extensible. Use extensibility to implement JSON Schema to YAML.

Changes

  • Convert large function to_regex into a class JSONSchemaRegexGenerator with visitors which implement JSON Schema rules, and formatters which implement pattern construction.
  • Implement YAMLRegexGenerator by subclassing JSONSchemaRegexGenerator and overriding some formatters.

Tests:

  • Update test_json_schema.py so it's existing tests also apply to YAML.
    • No changes to these tests otherwise to ensure stable behavior (with the exception of fixes to anyOf and allOf)
  • Enable previously disabled test_generate.py::test_generate_json, test both json and yaml modes.
  • Incorporate json-schema.org test suite (~1250 tests)

Behavioral Changes

The only behavior changes are:

  • Convert features which result in incorrect patterns to NotImplementedError
  • Fix anyOf, allOf, oneOf
    • anyOf: Previously broken, now ORs sub-patterns
    • allOf: Previously broken, now ANDs sub-patterns via positive lookahead
    • oneOf: Warns user that it's using anyOf instead, and calls anyOf

The rules are much closer to the JSON Schema spec with main, however JSON Schema spec isn't always desirable. Users can legalize the JSON Schema compliant validation rules via strict_json_schema_subset=False, resulting in:

  • items: If unspecified, allow additional items without constraints
  • properties: If unspecified, allow additional properties without constraints

json-schema.org test suite

This is a large change-set. To verify correctness, in addition to ensuring current tests pass, test_json_schema_full.py tests compliance with JSON Schema by retrieving 1,245 test cases from the official json-schema.org test suite.

On this Branch On main
Pattern invalid: FP & FN (bad: invisible) 38 246
Raise NotImplementedError (acceptable: visible) 944 693
Pattern valid 263 306

Raising NotImplementedError makes it clear to the user why a schema would fail during generation, and it does so before generation.

test_json_schema_to_yaml_compliance

For each of the 263 tests which pass in test_json_schema_to_json_compliance, we test to verify their corresponding yaml pattern is also correct.

TODO

  • Refactor json_schema so its clean and extensible
  • Validate refactor integrity through extensive json-schema test suite
  • Implement JSON Schema to YAML
  • Apply test_json_schema_full.py to yaml
  • Finish integrating patrice's patterns
  • Improve doc-strings
  • Update docs to reflect new behaviour surrounding JSON Schema spec-compliant implementation
    • No longer needed thanks to strict_json_schema_subset

Further Work

  • Enable additional YAML subsets. This implementation can be easily extended to allow parameterization of how the YAML subset is defined. (e.g. quotes around the string or not, indentation rules, etc)
  • Refactor because json_schema.py does too much. This new structure makes separation of concerns clear, easing a refactor.
  • Implement: JSONSchemaRegexGenerator.to_automata(...) Not using a pattern intermediate would simplify things.
  • Implement NotImplemented components based on users opening issues.

outlines/fsm/json_schema.py Outdated Show resolved Hide resolved
@lapp0 lapp0 force-pushed the json-schema-yaml branch 2 times, most recently from db73b4c to c939c70 Compare September 30, 2024 22:54
@lapp0 lapp0 added tests Linked to library tests structured generation Linked to structured generation JSON correctness Everything related to the generation correctness run-benchmarks labels Oct 1, 2024
@lapp0 lapp0 force-pushed the json-schema-yaml branch 2 times, most recently from ce488b6 to 3a28324 Compare October 1, 2024 20:44
@lapp0 lapp0 requested a review from brandonwillard October 1, 2024 20:50
@lapp0 lapp0 marked this pull request as ready for review October 1, 2024 20:50
@lapp0 lapp0 force-pushed the json-schema-yaml branch from 3a28324 to 9648b30 Compare October 1, 2024 21:26
Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON schema code is now in outlines-core. Unless there was an issue opened for that to be done separately, it looks like we missed it in #1175.

These changes will need to be moved and—potentially—ported to Rust. At the very least, we can't have two versions of the same basic JSON schema logic.

@lapp0 lapp0 force-pushed the json-schema-yaml branch from 9648b30 to 3efb728 Compare October 1, 2024 22:38
@lapp0
Copy link
Contributor Author

lapp0 commented Oct 1, 2024

Seems there was a miscommunication. Thanks for clarifying @brandonwillard, I'll get started on porting the changes to Rust.

@lapp0 lapp0 force-pushed the json-schema-yaml branch from 3efb728 to 99fa1ec Compare October 1, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correctness Everything related to the generation correctness JSON run-benchmarks structured generation Linked to structured generation tests Linked to library tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translate Pydantic models into a regular expression that accept the corresponding YAML
2 participants