-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite #373
base: master
Are you sure you want to change the base?
Rewrite #373
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #373 +/- ##
==========================================
- Coverage 78.87% 78.66% -0.22%
==========================================
Files 54 54
Lines 4488 4481 -7
==========================================
- Hits 3540 3525 -15
- Misses 948 956 +8 ☔ View full report in Codecov by Sentry. |
1ecc875
to
9d61042
Compare
Hi @Stranger6667 , I'm new to rust and came across this crate to validate some json files that I have for kubernetes.
|
Hi @Stranger6667 ,
By default, the url scheme is
Note:
|
9d61042
to
26dbb89
Compare
0b2f7e6
to
54dac9c
Compare
489bce8
to
7f97346
Compare
|
||
/// Unique identifier of a node in a graph. | ||
#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)] | ||
pub(crate) struct NodeId(usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be NonZeroUsize
- the 0th node should not be addressable
#[derive(Debug, Copy, Clone)] | ||
pub(crate) struct Node<'s> { | ||
pub(crate) value: &'s Value, | ||
kind: JsonSchemaNodeKind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be named like TraversalScope
schema: &'s Value, | ||
root: &'s Resolver, | ||
resolvers: &'s HashMap<&str, Resolver>, | ||
) -> Result<CompressedRangeGraph> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe store node ids of parent / children / siblings right here? then we may avoid having a separate vec for edges + there will be no need to maintain a stack during validation (each node knows where to go after validation is done at its level):
struct Node {
parent: Option<NonZeroUsize>,
children: Option<Range<NonZeroUsize>>,
data: Validator
}
c8ee78a
to
8eacf2d
Compare
Overview
This is a work-in-progress PR that brings the
jsonschema-rewrite
crate that takes a different approach to its internal structure.At a certain stage, it will be ready for review, but feel free to share your thoughts here.
Eventually, this PR will include all features provided by the current version, but at the moment it is more a proof of concept.
Porting some optimizations from the current version will be needed for proper benchmarking, but first, I'd like to focus on correctness.
Motivation
Validation performance. Building a "compiled" JSON Schema representation is a lesser concern, as it is supposed to be created once per schema and then reused. However, once the implementation is stable, I'd like to improve it too.
My personal use case is filtering randomly generated data leaving only samples that match some schema. It is a CPU-bound task and I'd like it to be as fast as possible (it also affects my AWS bills). Most of the data is generated by construction, but some parts are hard to implement this way, hence resorting to filtering.
Lessons learned
Over the last few years, I spent quite some time optimizing the current implementation and at some point, I realized that it is fundamentally limited by the internal structure I chose to use in the very beginning.
From my perspective the most annoying details of the current approach are:
validate
. In applicator keywords, we almost always allocate a new vector for children's validation errors. I.e. it is not a truly lazy iteration - generally, it does validation first and then iterates over errors.validate
. Another aspect is that we useflat_map
excessively and it produces a lot of code inside the resulting binary (according tollvm-lines
). It affects compilation time (and presumably performance too).RWLock
for reference resolving during validation. That adds some overhead - I removed it with some unsafe stuff (it was a single-threaded environment, so it was ok) and it gave ~10-12% improvement for Open API JSON Schema benchmark.serde_json
. This forces the users to convert schemas & inputs toserde_json::Value
, which is a huge performance hit for bindings like Python's where data transformation could eat up to 85% of total validation time.apply
calls.Ideas
Here are ideas on how each issue could be approached.
Fragmented data layout & annotations
Store input schemas in as few allocations as possible. My current idea is:
Applicator keywords may have some edges, e.g.
properties
might store indexes into the edges vector, then those edges will point back to the keywords vector. This should be enough to traverse keywords in the right order and extract input components needed for specific keywords.Keywords may need additional data, e.g.
required
needs a list of properties. This stuff may live in some shared arena too - I'd like to avoid nested allocations completely. Eventually, it would be nice to make it work inno-std
environments, where the user can provide their own memory where the "compiled" schema version will store its data. Dunno how it is going to play with custom keywords where the user may have arbitrary data, but it should be solvable.Similarly, annotations could be stored in a separate container to avoid loading them during
is_valid
/validate
, but this should be benchmarked first.Allocations during
validate
& excessive code generationHere we may implement the
Iterator
trait that will traverse over the schema & input and store its state on manually-handled stacks. E.g., with the layout above we can have two vectors to store the traversal position in keywords & edges. It gets a bit tricky with applicator keywords that may need to run validation on all their children first, but we still can store this state in a struct that implementsIterator
. This way it should be just allocations to store the iterator state.It also implies that we won't need
flat_map
this much - return types should be simpler. E.g., instead of always returning an iterator we can returnResult<(), ValidationError>
/Option<ValidationError>
from keywords and then use it in theIterator
trait implementation.Locking during reference resolving & async
Resolve everything upfront, then there will be no need to store anything extra during validation, hence no locking is needed.
The layout above also allows us to handle recursive references much easier and gives us an opportunity to optimize even more by inlining trivial schemas (like
ajv
does), so we avoid some jumps across the schema at all.If this will be just a single function, it is easier to make it async:
Not sure about the most ergonomic interface here, but separating resolving definitely makes async easier to work with.
Input layout &
serde_json
This one could be done later, but the idea would be to create a public trait, e.g.
Json
, andmake this library use it in its public API. This trait would make it possible to work with any inputs like it is a JSON value - checking types, extracting values, iterating, etc. By default, we can implement it only for
serde_json::Value
. However, I am not sure if it would be completely zero-cost forserde_json::Value
.Progress
The current state contains:
is_valid