Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configure patterns regex engine #487

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions jsonschema/src/compilation/options.rs
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,107 @@ static META_SCHEMA_VALIDATORS: Lazy<AHashMap<schemas::Draft, JSONSchema>> = Lazy
store
});

/// Fancy regex crate options
#[derive(Clone, Default)]
pub struct FancyRegexOptions {
/// Limit for how many times backtracking should be attempted for fancy regexes (where
/// backtracking is used). If this limit is exceeded, execution returns an error.
/// This is for preventing a regex with catastrophic backtracking to run for too long.
///
/// Default is `1_000_000` (1 million).
pub backtrack_limit: Option<usize>,
/// Set the approximate size limit of the compiled regular expression.
///
/// This option is forwarded from the wrapped `regex` crate. Note that depending on the used
/// regex features there may be multiple delegated sub-regexes fed to the `regex` crate. As
/// such the actual limit is closer to `<number of delegated regexes> * delegate_size_limit`.
pub delegate_size_limit: Option<usize>,
/// Set the approximate size of the cache used by the DFA.
///
/// This option is forwarded from the wrapped `regex` crate. Note that depending on the used
/// regex features there may be multiple delegated sub-regexes fed to the `regex` crate. As
/// such the actual limit is closer to `<number of delegated regexes> *
/// delegate_dfa_size_limit`.
pub delegate_dfa_size_limit: Option<usize>,
}

/// Regex crate options
#[derive(Clone, Default)]
pub struct RegexOptions {
/// Sets the approximate size limit, in bytes, of the compiled regex.
///
/// This roughly corresponds to the number of heap memory, in
/// bytes, occupied by a single regex. If the regex would otherwise
/// approximately exceed this limit, then compiling that regex will
/// fail.
///
/// The main utility of a method like this is to avoid compiling
/// regexes that use an unexpected amount of resources, such as
/// time and memory. Even if the memory usage of a large regex is
/// acceptable, its search time may not be. Namely, worst case time
/// complexity for search is `O(m * n)`, where `m ~ len(pattern)` and
/// `n ~ len(haystack)`. That is, search time depends, in part, on the
/// size of the compiled regex. This means that putting a limit on the
/// size of the regex limits how much a regex can impact search time.
///
/// The default for this is some reasonable number that permits most
/// patterns to compile successfully.
pub size_limit: Option<usize>,

/// Set the approximate capacity, in bytes, of the cache of transitions
/// used by the lazy DFA.
///
/// While the lazy DFA isn't always used, in tends to be the most
/// commonly use regex engine in default configurations. It tends to
/// adopt the performance profile of a fully build DFA, but without the
/// downside of taking worst case exponential time to build.
///
/// The downside is that it needs to keep a cache of transitions and
/// states that are built while running a search, and this cache
/// can fill up. When it fills up, the cache will reset itself. Any
/// previously generated states and transitions will then need to be
/// re-generated. If this happens too many times, then this library
/// will bail out of using the lazy DFA and switch to a different regex
/// engine.
///
/// If your regex provokes this particular downside of the lazy DFA,
/// then it may be beneficial to increase its cache capacity. This will
/// potentially reduce the frequency of cache resetting (ideally to
/// `0`). While it won't fix all potential performance problems with
/// the lazy DFA, increasing the cache capacity does fix some.
///
/// There is no easy way to determine, a priori, whether increasing
/// this cache capacity will help. In general, the larger your regex,
/// the more cache it's likely to use. But that isn't an ironclad rule.
/// For example, a regex like `[01]*1[01]{N}` would normally produce a
/// fully build DFA that is exponential in size with respect to `N`.
/// The lazy DFA will prevent exponential space blow-up, but it cache
/// is likely to fill up, even when it's large and even for smallish
/// values of `N`.
///
/// If you aren't sure whether this helps or not, it is sensible to
/// set this to some arbitrarily large number in testing, such as
/// `usize::MAX`. Namely, this represents the amount of capacity that
/// *may* be used. It's probably not a good idea to use `usize::MAX` in
/// production though, since it implies there are no controls on heap
/// memory used by this library during a search. In effect, set it to
/// whatever you're willing to allocate for a single regex search.
pub dfa_size_limit: Option<usize>,
}

/// Regex implementations with options
#[derive(Clone)]
pub enum RegexEngine {
FancyRegex(FancyRegexOptions),
Regex(RegexOptions),
}

impl Default for RegexEngine {
fn default() -> Self {
RegexEngine::FancyRegex(FancyRegexOptions::default())
}
}

/// Full configuration to guide the `JSONSchema` compilation.
///
/// Using a `CompilationOptions` instance you can configure the supported draft,
Expand All @@ -278,6 +379,7 @@ pub struct CompilationOptions {
validate_schema: bool,
ignore_unknown_formats: bool,
keywords: AHashMap<String, Arc<dyn KeywordFactory>>,
patterns_regex_engine: RegexEngine,
}

impl Default for CompilationOptions {
Expand All @@ -293,6 +395,7 @@ impl Default for CompilationOptions {
validate_formats: None,
ignore_unknown_formats: true,
keywords: AHashMap::default(),
patterns_regex_engine: Default::default(),
}
}
}
Expand Down Expand Up @@ -430,6 +533,29 @@ impl CompilationOptions {
self
}

/// Use specific regex engine with options for patterns.
///
/// Available engines:
/// - [RegexEngine::Regex] - An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. https://github.com/rust-lang/regex
/// - [RegexEngine::FancyRegex] - Rust library for regular expressions using "fancy" features like look-around and backreferences. https://github.com/fancy-regex/fancy-regex
///
/// Default: [RegexEngine::FancyRegex]
///
/// ```rust
/// use jsonschema::{CompilationOptions, RegexEngine, RegexOptions};
/// let mut options = CompilationOptions::default();
/// // Set Regex as a default engine for pattern keyword
/// options.with_patterns_regex_engine(RegexEngine::Regex(RegexOptions::default()));
/// ```
pub fn with_patterns_regex_engine(&mut self, regex_engine: RegexEngine) -> &mut Self {
self.patterns_regex_engine = regex_engine;
self
}

pub(crate) fn patterns_regex_engine(&self) -> &RegexEngine {
&self.patterns_regex_engine
}

#[inline]
fn content_encoding_check_and_converter(
&self,
Expand Down
50 changes: 33 additions & 17 deletions jsonschema/src/keywords/format.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
//! Validator for `format` keyword.
use std::{net::IpAddr, str::FromStr, sync::Arc};

use fancy_regex::Regex;
use once_cell::sync::Lazy;
use serde_json::{Map, Value};
use url::Url;
Expand All @@ -15,26 +14,27 @@ use crate::{
primitive_type::PrimitiveType,
validator::Validate,
Draft,
CompilationOptions,
};

static DATE_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r"^[0-9]{4}-[0-9]{2}-[0-9]{2}\z").expect("Is a valid regex"));
static IRI_REFERENCE_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r"^(\w+:(/?/?))?[^#\\\s]*(#[^\\\s]*)?\z").expect("Is a valid regex"));
static JSON_POINTER_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r"^(/(([^/~])|(~[01]))*)*\z").expect("Is a valid regex"));
static RELATIVE_JSON_POINTER_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r"^(?:0|[1-9][0-9]*)(?:#|(?:/(?:[^~/]|~0|~1)*)*)\z").expect("Is a valid regex")
static DATE_RE: Lazy<fancy_regex::Regex> =
Lazy::new(|| fancy_regex::Regex::new(r"^[0-9]{4}-[0-9]{2}-[0-9]{2}\z").expect("Is a valid regex"));
static IRI_REFERENCE_RE: Lazy<fancy_regex::Regex> =
Lazy::new(|| fancy_regex::Regex::new(r"^(\w+:(/?/?))?[^#\\\s]*(#[^\\\s]*)?\z").expect("Is a valid regex"));
static JSON_POINTER_RE: Lazy<fancy_regex::Regex> =
Lazy::new(|| fancy_regex::Regex::new(r"^(/(([^/~])|(~[01]))*)*\z").expect("Is a valid regex"));
static RELATIVE_JSON_POINTER_RE: Lazy<fancy_regex::Regex> = Lazy::new(|| {
fancy_regex::Regex::new(r"^(?:0|[1-9][0-9]*)(?:#|(?:/(?:[^~/]|~0|~1)*)*)\z").expect("Is a valid regex")
});
static TIME_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(
static TIME_RE: Lazy<fancy_regex::Regex> = Lazy::new(|| {
fancy_regex::Regex::new(
r"^([01][0-9]|2[0-3]):([0-5][0-9]):([0-5][0-9])(\.[0-9]{6})?(([Zz])|([+|\-]([01][0-9]|2[0-3]):[0-5][0-9]))\z",
).expect("Is a valid regex")
});
static URI_REFERENCE_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r"^(\w+:(/?/?))?[^#\\\s]*(#[^\\\s]*)?\z").expect("Is a valid regex"));
static URI_TEMPLATE_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(
static URI_REFERENCE_RE: Lazy<fancy_regex::Regex> =
Lazy::new(|| fancy_regex::Regex::new(r"^(\w+:(/?/?))?[^#\\\s]*(#[^\\\s]*)?\z").expect("Is a valid regex"));
static URI_TEMPLATE_RE: Lazy<fancy_regex::Regex> = Lazy::new(|| {
fancy_regex::Regex::new(
r#"^(?:(?:[^\x00-\x20"'<>%\\^`{|}]|%[0-9a-f]{2})|\{[+#./;?&=,!@|]?(?:[a-z0-9_]|%[0-9a-f]{2})+(?::[1-9][0-9]{0,3}|\*)?(?:,(?:[a-z0-9_]|%[0-9a-f]{2})+(?::[1-9][0-9]{0,3}|\*)?)*})*\z"#
)
.expect("Is a valid regex")
Expand Down Expand Up @@ -281,12 +281,28 @@ impl Validate for JSONPointerValidator {
}
}
}
format_validator!(RegexValidator, "regex");
struct RegexValidator {
schema_path: JSONPointer,
config: Arc<CompilationOptions>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keeping the regex engine options will be enough, instead of storing the whole set of options? Thinking more about just storing the necessary minimum, rather than performance. However, I think as cloning RegexEngine is cheap, it could be also better performance wise during validation (though for a cost of a pointer indirection)

}
impl RegexValidator {
pub(crate) fn compile<'a>(context: &CompilationContext) -> CompilationResult<'a> {
let schema_path = context.as_pointer_with("format");
Ok(Box::new(Self { schema_path, config: Arc::clone(&context.config) }))
}
}

impl core::fmt::Display for RegexValidator {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
concat!("format: ", "regex").fmt(f)
}
}

impl Validate for RegexValidator {
validate!("regex");
fn is_valid(&self, instance: &Value) -> bool {
if let Value::String(item) = instance {
pattern::convert_regex(item).is_ok()
pattern::convert_regex(item, self.config.patterns_regex_engine()).is_ok()
} else {
true
}
Expand Down
22 changes: 15 additions & 7 deletions jsonschema/src/keywords/pattern.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ use crate::{
paths::JsonPointerNode,
primitive_type::PrimitiveType,
validator::Validate,
RegexEngine,
regex::{Regex, RegexError}
};
use once_cell::sync::Lazy;
use serde_json::{Map, Value};
Expand All @@ -18,7 +20,7 @@ static CONTROL_GROUPS_RE: Lazy<regex::Regex> =

pub(crate) struct PatternValidator {
original: String,
pattern: fancy_regex::Regex,
pattern: Regex,
schema_path: JSONPointer,
}

Expand All @@ -30,7 +32,7 @@ impl PatternValidator {
) -> CompilationResult<'a> {
match pattern {
Value::String(item) => {
let pattern = match convert_regex(item) {
let pattern = match convert_regex(item, context.config.patterns_regex_engine()) {
Ok(r) => r,
Err(_) => {
return Err(ValidationError::format(
Expand Down Expand Up @@ -76,11 +78,15 @@ impl Validate for PatternValidator {
}
}
Err(e) => {
let RegexError::FancyRegex(fancy_error) = e else {
unreachable!("Only fancy regex returns an error")
};

return error(ValidationError::backtrack_limit(
self.schema_path.clone(),
instance_path.into(),
instance,
e,
fancy_error,
));
}
}
Expand All @@ -104,7 +110,7 @@ impl core::fmt::Display for PatternValidator {

// ECMA 262 has differences
#[allow(clippy::result_large_err)]
pub(crate) fn convert_regex(pattern: &str) -> Result<fancy_regex::Regex, fancy_regex::Error> {
pub(crate) fn convert_regex(pattern: &str, regex_engine: &RegexEngine) -> Result<Regex, RegexError> {
// replace control chars
let new_pattern = CONTROL_GROUPS_RE.replace_all(pattern, replace_control_group);
let mut out = String::with_capacity(new_pattern.len());
Expand Down Expand Up @@ -143,7 +149,7 @@ pub(crate) fn convert_regex(pattern: &str) -> Result<fancy_regex::Regex, fancy_r
out.push(current);
}
}
fancy_regex::Regex::new(&out)
Regex::new(&out, regex_engine)
}

#[allow(clippy::arithmetic_side_effects)]
Expand Down Expand Up @@ -186,7 +192,8 @@ mod tests {
#[test_case(r"^\W+$", "1_0", false)]
#[test_case(r"\\w", r"\w", true)]
fn regex_matches(pattern: &str, text: &str, is_matching: bool) {
let compiled = convert_regex(pattern).expect("A valid regex");
let regex_engine = RegexEngine::FancyRegex(Default::default());
let compiled = convert_regex(pattern, &regex_engine).expect("A valid regex");
assert_eq!(
compiled.is_match(text).expect("A valid pattern"),
is_matching
Expand All @@ -196,7 +203,8 @@ mod tests {
#[test_case(r"\")]
#[test_case(r"\d\")]
fn invalid_escape_sequences(pattern: &str) {
assert!(convert_regex(pattern).is_err())
let regex_engine = RegexEngine::FancyRegex(Default::default());
assert!(convert_regex(pattern, &regex_engine).is_err())
}

#[test_case("^(?!eo:)", "eo:bands", false)]
Expand Down
6 changes: 3 additions & 3 deletions jsonschema/src/keywords/pattern_properties.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ use crate::{
schema_node::SchemaNode,
validator::{format_validators, PartialApplication, Validate},
};
use fancy_regex::Regex;
use serde_json::{Map, Value};
use crate::regex::Regex;

pub(crate) struct PatternPropertiesValidator {
patterns: Vec<(Regex, SchemaNode)>,
Expand All @@ -26,7 +26,7 @@ impl PatternPropertiesValidator {
for (pattern, subschema) in map {
let pattern_context = keyword_context.with_path(pattern.as_str());
patterns.push((
match Regex::new(pattern) {
match Regex::new(pattern, context.config.patterns_regex_engine()) {
Ok(r) => r,
Err(_) => {
return Err(ValidationError::format(
Expand Down Expand Up @@ -137,7 +137,7 @@ impl SingleValuePatternPropertiesValidator {
let keyword_context = context.with_path("patternProperties");
let pattern_context = keyword_context.with_path(pattern);
Ok(Box::new(SingleValuePatternPropertiesValidator {
pattern: match Regex::new(pattern) {
pattern: match Regex::new(pattern, context.config.patterns_regex_engine()) {
Ok(r) => r,
Err(_) => {
return Err(ValidationError::format(
Expand Down
3 changes: 2 additions & 1 deletion jsonschema/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,9 @@ mod resolver;
mod schema_node;
mod schemas;
mod validator;
mod regex;

pub use compilation::{options::CompilationOptions, JSONSchema};
pub use compilation::{options::CompilationOptions, JSONSchema, options::RegexEngine, options::RegexOptions, options::FancyRegexOptions};
pub use error::{ErrorIterator, ValidationError};
pub use keywords::custom::Keyword;
pub use resolver::{SchemaResolver, SchemaResolverError};
Expand Down
4 changes: 2 additions & 2 deletions jsonschema/src/properties.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use ahash::AHashMap;
use fancy_regex::Regex;
use crate::regex::Regex;
use serde_json::{Map, Value};

use crate::{
Expand Down Expand Up @@ -144,7 +144,7 @@ pub(crate) fn compile_patterns<'a>(
let mut compiled_patterns = Vec::with_capacity(obj.len());
for (pattern, subschema) in obj {
let pattern_context = keyword_context.with_path(pattern.as_str());
if let Ok(compiled_pattern) = Regex::new(pattern) {
if let Ok(compiled_pattern) = Regex::new(pattern, context.config.patterns_regex_engine()) {
let node = compile_validators(subschema, &pattern_context)?;
compiled_patterns.push((compiled_pattern, node));
} else {
Expand Down
Loading
Loading