Skip to content
This repository has been archived by the owner on May 29, 2024. It is now read-only.

First pass at [Ruby][Mongo] validate simple filter rules #3174 #430

Closed
wants to merge 11 commits into from

Conversation

mchernyavskaya
Copy link
Contributor

Closes https://github.com/elastic/enterprise-search-team/issues/3174

Basic high-level validation of simple filtering rules. It has a lot of edge cases and we can't cover all, so it's a best-effort attempt to provide at least some rough coverage.

The PR also includes pre-processing of start_with and end_with rules in mongo (the first PR for simple rules didn't cover those as they were specified as post-processing in the implementation document, but since technically it's just a kind of a regex, I didn't see why we can't do it in pre-processing for Mongo).

Checklists

Pre-Review Checklist

  • Covered the changes with automated tests
  • Tested the changes locally

Related Pull Requests

For Elastic Internal Use Only

  • Considered corresponding documentation changes to contribute separately
  • New configuration settings added in this PR follow the official guidelines
  • Built gems (both connectors_utility and connectors_service) and included into Enterprise Search and tested that Enterprise Search works well with new gem versions. Instruction can be found here

raise "#{e.key} is required: #{e.message}"
end

def self.flex_fetch(hash, key, default_value = nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we do it in a monkey patch. =) But here, we don't.
Frankly, I don't see a good way to do it with key types unknown. Moreover, even the
hash[key] || hash.fetch(key.to_sym, nil) || hash.fetch(key.to_s, nil) won't work properly in the case where the key = boolean false (learned the hard way, therefore this ugly code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When resolving merge conflicts, I convinced myself we didn't need this. We can just rely on string keys.

@rule = SimpleRule.flex_fetch(rule_hash, RULE)
@value = SimpleRule.flex_fetch(rule_hash, VALUE)
@id = SimpleRule.flex_fetch(rule_hash, ID)
@order = SimpleRule.flex_fetch(rule_hash, ORDER, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to add order here, I suggest we require it, and not provide a default value. Otherwise we risk non-deterministic ordering in specs if we don't have an explicit order field defined and they all get :order => 0 from this default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that if we don't have order, then the ordering probably doesn't matter (like it doesn't now for pre-processing) and we can safely assume some default value. But you're probably right.

@@ -11,23 +11,144 @@

module Connectors
module Base
class FilteringRulesValidationError < StandardError; end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this go in errors.rb?

def initialize(rules)
@rules = (rules || []).map(&:with_indifferent_access).filter { |r| r[:id] != 'DEFAULT' }.sort_by { |r| r[:order] }
begin
sorted = (rules || []).map { |r| SimpleRule.new(r) }.filter { |r| r.id != 'DEFAULT' }.sort_by(&:order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change. I'd avoided it because of the order requirement, but you've solved that now.
Should we unify this sorting with the one done by the PostProcessingEngine to DRY up our logic?

Comment on lines 72 to 79
# # check for overlapping ranges
# ranges = field_rules.filter { |r| r[:rule] == '>' || r[:rule] == '<' }
# ranges.each_with_index do |r, i|
# next if i == ranges.size - 1
# next_r = ranges[i + 1]
# if r[:value] == next_r[:value]
# raise FilteringRulesValidationError.new("Contradicting rules for field: #{field}. Can't have overlapping ranges.")
# end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this block be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it should, it's a WIP to document the thought process. Frankly, the ranges feel so cumbersome to validate... there's just too many cases.

lib/connectors/base/simple_rules_parser.rb Show resolved Hide resolved
lib/connectors/mongodb/mongo_rules_parser.rb Show resolved Hide resolved
spec/core/filtering/simple_rule_spec.rb Show resolved Hide resolved
mchernyavskaya and others added 3 commits November 15, 2022 18:40
https://github.com/elastic/enterprise-search-team/issues/3174
added some range pre-validation
some of the processed cases aren't covered by tests (drop_invalid_range_rules)
…ring-rules

# Conflicts:
#	lib/core/filtering/post_process_engine.rb
#	lib/core/filtering/simple_rule.rb
#	spec/connectors/mongodb/mongo_rules_parser_spec.rb
@seanstory
Copy link
Member

@mchernyavskaya I failed to get this ready to merge. It too me too long to figure out how to resolve the merge conflicts (I couldn't rebase your branch and I still don't really understand why), and I'm just out of time. I think we'll just need to save this issue for 8.7 :(

@mchernyavskaya
Copy link
Contributor Author

@seanstory no worries. Taking out of the DM: I feel like I'm going into a rabbit hole with these filtering rules (and specifically, with validation). I'm second-guessing this pre and post processing model more and more, so maybe that's something we should revisit.

@artem-shelkovnikov
Copy link
Member

Checked the PR and spoke to @mchernyavskaya about it. It indeed feels like a rabbit hole - number of possible combinations of invalid rules is potentially really high, the complexity of validating the rules is humongous.

I think same problem can be probably approached from a different perspective - let users enter any rules they want but provide good logging and tracing to see what query was generated (maybe even why) and why the expected data was not ingested.

It's probably good to validate each individual rule though just for sanity purposes - for example that RANGE filter is actually [X; Y], not [-INF, X], [Y; +INF]

@mchernyavskaya
Copy link
Contributor Author

@artem-shelkovnikov yep - I also proposed something like this in Ingestion Sync agenda - want it at least to be up for a discussion.

@seanstory
Copy link
Member

I believe we can close this now, as it was implemented in a separate PR

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants