-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prep DataModel for removal #1102
base: main
Are you sure you want to change the base?
Conversation
It isn't used publicly anywhere, and it adds more API surface area to worry about. This is still backwards compatible, as self._len is still saved/restored byt pickle the same way, now we just access it directly instead of going via __len__().
We iterate through the input more than once, so an Iterable is insufficient.
Before, we never checked for this, so when we called list.index() we just got back the first instance. I swapped us over to a lookup table because that makes more sense, but I wanted to also add this check because if someone had duplicate names, this would quitely change the behavior underneath someone: before it gave the first index, now it gives the last.
These variables 1. you can't currently create them because they inherit from Variable, not FieldVariable, so they don't appear in the datamodel.VARIABLE_CLASSES lookup table 2. You SHOULDN'T be able to instantiate these variables directly from a variable definition
Before the flow was: 1. create all the primary variables EXCEPT for the interactions 2. Expand those. 3. go back through and NOW create the InteractionVariables I found this confusing. We parse the var definitions twice in separate places, and InteractionVariables are a special case in the first pass. The other point of this is to make it so that there is one list of Variable instances which is the single source of truth. Once we do this then we can turn all the other private instance variables of DataModel into @functools.cached_property's, which will make it much easier to ensure pickle compatibility in the future, because only one variable needs to get saved and restored
all_variables.append(variable_object) | ||
|
||
if only_custom: | ||
no_blocking_variables = all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing from #1098 (comment)
I think it would be better if this was more duck-typed though, such as
no_blocking_variables = not any(len(v.predicates) for v in variables)
.
That would be a slight change though, since Variable.predicates
is filled with an ExistsPredicate
for any var def with the has missing
flag. Perhaps this is OK? But I don't think so.
I think really there needs to be a more official API where variables declare the predicates they export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.
https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37
if we move that code, then we could do pretty much what you are saying.
no_blocking_variables = not any(len(v.predicates) for v in variables)
we might also want to have a check like
only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
warning.warn('You are only using the Exists Predicate which is usually pretty bad')
all_variables += interactions(variable_definitions, self.primary_variables) | ||
variables = typify_variables(variable_definitions) | ||
non_interactions: list[FieldVariable] = [ | ||
v for v in variables if not isinstance(v, InteractionType) # type: ignore[misc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another case where I think a Variable should have a more official API for if they export if they are an Interaction or not. Maybe should think of a better name for this too. Atomic vs derived? leaf vs nonleaf? base vs derived?
no_blocking_variables = all( | ||
isinstance(v, (CustomType, InteractionType)) for v in variables | ||
) | ||
if no_blocking_variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this check should happen in DataModel.init, and not in this utility function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems of a piece with the "Incorrect variable specification: variable" error?
var_d = {var.name: var for var in primary_variables} | ||
|
||
def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]: | ||
field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should feels like it needs a more official API, or should be responsibility of the InteractionType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.
- isolating the interaction features for us, so we can put them in the right place in the sequence of features.
- actually resolving the feature indices.
seems like we kind of need to do both.
all_variables.append(variable_object) | ||
|
||
if only_custom: | ||
no_blocking_variables = all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.
https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37
if we move that code, then we could do pretty much what you are saying.
no_blocking_variables = not any(len(v.predicates) for v in variables)
we might also want to have a check like
only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
warning.warn('You are only using the Exists Predicate which is usually pretty bad')
no_blocking_variables = all( | ||
isinstance(v, (CustomType, InteractionType)) for v in variables | ||
) | ||
if no_blocking_variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems of a piece with the "Incorrect variable specification: variable" error?
result.extend(variable.higher_vars) | ||
else: | ||
result.append(variable) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we added an iter method to Variable so that we could just do something like.
features = []
for variable in variables:
for feature in variable:
features.append(feature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea in general, but I would want to name it to differentiate between predicates and features, since Variables supply both. Perhaps Variable.features
?
var_d = {var.name: var for var in primary_variables} | ||
|
||
def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]: | ||
field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.
- isolating the interaction features for us, so we can put them in the right place in the sequence of features.
- actually resolving the feature indices.
seems like we kind of need to do both.
@@ -58,21 +58,16 @@ def all_subclasses( | |||
|
|||
|
|||
class DerivedType(Variable): | |||
type = "Derived" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but my understanding is:
This would only be used if someone passes in a variable definition like {"type": "Derived"}
, and then we would create an instance of DerivedType, but that doesn't make sense because DerivedType is an abstract base class and shouldn't ever be created directly.
def __init__(self, definition: VariableDefinition): | ||
self.name = "(%s: %s)" % (str(definition["name"]), str(definition["type"])) | ||
super(DerivedType, self).__init__(definition) | ||
|
||
|
||
class MissingDataType(Variable): | ||
type = "MissingData" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as for DerivedType
thanks for working through this @NickCrews. this is some of the first code i wrote for project and there's a lot i didn't know! |
Split off from #1098
This makes it easier to create Variable instances from variable definitions, because Variables are responsible for both predicates and distance functions. So, if we split DataModel into these two separate parts, we're going to need to do something like
so it would be important that we can create our variable instances once, and then pass those around to the other parts.