Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay optimized postgres persistor fixes #403

Merged
merged 20 commits into from
Feb 5, 2024

Conversation

erikrozendaal
Copy link
Member

This fixes two problems in the ReplayOptimizedPostgresPersistor:

  1. The documentation specifies that a default aggregate_id index is automatically added. Unfortunately the implementation was broken.
  2. The index used the hash value of the record keys but did not handle hash collisions at all. So when the hash function collided records could be affected with different keys, causing data corruption.

This PR fixes both issues and (tries to) clean up the code.

The implementation did not match the documentation as no index on
`aggregate_id` was created, even when the record class supported this
attribute. The index was also not added when indexes where specified
explicitly.

Now the `aggregate_id` index is always added when supported and for
record classes where no indexes are defined it will be the only
index (instead of no indexes).
Hashing can result in collissions (same hash value for different input
values) so they cannot be reliably used to index the records.
Index fields are now always sorted so we can simply use `==` to see if
an index has the same fields as the where clause.

Avoid duplicate index lookups by having `find` return `nil` when there
is no matching index.
end
end
end.dup
end).dup
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this dup only needs to happen when the records are found in an index. So I propose moving this to Index#find.

Also, can it just be freeze and let the caller dup if required? Or is this too much of a trap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this dup a bit weird indeed. I can't really recall why this is needed, and can't find anything in old commit messages unfortunately. It has been in there since the beginning. Since it is only the array that is dupped, not the objects in it, I am not sure what it supposed to protect against....

Copy link
Member

@lvonk lvonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a potential issue in find_records

end
end
end.dup
end).dup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this dup a bit weird indeed. I can't really recall why this is needed, and can't find anything in old commit messages unfortunately. It has been in there since the beginning. Since it is only the array that is dupped, not the objects in it, I am not sure what it supposed to protect against....

Simply clear the hash so its configuration (`compare_by_identity`) is
preserved.
Structs allow accessing attributes using both with the `[]` operator,
so no need to convert strings to symbols first.

Extract Symbol to String normalization into a separate method.
The in-memory structs can easily be generated using "ordinary"
meta-programming. The only difference is that the struct classes are
anonymous since there is no longer a constant referring to the
class. This also avoids polluting the Ruby namespace.

Also add the missing `eql?` override to ensure the records are stored
correctly in the record `Set`.
Since it no longer pollutes the global namespace it no longer needs to
be a singleton and can become a simple instance variable.
Use compare_by_identity for the record sets, but still keep the
equality/hash overrides for in-memory structs for consistency with
ActiveRecord.

Remove unnecessary `dup`.
Break up multi-colum indexes into multiple single-attribute
indexes. Match a `where-clause` against all indexes and use
set-intersection to find the candidate records that will match a
where-clause.
Symbols cause less GC and are faster to compare (immutable, same
symbol uses same instance).
Also ensure sets use `compare_by_identity`.
record_sets = indexes.flat_map do |field|
if !normalized_where_clause.include? field
[]
def find(record_class, normalized_where_clause)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does normalized_where_clause mean? Isn't it better to allow any valid where clause and normalize in this method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or fail if it isn't normalized

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal API and the normalization happens in the find_records method.

@erikrozendaal erikrozendaal merged commit cb3f51c into master Feb 5, 2024
9 checks passed
@erikrozendaal erikrozendaal deleted the replay-optimized-postgres-persistor-fixes branch February 5, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants