-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(i): Rewrite document fetcher #3279
base: develop
Are you sure you want to change the base?
refactor(i): Rewrite document fetcher #3279
Conversation
It hasn't been used in 3 years, it can die and come back if/when needed.
d85d035
to
f03a867
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3279 +/- ##
===========================================
+ Coverage 77.95% 78.04% +0.09%
===========================================
Files 382 388 +6
Lines 35364 35298 -66
===========================================
- Hits 27568 27548 -20
+ Misses 6148 6120 -28
+ Partials 1648 1630 -18
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 19 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
// deleted is a fetcher that orchastrates the fetching of deleted and active documents. | ||
type deleted struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Why not rename to deletedFetcher
, I know the package name is fetcher
but still prefer the more descriptive name especially as you also have newDeletedFetcher
function.
// deleted is a fetcher that orchastrates the fetching of deleted and active documents. | |
type deleted struct { | |
// deleted is a fetcher that orchastrates the fetching of deleted and active documents. | |
type deletedFetcher struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike fetcher.deletedFetcher
now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fred and I have rather similar views on this kind of naming stuff and although I'm not speaking for him, I do prefer the explicitness of deletedFetcher
here. deleted
is just too generic of a term, same with document
which given the context of the package could just as easily refer to an actual document.
As for the import/package full naming (ie: fetcher.deletedFetcher
) this is also OK and not against any language idioms or best practises, because the alternative (fetcher.deleted
) is not a clear enough naming scheme (ignoring the fact that this isnt a public type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike
fetcher.deletedFetcher
now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.
Why newDeletedFetcher
over newDeleted
then also haha? The only reason deleted
is easy to understand for us is because we have context. A new developer just stumbling on this file even though the package is called fetcher
might not grasp what a general term like deleted
is referring to here, let alone what a "deleted fetcher" is.
I do understand that this is subjective so happy to follow consensus. I do think I am bias to longer and descriptive names haha my ideal would definitely be deletedAndActiveDocumentFetcher
or activeDocumentAndDeletedFetcher
for this and activeDocumentFetecher
for the other one that is currently document
@@ -362,7 +362,7 @@ func TestQueryWithIndexOnOneToOnePrimaryRelation_IfFilterOnIndexedFieldOfRelatio | |||
// we make 3 index fetch to get the 3 address with city == "Montreal" | |||
// then we scan all 10 users to find one with matching "address_id" for each address | |||
// after this we fetch the name of each user | |||
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(33).WithIndexFetches(3), | |||
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(60).WithIndexFetches(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Whats the context on field fetches increases, is it because all fields are read now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted in the description. It is a large increase due to the lack of indexing (and other potential optimisations), so we end up doing a full scan for every parent record - field count here has gone from (3*10*1+(1*3)) to (3*10*2) (note if you rename address
to baddress
on develop branch you get the same result).
// document is the type responsible for fetching documents from the datastore. | ||
// | ||
// It does not filter the data in any way. | ||
type document struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Similar renaming suggestion as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above, I don't think document
is the best naming here, too generic/general and could be interpreted in a myriad of ways (which I initially did).
In addition to my above comment, from a purely gramatical POV, the fetcher
is a verb indicating action of some kind, which makes sense since the fetchers are a system of pulling and processing key values into a consistent output form. Where as document
is a noun and indicates a static object. (This grammar argument isn't my primary concern with the naming, just an observation)
reviewing now |
"github.com/sourcenetwork/defradb/internal/keys" | ||
) | ||
|
||
// document is the type responsible for fetching documents from the datastore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question/suggestion: Is this only for active documents? or can work for both?
Perhaps this suggestion if only for active:
// document is the type responsible for fetching documents from the datastore. | |
// document is the type responsible for fetching only active documents from the datastore. |
Relevant issue(s)
Resolves #3277
Description
Rewrites the document fetcher.
Motivation
The old document fetcher had grown over the last few years and had IMO turned into a bit of a mess with a large amount of mutable state, and a lot of different concepts blurred into the same type where they were interleaved amongst each other that made maintaining the file pretty horrible.
Picking up #3275 motivated me to make this change now, so I can integrate the new package into something that I (hopefully we) can read.
What has not been done
The Fetcher interface has not changed (much, see first commit) - it would involve touching more parts of the codebase and would be a lot more hassle, it would have minimal impact on introducing CoreKV.
What has been done
The old code has been deleted and replaced with something written from scratch. Please review each line as if it is brand new code, complain about anything you like. Please be quite critical, especially if you actually liked the old fetcher code.
Behavioural changes
Only one significant change has been made to the behaviour.
Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.
The old behaviour makes very little sense to me, as best case scenario, only fields that had names lexicographically larger than the last filter field for the last document in the prefix would have been avoided being scanned. This is quite tiny, and the code required to do so is almost certainly more computationally expensive than the saving (across the database's lifetime). Note: the primary side of relations gained the most from this, until we add automatic indexing of relations.
Even once CoreKV is introduced, with its
seek
functionality, skipping past those key-values would require the iterator to seek backwards and forwards (or have two iterators, with extra seeking), and the savings would likely be minimal at best.Medium-long term this means that doing any field-based filtering makes little sense, and we might as well do it outside of the fetcher. This is more hassle than it is worth to change in this PR, so, in order to keep a manageable scope to this PR a
filtered
fetcher has been implemented in order to host fetcher-level filtering. Hopefully we can remove that too soon.