Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(i): Rewrite document fetcher #3279

Merged
merged 6 commits into from
Dec 16, 2024

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Nov 28, 2024

Relevant issue(s)

Resolves #3277

Description

Rewrites the document fetcher.

Motivation

The old document fetcher had grown over the last few years and had IMO turned into a bit of a mess with a large amount of mutable state, and a lot of different concepts blurred into the same type where they were interleaved amongst each other that made maintaining the file pretty horrible.

Picking up #3275 motivated me to make this change now, so I can integrate the new package into something that I (hopefully we) can read.

What has not been done

The Fetcher interface has not changed (much, see first commit) - it would involve touching more parts of the codebase and would be a lot more hassle, it would have minimal impact on introducing CoreKV.

What has been done

The old code has been deleted and replaced with something written from scratch. Please review each line as if it is brand new code, complain about anything you like. Please be quite critical, especially if you actually liked the old fetcher code.

Behavioural changes

Only one significant change has been made to the behaviour.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

The old behaviour makes very little sense to me, as best case scenario, only fields that had names lexicographically larger than the last filter field for the last document in the prefix would have been avoided being scanned. This is quite tiny, and the code required to do so is almost certainly more computationally expensive than the saving (across the database's lifetime). Note: the primary side of relations gained the most from this, until we add automatic indexing of relations.

Even once CoreKV is introduced, with its seek functionality, skipping past those key-values would require the iterator to seek backwards and forwards (or have two iterators, with extra seeking), and the savings would likely be minimal at best.

Medium-long term this means that doing any field-based filtering makes little sense, and we might as well do it outside of the fetcher. This is more hassle than it is worth to change in this PR, so, in order to keep a manageable scope to this PR a filtered fetcher has been implemented in order to host fetcher-level filtering. Hopefully we can remove that too soon.

Todo

  • Make sure to wait for the release to be cut before merging

@AndrewSisley AndrewSisley added area/query Related to the query component refactor This issue specific to or requires *notable* refactoring of existing codebases and components code quality Related to improving code quality labels Nov 28, 2024
@AndrewSisley AndrewSisley added this to the DefraDB v0.15 milestone Nov 28, 2024
@AndrewSisley AndrewSisley requested a review from a team November 28, 2024 20:27
@AndrewSisley AndrewSisley self-assigned this Nov 28, 2024
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 79.44162% with 81 lines in your changes missing coverage. Please review.

Project coverage is 78.07%. Comparing base (6da615e) to head (791bb79).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
internal/db/fetcher/wrapper.go 78.95% 13 Missing and 7 partials ⚠️
internal/db/fetcher/document.go 85.85% 10 Missing and 5 partials ⚠️
internal/db/fetcher/multi.go 74.14% 10 Missing and 5 partials ⚠️
internal/db/fetcher/prefix.go 75.81% 10 Missing and 5 partials ⚠️
internal/db/fetcher/filtered.go 70.00% 6 Missing and 3 partials ⚠️
internal/db/fetcher/permissioned.go 82.86% 4 Missing and 2 partials ⚠️
internal/db/fetcher/versioned.go 80.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3279      +/-   ##
===========================================
- Coverage    78.20%   78.07%   -0.13%     
===========================================
  Files          382      388       +6     
  Lines        35448    35398      -50     
===========================================
- Hits         27720    27636      -84     
- Misses        6102     6127      +25     
- Partials      1626     1635       +9     
Flag Coverage Δ
all-tests 78.07% <79.44%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/db/collection.go 70.35% <100.00%> (+0.47%) ⬆️
internal/db/collection_get.go 79.31% <100.00%> (ø)
internal/db/collection_index.go 87.50% <ø> (-0.03%) ⬇️
internal/db/fetcher/fetcher.go 100.00% <ø> (+22.10%) ⬆️
internal/db/fetcher/indexer.go 83.69% <ø> (-0.11%) ⬇️
internal/lens/fetcher.go 71.50% <ø> (-0.13%) ⬇️
internal/planner/scan.go 89.88% <100.00%> (-0.04%) ⬇️
internal/db/fetcher/versioned.go 76.80% <80.00%> (-3.42%) ⬇️
internal/db/fetcher/permissioned.go 82.86% <82.86%> (ø)
internal/db/fetcher/filtered.go 70.00% <70.00%> (ø)
... and 4 more

... and 17 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6da615e...791bb79. Read the comment docs.

Comment on lines 19 to 20
// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
type deleted struct {
Copy link
Member

@shahzadlone shahzadlone Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Why not rename to deletedFetcher, I know the package name is fetcher but still prefer the more descriptive name especially as you also have newDeletedFetcher function.

Suggested change
// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
type deleted struct {
// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
type deletedFetcher struct {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike fetcher.deletedFetcher now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fred and I have rather similar views on this kind of naming stuff and although I'm not speaking for him, I do prefer the explicitness of deletedFetcher here. deleted is just too generic of a term, same with document which given the context of the package could just as easily refer to an actual document.

As for the import/package full naming (ie: fetcher.deletedFetcher) this is also OK and not against any language idioms or best practises, because the alternative (fetcher.deleted) is not a clear enough naming scheme (ignoring the fact that this isnt a public type).

Copy link
Member

@shahzadlone shahzadlone Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike fetcher.deletedFetcher now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.

Why newDeletedFetcher over newDeleted then also haha? The only reason deleted is easy to understand for us is because we have context. A new developer just stumbling on this file even though the package is called fetcher might not grasp what a general term like deleted is referring to here, let alone what a "deleted fetcher" is.

I do understand that this is subjective so happy to follow consensus. I do think I am bias to longer and descriptive names haha my ideal would definitely be deletedAndActiveDocumentFetcher or activeDocumentAndDeletedFetcher for this and activeDocumentFetecher for the other one that is currently document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like both of deleted and deletedFetcher. The documentation for the struct is not deletion-specific, it's for both deleted and active. If the main thing this fetcher does is orchestrating something, let's call it orchestratingFetcher.

And it's also not clear by just looking at it's fields why delete includes deletedFetcher.

Another question is: who deleted this fetcher? :D

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like both of deleted and deletedFetcher. The documentation for the struct is not deletion-specific, it's for both deleted and active. If the main thing this fetcher does is orchestrating something, let's call it orchestratingFetcher.

I can remove references to the deleted-ness of that type. It really just manages the fetching of documents from multiple statuses, nothing there is delete specific. Abstracting the naming will make the code a little more obscure IMO, but it is probably more correct.

New naming for it will probably be statusFetcher or similar.

If anyone objects to this change please let me know :)

  • Remove deleted-ness of deleted fetcher

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainFetcher or docFetcher

:) I don't see myself being able to call it mainFetcher, and having a docFetcher and a documentFetcher in the same code-space is quite unpleasant IMO. I might tweak the wrapper to wrappingFetcher or bridgeFetcher/tempFetcher if no better names pop up in the meantime.

  • Rename wrapper

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetcher suffix

People are in favour of this, so I'll add the suffix to all.

  • Add Fetcher suffix to type names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docFetcher and a documentFetcher in the same code-space is quite unpleasant IMO

docFetcher == documentFetcher. There is no duplication intended :)

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docFetcher == documentFetcher. There is no duplication intended :)

Ah! I thought that was a rename of wrapper (because of it being at the top, and the main suggestion) 😁

You have no problem with wrapper's current name?

@@ -362,7 +362,7 @@ func TestQueryWithIndexOnOneToOnePrimaryRelation_IfFilterOnIndexedFieldOfRelatio
// we make 3 index fetch to get the 3 address with city == "Montreal"
// then we scan all 10 users to find one with matching "address_id" for each address
// after this we fetch the name of each user
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(33).WithIndexFetches(3),
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(60).WithIndexFetches(3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Whats the context on field fetches increases, is it because all fields are read now?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in the description. It is a large increase due to the lack of indexing (and other potential optimisations), so we end up doing a full scan for every parent record - field count here has gone from (3*10*1+(1*3)) to (3*10*2) (note if you rename address to baddress on develop branch you get the same result).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the part in the description explaining why number of fields fetched increased.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

Is it this one?

And what do you mean by "lack of indexing"?

I still don't understand why it fetches 6 fields for every User doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why it fetches 6 fields for every User doc.

It doesn't. It fetches 2 fields per User doc, it is just that each doc is fetched 3 times (not new, see my calc). This is because of the lack of indexing/other-optimizations that I mentioned - a full scan is performed for each Address document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you mentioned "lack of indexing" and I asked you afterwards what you meant. You didn't answer yet.

It's clear why we do full scan on primary docs for every related doc, but it's not clear why you need to fetch 2 fields for every primary doc. The request contains only name in selected fields.
What it did before: for every primary doc we would read address_id to compare it with related doc's we found using index. That's already 30: 10 times for 3 docs. Then for every matching primary doc we would read additionally name. So +3.

I assume the answer lies somewhere here "Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read."
Which for this context I interpret as the following: while doing full scan we always fetch 2 fields: address_id to match against the related doc and name because it's in the selection set. In this case I wonder if we really have to read the selected fields before filter.

Sorry, I'm just trying to understand the new behaviour. Not trying to be picky.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't answer yet.

Ah I thought you missed my earlier (small expansion) :) I'll expand further.

Currently a full scan is done for every parent. This is not ideal. Indexing the child to parent would remove the need for this, and I thought we had already discussed as team doing that - this would convert the full table scan into a point lookup. Atm I think this can be done already by users by explicitly declaring the index from secondary side.

There also are other non-index optimizations we can do here too long term that would at least allow us to only scan child records once for all parents (instead of a full scan for every parent) if we dont have an index.

It also looks like it doesnt take advantage of the fact that it is a one-one relation, and keeps fetching after the first doc is found (planner problem, not fetcher IMO).

but it's not clear why you need to fetch 2 fields for every primary doc

Two fields are required. One for the filter, one for the select. develop branch does this anyway - if you rename address to oddress (or anything lexographically larger than name) it will fetch 60 fields. The 33 field count in develop is also something of a lie atm anyway, develop currently reads 60 fields (value and key) from disk anyway, no io is saved, it just skips a small amount of Defra Go code for the other 27 fields that are discarded.

Once we integrate the CoreKV lib into Defra, with it's exposed seek logic we can actually rework the many performance deficiencies of the way the fetcher field stuff works (old and current, they both work in roughly the same horrible way here).

Once CoreKV is integrated, and the other nastiness reworked, we can actually measure the effect of skipping the extra io of fields lexicographically larger than the largest filter field - this is where John/Fred disagree with me RE performance - IO can actually be really quite cheap, and I think there are reasonable odds of the logic required to skip those properties having either a negative or minimal effect on overall system performance. Once we have measured the performance of the two (simple vs seek) we can make a choice as to whether we want to take on the maintenance cost of the seek-based solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I see. Thanks for detailed explanation. Was very helpful.

@jsimnz
Copy link
Member

jsimnz commented Dec 2, 2024

reviewing now

"github.com/sourcenetwork/defradb/internal/keys"
)

// document is the type responsible for fetching documents from the datastore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/suggestion: Is this only for active documents? or can work for both?

Perhaps this suggestion if only for active:

Suggested change
// document is the type responsible for fetching documents from the datastore.
// document is the type responsible for fetching only active documents from the datastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fetches all documents. Active and deleted.

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is going to be a lot of refactoring needed when integrating core-kv but I understand this is a change in the right direction. Looking forward to the follow-up PR.

Make sure the other comments are resolved before merging.

Comment on lines 48 to 75
func NewDocumentFetcher() Fetcher {
return &wrapper{}
}

func (f *wrapper) Init(
ctx context.Context,
identity immutable.Option[acpIdentity.Identity],
txn datastore.Txn,
acp immutable.Option[acp.ACP],
col client.Collection,
fields []client.FieldDefinition,
filter *mapper.Filter,
docMapper *core.DocumentMapping,
showDeleted bool,
) error {
f.identity = identity
f.txn = txn
f.acp = acp
f.col = col
f.fields = fields
f.filter = filter
f.docMapper = docMapper
f.showDeleted = showDeleted

return nil
}

func (f *wrapper) Start(ctx context.Context, prefixes ...keys.Walkable) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I know why you're doing this but the pattern New -> Init -> Start... 🤮 lol. Hopefully this will be changed soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this pattern, and we do the same in planner too. Both need to die.

}
}

func (f *deleted) NextDoc() (immutable.Option[string], error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I wasn't sure initially when I looked at this implementation but now I really like it! Nice simple way to keep the docs in order.

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really great. I had hart time comprehending what was happening in the old code. This one is well-structured and easy to understand. Thanks Andy.

I just have serious concern about the numbers of fields that executer is reporting now. I hope we can clarify/solve it before it's merged.

Comment on lines 21 to 27
activeFetcher fetcher
activeDocID immutable.Option[string]

deletedFetcher fetcher
deletedDocID immutable.Option[string]

currentFetcher fetcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: would be nice to have per-field docs explaining differences between fetchers, especially between activeFetcher and currentFetcher

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem - I'll add them.

  • Doc deleted fetcher props

Comment on lines 19 to 20
// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
type deleted struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like both of deleted and deletedFetcher. The documentation for the struct is not deletion-specific, it's for both deleted and active. If the main thing this fetcher does is orchestrating something, let's call it orchestratingFetcher.

And it's also not clear by just looking at it's fields why delete includes deletedFetcher.

Another question is: who deleted this fetcher? :D

Comment on lines +39 to +46
// The most recently yielded item from kvResultsIter.
currentKV keyValue
// nextKV may hold a datastore key value retrieved from kvResultsIter
// that was not yet ready to be yielded from the instance.
//
// When the next document is requested, this value should be yielded
// before resuming iteration through the kvResultsIter.
nextKV immutable.Option[keyValue]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: great comments. Thanks

Comment on lines 109 to 117
if dsKey.DocID != f.currentKV.Key.DocID {
f.currentKV = keyValue{
Key: dsKey,
Value: res.Value,
}
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: if they are equal, don't we need to update f.currentKV.Value?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentKV is only relevant when crossing document boundaries. I'll see if I can find a good documentation-place to make that clearer.

  • Improve currentKV documentation, or maybe just the if-block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made more sense to change the code than to try and explain it.

return immutable.Some[EncodedDocument](&doc), nil
}

func (f *document) appendKv(doc *encodedDocument, kv keyValue) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: rename to appendKV

Copy link
Contributor Author

@AndrewSisley AndrewSisley Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Cheers

  • Fix casing

Comment on lines +78 to +81
prefixes = make([]keys.DataStoreKey, 0, len(uniquePrefixes))
for prefix := range uniquePrefixes {
prefixes = append(prefixes, prefix)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: why not to write like this:

i := 0
for prefix := range uniquePrefixes {
	prefixes[i] = prefix
	i += 1
}
prefixes = prefixes[0:len(uniquePrefixes)]

This will reuse existing memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the way Andy did it also uses existing memory because the capacity has been set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacity is good, but make always allocates a new array. The old one is discarded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had missed the reuse of prefixes. Given that it's always going to be a small array, I'm not sure it has much impact either way but Andy can decide which one he prefers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as-is, the suggestion is less readable and I class it as a premature optimization given likely tiny gains (the relativity of which is unmeasured atm).

Comment on lines +85 to +87
slices.SortFunc(prefixes, func(a, b keys.DataStoreKey) int {
return strings.Compare(a.ToString(), b.ToString())
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I assume the array of prefixes is not going to be large for any reasonable case and this approach would be fine.

The ToString() is not the cheapest and will be called 2 N (log N) time. We could make it just N.

The sorting itself could be also made using radix-like sorting style where we first sort by CollectionRootID, then by InstanceType... This won't give us exact scanner's sequence though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is worth the maintenance cost atm, whilst probably more significant the the make comment, it's user-impact is still unmeasured, and IMO we have far better things to spend our time on atm.

@@ -362,7 +362,7 @@ func TestQueryWithIndexOnOneToOnePrimaryRelation_IfFilterOnIndexedFieldOfRelatio
// we make 3 index fetch to get the 3 address with city == "Montreal"
// then we scan all 10 users to find one with matching "address_id" for each address
// after this we fetch the name of each user
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(33).WithIndexFetches(3),
Asserter: testUtils.NewExplainAsserter().WithFieldFetches(60).WithIndexFetches(3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the part in the description explaining why number of fields fetched increased.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

Is it this one?

And what do you mean by "lack of indexing"?

I still don't understand why it fetches 6 fields for every User doc.

It hasn't been used in 3 years, it can die and come back if/when needed.
Is technically more correct, and unlikely to have a meaningful impact on performance
@AndrewSisley AndrewSisley merged commit 363fe9c into sourcenetwork:develop Dec 16, 2024
43 of 44 checks passed
@AndrewSisley AndrewSisley deleted the 3277-cleanup-fetcher branch December 16, 2024 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component code quality Related to improving code quality refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup fetcher
5 participants