-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to persistent adaptive radix trees #9
Conversation
Benchmark results vs immutable-radix:
"Part_Insert" is without the watch channel management, and "Part_TrackMutate_Insert" is with. "SingleRadix_TrackMutate" still creates the watch channels but doesn't close them (so it's not apples-to-apples). This definitely shows that it's worth going this direction. It's almost twice as fast and half the allocations. We can then for the revision index just have a single watch channel per table rather than per node. There's not that much missing now. Need support for |
b1d9082
to
af8bcbe
Compare
The impact of this is really visible in the reconciler benchmark:
It's 50% faster and we're using almost half the amount of memory for the objects. |
07c06be
to
b9ae279
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a few small thing. In general things seem to check out. However, the part/iterator.go and part/txn.go have some very dense code which I find difficult to reason through. That is partially due to also having to wrap my head around the ART implementation which is new to me.
There are a few edge cases in there which are not covered by the current tests, so I would like to schedule a codewalk to go through it together before my final verdict.
Thanks for the review! What edge cases are not covered? I think it'd be great to have someone else besides me write a few test cases for this and try to break it... |
d399c2d
to
fd6eb05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part
looking good so far - need to further review iterator and the tests, and then the integration, but it's hard work 😅
// SPDX-License-Identifier: Apache-2.0 | ||
// Copyright Authors of Cilium | ||
|
||
package part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of part
seems to be undocumented from the user perspective. If you could add a godoc to func NewTree
, type Tree
and type Txn
that would already helpquite a bit, unless you think it's not super useful to document part
as a free standing lib.
Bonus for some simple examples (or let me know and I'll try to write them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I'll do a pass to add docs to exported types & funcs.
part/txn.go
Outdated
type Txn[T any] struct { | ||
opts *options | ||
root *node[T] | ||
size int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the size of a transaction track? Is it the number of operations (i.e. inserts/deletes) or is it an optimization keeping track of the size of the resulting tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of objects in the tree for implementing Len()
. I'll document. the fields.
} | ||
|
||
func (txn *Txn[T]) Commit() *Tree[T] { | ||
txn.mutated = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could clear
here, but I'm not sure what the differences are - do we rely on len(map) == 0
or on map == nil
somewhere? I'd imagine clear
being potentially more efficient by avoiding GC'ing the map memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. I'll check if clear
is better from GC perspective!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/part
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Benchmark_txn_1-8 1608686 775.2 ns/op 1290049 objects/sec
PASS
ok github.com/cilium/statedb/part 2.275s
part % go test . -bench txn_1\$ (pr/joamaki/part)statedb
goos: linux
goarch: amd64
pkg: github.com/cilium/statedb/part
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Benchmark_txn_1-8 1767748 643.4 ns/op 1554270 objects/sec
PASS
First one is with clear
, second without. So clearly worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clear
ly worse 😁 anyway - good to know!
part/txn.go
Outdated
// Reuse the same slice in the transaction to hold the parents in order to avoid | ||
// allocations. | ||
if txn.deleteParents == nil { | ||
txn.deleteParents = make([]deleteParent[T], 0, 32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 32, should this instead depend on the (estimated) height of the tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do anything fancy like that. I just picked a number that's high enough to be most of the time big enough. I can move this into a constant and add some comment about why this number.
part/node.go
Outdated
return nCopy | ||
} | ||
|
||
func (n *node[T]) promote(watch bool) *node[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting this here for lack of a better place. The paper has specialised the structure of each inner node, whereas your implementation "only" specialises node256 - I'm guessing you're leaving room for further optimization to keep some simplicity? Or did you benchmark and the other structures didn't do much good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for simplicity so I didn't even initially specialize node256 (until I benchmarked and noticed it makes a ton of sense). Since the perf is already 2x to what we had before I would leave it at this for now.
case nodeKind48: | ||
return n.node48().children[0:n.size():48] | ||
case nodeKind256: | ||
return n.node256().children[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the [:]
? Could also be left out or [:256]
I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need [:]
to convert it to a slice. [:256]
would be the same as [:]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending a few nits and questions, but I think this is good enough as is already.
|
||
txn = tree.Txn() | ||
for _, i := range keys { | ||
_, hadOld = txn.Insert(intKey(i), i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inject another get which confirms they're gone also after committing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
func Test_iterate(t *testing.T) { | ||
sizes := []int{0, 1, 10, 100, 1000, rand.Intn(1000)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific behaviour we want for calling Next() on iterators which already finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assert.
part/iterator.go
Outdated
edges := [][]*node[T]{} | ||
|
||
var traverseToMin func(n *node[T]) | ||
traverseToMin = func(n *node[T]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: definition and declaration separate, could be joined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it can't be since traverseToMin
calls traverseToMin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lifted traverseToMin
into a top-level function.
54dcc41
to
ac5cf9f
Compare
runtime.GC wasn't called and the memstats were read after Stop(). Signed-off-by: Jussi Maki <[email protected]>
Add an implementation of Persistent Adaptive Radix Trees. Adaptive Radix Trees are described in the paper https://db.in.tum.de/~leis/papers/ART.pdf. Compared to the go-immutable-radix which is non-adaptive the adaptive radix trees use less memory and avoid an extra pointer chase as the children array is embedded into the node. This implementation also allows only using watch channel at the root node which is more efficient for the revision index where per-node watch channels are not useful (we can't use them with LowerBound). Signed-off-by: Jussi Maki <[email protected]> Refresh of part-add-custom-implementation
Code coverage reports showed that there are a few edge cases that are not covered. This commit adds a few tests to cover some of those. Signed-off-by: Dylan Reimerink <[email protected]>
Deduplicate WriteTxn benchmark code and add tests for larger transaction sizes. Signed-off-by: Jussi Maki <[email protected]>
Use the part library instead of go-immutable-radix. Since part doesn't have a reverse iterator drop the Last and LastWatch support (they were never used, we can add them back if we need them). Benchmark results before: statedb % go test . -bench . -benchmem goos: linux goarch: amd64 pkg: github.com/cilium/statedb cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkDB_WriteTxn_1-8 332079 3431 ns/op 291454 objects/sec 3080 B/op 53 allocs/op BenchmarkDB_WriteTxn_10-8 567682 2026 ns/op 493603 objects/sec 1501 B/op 23 allocs/op BenchmarkDB_WriteTxn_100-8 570948 2010 ns/op 497480 objects/sec 1399 B/op 19 allocs/op BenchmarkDB_WriteTxn_1000-8 469208 2715 ns/op 368260 objects/sec 1495 B/op 18 allocs/op BenchmarkDB_WriteTxn_10000-8 1000000 2609 ns/op 383238 objects/sec 1349 B/op 21 allocs/op BenchmarkDB_WriteTxn_100_SecondaryIndex-8 309526 3554 ns/op 281402 objects/sec 2208 B/op 33 allocs/op BenchmarkDB_RandomInsert-8 1108 1010195 ns/op 989908 objects/sec 796015 B/op 12214 allocs/op BenchmarkDB_RandomReplace-8 250 4612771 ns/op 216789 objects/sec 2383117 B/op 33382 allocs/op BenchmarkDB_SequentialInsert-8 398 2997096 ns/op 333656 objects/sec 1492145 B/op 18255 allocs/op BenchmarkDB_Baseline_Part_Insert-8 4711 222810 ns/op 4488140 objects/sec 115342 B/op 3071 allocs/op BenchmarkDB_Baseline_SingleRadix_Insert-8 3504 331964 ns/op 3012375 objects/sec 353663 B/op 5103 allocs/op BenchmarkDB_Baseline_SingleRadix_TrackMutate_Insert-8 3328 333911 ns/op 2994818 objects/sec 353952 B/op 5106 allocs/op BenchmarkDB_Baseline_Part_TrackMutate_Insert-8 4189 273483 ns/op 3656536 objects/sec 214634 B/op 4096 allocs/op BenchmarkDB_Baseline_SingleRadix_Lookup-8 13726 84329 ns/op 11858358 objects/sec 0 B/op 0 allocs/op BenchmarkDB_Baseline_Part_Lookup-8 10000 101451 ns/op 9857032 objects/sec 0 B/op 0 allocs/op BenchmarkDB_Baseline_Hashmap_Insert-8 17227 70050 ns/op 14275499 objects/sec 86552 B/op 64 allocs/op BenchmarkDB_Baseline_Hashmap_Lookup-8 81237 14801 ns/op 67561608 objects/sec 0 B/op 0 allocs/op BenchmarkDB_DeleteTracker_Baseline-8 416 2930618 ns/op 341225 objects/sec 1758387 B/op 23413 allocs/op BenchmarkDB_DeleteTracker-8 194 6178728 ns/op 161846 objects/sec 3392362 B/op 40843 allocs/op BenchmarkDB_RandomLookup-8 8728 137090 ns/op 7294481 objects/sec 144 B/op 1 allocs/op BenchmarkDB_SequentialLookup-8 9162 128953 ns/op 7754759 objects/sec 8000 B/op 1000 allocs/op BenchmarkDB_FullIteration_All-8 103431 11386 ns/op 87823494 objects/sec 296 B/op 8 allocs/op BenchmarkDB_FullIteration_Get-8 83902 13900 ns/op 71943487 objects/sec 336 B/op 9 allocs/op BenchmarkDB_PropagationDelay-8 240777 4609 ns/op 41.00 50th_µs 48.00 90th_µs 199.0 99th_µs 3083 B/op 50 allocs/op PASS ok github.com/cilium/statedb 38.595s Benchmark results after: statedb % go test . -bench . -benchmem goos: linux goarch: amd64 pkg: github.com/cilium/statedb cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkDB_WriteTxn_1-8 412078 2748 ns/op 363939 objects/sec 2408 B/op 38 allocs/op BenchmarkDB_WriteTxn_10-8 949206 1267 ns/op 789411 objects/sec 663 B/op 12 allocs/op BenchmarkDB_WriteTxn_100-8 995774 1145 ns/op 873221 objects/sec 521 B/op 9 allocs/op BenchmarkDB_WriteTxn_1000-8 968202 1350 ns/op 740537 objects/sec 516 B/op 9 allocs/op BenchmarkDB_WriteTxn_10000-8 1000000 1507 ns/op 663678 objects/sec 486 B/op 9 allocs/op BenchmarkDB_WriteTxn_100_SecondaryIndex-8 559314 2031 ns/op 492404 objects/sec 844 B/op 18 allocs/op BenchmarkDB_RandomInsert-8 1326 841762 ns/op 1187984 objects/sec 418200 B/op 9170 allocs/op BenchmarkDB_RandomReplace-8 438 2705055 ns/op 369678 objects/sec 837668 B/op 19188 allocs/op BenchmarkDB_SequentialInsert-8 759 1636911 ns/op 610907 objects/sec 515916 B/op 9185 allocs/op BenchmarkDB_Baseline_Part_Insert-8 5148 213659 ns/op 4680351 objects/sec 115341 B/op 3071 allocs/op BenchmarkDB_Baseline_SingleRadix_Insert-8 3252 345056 ns/op 2898080 objects/sec 353662 B/op 5103 allocs/op BenchmarkDB_Baseline_SingleRadix_TrackMutate_Insert-8 3238 371801 ns/op 2689613 objects/sec 353951 B/op 5106 allocs/op BenchmarkDB_Baseline_Part_TrackMutate_Insert-8 3883 292706 ns/op 3416395 objects/sec 214636 B/op 4096 allocs/op BenchmarkDB_Baseline_SingleRadix_Lookup-8 13491 88791 ns/op 11262375 objects/sec 0 B/op 0 allocs/op BenchmarkDB_Baseline_Part_Lookup-8 10000 104666 ns/op 9554213 objects/sec 0 B/op 0 allocs/op BenchmarkDB_Baseline_Hashmap_Insert-8 17108 71518 ns/op 13982445 objects/sec 86539 B/op 64 allocs/op BenchmarkDB_Baseline_Hashmap_Lookup-8 83188 13999 ns/op 71434361 objects/sec 0 B/op 0 allocs/op BenchmarkDB_DeleteTracker_Baseline-8 934 1317762 ns/op 758863 objects/sec 525331 B/op 12317 allocs/op BenchmarkDB_DeleteTracker-8 488 2450865 ns/op 408019 objects/sec 925854 B/op 18652 allocs/op BenchmarkDB_RandomLookup-8 6177 165365 ns/op 6047254 objects/sec 144 B/op 1 allocs/op BenchmarkDB_SequentialLookup-8 8071 146586 ns/op 6821924 objects/sec 8000 B/op 1000 allocs/op BenchmarkDB_FullIteration_All-8 103074 11917 ns/op 83911284 objects/sec 280 B/op 8 allocs/op BenchmarkDB_FullIteration_Get-8 85909 14108 ns/op 70881358 objects/sec 320 B/op 9 allocs/op BenchmarkDB_PropagationDelay-8 414364 2914 ns/op 26.00 50th_µs 30.00 90th_µs 69.00 99th_µs 1406 B/op 28 allocs/op PASS ok github.com/cilium/statedb 36.180s In summary: - Half the number of allocations in a transaction of 100 inserts - Almost double the throughput: 497k/s -> 873k/s - Replacing an existing object: 216k/s -> 369k/s - Propagation latency down from 41us to 26us for 50th percentile Signed-off-by: Jussi Maki <[email protected]>
With the move to the part library we can drop all references to go-immutable-radix. Signed-off-by: Jussi Maki <[email protected]>
Signed-off-by: Jussi Maki <[email protected]>
Have a proper leaf node instead of storing the leaf inside an empty node4. This will reduce memory usage and takes out an allocation and a pointer dereference. For 1M dummy objects the memory usage is ~15% less and the insert speed is ~15% faster, with 30% less allocations. The leaf node is still wasting some space as it shares the header and the 'leaf' and 'prefix' fields could be removed. Before: Benchmark_Insert_RootOnlyWatch-8 4873 213619 ns/op 4681237 objects/sec 148882 B/op 3076 allocs/op Benchmark_Insert-8 4107 267379 ns/op 3740016 objects/sec 249387 B/op 4114 allocs/op Inserting batch 1000/1000 ... Waiting for reconciliation to finish ... 1000000 objects reconciled in 3.10 seconds (batch size 1000) Throughput 322570.39 objects per second Allocated 8012221 objects, 471726kB bytes, 527128kB bytes still in use After: Benchmark_Insert_RootOnlyWatch-8 5898 187664 ns/op 5328686 objects/sec 116722 B/op 2071 allocs/op Benchmark_Insert-8 4978 257483 ns/op 3883746 objects/sec 217225 B/op 3109 allocs/op benchmark % go run . -objects 1000000 Inserting batch 1000/1000 ... Waiting for reconciliation to finish ... 1000000 objects reconciled in 2.73 seconds (batch size 1000) Throughput 365990.20 objects per second Allocated 6011516 objects, 409163kB bytes, 484024kB bytes still in use Signed-off-by: Jussi Maki <[email protected]>
To avoid having the useless 'leaf' field in the leaf struct, move it out from header into each individual node struct. We still have the 'prefix' in the leaf struct that could in principle be removed by special-casing, e.g. instead of always looking at the prefix one would first check if the node is a leaf and then do full key comparison instead. Leaving that for a later commit. Signed-off-by: Jussi Maki <[email protected]>
Additionally move some exported functions to the top of the file to make it easier for readers. Embed Tree[T] into Txn[T] to avoid repeating the same fields. Signed-off-by: Jussi Maki <[email protected]>
lowerbound() was sufficiently large, so just move the closure up to its own top-level function.
ac5cf9f
to
305b96d
Compare
The leaf's watch channel wasn't closed on delete. Fix this and add a test case to check that all watch channels of deleted nodes are closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Focused my review on part and looks good to me. 🚀
I'll take another look at the benchmarks and tests later, but don't need to block this further.
Switch StateDB to use custom implementation of Persistent Adaptive Radix Trees (https://db.in.tum.de/~leis/papers/ART.pdf).
The implementation isn't a completely fateful implementation of the paper. The Node16 SIMD tricks and Node48 key array are skipped. This made the implementation simpler, but did leave some performance improvements on the table.
The only functional change to StateDB is to drop the
Last
andLastWatch
methods since that requires an reverse iterator implementation and since we didn't use them we can just drop them for now.Benchmark before:
Benchmark after:
Highlights: