You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -224,7 +224,7 @@ Below are some plots taken from the CSV output for tests at different key insert
224
224
225
225
### Method
226
226
227
-
Before a test runs, *n*-sized randomised insertion, search and deletion arrays of keys ranging from 0 to *n*-1 are generated using [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle). Using premade shuffled sequences guarantees that an operation is successful every time, with no repetitions. Keys are inserted, searched and deleted by iterating over these arrays and duration of every *k* operations is measured and then divided by *k*. The incremental search during deletion (or decremental search) first finds *k* nodes from the randomised deletion array, then deletes them, and so on, in batches of *k*. The incremental search during insertion is non-optimal, because it generates a *k*-sized random sample of previously inserted keys every *k*, but it does so by fetching insert array indices from the search array, modulo dividing them by the current node count. There are definitely duplicates happening, but overall the behaviour is random enough - the lower-right corner plots are *almost* symmetrical. These incremental / decremental search times are subtracted from insertion and deletion times, and that is what constitutes the approximate rebalancing overhead - obviously any spikes are superimposed. Rebalance overhead function plots are pretty much flat, which again is a known property of red-black trees.
227
+
Before a test runs, *n*-sized randomised insertion, search and deletion arrays of keys ranging from 0 to *n*-1 are generated using [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle). Using premade shuffled sequences guarantees that an operation is successful every time, with no repetitions. Keys are inserted, searched and deleted by iterating over these arrays and duration of every *k* operations is measured and then divided by *k*. The incremental search during deletion (or decremental search) first finds *k* nodes from the randomised deletion array, then deletes them, and so on, in batches of *k*. The incremental search during insertion is non-optimal, because it generates a *k*-sized random sample of previously inserted keys every *k*, but it does so by fetching insert array indices from the search array, modulo dividing them by the current node count. There are definitely duplicates happening, but overall the behaviour is random enough - the lower-right corner plots are *almost* symmetrical. These incremental / decremental search times are subtracted from insertion and deletion times, and that is what constitutes the approximate rebalancing overhead - obviously any spikes are superimposed. Rebalance overhead function plots are pretty much flat (at higher node counts at least), which again is a known property of red-black trees.
228
228
229
229
### 100k nodes:
230
230
@@ -250,4 +250,10 @@ I did not spend enough time analysing these plots, so for now the noise towards
250
250
251
251
All plots were made using [kst2 / kst-plot](https://kst-plot.kde.org/), which remains my all-time favourite graphing and data analysis package, and likely a mark of my age and unchanging habits. *While I have your attention - I have trawled through many papers, plots and benchmarks in my life, and let me say this: those who fail to specify axis descriptions and units, should not bother to publish plots at all and go back to school instead. K, thx, bye.*
252
252
253
-
<sup>*</sup> - the OS clock in the server was left uncorrected during the 500M run and it was drifting badly enough to affect the perceived duration of each 20000-node batch; this is why the sustained search time is apparently decreasing for that run. It is not. This was corrected for the 1000M run. Why weren't you using `CLOCK_MONOTONIC` you say? I was. Duration measurement could probably be rewritten with `rdtsc` (TODO).
253
+
<sup>*</sup> - the OS clock in the server was left uncorrected during the 500M run and it was drifting badly enough to affect the perceived duration of each 20000-node batch; this is why the sustained search time is apparently decreasing for that run. It is not. This was corrected for the 1000M run. Why weren't you using `CLOCK_MONOTONIC` you say? I was. Duration measurement could probably be rewritten with `rdtsc` (TODO).
254
+
255
+
## FAQ
256
+
257
+
Q: *If this is for a dictionary index, why are you going with red-black trees and not hash tables? Surely you do not need in-order traversal or ranged searches for a dictionary?*
258
+
259
+
A: Of course. Actually it is even worse, because this index will hold *hashes* of full tree node paths of a hierarchical configuration container. Implementing a non-shite *dynamically resizing* hash table however, is somewhat less trivial a task than implementing a red-black tree, and there is no amortisation with a red-black tree, there is no load factor and no emptiness either, even if temporary. I wrote this for convenience, not for ultimate performance, even though some gains could be made from top-down operations, and a *k*-ary tree could make great use of vectorisation. Plus, it this was a fun exercise. The aforementioned non-shite dynamically resizing hash table is next on my list, but let me tell you a secret: I am not a computer scientist. I am not a real programmer either!
0 commit comments