Skip to content

Commit 501ddbd

Browse files
committed
Added a FAQ section
1 parent 998142a commit 501ddbd

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,7 @@ Below are some plots taken from the CSV output for tests at different key insert
224224
225225
### Method
226226
227-
Before a test runs, *n*-sized randomised insertion, search and deletion arrays of keys ranging from 0 to *n*-1 are generated using [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle). Using premade shuffled sequences guarantees that an operation is successful every time, with no repetitions. Keys are inserted, searched and deleted by iterating over these arrays and duration of every *k* operations is measured and then divided by *k*. The incremental search during deletion (or decremental search) first finds *k* nodes from the randomised deletion array, then deletes them, and so on, in batches of *k*. The incremental search during insertion is non-optimal, because it generates a *k*-sized random sample of previously inserted keys every *k*, but it does so by fetching insert array indices from the search array, modulo dividing them by the current node count. There are definitely duplicates happening, but overall the behaviour is random enough - the lower-right corner plots are *almost* symmetrical. These incremental / decremental search times are subtracted from insertion and deletion times, and that is what constitutes the approximate rebalancing overhead - obviously any spikes are superimposed. Rebalance overhead function plots are pretty much flat, which again is a known property of red-black trees.
227+
Before a test runs, *n*-sized randomised insertion, search and deletion arrays of keys ranging from 0 to *n*-1 are generated using [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle). Using premade shuffled sequences guarantees that an operation is successful every time, with no repetitions. Keys are inserted, searched and deleted by iterating over these arrays and duration of every *k* operations is measured and then divided by *k*. The incremental search during deletion (or decremental search) first finds *k* nodes from the randomised deletion array, then deletes them, and so on, in batches of *k*. The incremental search during insertion is non-optimal, because it generates a *k*-sized random sample of previously inserted keys every *k*, but it does so by fetching insert array indices from the search array, modulo dividing them by the current node count. There are definitely duplicates happening, but overall the behaviour is random enough - the lower-right corner plots are *almost* symmetrical. These incremental / decremental search times are subtracted from insertion and deletion times, and that is what constitutes the approximate rebalancing overhead - obviously any spikes are superimposed. Rebalance overhead function plots are pretty much flat (at higher node counts at least), which again is a known property of red-black trees.
228228
229229
### 100k nodes:
230230
@@ -250,4 +250,10 @@ I did not spend enough time analysing these plots, so for now the noise towards
250250
251251
All plots were made using [kst2 / kst-plot](https://kst-plot.kde.org/), which remains my all-time favourite graphing and data analysis package, and likely a mark of my age and unchanging habits. *While I have your attention - I have trawled through many papers, plots and benchmarks in my life, and let me say this: those who fail to specify axis descriptions and units, should not bother to publish plots at all and go back to school instead. K, thx, bye.*
252252
253-
<sup>*</sup> - the OS clock in the server was left uncorrected during the 500M run and it was drifting badly enough to affect the perceived duration of each 20000-node batch; this is why the sustained search time is apparently decreasing for that run. It is not. This was corrected for the 1000M run. Why weren't you using `CLOCK_MONOTONIC` you say? I was. Duration measurement could probably be rewritten with `rdtsc` (TODO).
253+
<sup>*</sup> - the OS clock in the server was left uncorrected during the 500M run and it was drifting badly enough to affect the perceived duration of each 20000-node batch; this is why the sustained search time is apparently decreasing for that run. It is not. This was corrected for the 1000M run. Why weren't you using `CLOCK_MONOTONIC` you say? I was. Duration measurement could probably be rewritten with `rdtsc` (TODO).
254+
255+
## FAQ
256+
257+
Q: *If this is for a dictionary index, why are you going with red-black trees and not hash tables? Surely you do not need in-order traversal or ranged searches for a dictionary?*
258+
259+
A: Of course. Actually it is even worse, because this index will hold *hashes* of full tree node paths of a hierarchical configuration container. Implementing a non-shite *dynamically resizing* hash table however, is somewhat less trivial a task than implementing a red-black tree, and there is no amortisation with a red-black tree, there is no load factor and no emptiness either, even if temporary. I wrote this for convenience, not for ultimate performance, even though some gains could be made from top-down operations, and a *k*-ary tree could make great use of vectorisation. Plus, it this was a fun exercise. The aforementioned non-shite dynamically resizing hash table is next on my list, but let me tell you a secret: I am not a computer scientist. I am not a real programmer either!

0 commit comments

Comments
 (0)