-
Notifications
You must be signed in to change notification settings - Fork 62
/
benchmarks.Rmd
198 lines (162 loc) · 9.52 KB
/
benchmarks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "ANN Benchmarks"
author: "Amos Elberg"
date: '`r Sys.Date()`'
output: github_document
---
```{r setupbenchmark,eval=T,echo=F,warning=F,error=F,message=F}
# Note to reader: Please don't steal the semi-distinctive visual style I spent several minutes creating for myself.
library(ggplot2,
quietly = TRUE)
library(RColorBrewer,
quietly = TRUE)
library(wesanderson,
quietly = TRUE)
library(dplyr, quietly = TRUE)
library(magrittr, quietly = TRUE)
knitr::opts_chunk$set(collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height = 5)
colors_discrete <- function(x) rep(wes_palette("Darjeeling",
n = min(x, 5)),
2)[1:x]
colors_divergent_discrete <- function(x)
grDevices::colorRampPalette(RColorBrewer::brewer.pal(x, "Spectral"))
colors_continuous <- function(x) wes_palette(name = "Zissou",
n = x,
type = "continuous")
nacol <- colors_discrete(4)[4]
theme_set(
theme_bw() %+replace%
theme(
legend.title = element_text(size = rel(0.8),
face = "bold"),
legend.margin = unit(0, "cm"),
legend.position = "bottom",
legend.key.size = unit(0.5, "lines"),
legend.text = element_text(size = unit(8, "points")),
axis.title.y = element_text(angle = 90),
axis.text = element_text(size = rel(0.7)),
plot.margin = unit(c(0, 0.5, 1, 0), "lines"),
axis.title = element_text(size = rel(0.8),
face = "bold"),
title = element_text(size = rel(0.9))
)
)
```
## Note:
This was formerly a vignette benchmarking the performance of version 0.1.9 of the `largeVis` package. I have removed it as a vignette because of the overhead required to recalculate the benchmarks each time the code is refactored. Version 0.1.10 tested to be slightly (5-10%) faster in the Annoy phase, and significantly (30%) faster in the neighborhood iteration phase, than the benchmarks shown below would indicate.
## Overview
Besides manifold visualization, `largeVis` also includes an extremely efficient approximate nearest-neighbor search that runs in $O(n)$ time.
This vignette includes benchmarks and recommendations for adjusting hyperparameters in the neighbor search for best results.
## Hyperparameters
The `randomProjectionTreeSearch` function has three hyperparameters that trade-off accuracy and efficiency in the neighbor search:
1. `n_trees` - In the first phase of the function, the number of random projection trees to create.
2. `tree_threshold` - The maximum number of any nodes on a random projection tree leaf. If, after branching, the number of nodes in a branch exceeds this threshold, the branch will be divided again.
3. `max_iters` - The number of iterations for the neighborhood-exploration phase of the algorithm.
## Data Collection \& Methodology
The data in the benchmarks below was obtained by running the `benchmark.R` script, which is installed along with the package, on two machines.
The aim was to replicate as much as possible the methodology used by Erik Bernhardsson's [ANN Benchmark](https://github.com/erikbern/ann-benchmarks) github. However, `ANN Benchmark` is designed for libraries that are designed to build a neighbor index and then rapidly process queries against the index. The measure used by `ANN Benchmark` is therefore queries-per-second. By contract, `largeVis` is concerned with getting neighbors for all of the nodes in a finite dataset as quickly as possible.
Times shown for `RcppAnnoy` include the time to build a searchable index and query neighbors for all rows in the dataset.
The data used is the 1-million vector, 128-feature [SIFT Dataset](http://corpus-texmex.irisa.fr/), which is the test data used by `ANN Benchmark`.
Benchmarks were run on several machines. First, benchmarks were run on a workstation and a server with $K = 100$. Benchmarks were then run on an AWS c4.2xlarge instance with $K = 100$ and $K = 50$, to replicate as closely as possible the conditions of `ANN Benchmark`.
Results that appear to have used virtual memory, in that the completion time was radically discontinuous with other results from the same machine, were discarded.
I welcome submissions of output from the script from other hardware.
## Comparison With Annoy
The following chart illustrates performance versus the `Annoy` library, as implemented through the `RcppAnnoy` R package.
To facilitate comparison with the ANN Benchmark charts, the Y-axis shows the number of vectors processed per second.
```{r plotpeformance,echo=F,fig.align='center',warning=FALSE,message=FALSE}
load(system.file("extdata", "benchmark.Rda", package = "largeVis"))
benchmark %>%
filter(machine != 'Large Server',
machine == 'Workstation' | K == 50) %>%
mutate(facet = precision,
facet = ifelse(facet < 0.95, '', 'Closeup'),
facet = factor(facet)) %>%
ggplot(aes( y = time,
x = precision,
group = series,
fill = series,
shape = series)) +
geom_point(size = 1.5, alpha = 0.7, color = "grey80") +
scale_y_log10(name = "Speed, log (nodes / seconds)") +
scale_x_continuous("Precision",
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 0.925, 0.95, 0.975, 1.0)) +
facet_grid(K + machine ~ facet, scales = "free") +
scale_fill_manual(name = "Method & n. iter.",
values = colors_divergent_discrete(nlevels(benchmark$series))(nlevels(benchmark$series))) +
scale_shape_manual(name = "Method & n. iter.",
values = c(21, 21, 21, 21, 23)) +
# guides(color = guide_legend(nrow=3)) +
ggtitle(expression(
atop("Precision-Performance tradeoff, RcppAnnoy and largeVis",
atop(italic("(n = 10000; Upper Right is Better)"))
)
))
```
## Approximate Equivalence of Number of Trees and Tree Threshold
There is an approximate trade-off in memory use between the tree threshold and number of trees. Peak memory consumption during the tree search phase = N * n_trees * threshold.
The trade-off is not precise because the tree split phase will return fewer nodes per tree than the threshold. On average, it should return about 3/4 of the threshold.
On the following chart, points that share the same values of n_trees * threshold, referred to as `tth`, (and number of neighborhood exploration iterations), are shown as the same series.
```{r constn,echo=F,warning=F}
bench <- benchmark %>%
filter(method == 'largeVis', machine == 'Large Server') %>%
mutate(nn = threshold * n_trees) %>%
group_by(max_iters, nn) %>%
filter(n() > 2) %>%
mutate(series = paste(max_iters, ", ", nn, sep = " "))
bench$facet <- factor(ifelse(bench$n_trees >= 4, "", "n. trees < 10"))
bench %>%
ggplot(aes(y = time,
x = precision,
fill = series,
group = series,
color = factor(n_trees))) +
geom_point(size = 1.5, alpha = 0.8, shape = 21) +
scale_fill_manual("n. iter, tth", values = colors_divergent_discrete(6)(6)) +
scale_color_grey("n. trees", start = 0.8, end = 0 ) +
# guides(color = FALSE) +
# scale_shape(name = "Iterations", solid = FALSE) +
facet_grid(machine ~ .) +
scale_y_log10(name = "Speed, log (nodes / second)", limits = c(1e2,1e5)) +
scale_x_continuous("Precision",
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1)) +
ggtitle(expression(
atop("Precision-Performance tradeoff, n_trees and tree_threshold",
atop(italic("(100-NN precision, n = 10000; Upper Right is Better)")))))
```
Results that hold nn constant while varying the number of trees and threshold tend to cluster together, however increasing the number of trees (while holding tth constant) tends to improve accuracy and decrease speed. The degree of dispersion increases when a neighborhood exploration iteration is added.
On the charts below, n_trees * threshold is referred to as `tth`.
## Effect of Increasing `tth` vs. `max_iters`
```{r tree_threshold,echo=F}
bench <- benchmark %>%
filter(method == 'largeVis',
machine != 'Large Server') %>%
mutate(label = ifelse(threshold == 128, "128", "Other"),
label = factor(label),
facet = precision,
facet = ifelse(facet < 0.85, '', 'Closeup'))
bench$facet <- factor(bench$facet)
bench %>%
arrange(nn) %>%
mutate(max_iters = factor(max_iters)) %>%
ggplot(aes(y = time,
x = precision ,
color = max_iters,
group = max_iters)) +
geom_point(size = 0.2, alpha =0.8) +
geom_line(size = 0.2, alpha = 0.8) +
facet_grid(K + machine ~ facet, scales = 'free') +
scale_y_log10(name = "Speed, log (nodes / second)") +
scale_x_continuous("Precision",
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 0.9, 0.92, 0.94, 0.96, 0.98, 1.0)) +
# scale_shape_discrete(name = "", solid = FALSE) +
# guides(color = FALSE) +
scale_color_manual("n. iter", values = colors_discrete(4)) +
ggtitle(expression(
atop("Precision-Performance tradeoff, effect of increasing tth vs. max_iters",
atop(italic("(n = 10000; Upper Right is Better)")))))
```
Each line shows the precision-performance tradeoff that can be obtained by varying the number of trees and threshold while holding the number of iterations constant. As the chart shows, neighborhood exploration iterations move the precision-performance frontier outward. The marginal benefit of additional iterations declines as iterations are added.
This is consistent with the conclusions of the original paper authors.