📊 Scikit-learn nearest neighbors benchmark

This repository contains scripts and notebooks for benchmarking scikit-learn nearest neighbors algorithms (brute force, k-d tree and ball tree). This work is related to sklearn neighbors heuristic issue #8213 , and being addressed in the pull request #17148.

Scikit-learn 0.22.2.post1 version is used.

Usage

sklearn_neighbors_benchmark directory contains utilities to run experiments and save the results to results.csv.

run_experiments.py allows you to run a set of experiments, saving the results in results.csv. Please note that duplicated experiments will be run, but not saved.

jakevdp_benchmark directory contains Jake VanderPlas benchmark with a modified version computing brute force instead of estimating it.

Protocol

Datasets used:

covertype: consists of forest cartographic variables, shape (110_393, 54), version 1 from OpenML.
creditcard: consists of credit cards transactions PCA transformed variables, shape (284_807, 29), version 1 from OpenML.
mnist_pca: consists of the 100 first MNIST PCA transformed variables (explaining 70% of the variance), shape (70_000, 100), version 1 from OpenML.
synthetic_low_intrinsic_dim: consists of standard normal sampled variables divided by 1,000 (except 5 of them), shape (110_000, 100).
synthetic_standard_normal: consists of standard normal sampled variables, shape (110_000, 100).

Parameters studied:

algorithm
dataset
n_samples at construction time
n_features
n_neighbors
n_jobs, number of parallel jobs to run for neighbors search
n_threads, number of threads that can be used in OpenMP/BLAS thread pools

Results saved:

time_construction_mean
time_construction_std
time_querying_mean
time_querying_std

Miscellaneous:

In order to get robust results, the number of query points is fixed to 10,000
Each experiment is repeated 3 times — with random feature sampling for real world datasets
Real world datasets are standardized
metric is fixed to euclidean

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
img		img
jakevdp_benchmark		jakevdp_benchmark
sklearn_neighbors_benchmark		sklearn_neighbors_benchmark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
results.csv		results.csv
run_experiments.py		run_experiments.py
sklearn_nn_heuristic.ipynb		sklearn_nn_heuristic.ipynb
sklearn_nn_heuristic_proposal.ipynb		sklearn_nn_heuristic_proposal.ipynb
sklearn_nn_pickled_size.ipynb		sklearn_nn_pickled_size.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Scikit-learn nearest neighbors benchmark

Usage

Protocol

Results analysis

Resources

About

Releases

Packages

Languages

License

gbolmier/sklearn-neighbors-benchmark

Folders and files

Latest commit

History

Repository files navigation

📊 Scikit-learn nearest neighbors benchmark

Usage

Protocol

Results analysis

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages