Skip to content

Commit 06fc0b7

Browse files
authored
Preparing for 0.6.0 diskannpy release (#407)
* Some early staging for README updates and pyproject updates for a 0.6.0 release for diskannpy. * Trying to fix the CI badge to point toward main's latest build * Updating documentation for pdoc generation * Documentation updates. Tightened up the API to drop list support (there were entirely too many cases where it wouldn't work, and it's easier to just tell people to convert it themselves) * Some module reorganization to make pdoc actually display the docstrings for variables re-exported at the top level * A copy paste happened that shouldn't have. * Updating the apps to use the new 0.6.0 api * Addressing PR feedback * Some of the documentation changes didn't get made in both from_file or the constructor
1 parent 1eac702 commit 06fc0b7

19 files changed

+1109
-647
lines changed

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# DiskANN
22

3-
[![DiskANN Pull Request Build and Test](https://github.com/microsoft/DiskANN/actions/workflows/pr-test.yml/badge.svg)](https://github.com/microsoft/DiskANN/actions/workflows/pr-test.yml)
3+
[![DiskANN Paper](https://img.shields.io/badge/Paper-NeurIPS%3A_DiskANN-blue)](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf)
4+
[![DiskANN Paper](https://img.shields.io/badge/Paper-Arxiv%3A_Fresh--DiskANN-blue)](https://arxiv.org/abs/2105.09613)
5+
[![DiskANN Paper](https://img.shields.io/badge/Paper-Filtered--DiskANN-blue)](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf)
6+
[![DiskANN Main](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml/badge.svg?branch=main)](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml)
7+
[![PyPI version](https://img.shields.io/pypi/v/diskannpy.svg)](https://pypi.org/project/diskannpy/)
8+
[![Downloads shield](https://pepy.tech/badge/diskannpy)](https://pepy.tech/project/diskannpy)
9+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
410

511
DiskANN is a suite of scalable, accurate and cost-effective approximate nearest neighbor search algorithms for large-scale vector search that support real-time changes and simple filters.
612
This code is based on ideas from the [DiskANN](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf), [Fresh-DiskANN](https://arxiv.org/abs/2105.09613) and the [Filtered-DiskANN](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf) papers with further improvements.
@@ -12,8 +18,6 @@ contact [[email protected]](mailto:[email protected]) with any additio
1218

1319
See [guidelines](CONTRIBUTING.md) for contributing to this project.
1420

15-
16-
1721
## Linux build:
1822

1923
Install the following packages through apt-get
@@ -71,12 +75,16 @@ OR for Visual Studio 2017 and earlier:
7175
```
7276
<full-path-to-installed-cmake>\cmake ..
7377
```
74-
* This will create a diskann.sln solution. Open it from VisualStudio and build either Release or Debug configuration.
75-
* Alternatively, use MSBuild:
78+
**This will create a diskann.sln solution**. Now you can:
79+
80+
- Open it from VisualStudio and build either Release or Debug configuration.
81+
- `<full-path-to-installed-cmake>\cmake --build build`
82+
- Use MSBuild:
7683
```
7784
msbuild.exe diskann.sln /m /nologo /t:Build /p:Configuration="Release" /property:Platform="x64"
7885
```
79-
* This will also build gperftools submodule for libtcmalloc_minimal dependency.
86+
87+
* This will also build gperftools submodule for libtcmalloc_minimal dependency.
8088
* Generated binaries are stored in the x64/Release or x64/Debug directories.
8189

8290
## Usage:
@@ -88,16 +96,16 @@ Please see the following pages on using the compiled code:
8896
- [Commandline examples for using in-memory streaming indices](workflows/dynamic_index.md)
8997
- [Commandline interface for building and search in memory indices with label data and filters](workflows/filtered_in_memory.md)
9098
- [Commandline interface for building and search SSD based indices with label data and filters](workflows/filtered_ssd_index.md)
91-
- To be added: Python interfaces and docker files
99+
- [diskannpy - DiskANN as a python extension module](python/README.md)
92100

93101
Please cite this software in your work as:
94102

95103
```
96104
@misc{diskann-github,
97-
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan}},
105+
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
98106
title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
99107
url = {https://github.com/Microsoft/DiskANN},
100-
version = {0.5},
108+
version = {0.6.0},
101109
year = {2023}
102110
}
103111
```

pyproject.toml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ build-backend = "setuptools.build_meta"
1111

1212
[project]
1313
name = "diskannpy"
14-
version = "0.5.0.rc5"
14+
version = "0.6.0"
1515

1616
description = "DiskANN Python extension module"
1717
readme = "python/README.md"
@@ -25,17 +25,26 @@ authors = [
2525
{name = "Dax Pryce", email = "[email protected]"}
2626
]
2727

28+
[project.optional-dependencies]
29+
dev = ["black", "isort", "mypy"]
30+
2831
[tool.setuptools]
2932
package-dir = {"" = "python/src"}
3033

34+
[tool.isort]
35+
profile = "black"
36+
multi_line_output = 3
37+
38+
[tool.mypy]
39+
plugins = "numpy.typing.mypy_plugin"
40+
3141
[tool.cibuildwheel]
3242
manylinux-x86_64-image = "manylinux_2_28"
3343
test-requires = ["scikit-learn~=1.2"]
3444
build-frontend = "build"
3545
skip = ["pp*", "*-win32", "*-manylinux_i686", "*-musllinux*"]
3646
test-command = "python -m unittest discover {project}/python/tests"
3747

38-
3948
[tool.cibuildwheel.linux]
4049
before-build = [
4150
"dnf makecache --refresh",

python/README.md

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,17 @@
11
# diskannpy
22

3+
[![DiskANN Paper](https://img.shields.io/badge/Paper-NeurIPS%3A_DiskANN-blue)](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf)
4+
[![DiskANN Paper](https://img.shields.io/badge/Paper-Arxiv%3A_Fresh--DiskANN-blue)](https://arxiv.org/abs/2105.09613)
5+
[![DiskANN Paper](https://img.shields.io/badge/Paper-Filtered--DiskANN-blue)](https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf)
6+
[![DiskANN Main](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml/badge.svg?branch=main)](https://github.com/microsoft/DiskANN/actions/workflows/push-test.yml)
7+
[![PyPI version](https://img.shields.io/pypi/v/diskannpy.svg)](https://pypi.org/project/diskannpy/)
8+
[![Downloads shield](https://pepy.tech/badge/diskannpy)](https://pepy.tech/project/diskannpy)
9+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
10+
311
## Installation
412
Packages published to PyPI will always be built using the latest numpy major.minor release (at this time, 1.25).
513

6-
Conda distributions for versions 1.19-1.25 will be completed as a future effort. In the meantime, feel free to
14+
Conda distributions for versions 1.19-1.25 will be completed as a future effort. In the meantime, feel free to
715
clone this repository and build it yourself.
816

917
## Local Build Instructions
@@ -16,11 +24,18 @@ build `diskannpy` with these additional instructions.
1624
In the root folder of DiskANN, there is a file `pyproject.toml`. You will need to edit the version of numpy in both the
1725
`[build-system.requires]` section, as well as the `[project.dependencies]` section. The version numbers must match.
1826

27+
#### Linux
1928
```bash
20-
python3.11 -m venv venv # versions from python3.8 and up should work. on windows, you might need to use py -3.11 -m venv venv
21-
source venv/bin/activate # linux
22-
# or
23-
venv\Scripts\Activate.{ps1, bat} # windows
29+
python3.11 -m venv venv # versions from python3.9 and up should work
30+
source venv/bin/activate
31+
pip install build
32+
python -m build
33+
```
34+
35+
#### Windows
36+
```powershell
37+
py -3.11 -m venv venv # versions from python3.9 and up should work
38+
venv\Scripts\Activate.ps1
2439
pip install build
2540
python -m build
2641
```
@@ -31,10 +46,10 @@ The built wheel will be placed in the `dist` directory in your DiskANN root. Ins
3146
Please cite this software in your work as:
3247
```
3348
@misc{diskann-github,
34-
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan}},
49+
author = {Simhadri, Harsha Vardhan and Krishnaswamy, Ravishankar and Srinivasa, Gopal and Subramanya, Suhas Jayaram and Antonijevic, Andrija and Pryce, Dax and Kaczynski, David and Williams, Shane and Gollapudi, Siddarth and Sivashankar, Varun and Karia, Neel and Singh, Aditi and Jaiswal, Shikhar and Mahapatro, Neelam and Adams, Philip and Tower, Bryan and Patel, Yash}},
3550
title = {{DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search}},
3651
url = {https://github.com/Microsoft/DiskANN},
37-
version = {0.5},
52+
version = {0.6.0},
3853
year = {2023}
3954
}
40-
```
55+
```

python/apps/in-mem-dynamic.py

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -40,26 +40,25 @@ def insert_and_search(
4040
npts, ndims = utils.get_bin_metadata(indexdata_file)
4141

4242
if dtype_str == "float":
43-
index = diskannpy.DynamicMemoryIndex(
44-
"l2", np.float32, ndims, npts, Lb, graph_degree
45-
)
46-
queries = utils.bin_to_numpy(np.float32, querydata_file)
47-
data = utils.bin_to_numpy(np.float32, indexdata_file)
43+
dtype = np.float32
4844
elif dtype_str == "int8":
49-
index = diskannpy.DynamicMemoryIndex(
50-
"l2", np.int8, ndims, npts, Lb, graph_degree
51-
)
52-
queries = utils.bin_to_numpy(np.int8, querydata_file)
53-
data = utils.bin_to_numpy(np.int8, indexdata_file)
45+
dtype = np.int8
5446
elif dtype_str == "uint8":
55-
index = diskannpy.DynamicMemoryIndex(
56-
"l2", np.uint8, ndims, npts, Lb, graph_degree
57-
)
58-
queries = utils.bin_to_numpy(np.uint8, querydata_file)
59-
data = utils.bin_to_numpy(np.uint8, indexdata_file)
47+
dtype = np.uint8
6048
else:
6149
raise ValueError("data_type must be float, int8 or uint8")
6250

51+
index = diskannpy.DynamicMemoryIndex(
52+
distance_metric="l2",
53+
vector_dtype=dtype,
54+
dimensions=ndims,
55+
max_vectors=npts,
56+
complexity=Lb,
57+
graph_degree=graph_degree
58+
)
59+
queries = diskannpy.vectors_from_file(querydata_file, dtype)
60+
data = diskannpy.vectors_from_file(indexdata_file, dtype)
61+
6362
tags = np.zeros(npts, dtype=np.uintc)
6463
timer = utils.Timer()
6564
for i in range(npts):

python/apps/insert-in-clustered-order.py

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -24,26 +24,25 @@ def insert_and_search(
2424
npts, ndims = utils.get_bin_metadata(indexdata_file)
2525

2626
if dtype_str == "float":
27-
index = diskannpy.DynamicMemoryIndex(
28-
"l2", np.float32, ndims, npts, Lb, graph_degree, False
29-
)
30-
queries = utils.bin_to_numpy(np.float32, querydata_file)
31-
data = utils.bin_to_numpy(np.float32, indexdata_file)
27+
dtype = np.float32
3228
elif dtype_str == "int8":
33-
index = diskannpy.DynamicMemoryIndex(
34-
"l2", np.int8, ndims, npts, Lb, graph_degree
35-
)
36-
queries = utils.bin_to_numpy(np.int8, querydata_file)
37-
data = utils.bin_to_numpy(np.int8, indexdata_file)
29+
dtype = np.int8
3830
elif dtype_str == "uint8":
39-
index = diskannpy.DynamicMemoryIndex(
40-
"l2", np.uint8, ndims, npts, Lb, graph_degree
41-
)
42-
queries = utils.bin_to_numpy(np.uint8, querydata_file)
43-
data = utils.bin_to_numpy(np.uint8, indexdata_file)
31+
dtype = np.uint8
4432
else:
4533
raise ValueError("data_type must be float, int8 or uint8")
4634

35+
index = diskannpy.DynamicMemoryIndex(
36+
distance_metric="l2",
37+
vector_dtype=dtype,
38+
dimensions=ndims,
39+
max_vectors=npts,
40+
complexity=Lb,
41+
graph_degree=graph_degree
42+
)
43+
queries = diskannpy.vectors_from_file(querydata_file, dtype)
44+
data = diskannpy.vectors_from_file(indexdata_file, dtype)
45+
4746
offsets, permutation = utils.cluster_and_permute(
4847
dtype_str, npts, ndims, data, num_clusters
4948
)
@@ -52,7 +51,7 @@ def insert_and_search(
5251
timer = utils.Timer()
5352
for c in range(num_clusters):
5453
cluster_index_range = range(offsets[c], offsets[c + 1])
55-
cluster_indices = np.array(permutation[cluster_index_range], dtype=np.uintc)
54+
cluster_indices = np.array(permutation[cluster_index_range], dtype=np.uint32)
5655
cluster_data = data[cluster_indices, :]
5756
index.batch_insert(cluster_data, cluster_indices + 1, num_insert_threads)
5857
print('Inserted cluster', c, 'in', timer.elapsed(), 's')

0 commit comments

Comments
 (0)