Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: build one partition at a time to avoid OOM #2484

Closed
wants to merge 7 commits into from

Conversation

BubbleCal
Copy link
Contributor

we can build single HNSW graph in parallel, so don't need to build many partitions at a time.

this still slows the building progress cause the IO operations (reading the shuffled files) can't be done in advance

we can build single HNSW graph in parallel, so don't need to build many partitions at a time.

this still slows the building progress cause the IO operations (reading the shuffled files) can't be done in advance

Signed-off-by: BubbleCal <[email protected]>
@github-actions github-actions bot added the bug Something isn't working label Jun 18, 2024
@@ -296,8 +290,8 @@ pub(super) async fn write_hnsw_quantization_index_partitions(
let mut part_files = Vec::with_capacity(ivf.num_partitions());
let mut aux_part_files = Vec::with_capacity(ivf.num_partitions());
let tmp_part_dir = Path::from_filesystem_path(TempDir::new()?)?;
let mut tasks = Vec::with_capacity(ivf.num_partitions());
let sem = Arc::new(Semaphore::new(*HNSW_PARTITIONS_BUILD_PARALLEL));
let mut aux_ivf = IvfData::empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right name? why aux_ivf contains IvfData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aux_ivf is for tracking the storage partitions

if row_id_array.is_empty() {
tasks.push(tokio::spawn(async { Ok(0) }));
ivf.add_partition(offset, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in v2, do we track the rows instead of byte offsets in index file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here offset is row offset, only IVF_PQ tracks the byte offsets.
all indices will track the row range in v2

continue;
}
log::info!("Building HNSW partition {}", part_id);
let num_rows = build_hnsw_quantization_partition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this function to lance-index/src/hnsw?

@BubbleCal
Copy link
Contributor Author

#2552 can solve this problem

@BubbleCal BubbleCal closed this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants