-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: build one partition at a time to avoid OOM #2484
Conversation
we can build single HNSW graph in parallel, so don't need to build many partitions at a time. this still slows the building progress cause the IO operations (reading the shuffled files) can't be done in advance Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
@@ -296,8 +290,8 @@ pub(super) async fn write_hnsw_quantization_index_partitions( | |||
let mut part_files = Vec::with_capacity(ivf.num_partitions()); | |||
let mut aux_part_files = Vec::with_capacity(ivf.num_partitions()); | |||
let tmp_part_dir = Path::from_filesystem_path(TempDir::new()?)?; | |||
let mut tasks = Vec::with_capacity(ivf.num_partitions()); | |||
let sem = Arc::new(Semaphore::new(*HNSW_PARTITIONS_BUILD_PARALLEL)); | |||
let mut aux_ivf = IvfData::empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the right name? why aux_ivf
contains IvfData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aux_ivf
is for tracking the storage partitions
if row_id_array.is_empty() { | ||
tasks.push(tokio::spawn(async { Ok(0) })); | ||
ivf.add_partition(offset, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in v2, do we track the rows instead of byte offsets in index file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here offset
is row offset, only IVF_PQ tracks the byte offsets.
all indices will track the row range in v2
continue; | ||
} | ||
log::info!("Building HNSW partition {}", part_id); | ||
let num_rows = build_hnsw_quantization_partition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this function to lance-index/src/hnsw
?
#2552 can solve this problem |
we can build single HNSW graph in parallel, so don't need to build many partitions at a time.
this still slows the building progress cause the IO operations (reading the shuffled files) can't be done in advance