Add afterMerge() lifecycle hook to KnnVectorsWriter and PerFieldKnnVe…#15952
Add afterMerge() lifecycle hook to KnnVectorsWriter and PerFieldKnnVe…#15952MrFlap wants to merge 1 commit intoapache:mainfrom
Conversation
|
@MrFlap can you add entry in change log and also unit test for the new interface |
| if (mergeState.infoStream.isEnabled("VV")) { | ||
| mergeState.infoStream.message("VV", "merge done " + mergeState.segmentInfo); | ||
| if (mergeState.infoStream.isEnabled("VV")) { | ||
| mergeState.infoStream.message("VV", "merge done " + mergeState.segmentInfo); | ||
| } | ||
| } | ||
| } | ||
| } finally { | ||
| afterMerge(); | ||
| } | ||
| finish(); |
There was a problem hiding this comment.
Why doesn't finish work? It seems like the writer should know its in a merge context already and finish should just handle that? Adding a new interface seems unnecessary?
There was a problem hiding this comment.
We would like to manage the threads in PerFieldKnnVectorsFormat, which doesn't get a callback once KnnVectorsWriter.finish() happens. Also, the suggested afterMerge() runs in a finally block, so we can be sure to close out the threads even on exception.
There was a problem hiding this comment.
- On an exception with merge, all files, etc. should be closed, do we want to move finish into the
finally? There may be edge cases, for sure... - Are we sure extending PerFieldKnnVectorsFormat is the way forward here? My point is that this type of thing is doable via experts in Lucene in different ways than adding another top-level API to the KnnVectorFormat which then impacts all formats.
I want to be very careful about adding an expert level API that Apache Lucene itself doesn't actually use.
There was a problem hiding this comment.
On an exception with merge, all files, etc. should be closed, do we want to move finish into the finally? There may be edge cases, for sure...
I'm not sure I can give a completely educated call on this, but it seems like the right thing to do. If mergeOneField throws mid-way, the finish() call will be skipped.
Are we sure extending PerFieldKnnVectorsFormat is the way forward here? My point is that this type of thing is doable via experts in Lucene in different ways than adding another top-level API to the KnnVectorFormat which then impacts all formats.
Understandable. I'm not a Lucene expert, but I don't think there is an easy way for PerFieldKnnVectorsFormat to know when a merge completes. I'll continue looking into this. cc @navneet1v
There was a problem hiding this comment.
I am saying, that any expert extending PerFieldKnnVectorsFormat can likely do this in another way and requiring a Lucene change that Lucene doesn't benefit from is a tricky situation.
I am just wondering if this is the best way for Lucene.
Adding these. |
…ctorsFormat Signed-off-by: Andrew Klepchick <aklepchi@amazon.com>
Resolves #15935
Problem
PerFieldKnnVectorsFormatsubclasses that allocate merge-time resources (such as thread pools for concurrent HNSW graph construction) have no way to know when a merge completes.getKnnVectorsFormatForFieldis called per-field during merge, but there is no corresponding callback when the merge finishes. This makes it impossible to cleanly release per-merge resources, leading to resource leaks.Solution
Add a
protected void afterMerge() throws IOExceptionmethod with a no-op default to bothKnnVectorsWriterandPerFieldKnnVectorsFormat.A consumer can then override
afterMerge()to shrink a shared thread pool after each merge completes, ensuring the pool scales elastically with the number of concurrent merges rather than leaking threads indefinitely.Changes
KnnVectorsWriter.java: AddedafterMerge()no-op method. Wrapped the field merge loop inmerge()withtry/finally { afterMerge() }.PerFieldKnnVectorsFormat.java: AddedafterMerge()no-op method on the format.FieldsWriteroverridesafterMerge()to delegate toPerFieldKnnVectorsFormat.this.afterMerge().cc @navneet1v @shatejas