-
Notifications
You must be signed in to change notification settings - Fork 102
🔨 Fix Semantic Segmentor Memory Spill Issue #988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-define-KongNet-model
Are you sure you want to change the base?
🔨 Fix Semantic Segmentor Memory Spill Issue #988
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request addresses memory management issues in the semantic segmentor by implementing several optimizations: using zarr-backed arrays to reduce peak memory usage during batch processing, incrementally flushing data to disk instead of computing entire dask arrays at once, and improving memory threshold calculations. The changes aim to prevent out-of-memory errors when processing large whole slide images.
Changes:
- Modified memory threshold monitoring to use currently available memory instead of initial snapshot
- Refactored
prepare_full_batchto use zarr-backed arrays for intermediate storage - Updated
save_to_cacheto incrementally flush data block-by-block - Modified
merge_vertical_chunkwiseto process probability blocks iteratively - Added cleanup of temporary directories and improved zarr file removal
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| tiatoolbox/models/engine/semantic_segmentor.py | Core memory optimizations including zarr-backed storage, incremental data flushing, improved memory monitoring, and temporary file handling |
| tests/engines/test_semantic_segmentor.py | Adjusted assertion tolerances to account for numerical precision changes from zarr storage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| # Ensure we operate on NumPy to avoid Dask out-parameter issues when merging. | ||
| if not isinstance(blocks, np.ndarray): |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting zarr arrays to numpy using np.asarray will load the entire zarr array into memory, which defeats the purpose of using zarr for memory-efficient storage. This could cause memory issues when processing large arrays. Since this function is called from merge_horizontal with canvas_np that may contain zarr arrays from prepare_full_batch, this conversion can trigger out-of-memory errors on large datasets.
| if not isinstance(blocks, np.ndarray): | |
| # Do not convert zarr arrays to NumPy here to avoid loading the entire | |
| # array into memory; iterating over a zarr array yields NumPy blocks. | |
| if not isinstance(blocks, (np.ndarray, zarr.Array)): |
| canvas_block = canvas.blocks[block_idx, 0, 0].compute() | ||
| count_block = count.blocks[block_idx, 0, 0].compute() |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code assumes the dask arrays have 3D block structure by accessing blocks[block_idx, 0, 0]. If the canvas or count arrays have different dimensionality or block structure, this will raise an IndexError. Consider checking the dimensionality or using more flexible indexing to handle edge cases.
| ) | ||
| raw_predictions["coordinates"] = da.concatenate(coordinates, axis=0) | ||
|
|
||
| shutil.rmtree(save_path.with_name("full_batch_tmp")) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup assumes the directory exists and will fail with FileNotFoundError if prepare_full_batch was never called or if no batches were processed. Consider checking existence first using Path.exists() or wrapping in a try-except block with ignore_errors=True parameter for shutil.rmtree to prevent this from causing inference failures.
| shutil.rmtree(save_path.with_name("full_batch_tmp")) | |
| tmp_dir = save_path.with_name("full_batch_tmp") | |
| if tmp_dir.exists(): | |
| shutil.rmtree(tmp_dir) |
| save_path: Path | str = "temp_fullbatch", | ||
| *, | ||
| is_last: bool, | ||
| ) -> tuple[np.ndarray, np.ndarray, np.ndarray]: |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation specifies np.ndarray for the first element, but the function now returns a zarr.Array (created on line 1416). This type mismatch can cause issues downstream when the returned value is passed to concatenate_none (line 500), which only handles np.ndarray and da.Array types. Either convert the zarr array to numpy before returning, or update the type annotation and handle zarr arrays appropriately in concatenate_none.
| save_path_dir.mkdir(parents=True, exist_ok=True) | ||
| temp_dir = Path(tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(save_path_dir))) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function creates zarr arrays backed by temporary directories (line 1413), but these temporary directories are never cleaned up within this function. They are created for each batch and accumulate in the save_path directory. While line 584 in infer_wsi attempts cleanup with shutil.rmtree(save_path.with_name("full_batch_tmp")), this only removes the parent directory, not the individual tempfile.mkdtemp-created subdirectories if they still have open handles or if the zarr arrays are still being accessed. This can lead to resource leaks and disk space issues.
| save_path_dir.mkdir(parents=True, exist_ok=True) | |
| temp_dir = Path(tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(save_path_dir))) | |
| # Create a dedicated parent directory for full-batch temporary stores. | |
| base_tmp_dir = save_path_dir.with_name("full_batch_tmp") | |
| base_tmp_dir.mkdir(parents=True, exist_ok=True) | |
| temp_dir = Path( | |
| tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(base_tmp_dir)) | |
| ) |
| if is_last: | ||
| if is_last and len(full_output_locs): | ||
| pad_len = len(full_output_locs) | ||
| full_batch_output.resize(total_size + pad_len, *sample_shape) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zarr.Array.resize method signature typically requires individual dimension arguments, not a tuple followed by unpacked values. This line should be full_batch_output.resize(total_size + pad_len, *sample_shape) or pass all dimensions as a single tuple. The current syntax mixing tuple and unpacking may cause a TypeError at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
While I was working on the KongNet notebook PR, I encountered out-of-memory issues and Python kept crashing when processing a relatively large WSI. After investigating and have to admit getting lots of help from AI, we found the root cause being the
infer_wsi()function fromSementicSegmentor.I used a lot of help for AI to make this work, as I find some parts quite difficult to understand, so careful review is needed...
The example slide I was trying to run was:
https://huggingface.co/datasets/TIACentre/TIAToolBox_Remote_Samples/blob/main/sample_wsis/D_P000019_PAS_CPG.tif. The code I was trying to run was:Before this PR, the code kept crashing on my workstation, which has 32GBs of RAM, memory spiked to 100% just before it crashed.
Root Causes:
Key Changes:
Problems Solved: