Skip to content

Conversation

@Jiaqi-Lv
Copy link
Collaborator

@Jiaqi-Lv Jiaqi-Lv commented Jan 27, 2026

While I was working on the KongNet notebook PR, I encountered out-of-memory issues and Python kept crashing when processing a relatively large WSI. After investigating and have to admit getting lots of help from AI, we found the root cause being the infer_wsi() function from SementicSegmentor.

I used a lot of help for AI to make this work, as I find some parts quite difficult to understand, so careful review is needed...

The example slide I was trying to run was: https://huggingface.co/datasets/TIACentre/TIAToolBox_Remote_Samples/blob/main/sample_wsis/D_P000019_PAS_CPG.tif. The code I was trying to run was:

segmentor = SemanticSegmentor(model="fcn_resnet50_unet-bcss")
out = segmentor.run(
    images=[Path(wsi_path)],
    patch_mode=False,
    device="cuda",
    save_dir=output_path,
    overwrite=True,
    output_type="annotationstore",
    auto_get_mask=True,
    memory_threshold=25,
    num_workers=0,
    batch_size=8,
)

Before this PR, the code kept crashing on my workstation, which has 32GBs of RAM, memory spiked to 100% just before it crashed.

Root Causes:

  • Horizontal merge built per-row canvases spanning the slide width; large arrays spiked RAM near the end of WSI processing.
  • Masked-output alignment allocated dense zero-filled arrays in RAM for sparse locations.
  • Probability spill path forced a full compute()of the Dask array.

Key Changes:

  • Row-bounded horizontal merge: limit canvas width to the current row span, shrinking per-row allocations. [semantic_segmentor.py:1070-1081]
  • NumPy coercion before in-place merges to avoid Dask out errors. [semantic_segmentor.py:1014-1022]
  • Incremental [save_to_cache]: flush canvas/count row-chunks to Zarr without materializing the full arrays. [semantic_segmentor.py:1103-1158]
  • Streaming probability spill: write block-by-block instead of computing the whole Dask array; handle 2D/3D blocks with ndim-aware indexing. [semantic_segmentor.py:1234-1268], [semantic_segmentor.py:1263-1267]
  • Full-batch placement to disk: masked-output alignment now uses per-batch temp Zarr directories, eliminating giant in-RAM zeros and avoiding chunk-shape collisions. Call site: [semantic_segmentor.py:497-505]; impl: [semantic_segmentor.py:1355-1413]
  • Cleanup/maintenance: docstring fixed

Problems Solved:

  • OOMs and late-stage crashes during WSI processing.
  • Masked-output alignment allocating massive dense arrays in RAM.

@Jiaqi-Lv Jiaqi-Lv requested a review from Copilot January 27, 2026 17:17
@Jiaqi-Lv Jiaqi-Lv self-assigned this Jan 27, 2026
@Jiaqi-Lv Jiaqi-Lv added the enhancement New feature or request label Jan 27, 2026
@Jiaqi-Lv Jiaqi-Lv changed the title 🔨 Fix Semantic Segmentor Memory Issue 🔨 Fix Semantic Segmentor Memory Spill Issue Jan 27, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses memory management issues in the semantic segmentor by implementing several optimizations: using zarr-backed arrays to reduce peak memory usage during batch processing, incrementally flushing data to disk instead of computing entire dask arrays at once, and improving memory threshold calculations. The changes aim to prevent out-of-memory errors when processing large whole slide images.

Changes:

  • Modified memory threshold monitoring to use currently available memory instead of initial snapshot
  • Refactored prepare_full_batch to use zarr-backed arrays for intermediate storage
  • Updated save_to_cache to incrementally flush data block-by-block
  • Modified merge_vertical_chunkwise to process probability blocks iteratively
  • Added cleanup of temporary directories and improved zarr file removal

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
tiatoolbox/models/engine/semantic_segmentor.py Core memory optimizations including zarr-backed storage, incremental data flushing, improved memory monitoring, and temporary file handling
tests/engines/test_semantic_segmentor.py Adjusted assertion tolerances to account for numerical precision changes from zarr storage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
# Ensure we operate on NumPy to avoid Dask out-parameter issues when merging.
if not isinstance(blocks, np.ndarray):
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting zarr arrays to numpy using np.asarray will load the entire zarr array into memory, which defeats the purpose of using zarr for memory-efficient storage. This could cause memory issues when processing large arrays. Since this function is called from merge_horizontal with canvas_np that may contain zarr arrays from prepare_full_batch, this conversion can trigger out-of-memory errors on large datasets.

Suggested change
if not isinstance(blocks, np.ndarray):
# Do not convert zarr arrays to NumPy here to avoid loading the entire
# array into memory; iterating over a zarr array yields NumPy blocks.
if not isinstance(blocks, (np.ndarray, zarr.Array)):

Copilot uses AI. Check for mistakes.
Comment on lines +1168 to +1169
canvas_block = canvas.blocks[block_idx, 0, 0].compute()
count_block = count.blocks[block_idx, 0, 0].compute()
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code assumes the dask arrays have 3D block structure by accessing blocks[block_idx, 0, 0]. If the canvas or count arrays have different dimensionality or block structure, this will raise an IndexError. Consider checking the dimensionality or using more flexible indexing to handle edge cases.

Copilot uses AI. Check for mistakes.
)
raw_predictions["coordinates"] = da.concatenate(coordinates, axis=0)

shutil.rmtree(save_path.with_name("full_batch_tmp"))
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup assumes the directory exists and will fail with FileNotFoundError if prepare_full_batch was never called or if no batches were processed. Consider checking existence first using Path.exists() or wrapping in a try-except block with ignore_errors=True parameter for shutil.rmtree to prevent this from causing inference failures.

Suggested change
shutil.rmtree(save_path.with_name("full_batch_tmp"))
tmp_dir = save_path.with_name("full_batch_tmp")
if tmp_dir.exists():
shutil.rmtree(tmp_dir)

Copilot uses AI. Check for mistakes.
save_path: Path | str = "temp_fullbatch",
*,
is_last: bool,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation specifies np.ndarray for the first element, but the function now returns a zarr.Array (created on line 1416). This type mismatch can cause issues downstream when the returned value is passed to concatenate_none (line 500), which only handles np.ndarray and da.Array types. Either convert the zarr array to numpy before returning, or update the type annotation and handle zarr arrays appropriately in concatenate_none.

Copilot uses AI. Check for mistakes.
Comment on lines +1412 to +1413
save_path_dir.mkdir(parents=True, exist_ok=True)
temp_dir = Path(tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(save_path_dir)))
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function creates zarr arrays backed by temporary directories (line 1413), but these temporary directories are never cleaned up within this function. They are created for each batch and accumulate in the save_path directory. While line 584 in infer_wsi attempts cleanup with shutil.rmtree(save_path.with_name("full_batch_tmp")), this only removes the parent directory, not the individual tempfile.mkdtemp-created subdirectories if they still have open handles or if the zarr arrays are still being accessed. This can lead to resource leaks and disk space issues.

Suggested change
save_path_dir.mkdir(parents=True, exist_ok=True)
temp_dir = Path(tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(save_path_dir)))
# Create a dedicated parent directory for full-batch temporary stores.
base_tmp_dir = save_path_dir.with_name("full_batch_tmp")
base_tmp_dir.mkdir(parents=True, exist_ok=True)
temp_dir = Path(
tempfile.mkdtemp(prefix="full_batch_tmp_", dir=str(base_tmp_dir))
)

Copilot uses AI. Check for mistakes.
if is_last:
if is_last and len(full_output_locs):
pad_len = len(full_output_locs)
full_batch_output.resize(total_size + pad_len, *sample_shape)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zarr.Array.resize method signature typically requires individual dimension arguments, not a tuple followed by unpacked values. This line should be full_batch_output.resize(total_size + pad_len, *sample_shape) or pass all dimensions as a single tuple. The current syntax mixing tuple and unpacking may cause a TypeError at runtime.

Copilot uses AI. Check for mistakes.
@Jiaqi-Lv Jiaqi-Lv marked this pull request as ready for review January 27, 2026 18:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@shaneahmed shaneahmed added this to the Release v2.0.0 milestone Jan 28, 2026
@shaneahmed shaneahmed changed the base branch from dev-define-KongNet-model to dev-define-engines-abc January 28, 2026 09:50
@shaneahmed shaneahmed changed the base branch from dev-define-engines-abc to dev-define-KongNet-model January 28, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants