Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch inference issues when ROI close to image edge #45

Closed
xalexalex opened this issue Nov 12, 2023 · 12 comments · Fixed by #47
Closed

Batch inference issues when ROI close to image edge #45

xalexalex opened this issue Nov 12, 2023 · 12 comments · Fixed by #47
Labels
bug Something isn't working

Comments

@xalexalex
Copy link

When the ROI for inference is close (or touches) the image edge, I get the following error:
image

I can reproduce it on both the OpenSlide and BioFormats backends. With the new version of the wsinfer extension (v0.3.0), if I set a batch size = 1, I don't get the error but whenever inference encounters a tile on the edge, it is markedly slower (I can see the hiccups in the progress bar: at first once every few batches [because only some batches will end with a bottom-edge tile] and then for each batch [since on the last column, each batch has an edge tile]).

With batch size >1, if after the error I look at the detections, I can see detections up to and excluding the first batch containing an edge tile.

example video showing the behavior with batch size 4 and then 1.

@kaczmarj kaczmarj added the bug Something isn't working label Nov 12, 2023
@kaczmarj
Copy link
Collaborator

kaczmarj commented Nov 12, 2023

thanks for the bug report. the video is very helpful!

i wonder if the hiccups are caused by padding the images to get them to the required size.

but this also brings up a concern for me. the images should be 224x224 after resizing. how is an image of 188x188 attempting to be batched with a 224x224 image, if all of the images should be resized to 224x224? is the 188x188 not undergoing the resizing for some reason? (but of course, a truncated patch should not be upscaled to 224x224, because the physical spacing will be wrong).

also i am wondering why the image is a square (188x188) as opposed to a rectangle where one direction is shorter than the other direction. (although 188x188 could be the bottom right corner patch).

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

by the way @xalexalex - how do you record your videos? i would like to do something similar :)

@xalexalex
Copy link
Author

xalexalex commented Nov 12, 2023

i wonder if the hiccups are caused by padding the images to get them to the required size.

I thought the same thing, but it seems to take a bit too much time for a simple padding. I suspect something else is going on.

also i am wondering why the image is a square (188x188) as opposed to a rectangle where one direction is shorter than the other direction. (although 188x188 could be the bottom right corner patch).

Good catch. Are we perhaps assuming that width == height? In the case in the video, the tile that triggers the error was a bottom-edge tile, so width > height; it was definitely not the bottom-right corner tile.

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

I wil enumerate possible solutions from quick & hacky to thoughtful & expensive:

  1. simply discard the edge tile (might be a temporary quickfix.)
  2. if ROI intersects image boundary, tile from the image boundary backwards, so that the last tile is level with the image boundary. if ROI spans the whole width (or height) of the image and thus you can't avoid this problem on both sides, fall back to (1).
  3. pad with something standard, e.g. white (or qupath's white for the given image)
  4. leave the choice to the user in the config.json of each model, with (3) being the default

I would vote for 1. At most 2. But I think 1 will solve this problem quickly and nobody will ever complain. In my WSIs edge tiles are always uninformative and this error is simply an hindrance.

by the way @xalexalex - how do you record your videos? i would like to do something similar :)

This is going to be very crude, but:

  • open a terminal
  • sleep 1; ffmpeg -video_size 1920x1080 -framerate 25 -f x11grab -i :0.0 output.mp4
  • quickly switch to what you want to record
  • when you're done, go back to the terminal and CTRL-C

@xalexalex
Copy link
Author

Quick update: the error also happens on the left and top edges. So solution (1) seems to be the quickest way to fix this, whereas (2) is not doable until we understand where the actual problem is.

@petebankhead
Copy link
Member

if this isn't already done, we should decide how to pad the images. we could pad them with some constant color (like white). or we could mirror the patch. we could also exclude the patch if it extends past the boundaries of the slide. thoughts everyone?

Does the WSInfer Python code have a strategy for this?

To clarify: the QuPath implementation of inference is completely independent. It should agree with whatever is done in Python as much as possible for consistency, but it is difficult to guarantee identical results because some core operations might be implemented differently (e.g. the precise interpolation used when resizing tiles, which can make a big different).

petebankhead added a commit to petebankhead/qupath-extension-wsinfer that referenced this issue Nov 13, 2023
Fix qupath#45
This uses zero-padding (other padding may be preferable).

Also fix threading bug when selected objects are updated from another thread (e.g. a script) and the dialog is showing.

Slightly reduce vertical height by reducing padding.
@petebankhead petebankhead mentioned this issue Nov 13, 2023
@petebankhead
Copy link
Member

This PR addresses this by using zero-padding: #47

That is what should have been happening already... I just missed the bug because I was restricted to a batch size on 1 on my Mac (which is no longer a restriction).

Other boundary criteria could be considered is WSInfer handles it differently in Python.

I also added a comment where the tile resizing is applied:

// For example, using the Python WSInfer 0.5.0 output for the image at
// https://github.com/qupath/qupath-docs/issues/89 (30619 tiles):
//  BufferedImageTools Tumor prob Mean Absolute Difference: 0.0026328298250342763
//  OpenCV Tumor prob Mean Absolute Difference:             0.07625036735485102

Basically, the method of interpolation makes a difference in how well the Python and QuPath implementations agree. I believe Python uses bilinear interpolation, but the results quoted above both use bilinear interpolation, just implemented differently and this is enough to cause disagreements.

I think perfect agreement between Python and QuPath would be very difficult to achieve (and require some substantial changes), but this figure gives some idea of the difference. If I use something other than bilinear interpolation in QuPath, I see much larger disagreements.

@kaczmarj
Copy link
Collaborator

Does the WSInfer Python code have a strategy for this?

wsinfer python pads patches with 0. actually this is an implementation detail of opensldie and tiffslide (they will pad with 0 if the patch is at the edge of a slide). i should add tests to wsinfer python that makes sure this continues to happen with future versions.

I think perfect agreement between Python and QuPath would be very difficult to achieve

i agree, and i believe it shouldn't be our goal to achieve perfect agreement. by the way, the bilinear resampling in wsinfer/python is performed by Pillow.

@xalexalex
Copy link
Author

@petebankhead I tested the current HEAD and unfortunately the issue isn't fixed for me. Could anyone else check?

I get this error:

Successful run without edge tiles:

19:39:56.648 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Running prost-latest for 80 tiles
19:39:58.730 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Finished 80 tiles in 2 seconds (26 ms per tile)

Run that errors out on edge tile:

Nov 13, 2023 7:40:03 PM javafx.fxml.FXMLLoader$ValueElement processValue
WARNING: Loading FXML document with JavaFX API of version 20.0.1 by JavaFX runtime of version 20
19:40:03.876 [wsinfer1] [WARN ] ai.djl.repository.SimpleRepository - Simple repository pointing to a non-archive file.
19:40:03.978 [wsinfer1] [INFO ] qupath.ext.wsinfer.WSInfer - Running prost-latest for 75 tiles
19:40:04.192 [wsinfer-tiles1] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (12189, -51, 963, 963)
19:40:04.216 [wsinfer-tiles2] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (14115, -51, 963, 963)
19:40:04.219 [wsinfer-tiles4] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (13152, -51, 963, 963)
19:40:04.294 [wsinfer-tiles1] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (21819, -51, 963, 963)
19:40:04.519 [wsinfer-tiles2] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (22782, -51, 963, 963)
19:40:04.707 [wsinfer-tiles3] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (23745, -51, 963, 963)
19:40:04.740 [wsinfer-tiles3] [WARN ] qupath.ext.wsinfer.TileLoader - Detected out-of-bounds tile request - results may be influenced by padding (24708, -51, 963, 963)
19:40:04.808 [wsinfer1] [ERROR] qupath.ext.wsinfer.WSInfer - Error running model prost-latest
ai.djl.translate.TranslateException: java.lang.IllegalArgumentException: You cannot batch data with different input shapes(3, 963, 963) vs (3, 224, 224)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:193)
        at qupath.ext.wsinfer.WSInfer.runInference(WSInfer.java:243)
        at qupath.ext.wsinfer.ui.WSInferController$WSInferTask.call(WSInferController.java:553)
        at qupath.ext.wsinfer.ui.WSInferController$WSInferTask.call(WSInferController.java:499)
        at javafx.concurrent.Task$TaskCallable.call(Task.java:1426)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: You cannot batch data with different input shapes(3, 963, 963) vs (3, 224, 224)
        at ai.djl.translate.StackBatchifier.batchify(StackBatchifier.java:83)
        at ai.djl.inference.Predictor.processInputs(Predictor.java:300)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:181)
        ... 8 common frames omitted
Caused by: ai.djl.engine.EngineException: stack expects each tensor to be equal size, but got [3, 224, 224] at entry 0 and [3, 963, 963] at entry 3
        at ai.djl.pytorch.jni.PyTorchLibrary.torchStack(Native Method)
        at ai.djl.pytorch.jni.JniUtils.stack(JniUtils.java:626)
        at ai.djl.pytorch.engine.PtNDArrayEx.stack(PtNDArrayEx.java:662)
        at ai.djl.pytorch.engine.PtNDArrayEx.stack(PtNDArrayEx.java:33)
        at ai.djl.ndarray.NDArrays.stack(NDArrays.java:1825)
        at ai.djl.ndarray.NDArrays.stack(NDArrays.java:1785)
        at ai.djl.translate.StackBatchifier.batchify(StackBatchifier.java:54)
        ... 10 common frames omitted

@kaczmarj kaczmarj reopened this Nov 13, 2023
@kaczmarj
Copy link
Collaborator

thanks @xalexalex

it appears that these patches are not being resized for some reason... i'm not sure what would cause this.

ai.djl.translate.TranslateException: java.lang.IllegalArgumentException: You cannot batch data 
with different input shapes(3, 963, 963) vs (3, 224, 224)

@petebankhead
Copy link
Member

This works for me with the zoo models on both Windows and Mac.

@xalexalex can you specify which model you are using, or share the config.json?

Two explanations I can think of:

  1. You've still got an 'old' version of the extension installed, and QuPath is using it instead
  2. All the zoo models contain a 'resize' transform in the config.json... which may be required in the current implementation

@kaczmarj I notice that https://huggingface.co/kaczmarj/pancancer-lymphocytes-inceptionv4.tcga/blob/main/config.json contains a patch size of 100 but resizes to 299... is this intended?

@kaczmarj
Copy link
Collaborator

@kaczmarj I notice that https://huggingface.co/kaczmarj/pancancer-lymphocytes-inceptionv4.tcga/blob/main/config.json contains a patch size of 100 but resizes to 299... is this intended?

i realize it looks odd, but it is intended. i triple-checked the original implementation (https://github.com/ShahiraAbousamra/til_classification).

@xalexalex
Copy link
Author

2. All the zoo models contain a 'resize' transform in the `config.json`... which may be required in the current implementation

This was indeed true. Adding a resize transform in config.json did the trick. Now both zoo models and my custom models work flawlessly. Thanks!

@kaczmarj
Copy link
Collaborator

kaczmarj commented Dec 1, 2023

This was indeed true. Adding a resize transform in config.json did the trick. Now both zoo models and my custom models work flawlessly. Thanks!

fantastic! glad it is working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants