Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug in save_cog_with_dask? #185

Open
jerry-kobold opened this issue Oct 9, 2024 · 3 comments
Open

possible bug in save_cog_with_dask? #185

jerry-kobold opened this issue Oct 9, 2024 · 3 comments

Comments

@jerry-kobold
Copy link

At my company we have been using save_cog_with_dask very heavily to produce COGs from very large rasters. For certain raster/chunk combinations we have observed the function failing with an AssertionError. I did some digging and isolated the problem to the following code:

# odc/geo/cog/_mpu.py:190

def can_flush(pw: PartsWriter):
    if self.write_credits < 1:
        return False
    if self.started_write:
        return self.is_final or len(data) >= pw.min_write_sz
    if self.is_final:
        return len(data) > self.lhs_keep
    return len(data) - self.lhs_keep >= pw.min_write_sz

# odc/geo/cog/_mpu.py:207
assert can_flush(write)

It seems that for certain combinations of raster dimensions and chunking scheme, a situation can arise in which self.write_credits is 0 but the write has not completed. The following code below should reproduce the problem:

import xarray as xr
import dask.array as da
import numpy as np
from odc.geo.cog import save_cog_with_dask

ylen = 43284
xlen = 9615
bands = 2
fake_data = da.random.default_rng().integers(1, 1000, size=(bands, ylen, xlen), chunks=(1, 4000, 4000), dtype=np.int32)
fake_raster = xr.DataArray(fake_data, coords={'band': [1, 2], 'y': np.arange(ylen), 'x': np.arange(xlen)}).rio.write_crs('EPSG:4326')
output = Path('tiff-cog-test.tiff')
if output.exists():
    output.unlink()

fut = save_cog_with_dask(fake_raster, str(output), compression='deflate')
fut.compute()

The dimensions in this example were chosen because they match the dimensions of an actual raster for which this problem arose with this specific chunking. Using the time-honored method of distributed.print-based debugging, I isolated the error to the line that checks for self.write_credits < 1. Playing around with the source, I noticed that changing this line to

if self.write_credits < 1 and not self.final

makes the error go away and nothing else seems to break, but I don't know if that's the right decision or not.

@Kirill888
Copy link
Member

Maybe assert should be

assert self.final or can_flush(write)

@jerry-kobold
Copy link
Author

jerry-kobold commented Oct 10, 2024

From what I can tell that's basically the same thing as what I did, right? Happy to make a PR for this if it would help.

edit: would it make more sense to move the if self.final check to the top of can_flush so that condition gets evaluated first? Or is there a scenario where that could cause problems?

@Kirill888
Copy link
Member

My concern is that I still don’t understand if the assert is overly aggressive or if there is an actual logic error, for example it could be triggered by certain rare Dask events, like work stealing behaviour of the scheduler.

in general mpu code needs another pass of simplification, now that we know it works most of the time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants