-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of top level masking functions for bitmask and enumerated… #187
base: develop
Are you sure you want to change the base?
Conversation
2ecd02a
to
479dc32
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #187 +/- ##
=========================================
Coverage 95.44% 95.45%
=========================================
Files 31 32 +1
Lines 5534 5656 +122
=========================================
+ Hits 5282 5399 +117
- Misses 252 257 +5 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@alexgleith I think that chosen approach for representing "bit query" in I'd prefer if we just took a string like this mask = bits_to_bool(xx, "xx01_1xxx") instead of proposed mask = bits_to_bool(xx, [3,4,5], b0001_1000) |
Other option I guess is a dictionary mask = bits_to_bool(xx, {5:0, 4:1, 3:1}) but that's way more prone to errors: off by 1 or LSB/MSB confusion. |
3a4eb8c
to
504bbb8
Compare
504bbb8
to
f4f5a5b
Compare
7cfd466
to
39c42cb
Compare
Would be great to get your review @robbibt and perhaps an opinionated DEA Fmask implementation |
|
||
return mask_clouds( | ||
xx=xx, | ||
qa_name=qa_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there is a default qa_name, I'm concerned this might fail without clear information to the user if the data they've loaded does not have the scl
band, and they haven't realised. Should there be checks in the mask_s2
and mask_ls
functions that check whether the supplied qa_name
band is actually available?
SNOW = 5 | ||
CLEAR = 6 | ||
WATER = 7 | ||
# Not sure how to implement these yet... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've developed a way to do this in eo-insights: https://github.com/frontiersi/eo-insights/blob/412cfb5e4b416c1cfc7d52286a8b9eb241e730d2/src/eo_insights/masking.py#L267-L268
A two-bit pair can have four values: 0 = 0b00, 1 = 0b01, 2 = 0b10, 3 = 0b11. You can appropriately represent this by inserting the requested integer value at the minimum bit using integer_value << min(bits)
such that a value of 2
at bits [2, 3]
for a 4-bit integer produces 0b1000
and a value of 3
at bits [2, 3]
for a 4-bit integer produces 0b1100
.
""" | ||
Perform cloud masking for Sentinel-2 L2A products. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything to prevent the user from mistakenly using these functions on inappropriate datasets? What happens if the user has loaded Harmonised Landsat Sentinel-2, which uses fmask, which has a different set of flag definitions (and nodata value) to either of the defaults included here? They would need to know that they can't use either of these functions, and have to use the mask_clouds
function with custom variables relevant to their product. I'm worried that the inclusion of opinionated masking here is going to trip people up, and there need to be appropriate checks to inform the user that they might be using the function inappropriately.
|
||
# "Scales" and "offsets" is used by GDAL. | ||
if scale is None: | ||
scale = xx.attrs.get("scales") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if these attributes aren't available? Should you return None
?
@alexgleith -- is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2 products you have opinionated masking for, either from a single tiff, or a STAC query? I'd like to run a few tests to understand what might happen if a user applies the opinionated masking functions naively and makes a mistake. You could just send this to me directly if you don't want to have it as part of the repo. |
Here's a notebook: https://gist.github.com/alexgleith/84ebc85e17b904a48319845e42947352 |
Thanks for putting this together, Alex! At this stage, my main concern is that it would be very easy for the user to misuse What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General feedback:
This is brilliant, thanks so much for working on this @alexgleith - getting masking functionality into odc-geo
is something I've wanted for a long time, and I can see this making many of our current workflows so much easier.
I do think I agree with Caitlin that that we should probably break this PR into two chunks to consider seperately. The generic tooling is "easier" to progress, and is a perfect fit for this repo. However, the opinionated masking is trickier, especially given all the possible edge cases that can occur when users naively start applying the functions to various datasets (e.g. USGS Level 1 Landsat, USGS Level 2, DEA Collection 3, DE Africa, E84, HLS etc). So perhaps, just include the following functions in this initial PR:
bits_to_bool
: generic bitmaskingenum_to_bool
: generic "is-in" maskingscale_and_offset
: generic re-scaling of data by scale/offsetsmask_invalid_data
: generic conversion of integer nodata values toNaN
(given both this andscale_and_offset
convert integer data to floating point, I wonder ifmask_invalid_data
should become just a special variant of this function and callscale_and_offset
under-the-hood?)mask_clouds
: generic framework/workflow for masking clouds from a dataset (e.g. qa band selection, re-scaling, mask creation, mask application
For mask_clouds
in particular, I think it would be really nice if rather than passing in a custom function, a user could simply provide the bit flags or the enumerated values directly to the function, e.g. perhaps something like this:
# Enum masking
ds.odc.mask_clouds(qa_name="fmask", qa_enum=[0, 4], scale=0.0001, offset=0.0)
# Bit masking
ds.odc.mask_clouds(qa_name="pixel_qa", qa_bits=[4, 3], scale=0.0000275, offset=-0.2)
That would let mask_clouds
be used for generic cloud masking tasks and for future satellite datasets with different sets of masking bands and flags than we have seen so far.
Specific feedback:
-
.odc.mask_invalid_data
is great - although the first time I ran it, I was suprised that it operated "in-place", e.g. modifying the existing array without having to assign it to a new output. Is this intended? Both approaches could be useful - I guess we could include aninplace=False
param in there to allow users to decide how they want to apply it.ds.odc.mask_invalid_data() # vs ds_masked = ds.odc.mask_invalid_data()
-
.odc.enum_to_bool
works a treat! Again, this one returns a new output instead of modifying data in-place, so we should probably standardise to make sure functionality is expected. -
I've tested
odc.bits_to_bool
on our WO data, and it works great for masking by individual flags, e.g.ds.water.isel(time=0).odc.bits_to_bool(bits=[7])
- I actually quite like being able to pass in a simple list of bits. However, I'm not sure how to use it for more complex things like this (which I think relates to @caitlinadams' comments above):Bitmasking is something I've found super confusing forever, and I know it's something our users struggle with a lot too. The
datacube.utils.make_mask
function is probably still the most intuitive implementation I've seen, where users can do something like this using human-readable flags. I wonder if we could extent this function to allow something similar here, if variable flags exist in the data, or if a user manually provides them?make_mask(ds.water, water_observed=True)
-
I tried running
ds.odc.scale_and_offset(scale=0.0001)
on DEA's ARD scaled between 0-10000. It did work, but it also operated in-place which was unexpected, and it appeared to restore nodata values back into the dataset even after converting to float (so the data had values from-999.0
to1.0
. I think it's probably a safe assumption that nodata values can be set toNaN
after rescaling to floating point - I don't think having an integer nodata value is useful at that point.
def bits_to_bool( | ||
xx: DataArray, | ||
bits: Sequence[int] | None = None, | ||
bitflags: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that bits
and bitflags
are mutually exclusive, e.g. users could provide either option? If so, we should clarify this in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if you and Kirill both find this confusing, perhaps it's problematic.
Any suggestions on something better? Improving docs is simple enough... but maybe there's a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, not super confusing, just need to be clear in the docs that a user doesn't need to provide both I think (and most users except the most advanced ones will probably ignore bitflags
completely).
) -> DataArray | Dataset: | ||
""" | ||
Apply scale and offset to the DataArray. Leave scale and offset blank to use | ||
the values from the DataArray's attrs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these attributes that come with a particular STAC extension? If so, makes me wonder if we should configure these in our DEA datacube products too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's fairly standard. I'm not sure of the odc-stac library sets then, but it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rioxarray
does put them in (it takes those from TIFF tags). Unlike rioxarray
, odc-stac
loads from multiple sources that, at least in theory, can have different scale and offset for each asset being loaded, also that information can come from either STAC or from TIFF/netcdf.
There is also this possible confusion of "have these parameters been applied already or not?".
@robbibt Scale and offset are part of stac raster extension, see
https://odc-stac.readthedocs.io/en/latest/stac-best-practice.html
in STAC land it means something like "pixels are stored in a quantized form as an integer, to get back to floating point use these scale and offset parameters for all values that are not NODATA, for NODATA use NaN".
:param scale: Scale value for the dataset | ||
:param offset: Offset value for the dataset | ||
:param clip: Clip range for the data | ||
:param includ_cirrus: Whether to include cirrus in the mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring here doesn't match the current params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, on it.
xx: DataArray | Dataset, | ||
scale: float | None = None, | ||
offset: float | None = None, | ||
clip: Annotated[Sequence[int | float], 2] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this can be applied at the dataset level, we should add a "skip_bands" param here to match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I guess one other thing to consider here is that there's a bunch of similar masking tools in We should see if any of that content can be re-used here too. |
After chatting with @caitlinadams today, I think I'll suggest that we make the following changes:
And a range of code cleanup and doc fixes, plus examples are required. |
I really like the idea of being able to pass in custom dictionary kwarg configs - I think that would allow us to have a really nice generic function that can be highly customisable and support a wide range of current and future sensors. |
A work in progress, but currently includes functions:
odc.geo.masking.bits_to_bool
, which makes a mask using bit flagsodc.geo.masking.enum_to_bool
, which makes a mask usingisin
lookupsscale_and_offset
, which applies offset and scale values.Also:
odc.geo.masking.mask_clouds
and the two functions belowodc.geo.masking.mask_ls
: apply an opinionated mask to landsat c2l2sr dataodc.geo.masking.mask_s2
: apply an opinionated mask to sentinel-2 c1l2 data.I've added odc xarray extensions for them, which I think works. Tested briefly.
So, thoughts? Opinions? Fears and doubts?