-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extraction pipelines & data fetching blocks #36
Conversation
Remaining for the Sentinel-2 Pipeline:
|
@jdries You can start profiling the backend bottlenecks within the There is a full dataset with a realistic density here: So far, I managed to run jobs with 100 samples, with a peak memory usage a bit higher than 5GB, and 3hours of processing costing 400 credits (CDSE-STAGING: j-2403205396124e2f8d417a0c39667ced). Memory usage seems fine but it would be nice to speed-up the job. The results are in this folder: I did other tests with much less samples and it seems to me that the memory scales slowly up with the number of samples, which seems good. This line here subsamples the divided openeo jobs into a single one for testing purposes:
|
Memory settings are quite high compared to what's actually used. My first guess would be to try and lower that. Majority of time spent is in reading data. It's clearly still the case that reading many small blocks from object storage is not efficient. |
@jdries
|
S1 pipeline seems to work well, managed to extract 500 samples with close to 6GB of memory |
Found another cause of memory usage in S2: the distance to cloud computation works on the whole timeseries, while it should only need individual observations.
This will then also require the UDF to be adjusted, maybe need to check with @VincentVerelst This would be my guess for UDF without time dim:
|
Another one: the cloud mask is now applied, but there's a 'merge_cubes' before the mask process. |
@jdries Without the two optimizations you mentionned here I managed to make the extraction of 500 points for 8GB of peak memory, 580 credits and 4hours wall time (cdse-staging j-24032149eeb847fbbfdc5a6bfca9622c). I will run a new one with the same amount of samples and see the difference in memory usage. I also ran an extraction on S1 data with 6GB peak memory for 500 points, 160 credits and ~1h30 wall time. Seems quite efficient to me, but could you give it a look to see how can we optimize it? (cdse production j-240321d24a59494d848ae4eedc695edf) |
For sentinel-1, can you try setting executor-memory to 1G, keeping memory-overhead the same. |
For sentinel-2, I'm going to try writing an alternative loader. May take a little bit of time, but might be worth it as I again notice the big difference between loading data in bulk vs samples. |
New strategy looks promising, job ran with 2.5GB per executor, taking only 57 cpu hours in spark.
Still need to work on ensuring correctness of the result. Also the end of the job takes too long to look up sample metadata. This could probably be done in parallel as a quickfix. |
@jdries For the S1 job, I think that limiting the memory overhead of the executor to 1GB is too little. This jobs seems to crash (I think for OOM reasons) j-240325d99399443cabd88a9b4c0fb17f CDSE-STAGING |
@GriffinBabe my suggestion was:
you tried the other way round, which I would indeed expect to crash. |
A working version of the performance improvement is available on CDSE staging. Patches look fine upon first inspection, but deeper comparison is needed. To enable the optimization, we need to set a feature flag on the load_collection call that loads the optical bands. Here's an example of how to do that (the second line):
Also the job options have a large impact on performance. Note that I run with a small number of executors, to minimize loss when idle, but you can increase it if you want to iterate faster:
Note that the feature flag enables a very experimental code path, please don't start using it everywhere without asking or properly checking results. Processes like mask_scl_dilation are not yet supported for sure. As for next steps, I would recommend to start up sample extraction where we do not yet apply cloud masking, and then properly verify results. Then we can add the cloud masking when the more complex strategy works. I also still see other things that can be made faster, but takes some time to integrate everything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, though I have to say it's quite lengthy code and it doesn't feel very modular. If this is what a user typically needs to do, it's quite something. I have the feeling that some of it could be embedded in GFMAP and also be written a bit more modular, so the actual user-based extraction scripts are also more readable. What do you think?
@kvantricht This pull request will be much larger than it used to be, as I also started working on the inference blocks using GFMAP to extract the data in WorldCereal, instead of the usual recycled cropclass/HRL code that was in the repository. The old code will still be accessible through the versioning, but I removed it in this branch (to be merged whenever we confirm that a fetched datacube will have the same values here as it had with the previous pipeline) |
Did you tackle my previous comments? Wouldn't it be good to have a new stable version (so a merged PR) before you're actually gone? |
@kvantricht I resolved your comments and updates the pipelines. I think there's going to be a few changes later when extracting S1 and METEO but we have a solid foundation |
) | ||
|
||
# Rescale to uint16, multiplying by 100 first | ||
cube = cube * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is confirmed that cube
contains the unscaled values which is why you reintroduce it here?
"description": "Sentinel-2 bands", | ||
"type": "application/x-netcdf", | ||
"roles": ["data"], | ||
"proj:shape": [64, 64], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be hardcoded here? What if patch size changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that ideally it shouldn't be there. We discussed this with @VincentVerelst and we found no easy solution, as this definition is for a specific patch size for specific bands set selected, it is therefore application specific.
However, it should on the long run be part of the extraction pipeline in GFMAP. For example, the feature extractor and the postprocessing parameters should automatically in the parallel create those STAC definitions. But it is not a trivial issue and needs to be designed with a long-term view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for now after discussion with Darius. Some future changes planned.
Implemented extraction pipelines for Sentinel2, Sentinel1 and AGERA5
The following pipelines are finished: