Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow FieldTimeSeries to pass keyword arguments to jldopen #3739

Merged
merged 15 commits into from
Oct 29, 2024

Conversation

ali-ramadhan
Copy link
Member

This PR takes some code I had to pass kwargs to jldopen through FieldTimeSeries.

But from the dicussion in issue #3680 it sounds like maybe a more future-proof way to achieve this is to add the kwargs object to the backend object (either InMemory or OnDisk)?

I guess there's a bit of a conflict of wording here? Backend refers to how the data is stored for a FieldTimeSeries. So we could end up with OnDisk{NetCDF} and OnDisk{JLD2} for example. And maybe also for PartlyInMemory. But for fully InMemory there's no "backend"?

Resolves #3680

@glwagner
Copy link
Member

Right InMemory is more like a nothing backend, there's nothing back there.

On ClimaOcean there's an example how we extended this infrastructure to a particular kind of NetCDF backend: https://github.com/CliMA/ClimaOcean.jl/blob/dd9148a8f702699ddf2947d3a1a094adddbd9bb1/src/DataWrangling/JRA55.jl#L218

The implementation in this PR seems simple enough though. @ali-ramadhan maybe can you provide a little syntax example how you expect this to be used? @navidcy take a look and see what you think

@glwagner glwagner requested a review from navidcy August 28, 2024 03:47
@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Aug 28, 2024

I've used it to open JLD2 files in parallel using multiple threads with @threads so that each thread can process one time snapshot. Here's an example:

u = FieldTimeSeries("fields.jld2", "u";
    backend = OnDisk(),
    backend_kw = Dict(:parallel_read => true)
)

Nt = length(u)
temporal_reduction = zeros(Nt)

Threads.@threads for n in 1:Nt
    temporal_reduction[n] = median(u[n])
end

Of course, this post-processing is much slower than just doing it on-the-fly as the simulation runs. But it's still useful to be able to do this.

One especially useful use case is parallel plotting from a FieldTimeSeries.

@glwagner
Copy link
Member

Nice!

Only API suggestion is that the extra kw could be directly passed to FieldTimeSeries, so

u = FieldTimeSeries("fields.jld2", "u"; backend = OnDisk(), parallel_read=true)

Under the hood this can work with something like

function FieldTimeSeries(path::String, name::String; backend=InMemory(), kw...)
    # etc
    backend_kw = Dict(kw)
    # etc
end

You're the primary user though so you can state your preference!

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Aug 30, 2024

I agree that kw... results in a cleaner API with no passing of dicts.

I'm just wondering if there will be a use case in the future that will require passing a different set of kwargs. Then we'll need kw1, kw2 as kw... cannot disambiguate between kw1 and kw2.

@ali-ramadhan
Copy link
Member Author

I thought about this a bit more and further think we should stick to keeping kw a dict so that the API is more consistent with how the JLD2OutputWriter constructor works as it also expects a dict.

@glwagner
Copy link
Member

I thought about this a bit more and further think we should stick to keeping kw a dict so that the API is more consistent with how the JLD2OutputWriter constructor works as it also expects a dict.

We could also change JLD2OutputWriter!

@ali-ramadhan
Copy link
Member Author

Definitely agree the two interfaces should be consistent. I'd be okay with either, although still slightly favor the current interface in case we will want to pass two sets of kwargs in the future.

Would you support merging this PR and we can open an issue to discuss the best interface?

@glwagner
Copy link
Member

Yes for sure support that

@glwagner
Copy link
Member

Was hoping @navidcy might review but I can give it one

return FieldTimeSeries{LX, LY, LZ}(data, grid, backend, boundary_conditions,
indices, times, path, name, time_indexing)
return FieldTimeSeries{LX, LY, LZ}(data, grid, backend, boundary_conditions, indices,
times, path, name, time_indexing, backend_kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think theere needs to be changes to adapt_structure which I do not see in this PR. Possibly this is not tested right now. @simone-silvestri

Copy link
Member Author

@ali-ramadhan ali-ramadhan Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Ah true. I guess originally we didn't envision passing FieldTimeSeries into GPU kernels and I don't see a test that does this.

But with features like FieldTimeSeries forcing (PR #3760) an adapt_structure will be necessary. Actually not sure how tests will pass in that PR without adapting.

I can add some tests here though. There are no tests that use backend_kw...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think there are no tests about this. The adapt for a FieldTimeSeries returns a different type: GPUAdaptedFieldTimeSeries that is a little slimmer in terms of parameter space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but this PR breaks FieldTimeSeries on GPU which will break everything in ClimaOcean. @simone-silvestri can you add some tests here?

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Oct 7, 2024

Thanks @simone-silvestri for adding a GPU test for FieldTimeSeries! I've adapted it correctly now and tests pass locally so I think this PR is ready for review.

Actually I should add a test that uses the new kwargs. Would do a test where multiple threads open the same FieldTimeSeries but don't think we have multi-threaded tests.

@ali-ramadhan
Copy link
Member Author

Updated the branch and added tests that backend_kw works with FieldTimeSeries and FieldDataset so I think this PR is actually ready for review now!

@ali-ramadhan
Copy link
Member Author

Just bumping my review request 🥺

Co-authored-by: Simone Silvestri <[email protected]>
@simone-silvestri
Copy link
Collaborator

I am just not extremely sure about the name backend_kwargs: at first glance, I thought it would be the keyword arguments of the backend, but it is just related to the reader of the underlying data (which for the moment, is only JLD2) and is not necessarily related to the backend (for the moment).
I guess jld2_kw like in the JLD2OutputWriter does not work if we want to extend this to NETCDF. would something like reader_kwargs or IO_kwargs work you think?
Otherwise, if backend_kwargs is ok for the majority I can accept this.

@ali-ramadhan
Copy link
Member Author

I do agree that backend_kw is maybe not the best name for this. I think I named it so out of the discussion in #3680 around "storing the kwargs in the backend".

But yeah there's a backend where the data is stored, e.g. in memory or on disk, but there's also how the data is read, e.g. JLD2 or NetCDF. "Backend kwargs" should go into the call to OnDisk for example. So reader_kwargs makes sense to me (or reader_kw if we want to be consistent with JLD2OutputWriter property names).

Maybe the better names will become clear once FieldTimeSeries supports reading from NetCDF (something I'm interested in working on).

@ali-ramadhan
Copy link
Member Author

Curious if @glwagner has any thoughts on backend_kw vs. reader_kw, but I'm thinking of renaming to reader_kw then potentially revising the naming in a future refactor of OutputReaders (probably as part of supporting NetCDF).

@ali-ramadhan ali-ramadhan merged commit 97d7344 into main Oct 29, 2024
46 checks passed
@ali-ramadhan ali-ramadhan deleted the ali/fts-backend-kwargs branch October 29, 2024 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support passing keyword arguments to jldopen when using FieldTimeSeries
3 participants