Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to store rasters? #23

Open
tpoisot opened this issue Mar 3, 2021 · 9 comments
Open

Where to store rasters? #23

tpoisot opened this issue Mar 3, 2021 · 9 comments

Comments

@tpoisot
Copy link
Contributor

tpoisot commented Mar 3, 2021

#21 ended up being about org admin, so let me restate the issue here:

  1. we want to store raster data centrally
  2. we want the user to have flexibility about where "centrally" is (can be a folder, can be a server I guess, etc)
  3. we want to make a default decision when the user is not specifying anything, e.g. for small datasets

The current solution is to require ENV["RASTERDATASOURCES_PATH"] - this can work but it requires setting the variable even for small datasets, which is an additional step for users.

The solution I suggested in #21 was to use traits for the different types, but this is also possibly confusing - sometimes things will stop working unless the variable is set (and setting it from the session will not make it permanent).

Solutions like Artifacts don't work because we don't want to download ALL data when the package is built.

Here is my current thinking on this - we might want to keep the idea of a ENV["RASTERDATASOURCES_PATH"], and have a greet() function that reminds users of what it does. Specifically, if there is no such path set, we can use a folder in @__DIR__ to store the data? Users who don't want to make a choice will have their data there, users who want to specify a path will have a choice.

@rafaqz what do you think?

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2021

Isnt this just going back to your origional proposal? Downloading 30GB will still go into @_DIR_ by default. For me the important consideration is that that doesn't happen.

Its not clear to me why traits wont work? The idea is the variable is set outside the session, and that will only affect large datasets anyway? Maybe we should just use the new preferences system to set this instead of the env var. But a PR would make this more substantial thing to discuss.

We cant use artifacts because there are millions of files.

@tpoisot
Copy link
Contributor Author

tpoisot commented Mar 4, 2021

Agreed that we can't use Artifacts. I'm slowly realizing the size of other datasets - I might end up being convinced that maybe asking to set a path is not unreasonable - I also don't want to write into @__DIR__.

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2021

Lets use this https://github.com/JuliaPackaging/Preferences.jl?

Then you can set preferences in session, after an error gives you an example of what to do. And the path will stick when you set it. The trait can work as planned earlier and we only throw the error require setting the preferences for the large weather datasets.

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2021

Yeah, literally multuple terrabytes!

Preferences.jl could be the middle ground we need, its less weird than ENV

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2021

For cesar its important to have both scales... We'll run GrowthMaps.jl with tiny climate datasets for exploration and sharing ideas, but swap to hundred GB datasets for fitting real models - the GrowthMaps/GeoData/RasterDataSources combo abstracts that away and the output is the same format.

@rafaqz
Copy link
Member

rafaqz commented Mar 4, 2021

But I think you are right for Bioclim and Climate, I had to set the path in support scripts for a paper, and its pretty awful, and makes the script not reproducable without editing.

@tpoisot
Copy link
Contributor Author

tpoisot commented Mar 4, 2021

Let's definitely go with Preferences - I'll work on this when I've made progress on the future bioclim data

@asinghvi17
Copy link

asinghvi17 commented Mar 13, 2023

Would it make sense to use a scratch space from Scratch.jl as a default, with the user being given the option to override that by some mechanism (either Preferences.jl or some environment variable)?

@rafaqz
Copy link
Member

rafaqz commented Mar 13, 2023

The Scratch.jl docs kind of say not to use it for this use case :

Because the scratch space location on disk is not very user-friendly, scratch spaces should, in general, not be used for a storing files that the user must interact with through a file browser. In that event, packages should simply write out to disk at a location given by the use

I personally occasionally manage these files in a browser - say to copy them for someone else when I've downloaded a lot.

But I know the current solution kind of sucks too.

Some of these future climate datasets and current weather datasets are many GB downloadable with a single command, so we need to be a little bit careful about the location and let users access and manage it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants