Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create data package for spatialEpisim #3

Closed
bryce-carson opened this issue Jun 10, 2024 · 9 comments
Closed

Create data package for spatialEpisim #3

bryce-carson opened this issue Jun 10, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request UNSTABLE The work is to be completed in the unstable branch

Comments

@bryce-carson
Copy link
Collaborator

A package is a good way to distribute the data, because the script which generated the data can be distributed alongside it. Essentially, "modularize" the data so that spatialEpisim itself is not just some heap of data, scripts, and then a Shiny app consuming that.

Remember, "code as data" also means "data as code" which implies using data:

data(spatRastersAggregated, package = spatialEpisim)
@bryce-carson bryce-carson added the enhancement New feature or request label Jun 12, 2024
@bryce-carson
Copy link
Collaborator Author

Per the topic, sharing the pre-aggregated data in a package rather than as bundled data makes sense if the user is installing/running our Shiny app with shiny::runGitHub("ahsokkrish/spatialEpisim"). This will, presumably, install all packages required and then download the application and run it. Shipping the data alongside the application is a good idea, but the data might be useful to someone else separately. It also simplifies the application marginally if we don't bother to read the data ourselves and only need to run data(spatRastersAggregated, package = spatialEpisim).

This isn't necessary, however.

@bryce-carson bryce-carson self-assigned this Jul 18, 2024
@bryce-carson bryce-carson added the UNSTABLE The work is to be completed in the unstable branch label Jul 18, 2024
@bryce-carson
Copy link
Collaborator Author

@bryce-carson
Copy link
Collaborator Author

Preferred, given I've read more about it: https://dirk.eddelbuettel.com/code/drat.html

@bryce-carson
Copy link
Collaborator Author

Backup of data in our private Google Drive

Don't worry, it's restricted. Only people who've been granted access to that Google Drive folder can utilize the link; for others it leads nowhere.

@bryce-carson
Copy link
Collaborator Author

@ashokkrish, please see the above comment for a link to a backup of ye olde data/ folder. Part of creating the data package will be to remove these files from the repository (in the leftover-spaghetti-casserole branch, at first).

There is an effect of having lots of large data files in Git repositories; because they are binary files they have a different impact on Git's performance than textual files like source code, or small binary files. Large binary files cause Git to slow down when it pushes and pulls, or when someone updates their repository from a very old copy to a new copy.

It's not a big deal, but when I remove the data from the repository it will also be removed from the history of the repository, so that would effectively break archived versions of the software from long ago because they won't have the data they're expecting to be available.

What do you think?

#35 is related, because if we're not given permission to redistribute the data we'll need to remove it from our history as well, otherwise we're just redistributing it in an older version of our software repository, but it would still be redistribution, nonetheless. Did the Copyright office at MRU have anything to say about that issue?

@bryce-carson
Copy link
Collaborator Author

@ashokkrish, I believe we should publish the data that _we are allowed to redistribute_e under a compatible license, and publish this on Zenodo, which is supported by CERN and is hosted in their data centre (so it's highly available and reliable).

In my work with Prof. Jon Mee, I published the data that I created as a research output on Zenodo and Dryad.

The only original data product I've produced in this research is the preaggregated WorldPop data. There's probably already an existing data product of this somewhere else; we could search for that and find if its in a scientific data repository with high availability, or we can just go ahead with publishing that alone, and publish other original outputs separately.

For re-hosting and re-distributing data we don't have rights to, we need to be careful, like the GADM and WorldPop data which I haven't manipulated (preaggregated). We might need to prune those data from this repository.

@bryce-carson
Copy link
Collaborator Author

@ashokkrish, you could also share the data that users will need through Google Drive with a public link, if you're confident in your rights to do so (your rights to share the data).

@bryce-carson
Copy link
Collaborator Author

@bryce-carson
Copy link
Collaborator Author

I have created a data package spatialEpisim.

It's R package sources are here: https://github.com/KrishnamurthyLab/spatialEpisim.data
It's R package website is here: https://krishnamurthylab.github.io/spatialEpisim.data/

It ONLY contains the seed data and the preaggregated GeoTIF data from WorldPop, because that's the data we have rights to redistribute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request UNSTABLE The work is to be completed in the unstable branch
Projects
None yet
Development

No branches or pull requests

1 participant