-
Notifications
You must be signed in to change notification settings - Fork 132
GSoC 2020 Project Ideas
Please ask questions here. Tag @apoorvaeternity, @ethanwhite, @henrykironde
Preferred names (Apoorva, Henry, Ethan)
,
Preferred_greeting (Hi|Hello|Dear|Thanks|Thank you [First_name])
The code of conduct should be your first read.
The Data Retriever is a package manager for data. The Data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The Data Retriever handles tabular data and spatial data forms. The data retriever additionally handles compressed version of these data forms, i.e zip, gz and tar files
The goal of the project is to add support that will enable the Data Retriever platform to have the capability of ingesting
other forms of raw data. The project will introduce the support for raw data formats of XML, JSON, NetCDF, HDF, Excel, SQlite and Geojson data sources.
- Difficult
- Knowledge of Python
- Knowledge of Object Oriented Programming
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @apoorvaeternity
- @henrysenyondo
- @ethanwhite
Data Retriever: Improve environment setup and installation on all platforms for all Data Retriever ecosystem services
The main Data Retriever retriever package is a Python package with both a command line interface (CLI) and a Python interface. The platform is coupled with the Retriever-recipes's repository which stores the data packages. Additionally, the platform can be be used from Julia and R through wrapper packages. The Julia package called the Retriever.jl and the R package called the Rdataretriever are both hosted on GitHub. To be maximally useful installation and use of the retriever should be easy from all three languages (Python, R, and Julia) and operating systems (OS X, Windows, and Linux).
The goal of the project is to boost the usability of the Data Retriever platform ecosystem through enabling easy installation. Users should be able to install any of the packages with minimal steps or guidelines in a way that is intuitive for users of a R, Julia, or Python.
This project will involve automating as much of the installation process for the Python package as possible within the R and Julia wrappers so that it is as close to a normal R or Julia package install as possible. This will involve the use of the reticulate
package in R (as well as renv
)and the PyCall
package in Julia. These packages both support the conda
package management system for installing Python packages. The goal is to either have the Python package installed automatically as part of the R/Julia package installations or to include functions associated with those packages that perform the installation (e.g, rdataretriever::install_core_retriever()
).
One of the challenges with this task is ensuring that it works consistently across operating systems and development environments. For example, we have encountered situations here things that work smoothly in reticulated
on Linux don't work in the same way on Windows and we have seen cases where things work differently in RStudio than when running R directly. See https://github.com/ropensci/rdataretriever/issues/199 for an example of some of the challenges.
Developing good documentation to help guide users through any non-automated steps will also be important.
This project will involve the use of the modern DevOps technologies like Continuous Integration or Continuous Deployment pipelines for testing these solutions.
- Moderate Difficulty
- Knowledge of continuous development and deployment tools
- Knowledge of programing Python
- Knowledge of Git, continuous development and deployment tools
- Knowledge of R and Julia Programming
- Working knowledge of Python and R package managers including conda
The team at the Data Retriever primarily interacts via issues and pull requests on GitHub or through the Gitter channel.
- @apoorvaeternity
- @henrysenyondo
- @ethanwhite