-
Notifications
You must be signed in to change notification settings - Fork 132
GSoC 2019 Project Ideas
Please ask questions here. Tag @zhangcandrew, @henrykironde, @ethanwhite
Preferred names(Andrew, Henry, Ethan)
, Preferred_greeting(Hi/Hello/Dear/Thanks/Thank you [First_name])
The code of conduct should be your first read.
Please ask questions here. Tag @ethanwhite, @henrykironde, @zhangcandrew.
The Data Retriever is a package manager for your data. The data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. Currently the core software ships with json script metadata. We want to put this metadata in a separate project location to help with organization, maintenance, and testing.
The goal of this project aims at scaling up the number of usable datasets for retriever and standardizing maintenance of these scripts.
- Moderate Difficulty
- Knowledge of Python
- Knowledge of Web Requests
The Data Retriever primarily interacts via issues and pull requests on GitHub.
- @henrysenyondo
- @ethanwhite
- @zhangcandrew
Please ask questions here. Tag @ethanwhite, @henrysenyondo, @zhangcandrew.
As script file versions change and updates are pushed to the retriever, the issue of reproducibility arises. Research scientists will need to be able to consistently reproduce the same output.
The goal of the project is to be able to consistently reproduce the exact output obtained from a retriever script version combo at a previous tag time. We can achieve this by using Docker to capture specific retriever versions and by also caching previous versions of our JSON scripts.
- Moderate Difficulty
- Knowledge of Python
- Knowledge of Docker
- Principles of Object Oriented Programming
- Familiarity with Git Provenance
The Data Retriever primarily interacts via issues and pull requests on GitHub.
- @henrysenyondo
- @ethanwhite
Please ask questions here. Tag @ethanwhite, @henrysenyondo, @zhangcandrew.
As Retriever functionality increases, we want to be sure not to neglect efficiency. Additionally, specifically with large datasets, we want to be able to process them in a manner that does not involve us downloading all the data before we want to work with it.
- Moderate Difficulty
- Knowledge of Python
- Knowledge of Web Requests
The Data Retriever primarily interacts via issues and pull requests on GitHub.
- @henrysenyondo
- @ethanwhite
- @zhangcandrew