Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uploading of datasets. #52

Open
ManuelAlvarezC opened this issue Apr 9, 2020 · 4 comments
Open

Add uploading of datasets. #52

ManuelAlvarezC opened this issue Apr 9, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@ManuelAlvarezC
Copy link
Collaborator

Description

Trello card :https://trello.com/c/tb08vrGi

We need to upload the datasets generated by our data sources, to make them easily accesible to other teams.

To do we need to:

  1. Create a set of default parameters for each datasource, which may be empty.
  2. Create a function that given a data_source, runs it with the defined parameters, store the result as a csv and packs it in a folder with a copy of the audit and metapackage.json files.
  3. Make a function, using this notebook as a template that takes the path to a data package and uploads to kaggle.
  4. Create a function that take no arguments, and iterate through the data sources, generating the datapackages and uploading them to kaggle.
  5. Create a github action that runs the function of step 4 each 24 hours.
@ManuelAlvarezC ManuelAlvarezC added the enhancement New feature or request label Apr 9, 2020
@smartcaveman
Copy link
Member

Comments/Suggestions

  • While we may only need CSV inputs for this repository at this point, but it's highly likely that a reusable solution would be robust enough to consume and produce additional formats (JSON, JSONLD, RDF, XML, YAML, etc). Understanding this, it would be wise to implement some extensibility here by parameterizing the formats to whatever function processes the data sources.
  • The source for the library referenced by the sample notebook is at kaggle-storage-client. Please raise any issues on that project if this implementation identifies obstacles to its use. It was built with the intention of making this kind of thing easier.
  • Step 4 doesn't need to run if neither the code nor the data sources have changed since the last execution. To simplify this comparison, store the hash for source files.
  • Step 4 could become reusable to other teams' datasets if we parameterize either (1) the set of data sources; or, (2) a configuration file containing the set of data sources
  • W3C DCAT describes recommended semantics for describing aggregations of datasets. This is not necessary to consider yet, but may be helpful down the line as these processes become more complex.
  • dataflows and datapackage-pipelines may provide useful examples for what a mature, generic implementation of this kind of process might look like.

@ManuelAlvarezC
Copy link
Collaborator Author

HI @smartcaveman, and thanks for your comments, let me answer along your quotes:

Comments/Suggestions

  • While we may only need CSV inputs for this repository at this point, but it's highly likely that a reusable solution would be robust enough to consume and produce additional formats (JSON, JSONLD, RDF, XML, YAML, etc). Understanding this, it would be wise to implement some extensibility here by parametrizing the formats to whatever function processes the data sources.

Indeed, that's why the data source specification demands the output to be a pandas.DataFrame, this way, changing format is one line, making it work with different parameters, a couple more.

  • The source for the library referenced by the sample notebook is at kaggle-storage-client. Please raise any issues on that project if this implementation identifies obstacles to its use. It was built with the intention of making this kind of thing easier.

Will do. Thanks.

  • Step 4 doesn't need to run if neither the code nor the data sources have changed since the last execution. To simplify this comparison, store the hash for source files.

Some of our datasets, like census, won't be changing, so it was already considered, although not written on the issue, that some data sources, shouldn't be running at every execution.
Also, some other data sources, like meteo or covid cases, may have new data everyday, so it make sense to run them, even if the code hasn't changed.

However, I will make sure that when, let's call them "static" data_sources, they have their code updated will be executed too.

  • Step 4 could become reusable to other teams' datasets if we parameterize either (1) the set of data sources; or, (2) a configuration file containing the set of data sources

This is an interesting remark. Yesterday I had a call with Anton, regarding this issue, and have had that in mind while thinking in the design of the solution.

  • W3C DCAT describes recommended semantics for describing aggregations of datasets. This is not necessary to consider yet, but may be helpful down the line as these processes become more complex.
  • dataflows and datapackage-pipelines may provide useful examples for what a mature, generic implementation of this kind of process might look like.

I will check them in more detail during the weekend if I have time. Definitely they look really interesting.

@smartcaveman
Copy link
Member

@ManuelAlvarezC sorry, my comment was ambiguous. Re: "To simplify this comparison, store the hash for source files.", I was referring to source code and data source files. So, if you download a large dataset that's expensive to process, and the hash of the dataset is the same as it was last time it was processed, then it doesn't need to be reprocessed.

@hyberson
Copy link

hyberson commented May 1, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants