The intent of this document is to provide some recommendations for releasing experiment artifacts to aid the research community.
As an attempt to put discussion into action, the templates directory contains a base README.md template that will aid the discovery and ingestion of experimental artifacts.
Please check the README.md template for your artifact repositories.
Any pre-requisite dependencies that your experiment has should be listed to help aid in the reproducibility of your experiment. Following are some recommendations to consider :
- You should likely assume little if any background on the technology that your artifact relies on to best increase the likelihood of re-use.
- Attempt to be clear and concise with your instructions to make sure that entities looking to re-use your artifacts can do so without your assistance.
- If feasible, try to provide an entirely re-producible version of your artifact by leveraging systems such as docker to provide a docker image in Dockerhub and Dockerfile for re-creation.
The artifacts that you want to share should be included in the form that they are most useful in. Consider items like the following as you share them with other researchers:
- Completeness of the artifacts. Make sure to include everything that is necessary for your experiment validation.
- It is most useful if artifacts are shared in a common or well-known format to aid in other research re-use. While proprietary formats may sometimes be required, any tools that aid that access to that data should be pointed out to lower the learning curve for potential use.
Consider including artifacts like the following :
- Datasets
- Code
- Configurations
- Experiment setup tools
- System images (e.g., docker images)
- Testbed specific initiation scripts
- Publications related to this artifact
Experiments may have subtleties that were not possible to explain in the research paper due to space constraints. Including the code you used for setup, collection, reformatting, or analysis along with a description of what that code is and where it is used in the experimental pipeline can be extremely helpful to your fellow researchers.
If you are using specific scientific disciplines for your work, it may be useful to include additional items. For example:
- For ML-based efforts, consider checking out the ML recommendations here.
If can be useful to provide some of your key results as part of the sharing of your artifacts. Consider including a table of those results, graphs showing those results, along with the specific commands utilized to generate those results from the shared artifacts.
These recommendations come from the ML recommendations mentioned above. Modified to suit the needs of this document.
- Zenodo - versioning, 50GB, free bandwidth, DOI, provides long-term preservation
- GitHub Releases - versioning, 2GB file limit, free bandwidth
- OneDrive - versioning, 2GB (free)/ 1TB (with Office 365), free bandwidth
- Google Drive - versioning, 15GB, free bandwidth
- Dropbox - versioning, 2GB (paid unlimited), free bandwidth
- AWS S3 - versioning, paid only, paid bandwidth
- DAGsHub - a way to track experiments, version data, models & pipelines, using Git
- RClone - provides unified access to many different cloud storage providers
- dvc - open-source version control system designed for machine learning projects
If you'd like to contribute to these best practices please open an issue on this GitHub repository or submit a pull request. Also, please check out the NSF SEARCCH project.
All content in this repository is licensed under the MIT license.