Skip to content

Leading template for Data Science directory layouts rooted in professional industry standards. Simplify and enhance your Data Science project management.

License

Notifications You must be signed in to change notification settings

hardefarogonondo/data-science-project-folder-structure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Project Folder Structure

Discover a standardized Data Science project directory layout in this repository, crafted in line with industry-leading practices. This structure ensures your data projects are intuitive, organized, and sustainable.

Table of Contents

  1. Project Architecture
  2. Folder Description
  3. Contribution

Project Architecture

This structure is crafted to guide every data science professional and enthusiast through a streamlined workflow, right from raw data ingestion to deployment of the final model API. The project's folder structure would look like this:

.
├── data
│   ├── processed
│   └── raw
├── docs
├── models
│   └── logs
├── notebooks
│   ├── 1_data_preparation
│   ├── 2_exploratory_data_analysis_and_preprocessing
│   ├── 3_feature_engineering
│   └── 4_model_training_and_evaluation
├── references
├── reports
│   └── figures
└── src
    ├── back_end
    │   └── config
    └── front_end
        └── config

Folders Description

data

Every data science initiative commences with raw data. This directory facilitates its transition from its pristine state to a refined version suitable for further analysis.

  • raw: Retain your untouched, initial datasets here for traceability.
  • processed: This is reserved for datasets that have undergone preliminary cleansing, transformations, or other preparatory steps.

docs

A comprehensive space dedicated to API documentation. It is imperative that every sophisticated codebase is paired with robust documentation. This directory ensures consistency and clarity for API-related documentation.

models

Effective model governance is pivotal. This houses trained algorithms — from classifiers to regressors and other paradigms.

  • logs: Model training is an iterative process. It generates vital logs with every iteration and hyperparameter adjustment. Archive them here for future audits and performance checks.

notebooks

Jupyter notebooks are indispensable tools in contemporary data science. This directory encapsulates distinct stages of the data science workflow:

  • 1_data_preparation: Prepare your datasets for exhaustive analysis.
  • 2_exploratory_data_analysis_and_preprocessing: A deep dive into data characteristics, visualization, and preprocessing to prime it for feature extraction.
  • 3_feature_engineering: Devise and refine dataset attributes to elevate model efficacy.
  • 4_model_training_and_evaluation: This stage involves honing data into valuable predictions through models. It involves iterative cycles of training, tuning, and evaluation.

references

Scientific undertakings, like Data Science, are collaborative by nature. This directory acknowledges scholarly articles, research papers, and other resources that have contributed to the project.

reports

Tailored for stakeholders and for longitudinal reviews, this section aggregates synthesized findings, interpretations, and other pertinent reports.

  • figures: A coherent visual speaks volumes. Post-EDA or model evaluation visual representations can be cataloged here for a seamless reference.

src

The operational core of the solution. As experimental codes mature for deployment, this directory hosts the refined versions — encompassing API interfaces, utility modules, and other integral software units.

Contribution

We greatly value collaboration and believe that collective insights drive optimal outcomes. If you're considering enhancing this repository, be it through bug fixes, feature suggestions, or documentation refinements, your expertise is invaluable to us.

Guidelines for Effective Contribution:

  1. Initialize: Start by forking this repository. Once done, clone your fork to your local environment.

    git clone <URL_OF_YOUR_FORK>
  2. Environment Setup: Ensure all dependencies and prerequisites are correctly installed and configured.

  3. Branch Creation: For every new contribution, create a separate branch. This ensures the main branch remains consistent and is unaffected by ongoing development.

    git checkout -b <YOUR_NEW_BRANCH_NAME>
  4. Implement Changes: Develop the enhancements or fixes, making sure the changes are well-documented and pass any existing tests.

  5. Committing: Commit your modifications with a concise and informative commit message that sheds light on the changes made.

    git commit -m "Descriptive message about the changes"
  6. Pushing Changes: Push the updates to your fork on GitHub.

    git push origin <YOUR_NEW_BRANCH_NAME>
  7. Pull Request: Navigate to the 'Pull Requests' tab of the main repository, and initiate a new pull request. Ensure the base branch is the repository's main branch and the compare branch is your contribution branch. Include a detailed description outlining the changes and their motivations.

  8. Feedback: After submitting the PR, it will be reviewed by our maintainers. Any required changes or suggestions will be communicated. Constructive dialogue is essential for refining the contributions.

If at any point you are unsure or require clarity on the process, please raise an issue or get in touch. Your involvement, dedication, and expertise are highly esteemed, and together we can ensure this repository remains at the forefront of industry standards.

About

Leading template for Data Science directory layouts rooted in professional industry standards. Simplify and enhance your Data Science project management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published