Coffee-Quality-Analysis

A complete clustering analysis of a coffee beans characteristics dataset.
The purpose of this project is to find out if coffee beans from different parts of the world have similar values across multiple variables and if we can actually group them in clusters based on the data of each type of bean.

Before executing the clustering analysis a PCA (Principal Component Analysis) is done to determine the variables which are most informative (the best ones to use for the clustering process, since they determine most of the variability in the data).

Main Libraries Used for The Project

Scikit-Learn

MLXTend

SciPy

Yellowbrick

Clustering Evaluation Methods

Elbow Method

Silhouette

Project Operation:

It retrieves the dataset
Cleans the data
Exports the cleaned data into a secondary CSV file
Executes an EDA to generate an overview of the data and prints the main descriptive statistics which are mandatory to know before starting the analysis
Preprocesses the data to prepare it for the analysis
Executes a PCA with two different solvers
6.1 SVD (Singular Value Decomposition)
6.2 Auto-Solver (Set by default by the libraries)
Exports the Scree Plots which represent the explained variance for each dimension
Executes clustering on the data using the K-Means algorithm
Analyzes the data obtained from the clustering process
Calculates the optimal number of clusters using the Elbow Method and plots the results
Calculates the optimal number of clusters using the Silhouette Method (which is more accurate)
11.1 Three different distances are used: Euclidean, Minkowski and Manhattan 11.2 The function which implements the Silhouette Method also returns the best distance for each K number of clusters
The clustering results get represented through a 3D plot with three variables. One plot gets generated for each K
Example: if the analysis implies executing the clustering process with K from 2 to 10, then 9 plots will be generated, one for each K.
The plots get exported and the clustering results printed on the terminal

Original source of the dataset: https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
AnalysisPlots		AnalysisPlots
Data		Data
EDAPlots		EDAPlots
.gitattributes		.gitattributes
CoffeeAnalysis.py		CoffeeAnalysis.py
CoffeeMain.py		CoffeeMain.py
CoffeeReading.py		CoffeeReading.py
LICENSE		LICENSE
README.md		README.md
coffee.csv		coffee.csv
countryCodesISO3166.csv		countryCodesISO3166.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coffee-Quality-Analysis

Main Libraries Used for The Project

Clustering Evaluation Methods

Project Operation:

About

Releases

Packages

Languages

License

stefanodesaraca/Coffee-Quality-Analysis

Folders and files

Latest commit

History

Repository files navigation

Coffee-Quality-Analysis

Main Libraries Used for The Project

Clustering Evaluation Methods

Project Operation:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages