A complete clustering analysis of a coffee beans characteristics dataset.
The purpose of this project is to find out if coffee beans from different parts of the world have similar values across multiple variables and if we can actually group them in clusters based on the data of each type of bean.
Before executing the clustering analysis a PCA (Principal Component Analysis) is done to determine the variables which are most informative (the best ones to use for the clustering process, since they determine most of the variability in the data).
- It retrieves the dataset
- Cleans the data
- Exports the cleaned data into a secondary CSV file
- Executes an EDA to generate an overview of the data and prints the main descriptive statistics which are mandatory to know before starting the analysis
- Preprocesses the data to prepare it for the analysis
- Executes a PCA with two different solvers
6.1 SVD (Singular Value Decomposition)
6.2 Auto-Solver (Set by default by the libraries) - Exports the Scree Plots which represent the explained variance for each dimension
- Executes clustering on the data using the K-Means algorithm
- Analyzes the data obtained from the clustering process
- Calculates the optimal number of clusters using the Elbow Method and plots the results
- Calculates the optimal number of clusters using the Silhouette Method (which is more accurate)
11.1 Three different distances are used: Euclidean, Minkowski and Manhattan 11.2 The function which implements the Silhouette Method also returns the best distance for each K number of clusters - The clustering results get represented through a 3D plot with three variables. One plot gets generated for each K
Example: if the analysis implies executing the clustering process with K from 2 to 10, then 9 plots will be generated, one for each K. - The plots get exported and the clustering results printed on the terminal
Original source of the dataset: https://www.kaggle.com/datasets/fatihb/coffee-quality-data-cqi