This repository contains implementations of three unsupervised clustering algorithms: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), K-Means, and K-Means EM (Expectation-Maximization). These algorithms are used to group similar data points together without any prior knowledge of the labels or categories.
Clustering is a fundamental task in unsupervised machine learning, where the goal is to group similar objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups. This repository provides implementations of three clustering techniques:
- DBSCAN: A density-based clustering algorithm that can find arbitrarily shaped clusters and is robust to outliers.
- K-Means: A popular partitioning method that divides data into
k
clusters by minimizing the variance within each cluster. - K-Means EM: An extension of K-Means that uses the Expectation-Maximization (EM) algorithm to estimate the parameters of a mixture of Gaussian distributions.
- Description: DBSCAN is a clustering algorithm that groups together points that are closely packed together, marking points that are far away as outliers. It works well with clusters of varying shapes and sizes.
- Parameters:
eps
: The maximum distance between two points to be considered as neighbors.min_samples
: The minimum number of points required to form a dense region (i.e., a cluster).
- Advantages:
- Can find arbitrarily shaped clusters.
- Does not require specifying the number of clusters.
- Robust to noise and outliers.
- Disadvantages:
- Not suitable for datasets with varying density.
- Performance depends on the choice of
eps
andmin_samples
.
- Description: K-Means is a centroid-based algorithm that partitions the dataset into
k
clusters, where each data point belongs to the cluster with the nearest mean (centroid). - Parameters:
k
: The number of clusters to form.max_iter
: Maximum number of iterations for the algorithm.tol
: Tolerance to declare convergence.
- Advantages:
- Simple and fast.
- Works well for spherical clusters.
- Disadvantages:
- Requires the number of clusters (
k
) to be specified. - Sensitive to the initial placement of centroids.
- Assumes clusters are spherical and evenly sized.
- Requires the number of clusters (
- Description: K-Means EM is a variant of K-Means that incorporates the Expectation-Maximization algorithm. It models the data as a mixture of Gaussian distributions and iteratively refines the cluster assignments and parameters.
- Parameters:
k
: The number of clusters to form.max_iter
: Maximum number of iterations for the EM algorithm.tol
: Tolerance to declare convergence.
- Advantages:
- Handles overlapping clusters.
- Provides a probabilistic cluster membership.
- Disadvantages:
- Requires the number of clusters (
k
) to be specified. - Computationally more expensive than standard K-Means.
- Requires the number of clusters (
To use the clustering algorithms, clone the repository and install the required dependencies:
git clone https://github.com/yourusername/unsupervised-ML_Clustering.git
cd clustering-algorithms
pip install -r requirements.txt