Skip to content

AminaFaragai/Country-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Country Clustering Analysis

Introduction

This machine learning project aims to implement the K-means clustering algorithm on the provided country dataset. The Elbow method and the silhouette method were utilized to determine the optimum number of clusters. Both 3 and 4 clusters were explored to compare the model, resulting in similar outcomes.

Problem Statement

The dataset 'countries-data' contains information about developmental indices of numerous countries. Categorizing countries based on socio-economic and health factors can help determine the overall development of each country.

Importing Modules and Reading the Data

The data was imported from 'country-data.csv' for analysis. Information about child mortality, exports, health, imports, income, inflation, life expectancy, total fertility, and GDP per capita of several countries were contained in the file. Clustering the countries will group them based on similar levels of development, allowing identification of the least developed and most developed countries.

Data Visualization

Pairplot was used to visualize the features of the data, showcasing linear relationships and distributions within the dataset. Correlation analysis using a heatmap revealed associations between different features.

Model Building using K-means Clustering Algorithm

The elbow method of K-means clustering was used to cluster all the data, with high correlations observed between import and export of countries, child mortality and life expectancy, health and GDP per capita, income and GDP per capita, and child mortality and total fertility. Distplot and boxplot were used to understand the skewness of variables and identify outliers, respectively.

Handling Outliers

Outliers were identified using boxplots, and the optimal number of clusters for the given data was determined to be 4.

Comparing the Models

Models built using K-means with 3 and 4 clusters produced similar outputs, indicating the model's effectiveness. The top driving factors for clustering were identified as GDP per capita, income, and child mortality.

Conclusion

Low GDP per capita and income imply a high rate of child mortality. Life expectancy in under-developed countries remained consistent across both 3 and 4 clusters. The top 15 under-developed countries were found to be the same in both clusters.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages