S27_MachineLearning_Hard

Problem Statement

In the modern world, machine learning is a tool used by pretty much everyone. In essence, machine learning algorithms use statistical/mathematical models to "learn" from data to allow for future inference. In supervised learning, each data point has a known label, usually a class membership label (for classification problems) or a numerical value (for regression problems). Unsupervised learning (such as clustering) uses unlabeled data to find patterns and relationships among features. Given a dataset implement k-means clustering to analyze the data. The data consists of 2 numerical features. Given this dataset and a number of clusters, k, the solution should converge on a clustering of the data points. After clustering, the program should print how many data points are in each cluster.

Example:

kmeans("datasetfilepath.csv",4) <- here, the last argument is the number of clusters, k=4 with output:

cluster1: 20 data points

cluster2: 30 data points

cluster3: 20 data points

cluster4: 50 data points

Note: which cluster has which number of data points isn't important, as long as you end up with two clusters with 20 data points, one with 30, and one with 50. Slight variation is not unusual with random initialization (e.g. 20, 22, 28, 50 with certain initializations instead of 20, 20, 30, 50).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S27_MachineLearning_Hard

Problem Statement

User Documentation

Developer Documentation

Clone this wiki locally