Skip to content

ShirleyHan6/NTUOSS-DataOdyssey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NTUOSS-DataOdyssey

By Han Simeng from NTU Open Source Society

alt text Artwork by Brendan Hyde


Workshop Details
When Friday, 9 Sep 2018. 6:30 PM - 8:30 PM
Where LT1, NTU North Spine Plaza
Who NTU Open Source Society
Questions We will be hosting a Pigeon Hole Live for collecting questions regarding the workshop

Errors

For errors, typos or suggestions, please do not hesitate to post an issue! Pull requests are very welcome, thank you!

Disclaimer: This workshop is for educational purposes only. No prototype or outcome of any type is intended for commercial use.


Introduction

alt text Machine Learning is an interdisciplinary subject where computer science and statistics intersect.
In the workshop today, we will focus on the practical aspect of machine learning, i.e.,coding. In most cases, we give our algorithm an input and it gives us an output.
However, for a machine learning algorithm, we first feed a lot of data to the algorithm to let the algorithm determine itself how it should react to the data. This is the process of determining the parameters of the machine learning model.


In supervised machine learning, we feed the input and label, into the model and it will learn how to predict the output when we feed new inputs. Think about supervised learning as learning with a teacher who tells you the right answers.

In unsupervised machine learning, we only feed the input and the model will learn to predict the output solely based on the input. Think about unsupervised learning as learning without a teacher. Not all real-world data have a label, thus the necessity of unsupervised learning.

The second workshop will introduce two machine learning algorithms in order to demonstrate how the field can be used in real-world scenarios.
This includes logistic regression, a supervised method to solve classification problems, as well as k-means clustering, an unsupervised method to group together clusters of data by certain criteria.

We will use scikit-learn, a python package built for implement machine learning algorithms.
Logistic Regression with scikit-learn
K-Means with scikit-learn

Google Colabtory

See NTUOSS-PandasBasics for a comprehensive introduction on how to use Google Colabtory for data science projects and let's walk through it.
Copy this notebook to your own drive

alt text

Go to this link to download the data to be used in this workshop and upload it to Google Colabtory.

Odyssey Begins

title

  1. Supervised Odyssey: Supervised Classification
  2. Unsupervised Odyssey: Unsupervised Classification
  3. End of journey

Supervised Odyssey: Supervised Classification

Packing up: Environment Setup

Import the module for linear regression algorithm from sklearn and plotting packages

alt text


Data Exploration

Use numpy to load the file as a data object alt text

Inspect more details alt text

Plot all data alt text

alt text


Data Classification: Logistic Regression

Logistic Regression is used when the dependent variable(target) is categorical, i.e., we want to find class which each of the variables belongs to. For example, to classify spam emails, we find whether an email belongs to the spam class or the normal class.

Algorithm Intuition (online demo)

alt text


Coding with sklearn

Sigmoid function adds non-linearity into the model
z is the input to the sigmoid function, which is the dot product of input X and the weight w alt text

Logistic regression predictive function alt text

To conduct logistic regression with scikit-learn, we first create a LogisticRegression object
Then we fit the model to the data
The intercept and coef are the model parameters(weights) alt text


Result Visualization

After obtaining the parameters, lets visualize the result by plotting the decision boundary.
Students whose score points are above the decision boundary will be admitted while the students below the decision boundary will be rejected alt text

Now let's use our trained logistic regression model to predict if a student will be accepted or rejected. alt text


Unsupervided Odyssey: Unsupervised Classification

Import the image reading module from matplotlib and the K-Means module from sklearn alt text


Data Exploration

Read the image alt text A 2D image is comprised of two dimensional RGB values.
700 is the row number.
1000 is the column number.
3 is the R, G, B value respectively.


Image Compression: K-Means

Algorithm Intuition (Online Demo)

alt text K-means is one of the most popular unsupervised clustering algorithms.
"K" in K-means refers to k number of clusters.
"Means" refers to finding the means, or centroids of the clusters.


Coding with sklearn

Reshape the image to be 2-dimension
To run the KMeans algorithm, we first create a scikit-learn KMeans object with the number of clusters assigned to 20, which is the number of colors we want for the compressed image. Fit the model to the data, then use the centroids to compress the image alt text


Data Visualization

Reshape X_recovered to have the same dimension as the original image
Now we can plot the original and the compressed image side by side. alt text


End of Odyssey!

End of Journey

Congratualations on completing the Machine Learing Odyssey!
In this workshop we have learned how to use machine learning algorithms to solve some simple real-world problems.
In the next, which is also the last workshop of the NTUOSS Data Science workshop series, we will teach you deep learning, which is a subfield of machine learning and is even more interesting!


An approchable book if you want to learn more A Course in Machine Learning

alt text

About

Repository for workshop on Practical Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published