Skip to content

Latest commit

 

History

History
108 lines (65 loc) · 7.59 KB

README.md

File metadata and controls

108 lines (65 loc) · 7.59 KB

Week 3 | Machine Learning

Introduction

Even though ML doesn't actually need an introduction in today's world, where millions of people research in this field, and where every other day there's a new state of the art techinique. Machine Learning is basically automating and improving the learning process of computers based on their experiences without being actually programmed i.e. without any human assistance.

In Traditional Programming, We feed in Data (Input) + Program (Logic), run it on machine and get output.

While in Machine Learning, We feed in Data (Input) + Output, run it on machine during training and the machine creates its own program(Logic), which can be evaluated while testing.

Excited? Now go on, begin your journey into this vast and the most buzzing field in Computer Science here.

Resources

  • This is for those who have some coding experience, but never done Machine Learning before. If you feel your concepts about Python or programming in general are shaky, first complete our tutorial on Python. Even if you have completed the much renowned AndrewNG Machine Learning course, go through this article, because here we implement every algorithm in Python instead of MATLAB.

  • Only after you have gone through and implemented the algorithms in the above article should you continue with this one. It introduces you to all the important concepts and applications of Neural Networks

Tasks

1. Stanford CS231n Assignments

Stanford runs an amazing course CS231n: Convolutional Neural Networks for Visual Recognition whose assignments serve as a perfect way to practice and strengthen your concepts.

  • The First Assignment makes you implement kNN, SVM, Softmax, and a simple Neural Network without any ML libraries.

  • The Second Assignment helps you get acquainted with Backpropogation, Batch Normalisation,Dropout, CNNs, and deep learning frameworks.

  • The Third Assignment is where you'll implement RNNs, LSTMs, and GANs

You should definitely checkout their excellent Notes and Video Lectures if you are stuck somewhere, or have difficulties in understanding some particular concepts.

2. Document Classification

Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). This is especially useful for publishers, news sites, blogs or anyone who deals with a lot of content. Through this assignment, we will try to implement different clustering algorithms to classify documents from the real-world BBC Dataset.

Data Preprocessing:

  • Download the BBC Dataset [~ 5 MB] which consists of 2225 documents from the BBC news website corresponding to stories in 5 topical areas (business, entertainment, politics, sport, tech) from 2004-2005.

  • Write a function that reads all the *.txt files present in each of the 5 topical folders, normalize the text for each document & create a dataframe (use pandas) with headers similar to:

    sr_no doc_text class
    1 Ad sales boost time ... that stake. business
    ... ... ...
  • Now, we can create feature vectors for each of these documents & append them as corresponding columns to the above dataframe. Try to experiment with the following models to create feature vectors:

    You can choose to implement these from scratch or use existing implementations from sklearn.

    The dataframe should now look like:

    sr_no doc_text class bow_vectors tfidf_vectors
    1 ad sales boost time ... that stake business {"ad": 1, ...} {"ad": 1, ...}
    ... ... ... .. ..
  • Shuffle the rows of this dataframe & split it into a training, validation & test set. You could choose splits such as 70 : 10 : 20 [training : validation : test]

Training the Classifier

You can now implement the following algorithms for the document classification task:

  • K-Means Clustering
  • KNN Clustering
  • Gaussian Mixture Model (GMM)

You can try to work with different distance formulations like Cosine Distance, Euclidean Distance, Manhattan Distance, Chebyshev Distance, etc.

You can use techniques such as K-Fold Cross-Validation to check for over-fitting of the model.

Once trained, test your model on the 'test' split.

3. Kaggle Contests

Kaggle is a platform for predictive modelling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

Following are a list of some contests that you can take part in by creating ML/DL Models:

You can try any preprocessing methods, algorithms & ensembles for these challenges. Augmented deep Learning architectures such as CNNs, Autoencoders & RNNs could come in handy while attempting these challenges.

4. Denoising an image

Go to denoising-task to find the problem statement and relevant data. You don’t need to know about Markov Random Fields (MRF) priors for attempting this task. The following information is sufficient, though slides for MRF priors and image denoising are present if you wish to learn more:

For each of the g() function :

  • Minimise the following function (by gradient descent) to get the denoised image :
    Σ{a*(yi-xi)2 + g(xi-xi1)+g(xi-xi2)+g(xi-xi3)+g(xi-xi4) } where i1, i2, i3, i4 are the 4 neighbouring pixels of i. y is the noisy image and x is the denoised image.

  • The role of g() is of edge preservation (neighbouring values of pixels shouldn’t differ by much , and should differ by much only at edges) , the role of (yi-xi)2 is noise removal. While a is for giving weights to noise removal and edge preservation