Skip to content

Project 2: Kushagra #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 156 additions & 6 deletions Project2-Character-Recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,162 @@ CUDA Character Recognition

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Author : Kushagra
- [LinkedIn](https://www.linkedin.com/in/kushagragoel/)
* Tested on: Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)

### (TODO: Your README)
____________________________________________________________________________________

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Breaking the Ice
![](img/PrettyOutput.jpg)

## Table of Contents
1. [Introduction](#intro)
2. [What is a Multi Layer Perceptron](#mlp)
2.1. [Forward](#forward)
2.2. [Backward](#backward)
3. [Implementation](#impl)
4. [Performance Analysis](#perform)
5. [Humble Brag](#brag)
6. [References](#ref)


<a name = "intro"/>

## Introduction
In this project we have created a generic multi layer perceptron from scratch in CUDA. We then train the multi layer perceptrons on 3 different datasets :
* MNIST
* Custom Dataset having upper and lower case alphabets
* XOR

<a name = "mlp"/>

## What is a Multi Layer Perceptron
To understand what a multi layer perceptron is, we start by looking at what a perceptron is. The following image gives an idea of how a perceptron functions and the motivation behind it

Neuron | Perceptron
:-------------------------:|:-------------------------:
![](img/neuron.png) | ![](img/neuron_model.jpeg)

And with several such perceptrons, we make a multi-layer perceptron as depicted here :
~[](img/MNISTmlp.png)

<a name = "forward"/>

### Forward Pass
A forward pass means going from inputs to outputs in a mlp by calculating the value at each intermediate layer and then multiplying with the layer's weights to get the inputs for the next intermediate layer.


<a name = "backward"/>

### Backward Pass
In this step, we calculate how our trainable parameters effect the loss and adjust them accordingly. This is more popularly known as BackPropagation and is essentially an application chain rule from calculus.

A depiction of the forward pass and the backward pass looks like :
![](img/partial_derivative_notations.png)

<a name = "impl"/>

## Implementation
On to the fun stuff now. I implemented this project trying my best to keep the mlp architecture as generic as possible. Due to this design, we have the capability of having :
* Variable number of hidden layers
* Variable sizes of hidden layers
* Variable batchSizes for faster training
While also trying to encapsulate as much of the implementation detail from the user as possible. The following class definition explains how its being done.
### MultiLayerPerceptron
```
class MultiLayerPerceptron {

std::vector<FullyConnectedLayer*> layers;
int batchDim;
public :
MultiLayerPerceptron(int inputDim, int numHiddenLayers, int *hiddenDim, int outputDim, int batchDim);
void forward(float *input, float *output, bool test = false);
void backward(float *output, float *predicted, float learningRate);
float loss(float *label, float *predicted);
};
```
We see here that the MultiLayerPerceptron has a vector of FullyConnectedLayers which can be instantiated using the hiddenDim array. The forward and backward method perform the operations we described above.
The MultilayerPerceptron takes input from the user, iterates through the layers and calls their respective forward and backward methods and finally returns the prediction.
As long as a class extends FullyConnectedLayer and implements the forward and the backward method, we can add it to our MultiLayerPerceptron.

### FullyConnectedLayer
```
class FullyConnectedLayer {
float *weight = NULL;
float *inputs = NULL;
int inputDim;
int batchDim;
int outputDim;
bool lastLayer;

public:
FullyConnectedLayer(int inputDim, int outputDim, int batchDim, bool lastLayer);
void forward(float *input, float *output, bool test = false);
void backward(float learningRate, float *incomingGradient, float *outgoingGradient);
int getInputDim();
int getOutputDim();
};
```
This class symbolizes each hidden layer in the multi layer perceptron. The forward and the backward methods over here have the core logic to calculate the hidden values and finally the output.
By setting the lastLayer to true, we can signify the last layer to use softmax to give outputs as probabilites, otherwise each layer uses ReLU as its activation.
One thing that is absent here are the biases (although the first layer can handle biases if we append 1 to each sample). But in our experimentation, it was found that biases in the input layer were sufficient to give use good results.

### Isn't this too complicated?
Yes, but actually no. We are calculating the gradients of each layer in a very clever and efficient fashion.
The magic is present in the FullyConnectedLayer implementation. Each layer recieves a partial gradient from the next layer, which it uses to calculate gradients for its own weights and also what information it should pass along to the previous layer. The following image will provide a better idea of how this is working :

<img src="img/Autograd.jpg" alt="PyTorch Autograd" width="600"/>

And if someone is curious enough for the maths behind it, don't worry, I got you covered. With a little bit of maths, its not hard to see that current layer will recieve the derivative of the loss w.r.t to its output and the derivative of the loss w.r.t to the input of the current layer needs to be passed to the previous layer.

<img src="img/EquationsPart1.jpg" alt="NumBoidsVsFPS" width="600"/>
<img src="img/EquationsPart2.jpg" alt="If you can read this, you must be a genius or a doctor" width="600"/>

<a name = "perform"/>

## Performance Analysis
Let's look at how our implementation performs on the datasets :

### XOR

<img src="charts/XORLoss.jpg" width="600"/>

### Custom Dataset

<img src="charts/CharacterRecognitionLoss.jpg" width="600"/>

### MNIST

<img src="charts/MNISTLoss.jpg" width="600"/>


So the loss is going down quite smoothly. This means that our implementation works really well.

### Observations
* It was observed that for XOR would not train if the size of the hidden layer was less than 5. At anything more than or equal to 5, the loss keeps decreasing as long as we train. Which was expected as it will keep pushing the probabilites towards 1 without actually every reaching it due to the softmax.
* The character recognition with little training would give 100% accuracy with just 1 hidden layer. This means its overfitting badly which was expected since we have only one data point per class. We can revisit this later to add some kind of regularization/penalty to ensure that the mlp doesn't just memorize the input.
* To see if our MLP actually learns anything, we tried an alternate dataset, The MNIST. In fact we use a 2 hidden layer with less hidden units as having more hidden units would exhaust the memory of the GPU. This is because we are doing batch gradient descent and we are storing the inputs for all layers, leading us to signficant memory consumption.
* Another interesting fact about MNIST is with just the addition of biases to the input layer, we observed huge improvement to the point that the network actually started to predict correct values and not give the same answer for all the inputs.


<a name = "brag"/>

## Humble Brag
* As Archimedes once said, give me enough CUDA cores and I will meet you at the global minimum, you can have as many layers as you want and with as many hidden units. (almost) Nothing is hard coded.
* The interface to use the Multi-Layer Perceptrons follows PyTorch's style and therefore is highly intutive to anyone who has experience in PyTorch.
* Batch Gradient descent : You don't have to run the mlp again and again for different inputs, just batch them up and run the mlp once with correct parameters and let the implementation handle the un-fun stuff for you. On a serious note, batch gradient descent with generic classes introduced a lot of several complications about how to calculate and propagate the gradients which are now being handled very gracefully by the Autograd style of backpropagation.
* MNIST : I was successfully able to learn the correct labels for hand-drawn numbers, how cool is that!!
* Not giving up even when windows did :)
<img src="img/bsod.gif" width="300"/>



<a name = "ref"/>

## References
* http://cs231n.github.io/neural-networks-1/
* https://corochann.com/mnist-training-with-multi-layer-perceptron-1149.html
* https://www.ritchievink.com/blog/2017/07/10/programming-a-neural-network-from-scratch/
* https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec
* http://cs231n.stanford.edu/handouts/linear-backprop.pdf
100 changes: 100 additions & 0 deletions Project2-Character-Recognition/bookKeeping/characterLosses.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
3.95124
3.95123
3.95121
3.95118
3.95112
3.95099
3.95074
3.9502
3.949
3.9463
3.94008
3.92563
3.89247
3.82118
3.69507
3.53697
3.35283
3.1358
2.88086
2.6352
2.59458
2.14501
1.87281
1.86736
1.61598
1.20534
0.901629
0.793453
0.737934
0.587051
0.432607
0.351006
0.356106
0.337291
0.317566
0.246129
0.213596
0.173226
0.128
0.107839
0.0979261
0.075282
0.0709963
0.0601452
0.0587817
0.0496783
0.0470992
0.041245
0.0382672
0.0347674
0.0324327
0.0304023
0.0287442
0.0273021
0.0259963
0.0247993
0.0237088
0.0226944
0.0217569
0.0208874
0.0200767
0.0193191
0.0186157
0.0179514
0.0173283
0.0167464
0.0161945
0.0156764
0.0151852
0.0147219
0.0142862
0.0138674
0.0134729
0.0131006
0.0127408
0.0124031
0.0120795
0.0117706
0.0114763
0.011194
0.0109258
0.0106667
0.01042
0.0101831
0.0099549
0.00973787
0.00952755
0.00932477
0.00913258
0.00894522
0.0087638
0.00859288
0.00842283
0.00826305
0.00810679
0.0079552
0.00781064
0.00766904
0.00753303
0.00740097
98 changes: 98 additions & 0 deletions Project2-Character-Recognition/bookKeeping/mnistLosses.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
2.30258
2.30182
2.29723
2.25676
2.10882
2.13333
1.85369
1.54322
1.91046
2.24429
1.99083
1.98546
2.77747
2.51213
2.86779
3.02199
3.08375
2.58157
2.34638
2.1385
2.00144
1.34227
1.22468
1.18058
1.08096
1.03805
0.990021
0.969472
0.959191
0.960439
0.937562
0.978949
0.978516
0.932206
0.867721
0.856745
0.852589
0.847493
0.844346
0.836677
0.823922
1.38576
1.65748
1.18833
1.14968
0.985466
0.876304
0.85324
0.843678
0.838025
0.829976
0.825726
0.82305
0.818249
0.815175
0.813122
0.81156
0.810342
0.80937
0.808547
0.807909
0.807415
0.807029
0.806829
0.807355
0.807097
0.808126
0.805296
0.804825
0.80313
0.802475
0.801791
0.801563
0.801426
0.800862
0.785671
0.774506
0.768242
0.757633
0.75637
0.755529
0.754989
0.754938
0.754965
0.754398
0.754042
0.753789
0.753555
0.753346
0.753154
1.67381
0.825814
0.790645
0.780903
0.77252
0.768511
0.764477
0.761343
Loading