Skip to content

Project 2: Srinath Rajagopalan #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 99 additions & 7 deletions Project2-Character-Recognition/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,106 @@
CUDA Character Recognition
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 2 - Character Recognition**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Srinath Rajagopalan
* [LinkedIn](https://www.linkedin.com/in/srinath-rajagopalan-07a43155), [twitter](https://twitter.com/srinath132)
* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Nvidia Quadro P1000 4GB (Moore 100B Lab)

### (TODO: Your README)
### Neural Networks in CUDA for Character Recognition


![](data-set/01out.bmp) ![](data-set/07out.bmp) ![](data-set/37out.bmp) ![](data-set/52out.bmp)

In this we project, I have implemented a feedforward neural network, from scratch, in C++ and CUDA to make efficient use of the GPU to parallelize the matrix operations.

The network is trained to achieve a 100% accuracy on the provide character recognition dataset of 52 images with a resolution of 101x101. This network will NOT generalize to unseen test samples and that is not the focus of this project. The focus is on the performance and scalability comparisons and appreciating the role of the GPU.

The network architecture is fixed to have one hidden layer but one can configure the image size, number of images, and the number of hidden neurons.

## Computation Graph

The setup is a 2-layer neural network with the following configuration
* Input - `X` of dimension `52 x 10201`
* Hidden - `X2` of dimension `52 x 10` with `ReLU` non-linearity and `10` hidden neurons
* Output - `probs` of dimension `52x52` predicted probability for each sample over 52 classes

In effectively 2 matrix multiply operations, we compute the predictions for ALL `N` examples at the same time. It is helpful to think of the operations being performed on the matrices themselves and utilizing matrix differentiation to calculate the gradients for the backward pass.

The following Python-Numpy, borrowed from [CS231n: CNNs for Visual Recognition](https://cs231n.github.io/neural-networks-case-study/), illustrates how we can vectorize the forward pass and backward pass computation.

```
### Forward Pass
# evaluate class scores with a 2-layer Neural Network
hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation
scores = np.dot(hidden_layer, W2) + b2
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

### Evaluate Loss
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples


### Backward Pass
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # derivative wrt output layer

dW2 = np.dot(hidden_layer.T, dscores) # derivative wrt W2
db2 = np.sum(dscores, axis=0, keepdims=True)

dhidden = np.dot(dscores, W2.T) # derivative wrt hidden layer
dhidden[hidden_layer <= 0] = 0 # derivative through ReLU

dW = np.dot(X.T, dhidden) # derivative wrt W
db = np.sum(dhidden, axis=0, keepdims=True)
```

The above code is treated as an API one can adhere to. It's easy to see how easily all the above operations can be individually parallelized. To achieve this the following CUDA Kernels have been implemented

* Matrix multiply on any two matrices of any size
* Slicing operations to fetch the scores of the correct class for each example
* ReLU activation
* Softmax activation
* Matrix addition
* Element-wise matrix multiplication
* Filter to zero-out negative elements in a matrix

## Training Curves

The training curve is plotted for the toy XOR operation and for the character recognition dataset.

![](data/training_curve_xor.png)

![](data/training_curve.png)


Predictions on character-recognition is given below

![](data/final_predictions.png)

## Performance Analysis on Memorizing Random Images

With 52 examples, and enough number of hidden neurons, we can achieve 100% accuracy on random images (101x101 resolution)!

![](data/white_noise.png)

The performance analysis is based on generating random images of different resolutions (changing `d`) and different number of hidden neurons (changing `h`). Resolution could be scaled to a maximum of 1 million pixels before running out of memory. Number of hidden neurons could be scaled to a maximum of 10K neurons before running out of memeory.

![](data/perf_image.png)

![](data/perf_neurons.png)


As we scale to bigger models and higher resolution images, memory is definitely a main bottleneck. Fitting the enitre dataset into the GPU global memory will be impossible. So we have to work the CPU hard disk. From a computatiton perspective, there can be several improvements

1) Efficient matrix-multiply operations by the tiling approach using shared memory. As of now, I am a doing a naive matrix multiplication which is doing many redundant global memory access.

2) Loss calculation can be parallelized by using parallel-reduction written in stream-compaction. The tricky part is to tweak the implementation to work access the index in a 2D view.

## Extra Credit
The forward and backward pass is computed for all `N` exmaples simultaneously in one-shot using matrix-multiply and slicing operations.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@ set(SOURCE_FILES

cuda_add_library(character_recognition
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_61
)
Loading