diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md index 4503fac..6351544 100644 --- a/Project2-Character-Recognition/README.md +++ b/Project2-Character-Recognition/README.md @@ -1,14 +1,106 @@ CUDA Character Recognition ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2 - Character Recognition** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Srinath Rajagopalan + * [LinkedIn](https://www.linkedin.com/in/srinath-rajagopalan-07a43155), [twitter](https://twitter.com/srinath132) +* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Nvidia Quadro P1000 4GB (Moore 100B Lab) -### (TODO: Your README) +### Neural Networks in CUDA for Character Recognition + + +![](data-set/01out.bmp) ![](data-set/07out.bmp) ![](data-set/37out.bmp) ![](data-set/52out.bmp) + +In this we project, I have implemented a feedforward neural network, from scratch, in C++ and CUDA to make efficient use of the GPU to parallelize the matrix operations. + +The network is trained to achieve a 100% accuracy on the provide character recognition dataset of 52 images with a resolution of 101x101. This network will NOT generalize to unseen test samples and that is not the focus of this project. The focus is on the performance and scalability comparisons and appreciating the role of the GPU. + +The network architecture is fixed to have one hidden layer but one can configure the image size, number of images, and the number of hidden neurons. + +## Computation Graph + +The setup is a 2-layer neural network with the following configuration +* Input - `X` of dimension `52 x 10201` +* Hidden - `X2` of dimension `52 x 10` with `ReLU` non-linearity and `10` hidden neurons +* Output - `probs` of dimension `52x52` predicted probability for each sample over 52 classes + +In effectively 2 matrix multiply operations, we compute the predictions for ALL `N` examples at the same time. It is helpful to think of the operations being performed on the matrices themselves and utilizing matrix differentiation to calculate the gradients for the backward pass. + +The following Python-Numpy, borrowed from [CS231n: CNNs for Visual Recognition](https://cs231n.github.io/neural-networks-case-study/), illustrates how we can vectorize the forward pass and backward pass computation. + +``` +### Forward Pass +# evaluate class scores with a 2-layer Neural Network +hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation +scores = np.dot(hidden_layer, W2) + b2 +exp_scores = np.exp(scores) +probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) + +### Evaluate Loss +correct_logprobs = -np.log(probs[range(num_examples),y]) +data_loss = np.sum(correct_logprobs)/num_examples + + +### Backward Pass +dscores = probs +dscores[range(num_examples),y] -= 1 +dscores /= num_examples # derivative wrt output layer + +dW2 = np.dot(hidden_layer.T, dscores) # derivative wrt W2 +db2 = np.sum(dscores, axis=0, keepdims=True) + +dhidden = np.dot(dscores, W2.T) # derivative wrt hidden layer +dhidden[hidden_layer <= 0] = 0 # derivative through ReLU + +dW = np.dot(X.T, dhidden) # derivative wrt W +db = np.sum(dhidden, axis=0, keepdims=True) +``` + +The above code is treated as an API one can adhere to. It's easy to see how easily all the above operations can be individually parallelized. To achieve this the following CUDA Kernels have been implemented + +* Matrix multiply on any two matrices of any size +* Slicing operations to fetch the scores of the correct class for each example +* ReLU activation +* Softmax activation +* Matrix addition +* Element-wise matrix multiplication +* Filter to zero-out negative elements in a matrix + +## Training Curves + +The training curve is plotted for the toy XOR operation and for the character recognition dataset. + +![](data/training_curve_xor.png) + +![](data/training_curve.png) + + +Predictions on character-recognition is given below + +![](data/final_predictions.png) + +## Performance Analysis on Memorizing Random Images + +With 52 examples, and enough number of hidden neurons, we can achieve 100% accuracy on random images (101x101 resolution)! + +![](data/white_noise.png) + +The performance analysis is based on generating random images of different resolutions (changing `d`) and different number of hidden neurons (changing `h`). Resolution could be scaled to a maximum of 1 million pixels before running out of memory. Number of hidden neurons could be scaled to a maximum of 10K neurons before running out of memeory. + +![](data/perf_image.png) + +![](data/perf_neurons.png) + + +As we scale to bigger models and higher resolution images, memory is definitely a main bottleneck. Fitting the enitre dataset into the GPU global memory will be impossible. So we have to work the CPU hard disk. From a computatiton perspective, there can be several improvements + +1) Efficient matrix-multiply operations by the tiling approach using shared memory. As of now, I am a doing a naive matrix multiplication which is doing many redundant global memory access. + +2) Loss calculation can be parallelized by using parallel-reduction written in stream-compaction. The tricky part is to tweak the implementation to work access the index in a 2D view. + +## Extra Credit +The forward and backward pass is computed for all `N` exmaples simultaneously in one-shot using matrix-multiply and slicing operations. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt index 7446175..9e834c1 100644 --- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt @@ -7,5 +7,5 @@ set(SOURCE_FILES cuda_add_library(character_recognition ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_61 ) diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu index 5a3ed7f..7ead417 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.cu +++ b/Project2-Character-Recognition/character_recognition/mlp.cu @@ -3,6 +3,10 @@ #include "common.h" #include "mlp.h" +#define blockSize 16 + +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg) + namespace CharacterRecognition { using Common::PerformanceTimer; PerformanceTimer& timer() @@ -10,6 +14,21 @@ namespace CharacterRecognition { static PerformanceTimer timer; return timer; } + + void printArray2D(float *dev_X, int nR, int nC) { + + float* X = new float[nR * nC]; + cudaMemcpy(X, dev_X, nR * nC * sizeof(float), cudaMemcpyDeviceToHost); + + for (int i = 0; i < nR; i++) { + for (int j = 0; j < nC; j++) + printf("%.4f ", X[i*nC + j]); + printf("\n"); + } + printf("\n"); + + delete[] X; + } // TODO: __global__ @@ -23,5 +42,527 @@ namespace CharacterRecognition { } */ + __global__ void kernExp(float *A, int m, int n) { + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= m || tx >= n) return; + + A[ty*n + tx] = exp(A[ty*n + tx]); + } + + __global__ void kernReLU(float *A, int m, int n) { + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= m || tx >= n) return; + + if (A[ty*n + tx] < 0) A[ty*n + tx] = 0.0; + } + + __global__ void kernDerivativeReLU(float *A, int m, int n) { + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= m || tx >= n) return; + + if (A[ty*n + tx] > 0) { + A[ty*n + tx] = 1.0; + } + else { + A[ty*n + tx] = 0.0; + } + } + + __global__ void kernMax(float *A, int *max, int N, int C) { + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (tx >= N) return; + + int maxIndex = 0; + float maxVal = A[tx*C + 0]; + for (int i = 1; i < C; i++) { + float val = A[tx*C + i]; + if (val > maxVal) { + maxVal = val; + maxIndex = i; + } + } + + max[tx] = maxIndex; + } + + __global__ void kernMatrixTranspose(float* A, float* B, int m, int n) + { + int tx = blockIdx.x * blockDim.x + threadIdx.x; + int ty = blockIdx.y * blockDim.y + threadIdx.y; + + if (tx >= n || ty >= m) return; + + + int idx = ty * n + tx; + int transIdx = tx * m + ty; + B[transIdx] = A[idx]; + + } + + __global__ void kernMatrixMultiply(float *A, float *B, float *C, int m, int n, int k) { + /* + A -> m X n and B is n X k + C -> m X k the output array + */ + + int row = (blockDim.y * blockIdx.y) + threadIdx.y; + int col = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (row >= m || col >= k) return; + + float sum = 0.0; + for (int i = 0; i < n; i++) { + sum += A[row*n + i] * B[i*k + col]; + } + + C[row*k + col] = sum; + } + + __global__ void kernElementMatrixMultiply(float *A, float *B, float *C, int m, int n) { + /* + A -> m X n and B is n X k + C -> m X k the output array + */ + + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= m || tx >= n) return; + + C[ty*n + tx] = A[ty*n + tx] * B[ty*n + tx]; + + } + + __global__ void kernElementMatrixAdd(float *A, float *B, float *C, float alpha, int m, int n) { + /* + A -> m X n and B is n X k + C -> m X k the output array + */ + + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= m || tx >= n) return; + + C[ty*n + tx] = A[ty*n + tx] - alpha * B[ty*n + tx]; + + } + + __global__ void kernDerivativeLossScores(float *probs, int *y, float *dscores, int N, int C) { + int ty = (blockDim.y * blockIdx.y) + threadIdx.y; + int tx = (blockDim.x * blockIdx.x) + threadIdx.x; + + if (ty >= N || tx >= C) return; + + if (tx == y[ty]) { + dscores[ty*C + tx] = probs[ty*C + tx] - 1; + dscores[ty*C + tx] /= N; + } else { + dscores[ty*C + tx] /= N; + } + + } + + void softmaxExp(float *dev_A, int m, int n) { + + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernExp << > > (dev_A, m, n); + } + + void softmaxNormalize(float *dev_A, int m, int n) { + // TODO: Should be parallelized + + float *A = new float[m * n]; + cudaMemcpy(A, dev_A, m*n * sizeof(float), cudaMemcpyDeviceToHost); + + for (int i = 0; i < m; i++) { + float sum = 0; + for (int j = 0; j < n; j++) + sum += A[i*n + j]; + + for (int j = 0; j < n; j++) + A[i*n + j] /= sum; + } + + cudaMemcpy(dev_A, A, m*n * sizeof(float), cudaMemcpyHostToDevice); + + delete[] A; + } + + void calculateLoss(float *dev_probs, int N, int C, int *dev_y, float* loss) { + // TODO: Should be parallelized + + float *probs = new float[N * C]; + cudaMemcpy(probs, dev_probs, N * C * sizeof(float), cudaMemcpyDeviceToHost); + + int *y = new int[N]; + cudaMemcpy(y, dev_y, N * 1 * sizeof(int), cudaMemcpyDeviceToHost); + + float totalLoss = 0; + for (int i = 0; i < N; i++) { + int label = y[i]; + totalLoss += -log(probs[i*C + label]); + } + + totalLoss /= N; + + *loss = totalLoss; + + delete[] probs; + delete[] y; + } + + void reLU(float *dev_A, int m, int n) { + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernReLU<<>>(dev_A, m, n); + } + + void derivativeReLU(float *dev_A, int m, int n) { + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernDerivativeReLU << > > (dev_A, m, n); + } + + void matrixMultiply(float *dev_A, float *dev_B, float *dev_C, int m, int n, int k) { + + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (k + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernMatrixMultiply << > > (dev_A, dev_B, dev_C, m, n, k); + } + + void matrixElementMultiply(float *dev_A, float *dev_B, float *dev_C, int m, int n) { + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernElementMatrixMultiply << > > (dev_A, dev_B, dev_C, m, n); + } + + + void matrixTranspose(float *dev_A, float *dev_B, int m, int n) { + + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernMatrixTranspose << > > (dev_A, dev_B, m, n); + } + + void softmaxDerivativeWrtScores(float *dev_probs, int *dev_y, float *dev_dscores, int N, int C) { + /* + Calcluates dL/dscores . probs = softmax(scores) + dL/dscores = probs[range(N), y] -= 1 + */ + + cudaMemcpy(dev_dscores, dev_probs, N * C * sizeof(float), cudaMemcpyDeviceToDevice); + + int gridRows = (N + blockSize - 1) / blockSize; + int gridCols = (C + blockSize - 1) / blockSize; + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + kernDerivativeLossScores << > > (dev_probs, dev_y, dev_dscores, N, C); + + } + + void calculateDerviativeW2(int N, int d, int C, int h1, float *dscores, float *X2, float *dW2) { + float *X2Trans; + cudaMalloc((void**)&X2Trans, h1 * N * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc X2Trans failed"); + matrixTranspose(X2, X2Trans, N, h1); + + + // dL/dW2 = X2.T * dscores (h1xN X NxC = h1xC) + matrixMultiply(X2Trans, dscores, dW2, h1, N, C); + + cudaFree(X2Trans); + checkCUDAErrorWithLine("cudaFree X2Trans faild"); + } + + void calculateDerviativeW1(int N, int d, int C, int h1, float *X, float *fc, float *W2, float *dscores, float *dW1) { + float *dW1_1; + cudaMalloc((void**)&dW1_1, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed"); + float *W2Trans; + cudaMalloc((void**)&W2Trans, C * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed"); + + matrixTranspose(W2, W2Trans, h1, C); + + // dW1_1 = dscores * W2.T + matrixMultiply(dscores, W2Trans, dW1_1, N, C, h1); + + float *dfcRelu; + cudaMalloc((void**)&dfcRelu, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed"); + + cudaMemcpy(dfcRelu, fc, N * h1 * sizeof(float), cudaMemcpyDeviceToDevice); + + derivativeReLU(dfcRelu, N, h1); + + float *dfc; + cudaMalloc((void**)&dfc, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed"); + + matrixElementMultiply(dW1_1, dfcRelu, dfc, N, h1); + + float *XTrans; + cudaMalloc((void**)&XTrans, d * N * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed"); + + matrixTranspose(X, XTrans, N, d); + + matrixMultiply(XTrans, dfc, dW1, d, N, h1); + + cudaFree(dW1_1); + checkCUDAErrorWithLine("cudaFree failed"); + cudaFree(W2Trans); + checkCUDAErrorWithLine("cudaFree failed"); + cudaFree(dfcRelu); + checkCUDAErrorWithLine("cudaFree failed"); + cudaFree(dfc); + checkCUDAErrorWithLine("cudaFree failed"); + cudaFree(XTrans); + checkCUDAErrorWithLine("cudaFree failed"); + + } + + void updateWeights(float *dev_A, float *dev_B, float *dev_C, float alpha, int m, int n) { + + int gridRows = (m + blockSize - 1) / blockSize; + int gridCols = (n + blockSize - 1) / blockSize; + + dim3 dimGrid(gridCols, gridRows); + dim3 dimBlock(blockSize, blockSize); + + kernElementMatrixAdd << > > (dev_A, dev_B, dev_C, alpha, m, n); + } + + + + + void allocateMemoryTrain(int N, int d, int C, int h1, float **fc, float **X2, float **scores, float **dscores, float **dW1, float **dW2) { + + cudaMalloc((void **)fc, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)X2, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)scores, N * C * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)dscores, N * C * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)dW2, h1 * C * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)dW1, d * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + } + + void freeMemoryTrain(float *fc, float *X2, float *scores, float *dscores, float *dW1, float *dW2) { + cudaFree(fc); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(X2); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(scores); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(dscores); + checkCUDAErrorWithLine("cudaaFree fc failed!"); + + cudaFree(dW2); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(dW1); + checkCUDAErrorWithLine("cudaFree fc failed!"); + } + + void allocateMemoryInference(int N, int d, int C, int h1, float **fc, float **X2, float **scores) { + + cudaMalloc((void **)fc, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)X2, N * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMalloc((void **)scores, N * C * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + } + + void freeMemoryInference(float *fc, float *X2, float *scores) { + cudaFree(fc); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(X2); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + cudaFree(scores); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + } + // TODO: implement required elements for MLP sections 1 and 2 here + + void trainStep(int N, int d, int C, int h1, float alpha, float *X, int *y, float *loss, float *W1, float *W2) { + + /* + X -> N x d + y -> N x 1 + W1 -> d x h1 + W2 -> h1 x C + */ + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + + cudaEventRecord(start); + + + // first fully connected layer X2 = X*W1 + float *fc, *X2, *scores, *dscores, *dW1, *dW2; + allocateMemoryTrain(N, d, C, h1, &fc, &X2, &scores, &dscores, &dW1, &dW2); + + + matrixMultiply(X, W1, fc, N, d, h1); + //printf("Fc:\n"); + //printArray2D(fc, N, h1); + + // Apply ReLU activation: X2 = ReLU(X2); + + cudaMemcpy(X2, fc, N * h1 * sizeof(float), cudaMemcpyDeviceToDevice); + reLU(X2, N, h1); + //printf("ReLU:\n"); + //printArray2D(X2, N, h1); + + // calculate log_scores for softmax + matrixMultiply(X2, W2, scores, N, h1, C); + //printf("Scores:\n"); + //printArray2D(scores, N, C); + + // calculate softmax probability: apply exp on all elements and normalize by sum of columns + softmaxExp(scores, N, C); + softmaxNormalize(scores, N, C); + //printf("Probs:\n"); + //printArray2D(scores, N, C); + + // calculate the loss + calculateLoss(scores, N, C, y, loss); + printf("Loss: %.4f\n", *loss); + + + // **** BACKPROPAGATION STARTS ***** + softmaxDerivativeWrtScores(scores, y, dscores, N, C); + + calculateDerviativeW2(N, d, C, h1, dscores, X2, dW2); + + calculateDerviativeW1(N, d, C, h1, X, fc, W2, dscores, dW1); + + updateWeights(W1, dW1, W1, alpha, d, h1); + + updateWeights(W2, dW2, W2, alpha, h1, C); + + freeMemoryTrain(fc, X2, scores, dscores, dW1, dW2); + + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + //printf("Time taken for one train step: %.2f\n", milliseconds); + + + } + + + void predict(int N, int d, int C, int h1, float *X, int *y, float *W1, float *W2, int* predictions) { + + float *fc, *X2, *scores; + allocateMemoryInference(N, d, C, h1, &fc, &X2, &scores); + + matrixMultiply(X, W1, fc, N, d, h1); + + cudaMemcpy(X2, fc, N * h1 * sizeof(float), cudaMemcpyDeviceToDevice); + reLU(X2, N, h1); + + matrixMultiply(X2, W2, scores, N, h1, C); + + // calculate softmax probability: apply exp on all elements and normalize by sum of columns + softmaxExp(scores, N, C); + softmaxNormalize(scores, N, C); + + //printArray2D(scores, N, C); + + int *dev_pred; + cudaMalloc((void **)&dev_pred, N * C * sizeof(int)); + + int numBlocks = (N + blockSize - 1) / blockSize; + kernMax << > > (scores, dev_pred, N, C); + cudaMemcpy(predictions, dev_pred, N * sizeof(int), cudaMemcpyDeviceToHost); + + freeMemoryInference(fc, X2, scores); + } + + void predictAndAcc(int N, int d, int C, int h1, float *X, int *y, float *W1, float *W2) { + + int* predictions = new int[N]; + predict(N, d, C, h1, X, y, W1, W2, predictions); + + for (int i = 0; i < N; i++) { + printf("Predictions for %d example is %d\n", i + 1, predictions[i]); + } + + int *host_y = new int[N]; + cudaMemcpy(host_y, y, N * 1 * sizeof(int), cudaMemcpyDeviceToHost); + + float accuracy = 0.0; + for (int i = 0; i < N; i++) { + accuracy += (predictions[i] == host_y[i]); + } + accuracy /= N; + + //printf("W1:\n"); + //printArray2D(W1, d, h1); + + //printf("W2:\n"); + //printArray2D(W2, h1, C); + + printf("\n\nAccuracy is %.4f\n\n", accuracy); + } + + } diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h index 2096228..812dfc0 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.h +++ b/Project2-Character-Recognition/character_recognition/mlp.h @@ -6,4 +6,7 @@ namespace CharacterRecognition { Common::PerformanceTimer& timer(); // TODO: implement required elements for MLP sections 1 and 2 here + + void trainStep(int N, int d, int C, int h1, float alpha, float *X, int *y, float *loss, float *W1, float *W2); + void predictAndAcc(int N, int d, int C, int h1, float *X, int *y, float *W1, float *W2); } diff --git a/Project2-Character-Recognition/data/final_predictions.png b/Project2-Character-Recognition/data/final_predictions.png new file mode 100644 index 0000000..dc444a5 Binary files /dev/null and b/Project2-Character-Recognition/data/final_predictions.png differ diff --git a/Project2-Character-Recognition/data/memorize_random.png b/Project2-Character-Recognition/data/memorize_random.png new file mode 100644 index 0000000..bac1bf8 Binary files /dev/null and b/Project2-Character-Recognition/data/memorize_random.png differ diff --git a/Project2-Character-Recognition/data/perf_image.png b/Project2-Character-Recognition/data/perf_image.png new file mode 100644 index 0000000..bc75291 Binary files /dev/null and b/Project2-Character-Recognition/data/perf_image.png differ diff --git a/Project2-Character-Recognition/data/perf_neurons.png b/Project2-Character-Recognition/data/perf_neurons.png new file mode 100644 index 0000000..60830b4 Binary files /dev/null and b/Project2-Character-Recognition/data/perf_neurons.png differ diff --git a/Project2-Character-Recognition/data/training_curve.png b/Project2-Character-Recognition/data/training_curve.png new file mode 100644 index 0000000..3aa6a34 Binary files /dev/null and b/Project2-Character-Recognition/data/training_curve.png differ diff --git a/Project2-Character-Recognition/data/training_curve_xor.png b/Project2-Character-Recognition/data/training_curve_xor.png new file mode 100644 index 0000000..f6d2861 Binary files /dev/null and b/Project2-Character-Recognition/data/training_curve_xor.png differ diff --git a/Project2-Character-Recognition/data/white_noise.png b/Project2-Character-Recognition/data/white_noise.png new file mode 100644 index 0000000..aefe249 Binary files /dev/null and b/Project2-Character-Recognition/data/white_noise.png differ diff --git a/Project2-Character-Recognition/data/xor_predictions.PNG b/Project2-Character-Recognition/data/xor_predictions.PNG new file mode 100644 index 0000000..bdd7564 Binary files /dev/null and b/Project2-Character-Recognition/data/xor_predictions.PNG differ diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp index 11dd534..0f0c2ab 100644 --- a/Project2-Character-Recognition/src/main.cpp +++ b/Project2-Character-Recognition/src/main.cpp @@ -10,143 +10,217 @@ #include #include #include "testing_helpers.hpp" +#include +#include +#include +#include -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; +#include +#include + +std::random_device rd; //Will be used to obtain a seed for the random number engine +std::mt19937 gen(rd()); +std::uniform_real_distribution<> dis(-0.01, 0.01); + +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg) + + + +void printArray2D(float *X, int nR, int nC) { + for (int i = 0; i < nR; i++) { + for (int j = 0; j < nC; j++) + printf("%.2f ", X[i*nC + j]); + printf("\n"); + } +} + +void printArray2D(int *X, int nR, int nC) { + for (int i = 0; i < nR; i++) { + for (int j = 0; j < nC; j++) + printf("%d ", X[i*nC + j]); + printf("\n"); + } +} + +void fillInputXOR(float *X, int *y) { + X[0] = 0.0, X[1] = 0.0, X[2] = 1.0; + X[3] = 0.0, X[4] = 1.0, X[5] = 1.0; + X[6] = 1.0, X[7] = 0.0, X[8] = 1.0; + X[9] = 1.0, X[10] = 1.0; X[11] = 1.0; + + y[0] = 0; + y[1] = 1; + y[2] = 1; + y[3] = 0; +} + + +void fillImage(float *X, int *y) { + + int j = 0; + for (int i = 1; i <= 52; i++) { + std::string fileName; + if (i <= 9) { + fileName = "../data-set/0" + std::to_string(i) + "info.txt"; + } + else { + fileName = "../data-set\\" + std::to_string(i) + "info.txt"; + } + + std::ifstream myfile(fileName); + std::string line; + //std::cout << fileName << '\n'; + if (myfile.is_open()) + { + std::getline(myfile, line); + y[i-1] = std::stoi(line) - 1 ; + + std::getline(myfile, line); + + + std::getline(myfile, line); + std::string buf; + std::stringstream ss(line); + while (ss >> buf) { + //std::cout << j << '\n'; + X[j++] = ((float)std::stoi(buf))/255.0; + } + + + myfile.close(); + } + } + +} + +void fillImageRandom(float *X, int *y, int N, int d) { + + int j = 0; + for (int i = 1; i <= N; i++) { + //std::cout << fileName << '\n'; + + y[i - 1] = i - 1; + + for(int k = 0; k < d; k++) { + //std::cout << j << '\n'; + X[j++] = dis(gen); + } + + } +} + + +void generateRandomWeights(float *W, int nR, int nC) { + for (int i = 0; i < nR; i++) { + for (int j = 0; j < nC; j++) + W[i*nC + j] = dis(gen); + } +} int main(int argc, char* argv[]) { // Scan tests - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + //int N = 4; + //int d = 3; + //int C = 2; + //int h1 = 4; + + //float *X = new float[N * d * sizeof(float)]; + //int *y = new int[N * 1 * sizeof(int)]; + //float *W1 = new float[d * h1 * sizeof(float)]; + //float *W2 = new float[h1 * C * sizeof(float)]; + //float loss_val = 0.0; + //float *loss = &loss_val; + + //float alpha = 0.5; + + //fillInputXOR(X, y); + //generateRandomWeights(W1, d, h1); + //generateRandomWeights(W2, h1, C); + + + //printf("X:\n"); + //printArray2D(X, N, d); + //printf("\n"); + //printf("W1:\n"); + //printArray2D(W1, d, h1); + //printf("\n"); + //printf("W2:\n"); + //printArray2D(W2, h1, C); + //printf("\n"); + + //for (int i = 1; i <= 1000; i++) { + // printf("\n\nIteration %d\n\n", i); + // CharacterRecognition::trainStep(N, d, C, h1, alpha, X, y, loss, W1, W2); + // + //} + + //CharacterRecognition::predictAndAcc(N, d, C, h1, X, y, W1, W2); + + int N = 52; + int d = 101 * 101; + int C = 52; + int h1 = 10; + + float *X = new float[N * d]; + int *y = new int[N]; + // + ////fillImageRandom(X, y, N, d); + fillImage(X, y); + //fillInputXOR(X, y); + + float *W1 = new float[d * h1 * sizeof(float)]; + float *W2 = new float[h1 * C * sizeof(float)]; + float loss_val = 0.0; + float *loss = &loss_val; + + float alpha = 0.5; + + generateRandomWeights(W1, d, h1); + generateRandomWeights(W2, h1, C); + + float *dev_X, *dev_W1, *dev_W2; + int *dev_y; + cudaMalloc((void **)&dev_X, N * d * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed!"); + cudaMalloc((void **)&dev_y, N * 1 * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc failed!"); + cudaMalloc((void **)&dev_W1, d * h1 * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc failed!"); + cudaMalloc((void **)&dev_W2, h1 * C * sizeof(float)); + checkCUDAErrorWithLine("cudaMalloc fc failed!"); + + cudaMemcpy(dev_X, X, N * d * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_y, y, N * 1 * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_W1, W1, d * h1 * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_W2, W2, h1 * C * sizeof(float), cudaMemcpyHostToDevice); + + + //std::ofstream myfile; + //myfile.open("loss_curve_XOR.txt"); + + for (int i = 1; i <= 100; i++) { + printf("\n\nIteration %d\n\n", i); + CharacterRecognition::trainStep(N, d, C, h1, alpha, dev_X, dev_y, loss, dev_W1, dev_W2); + + //myfile << i << " " << *loss << '\n'; + } + + CharacterRecognition::predictAndAcc(N, d, C, h1, dev_X, dev_y, dev_W1, dev_W2); + + //myfile.close(); + + cudaFree(dev_X); + checkCUDAErrorWithLine("cudaFree fc failed!"); + cudaFree(dev_y); + checkCUDAErrorWithLine("cudaFree fc failed!"); + cudaFree(dev_W1); + checkCUDAErrorWithLine("cudaFree fc failed!"); + cudaFree(dev_W2); + checkCUDAErrorWithLine("cudaFree fc failed!"); + + delete[] X; + delete[] y; + delete[] W1; + delete[] W2; } diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md index 0e38ddb..d3ce611 100644 --- a/Project2-Stream-Compaction/README.md +++ b/Project2-Stream-Compaction/README.md @@ -1,14 +1,122 @@ CUDA Stream Compaction ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2 - Stream Compaction** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Srinath Rajagopalan + * [LinkedIn](https://www.linkedin.com/in/srinath-rajagopalan-07a43155), [twitter](https://twitter.com/srinath132) +* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Nvidia Quadro P1000 4GB (Moore 100B Lab) -### (TODO: Your README) +### Scan and Stream Compaction -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +In this we project, I have implemented the exclusive scan and stream compaction algorithms for different configuratins +1) CPU scan and CPU stream compaction +2) Naive GPU scan +3) Work-Efficient GPU scan and stream compaction +5) Scan from Thrust (to benchmark) + +Scan and Compaction can be best understood with the following example: + +* scan: + - goal: produce a prefix sum array of a given array (we only care about exclusive scan here) + - input + - [1 5 0 1 2 0 3] + - output + - [0 1 6 6 7 9 9] +* compact: + - goal: closely and neatly packed the elements != 0 + - input + - [1 5 0 1 2 0 3] + - output + - [1 5 1 2 3] + +## GPU Implementation +### Naive Scan +Though exclusive scan looks like a sequential operation, we can parallelize by dividing the computation to `log(n)` passes through the array. In the first pass, we do `a[i] = a[i] + a[i-1]`. The invariant we maintain is that by the `k`th pass, we have the correct values in the array upto to the index `2^(k-1)` (excluded), and we perform the two-sum addition only for `index >= 2^(k-1)`. After `log(n)` passes, the array will contain an inclsuive scan which we convert to exclusive by a shift-right operation. By utilizing `n` threads, each pass requires `O(1)` time and we have `log(n)` passes. So totally the time complexity is `O(log(n))`. However, what about the total number of adds? A sequential scan will only perform `n` adds. The naive parallel implementation odes `O(n)` adds for each pass and total number of adds is therefore `O(nlogn)`. + +### Work-efficient Scan +Work-efficient implementation aims to bring down the total number of addition operations to `O(n)`. This approach uses a balanced-binary tree view to structure the computation. In the up-sweep phase, we go from the leaf nodes to the root and keep building partial-sums in-place. In the down-sweep phase, we traverse back from the root to make use of the partial-sum from before and build the scan in-place. This approach requires the array to be a power of 2, so when the array size is not one we pad it to the next 2-power. + +### Work-efficient Compaction + +In compaction, we to remove all elements from the array which do not satisfy a certain criterion. We first create a boolean array indicating which all elements satisfy the criteria. After this, we call the work-efficient scan on the _boolean array_ to build the prefix-sum. For all the elements which are going to be included in the final array, the exclusive scan gives at what _index_ they must be writtten to. This is done by a parallel scatter operation. + + +### Performance Analysis +1) A block size that worked well for each of the configurations was `128`. I experimented with different size options from 256 to 512 and the performance remained similar. Decreasing the block size below `64` led to a drop in performance for all the GPU implementations. Thus, all the comparisons are benchmarked by fixing block size as 128. + +2) Performannce graphs comparing the differnet implementations are included below. For work-efficient compaction, I implemented two versions, one of which is inefficient but passes all the test cases for array sizes upto 2^25. This implementation calls the work-efficient scan function implemented as a part of the scan setup. But to stick to the API, it also does several `cudaMemcpy` from device to host and host to device which is not required if the interemediate scan buffers are never going to be needed on the CPU. However, the more efficient version is not passing the test case for not-power of two for array size `2^25`. + +

+ +

+

+ +

+

+ +

+ +**What do the above graphs mean?** + +For smaller inputs, the CPU implementation is significantly faster than the GPU one. The GPU naive scan invokes a kernel `log(n)` times. There is a lot of overhead incurred while invoking a kernel withtin a loop. For small sizes it doesn't justify the cost. The overhead is compounded in work-efficient scan. So for small array sizes we are better off with the CPU. However, the CPU scan _also_ performs better than the naive one for large array sizes. Why? I am theorizing this to be because of the additional number of adds required in the parallel approach. This is also reflected in the fact that the the work-efficient scan is faster than the CPU implementation for the larger array size. Thrust, unsurprisingly, performs better than everything I have implemented. This is probably because, though the algorithmic complexity might still be the same, Thrust is making better utilizattion of hardware resources: using shared memory and reducing global memory calls, and also avoiding issues like bank conflicts (when accessing shared memory). + +Work-efficient compaction is significantly slower if use the same implementation as work-efficient scan one. This is because of additional `cudaMemcpy` calls from Host to Device and Device to Host. We can improve this by minimizing the CPU-GPU transfers by maintaining all the intermediate arrays in the GPU itself. But scan updates the array in place and we require the boolean array while performing scatter, so we have to copy it to another memory location on the GPU. This is a better alternative compared as it's a Device to Device transfer as opposed tot Device To Host. + +### Tests output + +``` +**************** +** SCAN TESTS ** +**************** + [ 8 18 36 48 30 4 0 24 16 19 40 47 23 ... 23 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0017ms (std::chrono Measured) + [ 0 8 26 62 110 140 144 144 168 184 203 243 290 ... 25535 25558 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0016ms (std::chrono Measured) + [ 0 8 26 62 110 140 144 144 168 184 203 243 290 ... 25468 25509 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.251904ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.058368ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.099328ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.142368ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.05264ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.082112ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 2 2 2 0 0 0 2 0 1 0 3 3 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0026ms (std::chrono Measured) + [ 2 2 2 2 2 1 3 3 2 2 1 3 2 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0164ms (std::chrono Measured) + [ 2 2 2 2 2 1 3 3 2 2 1 3 2 ... 3 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.124ms (std::chrono Measured) + [ 2 2 2 2 2 1 3 3 2 2 1 3 2 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.459776ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.529408ms (CUDA Measured) + passed``` diff --git a/Project2-Stream-Compaction/data/compact_perf_15.png b/Project2-Stream-Compaction/data/compact_perf_15.png new file mode 100644 index 0000000..8f3dbca Binary files /dev/null and b/Project2-Stream-Compaction/data/compact_perf_15.png differ diff --git a/Project2-Stream-Compaction/data/compact_perf_20.png b/Project2-Stream-Compaction/data/compact_perf_20.png new file mode 100644 index 0000000..bff2838 Binary files /dev/null and b/Project2-Stream-Compaction/data/compact_perf_20.png differ diff --git a/Project2-Stream-Compaction/data/compact_perf_25.png b/Project2-Stream-Compaction/data/compact_perf_25.png new file mode 100644 index 0000000..a3711f3 Binary files /dev/null and b/Project2-Stream-Compaction/data/compact_perf_25.png differ diff --git a/Project2-Stream-Compaction/data/data_perf.xlsx b/Project2-Stream-Compaction/data/data_perf.xlsx new file mode 100644 index 0000000..de9d548 Binary files /dev/null and b/Project2-Stream-Compaction/data/data_perf.xlsx differ diff --git a/Project2-Stream-Compaction/data/scan_perf_15.png b/Project2-Stream-Compaction/data/scan_perf_15.png new file mode 100644 index 0000000..df505b8 Binary files /dev/null and b/Project2-Stream-Compaction/data/scan_perf_15.png differ diff --git a/Project2-Stream-Compaction/data/scan_perf_20.png b/Project2-Stream-Compaction/data/scan_perf_20.png new file mode 100644 index 0000000..5f95d5e Binary files /dev/null and b/Project2-Stream-Compaction/data/scan_perf_20.png differ diff --git a/Project2-Stream-Compaction/data/scan_perf_25.png b/Project2-Stream-Compaction/data/scan_perf_25.png new file mode 100644 index 0000000..27f1522 Binary files /dev/null and b/Project2-Stream-Compaction/data/scan_perf_25.png differ diff --git a/Project2-Stream-Compaction/data_perf.xlsx b/Project2-Stream-Compaction/data_perf.xlsx new file mode 100644 index 0000000..de9d548 Binary files /dev/null and b/Project2-Stream-Compaction/data_perf.xlsx differ diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp index d016553..3569b2e 100644 --- a/Project2-Stream-Compaction/src/main.cpp +++ b/Project2-Stream-Compaction/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt index cdbef77..4bb0dc2 100644 --- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt +++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_61 ) diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu index 2ed6d63..01fe486 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.cu +++ b/Project2-Stream-Compaction/stream_compaction/common.cu @@ -23,7 +23,10 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) return; + + bools[index] = idata[index] != 0; } /** @@ -32,7 +35,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) return; + + if (bools[index]) { + odata[indices[index]] = idata[index]; + } + } } diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu index a2d3e6c..34a1de9 100644 --- a/Project2-Stream-Compaction/stream_compaction/cpu.cu +++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu @@ -18,9 +18,31 @@ namespace StreamCompaction { * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + bool exception = false; + try { + timer().startCpuTimer(); + } + catch (const std::runtime_error& ex) { + exception = true; + } + + + if (n <= 0) return; + if (n == 1) { + odata[0] = 0; + return; + } + odata[0] = 0; + odata[1] = idata[0]; + + + for (int i = 2; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + + if(!exception) + timer().endCpuTimer(); + } /** @@ -30,9 +52,17 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int k = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[k] = idata[i]; + k++; + } + } + timer().endCpuTimer(); - return -1; + return k; } /** @@ -42,9 +72,25 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int *checkNonZero = new int[n]; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) checkNonZero[i] = 1; + else checkNonZero[i] = 0; + } + + int *prefixSum = new int[n]; + scan(n, prefixSum, checkNonZero); + + for (int i = 0; i < n; i++) { + if (checkNonZero[i] != 0) { + odata[prefixSum[i]] = idata[i]; + } + } + + int count = checkNonZero[n - 1] == 0 ? prefixSum[n - 1] : prefixSum[n - 1] + 1; timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu index 2db346e..7f3e3c7 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.cu +++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu @@ -3,6 +3,8 @@ #include "common.h" #include "efficient.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg) + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,15 +14,181 @@ namespace StreamCompaction { return timer; } + __global__ void resetZeros(int n, int *a) { + int index = (blockDim.x*blockIdx.x) + threadIdx.x; + if (index >= n) return; + a[index] = 0; + } + + + __global__ void upSweep(int n, int d, int *idata) { + int index = (blockDim.x*blockIdx.x) + threadIdx.x; + + int twoPowd1 = 1 << (d + 1); + int twoPowd = 1 << d; + + + if ((index % twoPowd1 != twoPowd1-1) || index >= n) return; + + int k = index - twoPowd1 + 1; + idata[index] += idata[k + twoPowd - 1]; + } + + __global__ void downSweep(int n, int d, int *idata) { + int index = (blockDim.x*blockIdx.x) + threadIdx.x; + + int twoPowd1 = 1 << (d + 1); + int twoPowd = 1 << d; + + + if ((index % twoPowd1 != twoPowd1 - 1) || index >= n) return; + + int k = index - twoPowd1 + 1; + int t = idata[k + twoPowd - 1]; + idata[k + twoPowd - 1] = idata[index]; + idata[index] += t; + } + + void printxxx(int n, const int *a) { + for (int i = 0; i < n; i++) { + printf("%d ", a[i]); + } + printf("\n\n\n"); + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + bool exception = false; + + int *dev_idata; + + int numThreads = 128; + int numBlocks = (n + numThreads - 1) / numThreads; + + int d_max = ilog2ceil(n); + + int twoPowN = 1 << d_max; + if (n != twoPowN) { + + int diff = twoPowN - n; + + cudaMalloc((void **)&dev_idata, (n + diff) * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata1 failed!"); + + resetZeros << > > (n + diff, dev_idata); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + n = n + diff; + } else { + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + } + + try { + timer().startGpuTimer(); + } + catch (const std::runtime_error& ex) { + exception = true; + } + + + for (int d = 0; d < d_max; d++) { + upSweep<<>>(n, d, dev_idata); + } + + // reset last element to zero + int* zero = new int[1]; + zero[0] = 0; + cudaMemcpy(dev_idata + n - 1, zero, sizeof(int), cudaMemcpyHostToDevice); + + + for(int d = d_max-1; d >= 0; d--) { + downSweep << > > (n, d, dev_idata); + } + + + if (!exception) + timer().endGpuTimer(); + + + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + + + + cudaFree(dev_idata); + + } + void scanCompact(int n, int *odata, const int *idata) { + bool exception = false; + + int *dev_idata; + + int numThreads = 128; + int numBlocks = (n + numThreads - 1) / numThreads; + + int d_max = ilog2ceil(n); + + int twoPowN = 1 << d_max; + if (n != twoPowN) { + + int diff = twoPowN - n; + + cudaMalloc((void **)&dev_idata, (n + diff) * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata1 failed!"); + + resetZeros << > > (n + diff, dev_idata); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyDeviceToDevice); + n = n + diff; + } + else { + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyDeviceToDevice); + } + + try { + timer().startGpuTimer(); + } + catch (const std::runtime_error& ex) { + exception = true; + } + + + for (int d = 0; d < d_max; d++) { + upSweep << > > (n, d, dev_idata); + } + + // reset last element to zero + int* zero = new int[1]; + zero[0] = 0; + cudaMemcpy(dev_idata + n - 1, zero, sizeof(int), cudaMemcpyHostToDevice); + + + for (int d = d_max - 1; d >= 0; d--) { + downSweep << > > (n, d, dev_idata); + } + + if (!exception) + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToDevice); + + + + cudaFree(dev_idata); + + + } + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +199,122 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + + + int numThreads = 128; + int numBlocks = (n + numThreads - 1) / numThreads; + + int *dev_checkZeros, *dev_sumIndices, *dev_odata, *dev_idata; + + cudaMalloc((void **) &dev_checkZeros, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_checkZeros failed!"); + cudaMalloc((void **) &dev_sumIndices, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_sumIndices failed!"); + cudaMalloc((void **)&dev_odata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata failed!"); + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + StreamCompaction::Common::kernMapToBoolean<<>>(n, dev_checkZeros, dev_idata); + + int *checkZeros = new int[n]; + cudaMemcpy(checkZeros, dev_checkZeros, n * sizeof(int), cudaMemcpyDeviceToHost); + + + int *sumIndices = new int[n]; + scan(n, sumIndices, checkZeros); + + cudaMemcpy(dev_sumIndices, sumIndices , n * sizeof(int), cudaMemcpyHostToDevice); + + StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_checkZeros, dev_sumIndices); + + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + + + int count = checkZeros[n - 1] == 0 ? sumIndices[n - 1] : sumIndices[n - 1] + 1; + + //delete[] checkZeros; + //delete[] sumIndices; + + //printf("hey\n"); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_checkZeros); + cudaFree(dev_sumIndices); + + + return count; } + + + //int compact(int n, int *odata, const int *idata) { + + + // int numThreads = 128; + // int numBlocks = (n + numThreads - 1) / numThreads; + + // int *dev_checkZeros, *dev_sumIndices, *dev_odata, *dev_idata; + + // cudaMalloc((void **)&dev_checkZeros, n * sizeof(int)); + // checkCUDAErrorWithLine("cudaMalloc dev_checkZeros failed!"); + // cudaMalloc((void **)&dev_sumIndices, n * sizeof(int)); + // checkCUDAErrorWithLine("cudaMalloc dev_sumIndices failed!"); + // cudaMalloc((void **)&dev_odata, n * sizeof(int)); + // checkCUDAErrorWithLine("cudaMalloc dev_odata failed!"); + // cudaMalloc((void **)&dev_idata, n * sizeof(int)); + // checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + // cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + // timer().startGpuTimer(); + + // StreamCompaction::Common::kernMapToBoolean << > > (n, dev_checkZeros, dev_idata); + + // //int *checkZeros = new int[n]; + // //cudaMemcpy(checkZeros, dev_checkZeros, n * sizeof(int), cudaMemcpyDeviceToHost); + + + // //int *sumIndices = new int[n]; + // scanCompact(n, dev_sumIndices, dev_checkZeros); + + // //cudaMemcpy(dev_sumIndices, sumIndices, n * sizeof(int), cudaMemcpyHostToDevice); + + // StreamCompaction::Common::kernScatter << > > (n, dev_odata, dev_idata, dev_checkZeros, dev_sumIndices); + + + // timer().endGpuTimer(); + + // cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + // int *sumIndices = new int[n]; + // int *checkZeros = new int[n]; + // cudaMemcpy(checkZeros, dev_checkZeros, n * sizeof(int), cudaMemcpyDeviceToHost); + // cudaMemcpy(sumIndices, dev_sumIndices, n * sizeof(int), cudaMemcpyDeviceToHost); + // int count = checkZeros[n - 1] == 0 ? sumIndices[n - 1] : sumIndices[n - 1] + 1; + + // //delete[] checkZeros; + // //delete[] sumIndices; + + // //printf("hey\n"); + + // cudaFree(dev_idata); + // cudaFree(dev_odata); + // cudaFree(dev_checkZeros); + // cudaFree(dev_sumIndices); + + // delete[] sumIndices; + // delete[] checkZeros; + + // return count; + //} } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu index 4308876..dc8b82b 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.cu +++ b/Project2-Stream-Compaction/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg) + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +13,78 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void prefixSum(int n, int d, int *odata, const int *idata) { + int index = (blockDim.x*blockIdx.x) + threadIdx.x; + + int min_k = 1 << (d-1); + if (index < min_k || index >= n) return; + + odata[index] = idata[index - min_k] + idata[index]; + } + + __global__ void shiftRight(int n, int *odata, const int *idata) { + int index = (blockDim.x*blockIdx.x) + threadIdx.x; + + if (index >= n) return; + + if (index == 0) { + odata[index] = 0; + return; + } + + odata[index] = idata[index-1]; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + + + int numThreads = 128; + int numBlocks = (n + numThreads - 1) / numThreads; + + int d_max = ilog2ceil(n); + + int *dev_idata, *dev_odata1, *dev_odata2; + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + cudaMalloc((void **)&dev_odata1, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata1 failed!"); + + cudaMalloc((void **)&dev_odata2, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata2 failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_odata1, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + + int *out1 = dev_idata; + int *out2 = dev_odata1; + + + timer().startGpuTimer(); + + for (int d = 1; d <= d_max; d++) { + prefixSum<<>>(n, d, out2, out1); + cudaMemcpy(out1, out2, n * sizeof(int), cudaMemcpyDeviceToDevice); + } + + shiftRight<<>>(n, out2, out1); + + timer().endGpuTimer(); + + cudaMemcpy(odata, out2, n * sizeof(int), cudaMemcpyDeviceToHost); + + + cudaFree(dev_idata); + cudaFree(dev_odata1); + cudaFree(dev_odata2); + + } } } diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu index 1def45e..f6c5fc3 100644 --- a/Project2-Stream-Compaction/stream_compaction/thrust.cu +++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu @@ -6,6 +6,8 @@ #include "common.h" #include "thrust.h" +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg) + namespace StreamCompaction { namespace Thrust { using StreamCompaction::Common::PerformanceTimer; @@ -18,11 +20,32 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); + + int *dev_idata, *dev_odata; + cudaMalloc((void **)&dev_odata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_odata failed!"); + cudaMalloc((void **)&dev_idata, n * sizeof(int)); + checkCUDAErrorWithLine("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + thrust::device_ptr dev_idataItr(dev_idata); + thrust::device_ptr dev_odataItr(dev_odata); + + timer().startGpuTimer(); + + thrust::exclusive_scan(dev_idataItr, dev_idataItr + n, dev_odataItr); + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + + } } } diff --git a/README.md b/README.md index 3a0b2fe..eceb145 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,13 @@ CUDA Number Algorithms ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2 - Number Algorithms** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Srinath Rajagopalan + * [LinkedIn](https://www.linkedin.com/in/srinath-rajagopalan-07a43155), [twitter](https://twitter.com/srinath132) +* Tested on: Windows 10, i7-6700 @ 3.4GHz 16GB, Nvidia Quadro P1000 4GB (Moore 100B Lab) -### (TODO: Your README) - -Link to the readmes of the other two subprojects. - -Add anything else you think is relevant up to this point. -(Remember, this is public, so don't put anything here that you don't want to share with the world.) +1) [Project-Stream Compaction](Project2-Stream-Compaction/README.md) +2) [Project-Character-Recognition](Project2-Character-Recognition/README.md)