diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md index 4503fac..2c5bf83 100644 --- a/Project2-Character-Recognition/README.md +++ b/Project2-Character-Recognition/README.md @@ -3,12 +3,62 @@ CUDA Character Recognition **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Grace Gilbert + * gracelgilbert.com +* Tested on: Windows 10, i9-9900K @ 3.60GHz 64GB, GeForce RTX 2080 40860MB + +## Overview +In this project I implemented a multilayer perceptron (mlp), a simple neural network that is used for machine learning. There are two states of the project, the training state and the predicting state. In the training state, we are given a set of data inputs and their desired output values. In this case, the main data set is 52 English characters (upper and lower case), each with a .txt file containing gray scale values of an image of the character. We train the mlp using the .txt file data and the expected values, adjusting weights based on error calculations until they converge on good weights that match the output value with high consistency. Once we have these good weights, we can then use the mlp in the predicting state with these weights to predict the output value of a .txt file data without knowing it's desired output in advance. + +### Project Run Options +At the top of main.cpp, there are multiple setting that the user can adjust. +- The first, training, indicates if we are in the training (1) or predicting state (0). +- Next is the accepted error, which determines what error value is low enough that we can consider the weights that achieved that error good enough weights. The lower this value is, the longer it will take for the weights to converge, but the more accurate the predictions will be. +- Next is the number of random weight attempts. To find good weights, we start with randomly assigned weights. We then see if these weights converge. If they do not converge after a certain number of iterations of training on the input data, then we reset and try new random weights. The number of random weight attempts determines how many times we will reset and try new weights before giving up on convergence. +- Related is the number of convergence iterations. When testing certain weights, we repeatedly run the data through the mlp network and refine the weights until the error is low enough. This number determines how many times we will refine the weights before giving up on this set of weights and resetting. +- The final option is the use xor value. I had some trouble training the character recognition data, but was successfully able to get weights for the simpler xor example to work, where the input is a pair of 0's and 1's, and the output is the xor of the pair. If this value is set to 1, the training and prediction will run on the xor example rather than the character data. If set to 0, it will read the character data from the text files and run training and prediction on that. + +## Training +The overall training pipeline is to generate a set of random weights, run each input data through the mlp network, and iteratively adjust the weights based on the output and error returned from the mlp network until the error is sufficiently low. It is not guaranteed that all sets of initial weight guesses will converge to good final weight values, so after some number of iterations of testing the input data and adjusting weights, we reset and make new random weight guesses. If after many resets of the guesses, we still haven't gotten convergence, we stop to avoid an infinitely running program. Once the converged weights are found, we output these weights. +### Random Weight Generation +The random weights are generated on the GPU, as the weights do not depend on each other, so can be generated in parallel. For the random seed, I take the elapsed time between the start of training and the time at which the weights are generated, so that each set of random weights has a different seed. +### MLP Network +The mlp network takes in out piece of input data at a time. This input data is make up of a certain number of values. In the case of the xor data, the data is made up of two numbers, whereas in the case of the character data, the input is made of up 10201 numbers, 101 by 101 numbers. There is then a second layer of values called the hidden layer. The number of hidden layers is somewhere between 1 and the number of input numbers. For each pair of input numbers and hidden layer numbers, there is a corresponding. predetermined weight. The value of the hidden layer values is found by taking a dot product between the input values and their weights corresponding to that hidden layer index. That dot product value is then put into an equation to find the hidden layer value: + +``` +hidden layer value = 1/(1 + exp(-dot product value)) +``` + +This completes the first layer of the mlp. Each hidden layer value also has a weight. The final layer is just one value, as this will be our output. To find that final value, we fist sum the products of each hidden layer value and its weight. Then we perform the same operation described above on this product we just found. This is our output. If we were in the prediction state, this would be the conclusion of the mlp, as we would return this output as the prediction based on the given weights. + +Below is a diagram illustrating the layers and weights in the mlp network: + +![](img/MLP.png) + +### Error propagation +In training mode, we want to use the mlp output to calculate error and adjust the weights to lower error, a process called error propagation. To calculate the error of an indivual data input, we find: +``` +error = (output - expected value)^2 +``` +The expected value is a numerical value representing what the input data's correct value is. For the xor example, it is 1 for true, 0 for false, and for the characters, the characters are number 1 through 52. + +Next, we calculate the partial derivative of the individual data value's error over the weight for each weight: + +![](img/PartialDeriv.PNG) + +We then take the accumulation of all of the error values from each input's mlp run. Using this value, combined with the partial derivative mentioned above, we come out with a delta that gets added to each weight: + +![](img/Delta.PNG) + +In my implementation, I run the mlp on each piece of input data, where I calculate the output, the error, and the partial derivatives for each weight. During this iteratation of all the data, I accumulate the total error, which I then use to calculate the delta values that I then add these to the weights, modifying the weights for the next iteration over the data. If the total accumulated error is low enough, no more error propagation is needed, as the weights are sufficiently converged. + +## Predicting +In the predicting phase, there are no iterative loops of running the mlp. Instead, we assign predetermined weights, ideally weights that have been found as good predictors in training. We then run the mlp on the input data with these weights, and just find the output, no error propagation. This output becomes our prediction. In both the xor and character data, the expected values are all discrete integers, so when outputting the prediction, I round the mlp result to the nearest integer. In my network, I was unable to find converged weights for the character data, so the predictor will almost always output all 0's, not close to the expected values ranging from 1 to 52. However for the xor data, given the weights I found that converged, the predictor is accurate on all inputs. + +## Challenges +I ran into multiple challenges working on this project. Initially, when calculating the error propagation, instead of using the total accumulated error over all inputs to find the delta value for the weights, I was finding an error value per input. This threw off my weight adjustments, so my weights were never converging. Once I caught this, I reorganized my code to output the partial derivatives and output value for each input, and then after accumulating all of the error, finally calculate the delta. + +Another challenge I faced and was unable to fix was that once I changed the data to the character data, I found that it never converged and all of the data points kept outputting the same value even though they contained different data and different expected values. There may have been an error in how I was reading in and passing along the larger data sets. Another possibility is that I did not let the weights converge long enough. Once I increased the data to be 10201 values per input, and 52 inputs, the mlp loops ran significantly slower and I ran out of time to let them run long enough to potentially converge. I am somewhat skeptical that they would have converged, however, as it seemed incorrect that they would all output the same value on every iteration. Due to this challenge, I was unable to find good weights for the character data, so my prediction for that data set is almost meaningless, just using random weights on a potentially buggy input data set. -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt index 7446175..c5e28b0 100644 --- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt @@ -7,5 +7,5 @@ set(SOURCE_FILES cuda_add_library(character_recognition ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_75 ) diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu index 5a3ed7f..7e40401 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.cu +++ b/Project2-Character-Recognition/character_recognition/mlp.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "mlp.h" +#include +#include +#include namespace CharacterRecognition { using Common::PerformanceTimer; @@ -10,18 +13,366 @@ namespace CharacterRecognition { static PerformanceTimer timer; return timer; } - - // TODO: __global__ - - /** - * Example of use case (follow how you did it in stream compaction) - */ - /*void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - */ - // TODO: implement required elements for MLP sections 1 and 2 here + int blockSize = 128; + dim3 threadsPerBlock(blockSize); + + __host__ __device__ unsigned int hash(unsigned int a) { + a = (a + 0x7ed55d16) + (a << 12); + a = (a ^ 0xc761c23c) ^ (a >> 19); + a = (a + 0x165667b1) + (a << 5); + a = (a + 0xd3a2646c) ^ (a << 9); + a = (a + 0xfd7046c5) + (a << 3); + a = (a ^ 0xb55a4f09) ^ (a >> 16); + return a; + } + + __global__ void kernFillRandom(int N, float *weights, float time) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + + thrust::default_random_engine rng(hash((int)(index * time))); + thrust::uniform_real_distribution unitDistrib(-12, 12); + + + weights[index] = (float)unitDistrib(rng); + } + + + void fillRandomWeights(int n, float *data, float seed) { + float *dev_weightsArray; + + cudaMalloc((void**)&dev_weightsArray, n * sizeof(float)); + checkCUDAError("cudaMalloc dev_weightsArray failed!"); + + int numThreads = n; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernFillRandom<<>>(numThreads, dev_weightsArray, seed); + checkCUDAError("kernFillRandom failed!"); + + cudaMemcpy(data, dev_weightsArray, n * sizeof(float), cudaMemcpyDeviceToHost); + + cudaFree(dev_weightsArray); + } + + + __global__ void kernLayer1Mult(int numHidden, float *hiddenLayers, int inputSize, const float* input, const float *weights) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= numHidden) { + return; + } + float sum = 0; + for (int i = 0; i < inputSize; ++i) { + sum += input[i] * weights[index + numHidden * i]; + } + + hiddenLayers[index] = 1 / (1 + exp(-sum)); + } + + __global__ void kernLayer2Mult(int n, int numHiddenlayers, float *output, const float *hiddenLayers, const float *weights) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + float sum = 0; + for (int i = 0; i < numHiddenlayers; ++i) { + sum += hiddenLayers[i] * weights[i]; + } + output[index] = 1 / (1 + exp(-sum)); + } + + __global__ void kernPartialErrorDeriv1(int n, + float expectedValue, float output, int inputSize, int numHidden, + const float *input, const float *hidden, const float *weights2, float *partials1) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + //float originalWeight = adjustedWeights[index]; // Do the memory acces first and let the following math hide latency + + int inputIndex = floorf(index / (numHidden)); + int hiddenIndex = index % numHidden; + + float inputValue = input[inputIndex]; + float hiddenValue = hidden[hiddenIndex]; + float hiddenWeight = weights2[hiddenIndex]; + + float partialErrorDeriv = -inputValue * (1 / (1 + exp(-hiddenValue))) * + (1 - (1 / (1 + exp(-hiddenValue)))) * (expectedValue - output) * + (1 / (1 + exp(-output))) * (1 - (1 / (1 + exp(-output)))) * + hiddenWeight; + + //float deltaWeight = (error / 10.0) * partialErrorDeriv; + + partials1[index] = partialErrorDeriv; + } + + __global__ void kernPartialErrorDeriv2(int n, + float expectedValue, float output, + const float *hidden, float *partials2) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + //float originalWeight = adjustedWeights[index]; + + float partialErrorDeriv = (-(expectedValue - output)) * (1 / (1 + exp(-output))) * (1 - (1 / (1 + exp(-output)))) * hidden[index]; + + //float deltaWeight = (error / 10.0) * partialErrorDeriv; + + partials2[index] = partialErrorDeriv; + + + + } + + + float mlp(int inputSize, int numHiddenLayers, float expectedValue, + const float *weights1, const float *weights2, + const float *idata, + float *partialDerivatives1, float *partialDerivatives2) { + // size of input is 2 for xor and 512 by 512 for characters + // hidden layer somewhere between 1 and size of input + // first number of weights is size of hidden layer * size of input + // second number of weights is size of hidden layer * size of output(1) + + int numWeights1 = inputSize * numHiddenLayers; + int numWeights2 = numHiddenLayers; + + + // Initialize buffers + float *dev_inputData; + float *dev_hidden; + float *dev_weights1; + float *dev_weights2; + float *dev_output; + + float *dev_partials1; + float *dev_partials2; + + float *host_output; + + + // Malloc for buffers + cudaMalloc((void**)&dev_inputData, inputSize * sizeof(float)); + checkCUDAError("cudaMalloc dev_inputData failed!"); + + cudaMalloc((void**)&dev_hidden, numHiddenLayers * sizeof(float)); + checkCUDAError("cudaMalloc dev_hidden failed!"); + + cudaMalloc((void**)&dev_weights1, numWeights1 * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights1 failed!"); + + cudaMalloc((void**)&dev_weights2, numWeights2 * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights2 failed!"); + + cudaMalloc((void**)&dev_partials1, numWeights1 * sizeof(float)); + checkCUDAError("cudaMalloc dev_partials1 failed!"); + + cudaMalloc((void**)&dev_partials2, numWeights2 * sizeof(float)); + checkCUDAError("cudaMalloc dev_partials2 failed!"); + + cudaMalloc((void**)&dev_output, sizeof(float)); + checkCUDAError("cudaMalloc dev_output failed!"); + + cudaMallocHost((void**)&host_output, sizeof(float)); + checkCUDAError("cudaMallocHost host_output failed!"); + + // Fille input and weights data + cudaMemcpy(dev_inputData, idata, inputSize * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_weights1, weights1, numWeights1 * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_weights2, weights2, numWeights2 * sizeof(float), cudaMemcpyHostToDevice); + + // Perform the multiplications for layer 1 to get the hidden layers + int numThreads = numHiddenLayers; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernLayer1Mult<<>>(numHiddenLayers, dev_hidden, inputSize, dev_inputData, dev_weights1); + + // perform the multiplications for layer 2 to get the output value + int layer2_numThreads = 1; + dim3 layer2_blocksPerGrid((layer2_numThreads + blockSize - 1) / blockSize); + + kernLayer2Mult<<>>(1, numHiddenLayers, dev_output, dev_hidden, dev_weights2); + + // Copy the output onto the host + cudaMemcpy(host_output, dev_output, sizeof(float), cudaMemcpyDeviceToHost); + float output = host_output[0]; + + // Find the error from the output + //float error = (output - expectedValue) * (output - expectedValue); + //std::cout << "error " << error << std::endl; + + // Adjust the weights of the layer 1 weights + int weight1Adjust_numThreads = numWeights1; + dim3 weight1Adjust_blocksPerGrid((weight1Adjust_numThreads + blockSize - 1) / blockSize); + + kernPartialErrorDeriv1<<>>(numWeights1, expectedValue, + output, inputSize, + numHiddenLayers, dev_inputData, dev_hidden, + dev_weights2, dev_partials1); + + // Copy the weights into the input array + cudaMemcpy(partialDerivatives1, dev_partials1, numWeights1 * sizeof(float), cudaMemcpyDeviceToHost); + + + // Adjust the weights of the layer 2 weights + int weight2Adjust_numThreads = numWeights2; + dim3 weight2Adjust_blocksPerGrid((weight2Adjust_numThreads + blockSize - 1) / blockSize); + + kernPartialErrorDeriv2<<>>(numWeights2, + expectedValue, output, dev_hidden, dev_partials2); + + cudaMemcpy(partialDerivatives2, dev_partials2, numWeights2 * sizeof(float), cudaMemcpyDeviceToHost); + + //for (int i = 0; i < numWeights1; ++i) { + // //std::cout << "adjusted weight: " << adjustedWeights1[i] << std::endl; + //} + + + // Free buffer memory + cudaFree(dev_inputData); + cudaFree(dev_hidden); + cudaFree(dev_weights1); + cudaFree(dev_weights2); + cudaFree(dev_partials1); + cudaFree(dev_partials2); + cudaFree(dev_output); + cudaFreeHost(host_output); + + + return output; + + } + + + __global__ void kernAddDelta(int n, float accumulatedError, const float *partials, + float *weights) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + float delta = -(accumulatedError / 5.0) * partials[index]; + weights[index] += delta; + } + + void updateWeights(int numWeights, float accumulatedError, const float *partials, float *weights) { + float *dev_partials; + float *dev_weights; + + cudaMalloc((void**)&dev_partials, numWeights * sizeof(float)); + checkCUDAError("cudaMalloc dev_partials failed!"); + + cudaMalloc((void**)&dev_weights, numWeights * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights failed!"); + + cudaMemcpy(dev_partials, partials, numWeights * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_weights, weights, numWeights * sizeof(float), cudaMemcpyHostToDevice); + + + int numThreads = numWeights; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernAddDelta<<>>(numWeights, accumulatedError, dev_partials, dev_weights); + + cudaMemcpy(weights, dev_weights, numWeights * sizeof(float), cudaMemcpyDeviceToHost); + + cudaFree(dev_weights); + cudaFree(dev_partials); + + + } + + + + + float mlpNoError(int inputSize, int numHiddenLayers, float expectedValue, + const float *weights1, const float *weights2, + const float *idata) { + // size of input is 2 for xor and 512 by 512 for characters + // hidden layer somewhere between 1 and size of input + // first number of weights is size of hidden layer * size of input + // second number of weights is size of hidden layer * size of output(1) + + int numWeights1 = inputSize * numHiddenLayers; + int numWeights2 = numHiddenLayers; + + + // Initialize buffers + float *dev_inputData; + float *dev_hidden; + float *dev_weights1; + float *dev_weights2; + float *dev_output; + + float *dev_partials1; + float *dev_partials2; + + float *host_output; + + // Malloc for buffers + cudaMalloc((void**)&dev_inputData, inputSize * sizeof(float)); + checkCUDAError("cudaMalloc dev_inputData failed!"); + + cudaMalloc((void**)&dev_hidden, numHiddenLayers * sizeof(float)); + checkCUDAError("cudaMalloc dev_hidden failed!"); + + cudaMalloc((void**)&dev_weights1, numWeights1 * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights1 failed!"); + + cudaMalloc((void**)&dev_weights2, numWeights2 * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights2 failed!"); + + cudaMalloc((void**)&dev_output, sizeof(float)); + checkCUDAError("cudaMalloc dev_output failed!"); + + cudaMallocHost((void**)&host_output, sizeof(float)); + checkCUDAError("cudaMallocHost host_output failed!"); + + // Fille input and weights data + cudaMemcpy(dev_inputData, idata, inputSize * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_weights1, weights1, numWeights1 * sizeof(float), cudaMemcpyHostToDevice); + cudaMemcpy(dev_weights2, weights2, numWeights2 * sizeof(float), cudaMemcpyHostToDevice); + + // Perform the multiplications for layer 1 to get the hidden layers + int numThreads = numHiddenLayers; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernLayer1Mult << > > (numHiddenLayers, dev_hidden, inputSize, dev_inputData, dev_weights1); + + // perform the multiplications for layer 2 to get the output value + int layer2_numThreads = 1; + dim3 layer2_blocksPerGrid((layer2_numThreads + blockSize - 1) / blockSize); + + kernLayer2Mult << > > (1, numHiddenLayers, dev_output, dev_hidden, dev_weights2); + + // Copy the output onto the host + cudaMemcpy(host_output, dev_output, sizeof(float), cudaMemcpyDeviceToHost); + float output = host_output[0]; + + + + // Free buffer memory + cudaFree(dev_inputData); + cudaFree(dev_hidden); + cudaFree(dev_weights1); + cudaFree(dev_weights2); + cudaFree(dev_output); + cudaFreeHost(host_output); + + + return output; + + } + + } diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h index 2096228..10f1d8f 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.h +++ b/Project2-Character-Recognition/character_recognition/mlp.h @@ -6,4 +6,17 @@ namespace CharacterRecognition { Common::PerformanceTimer& timer(); // TODO: implement required elements for MLP sections 1 and 2 here + void fillRandomWeights(int n, float *data, float seed); + + float mlp(int inputSize, int numHiddenLayers, float expectedValue, + const float *weights1, const float *weights2, + const float *idata, float *partialDerivatives1, float *partialDerivatives2); + + float mlpNoError(int inputSize, int numHiddenLayers, float expectedValue, + const float *weights1, const float *weights2, + const float *idata); + + void updateWeights(int numWeights, float accumulatedError, const float *partials, + float *weights); + } diff --git a/Project2-Character-Recognition/img/Delta.PNG b/Project2-Character-Recognition/img/Delta.PNG new file mode 100644 index 0000000..0bed70a Binary files /dev/null and b/Project2-Character-Recognition/img/Delta.PNG differ diff --git a/Project2-Character-Recognition/img/PartialDeriv.PNG b/Project2-Character-Recognition/img/PartialDeriv.PNG new file mode 100644 index 0000000..bc50042 Binary files /dev/null and b/Project2-Character-Recognition/img/PartialDeriv.PNG differ diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp index 11dd534..2dd498f 100644 --- a/Project2-Character-Recognition/src/main.cpp +++ b/Project2-Character-Recognition/src/main.cpp @@ -10,143 +10,268 @@ #include #include #include "testing_helpers.hpp" +#include +#include +#include +#include +#include +#include +#include +#include + + + +// Define run presets +#define training 1 // If set to 1, indicates we are in training mode +#define acceptedError 0.001 +#define numRandomWeightAttempts 10 +#define numConvergeIterations 1000 +#define useXor 1 -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; int main(int argc, char* argv[]) { - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + // Setup input data depending on xor or character data + int inputSize = 10201; + int numInputs = 52; + if (useXor) { + inputSize = 2; + numInputs = 4; + } + + // Fill in all data: + std::vector inputData = std::vector(); + std::vector expectedData = std::vector(); + + if (useXor) { + float *data1 = new float[2]; + float *data2 = new float[2]; + float *data3 = new float[2]; + float *data4 = new float[2]; + + data1[0] = 0; + data1[1] = 0; + data2[0] = 0; + data3[1] = 0; + data2[1] = 1; + data3[0] = 1; + data4[0] = 1; + data4[1] = 1; + + inputData.push_back(data1); + inputData.push_back(data2); + inputData.push_back(data3); + inputData.push_back(data4); + expectedData.push_back(0); + expectedData.push_back(1); + expectedData.push_back(1); + expectedData.push_back(0); + } + + else { + // Read in character training data: + for (int i = 1; i <= numInputs; ++i) { + std::string filename = std::to_string(i) + "info.txt"; + if (i < 10) { + filename = "0" + filename; + } + filename = "../data-set/" + filename; + + std::ifstream inputFile; + inputFile.open(filename); + if (inputFile.is_open()) { + std::string firstLine; + getline(inputFile, firstLine); + int expectedVal = std::stoi(firstLine); + + std::string secondLine; + getline(inputFile, secondLine); + int inputLength = std::stoi(secondLine); + + float *currData = new float[inputLength]; + int counter = 0; + + std::string dataLine; + getline(inputFile, dataLine); + + std::stringstream stream(dataLine); + while (1) { + int n; + stream >> n; + if (!stream) { + break; + } + currData[counter] = n; + counter++; + } + inputData.push_back(currData); + expectedData.push_back(expectedVal); + } + inputFile.close(); + } + } + + + if (training) { + // Setup weights arrays: + int numHiddenLayers = ceil((inputSize + 1) / 2.0); + + int layer1_numWeights = inputSize * numHiddenLayers; + int layer2_numWeights = numHiddenLayers; + + float *layer1_weights = new float[layer1_numWeights]; + float *layer2_weights = new float[layer2_numWeights]; + + + if (useXor) { + layer1_weights[0] = 10.1; + layer1_weights[1] = 0.9; + layer1_weights[2] = 20; + layer1_weights[3] = 0.87; + + layer2_weights[0] = 41; + layer2_weights[1] = -54; + } + + std::vector partials1 = std::vector(); + std::vector partials2 = std::vector(); + + for (int i = 0; i < numInputs; ++i) { + partials1.push_back(new float[layer1_numWeights]); + partials2.push_back(new float[layer2_numWeights]); + } + + // Begin loop iteration + auto start = std::chrono::steady_clock::now(); + int numRandIters = 0; + float accumulatedError = 3.0; // Larger than accepted error + bool done = false; + int numRandAttempts = useXor ? 1 : numRandomWeightAttempts; + while (!done && numRandIters < numRandAttempts) { + // Fill new random weights (if xor, use preset weights) + if (!useXor) { + auto end1 = std::chrono::steady_clock::now(); + CharacterRecognition::fillRandomWeights(layer1_numWeights, layer1_weights, std::chrono::duration_cast(end1 - start).count()); + auto end2 = std::chrono::steady_clock::now(); + CharacterRecognition::fillRandomWeights(layer2_numWeights, layer2_weights, std::chrono::duration_cast(end2 - start).count()); + std::cout << "NEW WEIGHTS" << std::endl; + } + + int numInnerIters = 0.0; + // Try refining weights iteratively + while (!done && numInnerIters < numConvergeIterations) { + accumulatedError = 0.0; + bool resultAll1 = true; + + // Run each input through mlp, accumulating error + for (int k = 0; k < numInputs; ++k) { + float currExpected = expectedData.at(k); + float output = CharacterRecognition::mlp(inputSize, numHiddenLayers, currExpected, + layer1_weights, layer2_weights, inputData.at(k), partials1.at(k), partials2.at(k)); + if (output != 1) { + resultAll1 = false; + } + + float currError = (output - currExpected) * (output - currExpected); + std::cout << "expected output: " << currExpected << " Result: " << output << std::endl; + accumulatedError += currError; + } + if (resultAll1) { + break; + } + accumulatedError /= 2.0; + std::cout << "Accumulated error: " << accumulatedError << std::endl; + if (accumulatedError < acceptedError) { + done = true; + } + // Adjust weights based on accumulated error + if (!done) { + for (int k = 0; k < numInputs; ++k) { + float* partialValues1 = partials1.at(k); + float* partialValues2 = partials2.at(k); + + CharacterRecognition::updateWeights(layer1_numWeights, + accumulatedError, partialValues1, layer1_weights); + + CharacterRecognition::updateWeights(layer2_numWeights, + accumulatedError, partialValues2, layer2_weights); + } + } + if (done) { + std::cout << "DONE" << std::endl; + } + numInnerIters++; + } + numRandIters++; + } + // Print out final weights and error after either converging to good weights or failing to converge + std::cout << "FINAL ERROR: " << accumulatedError << std::endl; + std::cout << "WEIGHTS:" << std::endl; + for (int i = 0; i < layer1_numWeights; ++i) { + std::cout << "layer 1 weight " << i << ": " << layer1_weights[i] << std::endl; + } + for (int i = 0; i < layer2_numWeights; ++i) { + std::cout << "layer 2 weight " << i << ": " << layer2_weights[i] << std::endl; + } + + // Delete data arrays stored in map + for (float* i : inputData) { + delete[] i; + } + for (float* i : partials1) { + delete[] i; + } + for (float* i : partials2) { + delete[] i; + } + delete[] layer1_weights; + delete[] layer2_weights; + } + else { + // Setup weights arrays: + int numHiddenLayers = ceil((inputSize + 1) / 2.0); + + int layer1_numWeights = inputSize * numHiddenLayers; + int layer2_numWeights = numHiddenLayers; + + float *layer1_weights = new float[layer1_numWeights]; + float *layer2_weights = new float[layer2_numWeights]; + + + if (useXor) { + layer1_weights[0] = 10.1302; + layer1_weights[1] = 0.854664; + layer1_weights[2] = 20.0219; + layer1_weights[3] = 0.837066; + + layer2_weights[0] = 41.009; + layer2_weights[1] = -53.9937; + } + + auto start = std::chrono::steady_clock::now(); + // if not xor, use random weights because never found good weights + if (!useXor) { + auto end1 = std::chrono::steady_clock::now(); + CharacterRecognition::fillRandomWeights(layer1_numWeights, layer1_weights, std::chrono::duration_cast(end1 - start).count()); + auto end2 = std::chrono::steady_clock::now(); + CharacterRecognition::fillRandomWeights(layer2_numWeights, layer2_weights, std::chrono::duration_cast(end2 - start).count()); + std::cout << "NEW WEIGHTS" << std::endl; + } + + // Run input data through mlp to get output with given weights + for (int k = 0; k < numInputs; ++k) { + float currExpected = expectedData.at(k); + float output = CharacterRecognition::mlpNoError(inputSize, numHiddenLayers, currExpected, + layer1_weights, layer2_weights, inputData.at(k)); + // Print out results, rounding output to integer to match input type + std::cout << "expected output: " << currExpected << " Result: " << round(output) << std::endl; + } + + // Delete data arrays stored in map + for (float* i : inputData) { + delete[] i; + } + delete[] layer1_weights; + delete[] layer2_weights; + } + } diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md index 0e38ddb..bce5c20 100644 --- a/Project2-Stream-Compaction/README.md +++ b/Project2-Stream-Compaction/README.md @@ -3,12 +3,182 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Grace Gilbert + * gracelgilbert.com +* Tested on: Windows 10, i9-9900K @ 3.60GHz 64GB, GeForce RTX 2080 40860MB -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +## Overview +In this project, I implemented the stream compaction algorithm on the GPU in CUDA. Stream compaction is an algorithm that, given an array of values marked to remove or keep, removes the values and returns a new, shorter array with the values removed. Below is a diagram representing the stream compaction algorithm: + +![](img/StreamCompaction.PNG) + +This algorithm has many practical applications, including path tracing, as it lets us mark certain elements as unwanted and remove them. While there is a simple way to perform this algorithm using loops on the CPU, it can also be parallelized to be more efficiently performed on the GPU. + +An important step in the stream compaction algorithm is the scan algorithm. This algorithm goes through an array and accumulates additively all of the elements in the array. An exclusive scan excludes the current index in the accumulated sum, whereas inclusive scan includes the current index. Steam compaction uses an exclusive scan. Below is a diagram representing the scan algorithm: + +![](img/exclusiveScan.PNG) + +I implemented 4 versions of the above algorithms, scan and stream compaction on the CPU, a naive version of scan on the GPU, a work efficient version of both on the GPU, and then using the thrust implementation of scan. + +## CPU +#### Scan +The scan algorithm on the cpu is a simple loop over the data. For an exclusive scan, we set the first value of the output to 0, as no sum has been accumulated. Then from index 1 through arrayLength - 1, we set the output at that index to the sum of the output at the previous index and the input array at the previous index: + +``` +outputData[0] = 0; +for (int k = 1; k < n; ++k) { + outputData[k] = outoutData[k - 1] + inputData[k - 1]; +} +``` + +#### Stream Compaction without Scan +The basic stream compaction algorithm tracks a counter of how many elements to include that we have seen. If we see an element to include, it sets the output at the index of the counter to the element value, then increments the counter. +``` +int counter = 0; +for (int k = 0; k < n; ++k) { + int currentValue = inputData[k]; + if (currentValue != 0) { + outputData[counter] = currentValue; + counter++; + } +} +``` + +#### Stream Compaction with Scan and Scatter +In this version, I start by creating a temporary array that contains a 1 wherever the input array had a nonzero value, and 0 where the input array had a zero value: +``` +int *tempArray = new int[n]; +for (int k = 0; k < n; ++k) { + tempArray[k] = (int) inputData[k] != 0; +} +``` +I then call scan on that 0/1 array using the CPU implementation from above. I then iterate through all the indices in the input array, and if the value should be included, I get the scan result value at that index and put the input value at that scan result index in the output array. +``` +int *scanResult = new int[n]; +scan(n, scanResult, tempArray); +for (int k = 0; k < n; ++k) { + if (tempArray[k]) { + int index = scanResult[k]; + odata[index] = idata[k]; + } +} +``` + +## GPU +In order to parallelize scan to run on the GPU, we add pairs of elements in the array in parallel. We do this in multiple steps, eventually filling out the entire scanned array. The way in which we group the pairs of elements to sum together in each step affects the efficiency. + +These groupings involve splitting the array by 2. They work only when the array is a power of 2. Therefore, we pad the input array with 0's, which do not affect scan, filling it until it is as large as the next power of 2. + +### Naive +#### Scan +The first implementation of a GPU scan is the naive one. This implementation starts by adding all adjacent pairs of indices. Then in the next step, it adds indices one farther apart from each other. The distance between the indices added together in parallel doubles at each step, until eventuall the first half of indices is added to the second half of indices. The following diagram outlines the naive implementation: + +![](img/naiveParallelScan.PNG) + +This process has O(logn) steps, and O(n) adds run in parallel per step, making the total number of additions O(nlogn). + +In this version, the output was an inclusive scan, so I had to shift all of the values right and insert a 0 in the first index. I performed this shift operation in a kernel, where I checked if the index was 0, in which case I filled the shifted array with 0, and otherwise shifted everything right. + +### Work Efficient +#### Scan +In a more efficient GPU implementation, we utilize a binary tree structure in order to further optimize the scan algorithm. In this version, there are two stages, the up-sweep and the down-sweep. + +In the up-sweep, we first sum adjacent pairs, creating n/2 sums. We then sum adjacent pairs of the n/2 sums, creating n/4. We continue this until there is one final sum, as picutred below: + +![](img/efficientScanUpSweep.PNG) + +Next comes the down-sweep. Because we can treat the array as a tree, I will refer to the left and right children of indices. The root of the tree is the final index, initially holding the final value from the up-sweep. We start by replacing this value with 0. Then we store its left child. In the left child's spot, we copy the current value, which is initially the root. Then we sum the current value and its left child and store that in the right child. We the continue this at every level of the tree, running all the values at a level in parallel until we've filled out n leaves. This algorithm is pictured below: + +![](img/efficientScanDownSweep.PNG) + +The up-sweep process has O(n) adds. The down-sweep process has O(n) adds and O(n) swaps. This makes the total runtime O(n), exponentially more efficient than the naive scan. + +### Thrust +#### Scan +For this implementation, I simply cast the input and output array buffers to thrust device pointers and then run thrust's exclusive scane on the buffers. +``` +thrust::exclusive_scan(dev_thrust_inputArray, dev_thrust_inputArray + bufferLength, dev_thrust_outputArray); +``` + +## Performance Analysis and Questions +For each of the algorithms, I ran them with a block size of 128. In order for the block size to be optimal, it must be a power of two. Most of the thread counts that we are sending are multiples or powers of two, so having a block size that is a power of two helps ensure that the blocks are filled to their capacity, making it most efficient. + +I did perfomance analyes comparing the different implementations of scan and stream compaction on both arrays of size power of 2, and arrays of size non power of 2. The following charts show this analysis: + +### Scan +#### Power of 2 size +##### Full data +![](img/ScanPerformanceChart.PNG) +##### Close up on smaller sizes +![](img/ScanPerformanceChartSmallSizes.PNG) +#### Non-power of 2 size +##### Full data +![](img/ScanPerformanceChartNonPower2.PNG) +##### Close up on smaller sizes +![](img/ScanPerformanceChartNonPower2SmallSizes.PNG) + +### Stream Compaction +#### Power of 2 size +##### Full data +![](img/StreamCompactionPerformanceChart.PNG) +##### Close up on smaller sizes +![](img/StreamCompactionPerformanceChartSmallSizes.PNG) +#### Non-power of 2 size +##### Full data +![](img/StreamCompactionPerformanceChartNonPower2.PNG) +##### Close up on smaller sizes +![](img/StreamCompactionPerformanceChartNonPower2SmallSizes.PNG) + +The results of the performance data indicates that at smaller values, the CPU implementations run faster than either of the GPU implementations. For scan, the work efficient version is consistently faster than the naive version, even at smaller array sizes. However, once the array size reaches a large enough value, start at size roughly 2^20, the GPU versions begin to outpace the CPU versions, and the work efficient implementation more dramatically beats the naive implementation. + +The reason for the above phenomenon may be because the GPU implementations, while we did not count the buffer setup and final copying time in the performance analysis, still involves more overhead than the CPU implementation. Only once the array becomes large enough do the performance benefits outweigh the GPU overhead. This indicates that the performance bottleneck of the GPU is memory i/o, as this would be where the overhead is coming from. Because eventually the GPU beats the CPU implementations, we know that the overhead of the memory i/o is not increasing as quickly as the computation time of the CPU loops as the array becomes longer. + +Additionally, most of the implementations do not drastically differ between the power of 2 length array and the non power of 2 length array, except for the thrust implementation, which becomes less efficient when the array is not a power of 2. + +### Optimizations (extra credit) +#### Naive thread count +In the naive algorithm, within the parallel portion of the algorithm, there is a check to see if the index is at least a certain value: + +![](img/NaiveScanPseudoCode.PNG) + +I realized that for all threads where the index was too small, the thread was simply returning without doing any work. To optimize this, I calculated exactly how many threads would actually end up doing work: +``` +for (int d = 1; d <= ilog2ceil(n); ++d) { + int numThreads = bufferLength - pow(2, d - 1); +} +``` +I only launched this many threads. However, the indices were now incorrect, as I was starting at 0, but the desired threads were the larger portion. To fix this, I offset all of the thread indices by the cutoff value, enabling proper indexing without launching any redundent threads. + +This optimization improved the runtime of the scan, as shown in the data below: + +Prior to thread count optimization: + +![](img/naiveScanWithNaiveThreadCount.PNG) + +With thread count optimization: + +![](img/naiveScanWithBetterThreadcount.PNG) + +Note that this data shows a relative improvement, not the absolute performance of the algorithm as shown in the charts above. I had not yet started documenting performance properly, so this was not run in release mode and the timers included the final memory operations. However, they are both run under the same conditions, scanning an array of size 2^7 and 2^7 - 3. + +#### Work Efficient Implementation Optimizations +When I implemented the efficient implementation, it was slower than the naive implementation at first, so I made some changes to optimize it. Like in the naive implementation, I made sure to only launch exactly as many threads as needed, and this number gets cut in half each iteration for both the up and down sweeps: +``` +for (int d = 0; d < ilog2ceil(n); ++d) { + int power = pow(2, d); + int numThreads = bufferLength / (2 * power); +} +``` +To then ensure proper indexing within the kernel, I multiplied the index by 2 * power. + +I also realized that the thread operations were using 2^d frequently. At first, I was calculating pow(2, d) within the kernel, but realized this to be inefficient, as this is not a trivial operation. Instead, I calculated the exponent before launching the threads and passed in the value. For the cased where we used 2^(d + 1) instead, I simply multiplied the power value by 2. + +The final optimization I made is in the part of the algorithm where we have to set the hypothetical root of the tree to 0 before beginning the down-sweep. Initially, I copied the device buffer data to the CPU, indexed into the root position and changed it to 0, then copied it back into the decide buffer. I realized that this copying between the host and device was inefficient. Instead, in order to keep all the data on the GPU, I launched a kernel with a single thread and an offset value. This kernel indexed into the offset value, as the only index is 0, so I used index 0 + offset. For the offset, I made it the index of the root, and then set the value to 0 in the kernel. This saved two copies of the data between host and device. + +Overall, these changes fairly dramatically improved the runtime of the scan operation. The improvement can be seen in the following two images, the first showing the runtime prior to the optimizations, the second showing the runtime afterwards. As mentioned above, these are relative improvements, as I did not perform these tests in release mode, and the timers include final data copying. Both of these are run on arrays length 2^7 and 2^7 - 3: + +![](img/workEfficientScanLessEfficient.PNG) + +![](img/workEfficientScanMoreEfficient.PNG) diff --git a/Project2-Stream-Compaction/img/NaiveScanPseudoCode.PNG b/Project2-Stream-Compaction/img/NaiveScanPseudoCode.PNG new file mode 100644 index 0000000..e286155 Binary files /dev/null and b/Project2-Stream-Compaction/img/NaiveScanPseudoCode.PNG differ diff --git a/Project2-Stream-Compaction/img/ScanPerformanceChart.PNG b/Project2-Stream-Compaction/img/ScanPerformanceChart.PNG new file mode 100644 index 0000000..c0721b6 Binary files /dev/null and b/Project2-Stream-Compaction/img/ScanPerformanceChart.PNG differ diff --git a/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2.PNG b/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2.PNG new file mode 100644 index 0000000..d5f85b7 Binary files /dev/null and b/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2.PNG differ diff --git a/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2SmallSizes.PNG b/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2SmallSizes.PNG new file mode 100644 index 0000000..073383d Binary files /dev/null and b/Project2-Stream-Compaction/img/ScanPerformanceChartNonPower2SmallSizes.PNG differ diff --git a/Project2-Stream-Compaction/img/ScanPerformanceChartSmallSizes.PNG b/Project2-Stream-Compaction/img/ScanPerformanceChartSmallSizes.PNG new file mode 100644 index 0000000..7bf9170 Binary files /dev/null and b/Project2-Stream-Compaction/img/ScanPerformanceChartSmallSizes.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompaction.PNG b/Project2-Stream-Compaction/img/StreamCompaction.PNG new file mode 100644 index 0000000..4e25110 Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompaction.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompactionPerformanceChart.PNG b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChart.PNG new file mode 100644 index 0000000..05c7db0 Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChart.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2.PNG b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2.PNG new file mode 100644 index 0000000..075d4ce Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2SmallSizes.PNG b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2SmallSizes.PNG new file mode 100644 index 0000000..faf3af6 Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartNonPower2SmallSizes.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartSmallSizes.PNG b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartSmallSizes.PNG new file mode 100644 index 0000000..ce1f154 Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompactionPerformanceChartSmallSizes.PNG differ diff --git a/Project2-Stream-Compaction/img/StreamCompactionWithScan.PNG b/Project2-Stream-Compaction/img/StreamCompactionWithScan.PNG new file mode 100644 index 0000000..6da2f07 Binary files /dev/null and b/Project2-Stream-Compaction/img/StreamCompactionWithScan.PNG differ diff --git a/Project2-Stream-Compaction/img/efficientScanDownSweep.PNG b/Project2-Stream-Compaction/img/efficientScanDownSweep.PNG new file mode 100644 index 0000000..4c12fbd Binary files /dev/null and b/Project2-Stream-Compaction/img/efficientScanDownSweep.PNG differ diff --git a/Project2-Stream-Compaction/img/efficientScanUpSweep.PNG b/Project2-Stream-Compaction/img/efficientScanUpSweep.PNG new file mode 100644 index 0000000..03abef9 Binary files /dev/null and b/Project2-Stream-Compaction/img/efficientScanUpSweep.PNG differ diff --git a/Project2-Stream-Compaction/img/exclusiveScan.PNG b/Project2-Stream-Compaction/img/exclusiveScan.PNG new file mode 100644 index 0000000..eb84b1d Binary files /dev/null and b/Project2-Stream-Compaction/img/exclusiveScan.PNG differ diff --git a/Project2-Stream-Compaction/img/naiveParallelScan.PNG b/Project2-Stream-Compaction/img/naiveParallelScan.PNG new file mode 100644 index 0000000..2460e49 Binary files /dev/null and b/Project2-Stream-Compaction/img/naiveParallelScan.PNG differ diff --git a/Project2-Stream-Compaction/img/naiveScanWithBetterThreadcount.PNG b/Project2-Stream-Compaction/img/naiveScanWithBetterThreadcount.PNG new file mode 100644 index 0000000..c5c69c5 Binary files /dev/null and b/Project2-Stream-Compaction/img/naiveScanWithBetterThreadcount.PNG differ diff --git a/Project2-Stream-Compaction/img/naiveScanWithNaiveThreadCount.PNG b/Project2-Stream-Compaction/img/naiveScanWithNaiveThreadCount.PNG new file mode 100644 index 0000000..4de18a3 Binary files /dev/null and b/Project2-Stream-Compaction/img/naiveScanWithNaiveThreadCount.PNG differ diff --git a/Project2-Stream-Compaction/img/workEfficientScanLessEfficient.PNG b/Project2-Stream-Compaction/img/workEfficientScanLessEfficient.PNG new file mode 100644 index 0000000..9374438 Binary files /dev/null and b/Project2-Stream-Compaction/img/workEfficientScanLessEfficient.PNG differ diff --git a/Project2-Stream-Compaction/img/workEfficientScanMoreEfficient.PNG b/Project2-Stream-Compaction/img/workEfficientScanMoreEfficient.PNG new file mode 100644 index 0000000..affa212 Binary files /dev/null and b/Project2-Stream-Compaction/img/workEfficientScanMoreEfficient.PNG differ diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp index d016553..8e86b7a 100644 --- a/Project2-Stream-Compaction/src/main.cpp +++ b/Project2-Stream-Compaction/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 3; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -29,6 +29,9 @@ int main(int argc, char* argv[]) { genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; + for (int i = 0; i < SIZE; ++i) { + a[i] = i; + } printArray(SIZE, a, true); // initialize b using StreamCompaction::CPU::scan you implement @@ -51,7 +54,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -71,28 +74,28 @@ int main(int argc, char* argv[]) { printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -137,14 +140,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt index cdbef77..185a604 100644 --- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt +++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_75 ) diff --git a/Project2-Stream-Compaction/stream_compaction/common.h b/Project2-Stream-Compaction/stream_compaction/common.h index 996997e..4085b6a 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.h +++ b/Project2-Stream-Compaction/stream_compaction/common.h @@ -108,6 +108,14 @@ namespace StreamCompaction { return prev_elapsed_time_gpu_milliseconds; } + bool getCpuTimerStarted() { + return cpu_timer_started; + } + + bool getGpuTimerStarted() { + return gpu_timer_started; + } + // remove copy and move functions PerformanceTimer(const PerformanceTimer&) = delete; PerformanceTimer(PerformanceTimer&&) = delete; diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu index a2d3e6c..fd33591 100644 --- a/Project2-Stream-Compaction/stream_compaction/cpu.cu +++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu @@ -18,9 +18,23 @@ namespace StreamCompaction { * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + bool newTimer = true; + if (timer().getCpuTimerStarted()) { + newTimer = false; + } + if (newTimer) { + timer().startCpuTimer(); + } // TODO - timer().endCpuTimer(); + if (n > 0) { + odata[0] = 0; + for (int k = 1; k < n; ++k) { + odata[k] = odata[k - 1] + idata[k - 1]; + } + } + if (newTimer) { + timer().endCpuTimer(); + } } /** @@ -31,8 +45,16 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int counter = 0; + for (int k = 0; k < n; ++k) { + int currVal = idata[k]; + if (currVal != 0) { + odata[counter] = currVal; + counter++; + } + } timer().endCpuTimer(); - return -1; + return counter; } /** @@ -43,8 +65,26 @@ namespace StreamCompaction { int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int *tempArray = new int[n]; + for (int k = 0; k < n; ++k) { + tempArray[k] = (int) idata[k] != 0; + } + int counter = 0; + int *scanResult = new int[n]; + scan(n, scanResult, tempArray); + for (int k = 0; k < n; ++k) { + if (tempArray[k]) { + int index = scanResult[k]; + odata[index] = idata[k]; + counter++; + } + } + + + delete[] scanResult; + delete[] tempArray; timer().endCpuTimer(); - return -1; + return counter; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu index 2db346e..c58050a 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.cu +++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu @@ -12,15 +12,131 @@ namespace StreamCompaction { return timer; } + int blockSize = 128; + dim3 threadsPerBlock(blockSize); + + __global__ void kernUpSweep(int N, int power, int *opArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + int k = index * 2 * power; + + opArray[k + 2 * power - 1] += opArray[k + power - 1]; + } + + __global__ void kernDownSweep(int N, int power, int *opArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + int k = index * 2 * power; + + int t = opArray[k + power - 1]; + int s = opArray[k + 2 * power - 1]; + opArray[k + power - 1] = s; + opArray[k + 2 * power - 1] = s + t; + } + + __global__ void kernSetLastZero(int N, int offset, int *opArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + opArray[index + offset] = 0; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + int bufferLength = 1 << ilog2ceil(n); + int *dev_inputArray; + + cudaMalloc((void**)&dev_inputArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_inputArray failed!"); + + cudaMemset(dev_inputArray, 0, bufferLength * sizeof(int)); + + cudaMemcpy(dev_inputArray, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + bool newTimer = true; + if (timer().getGpuTimerStarted()) { + newTimer = false; + } + if (newTimer) { + timer().startGpuTimer(); + } + for (int d = 0; d < ilog2ceil(n); ++d) { + int power = pow(2, d); + int numThreads = bufferLength / (2 * power); + + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + kernUpSweep << > > (numThreads, power, dev_inputArray); + checkCUDAError("kernUpSweep failed!"); + } + + // NOTE: dev_inputArray is now in upsweep stage + + int numThreadsSetZero = 1; + dim3 blocksPerGridSetZero((numThreadsSetZero + blockSize - 1) / blockSize); + kernSetLastZero << > > (numThreadsSetZero, bufferLength - 1, dev_inputArray); + + for (int d = ilog2ceil(n) - 1; d >= 0; --d) { + int power = pow(2, d); + int numThreads = bufferLength / (2 * power); + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + kernDownSweep << > > (numThreads, power, dev_inputArray); + checkCUDAError("kernDownSweep failed!"); + } + + if (newTimer) { + timer().endGpuTimer(); + } + cudaMemcpy(odata, dev_inputArray, n * sizeof(int), cudaMemcpyDeviceToHost); + // TODO - timer().endGpuTimer(); + + cudaFree(dev_inputArray); } + + __global__ void kernComputeInOutArray(int N, int *srcArray, int *dstArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + dstArray[index] = (int)(srcArray[index] != 0); + } + + __global__ void kernComputeSize(int N, int offset, int *dst, int *src1, int *src2) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + dst[index] = src1[index + offset] + src2[index + offset]; + } + + __global__ void kernScatter(int N, int *dst, int *src, int *indexFinder) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + int currVal = src[index]; + int dstIndex = -1; + if (currVal != 0) { + dstIndex = indexFinder[index]; + } + if (dstIndex >= 0) { + dst[dstIndex] = currVal; + } + } + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +147,76 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int bufferLength = 1 << ilog2ceil(n); + int *dev_inputArray; + int *dev_tempInOutArray; + int *dev_scanArray; + int *dev_resultLength; + + int *host_resultLength; + + cudaMalloc((void**)&dev_inputArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_inputArray failed!"); + + cudaMalloc((void**)&dev_tempInOutArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_tempInOutArray failed!"); + + cudaMalloc((void**)&dev_scanArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_scanArray failed!"); + + cudaMalloc((void**)&dev_resultLength, sizeof(int)); + checkCUDAError("cudaMalloc dev_resultLength failed!"); + + cudaMallocHost((void**)&host_resultLength, sizeof(int)); + checkCUDAError("cudaMallocHost host_resultLength failed!"); + + + + cudaMemset(dev_inputArray, 0, bufferLength * sizeof(int)); + + cudaMemcpy(dev_inputArray, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); // TODO - timer().endGpuTimer(); - return -1; + int numThreads = bufferLength; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernComputeInOutArray << > > (numThreads, dev_inputArray, dev_tempInOutArray); + checkCUDAError("kernComputeInOutArray failed!"); + + scan(bufferLength, dev_scanArray, dev_tempInOutArray); + + // NOTE: dev_scanArray now holds bufferLength scanned values of the 0/1 array + + int numThreadsComputeSize = 1; + dim3 blocksPerGridComputeSize((numThreadsComputeSize + blockSize - 1) / blockSize); + kernComputeSize << > > (numThreadsComputeSize, bufferLength - 1, dev_resultLength, dev_scanArray, dev_tempInOutArray); + checkCUDAError("kernComputeSize failed!"); + + cudaMemcpy(host_resultLength, dev_resultLength, sizeof(int), cudaMemcpyDeviceToHost); + int length = host_resultLength[0]; + + int *dev_final; + cudaMallocHost((void**)&dev_final, length * sizeof(int)); + + int numThreadsScatter = bufferLength; + dim3 blocksPerGridScatter((numThreadsScatter + blockSize - 1) / blockSize); + + kernScatter << > > (numThreadsScatter, dev_final, dev_inputArray, dev_scanArray); + + timer().endGpuTimer(); + + + cudaMemcpy(odata, dev_final, length * sizeof(int), cudaMemcpyDeviceToHost); + + + cudaFree(dev_inputArray); + cudaFree(dev_tempInOutArray); + cudaFree(dev_scanArray); + cudaFree(dev_resultLength); + cudaFreeHost(host_resultLength); + + return length; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu index 4308876..acb7e20 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.cu +++ b/Project2-Stream-Compaction/stream_compaction/naive.cu @@ -12,14 +12,99 @@ namespace StreamCompaction { return timer; } // TODO: __global__ + int blockSize = 128; + dim3 threadsPerBlock(blockSize); + + __global__ void kernSumPairs(int N, int d, int *srcArray, int *dstArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + int power = powf(2, d - 1); + index += power; + dstArray[index] = srcArray[index - power] + srcArray[index]; + + /*int power = powf(2, d - 1); + if (index >= power) { + dstArray[index] = srcArray[index - power] + srcArray[index]; + }*/ + } + + __global__ void kernShift(int N, int *srcArray, int *dstArray) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) { + return; + } + + if (index == 0) { + dstArray[index] = 0; + return; + } + dstArray[index] = srcArray[index - 1]; + + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + int bufferLength = 1 << ilog2ceil(n); + int *dev_arrayA; + int *dev_arrayB; + + cudaMalloc((void**)&dev_arrayA, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_arrayA failed!"); + + cudaMalloc((void**)&dev_arrayB, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_arrayB failed!"); + + cudaMemset(dev_arrayA, 0, bufferLength * sizeof(int)); + cudaMemset(dev_arrayB, 0, bufferLength * sizeof(int)); + + cudaMemcpy(dev_arrayA, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_arrayB, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + // TODO - timer().endGpuTimer(); + bool alternator = true; + for (int d = 1; d <= ilog2ceil(n); ++d) { + int numThreads = bufferLength - pow(2, d - 1); // TODO: can this be smaller? + + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + if (alternator) { + kernSumPairs<<>>(numThreads, d, dev_arrayA, dev_arrayB); + checkCUDAError("kernSumPairs failed!"); + + cudaMemcpy(dev_arrayA, dev_arrayB, bufferLength * sizeof(int), cudaMemcpyDeviceToDevice); + alternator = false; + + } + else { + kernSumPairs<<>>(numThreads, d, dev_arrayB, dev_arrayA); + checkCUDAError("kernSumPairs failed!"); + + cudaMemcpy(dev_arrayB, dev_arrayA, bufferLength * sizeof(int), cudaMemcpyDeviceToDevice); + alternator = true; + + } + } + // Note: dev_arrayA/B are now inclusive scans. + // We will take B and shift it on the gpu, storing the exlusive scan in A + + int numThreads = bufferLength; + dim3 blocksPerGrid((numThreads + blockSize - 1) / blockSize); + + kernShift<<>>(bufferLength * sizeof(int), dev_arrayB, dev_arrayA); + checkCUDAError("kernShift failed!"); + + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_arrayA, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_arrayA); + cudaFree(dev_arrayB); } } } diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu index 1def45e..096bfdc 100644 --- a/Project2-Stream-Compaction/stream_compaction/thrust.cu +++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu @@ -18,10 +18,32 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int bufferLength = n; + int *dev_inputArray; + int *dev_outputArray; + + cudaMalloc((void**)&dev_inputArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_inputArray failed!"); + + cudaMalloc((void**)&dev_outputArray, bufferLength * sizeof(int)); + checkCUDAError("cudaMalloc dev_outputArray failed!"); + + cudaMemset(dev_inputArray, 0, bufferLength * sizeof(int)); + cudaMemset(dev_outputArray, 0, bufferLength * sizeof(int)); + + cudaMemcpy(dev_inputArray, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + thrust::device_ptr dev_thrust_inputArray = thrust::device_pointer_cast(dev_inputArray); + thrust::device_ptr dev_thrust_outputArray = thrust::device_pointer_cast(dev_outputArray); + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_thrust_inputArray, dev_thrust_inputArray + bufferLength, dev_thrust_outputArray); + + cudaMemcpy(odata, dev_outputArray, n * sizeof(int), cudaMemcpyDeviceToHost); + timer().endGpuTimer(); } } diff --git a/README.md b/README.md index 3a0b2fe..a167dee 100644 --- a/README.md +++ b/README.md @@ -3,14 +3,12 @@ CUDA Number Algorithms **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Grace Gilbert + * gracelgilbert.com +* Tested on: Windows 10, i9-9900K @ 3.60GHz 64GB, GeForce RTX 2080 40860MB -### (TODO: Your README) +[Stream Compaction](/Project2-Stream-Compaction/README.md) -Link to the readmes of the other two subprojects. +[Character Recognition](/Project2-Character-Recognition/README.md) -Add anything else you think is relevant up to this point. -(Remember, this is public, so don't put anything here that you don't want to share with the world.)