diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md index 4503fac..a02969c 100644 --- a/Project2-Character-Recognition/README.md +++ b/Project2-Character-Recognition/README.md @@ -1,14 +1,58 @@ -CUDA Character Recognition -====================== +# Project 2b: CUDA Character Recognition +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2 - Character Recognition** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +Caroline Lachanski: [LinkedIn](https://www.linkedin.com/in/caroline-lachanski/), [personal website](http://carolinelachanski.com/) -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +Tested on: Windows 10, i5-6500 @ 3.20GHz 16GB, GTX 1660 (personal computer) -### (TODO: Your README) +## Project Description -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +The goal of this project was to implement a small neural network that would recognize characters from black-and-white image inputs. I specifically implemented a multi-layer perceptron, with one hidden layer. In this project, I let the size of the hidden layer be the average of the size of the input layer and the size of the output layer (which is always 1). +![](./img/MLP.png) + +The training starts when a set of data in input into the peceptron, becoming the first layer. This data is multiplied by a series of weights (initially created randomly) that each correspond to one input node and one hidden layer node. These weights are used to determine the importance or the "weight" each specific input will have in determining the output. All of the products with the same hidden layer node position are summed, then put into the activation function, which, in this project, is a sigmoid function, f(x) = 1/(1+e^-x). The results of this function then become the "input" to the next layer, which has its own set of weights and follows a similar weighting and summing process. Since our final layer is one node, we are left with one output, which is also input into the activation function. + +Each input into the network results in a singular output value. This expected value for this input is subtracted from the actual output, squared, and then summed with the corresponding result from every other input. This sum is divided by 2 to give us the total error for this round of training. + +![](./img/error_equation.png) + +We then calculate the error in the difference between the actual and theoretical output due to each specific weight (the error's partial derivative with respect to each weight). Each of these values is multiplied by a negative lambda (here defined as (total error / 5)) to get the delta to be added to that specific weight. This is what is known as back-propogation and should ultimately result in a reduction of error in the overall system. + +![](./img/delta_weight.png) + +We can continue this reduction of error until it reaches a certain threshold or we reach our maximum number of iterations. Once training is complete, we record the final weights to a text file, which can then be used as the weights for when we actually want to run our system on a single input. + +I initially trained and tested my data on a small XOR example, then moved onto to training and testing for the much larger character recognition example (52 characters, each with an input array of 101x101 = 10,201 floats). + +## Output + +The weights produced by training for both the XOR and character recognition examples can be found in text files in this repository. After training, I tested the network on each possible input. Here is the result for XOR: + +``` +********************************* +*********** XOR TESTS *********** +********************************* +Expected output: 0.000000 +Actual output: 0.001515 + +Expected output: 1.000000 +Actual output: 0.977023 + +Expected output: 1.000000 +Actual output: 0.976850 + +Expected output: 0.000000 +Actual output: 0.019346 + +Total error: 0.000720 +``` + +There seems to be an issue with my character recognition test, as the actual output for each tested character is 0. + +## Challenges + +This project was my first foray into machine learning, and only my second project using CUDA/GPU programming, so this project was a bit of a challenge. I struggled a bit with choosing how to represent and pass around data, particularly when it came to error calculation and back-propogation. For example, when it came to calculating the partial error derivatives for each weight and summing them to get a delta w for each weight, I intially made one large float * buffer that held all the partial derivatives for all weights and for all inputs, and worked with this using various indexing schemes. This worked fine for the very small XOR example, but when it came to the actual character recognition example, which had over 50 million weights, I had trouble fitting everything one buffer. I had to think of a way of dividing the data into pieces while also allowing me to run various calculations on it in parallel, and I still don't think what I did was the best choice. + +In the end, I don't believe my back-propogation/weight update functionality works fully, as my network stops reducing error after only one iteration for the character recognition example. Nonetheless, I learned a lot about neural networks through this project. diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt index 7446175..c5e28b0 100644 --- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt @@ -7,5 +7,5 @@ set(SOURCE_FILES cuda_add_library(character_recognition ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_75 ) diff --git a/Project2-Character-Recognition/character_recognition/common.cu b/Project2-Character-Recognition/character_recognition/common.cu index 2a754d4..68fa464 100644 --- a/Project2-Character-Recognition/character_recognition/common.cu +++ b/Project2-Character-Recognition/character_recognition/common.cu @@ -1,15 +1,15 @@ -#include "common.h" - -void checkCUDAErrorFn(const char *msg, const char *file, int line) { - cudaError_t err = cudaGetLastError(); - if (cudaSuccess == err) { - return; - } - - fprintf(stderr, "CUDA error"); - if (file) { - fprintf(stderr, " (%s:%d)", file, line); - } - fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); - exit(EXIT_FAILURE); -} +#include "common.h" + +void checkCUDAErrorFn(const char *msg, const char *file, int line) { + cudaError_t err = cudaGetLastError(); + if (cudaSuccess == err) { + return; + } + + fprintf(stderr, "CUDA error"); + if (file) { + fprintf(stderr, " (%s:%d)", file, line); + } + fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); + exit(EXIT_FAILURE); +} diff --git a/Project2-Character-Recognition/character_recognition/common.h b/Project2-Character-Recognition/character_recognition/common.h index 6aede64..8330494 100644 --- a/Project2-Character-Recognition/character_recognition/common.h +++ b/Project2-Character-Recognition/character_recognition/common.h @@ -1,126 +1,128 @@ -#pragma once - -#include -#include - -#include -#include -#include -#include -#include -#include - -#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) -#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) - -/** - * Check for CUDA errors; print and exit if there was a problem. - */ -void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); - -inline int ilog2(int x) { - int lg = 0; - while (x >>= 1) { - ++lg; - } - return lg; -} - -inline int ilog2ceil(int x) { - return x == 1 ? 0 : ilog2(x - 1) + 1; -} - - -namespace Common { - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; - }; -} +#pragma once + +#include +#include + +#include +#include +#include +#include +#include +#include + +#define blockSize 256 + +#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) +#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) + +/** + * Check for CUDA errors; print and exit if there was a problem. + */ +void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); + +inline int ilog2(int x) { + int lg = 0; + while (x >>= 1) { + ++lg; + } + return lg; +} + +inline int ilog2ceil(int x) { + return x == 1 ? 0 : ilog2(x - 1) + 1; +} + + +namespace Common { + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; + }; +} diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu index 5a3ed7f..4bd098a 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.cu +++ b/Project2-Character-Recognition/character_recognition/mlp.cu @@ -1,27 +1,430 @@ -#include -#include -#include "common.h" -#include "mlp.h" - -namespace CharacterRecognition { - using Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - // TODO: __global__ - - /** - * Example of use case (follow how you did it in stream compaction) - */ - /*void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - */ - - // TODO: implement required elements for MLP sections 1 and 2 here -} +#include +#include +#include "common.h" +#include "mlp.h" +#include +#include +#include +#include +#include +#include +#include +#include + +#define EPSILON 0.0005 +#define MAX_ITER 1 << 5 + +#define XOR_HARD_CODED_WEIGHTS 1 + +namespace CharacterRecognition { + using Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __host__ __device__ unsigned int hash(unsigned int a) { + a = (a + 0x7ed55d16) + (a << 12); + a = (a ^ 0xc761c23c) ^ (a >> 19); + a = (a + 0x165667b1) + (a << 5); + a = (a + 0xd3a2646c) ^ (a << 9); + a = (a + 0xfd7046c5) + (a << 3); + a = (a ^ 0xb55a4f09) ^ (a >> 16); + return a; + } + + + __host__ __device__ float generateRandom(float time, int index, float range) { + thrust::default_random_engine rng(hash((int)(index * time))); + thrust::uniform_real_distribution unitDistrib(-1, 1); + + return range * unitDistrib(rng); + } + + + // compute random float value between 0 and 1 + __global__ void kernRandom(int n, int time, float* out) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + out[index] = generateRandom(time, index, 10.0); + } + + + __global__ void kernComputeLayerSum(int n, float *out, int inCount, float *in, float *weights, int weightsOffset) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + float sum = 0.0; + for (int i = 0; i < inCount; i++) { + sum += in[i] * weights[index + (i * n) + weightsOffset]; + } + + out[index] = sum; + } + + + __host__ __device__ float activationFxn(float x) { + // activation function: f(x) = 1 / (1 + e^-x) + return (1.0 / (1.0 + exp(-x))); + } + + + __global__ void kernComputeActivationFxn(int n, float *in) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + in[index] = activationFxn(in[index]); + } + + + __global__ void kernComputePartialDerivativeLayer1(int n, float *outDerivatives, float *weights, + int numWeights1, float *input, int numHidden, float *hiddenOutput, float output, float expected) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int inputIndex = index / numHidden; + int hiddenIndex = index % numHidden; + float hiddenActivation = activationFxn(hiddenOutput[hiddenIndex]); + float outputActivation = activationFxn(output); + + float partialDerivative = -input[inputIndex] * hiddenActivation * (1.0 - hiddenActivation) * + (expected - output) * outputActivation * (1.0 - outputActivation) * weights[numWeights1 + hiddenIndex]; + + outDerivatives[index] = partialDerivative; + } + + + __global__ void kernComputePartialDerivativeLayer2(int n, float *outDerivatives, float *weights, + int numWeights1, float *input, int numHidden, float *hiddenOutput, float output, float expected) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + float outputActivation = activationFxn(output); + float partialDerivative = -(expected - output) * outputActivation * (1.0 - outputActivation) * hiddenOutput[index]; + + outDerivatives[numWeights1 + index] = partialDerivative; + } + + + void createWeights(int numWeights, float *weights) { + float *dev_weights; + cudaMalloc((void**)&dev_weights, numWeights * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights failed!"); + + // fill weights with random numbers between 0 and 1 + dim3 gridSize = dim3((numWeights + blockSize - 1) / blockSize, 1, 1); + kernRandom<<>>(numWeights, 1, dev_weights); + checkCUDAError("kernRandom failed!"); + + // copy weights back to host + cudaMemcpy(weights, dev_weights, numWeights * sizeof(float), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy weights dev_weights failed!"); + +#if XOR_HARD_CODED_WEIGHTS + weights[0] = 10.1; + weights[1] = 0.9; + weights[2] = 20.0; + weights[3] = 0.87; + weights[4] = 41.0; + weights[5] = -54.0; +#endif // #if XOR_HARD_CODED_WEIGHTS + + cudaFree(dev_weights); + checkCUDAError("cudaFree dev_weights failed!"); + } + + + void getWeightsFromFile(int numWeights, float *weights, std::string filename) { + std::string prefix = "../weights_"; + std::string suffix = ".txt"; + std::stringstream buffer; + buffer << prefix << filename << suffix; + + int index = 0; + std::ifstream inputFile(buffer.str()); + if (inputFile.is_open()) { + std::string line; + while (std::getline(inputFile, line)) { + weights[index] = stof(line); + index++; + } + } + } + + + __global__ void kernScale(int n, float *buffer, float scale) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + buffer[index] = scale * buffer[index]; + } + + + __global__ void kernScanUpsweep(int n, int iteration, float *buffer) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int power1 = ceil(pow(2.0, double(iteration + 1))); + if (index % power1 == 0) { + int power2 = ceil(pow(2.0, double(iteration))); + buffer[index + power1 - 1] += buffer[index + power2 - 1]; + } + } + + + __global__ void kernModifyWeights(int n, float *weights, float *allWeightsErrors, int numTotalInput) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + for (int i = 0; i < numTotalInput; i++) { + weights[index] += allWeightsErrors[i * n + index]; + } + } + + __global__ void kernModifyWeights2(int n, float *weights, float *weightsErrors, int numTotalInput, int currWeight) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + for (int i = 0; i < numTotalInput; i++) { + weights[currWeight] += weightsErrors[i]; + } + } + + __global__ void kernModifyWeights(int n, float *weights, float *weightsErrors) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + weights[index] += weightsErrors[index]; + } + + + void adjustWeights(int numTotalInput, int numWeights, float *weights, std::vector allWeightsDerivatives, float error) { + float lambda = -error / 5.0; + + float *dev_weights; + cudaMalloc((void**)&dev_weights, numWeights * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights failed!"); + + cudaMemcpy(dev_weights, weights, numWeights * sizeof(float), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_weights weights failed!"); + + float *dev_weightErrors; + cudaMalloc((void**)&dev_weightErrors, numWeights * sizeof(float)); + checkCUDAError("cudaMalloc dev_weightErrors failed!"); + + for (int i = 0; i < allWeightsDerivatives.size(); i++) { + cudaMemcpy(dev_weightErrors, allWeightsDerivatives[i], numWeights * sizeof(float), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_weightErrors allWeightsDerivatives[i] failed!"); + + // for each weight derivative, compute delta weight + dim3 gridSize = dim3((numWeights + blockSize - 1) / blockSize, 1, 1); + kernScale<<>>(numWeights, dev_weightErrors, lambda); + checkCUDAError("kernScale failed!"); + + kernModifyWeights<<>>(numWeights, dev_weights, dev_weightErrors); + checkCUDAError("kernModifyWeights failed!"); + } + + // copy weights back to host + cudaMemcpy(weights, dev_weights, numWeights * sizeof(float), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy weights dev_weights failed!"); + + cudaFree(dev_weightErrors); + cudaFree(dev_weights); + checkCUDAError("cudaFree failed!"); + } + + + float runOneInput(int numInput, int numHidden, const float *input, const float *weights, float *weightErrors, float expected, bool training) { + int numWeights1 = numInput * numHidden; + int numWeights2 = numHidden; + + float *dev_input; + float *dev_hiddenLayer; + float *dev_output; + float *dev_weights; + float *dev_weightErrors; + + // malloc device buffers + cudaMalloc((void**)&dev_input, numInput * sizeof(float)); + checkCUDAError("cudaMalloc dev_input failed!"); + + cudaMalloc((void**)&dev_hiddenLayer, numHidden * sizeof(float)); + checkCUDAError("cudaMalloc dev_hiddenLayer failed!"); + + cudaMalloc((void**)&dev_output, numHidden * sizeof(float)); + checkCUDAError("cudaMalloc dev_output failed!"); + + cudaMalloc((void**)&dev_weights, (numWeights1 + numWeights2) * sizeof(float)); + checkCUDAError("cudaMalloc dev_weights failed!"); + + cudaMalloc((void**)&dev_weightErrors, (numWeights1 + numWeights2) * sizeof(float)); + checkCUDAError("cudaMalloc dev_weightsError failed!"); + + // copy weights from host to device + cudaMemcpy(dev_weights, weights, (numWeights1 + numWeights2) * sizeof(float), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_weights weights failed!"); + + // copy input from host to device + cudaMemcpy(dev_input, input, numInput * sizeof(float), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_input input failed!"); + + // compute first layer summation + dim3 gridSize = dim3((numHidden + blockSize - 1) / blockSize, 1, 1); + kernComputeLayerSum<<>>(numHidden, dev_hiddenLayer, numInput, dev_input, dev_weights, 0); + checkCUDAError("kernComputeLayerSum failed!"); + + // compute result of hidden layer (activation function) + gridSize = dim3((numHidden + blockSize - 1) / blockSize, 1, 1); + kernComputeActivationFxn<<>>(numHidden, dev_hiddenLayer); + checkCUDAError("kernComputeActivationFxn failed!"); + + // compute second layer summation (this is one thread) TODO hm + gridSize = dim3((1 + blockSize - 1) / blockSize, 1, 1); + kernComputeLayerSum<<>>(1, dev_output, numHidden, dev_hiddenLayer, dev_weights, numWeights1); + checkCUDAError("kernComputeLayerSum failed!"); + + // compute activation function of output layer node + float output; + cudaMemcpy(&output, dev_output, 1 * sizeof(float), cudaMemcpyDeviceToHost); + output = activationFxn(output); + + if (!training) { + std::cout << "Expected output: " << std::to_string(expected) << std::endl; + std::cout << "Actual output: " << std::to_string(output) << std::endl; + std::cout << std::endl; + } + + // if training, compute partial derivatives for error/weight + if (training) { + // first layer weights + gridSize = dim3((numWeights1 + blockSize - 1) / blockSize, 1, 1); + kernComputePartialDerivativeLayer1<<>>(numWeights1, dev_weightErrors, dev_weights, numWeights1, + dev_input, numHidden, dev_hiddenLayer, output, expected); + checkCUDAError("kernComputePartialDerivativeLayer1 failed!"); + + // second layer weights + gridSize = dim3((numWeights2 + blockSize - 1) / blockSize, 1, 1); + kernComputePartialDerivativeLayer2<<>>(numWeights2, dev_weightErrors, dev_weights, numWeights1, + dev_input, numHidden, dev_hiddenLayer, output, expected); + checkCUDAError("kernComputePartialDerivativeLayer2 failed!"); + + // copy derivatives to host + cudaMemcpy(weightErrors, dev_weightErrors, (numWeights1 + numWeights2) * sizeof(float), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy weightErrors dev_weightErrors failed!"); + } + + cudaFree(dev_input); + cudaFree(dev_hiddenLayer); + cudaFree(dev_output); + cudaFree(dev_weights); + cudaFree(dev_weightErrors); + checkCUDAError("cudaFree failed!"); + + // return square of difference + return (output - expected) * (output - expected); + } + + void train(int numInput, std::vector inputs, std::vector expected, std::string filename) { + int numTotalInput = inputs.size(); + + // determine how many nodes in hidden layer (average of # of nodes in input and # of nodes in output) + int numHidden = ceil((numInput + 1.0) / 2.0); + + // determine number of weights in all layers + int numWeights = (numInput * numHidden) + numHidden; + + // get weights + float *weights = new float[numWeights]; + createWeights(numWeights, weights); + + // create weight errors + std::vector allWeightErrors; + for (int i = 0; i < numTotalInput; i++) { + allWeightErrors.push_back(new float[numWeights]); + } + + float totalError; + int numIter = 0; + + while (true) { + totalError = 0.0; + // train on each input + for (int i = 0; i < numTotalInput; i++) { + float *in = inputs[i]; + totalError += runOneInput(numInput, numHidden, in, weights, allWeightErrors[i], expected[i], true); + } + totalError /= 2.0; + + if (totalError < EPSILON || numIter > MAX_ITER) { + // finished training, save weights to txt file + std::ofstream weightsFile; + weightsFile.open("../weights_" + filename + ".txt"); + for (int i = 0; i < numWeights; i++) { + float weight = weights[i]; + weightsFile << weight; + weightsFile << "\n"; + } + weightsFile.close(); + break; // break while loop + } + else { + adjustWeights(numTotalInput, numWeights, weights, allWeightErrors, totalError); + } + numIter++; + } + + delete[] weights; + + for (int i = 0; i < numTotalInput; i++) { + delete[] allWeightErrors[i]; + } + } + + void run(int numInput, std::vector inputs, std::vector expected, std::string filename) { + + // determine how many nodes in hidden layer (average of # of nodes in input and # of nodes in output) + int numHidden = ceil((numInput + 1.0) / 2.0); + + // determine number of weights in all layers + int numWeights = (numInput * numHidden) + numHidden; + + // get weights + float* weights = new float[numWeights]; + getWeightsFromFile(numWeights, weights, filename); + + float totalError = 0.f; + for (int i = 0; i < inputs.size(); i++) { + float *in = inputs[i]; + totalError += runOneInput(numInput, numHidden, in, weights, nullptr, expected[i], false); + } + + std::cout << "Total error: " << std::to_string(totalError / 2.0) << std::endl; + + delete[] weights; + } + + +} diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h index 2096228..fcfc3a9 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.h +++ b/Project2-Character-Recognition/character_recognition/mlp.h @@ -1,9 +1,13 @@ -#pragma once - -#include "common.h" - -namespace CharacterRecognition { - Common::PerformanceTimer& timer(); - - // TODO: implement required elements for MLP sections 1 and 2 here -} +#pragma once + +#include "common.h" +#include + +namespace CharacterRecognition { + Common::PerformanceTimer& timer(); + + void train(int numInput, std::vector inputs, std::vector expected, std::string filename); + + void run(int numInput, std::vector inputs, std::vector expected, std::string filename); + +} diff --git a/Project2-Character-Recognition/img/delta_weight.png b/Project2-Character-Recognition/img/delta_weight.png new file mode 100644 index 0000000..6acd2ca Binary files /dev/null and b/Project2-Character-Recognition/img/delta_weight.png differ diff --git a/Project2-Character-Recognition/img/error_equation.png b/Project2-Character-Recognition/img/error_equation.png new file mode 100644 index 0000000..724cd32 Binary files /dev/null and b/Project2-Character-Recognition/img/error_equation.png differ diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp index 11dd534..4a8708a 100644 --- a/Project2-Character-Recognition/src/main.cpp +++ b/Project2-Character-Recognition/src/main.cpp @@ -1,152 +1,116 @@ -/** - * @file main.cpp - * @brief Stream compaction test program - * @authors Kai Ninomiya - * @date 2015 - * @copyright University of Pennsylvania - */ - -#include -#include -#include -#include "testing_helpers.hpp" - -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; - -int main(int argc, char* argv[]) { - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; -} +/** + * @file main.cpp + * @brief Stream compaction test program + * @authors Kai Ninomiya + * @date 2015 + * @copyright University of Pennsylvania + */ + +#include +#include +#include +#include "testing_helpers.hpp" +#include +#include +#include +#include +#include + +#define TRAINING 0 +#define CHAR_RECOGNITION 1 + +int main(int argc, char* argv[]) { + +#if CHAR_RECOGNITION + printf("\n"); + printf("*********************************\n"); + printf("** CHARACTER RECOGNITION TESTS **\n"); + printf("*********************************\n"); + + int numTotalInput = 52; + std::vector inputs; + std::vector expected; + + // collect character recognition data set + int numInput; + + std::string prefix = "../data-set/"; + std::string suffix = "info.txt"; + for (int i = 1; i < numTotalInput + 1; i++) { + std::stringstream buffer; + buffer << prefix << std::setfill('0') << std::setw(2) << i << suffix; + + std::ifstream inputFile(buffer.str()); + if (inputFile.is_open()) { + std::string line; + // get expected character + std::getline(inputFile, line); + expected.push_back(atoi(line.c_str()) / 255.0); + + // get number of characters in input + std::getline(inputFile, line); + numInput = atoi(line.c_str()); + + // get input + std::getline(inputFile, line); + + std::istringstream iss(line); + std::string oneInput; + int counter = 0; + float *input = new float[numInput]; + while (std::getline(iss, oneInput, ' ')) { + if (!oneInput.empty()) { + input[counter] = stof(oneInput) / 255.0; + counter++; + } + } + inputs.push_back(input); + } + } + +#if TRAINING + CharacterRecognition::train(numInput, inputs, expected, "char_recognition"); +#else // #if TRAINING + CharacterRecognition::run(numInput, inputs, expected, "char_recognition"); +#endif // #else // #if TRAINING + + for (int i = 0; i < numTotalInput; i++) { + delete[] inputs[i]; + } + +#else // #if CHAR_RECOGNITION + printf("\n"); + printf("*********************************\n"); + printf("*********** XOR TESTS ***********\n"); + printf("*********************************\n"); + + int numTotalInput = 4; + + std::vector inputs; + std::vector expected; + const int numInput = 2; + + float input1[numInput] = { 0, 0 }; + float input2[numInput] = { 0, 1 }; + float input3[numInput] = { 1, 0 }; + float input4[numInput] = { 1, 1 }; + inputs.push_back(input1); + inputs.push_back(input2); + inputs.push_back(input3); + inputs.push_back(input4); + + expected.push_back(0); + expected.push_back(1); + expected.push_back(1); + expected.push_back(0); + +#if TRAINING + CharacterRecognition::train(numInput, inputs, expected, "XOR"); +#else // #if TRAINING + CharacterRecognition::run(numInput, inputs, expected, "XOR"); +#endif // #else // #if TRAINING + +#endif // #else // #if CHAR_RECOGNITION + + system("pause"); // stop Win32 console from closing on exit +} diff --git a/Project2-Character-Recognition/src/testing_helpers.hpp b/Project2-Character-Recognition/src/testing_helpers.hpp index b28a8d2..d6cc4e3 100644 --- a/Project2-Character-Recognition/src/testing_helpers.hpp +++ b/Project2-Character-Recognition/src/testing_helpers.hpp @@ -1,76 +1,76 @@ -#pragma once - -#include -#include -#include -#include -#include - -template -int cmpArrays(int n, T *a, T *b) { - for (int i = 0; i < n; i++) { - if (a[i] != b[i]) { - printf(" a[%d] = %d, b[%d] = %d\n", i, a[i], i, b[i]); - return 1; - } - } - return 0; -} - -void printDesc(const char *desc) { - printf("==== %s ====\n", desc); -} - -template -void printCmpResult(int n, T *a, T *b) { - printf(" %s \n", - cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); -} - -template -void printCmpLenResult(int n, int expN, T *a, T *b) { - if (n != expN) { - printf(" expected %d elements, got %d\n", expN, n); - } - printf(" %s \n", - (n == -1 || n != expN) ? "FAIL COUNT" : - cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); -} - -void zeroArray(int n, int *a) { - for (int i = 0; i < n; i++) { - a[i] = 0; - } -} - -void onesArray(int n, int *a) { - for (int i = 0; i < n; i++) { - a[i] = 1; - } -} - -void genArray(int n, int *a, int maxval) { - srand(time(nullptr)); - - for (int i = 0; i < n; i++) { - a[i] = rand() % maxval; - } -} - -void printArray(int n, int *a, bool abridged = false) { - printf(" [ "); - for (int i = 0; i < n; i++) { - if (abridged && i + 2 == 15 && n > 16) { - i = n - 2; - printf("... "); - } - printf("%3d ", a[i]); - } - printf("]\n"); -} - -template -void printElapsedTime(T time, std::string note = "") -{ - std::cout << " elapsed time: " << time << "ms " << note << std::endl; +#pragma once + +#include +#include +#include +#include +#include + +template +int cmpArrays(int n, T *a, T *b) { + for (int i = 0; i < n; i++) { + if (a[i] != b[i]) { + printf(" a[%d] = %d, b[%d] = %d\n", i, a[i], i, b[i]); + return 1; + } + } + return 0; +} + +void printDesc(const char *desc) { + printf("==== %s ====\n", desc); +} + +template +void printCmpResult(int n, T *a, T *b) { + printf(" %s \n", + cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); +} + +template +void printCmpLenResult(int n, int expN, T *a, T *b) { + if (n != expN) { + printf(" expected %d elements, got %d\n", expN, n); + } + printf(" %s \n", + (n == -1 || n != expN) ? "FAIL COUNT" : + cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); +} + +void zeroArray(int n, int *a) { + for (int i = 0; i < n; i++) { + a[i] = 0; + } +} + +void onesArray(int n, int *a) { + for (int i = 0; i < n; i++) { + a[i] = 1; + } +} + +void genArray(int n, int *a, int maxval) { + srand(time(nullptr)); + + for (int i = 0; i < n; i++) { + a[i] = rand() % maxval; + } +} + +void printArray(int n, int *a, bool abridged = false) { + printf(" [ "); + for (int i = 0; i < n; i++) { + if (abridged && i + 2 == 15 && n > 16) { + i = n - 2; + printf("... "); + } + printf("%3d ", a[i]); + } + printf("]\n"); +} + +template +void printElapsedTime(T time, std::string note = "") +{ + std::cout << " elapsed time: " << time << "ms " << note << std::endl; } \ No newline at end of file diff --git a/Project2-Character-Recognition/weights_XOR.txt b/Project2-Character-Recognition/weights_XOR.txt new file mode 100644 index 0000000..3e1e69f --- /dev/null +++ b/Project2-Character-Recognition/weights_XOR.txt @@ -0,0 +1,6 @@ +10.1574 +0.801176 +20.0398 +0.80065 +41.0118 +-53.9937 diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md index 0e38ddb..6c8b1aa 100644 --- a/Project2-Stream-Compaction/README.md +++ b/Project2-Stream-Compaction/README.md @@ -1,14 +1,158 @@ -CUDA Stream Compaction -====================== +# Project 2a: CUDA Stream Compaction +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2 - Stream Compaction** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +Caroline Lachanski: [LinkedIn](https://www.linkedin.com/in/caroline-lachanski/), [personal website](http://carolinelachanski.com/) -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +Tested on: Windows 10, i5-6500 @ 3.20GHz 16GB, GTX 1660 (personal computer) -### (TODO: Your README) +## Project Description + +This goal of this project was to gain to familiarity with writing parallel algorithms, specifically Scan and Stream Compaction (used to remove values of zero from an array), from scratch in CUDA. The implementations in this project were mainly based on [these slides](https://onedrive.live.com/view.aspx?resid=A6B78147D66DD722!93669&ithint=file%2cpptx&authkey=!AOZdbr6KT8OQ9fs). + +### Scan + +The Scan algorithm (also known as prefix sum) takes in an input array and returns an array where each element j in the output array is a sum of the previous elements from the input array. The Scan algorithm can be inclusive or exclusive. In an exclusive Scan, element j of the result does not include element j of the input. In an inclusive Scan, element j of the result includes all elements before and up to element j of the input; + +![](img/scan_example.png) + +### Stream Compact +The goal of Stream Compaction is, given an input array of elements, create a new array with elements from the input array that meet a certain critera (e.g. being non-zero or non-null), while preserving the order of elements from the input array. + +![](img/compact_goal_visual.png) + +Stream Compaction can be accomplished in a few steps. Let In be out input array. +1. Map the input array to a new array of bools B where B[i] = 1 if In[i] meets the specififed criteria, and B[i] = 0 if In[i] does not. (B is the yellow array in the below picture.) +2. Run an exclusive Scan on B to get C, the Scan result. C will contain the indices into our output array. (C is the first blue array in the below picture.) +3. Run Scatter on C, the result of Scan. Write the element In[i] to output array Out[C[i]] if B[i] = 1. + +![](img/compact_visual.png) + +This project contains several implementations of the above algorithms: a CPU version, a naive GPU version, a work-efficient GPU version, and one using the Thrust library functions. + +## Implementation Details + +### CPU + +The CPU implementations were done simply to allow me to become more familiar with the algorithms. These are straightforward and simple to implement, and include implementations of Scan, Scatter, Stream Compaction without using Scan, and Stream Compaction using Scan and Scatter. For example, here is pseudo-code for single-thread Scan: + +![](img/CPU_scan.png) + + +### Naive + +The first GPU implementation of Scan is a naive one, and results in an inclusive Scan that must be modified (using a shift right kernel) to an exclusive Scan. + +![](img/naive_parallel_scan.png) + +![](img/naive_parallel_scan_visual.png) + +### Work-Efficient + +Next we have work-efficient GPU versions of Scan, Scatter, and Stream Compaction. The work-efficient version of Scan involves two phases, an upsweep: + +![](img/efficient_upsweep.png) + +![](img/efficient_upsweep_visual.png) + +...and a downsweep: + +![](img/efficient_downsweep.png) + +![](img/efficient_downsweep_visual.png) + +Because of work-efficient Scan's underlying conceptual structure as a balanced binary tree, the input array must be buffered with zeros until its size is a power of 2. + +We can then use Scan in a GPU implementation of Stream Compaction, which also uses Scatter and MapToBoolean kernels. + +### Thrust + +Lastly, we perform Stream Compaction with the Thrust library's thrust::exclusive_scan. + +## Analysis + +We can look at the performance of the various implementations. Here's a graph of the performance of Scan vs. block size: + +![](img/scan_graph.png) + +A faster implementation will take less time, so lower implementations on the graph perform better than the higher up implementations. It may seem like "Thrust (power of 2)" is missing, but it is right beneath the "Thrust" line, since their performances were so similar. + +Oddly enough, Work-Efficient performs the worst, Naive the second worst, and CPU better than both of those. This is not necessarily unexpected. The current implementation of Work-Efficient actually has very poor warp occupancy; the number of threads doing useful work decreases by half each time, but they are not able to be terminated due to their organization. Some re-indexing and changing of the organization of threads should fix this issue. Additionally, the upsweep and downsweep outer for-loops are run on the CPU, which will not help runtime at all. But, we can obviously see from the Thrust examples that a GPU implementation of Scan can be way faster than a CPU implementation... as long as you do it right. As to why Thrust performs so much better, possible reasons include optimizing warp occupancy with better indexing and avoiding so much memory latency. + +One thing to notice: for CPU and Naive, the power of 2 examples (which are running on an array whose size is a power of 2), take slightly longer than their non-power of 2 counterparts (which are running on an array whose size is very slightly smaller than the aforementioned power of 2). This can likely be attributed to the fact that for these more basic, less parallelized applications, a larger input simply takes larger to process (note that CPU, which is the most serialized implementation, sees the biggest difference between its performances). For the Work-Efficient examples, there is very little difference between power of 2 and non-power of 2, since the non-power of 2 input is padded with zeros until it reaches a size that is a power of 2. The thrust implementation is similarly not effected by a size that is power of 2 or not, likely because it is very well parallelized or also pads to a power of 2 size. + +A second thing to notice: changing block size does not seem to have a significant effect on any of the implementations, although block sizes of 64 and 1024 seem to cause slightly worse performance in my GPU implementations. + +Here's a graph of the performance of Compact vs. block size: + +![](img/compact_graph.png) + +Again, a better implementation will take less time, so lower implementations on the graph perform better than the higher up implementations. Also, it may seem like "CPU (no scan, power of 2)" is missing, but it is right beneath the "CPU (no scan)" line, since their performances were so similar. + +We see similar performance patterns to what we saw for Scan. The Work-Efficient Compact performs worse than the CPU version, and the CPU version using a CPU Scan performs worse than the CPU versions that do not use Scan. This perhaps indicates that Scan will only provide a better performance Compact if it is parallelized. In regards to the Work-Efficient examples, first, Work-Efficient Compact uses Work-Efficient Scan which already performs poorly as discussed above. Additionally, the Work-Efficient Scan includes several memory copies within the recorded time, which can decrease performance. + +Again, block size seems to have relatively little effect on performance. + +## Output + +Here is the output of the test cases for 33,554,432 elements, with a block size of 1024. +``` + +**************** +** SCAN TESTS ** +**************** + [ 30 46 22 3 29 47 40 19 4 43 5 18 39 ... 24 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 179.922ms (std::chrono Measured) + [ 0 30 76 98 101 130 177 217 236 240 283 288 306 ... 821999768 821999792 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 64.8479ms (std::chrono Measured) + [ 0 30 76 98 101 130 177 217 236 240 283 288 306 ... 821999720 821999754 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 1067.39ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 1046.83ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 2464.74ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 2473.12ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.0384ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 2.14227ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 1 0 1 0 1 0 2 2 3 2 3 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 98.5023ms (std::chrono Measured) + [ 1 1 1 1 2 2 3 2 3 3 1 2 1 ... 3 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 100.227ms (std::chrono Measured) + [ 1 1 1 1 2 2 3 2 3 3 1 2 1 ... 2 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 445.056ms (std::chrono Measured) + [ 1 1 1 1 2 2 3 2 3 3 1 2 1 ... 3 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 2690.43ms (CUDA Measured) + [ 1 1 1 1 2 2 3 2 3 3 1 2 1 ... 3 3 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 2689.84ms (CUDA Measured) + [ 1 1 1 1 2 2 3 2 3 3 1 2 1 ... 2 3 ] + passed +Press any key to continue . . . +``` -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) diff --git a/Project2-Stream-Compaction/img/CPU_scan.png b/Project2-Stream-Compaction/img/CPU_scan.png new file mode 100644 index 0000000..8d60134 Binary files /dev/null and b/Project2-Stream-Compaction/img/CPU_scan.png differ diff --git a/Project2-Stream-Compaction/img/compact_goal_visual.png b/Project2-Stream-Compaction/img/compact_goal_visual.png new file mode 100644 index 0000000..ee72925 Binary files /dev/null and b/Project2-Stream-Compaction/img/compact_goal_visual.png differ diff --git a/Project2-Stream-Compaction/img/compact_graph.png b/Project2-Stream-Compaction/img/compact_graph.png new file mode 100644 index 0000000..531b38d Binary files /dev/null and b/Project2-Stream-Compaction/img/compact_graph.png differ diff --git a/Project2-Stream-Compaction/img/compact_visual.png b/Project2-Stream-Compaction/img/compact_visual.png new file mode 100644 index 0000000..ce64f38 Binary files /dev/null and b/Project2-Stream-Compaction/img/compact_visual.png differ diff --git a/Project2-Stream-Compaction/img/efficient_downsweep.png b/Project2-Stream-Compaction/img/efficient_downsweep.png new file mode 100644 index 0000000..60d90c7 Binary files /dev/null and b/Project2-Stream-Compaction/img/efficient_downsweep.png differ diff --git a/Project2-Stream-Compaction/img/efficient_downsweep_visual.png b/Project2-Stream-Compaction/img/efficient_downsweep_visual.png new file mode 100644 index 0000000..fecc70c Binary files /dev/null and b/Project2-Stream-Compaction/img/efficient_downsweep_visual.png differ diff --git a/Project2-Stream-Compaction/img/efficient_upsweep.png b/Project2-Stream-Compaction/img/efficient_upsweep.png new file mode 100644 index 0000000..064d717 Binary files /dev/null and b/Project2-Stream-Compaction/img/efficient_upsweep.png differ diff --git a/Project2-Stream-Compaction/img/efficient_upsweep_visual.png b/Project2-Stream-Compaction/img/efficient_upsweep_visual.png new file mode 100644 index 0000000..dee394a Binary files /dev/null and b/Project2-Stream-Compaction/img/efficient_upsweep_visual.png differ diff --git a/Project2-Stream-Compaction/img/naive_parallel_scan.png b/Project2-Stream-Compaction/img/naive_parallel_scan.png new file mode 100644 index 0000000..db7b5b3 Binary files /dev/null and b/Project2-Stream-Compaction/img/naive_parallel_scan.png differ diff --git a/Project2-Stream-Compaction/img/naive_parallel_scan_visual.png b/Project2-Stream-Compaction/img/naive_parallel_scan_visual.png new file mode 100644 index 0000000..c1a18ce Binary files /dev/null and b/Project2-Stream-Compaction/img/naive_parallel_scan_visual.png differ diff --git a/Project2-Stream-Compaction/img/scan_example.png b/Project2-Stream-Compaction/img/scan_example.png new file mode 100644 index 0000000..83e64ed Binary files /dev/null and b/Project2-Stream-Compaction/img/scan_example.png differ diff --git a/Project2-Stream-Compaction/img/scan_graph.png b/Project2-Stream-Compaction/img/scan_graph.png new file mode 100644 index 0000000..15fc959 Binary files /dev/null and b/Project2-Stream-Compaction/img/scan_graph.png differ diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp index d016553..83470d6 100644 --- a/Project2-Stream-Compaction/src/main.cpp +++ b/Project2-Stream-Compaction/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 25; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -54,11 +54,11 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ + //For bug-finding only: Array of 1s to help find bugs in stream compaction or scan +// onesArray(SIZE, c); +// printDesc("1s array for finding bugs"); +// StreamCompaction::Naive::scan(SIZE, c, c); +// printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); @@ -67,6 +67,12 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(NPOT, b, c); + //For bug-finding only: Array of 1s to help find bugs in stream compaction or scan +// identityArray(SIZE, c); +// printDesc("work-efficient scan, 1s array for finding bugs"); +// StreamCompaction::Efficient::scan(SIZE, c, c); +// printArray(SIZE, c, true); + zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); @@ -137,14 +143,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/Project2-Stream-Compaction/src/testing_helpers.hpp b/Project2-Stream-Compaction/src/testing_helpers.hpp index b28a8d2..59902fe 100644 --- a/Project2-Stream-Compaction/src/testing_helpers.hpp +++ b/Project2-Stream-Compaction/src/testing_helpers.hpp @@ -43,6 +43,12 @@ void zeroArray(int n, int *a) { } } +void identityArray(int n, int *a) { + for (int i = 0; i < n; i++) { + a[i] = i; + } +} + void onesArray(int n, int *a) { for (int i = 0; i < n; i++) { a[i] = 1; diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt index cdbef77..185a604 100644 --- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt +++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_75 ) diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu index 2ed6d63..edadc07 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.cu +++ b/Project2-Stream-Compaction/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + bools[index] = idata[index] == 0 ? 0 : 1; } /** @@ -32,8 +37,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO - } + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } + } } } diff --git a/Project2-Stream-Compaction/stream_compaction/common.h b/Project2-Stream-Compaction/stream_compaction/common.h index 996997e..d8831c2 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.h +++ b/Project2-Stream-Compaction/stream_compaction/common.h @@ -10,6 +10,8 @@ #include #include +#define blockSize 1024 + #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu index a2d3e6c..67a041d 100644 --- a/Project2-Stream-Compaction/stream_compaction/cpu.cu +++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu @@ -19,10 +19,25 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = idata[i - 1] + odata[i - 1]; + } + timer().endCpuTimer(); } + /** + * CPU scan (prefix sum) with no timers + */ + void scanNoTimer(int n, int *odata, const int *idata) { + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = idata[i - 1] + odata[i - 1]; + } + } + /** * CPU stream compaction without using the scan function. * @@ -30,11 +45,34 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // copy over non-zero values to odata + int currIndex = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[currIndex] = idata[i]; + currIndex++; + } + } + timer().endCpuTimer(); - return -1; + return currIndex; } + /** + * CPU scatter + * + * @returns the number of elements remaining after compaction. + */ + int scatter(int n, int *odata, const int *idata, const int *bools, const int *scanOutput) { + for (int i = 0; i < n; i++) { + if (bools[i] == 1) { + odata[scanOutput[i]] = idata[i]; + } + } + return scanOutput[n - 1]; + } + /** * CPU stream compaction using scan and scatter, like the parallel version. * @@ -42,9 +80,26 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int *bools = new int[n]; + int *scanResult = new int[n]; + + // map input to binary array + for (int i = 0; i < n; i++) { + bools[i] = idata[i] == 0 ? 0 : 1; + } + + // scan binary array + scanNoTimer(n, scanResult, bools); + + // scatter + int outSize = scatter(n, odata, idata, bools, scanResult); + delete(bools); + delete(scanResult); + timer().endCpuTimer(); - return -1; + + return outSize; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu index 2db346e..6ed4913 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.cu +++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu @@ -12,15 +12,151 @@ namespace StreamCompaction { return timer; } + __global__ void kernScanUpsweep(int n, int iteration, int *buffer) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int power1 = ceil(pow(2.0, double(iteration + 1))); + if (index % power1 == 0) { + int power2 = ceil(pow(2.0, double(iteration))); + buffer[index + power1 - 1] += buffer[index + power2 - 1]; + } + } + + __global__ void kernScanDownsweep(int n, int iteration, int *buffer) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int power1 = ceil(pow(2.0, double(iteration + 1))); + if (index % power1 == 0) { + int power2 = ceil(pow(2.0, double(iteration))); + + int temp = buffer[index + power2 - 1]; + buffer[index + power2 - 1] = buffer[index + power1 - 1]; + buffer[index + power1 - 1] += temp; + } + } + + // finds the next power of 2 greater than or equal to n + int nextPowerOfTwo(int n) { + if (n && !(n & (n - 1))) + return n; + + int count = 0; + while (n != 0) { + n >>= 1; + count++; + } + + return 1 << count; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + int nPowerOfTwo = nextPowerOfTwo(n); + + int *dev_buffer; + + // malloc device buffer + cudaMalloc((void**)&dev_buffer, nPowerOfTwo * sizeof(int)); + checkCUDAError("cudaMalloc dev_buffer failed!"); + + // copy input to device buffer + cudaMemcpy(dev_buffer, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_buffer idata failed!"); + + // fill rest of device buffer with zero + cudaMemset(dev_buffer + n, 0, (nPowerOfTwo - n) * sizeof(int)); + checkCUDAError("cudaMemset dev_buffer failed!"); + + timer().startGpuTimer(); + + // upsweep + dim3 gridSize = dim3((nPowerOfTwo + blockSize - 1) / blockSize, 1, 1); + for (int d = 0; d < ilog2ceil(nPowerOfTwo); d++) { + kernScanUpsweep<<>>(nPowerOfTwo, d, dev_buffer); + checkCUDAError("kernScanUpsweep failed!"); + } + + cudaDeviceSynchronize(); + + // set root to zero + cudaMemset(dev_buffer + nPowerOfTwo - 1, 0, 1 * sizeof(int)); + checkCUDAError("cudaMemset dev_buffer failed!"); + + // downsweep + for (int d = ilog2ceil(nPowerOfTwo) - 1; d >= 0; d--) { + kernScanDownsweep<<>>(nPowerOfTwo, d, dev_buffer); + checkCUDAError("kernScanDownsweep failed!"); + } + + cudaDeviceSynchronize(); + timer().endGpuTimer(); + + // copy output to host + cudaMemcpy(odata, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata dev_buffer failed!"); + + cudaFree(dev_buffer); + checkCUDAError("cudaFree failed!"); } + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata, NO TIMER + */ + void scanNoTimer(int n, int *odata, const int *idata) { + int nPowerOfTwo = nextPowerOfTwo(n); + + int *dev_buffer; + + // malloc device buffer + cudaMalloc((void**)&dev_buffer, nPowerOfTwo * sizeof(int)); + checkCUDAError("cudaMalloc dev_buffer failed!"); + + // copy input to device buffer + cudaMemcpy(dev_buffer, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_buffer idata failed!"); + + // fill rest of device buffer with zero + cudaMemset(dev_buffer + n, 0, (nPowerOfTwo - n) * sizeof(int)); + checkCUDAError("cudaMemset dev_buffer failed!"); + + // upsweep + dim3 gridSize = dim3((n + blockSize - 1) / blockSize, 1, 1); + for (int d = 0; d < ilog2ceil(nPowerOfTwo); d++) { + kernScanUpsweep<<>>(nPowerOfTwo, d, dev_buffer); + checkCUDAError("kernScanUpsweep failed!"); + } + + cudaDeviceSynchronize(); + + // set root to zero + cudaMemset(dev_buffer + nPowerOfTwo - 1, 0, 1 * sizeof(int)); + checkCUDAError("cudaMemset dev_buffer failed!"); + + // downsweep + for (int d = ilog2ceil(nPowerOfTwo) - 1; d >= 0; d--) { + kernScanDownsweep<<>>(nPowerOfTwo, d, dev_buffer); + checkCUDAError("kernScanDownsweep failed!"); + } + + cudaDeviceSynchronize(); + + // copy output to host + cudaMemcpy(odata, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata dev_buffer failed!"); + + cudaFree(dev_buffer); + checkCUDAError("cudaFree failed!"); + } + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +167,69 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + int *dev_in; + int *dev_out; + int *dev_bools; + int *dev_indices; + + // malloc device buffers + cudaMalloc((void**)&dev_in, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_input failed!"); + + cudaMalloc((void**)&dev_out, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_out failed!"); + + cudaMalloc((void**)&dev_bools, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_bools failed!"); + + cudaMalloc((void**)&dev_indices, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_indices failed!"); + + // copy input to device buffer + cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_in idata failed!"); + + timer().startGpuTimer(); + + // compute bools buffer + Common::kernMapToBoolean<<>>(n, dev_bools, dev_in); + checkCUDAError("kernMapToBoolean failed!"); + + cudaDeviceSynchronize(); + + // copy bools to host + int *bools = new int[n]; + cudaMemcpy(bools, dev_bools, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy bools dev_bools failed!"); + + // run exclusive scan on bools + int *indices = new int[n]; + scanNoTimer(n, indices, bools); + int outputSize = bools[n - 1] == 0 ? indices[n - 1] : indices[n - 1] + 1; + + // copy indices to device + cudaMemcpy(dev_indices, indices, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_indices indices failed!"); + + // scatter + Common::kernScatter<<>>(n, dev_out, dev_in, dev_bools, dev_indices); + cudaDeviceSynchronize(); + + timer().endGpuTimer(); + + // copy output to host + cudaMemcpy(odata, dev_out, outputSize * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata dev_out failed!"); + + cudaFree(dev_in); + cudaFree(dev_out); + cudaFree(dev_bools); + cudaFree(dev_indices); + checkCUDAError("cudaFree failed!"); + + return outputSize; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu index 4308876..4b1f1ac 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.cu +++ b/Project2-Stream-Compaction/stream_compaction/naive.cu @@ -11,15 +11,74 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernInclusiveScanIteration(int n, int iteration, int *out, int *in) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + int nextIndex = ceil(pow(2.0, double(iteration - 1))); // encountered rounding issue at 2048, should fix it + if (index >= nextIndex) { + out[index] = in[index - nextIndex] + in[index]; + } + } + + __global__ void kernShiftRight(int n, int *out, int *in) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) { + return; + } + + out[index] = index == 0 ? 0 : in[index - 1]; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + int *dev_in; + int *dev_out; + + // malloc device buffers + cudaMalloc((void**)&dev_in, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_in failed!"); + cudaMalloc((void**)&dev_out, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_out failed!"); + + // copy input to device + cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_in idata failed!"); + cudaMemcpy(dev_out, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_out idata failed!"); + + // perform inclusive scan + timer().startGpuTimer(); + + dim3 gridSize = dim3((n + blockSize - 1) / blockSize, 1, 1); + for (int d = 1; d <= ilog2ceil(n); d++) { + // run one iteration + kernInclusiveScanIteration<<>>(n, d, dev_out, dev_in); + checkCUDAError("kernInclusiveScanIteration failed!"); + + // copy out to in + cudaMemcpy(dev_in, dev_out, n * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy dev_in dev_out failed!"); + } + + // convert to exclusive scan + kernShiftRight<<>>(n, dev_out, dev_in); + checkCUDAError("kernShiftRight failed!"); + + timer().endGpuTimer(); + + // copy output to host + cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata dev_out failed!"); + + cudaFree(dev_in); + cudaFree(dev_out); + checkCUDAError("cudaFree failed!"); } } } diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu index 1def45e..f7e403b 100644 --- a/Project2-Stream-Compaction/stream_compaction/thrust.cu +++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu @@ -18,11 +18,28 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int *dev_in; + cudaMalloc((void**)&dev_in, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_in failed!"); + + // copy input to device buffer + cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_in idata failed!"); + + thrust::device_vector dev_thrust_in(dev_in, dev_in + n); + thrust::device_vector dev_thrust_out(n); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_thrust_in.begin(), dev_thrust_in.end(), dev_thrust_out.begin()); + checkCUDAError("thrust::exclusive_scan failed!"); timer().endGpuTimer(); + + int *dev_out = thrust::raw_pointer_cast(dev_thrust_out.data()); + cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata dev_out failed!"); + + cudaFree(dev_in); + checkCUDAError("cudaFree dev_in failed!"); } } } diff --git a/README.md b/README.md index 3a0b2fe..2474c95 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,16 @@ -CUDA Number Algorithms -====================== +# Project 2: CUDA Number Algorithms +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, +Project 2** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +Caroline Lachanski: [LinkedIn](https://www.linkedin.com/in/caroline-lachanski/), [personal website](http://carolinelachanski.com/) -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +Tested on: Windows 10, i5-6500 @ 3.20GHz 16GB, GTX 1660 (personal computer) -### (TODO: Your README) +## Project Description -Link to the readmes of the other two subprojects. +The repository contains two separate projects: -Add anything else you think is relevant up to this point. -(Remember, this is public, so don't put anything here that you don't want to share with the world.) +1. [Stream Compaction](https://github.com/clach/Project2-Number-Algorithms/blob/master/Project2-Stream-Compaction) + +2. [Character Recognition](https://github.com/clach/Project2-Number-Algorithms/tree/master/Project2-Character-Recognition)