Keys_Falcon_Flansburgh_RMD.Rmd

---
title: "Statistical Analysis of the Relationship Between Mental Health, Emotional Wellbeing, and Athletic Identity "
author: "Falcon-Flansburgh, Victoria, and Keys, Rob"
output: html_document
date: "2024-07-10"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

The Covid-19 pandemic forced lock downs and quarantines for nearly the entire population, and we attempt to reveal the connection between post-pandemic mental health and the level of athletics in an individual's life. We use multiple statistical techniques to analyze the data, such as multiple linear regression, logistic regression, regression trees, and ANOVA tests.

Our chosen data set was collected in May of 2021 through an online questionnaire, where participants answered about their current emotional well being and mental health, how athletics affected their self-identity, the impact athletics have in their life, the length of their covid isolation, and basic background information. The data set has over 80 columns of data with 688 participants, and this large size allows for complex analysis across data columns and an ample sample size for analysis.

[Data_Sheet_1_Comparing Mental Health of Athletes and Non-athletes as They Emerge From a COVID-19 Pandemic Lockdown.CSV (figshare.com)](https://figshare.com/articles/dataset/Data_Sheet_1_Comparing_Mental_Health_of_Athletes_and_Non-athletes_as_They_Emerge_From_a_COVID-19_Pandemic_Lockdown_CSV/14623635/1) 

The code block below loads the needed libraries and one package install as necessary.

```{r}

#install.packages("car") #Uncomment and run once

# Libraries
library(readr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(car)
library(tree)
```

The code block below imports our data from the file to a variable.

```{r}
originalData <- read_csv('Data_Sheet_1_Comparing Mental Health of Athletes and Non-athletes as They Emerge From a COVID-19 Pandemic Lockdown.csv')
```

# Multiple Linear Regression

*Data cleaning*

The code block below eliminates rows with values that indicate a null or not understandable response for the 4 variables about to be used in the next code block.

```{r}
# Data cleanup
Data <- originalData %>% filter(originalData$`Emotional Wellbeing` < 999)
Data <- Data %>% filter(Data$`I consider myself an athlete` < 999)
Data <- Data %>% filter(Data$`most of my friends are athletes` < 999)
Data <- Data %>% filter(Data$`Sport is the most important part of my life` < 999)

Data

```

*Model Building*

The code block below uses multiple linear regression to measure the effects of self-identifying as an athlete, having athlete friends, and having sports as the most important aspect of your life on the effect of emotional wellbeing.

```{r}
Emo_Data <- Data %>% select(`Emotional Wellbeing`, `I consider myself an athlete`, `most of my friends are athletes`, `Sport is the most important part of my life`) #create a dataframe with just the relevant variables

Emo_Well_Fit <- lm(Emo_Data$`Emotional Wellbeing` ~ Emo_Data$`I consider myself an athlete` + Emo_Data$`most of my friends are athletes` + Emo_Data$`Sport is the most important part of my life`) #create the model

summary(Emo_Well_Fit)

avPlots(Emo_Well_Fit) #create graphs for each variable's relationship with the predicted variable

```

*Analysis*

Looking at the individual variable graphs, we can see a slight positive correlation between both 'I consider myself an athlete' and 'Most of my friends are athletes' with Emotional well being, and a slight negative correlation between 'Sport is the most important part of my life' with Emotional well being. By comparing the p values to 0.05 to determine significance, we can see that both 'I consider myself an athlete' and 'Sport is the most important part of my life' are significant, and we can say they impact emotional well being. Despite both being sport-related metrics, we can see that they have opposite signs, indicating that they have opposite effects on emotional well being, which makes sense that being an athlete and working out will improve your mental health, but valuing a sport above all else and then not being able to play during the lockdown would negatively affect you.

# Linear Regression

*Model Building*

The code block below uses linear regression to predict emotional wellbeing for athletes and non-athletes using weeks spent social distancing, as well as plotting the results.

```{r}
ggplot(Data, aes(x=Data$`Weeks Social Distancing`, y=Data$`Emotional Wellbeing`, color=factor(Data$`Athlete/Non-Athlete`)))+
geom_point()+
geom_smooth(method=lm, se=FALSE)+
scale_color_manual(name="",values = c("coral", "blue"), labels = c('Athletes', 'Non-Athletes')) #cleans up legend

Emo_Weeks_Fit <- lm(Data$`Emotional Wellbeing` ~ Data$`Weeks Social Distancing`*as.factor(Data$`Athlete/Non-Athlete`), data = Data)
summary(Emo_Weeks_Fit)
```

*Analysis*

The regression line for athletes and non athletes shows that athletes' emotional wellbeing has a positive correlation with weeks spent social distancing, while non-athletes' emotional wellbeing has a negative correlation. This could be due to a variety of factors associated with being an athlete, but would suggest that athletes seemed to handle the social reduction better than non-athletes, actually having social distancing improve their mental health, compared to the expected result of a negative correlation like non-athletes showed. However, none of the p values are less than 0.05, and therefore this data is statistically insignificant, and we cannot draw any conclusions from it.

# Logistic Regression

*Data cleaning*

The code block below creates a dataset clean of null/unusable data for the desired variables. It then splits it into training and testing data for the model to use.

```{r}
logData = originalData %>% select(`Athlete/Non-Athlete`,`Emotional Wellbeing`,`Social Wellbeing`,`Psychological Wellbeing`)

logData = logData %>% filter(logData$`Athlete/Non-Athlete` < 999)
logData = logData %>% filter(logData$`Emotional Wellbeing` < 999)
logData = logData %>% filter(logData$`Social Wellbeing` < 999)
logData = logData %>% filter(logData$`Psychological Wellbeing` < 999)

train <- sample(1:nrow(logData), nrow(logData)*0.75)
log_train <- logData[train, ]
log_test <- logData[-train, ]
```

*Model Building*

This code block creates a logistic regression model using 3 "well-being" measures to attempt to predict whether the respondent is an athlete or non/athlete, and then tests the accuracy of the model.

```{r}
log_model <- glm(as.factor(`Athlete/Non-Athlete`) ~ `Emotional Wellbeing` + `Social Wellbeing` + `Psychological Wellbeing`, data = log_train, family = binomial())
summary(log_model)

test_pred <- predict(log_model, log_test, type = "response")
prediction <- ifelse(test_pred > 0.50, 1, 0)

confusion_matrix <- table(Predicted = prediction, Actual = log_test$`Athlete/Non-Athlete`)
confusion_matrix
```

*Analysis*

Interpreting the coefficients of our log model shows that per unit increase in emotional wellbeing, the log odds of the individual being an athlete increases by 0.018, per unit increase in social wellbeing, the log odds of being an athlete increase by 0.042, and per unit increase in psychological wellbeing, the log odds decrease by 0.036. This would suggest that athletes have better social well-being, potentially due team sports and sports communities, but they also have worse emotional and psychological wellbeing coming out of the lockdown. However, none of the p values are significant (\<0.05), and therefore we cannot draw conclusions from the data.

# Paired variables graphs

*Data cleaning*

This code block creates and filters the data to the necessary variables for our future variable pairing

```{r}
mlrData <- originalData %>% select(`MHC-SF OVERALL`,`AIMS_ TOTAL`,`I consider myself an athlete`,`I have many goals related to sport`,`most of my friends are athletes`,`Sport is the most important part of my life`,`I spend more time thinking about sport than anything else`,`I feel bad about myself when I do badly in sport`,`I would be very depressed if I were injured and could not compete in sport`)

mlrData <- mlrData %>% filter(`AIMS_ TOTAL` <= 100, `MHC-SF OVERALL` <= 100,`I consider myself an athlete`<= 100,`I have many goals related to sport`<= 100,`most of my friends are athletes`<= 100,`Sport is the most important part of my life`<= 100,`I spend more time thinking about sport than anything else`<= 100,`I feel bad about myself when I do badly in sport`<= 100,`I would be very depressed if I were injured and could not compete in sport`<= 100)
```

*Model Building*

This code block creates graphs to illustrate the relationship between a series of athletic variables to 'MHC-SF Overall', a comprehensive mental health statistic created by using other survey answers.

```{r}
pairs(mlrData,lower.panel=NULL,)
```

*Analysis*

Its hard to determine any linear relationships from these graphhs due to the data being largely evenly spread, not indicating a positive or negative relationship between most variables. There is a strong positive linear correlation between 'AIMS_TOTAL' with all the other predictor variables, but this is to be expected because 'AIMS_TOTAL' is a comprehensive athletic identity score, and the other predictors are very similar metrics. The only variable that appears it could have a linear correlation with 'MHC-SF' is 'My goals are related to sports', where we can see a void in the lower left box, indicating some level of a negative correlation between having athletic goals and mental health, perhaps a consequence of not being able to meet those goals during the lockdown.

# Regression Trees

*Data cleaning*

The code block below does some additional cleaning of the data, selecting usable columns, and splits it into a training set and a test set for use in regression trees.

```{r warning=FALSE}
cleanData = originalData[c(2,3,4,7,8,9,10,12,14:27,34:84)]
for(i in 1:73){
  cleanData = cleanData %>% filter(cleanData[i] < 100)
}
train <- sample(1:nrow(cleanData), nrow(cleanData)*0.75)
training = cleanData[train,]
testing = cleanData[-train,]
```

*Model Building*

The code block below creates a regression tree to predict "emotional wellbeing" using all available predictors, minus those too similar to emotional wellbeing to be useful.

```{r warning=FALSE}
specific_tree_model <- tree(cleanData$`Emotional Wellbeing` ~ cleanData$`Gender:` + cleanData$`Age Group:` + cleanData$`Country During Lockdown` + cleanData$`Marital Status:` + cleanData$`Smoking Status:` + cleanData$`Five Fruit and Veg` + cleanData$`Hours sleep:` + cleanData$`Shielded?` + cleanData$`Weeks Social Distancing` + cleanData$`# in lockdown bubble:` + cleanData$`Athlete/Non-Athlete` + cleanData$`AIMS_ TOTAL` + cleanData$`Social Identity` + cleanData$`I consider myself an athlete` + cleanData$`I have many goals related to sport` + cleanData$`most of my friends are athletes` + cleanData$`Exclusivity` + cleanData$`Sport is the most important part of my life` + cleanData$`I spend more time thinking about sport than anything else` + cleanData$`Negative Affectivity` + cleanData$`I feel bad about myself when I do badly in sport` + cleanData$`I would be very depressed if I were injured and could not compete in sport` + cleanData$`RES_TOTAL` + cleanData$`I tend to bounce back quickly after hard times` + cleanData$`I have a hard time making it through stressful events*` + cleanData$`It does not take me long to recover from a stressful event` + cleanData$`It is hard for me to snap back when something bad happens*` + cleanData$`I usually come through difficult times with little trouble` + cleanData$`I tend to take a long time to get over setbacks in my life*` + cleanData$`I miss having people around` + cleanData$`There are many people I can trust completely*` + cleanData$`There are enough people I feel close to*` + cleanData$`There are plenty of people I can rely on when I have problems*`, data = training)
plot(specific_tree_model)
text(specific_tree_model,pretty=0,cex=0.6)
summary(specific_tree_model)
```

The code block below cross-validates the above tree to determine the optimal size.

```{r warning=FALSE}
#cross-validation of the above tree
cv.specific <- cv.tree(specific_tree_model)
cv.specific
plot(cv.specific$size,cv.specific$dev,type='b')
```

The code block below prunes the tree to its optimal size, and tests the pruned vs non-pruned trees to prove its optimization.

```{r warning=FALSE}
#pruning the above tree

prune.specific <- prune.tree(specific_tree_model, best = 6)
plot(prune.specific)
text(prune.specific,pretty=0, cex = 0.6)
summary(prune.specific)

#getting mse for non-pruned tree
prediction <- round(predict(specific_tree_model, newdata = testing))
mse <- mean((testing$`Emotional Wellbeing` - prediction)^2)
print("Testing mse of non-pruned tree")
mse
#getting mse for pruned tree
prediction <- round(predict(prune.specific, newdata = testing))
mse <- mean((testing$`Emotional Wellbeing` - prediction)^2)
print("Testing mse of pruned tree")
mse
```

*Analysis*

Our pruned tree did significantly better on the test data than the non-pruned, meaning we can say it is the optimal size and we can analyze the pruned tree's nodes to see the variables most found to be predictive of mental health. While the non-pruned tree had nodes directly related to athletics, pruning the tree actually removes these, leaving us with only 4 variables: resilience, hours of sleep, 'there are enough people I feel close to', and 'there are plenty of people I can rely on when I have problems'. These variables could be linked to athletics, but this tree shows that athletics are not the most significant predictors of mental health.

*Model Building*

The code block below creates a regression tree to predict emotional wellbeing using only variables related to sports and athletics.

```{r warning=FALSE}
#this tree only uses predictors that are related to sports and athletics
morespecific_tree_model <- tree(cleanData$`Emotional Wellbeing` ~ cleanData$`Athlete/Non-Athlete` + cleanData$`AIMS_ TOTAL` + cleanData$`I consider myself an athlete` + cleanData$`I have many goals related to sport` + cleanData$`most of my friends are athletes` + cleanData$`Sport is the most important part of my life` + cleanData$`I spend more time thinking about sport than anything else` + cleanData$`I feel bad about myself when I do badly in sport` + cleanData$`I would be very depressed if I were injured and could not compete in sport`, data = training)

plot(morespecific_tree_model)
text(morespecific_tree_model,pretty=0,cex=0.7)
summary(morespecific_tree_model)
```

The code below cross-validates and prunes the tree to its optimal size, as well as a third tree that was pruned to a manually selected value that is experimentally more accurate. Then tests all 3 models for their MSE.

```{r warning=FALSE}
#finding optimal tree size
cv.tree(morespecific_tree_model)
tree_size = c(6,5,4,3,1)
optimalSize=c()
#for(i in 1:100){
  optimalSize = c(optimalSize,tree_size[which.min(cv.tree(morespecific_tree_model)$dev)])
#}
table(optimalSize)

#getting mean square error rate for non-pruned tree
prediction = round(predict(morespecific_tree_model, newdata = testing))
mse <- mean((testing$`Emotional Wellbeing` - prediction)^2)
print("Testing mse of non-pruned tree")
mse

#creating pruned tree
newtree=prune.tree(morespecific_tree_model,best=5)
plot(newtree)
text(newtree,pretty=0,cex=0.7)

#getting mse for pruned tree
prediction = round(predict(newtree, newdata = testing))
mse <- mean((testing$`Emotional Wellbeing` - prediction)^2)
print("Testing mse of pruned tree")
mse

#creating extra pruned tree
manualTree=prune.tree(morespecific_tree_model,best=3)
plot(manualTree)
text(manualTree,pretty=0,cex=0.7)

#getting mse for pruned tree
prediction = round(predict(manualTree, newdata = testing))
mse <- mean((testing$`Emotional Wellbeing` - prediction)^2)
print("Testing mse of manually pruned tree")
mse
```

*Analysis*

Creating a tree of only sport-related metrics both allows us to see that it actually has a lower mean squared error rate than the tree created with more metrics, and also see the sport variables considered to be the most significant predictors. 'I feel bad about myself when I do badly in sport', 'most of my friends are athletes', 'I consider myself an athlete', and 'AIMS_TOTAL' were all variables used in the pruned tree as predictors. Analyzing the predictions, we can see the best mental health comes from not letting sports/losing significantly decrease your mood, but still having a community of athlete friends. We can also see the worst mental health comes from those who are affected by losing, but also do not consider themselves athletes.

# ANOVA test

*Model Building*

The code below uses an ANOVA test to see if there is a significant difference in the means of Emotional Wellbeing for Athletes vs. Non-Athletes.

```{r warning=FALSE}
#ANOVA Comparison between the mean well being of non athletes and the mean well being of athletes
cleanData$`Athlete/Non-Athlete`[cleanData$`Athlete/Non-Athlete` == 1] <- "Athlete"
cleanData$`Athlete/Non-Athlete`[cleanData$`Athlete/Non-Athlete` == 2] <- "Non-Athlete"

aov_results <- aov(cleanData$`Emotional Wellbeing` ~ factor(cleanData$`Athlete/Non-Athlete`))
summary(aov_results)
boxplot(cleanData$`Emotional Wellbeing` ~ factor(cleanData$`Athlete/Non-Athlete`), col = c("red", "blue"))
```

The code below uses an ANOVA to see if there is a signifcant difference in the means of Resiliency Total for Athletes vs. Non-Athletes.

```{r warning=FALSE}
#ANOVA Comparison between the mean resliency of non athletes and the mean well being of athletes
aov_results <- aov(cleanData$`RES_TOTAL` ~ factor(cleanData$`Athlete/Non-Athlete`))
summary(aov_results)
```

*Analysis*

Comparing the sum squares, we can see that the noise caused by residuals is significantly greater than the effect of being an athlete, and that is also represented in the p-value of 0.238, which is greater than 0.05 and thus is not statistically significant, and we cannot say that there is a significant difference between athletes and non athletes.

# Multiple Linear Regression (part 2!)

*Model Building*

The code below runs a multiple linear regression with Emotional Wellbeing as the response variable and belonging to a community, the number of people in the lockdown bubble, and having trusting relationships as the predictors.

```{r warning=FALSE}
#MLR
fit <- lm(cleanData$`Emotional Wellbeing` ~ cleanData$`That you belonged to a community (like a social group or your neighbourhood)` + cleanData$`# in lockdown bubble:` + cleanData$`That you had warm and trusting relationships with others`)
summary(fit)

avPlots(fit)
```

The code below checks to see if backwards elimination would improve the model above.

```{r warning=FALSE}
#MLR backwards elim
fit1 <- lm(cleanData$`Emotional Wellbeing` ~ + cleanData$`# in lockdown bubble:` + cleanData$`That you had warm and trusting relationships with others`)
summary(fit1)

fit2 <- lm(cleanData$`Emotional Wellbeing` ~ cleanData$`That you belonged to a community (like a social group or your neighbourhood)` + cleanData$`That you had warm and trusting relationships with others`)
summary(fit2)

fit3 <- lm(cleanData$`Emotional Wellbeing` ~ cleanData$`That you belonged to a community (like a social group or your neighbourhood)`)
summary(fit3)

```

*Analysis*

The added-variable plots show strong positive relationships between belonging to a community and having warm trusting relationships with emotional well-being, and a slight negative relationship between the size of your lockdown bubble and emotional well-being. Further, the p-values of the MLR shows that both metrics about having relationships were statistically significant (p\<0.05), while the size of your lockdown bubble was statistically insignificant.

# Project management

Due to an indivisible class size, our group only consisted of two people, and therefore our workload was larger than other groups, but it was also easier to communicate who was doing which aspects of the project. We were in relatively constant communication over email, and this RMD file and our slideshow presentation were a true joint effort, with us going back and forth adding and updating files as we saw fit. The workload seemed evenly split between us, and neither of us have a problem with how it unfolded.

# Conclusions

Athletes were not significantly more or less likely to have mental health struggles in the wake of the COVID-19 pandemic. The linear regression for emotional wellbeing as a function of how important sports were to the person had an R\^2 below 0.1 as did the regression model using weeks social distancing by athlete vs. non-athlete. The ANOVA tests had p-values above 0.2 so they did not show any significant difference between athletes vs. non-athletes. The regression tree with no limiting parameters and the tree with only some limiting parameters simply based the decision for emotional wellbeing off predictors like happiness or emotional resiliency, which is not useful data since there was no significant difference in those factors between athletes and non-athletes. The regression tree using only athletic parameters had a mean squared error of 9.54, which given that the maximum emotional wellbeing in the data is 15, is very high. The linear regression using belonging to a community, the number of people in the lockdown bubble, and feeling like they have loving relationships had an R\^2 of 0.408, which shows some correlation. The main conclusion drawn given this data is that despite the inability to play sports during the lockdown, athletes did not suffer more emotional problems than non-athletes, and factors like their community, and their general wellbeing were far more important.