This Project was completed as a gorup project for the University of Calgary course DATA603 - Statistical Modelling with Data.
Sleep is a fundamental biological need for humans, as highlighted in Maslow’s hierarchy of needs. However, sleep alone is not sufficient; it must be efficient. Sleep efficiency refers to the proportion of time spent actually sleeping while in bed. Understanding sleep efficiency is critical because poor sleep quality can negatively impact individuals across multiple dimensions, including cognitive, emotional, and physical health.
The primary problem this project aims to address is identifying the key factors that influence sleep efficiency and understanding how these factors interact to affect sleep quality.
This problem is challenging because everyone’s sleep schedule and habits are different, so we might encounter barriers when trying to create a model that accurately predicts sleep efficiency for everyone.
By identifying the factors that influence sleep efficiency, we can provide evidence-based recommendations to help individuals improve their sleep quality and, consequently, their overall quality of life.
The dataset was obtained from Kaggle which is free to use for project purposes. The dataset contains various factors affecting the sleep efficiency listed in different columns.
-
Age: The participant’s age. Sleep habits and quality might change as a person gets older.
-
Gender: A person’s gender. We will investigate if different genders have different patterns in sleep efficiency.
-
Bedtime: The time at which a person goes to bed. It is a key part of the body’s natural circadian rhythm.
-
Wake-up time: The time a person wakes up. This can disrupt sleep cycles if it is inconsistent or too early.
-
Sleep duration: The amount of time spent sleeping. Both insufficient and excessive sleep can harm sleep quality.
-
REM sleep: The percentage of total sleep that is REM sleep. It is crucial for cognitive restoration and emotional regulation.
-
Deep sleep: The percentage of total sleep that is deep sleep. This phase is essential for physical recovery and immune function.
-
Light sleep: The percentage of total sleep that is light sleep. While less restorative, light sleep still plays a role in transitioning between sleep stages.
-
Awakenings: The number of times a participant awoke during the night. Frequent awakenings during the night can fragment the sleep stages.
-
Caffeine consumption: Amount of caffeine consumed (mg) 24-hour before bedtime. Caffeine could keep someone awake longer than necessary and disrupt their circadian rhythm.
-
Alcohol consumption: Amount of alcohol consumed (oz) in the 24-hour before bedtime. Alcohol could keep someone awake longer than necessary and disrupt their circadian rhythm.
-
Smoking status: Whether the participant smokes or not. Nicotine is a stimulant that can interfere with falling asleep and staying asleep.
-
Exercise frequency: The number of times a participant exercises in a week. Regular physical activity has been shown to reduce stress and improve sleep quality.
-
Sleep Efficiency: This will be the response variable, is quantitative, and as stated in the project proposal is measured in percentage. Given there are no values collected as 0 or 100, and the values close to 100 represent less than 3% of the records within the dataset , we see it is viable to try to model the sleep efficiency in this project.
Our results indicate that we were able to create a multiple linear regression model to predict sleep efficiency with a variety of significant predictors from the dataset.
full_model = lm(Sleep.efficiency ~ Age + factor(Gender) + Sleep.duration + REM.sleep.percentage + Deep.sleep.percentage + Light.sleep.percentage + Awakenings + Caffeine.consumption + Alcohol.consumption + factor(Smoking.status) + Exercise.frequency + Bedtime_shifted + Wakeuptime_shifted, data = sleep_data)
Based on the plot, the full additive model appears nonlinear. This suggest the linearity assumption may be violated, and adding interactions or higher order terms might be better capture the underlying trend.
We checked their variance inflation factor (VIF) and saw a correlation between REM.sleep.percentage, Deep.sleep.percentage, and Light.sleep.percentage.
Therefore, we tried to check all possible models that didn’t have a correlation between the predictors.
Based on our results, the model with REM.sleep.percentage and Deep.sleep.percentage had no multicollinearity between the predictors and both predictors were significant.
Therefore, we choose the model with REM.sleep.percentage and Deep.sleep.percentage.
Our final full additive model is:
interactionmodel = lm(Sleep.efficiency ~ REM.sleep.percentage + Age + Awakenings + Exercise.frequency + factor(Smoking.status) + Alcohol.consumption + Deep.sleep.percentage + factor(Smoking.status)*Deep.sleep.percentage + Awakenings*Deep.sleep.percentage + Age*Deep.sleep.percentage + REM.sleep.percentage*Awakenings, data = sleep_data)
There doesn’t seem to be any obvious pattern in the residuals plot, but the data does seem to have a curve for both low and high fitted values.
Since we don’t want to over-fit our model or invalidate any assumptions, we decided to investigate three different promising higher order models going forward to see which one ends up meeting the most criteria.
model1 = lm(Sleep.efficiency ~ REM.sleep.percentage + Age + Awakenings + Exercise.frequency + Smoking.status + Alcohol.consumption + Deep.sleep.percentage + Smoking.status*Deep.sleep.percentage + Awakenings*Deep.sleep.percentage + Age*Deep.sleep.percentage + REM.sleep.percentage*Awakenings + I(Deep.sleep.percentage^2) + I(Deep.sleep.percentage^3) + I(Awakenings^2) + I(Awakenings^3) + I(Awakenings^4), data=sleep_data)
model2 = lm(Sleep.efficiency ~ REM.sleep.percentage + Age + Awakenings + Exercise.frequency + Smoking.status + Alcohol.consumption + Deep.sleep.percentage + Smoking.status*Deep.sleep.percentage + Awakenings*Deep.sleep.percentage + Age*Deep.sleep.percentage + REM.sleep.percentage*Awakenings + I(Awakenings^2) + I(Awakenings^3) + I(Awakenings^4) + I(Deep.sleep.percentage^2) + I(Deep.sleep.percentage^3) + I(Age^2), data = sleep_data)
model3 = lm(Sleep.efficiency~REM.sleep.percentage+Age+Awakenings+Exercise.frequency+Smoking.status+Alcohol.consumption+Deep.sleep.percentage+Smoking.status*Deep.sleep.percentage+Awakenings*Deep.sleep.percentage+Age*Deep.sleep.percentage+REM.sleep.percentage*Awakenings+I(Age^2)+I(Age^3)+I(Age^4)+I(Age^5)+I(Deep.sleep.percentage^2)+I(Deep.sleep.percentage^3)+I(Deep.sleep.percentage^4)+I(Deep.sleep.percentage^5), data=sleep_data)
In model 1, by using the Breusch-Pagan test, and getting a
In model 2, by using the Breusch-Pagan test, and getting a
In model 3, by using the Breusch-Pagan test, and getting a
All of these models have all significant predictors and a high
Based on these final graphs, we can see that adding higher-order terms has improved the linearity of all the models and we can proceed with checking other assumptions.
None of the data points for any of the models are considered outliers because they all have small Cook’s distance values. This means that there are no points with abnormally high influence on the outcome of the model and we don’t have to remove any outliers.
Since each row in the data is associated with a unique test subject and are not related to each other in a time-series, we can safely assume that the measurements are independent. If we suspected the measurements might not be independent, we could plot error terms in the order in which they occur in the dataset and try to observe any pattern in the plot.
These residual and scale-location plots seem to have slight patterns, suggesting the models might have heteroscedasticity. We need to investigate further by using the Breusch-Pagan test with a null hypothesis that the models have homoscedasticity and the alternate hypothesis being that the models have heteroscedasticity.
The results from the bp-tests show that model 1 has heteroscedasticity while model 2 and model 3 have homoscedasticity, with model 3 being the best option here.
According to the stat-QQ line plot, there is a noticeable bow shaped pattern and kurtosis of the diagonal points, suggesting that the residuals are not normally distributed.
We can also confirm this by running a Shapiro-Wilk test for normality with the null hypothesis being that the residuals are normally distributed and the the alternate hypothesis being that the residuals are not normally distributed.
The
Therefore the final model is:
Also, the sub-model for smokers:
and the sub-model for non-smokers:
The analysis conducted provides promising insights into how various lifestyle and physiological factors influence sleep efficiency, even if the model is not perfectly suited for individual-level predictions. The final model explains 85.44% of the variance in sleep efficiency, highlighting key influences such as exercise, alcohol, smoking, age, awakenings, and deep sleep percentage.
Alcohol consumption and smoking both demonstrate negative effects on sleep efficiency, with smoking showing a particularly strong detrimental impact. These findings align with existing medical research about substance use and sleep quality. Interestingly, while smoking generally reduces sleep efficiency, the model suggests deep sleep may slightly mitigate this effect, possibly due to nicotine's temporary relaxing properties.
The relationship between age and sleep efficiency proves complex, following a nonlinear pattern that changes across different life stages. This likely reflects how various life circumstances and health factors influence sleep differently at various ages, rather than being solely caused by biological aging itself.
Sleep architecture plays a crucial role in sleep efficiency. REM sleep shows a clear positive association with better sleep quality, supporting its importance for cognitive restoration. Deep sleep presents a more nuanced relationship, where moderate amounts are beneficial but excessive duration may become counterproductive, indicating balance is key.
The unexpected positive association between nighttime awakenings and sleep efficiency warrants further investigation. While initially counterintuitive, this effect is likely explained by the negative offset provided by the interaction between awakenings and deep sleep percentage. Individuals who are able to achieve a large percentage of deep sleep seem to be less affected by awakenings since they can still achieve enough deep sleep. The positive interpretation of awakenings might also suggest measurement limitations in the study design.
-
Deng, Z., Liu, L., Liu, W., Liu, R., Ma, T., Xin, Y., Xie, Y., Zhang, Y., Zhou, Y., & Tang, Y. (2024). Alterations in the fecal microbiota of methamphetamine users with bad sleep quality during abstinence. BMC Psychiatry, 24(1), 324-12. https://doi.org/10.1186/s12888-024-05773-5
-
ENSIAS. (2021). Sleep Efficiency Dataset. Kaggle. Retrieved [March 11, 2025], from https://www.kaggle.com/datasets/equilibriumm/sleep-efficiency/data
-
Fjell, A. M., Sørensen, Ø., Wang, Y., Amlien, I. K., Baaré, W. F. C., Bartrés-Faz, D., Boraxbekk, C., Brandmaier, A. M., Demuth, I., Drevon, C. A., Ebmeier, K. P., Ghisletta, P., Kievit, R., Kühn, S., Madsen, K. S., Nyberg, L., Solé-Padullés, C., Vidal-Piñeiro, D., Wagner, G., . . . Walhovd, K. B. (2023). Is short sleep bad for the brain? brain structure and cognitive function in short sleepers. The Journal of Neuroscience, 43(28), 5241-5250. https://doi.org/10.1523/JNEUROSCI.2330-22.2023
-
Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4), 370-396.
-
Pan, L., Li, L., Peng, H., Fan, L., Liao, J., Wang, M., Tan, A., & Zhang, Y. (2022). Association of depressive symptoms with marital status among the middle-aged and elderly in rural china: Serial mediating effects of sleep time, pain and life satisfaction. Journal of Affective Disorders, 303, 52-57. https://doi.org/10.1016/j.jad.2022.01.111
-
Wang, L., & Aton, S. J. (2022). Perspective – ultrastructural analyses reflect the effects of sleep and sleep loss on neuronal cell biology. Sleep (New York, N.Y.), 45(5), 1. https://doi.org/10.1093/sleep/zsac047