-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment7.Rmd
80 lines (58 loc) · 3.69 KB
/
Assignment7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: "Assignment 7"
author: "Sreedevi Kesavan"
date: "12/03/2021"
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
JD: This is cool, and a nice investigation – but you did not really do a glm! Gaussian with identity link is just an lm. The thing you improved was using poly instead of a linear fit. But (as we tried to explain) you can have a polynomial relationship in a plain lm (the model is a linear function of the parameters, not necessarily the input variables). Grade 2/3. Sorry for delay!
## Modelling trends in New HIV Cases and the Mother to Child transmission rate in Easter Africa using GLMs
Last day we that new HIV infections can be correlated to the increase in mother-baby transmission i.e. as the mother to child transmission rate increases the number of new HIV cases in young adults also increases. We ran a standard linear regression using the lm() function and although it seemed like a good fit based on the pearson correlation coefficient, the diagnostic plots suggested otherwise. For one the Residual vs Fitted plot clearly indicated that the data was not fitting a line. This week I am going back to that same data and using the glm() function testing to see if there are other models that would fit the data better and create cleaner diagnostic plots. To broaden the analysis I have decided to increase the scope of the data from 2005-2019 to 2000-2019.
```{r, results = 'hide', warning=FALSE, message=FALSE}
#Packages Used ----
library(ggplot2)
library(emmeans)
```
This data has been downloaded from <https://data.unicef.org/dv_index/>
```{r }
Plot_ready <- read.csv('Cor_data.csv')
Plot_ready_S <- read.csv('Cor_data_Sew.csv')
```
I have not shown the data handling again this week, the final tables from last week have been imported for the purposes of his exercise.
### Initial Diagnostics
```{r p2}
m1 <- lm(new_HIV~ M_to_C, data=Plot_ready)
#Plot
p2 <- plot(m1, id.n=4)
m3 <- shapiro.test(Plot_ready$new_HIV) # W = 0.87212, p-value = 0.01281
m4 <- shapiro.test(Plot_ready$M_to_C) # W = 0.88695, p-value = 0.02364
```
### Final Diagnostics
The biggest issue with the Initial Diagnostic seen above is the residual vs fitted plot clearly suggests a non-linear fit. The sharpiro test on the 2000-2019 data suggests a normal distribution and since the scale-location plot suggests a consistent degree of variance, I think the best approach for a GLM is a Gaussian distribution with a polynomial fit.
As seen from the second run at diagnostics, the first residuals vs fitted has significantly flattened, and we can now move to the actual plot of the model.
```{r m2}
#Diagnostic Test ----
m1 <- glm(new_HIV~poly(M_to_C, 2),Plot_ready,family=gaussian(link = "identity"))
#Plot
m2 <- plot(m1, id.n=4)
```
### Final Model
```{r plot}
plot <- ggplot(Plot_ready, aes(M_to_C, new_HIV)) +
geom_point() +
geom_smooth(method='glm', formula = y~poly(x,2)) +
xlab("Mother-to-child HIV transmission rate per 100") +
ylab("Est. No. of new HIV cases") + theme_bw(); plot
```
### Comapring Males to Females using emmeans
To study whether there is a difference in the correlation above w.r.t. Sex, I used the emmeans package and took into account sex based difference in the data. The lack of overlap with the confidence intervals around the two means alone indicate a significant different between the two groups, but unlike the linear regression from last week the confidence intervals are a lot larger than last week.
```{r p}
#emmeans pairwise comparison on the bases of sex ----
lmboth <- glm(new_HIV~poly(M_to_C, 2) + Sex , Plot_ready_S, family=gaussian(link = "identity"))
e1 <- emmeans(lmboth, "Sex")
p <- plot(e1) + theme_bw(); p
```