03DataExploration.Rmd

# Data Exploration {-}

To better understand the dataset, we can explore the distribution of patients and observations per patient.

```{r, cache=TRUE}
## Trial characteristics
n <- length(unique(simulated.data$id))
n1 <- length(unique(simulated.data$id[simulated.data$Z==1]))
n0 <- length(unique(simulated.data$id[simulated.data$Z==0]))
averageObs <- mean(table(simulated.data$id))

cat(n, "participants total", "\n", 
		n1, "were randomized to the treatment arm", "\n",
		n0, "were randomized to the control arm.", "\n",
		"On average each patient has ", round(averageObs, 1), "observations")
```

There are 1000 individuals in each treatment arm with approximately 59 observations per person. The proportion of trial participants who experience the event of interest in the trial dataset is:

```{r, cache=TRUE}
## Event rate
e.rate <- sum(simulated.data$Y)/
  length(unique(simulated.data$id))
cat("Event rate is", round(e.rate*100,1), "%\n")
```

Next, we can quantify non-adherence within the trial by calculating what proportion of individuals became non-adherent by their last observation:

```{r, cache=TRUE}
## Non-adherence rates by arm

## how many person-time deviated in the treatment arm?
simulated.data.treated <- simulated.data[simulated.data$Z==1 & 
                                           simulated.data$Alag1==1,]
## how many person-time deviated in the control arm?
simulated.data.control <- simulated.data[simulated.data$Z==0 & 
                                           simulated.data$Alag1==0,]

pt1dev = dim(simulated.data.treated[simulated.data.treated$A==0&
                                      simulated.data.treated$t0<59,])[1]/n1
pt0dev = dim(simulated.data.control[simulated.data.control$A==1&
                                      simulated.data.control$t0<59,])[1]/n0

cat("In the treatment arm", pt1dev*100, "% were non-adherent by their last observation", 
		"\n", "In the control arm", pt0dev*100, 
		"% were non-adherent by their last observation")
```

As the last part of our exploration, let's see the covariates for this dataset and how they are distributed between the two treatment arms. In this example trial, there is one measured baseline covariate and two time-varying covariates. The estimation methods detailed in this document can be expanded to handle additional baseline or time-varying covariates by using the complete set of observed baseline covariates wherever $B$ is used, and the complete set of observed time-varying covariates whereever $L_{1}$ and $L_{2}$ are used.

```{r, cache=TRUE}
simulated.data %>% 
	group_by(Z) %>% 
	summarize(B_mean = mean(B),
						B_sd = sd(B),
						L1_mean = mean(L1),
						L1_sd = sd(L1),
						L2_prop = sum(L2)/n())
	
```

We can see the baseline covariate is well balanced between the two treatment arms while the time-varying covariates are not well balanced between the treatment arms. This can occur when the time-varying covariates are associated with the receipt of treatment.