dataanalysis3fs/05_ProbabilityDistributions_CLT_solution.Rmd at main · hdsu-bioquant/dataanalysis3fs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "Probability distribution, central limit theorem and confidence intervals"
author: "Carl Herrmann, Maïwen Caudron-Herger"
date: "`r Sys.Date()`"
output: html_document
---


# 0. Recap of the previous Sheet

In the previous Exercise Sheet we have learnt about distributions and how to use them in R. We have learned about p-, q-, d- and r- functions, the normal and the poisson distribution.

------------------------------------------------------------------------

# 1. Introduction and Objectives

In the previous lectures you learnt about different major types of probability distributions like Normal, Binomial, Negative binomial, Student t etc. In this tutorial you will complete your training with distribution by learning more about the binomial distribution and we will play around with the central limit theorem and the confidence intervals.

------------------------------------------------------------------------

# 2. Binomial distribution

A binomial distribution can be defined as -

$$P(x) = \frac{n!}{x!(n-x)!}\cdot p^x \cdot (1-p)^{n-x}$$

Where $x$ is the number of successes out of $n$ experiments and $p$ is the probability of success.

-   $mean = n \cdot p$
-   $variance = np \cdot (1 - p)$
-   $sd = \sqrt{np \cdot (1 - p)}$

The design of the experiment is as follows -

-   The experiment is repeated and are independent of one another
-   Each experiment has just two outcomes
-   The probability of success is constant and does not change with experiments

We can for example compute the probability of having 7 heads in a series of 10 throws:

```{r}
dbinom(7,size=10,prob = 0.5)
```

Or we can compute what the probability is to get 7 or more heads:

```{r}
pbinom(6,size=10,prob=0.5,lower.tail=FALSE)
```

Beware that this syntax means **strictly more than 6**, i.e. 7 or more!!

> How would you compute the probability to get less than 5?
> What would `qbinom(0.3,size=10,prob=0.5,lower.tail=FALSE) represent?

------------------------------------------------------------------------

# 3. Central limit theorem

In this chapter, we will test the central limit theorem using the example of pipetting errors in the lab.

Imagine you have a set of three pipets made of a P2 (up to 2 µl), P10 and P200 pipets. You would like to test your pipets and check their accuracy. To do that, you weight drops of water corresponding to 2 µl with the P2, 5 µl with the P10 and 50 µl with the P200.

To proceed, you first consider samples of 1 measurement each. You perform 10 measurements.

The results of such experiments can be simulated using a uniform distribution

```{r}
# Measurements with P2, varying between 1.5 µl and 2.5 µl
P2.set1 <- sapply(1:10, function(x) {
  runif(1,min = 1.5, max=2.5)
})
P2.set1
```

Since each sample has one measurement, the mean of each sample is ... the measurement itself:

```{r}
P2.set1.mean = sapply(P2.set1,mean)
P2.set1.mean
```

```{r}
plot(density(P2.set1.mean,bw=0.4), xlim = c(1,3),main='distribution of mean values')
```

We now repeat these measurements with 5 drops in each of the 10 samples

```{r}
N = 5
P2.sets = lapply(1:10, function(x) {
  runif(N,min = 1.5, max=2.5)
})
P2.sets
```

And we calculate the mean of each sample set

```{r}
P2.mean <- sapply(P2.sets,mean)
P2.mean
```

How are the mean values distributed?

```{r}
plot(density(P2.mean,bw=0.2), xlim = c(1,3), main = 'Distribution of mean values')
```

Compare this distribution of means over samples with N=5 to the distribution of means over samples with N=1. What do you notice?

> What about increasing the number of drops? Try with increasing numbers of drops.


Let's do that systematically for an increasing number of drops per sample, repeated 100 times:

```{r}
# Number of samples
N = c(2,5,10,20,50,100,500)


lapply(N,function(n) {
  P2.exp = lapply(1:100, function(x) {
    runif(n,min = 1.5, max=2.5)
    #rnorm(n,mean=2)
  })
  # Calculate the mean
  P2.exp.mean = sapply(P2.exp,mean)
  plot(density(P2.exp.mean),
     type='l',
     xlab = "volume",
     main = paste("Sample mean, n=",n),
     xlim = c(1.5,2.5))
})

```

> It starts to look a bit clearer, doesn't it? Which kind if distribution is it?

We can have a look at the distribution of the mean values for increasing N.

```{r}
# Number of samples
N <- 1:100 # Goes from 1 to 100
#
# Calculate the mean
P2.exp.mean <- sapply(N,function(n) {
  mean(sapply(1:100, function(y){
    mean(runif(n,min = 1.5, max=2.5)) # to get 100 mean values for each N
  }) )                                                   # to get the mean of the distribution
})
# Plot the distribution of the mean
plot(P2.exp.mean, ylab = "Mean value for P2 with increasing N",
     pch=20,
     ylim = c(1.9,2.1),xlab='Sample size');abline(h =2, col = "blue", lty = 2)
```

#### Conclusion:

The fluctuations around the real expectation value get smaller and smaller: approximating the unknown expectation value through the mean of the sample becomes more and more accurate!


We can also have a look at the standard deviation.

```{r}
# Number of samples
N <- 1:100 # Goes from 1 to 100
#
# Calculate the sd of the mean values (is actually the standard error!)
P2.exp.se <- sapply(N,function(n) {
  sd(sapply(1:100, function(y){
    mean(runif(n,min = 1.5, max=2.5)) # to get 100 mean values for each N
  }) )                                                   # to get the sd of the distribution (of the mean values)
})
# Plot the distribution of the mean
plot(P2.exp.se, ylab = "SE value for P2",xlab = 'Sample size',pch=20)
```

> What is the characteristics of these values? What is the main take home message?


------------------------------------------------------------------------

# 4. Confidence interval

The confidence interval describes the interval containing the (unknown) expectation value of a distribution with 95% confidence. This means that out of 100 random realizations of this random variable, the true expectation value $\mu$ will indeed be in this interval.

Let us try a simulation: we consider a random variable distributed according to a Poisson distribution $$P(x) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}$$ Here, *we know the true value of the expectation value*. We want to get an estimate for $\lambda$, and check if the confidence interval contains the true expectation value.

For example, a farmer expect to collect 75 eggs from his hens per hour.

```{r}
lambda = 75
```

He now collect during 100 days the eggs $N=8$ times a day (each time during one hour). We want to compute the mean $m_N$ over these $N=8$ realizations and determine the 95% confidence interval, and check, how often the expectation value $\mu$ is inside the confidence interval.

Remember that the 95% CI is given by $$
[m_N-t_{95,N-1}\frac{\sigma}{\sqrt{N}},m_N+t_{95,N-1}\frac{\sigma}{\sqrt{N}}]
$$ where $t_{95,N-1}$ is the critical value for the $t$-distribution with $n-1$ degrees of freedom.

Let's start by creating our samples:

```{r}
# size of the sample
N = 8
#
# we now draw 100 times samples of size N=8
X = lapply(1:100,function(i) {rpois(N,lambda = lambda)})
```

Now, we calculate the mean and the sandard deviation of the respective samples:

```{r}
# we compute the means
Xm = sapply(X,mean)
# and the sample standard deviations
Xsd = sapply(X,sd)
#
```

Next, we determine the upper and lower bounds of the 95% CI. Remember that the confidence interval is based on a $t$-distribution. The degrees of freedom of this distribution is the sample size -1 ($N$-1=7 in this case)

```{r}

df = N-1
tc = qt(c(0.975),df) # this is the critical value for the t-distribution for df = N-1 degrees of freedom and 95% CI

Xl = Xm-tc*Xsd/sqrt(N) # upper bound of the 95% CI
Xh = Xm+tc*Xsd/sqrt(N) # lower bound of the 95% CI
```

Finally, we determine whether each sample mean is found within the 95% CI or not:

```{r,results='hide'}
col = c('red','blue')

## vector of TRUE/FALSE if the real expectation value lambda is inside the interval
i.ok =  as.factor(Xl < lambda & Xh > lambda)

## plot the
plot(Xm,ylim=c(50,100),pch=20,ylab="",main=paste("Means values and confidence intervals,N=",N));abline(h=lambda,lty=3);lapply(1:length(Xl),function(i) {points(c(i,i),c(Xl[i],Xh[i]),type="l",col=col[i.ok[i]],lwd=2)})
```

Here, the red/blue bars represent the confidence interval, the black dot the mean of the sample values, and the dotted line at `\lambda` represents the true expectation value. Whenever the true expectation value is within the CI, the bar is blue, if not, the bar is red How often is the true expectation value outside the CI? Count the red bars!

It happens `r sum(!as.logical(i.ok))` times, which fits pretty well with the expected 5%.

> Repeat this simulation, but now with samples of $N=24$ (again 100 times)\
> What do you observe?\
> How often is the true expectation value outside the CI?
> Change to 90% CI and check if that works!

## Exercises

### Exercise 1

1.  How large are the chances of getting at least 50 times head out of 100?

2.  How would you simulate the effect of an unfair coin say which leads head 2/3 times? How large are the chances of getting at least 50 times head out of 100 with this coin?

3.  You invited a group of 20 friends to a bar. For every member, there is an 80% chance that he/she will show up. If you want to buy a treat for all of your  friends, how many should you buy to have a 70% chance that you will have just enough for everyone?


#### Solution

```{r}
# 1.
pbinom(49,size=100,prob=0.5,lower.tail=FALSE)

# or

1-pbinom(49,size=100,prob=0.5,lower.tail=TRUE)


# 2.
pbinom(49,size=100,prob=2/3,lower.tail=FALSE)

# or

1-pbinom(49,size=100,prob=2/3,lower.tail=TRUE)

# 3.
qbinom(1,size=20,prob=0.6)
```


### Exercise 2

You are buying 10 packs of gummy bears. You particularly like the red ones and the green ones. A pack contains 6 different colors and you expect them to be equally distributed. There are 84 pieces per 200g pack.

1.  What is the expected amount of red or green gummy bears?

2.  You selected your 10 packs according to the colors you could see in the pack. At home, you counted the following bears per pack:\

-   for the red ones: 12 16 17 12 16 13 11 18 13 19\
-   for the green ones: 11 10 15 16 12 14 13 10 13 17\
    Was your selection procedure a success? In other words, is the expected value below (congrats!), within or above (bad luck!) the 95% CI?


#### Solution

Expected amounts of red and green per pack:

```{r}
green.exp = 84/6
red.exp = 84/6
green.exp;
```


```{r}

packs = 10

red.obs = c(12,16,17,12,16,13,11,18,13,19)
green.obs = c(11,10,15,16,12,14,13,10,13,17)

m.red = mean(red.obs)
m.green = mean(green.obs)

k.red = sum(red.obs)
k.green = sum(green.obs)

## confidence interval

##critical value, using a t-distribution

t_95 = qt(0.975,df=packs-1)

m.red-t_95*sqrt(m.red)/sqrt(length(red.obs))
m.red+t_95*sqrt(m.red)/sqrt(length(red.obs))

```


```{r}
m.red-t_95*sd(red.obs)/sqrt(length(red.obs))
m.red+t_95*sd(red.obs)/sqrt(length(red.obs))
```


```{r}
m.green-t_95*sqrt(m.green)/sqrt(length(green.obs))
m.green+t_95*sqrt(m.green)/sqrt(length(green.obs))

```