1. Introduction
2. Why We Are Using Think Stats
3. Instructions for Cloning the Repo
4. Required Exercises
5. Optional Exercises
6. Recommended Reading
7. Resources
Use Allen Downey's Think Stats (second edition) book for getting up to speed with core ideas in statistics and how to approach them programmatically. This book is available online, or you can buy a paper copy if you would like.
Use this book as a reference when answering the 6 required statistics questions below. The Think Stats book is approximately 200 pages in length. It is recommended that you read the entire book, particularly if you are less familiar with introductory statistical concepts.
Complete the following exercises along with the questions in this file. Some can be solved using code provided with the book. The preface of Think Stats explains how to use the code.
Communicate the problem, how you solved it, and the solution, within each of the following markdown files. (You can include code blocks and images within markdown.)
The stats exercises have been chosen to introduce/solidify some relevant statistical concepts related to data science. The solutions for these exercises are available in the ThinkStats repository on GitHub. You should focus on understanding the statistical concepts, python programming and interpreting the results. If you are stuck, review the solutions and recode the python in a way that is more understandable to you.
For example, in the first exercise, the author has already written a function to compute Cohen's D. You could import it, or you could write your own code to practice python and develop a deeper understanding of the concept.
Think Stats uses a higher degree of python complexity from the python tutorials and introductions to python concepts, and that is intentional to prepare you for the bootcamp.
One of the skills to learn here is to understand other people’s code. And this author is quite experienced, so it’s good to learn how functions and imports work.
Using the code referenced in the book, follow the step-by-step instructions below.
Step 1. Create a directory on your computer where you will do the prework. Below is an example:
(Mac): /Users/yourname/ds/metis/metisgh/prework
(Windows): C:/ds/metis/metisgh/prework
Step 2. cd into the prework directory. Use GitHub to pull this repo to your computer.
$ git clone https://github.com/AllenDowney/ThinkStats2.git
Step 3. Put your ipython notebook or python code files in this directory (that way, it can pull the needed dependencies):
(Mac): /Users/yourname/ds/metis/metisgh/prework/ThinkStats2/code
(Windows): C:/ds/metis/metisgh/prework/ThinkStats2/code
Include your Python code, results and explanation (where applicable).
Q1. Think Stats Chapter 2 Exercise 4 (effect size of Cohen's d)
Cohen's D is an example of effect size. Other examples of effect size are: correlation between two variables, mean difference, regression coefficients and standardized test statistics such as: t, Z, F, etc. In this example, you will compute Cohen's D to quantify (or measure) the difference between two groups of data.
You will see effect size again and again in results of algorithms that are run in data science. For instance, in the bootcamp, when you run a regression analysis, you will recognize the t-statistic as an example of effect size.
Q2. Think Stats Chapter 3 Exercise 1 (actual vs. biased)
This problem presents a robust example of actual vs biased data. As a data scientist, it will be important to examine not only the data that is available, but also the data that may be missing but highly relevant. You will see how the absence of this relevant data will bias a dataset, its distribution, and ultimately, its statistical interpretation.
Q3. Think Stats Chapter 4 Exercise 2 (random distribution)
This questions asks you to examine the function that produces random numbers. Is it really random? A good way to test that is to examine the pmf and cdf of the list of random numbers and visualize the distribution. If you're not sure what pmf is, read more about it in Chapter 3.
Q4. Think Stats Chapter 5 Exercise 1 (normal distribution of blue men)
This is a classic example of hypothesis testing using the normal distribution. The effect size used here is the Z-statistic.
Bayes' Theorem is an important tool in understanding what we really know, given evidence of other information we have, in a quantitative way. It helps incorporate conditional probabilities into our conclusions.
Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin? Assume we observe the following probabilities in the population: fraternal twin is 1/125 and identical twin is 1/300.
Using the formula for Bayes' Theorem, the probability that Elvis was an identical twin is approximately 0.27.
I got this answer by assignig the events as follows:
- A: Elvis has an identical twin.
- B: Elvis has a twin of either type.
From there, the the probabilities are as follows: - P(A|B): Probability that Elvis has an identical twin in the event that he has a twin of either type.
- P(B|A): Probability that Elvis has a twin in the event that he has an identical twin. I assume this to be 1.
- P(A): Probability that Elvis has an identical twin, which is given as 1/300 or approximately 0.003.
- P(B): Probability that Elvis has a twin, which is the sum of 1/125 and 1/300, approximately 0.011.
From there, P(A|B) is effectively 0.003/0.011, or approximately 0.27.
How do frequentist and Bayesian statistics compare?
Bayesian and frequentist statistics are similar in that both purport methods that can be used to test hypothesis and establish confidence intervals. In Bayesian statistics, more importance is given to the prior distribution of the data being considered. In frequentist statistics, more though is given to establishing a pre-experiment hypothesis and that the experimental design should consider the steps needed to successfully realize this hypothesis over repeated trials. The two schools of thought differ on their interpretation of what "probability" means: for frequentists, it is the frequency with which an unknown has a certain value. For Bayesians, it describes the random variate aspect of the value of an unknown.
The result is that Bayesian experiments produce a probability distribution that describes the main veriables in an experiment, whereas a frequentist approach would yield a true or false outcome about the pre-hypothesis prediction about those variables.
The following exercises are optional, but we highly encourage you to complete them if you have the time.
Q7. Think Stats Chapter 7 Exercise 1 (correlation of weight vs. age)
In this exercise, you will compute the effect size of correlation. Correlation measures the relationship of two variables, and data science is about exploring relationships in data.
Q8. Think Stats Chapter 8 Exercise 2 (sampling distribution)
In the theoretical world, all data related to an experiment or a scientific problem would be available. In the real world, some subset of that data is available. This exercise asks you to take samples from an exponential distribution and examine how the standard error and confidence intervals vary with the sample size.
Q9. Think Stats Chapter 6 Exercise 1 (skewness of household income)
Q10. Think Stats Chapter 8 Exercise 3 (scoring)
Q11. Think Stats Chapter 9 Exercise 2 (resampling)
Read Allen Downey's Think Bayes book. It is available online for free, or you can buy a paper copy if you would like.
Some people enjoy video content such as Khan Academy's Probability and Statistics or the much longer and more in-depth Harvard Statistics 110. You might also be interested in the book Statistics Done Wrong or a very short overview from School of Data.