title | author | geometry | toc | toc-depth | header-includes | include-before | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Science Self-Assessment |
Galvanize Inc. |
margin=1in |
true |
2 |
|
\begin{center} \includegraphics[width=4cm]{imgs/logo.png} \end{center} |
This document is designed to give you an idea of the baseline of Python, SQL and probability/statistics knowledge required to apply for the Data Science Immersive program. If understanding any of the scripts included in this PDF is challenging, we encourage you to take the time to study Python and/or SQL and/or probability/statistics before beginning the application process. For a list of free Python, SQL and probability/statistics resources, please refer to the DSI Study Resources PDF.
This document starts with some simple python statements which you should be able to evaluate without actually executing. We then proceed to more advanced challenges that will require a solid understanding of strings, lists, sets, dictionaries, file I/O, and functions. We then continue the self assessment with a variety of SQL statements you should be comfortable with. We end the document with probability/statistics exercises that cover counting (permutations, combinations), probability (conditional probability, Bayes’ Theorem), probability distribution for discrete and continuous random variables, descriptive and inferential statistics as well as basic linear regression.
Without running the scripts, can you tell what the output will be? If you have some Python or programming background, this section should take very little time.
\begin{multicols}{3} \begin{minted}[linenos]{python}
list_num = [1,2,3] for num in list_num: total = 0 total += num print total \end{minted}
\begin{minted}[linenos]{python}
list_num = [1,2,3] total = 0 for num in list_num: total += num print total \end{minted}
\begin{minted}[linenos]{python}
list_num = [1,2,3] total = 0 for num in list_num: total += num print total \end{minted} \end{multicols}
\begin{multicols}{2}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
def my_function1(my_list): output = [] for item in my_list: output.append(item) return item
print my_function1(['cat', 'bad', 'dad']) \end{minted}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
def my_function2(my_list): output = [] for item in my_list: output.append(item) return output
print my_function2(['cat', 'bad', 'dad']) \end{minted}
\end{multicols} \newpage \begin{multicols}{2}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
def my_function3(my_list): output = [] for item in my_list: output.append(item) return item
print my_function3(['cat', 'bad', 'dad']) \end{minted}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
def my_function4(my_list): for item in my_list: output = [] output.append(item) return output
print my_function4(['cat', 'bad', 'dad']) \end{minted} \end{multicols}
\begin{multicols}{2}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
def my_function5(my_list): output = [] for item in my_list: output.append(item) return output
print my_function5(['cat', 'bad', 'dad']) print my_function5(['cat', 'bad', 'dad']) \end{minted}
\begin{minted}[linenos, fontsize=\footnotesize]{python}
output = [] def my_function6(my_list): for item in my_list: output.append(item) return output
print my_function6(['cat', 'bad', 'dad']) print my_function6(['cat', 'bad', 'dad']) \end{minted}
\end{multicols}
Functions, blocks of reusable code, keep your code modular, well organized and easily maintainable. You should try to keep your code organized in functions. Take a look at each of the following snippets of code and organize them into functions.
- We want a function that takes a list of numbers and returns that list where 10 was added to each number.
\begin{pythoncode} list_num = [1,2,3] list_add_10 = [] for num in list_num: list_add_10.append(num + 10) print list_add_10 \end{pythoncode}
- We want a function that takes in a list of strings and returns the list with the length of the words.
\begin{pythoncode} list_words = ['great', 'job', 'so', 'far'] list_length_words = [] for word in list_words: list_length_words.append(len(word)) print list_length_words \end{pythoncode}
\newpage
Practice, practice, practice: we encourage you to work through these challenges.
Write a function that looks at the number of times the given letters appear in a document. The output should be in a dictionary.
\begin{pythoncode} def letter_counter(path_to_file, letters_to_count): ''' Returns the number of times specified letters appear in a file
Parameters
-----------
path_to_file: str
Relative or absolute path to file of interest
letters_to_count: str
String containing the letters to count in the text
Returns
--------
letter_dict: dict
- key: letter
- value: the count of that letter in the file
The counting is case insensitive
Example
--------
```file.txt
This is the file of interest. Count my vowels!
```
>>> letter_counter('file.txt', 'aeiou')
{'i': 4, 'e':4, 'o':2, 'u':1}
'''
pass
\end{pythoncode}
\newpage
Write a function that removes one occurrence of a given item from a list. Do not use methods .pop()
or .remove()
! If the item is not present in the list, output should be 'The item is not in the list'.
\begin{pythoncode} def remove_item(list_items, item_to_remove): ''' Remove first occurrence of item from list
Parameters
----------_
list_items: list
item_to_remove: object
The object to be removed from list_items
Returns
--------
- if the item is in the list: list
list with first occurrence of item removed
- if the item is not in the list: str
'The item is not in the list'
Example
--------
>>>list_items = [1,3,7,8,0]
>>>remove_item(list_items, 7)
[1,3,8,0]
'''
pass
\end{pythoncode}
\newpage
The simple substitution cipher basically consists of substituting every plaintext character for a different ciphertext character. The following is an example of one possible cipher from http://practicalcryptography.com/ciphers/simple-substitution-cipher/:
- Plain alphabet : abcdefghijklmnopqrstuvwxyz
- cipher alphabet: phqgiumeaylnofdxjkrcvstzwb
\begin{pythoncode} def cipher(text, cipher_alphabet, option='encipher'): ''' Run text through a particular cipher alphabet
Parameters
-----------
text: str
Either the plain text to encipher, or the cipher text to decrypt
cipher_alphabet: dict
Dictionary specifying {'original_letter': 'cipher_letter'}
option: str (default 'encipher')
'encipher' (accept plain text and output cipher text)
'decipher' (accept cipher text and output plain text)
Returns
--------
cipher text by default,
plain text if option is set to decipher
>>> d = dict(zip('abcdefghijklmnopqrstuvwxyz',
'phqgiumeaylnofdxjkrcvstzwb'))
>>> cipher('defend the east wall of the castle',
d)
'giuifg cei iprc tpnn du cei qprcni'
>>> cipher('giuifg cei iprc tpnn du cei qprcni',
d,
option='decopher')
'defend the east wall of the castle'
'''
pass
\end{pythoncode}
\newpage
Implement a function that counts the number of isograms in a list of strings.
- An isogram is a word that has no repeating letters, consecutive or non-consecutive.
- Assume the empty string is an isogram and that the function should be case insensitive.
\begin{pythoncode} def count_isograms(list_of_words): ''' Count the number of strings without repeating characters in a list
Parameters
-----------
list_of_words: list of strings
Returns
-------
count of isograms (as integer)
>>>count_isograms(['conduct', letter', 'contract', 'hours', 'interview'])
1
'''
pass
\end{pythoncode}
Write a function that returns a list of matching items. Items are defined by a tuple with a letter and a number and we consider item 1 to match item 2 if:
- Both their letters are vowels (aeiou), or both are consonants and,
- The sum of their numbers is a multiple of 3
(1,2) contains the same information as (2,1), the output list should only contain one of them.
\begin{pythoncode} def matching_pairs(data_list): ''' Parameters ---------- data_list: as list of tuples (letter, number)
Returns
-------
A list of the matching pair referenced by their index (index_A, index_B).
Each pair should appear only once. (A,B) is the same as (B,A)
>>> data = [('a', 4), ('b', 5), ('c', 1), ('d', 3), ('e', 2), ('f',6)]
>>> matching_pairs(data)
[(0,4), (1,2), (3,5)]
'''
pass
\end{pythoncode}
\newpage
You should be able to write the SQL queries that use SELECT
, FROM
, WHERE
, CASE
clauses, aggregates, and JOIN
s . To check your work, you can run your queries on w3school's site (http://bit.ly/1foSkgu).
We will be querying the following tables.
Table 1: flags
name | country | w_prop | l_prop | adoption_date |
---|---|---|---|---|
"Tricolour" | "France" | 2 | 3 | 1830 |
"Union Jack" | "United Kingdom" | 1 | 2 | 1801 |
"The Star-Strangled Banner" | "USA" | 10 | 19 | 1960 |
"Hinomaru" | "Japan" | 2 | 3 | 1999 |
"NA" | "Brazil" | 7 | 10 | 1992 |
"Jalur Gemilang" | "Malaysia" | 1 | 2 | 1963 |
where w_prop
is the width proportion and l_prop
is the length proportion
Table 2: countries
country | capital | contient |
---|---|---|
"France" | "Paris" | "Europe" |
"Malaysia" | "Kuala Lumpur" | "Asia" |
"Brazil" | "Brasilia" | "South America" |
"United Kingdom" | "London" | "Europe" |
"Japan" | "Tokyo" | "Asia" |
"USA" | "Washington DC" | "North America" |
"Germany" | "Berlin" | "Europe" |
-
Use the
WHERE
clause to show the countries with a flag ratio of 2:3 (i.e.w_prop
= 2 andl_prop
= 3). -
Use
IN
to check if an item is in a list and show the countries on a continent that is either Europe or North America. -
Use
BETWEEN xxx AND xxx
to show names of flags and countries that have width proportion higher than 1 but lower than 8. -
Use
LIKE 'X%'
to show countries that have an name that starts with 'U'. -
Use
CASE
to show countries, their capital and a column to indicate whether the continent is 'Eurasia' (i.e. Europe or Asia) or 'Americas' (North or South America). Add a filter to select countries with capitals that are at least 7 character long.
Aggregates include commands such as DISTINCT
, COUNT
, SUM
, GROUP BY
, HAVING
, and ORDER BY
. Try using these commands on the following questions!
-
Use
DISTINCT
to list the continents in the countries table - each continent should appear only once. -
Use
COUNT
to see how many countries are in Europe. -
Use
GROUP BY
to count how many countries are in each continent, with continents alphabetically ordered (hint: useORDER BY
). -
Use
HAVING
to determine which continents are represented at least twice in the countries table.
-
Use
JOIN
to display the capital, the country, and the flag name. -
Use
JOIN
andWHERE
to display the continents associated to the flags in the flags table when the flag has a name (i.e. not 'NA'). -
Use
JOIN
andHAVING
to display continents that have at least 2 countries represented as well as the average adoption date of the flag (asavg_date
).
\newpage
Here is a small selection of exercises to make sure you know how to apply your knowledge in statistics, probability and simple regression. If you want to practice some more, or to practice on exercises with a solution, checkout the links in each section. They come from the recommended resources (Khan Academy, Udacity and the probability review).
Table of content
-
Counting: permutations, combinations
-
Probability: Probability of an event, Probability of 2 or more events (Conditional probability, Independent and dependent events, Mutually exclusive events, Bayes’ Theorem)
-
Probability distribution (Binomial, Geometric and Poisson distributions for discrete random variables, Uniform, Normal and Exponential distributions for continuous random variables)
-
Descriptive Statistics: mean, variance, standard deviation, range, IQR
-
Inferential Statistics: confidence intervals, hypothesis testing, inference for proportions and means
-
Linear regression: model performance, interpretation of coefficients, underfitting/overfitting
NOTE: Some exercises are labeled as Extra Credit, and as such are not mandatory.
-
How many ways can you arrange the numbers 1, 2, 3, 4 and 5?
-
How many ways can you arrange 1, 1, 2, 3, 4?
-
How many ways can you arrange two 3s and three 5s?
Some links: http://bit.ly/2iGgrir, http://bit.ly/2jXtFIt
-
How many different poker hands (5 cards) can you have? A deck holds 52 cards.
-
There are five flavors of ice cream: Stracciatella, Mint chocolate chip, Cookies and Cream, Butter Pecan, Pistachio and Pralines and cream. How many three scoop ice-creams can you make if all the scoops must be different flavors?
Extra Credit: what happens if you can take several scoops of the same flavor?
Some links: http://bit.ly/2iNIXSF, http://bit.ly/2jXlDiI
-
In a deck of cards (52 cards), what's the probability of picking a queen? A heart? Of picking a card that's not a queen nor a heart?
-
If I do not replace the cards, what is the probability of picking 2 kings? 4 diamonds? How do these probabilities evolve if I replace the cards after each draw?
Some links: http://bit.ly/2iNCwyS, http://bit.ly/OtSNH2, http://bit.ly/2j7R4qF
-
What is the probability that the total of two dice is less than four, knowing that the first die is a 2?
-
25% of candidates to a Web developer position can code both in Javascript and HTML. 70% of these candidates can code in Javascript and 50% can code in HTML. What is the probability that a candidate can code in HTML knowing that he can code in Javascript?
Some links: http://bit.ly/2iGktHi
- Number of kids dressed as pumpkins or ghosts on Halloween night and the amount of candy they received:
| Amount of Candy | less than 10 | 10 - 20 | 20 - 30 | greater than 30 |
| :-------------: | :----------: | :-----------: | :-----------: | :----------: |
| Pumpkins | 5 | 10 | 60 | 25 |
| Ghosts | 15 | 40 | 80 | 15 |
- What is the probability that a kid dressed as a pumpkin gets 20 or more pieces of candy? How about if he dresses as a ghost?
- What is the probability that a kid obtains less than 10 pieces of candy?
- What is the probability that two siblings, one dressed as a ghost and one dressed as a pumpkin, each receive 20 to 30 pieces of candy?
- You toss a fair die twice. What is the probability of getting less than 3 on the first toss and an even number on the second?
Some links: http://bit.ly/2jmalpl
Let's consider a population from which we draw a sample of 40 individuals. The probability of your sample having no-one with glasses is 26%. The probability of having only one individual wearing glasses is 32%. What is the probability of
(a) Obtaining not more than one individual wearing glasses in a sample?
(b) Obtaining more than one individual wearing glasses in a sample?
Some links: http://bit.ly/2jmjyxO
-
To detect a medical condition, patients are given two tests. 25% of the patients receive positive results on both tests and 42% of the patients receive positive results on the first test. What percent of those who have positive results on the first test passed also had positive result on the second test?
-
Extra Credit: A jar contains red and blue marbles. You draw two marbles one after the other without replacing the first marble in the jar. You know that:
- The probability of selecting a blue marble and then a red marble is 30%.
- The probability of selecting a red marble on the first draw is 50%.
You first draw a red marble. What is the probability of selecting a blue marble on the second draw?
Some links: http://bit.ly/2jmjHRS
Common problems relying on discrete (Binomial, Geometric, Poisson) or continuous (Uniform, Normal, Exponential) probability distributions.
Here are some exercises (http://bit.ly/2j7GK25) with their solutions as video.
-
Fair coin: Imagine you were to flip a fair coin 10 times. What would be the probability of getting 5 heads?
-
Unfair coin: You have a coin with which you are 2 times more likely to get heads than tails. You flip the coin 100 times. What is the probability of getting 20 tails? What is the probability of getting at least one heads?
Suppose you have an unfair coin, with a 68% chance of getting tails. What is the probability that the first head will be on the 3rd trial?
On average 20 taxis drive past your office every 30 minutes. What is the probability that 30 taxis will drive by in 1 hour?
Let
Extra Credit: Let
Let the random variable
Extra Credit:
-
Suppose
$X$ has a standard normal distribution. Compute$P(X > 9)$ ,$P(1< X < 3)$ and$P(X > -3)$ . -
The weight in pounds of individuals in a population of interest has a normal distribution, with a mean of 150 and a standard deviation of 40. What is the expected range of values that describe the weight of 68% of the population (Hint: use the empirical rule)? Of the people who weigh more than 170 pounds, what percent weigh more than 200 pounds (Hint: this is conditional probability)?
Give the mean, median and mode of the following data:
(20, 45, 68, 900, 57, 45, 33, 35, 45, 22)
Do you think the mean is a good summary statistic? Why or why not?
Give the mean, the variance, the standard deviation, the range and the interquartile of range of the following data:
(20, 45, 68, 900, 57, 45, 33, 35, 45, 22)
Give the expression of the mean and the variance for a discrete random variable
Give the expression of the mean and the variance for a continuous random variable
-
We are polling to get the approval rate of the president. Out of a population of 4 million, 6014 were surveyed and 3485 expressed their approval. Construct a 95% confidence interval for the approval rate of the president.
-
The weight of a random sample of 100 individuals from a population of interest was surveyed and yielded a sample average weight of 150 pounds and sample standard deviation of 20 pounds. Construct a 95% confidence interval for the average weight of the population.
-
What is the definition of a significance level? Of a p-value?
-
Would you use a one tailed or two tailed tests in the following cases:
- Investigating if women are paid less than men.
- Comparing the click-through rate of website when the 'subscribe' button is green vs. when it is blue.
-
A man goes to trial. In a hypothesis testing framework, let's define the null hypothesis as Not Guilty and the alternative hypothesis as Guilty. - What type of error is made when the man is actually not guilty but verdict returned is guilty? - What type of error is made when the man is actually guilty but verdict returned is not guilty?
- We want the test the hypothesis that at least 68% of the Canadian population (aged 18+) went to the movies at least once in the past 12 months with a significance level of 5%. We surveyed 4,000 respondents and found 3,012 did go at least once to the movies in the past 12 months. How would your conclusion compare if you only had 40 respondents, 30 of which went to the movies at least once in the past 12 months
Some links: http://bit.ly/2jIM1h3
- We want to test the hypothesis that the average weight in North America is at least 175 pounds. The mean of weights of the 100 individuals sampled is 178 pounds, with a sample standard deviation of 8 pounds. What are you conclusions?
Some links: http://bit.ly/2jmht5d
- We want to investigate the claim that on average, sea turtles lay 110 eggs in a nest. Volunteers have gone out and counted the number of eggs in 20 nest. What do you conclude?
- Data:
101, 120, 154, 89, 97, 132, 126, 105, 94, 111, 98, 90, 88, 115, 99, 85, 131, 127, 116
Some links: http://bit.ly/2j7KpN2
- Is there a meaningful difference between the proportion of teenagers vs that of adults that go to the movies at least once per month?
- Data:
- 1000 teenagers are surveyed, 780 answer positively.
- 1000 adults are surveyed, 620 answer positively.
Some links: http://bit.ly/2j7GUXg
- Is there a meaningful difference between the average wingspan of bald eagles vs that of crowned eagles?
-
Data for bald eagles (in ft):
[7.4, 7.7, 6.0, 6.7, 8.3, 6.5, 6.9, 7.7, 7.8, 7.3, 6.9, 6.5, 6.3, 4.8, 8.0, 6.8, 5.8, 6.9, 6.3, 6.3, 6.4, 5.1, 6.9, 7.6, 5.6, 6.5, 6.7, 7.8, 6.6, 6.9, 7.0, 6.4, 7.4, 6.0, 7.0, 5.3, 5.8, 6.4, 7.1, 5.5, 7.0, 6.7, 5.8, 6.1, 7.1, 7.9, 7.7, 6.2, 5.3, 6.4, 6.9, 5.9, 7.8, 5.6, 5.0, 5.5, 6.4, 7.1, 8.6, 9.3, 6.8, 7.6, 7.2, 7.1, 5.8, 5.9, 5.1, 6.6, 6.8, 5.7, 6.3, 7.3, 6.3, 7.2, 7.7, 6.0, 7.2, 5.9, 7.2, 7.0, 7.4, 6.5, 7.8, 5.9, 6.3, 6.3, 8.3, 5.9, 6.9, 7.8]
-
Data for crowned eagles (in ft):
[5.3, 5.6, 5.8, 5.3, 5.6, 4.9, 5.7, 5.4, 5.8, 5.4, 6.0, 5.4, 5.1, 5.4, 5.2, 5.7, 4.8, 5.8, 5.7, 5.1, 5.3, 5.4, 5.7, 6.6, 5.0, 5.4, 5.3, 5.5, 5.2, 5.6, 5.2, 5.9, 5.7, 5.8, 5.5, 5.2, 4.0, 5.8, 5.2, 6.2, 5.4, 4.6, 5.3, 5.8, 6.3, 4.8, 5.6, 5.4, 5.2, 5.4, 5.1, 6.0, 6.1, 5.4, 5.4, 5.3, 5.0, 6.0, 5.0, 5.8, 5.1, 5.3, 4.8, 5.6, 5.7, 6.1, 5.0, 6.4, 5.1, 4.6, 5.3, 6.0, 4.8, 5.4, 4.3, 5.4, 5.1, 4.7, 6.0, 5.5, 5.4, 5.6, 5.2, 5.8, 5.3, 4.9, 5.3, 5.5, 5.7, 4.7, 6.0, 5.6, 4.9, 5.4, 4.3, 5.5, 4.9, 5.3, 5.6, 6.0]
Some links: http://bit.ly/2jva7OY
-
Dataset
x 0 1 2 3 5 y 1 2.1 3.2 4 6.1
(a) Plot corresponding the scatter plot.
(b) Find the least square regression line $y = ax + b$. Add it to your plot.
(c) Estimate the value of $y$ when $x = 4$.
*Extra Credit*: Can you do these steps in Python?
- Dataset
| x | 0 | 1 | 2 | 3 | 4 | 7 | 9 | 11 | 30 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| y | 2. | 4.9 | 8. | 10.8 | 13.9 | 23.1 | 29. | 35. | 92.1 |
(a) Find the least square regression line for the given data points.
(b) Plot the given points and the regression line on the same graph.
- We have the following (x,y) points:
`[(0, 42.0), (1, -101.0), (2, 21.0), (3, -38.0), (4, 5.0), (7, 20.0), (9, 293.0), (11, 266.0), (15, 625.0), (20, 1266.0), (25, 1757.0), (30, 2844.0)]`
(a) Plot the data.
(b) How do you think a linear model would perform? How about a 100 degree polynomial model? How would you figure out which of these models was preferable?
(c) How would you model the relationship between these features?
- We have a dataset that gives the height and age of a sample of people. The range of age spans from 1 to 60 years. We decide to compute the correlation coefficient to model to understand the relationship between these features.
(a) Do you expect the correlation coefficient to be positive or negative?
(b) What are some of the limitation of this approach?
Some links: http://bit.ly/2jXyDF6, http://bit.ly/2jqXuRp, http://bit.ly/2jxlCFA
-
What is Linear Regression and Logistic Regression? How are they different?
-
Describe cross-validation and its role in model selection.
-
Generally speaking, as we increase the complexity of the model we are evaluating, how is the behavior of the model's bias and variance changing?
-
A bank that grants auto loans is building a model, using historical sales data, to predict the price that a used car will sell for. Why is the average error between the predicted and actual price NOT an appropriate for evaluating the performance of the model?
-
In linear regression, how should coefficients be interpreted? What is the difference between the size of a coefficient versus its statistical significance?
-
Name two ways to measure the accuracy of a linear regression model.
Some links: http://stanford.io/1Ry9D60