From 9a2107ac061145a03e44194217ea018946c3af1c Mon Sep 17 00:00:00 2001 From: Sean Raleigh Date: Tue, 8 Aug 2023 15:28:23 -0600 Subject: [PATCH] Update Chapter 1 --- 01-intro_to_r.qmd | 21 +++++++++--------- docs/01-intro_to_r.html | 29 +++++++++++++++---------- docs/index.html | 21 +++++++++--------- docs/search.json | 16 +++++++------- index.qmd | 48 +++++++++++++++++++++-------------------- 5 files changed, 72 insertions(+), 63 deletions(-) diff --git a/01-intro_to_r.qmd b/01-intro_to_r.qmd index 73aa123..9ce6bce 100644 --- a/01-intro_to_r.qmd +++ b/01-intro_to_r.qmd @@ -42,7 +42,7 @@ We'll return to the Console in a moment. Next, look at the upper-right corner of the screen. There are at least three tabs in this pane starting with "Environment", "History", and "Connections". The "Environment" (also called the "Global Environment") keeps track of things you define while working with R. There's nothing to see there yet because we haven't defined anything! The "History" tab will likewise be empty; again, we haven't done anything yet. We won't use the "Connections" tab in this course. (Depending on the version of RStudio you are using and its configuration, you may see additional tabs, but we won't need them for this course.) -Now look at the lower-right corner of the screen. There are likely five tabs here: "Files", "Plots", "Packages", "Help", and "Viewer". The "Files" tab will eventually contain the files you upload or create. "Plots" will show you the result of commands that produce graphs and charts. "Packages" will be explained later. "Help" is precisely what it sounds like; this will be a very useful place for you to get to know. We will never use the "Viewer" tab, so don't worry about it. +Now look at the lower-right corner of the screen. There are likely six tabs here: "Files", "Plots", "Packages", "Help", "Viewer", and "Presentation". The "Files" tab will eventually contain the files you upload or create. "Plots" will show you the result of commands that produce graphs and charts. "Packages" will be explained later. "Help" is precisely what it sounds like; this will be a very useful place for you to get to know. We will never use the "Viewer" or "Presentation" tabs, so don't worry about them. ## Try something! @@ -57,7 +57,7 @@ and hit Enter. Congratulations! You just ran your first command in R. It's all downhill from here. R really is nothing more than a glorified calculator. -Okay, let's do something slightly more sophisticated. It's important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase `c`, and hit Enter: +Okay, let's do something slightly more sophisticated. It's important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase `x` and lowercase `c`, and hit Enter: ```{r} x <- c(1, 3, 4, 7, 9) @@ -119,16 +119,16 @@ It makes no difference what letter or combination of letters we use to name our mean_x <- mean(x) ``` -just saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (`_`), and dots (`.`). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special character in the names of variables.^[The official spec says that a valid variable name "consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number."] You should avoid variable names that are the same words as predefined R functions; for example, we should not type `mean <- mean(x)`. +just saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (`_`), and dots (`.`). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special characters in the names of variables.^[The official spec says that a valid variable name "consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number."] You should avoid variable names that are the same words as predefined R functions; for example, we should not type `mean <- mean(x)`. ## Load packages Packages are collections of commands, functions, and sometimes data that people all over the world write and maintain. These packages extend the capabilities of R and add useful tools. For example, we would like to use the `palmerpenguins` package because it includes an interesting data set on penguins. -If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you'll need to type `install.packages("palmerpenguins")` if you've never used the `palmerpenguins` package before. If you are using RStudio through a browser, you may not be able to install packages because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server. +If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you'll need to type `install.packages("palmerpenguins")` at the Console. (This is assuming you've never used the `palmerpenguins` package before. Once a package is installed the first time, it never has to be installed again.) If you are using RStudio through a browser, the packages you need should be pre-installed for you. In fact, you may not be able to install packages yourself because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server. -The data set is called `penguins`. Let's see what happens when we try to access this data set without loading the package that contains it. Try typing this: +After we've installed the package (a one-time process), we will need to load the package in every R session in which we want to use it. For example, the `palmerpenguins` package contains a data set called `penguins`. Let's see what happens when we try to access this data set without loading the package that contains it. Try typing this: ```{r} #| error: true @@ -155,12 +155,12 @@ Now R knows about the `penguins` data, so the last command printed some of it to Go look at the "Packages" tab in the pane in the lower-right corner of the screen. Scroll down a little until you get to the "P"s. You should be able to find the `palmerpenguins` package. You'll also notice a check mark by it, indicating that this package is loaded into your current R session. -You must use the `library` command in every new R session in which you want to use a package.^[If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you'll want to know that `install.packages` only has to be run once, the first time you want to install a package. If you're using RStudio Workbench, you don't even need to type that because your server admin will have already done it for you.] If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you're typing it correctly, but you're still getting an error, check to see if the package containing that command has been loaded with `library`. (Many R commands are "base R" commands, meaning they come with R and no special package is required to access them. The set of `letters` you used above is one such example.) +You must use the `library` command in every new R session in which you want to use a package. If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you're typing it correctly, but you're still getting an error, check to see if the package containing that command has been loaded with `library`. (Many R commands are "base R" commands, meaning they come with R and no special package is required to access them. The set of `letters` you used above is one such example.) ## Getting help -There are four important ways to get help with R. The first is the obvious "Help" tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type `penguins` and hit Enter. Take a few minutes to read the help file. +There are three important ways to get help with R. The first is the obvious "Help" tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type `penguins` and hit Enter. Take a few minutes to read the help file. Help files are only as good as their authors. Fortunately, most package developers are conscientious enough to write decent help files. But don't be surprised if the help file doesn't quite tell you what you want to know. And for highly technical R functions, sometimes the help files are downright inscrutable. Try looking at the help file for the `grep` function. Can you honestly say you have any idea what this command does or how you might use it? Over time, as you become more knowledgeable about how R works, these help files get less mysterious. @@ -186,7 +186,7 @@ You should have received an error because there is no command called `letter`. T and scroll down a bit in the Help pane. Two question marks tell R not to be too picky about the spelling. This will bring up a whole bunch of possibilities in the Help pane, representing R's best guess as to what you might be searching for. (In this case, it's not easy to find. You'd have to know that the help file for `letters` appeared on a help page called `base::Constants`.) -The fourth way to get help---and often the most useful way---is to use your best friend, the search engine. You don't want to just search for "R". (That's the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type "R __________" where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you'll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to [StackOverflow](https://stackoverflow.com) where very knowledgeable people post very helpful responses to common questions. +The third way to get help---and often the most useful way---is to use your best friend, the internet. You don't want to just type "R" into a search engine. (That's the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type "R __________" where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you'll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to [StackOverflow](https://stackoverflow.com) where very knowledgeable people post very helpful responses to common questions. Use a search engine to find out how to take the square root of a number in R. Test out your newly-discovered function on a few numbers to make sure it works. @@ -227,7 +227,7 @@ We can customize this by specifying the number of rows to print. (Don't forget a head(penguins, n = 10) ``` -The `tail` command does something similar. +The `tail` command does something similar, but for data from the last few rows. ```{r} tail(penguins) @@ -242,10 +242,11 @@ library(palmerpenguins) ``` ```{r} +#| eval: true penguins ``` -You can scroll through the rows by using the numbers at the bottom or the "Next" button. You can scroll through the variables by clicked the little black arrow pointed to the right in the upper-right corner. The only thing you can't do here that you can do with `View` is sort the columns. +You can scroll through the rows by using the numbers at the bottom or the "Next" button. You can scroll through the variables by clicking the little black arrow pointed to the right in the upper-right corner. The only thing you can't do here that you can do with `View` is sort the columns. We want to understand the "structure" of our data. For this, we use the `str` command. Try it: diff --git a/docs/01-intro_to_r.html b/docs/01-intro_to_r.html index 0e847bb..635d5fd 100644 --- a/docs/01-intro_to_r.html +++ b/docs/01-intro_to_r.html @@ -376,7 +376,7 @@

1.4 Try something!

@@ -386,7 +386,7 @@

and hit Enter.

Congratulations! You just ran your first command in R. It’s all downhill from here. R really is nothing more than a glorified calculator.

-

Okay, let’s do something slightly more sophisticated. It’s important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase c, and hit Enter:

+

Okay, let’s do something slightly more sophisticated. It’s important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase x and lowercase c, and hit Enter:

x <- c(1, 3, 4, 7, 9)
@@ -425,13 +425,13 @@

mean_x <- mean(x)
-

just saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (_), and dots (.). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special character in the names of variables.1 You should avoid variable names that are the same words as predefined R functions; for example, we should not type mean <- mean(x).

+

just saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (_), and dots (.). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special characters in the names of variables.1 You should avoid variable names that are the same words as predefined R functions; for example, we should not type mean <- mean(x).

1.5 Load packages

Packages are collections of commands, functions, and sometimes data that people all over the world write and maintain. These packages extend the capabilities of R and add useful tools. For example, we would like to use the palmerpenguins package because it includes an interesting data set on penguins.

-

If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll need to type install.packages("palmerpenguins") if you’ve never used the palmerpenguins package before. If you are using RStudio through a browser, you may not be able to install packages because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server.

-

The data set is called penguins. Let’s see what happens when we try to access this data set without loading the package that contains it. Try typing this:

+

If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll need to type install.packages("palmerpenguins") at the Console. (This is assuming you’ve never used the palmerpenguins package before. Once a package is installed the first time, it never has to be installed again.) If you are using RStudio through a browser, the packages you need should be pre-installed for you. In fact, you may not be able to install packages yourself because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server.

+

After we’ve installed the package (a one-time process), we will need to load the package in every R session in which we want to use it. For example, the palmerpenguins package contains a data set called penguins. Let’s see what happens when we try to access this data set without loading the package that contains it. Try typing this:

penguins
@@ -447,11 +447,11 @@

Now R knows about the penguins data, so the last command printed some of it to the Console.

Go look at the “Packages” tab in the pane in the lower-right corner of the screen. Scroll down a little until you get to the “P”s. You should be able to find the palmerpenguins package. You’ll also notice a check mark by it, indicating that this package is loaded into your current R session.

-

You must use the library command in every new R session in which you want to use a package.2 If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you’re typing it correctly, but you’re still getting an error, check to see if the package containing that command has been loaded with library. (Many R commands are “base R” commands, meaning they come with R and no special package is required to access them. The set of letters you used above is one such example.)

+

You must use the library command in every new R session in which you want to use a package. If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you’re typing it correctly, but you’re still getting an error, check to see if the package containing that command has been loaded with library. (Many R commands are “base R” commands, meaning they come with R and no special package is required to access them. The set of letters you used above is one such example.)

1.6 Getting help

-

There are four important ways to get help with R. The first is the obvious “Help” tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type penguins and hit Enter. Take a few minutes to read the help file.

+

There are three important ways to get help with R. The first is the obvious “Help” tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type penguins and hit Enter. Take a few minutes to read the help file.

Help files are only as good as their authors. Fortunately, most package developers are conscientious enough to write decent help files. But don’t be surprised if the help file doesn’t quite tell you what you want to know. And for highly technical R functions, sometimes the help files are downright inscrutable. Try looking at the help file for the grep function. Can you honestly say you have any idea what this command does or how you might use it? Over time, as you become more knowledgeable about how R works, these help files get less mysterious.

The second way of getting help is from the Console. Go to the Console and type

@@ -467,7 +467,7 @@

??letter

and scroll down a bit in the Help pane. Two question marks tell R not to be too picky about the spelling. This will bring up a whole bunch of possibilities in the Help pane, representing R’s best guess as to what you might be searching for. (In this case, it’s not easy to find. You’d have to know that the help file for letters appeared on a help page called base::Constants.)

-

The fourth way to get help—and often the most useful way—is to use your best friend, the search engine. You don’t want to just search for “R”. (That’s the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type “R __________” where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you’ll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to StackOverflow where very knowledgeable people post very helpful responses to common questions.

+

The third way to get help—and often the most useful way—is to use your best friend, the internet. You don’t want to just type “R” into a search engine. (That’s the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type “R __________” where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you’ll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to StackOverflow where very knowledgeable people post very helpful responses to common questions.

Use a search engine to find out how to take the square root of a number in R. Test out your newly-discovered function on a few numbers to make sure it works.

@@ -492,7 +492,7 @@

<
head(penguins, n = 10)
-

The tail command does something similar.

+

The tail command does something similar, but for data from the last few rows.

tail(penguins)
@@ -504,8 +504,16 @@

<
penguins
+
+ +
+ +
+
-

You can scroll through the rows by using the numbers at the bottom or the “Next” button. You can scroll through the variables by clicked the little black arrow pointed to the right in the upper-right corner. The only thing you can’t do here that you can do with View is sort the columns.

+

You can scroll through the rows by using the numbers at the bottom or the “Next” button. You can scroll through the variables by clicking the little black arrow pointed to the right in the upper-right corner. The only thing you can’t do here that you can do with View is sort the columns.

We want to understand the “structure” of our data. For this, we use the str command. Try it:

str(penguins)
@@ -574,7 +582,6 @@

  1. The official spec says that a valid variable name “consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.”↩︎

  2. -
  3. If you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll want to know that install.packages only has to be run once, the first time you want to install a package. If you’re using RStudio Workbench, you don’t even need to type that because your server admin will have already done it for you.↩︎

diff --git a/docs/index.html b/docs/index.html index 9530872..5745bb0 100644 --- a/docs/index.html +++ b/docs/index.html @@ -356,16 +356,14 @@

Philosophy and ped

Eventually, I got interested in Bayesian statistics and read everything I could get my hands on. I became convinced that Bayesian statistics is the right way to do statistical analysis. I started teaching special topics courses in Bayesian Data Analysis and working with students on research projects that involved Bayesian methods. If it were up to me, every introductory statistics class in the world would be taught using Bayesian methods. I know that sounds like a strong statement. (And I put it in boldface, so it looks even stronger.) But I truly believe that in an alternate universe where Fisher and his disciples didn’t “win” the stats wars of the 20th century (and perhaps one in which computing power got a little more advanced a little earlier in the development of statistics), we would all be Bayesians. Bayesian thinking is far more intuitive and more closely aligned with our intuitions about probabilities and uncertainty.

Unfortunately, our current universe timeline didn’t play out that way. So we are left with frequentism. It’s not that I necessarily object to frequentist tools. All tools are just tools, after all. However, the standard form of frequentist inference, with its null hypothesis significance testing, P-values, and confidence intervals, can be confusing. It’s bad enough that professional researchers struggle with them. We teach undergraduate students in introductory classes.

Okay, so we are stuck not in the world we want, but the world we’ve got. At my institution and most others, intro stats is a service course that trains far more people who are outside the fields of mathematics and statistics. In that world, students will go on to careers where they interact with research that reports P-values and confidence intervals.

-

So what’s the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence intervals tells them—and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we’re going to train frequentists, the least we can do is help them become good frequentists.

+

So what’s the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence interval tells them—and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we’re going to train frequentists, the least we can do is help them become good frequentists.

One source of inspiration for good statistical pedagogy comes from the Guidelines for Assessment and Instruction in Statistics Education (GAISE), a set of recommendations made by experienced stats educators and endorsed by the American Statistical Association. Their college guidelines are as follows:

    -
  1. Teach statistical thinking.
  2. -
-
    +
  • Teach statistical thinking. +
    1. Teach statistics as an investigative process of problem-solving and decision-making.
    2. Give students experience with multivariable thinking.
    3. -
-
    +
  • Focus on conceptual understanding.
  • Integrate real data with a context and purpose.
  • Foster active learning.
  • @@ -387,12 +385,12 @@

    Course structure

    As explained above, this book is meant to be a workbook that students complete as they’re reading.

    At Westminster University, we host Posit Workbench on a server that is connected to our single sign-on (SSO) systems so that students can access RStudio through a browser using their campus online usernames and passwords. If you have the ability to convince your IT folks to get such a server up and running, it’s highly worth it. Rather than spending the first day of class troubleshooting while students try to install software on their machines, you can just have them log in and get started right away. Campus admins install packages and tweak settings to make sure all students have a standardized interface and consistent experience.

    If you don’t have that luxury, you will need to have students download and install both R and RStudio. The installation processes for both pieces of software are very easy and straightforward for the majority of students. The book chapters here assume that the necessary packages are installed already, so if your students are running R on their own machines, they will need to use install.packages at the beginning of some of the chapters for any new packages that are introduced. (They are mentioned at the beginning of each chapter with instructions for installing them.)

    -

    Chapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, they will have created a project called intro_stats in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.)

    -

    In Chapter 2, students are taught to click a link to download a Quarto document (.qmd). I have found that students struggle initially to get this file to the right place. If students are using RStudio Workbench online, they will need to use the “Upload” button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn’t a single set of instructions that will get all students’ files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code.

    +

    Chapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, students will have created a project called intro_stats in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.)

    +

    In Chapter 2, students are taught to click a link to download a Quarto document (.qmd). I have found that students struggle initially to get this file to the right place. If students are using RStudio online, they will need to use the “Upload” button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn’t a single set of instructions that will get all students’ files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code.

    By Chapter 3, a rhythm is established that students will start to get used to:

    • Open the book online and open RStudio.
    • -
    • Install any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio Workbench in a browser.)
    • +
    • Install any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio in a browser.)
    • Check to make sure they’re are in the intro_stats project.
    • Click the link online to download the Quarto document.
    • Move the Quarto document from the Downloads folder to the project directory.
    • @@ -400,19 +398,20 @@

      Course structure

    • Restart R and Run All Chunks.
    • Start reading and working.
    +

    When students finish each assignment, they will Restart R and Run All Chunks one last time and then “Render” the Quarto document, which will create HTML output that can then be submitted. (Hopefully, they will also take the opportunity to spell check and proofread thoroughly before submission. It’s important to proofread the HTML document not just for the writing, but also to make sure that the code output and formatting all looks correct.)

    Chapters 3 and 4 focus on exploratory data analysis for categorical and numerical data, respectively.

    Chapter 5 is a primer on data manipulation using dplyr.

    Chapters 6 and 7 cover correlation and regression. This “early regression” approach mirrors the IMS text. (IMS eventually circles back to hypothesis testing for regression, but this book does not. That’s a topic that is covered extensively in most second-semester stats classes.)

    Chapters 8–11 are crucial for building the logical foundations for inference. The idea of a sampling distribution under the assumption of a null hypothesis is built up slowly and intuitively through randomization and simulation. By the end of Chapter 11, students will be fully introduced to the structure of a hypothesis test, and hopefully will have experienced the first sparks of intuition about why it “works.” All inference in this book is conducted using a “rubric” approach—basically, the steps are broken down into bite-sized pieces and students are expected to work through each step of the rubric every time they run a test. (The rubric steps are shown in the Appendix.)

    Chapter 12 introduces a few more steps to the rubric for confidence intervals. As we are still using randomization to motivate inference, confidence intervals are calculated using the bootstrap approach for now.

    -

    Once students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models as well. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions.

    +

    Once students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions.

    The final chapters of the book (Chapters 15–22) are simply applications of inference in specific data settings: inference for one (Ch. 15) and two (Ch. 16) proportions, Chi-square tests for goodness-of-fit (Ch. 17) and independence (Ch. 18), inference for one mean (Ch. 19), paired data (Ch. 20), and two independent means (Ch. 21), and finally ANOVA (Ch. 22). Along the way, students learn about the chi-square, Student t, and F distributions. Although the last part of the book follows a fairly traditional parametric approach, every chapter still includes randomization and simulation to some degree so that students don’t lose track of the intuition behind sampling distributions under the assumption of a null hypothesis.

    Onward and upward

    I hope you enjoy the textbook. You can provide feedback two ways:

      -
    1. The preferred method is to file an issue on the Github page: https://github.com/VectorPosse/intro_stats/issues

    2. +
    3. The preferred method is to file an issue on the Github page: https://github.com/VectorPosse/intro_stats/issues

    4. Alternatively, send me an email:

    diff --git a/docs/search.json b/docs/search.json index 18a6666..2d36bff 100644 --- a/docs/search.json +++ b/docs/search.json @@ -25,14 +25,14 @@ "href": "index.html#philosophy-and-pedagogy", "title": "Introduction to Statistics: an integrated textbook and workbook using R", "section": "Philosophy and pedagogy", - "text": "Philosophy and pedagogy\nTo understand my statistics teaching philosophy, it’s worth telling you a little about my background in statistics.\nAt the risk of undermining my own credibility, I’d like to tell you about the first statistics class I took. In the mid-2000s, I was working on my Ph.D. at the University of California, San Diego, studying geometric topology. To make a little extra money and get some teaching experience under my belt, I started teaching night and summer classes at Miramar College, a local community college in the San Diego Community College District. I had been there for several semesters, mostly teaching pre-calculus, calculus, and other lower-division math classes. One day, I got a call from my department chair with my assignment for the upcoming semester. I was scheduled to teach intro stats. I was about to respond, “Oh, I’ve never taken a stats class before.” But remembering this was the way I earned money to be able to live in expensive San Diego County, I said, “Sounds great. By the way, do you happen to have an extra copy of the textbook we’ll be using?”\nYes, the first statistics class I took was the one I taught. Not ideal, I know.\nI was lucky to start teaching with Intro Stats by De Veaux, Velleman, and Bock, a book that was incredibly well-written and included a lot of resources for teachers like me. (I learned quickly that I wasn’t the only math professor in the world who got thrown into teaching statistics classes with little to no training.) I got my full-time appointment at Westminster in 2008 and continued to teach intro stats classes for many years to follow. As I mentioned earlier, we started the Data Science program at Westminster in 2016 and moved everything from our earlier hodgepodge of calculators, spreadsheets, and SPSS, over to R.\nEventually, I got interested in Bayesian statistics and read everything I could get my hands on. I became convinced that Bayesian statistics is the right way to do statistical analysis. I started teaching special topics courses in Bayesian Data Analysis and working with students on research projects that involved Bayesian methods. If it were up to me, every introductory statistics class in the world would be taught using Bayesian methods. I know that sounds like a strong statement. (And I put it in boldface, so it looks even stronger.) But I truly believe that in an alternate universe where Fisher and his disciples didn’t “win” the stats wars of the 20th century (and perhaps one in which computing power got a little more advanced a little earlier in the development of statistics), we would all be Bayesians. Bayesian thinking is far more intuitive and more closely aligned with our intuitions about probabilities and uncertainty.\nUnfortunately, our current universe timeline didn’t play out that way. So we are left with frequentism. It’s not that I necessarily object to frequentist tools. All tools are just tools, after all. However, the standard form of frequentist inference, with its null hypothesis significance testing, P-values, and confidence intervals, can be confusing. It’s bad enough that professional researchers struggle with them. We teach undergraduate students in introductory classes.\nOkay, so we are stuck not in the world we want, but the world we’ve got. At my institution and most others, intro stats is a service course that trains far more people who are outside the fields of mathematics and statistics. In that world, students will go on to careers where they interact with research that reports P-values and confidence intervals.\nSo what’s the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence intervals tells them—and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we’re going to train frequentists, the least we can do is help them become good frequentists.\nOne source of inspiration for good statistical pedagogy comes from the Guidelines for Assessment and Instruction in Statistics Education (GAISE), a set of recommendations made by experienced stats educators and endorsed by the American Statistical Association. Their college guidelines are as follows:\n\nTeach statistical thinking.\n\n\nTeach statistics as an investigative process of problem-solving and decision-making.\nGive students experience with multivariable thinking.\n\n\nFocus on conceptual understanding.\nIntegrate real data with a context and purpose.\nFoster active learning.\nUse technology to explore concepts and analyze data.\nUse assessments to improve and evaluate student learning.\n\nIn every element of this book, I’ve tried to follow these guidelines:\n\nThe first part of the book is an extensive guide for exploratory data analysis. The rest of the book is about inference in the context of specific research questions that are answered using statistical tools. While multivariable thinking is a little harder to do in an intro stats class, I take the opportunity whenever possible to use graphs to explore more variables than we can handle with intro stats inferential techniques. I point out the the simple analyses taught in this class are only the first step in more comprehensive analyses that incorporate more information and control for confounders. I emphasize that students can continue their statistical growth by enrolling in more advanced stats classes.\nI often tell students that if they forget everything else from their stats class, the one think I want them to be able to do is interpret a P-value correctly. It’s not intuitive, so it takes an entire semester to set up the idea of a sampling distribution and explain over and over again how the P-value relates to it. In this book, I try to reinforce the logic of inference until the students know it almost instinctively. A huge pedagogical advantage is derived by using randomization and simulation to keep students from getting lost in the clouds of theoretical probability distributions. But they also need to know about the latter too. Every hypothesis test is presented both ways, a task made easy when using the infer package.\nThis is the thing I struggle with the most. Finding good data is hard. Over the years, I’ve found a few data sets I really like, but my goal is to continue to revise the book to incorporate more interesting data, especially data that serves to highlight issues of social justice.\nBack when I wrote the first set of modules that eventually became this book, the goal was to create assignments that merged content with activities so that students would be engaged in active learning. When these chapters are used in the classroom, students can collaborate with each other and with their professor. They learn by doing.\nUnlike most books out there, this book does not try to be agnostic about technology. This book is about doing statistics in R.\nThis one I’ll leave in the capable hands of the professors who use these materials. The chapter assignments should be completed and submitted, and that is one form of assessment. But I also believe in augmenting this material with other forms of assessment that may include supplemental assignments, open-ended data exploration, quizzes and tests, projects, etc." + "text": "Philosophy and pedagogy\nTo understand my statistics teaching philosophy, it’s worth telling you a little about my background in statistics.\nAt the risk of undermining my own credibility, I’d like to tell you about the first statistics class I took. In the mid-2000s, I was working on my Ph.D. at the University of California, San Diego, studying geometric topology. To make a little extra money and get some teaching experience under my belt, I started teaching night and summer classes at Miramar College, a local community college in the San Diego Community College District. I had been there for several semesters, mostly teaching pre-calculus, calculus, and other lower-division math classes. One day, I got a call from my department chair with my assignment for the upcoming semester. I was scheduled to teach intro stats. I was about to respond, “Oh, I’ve never taken a stats class before.” But remembering this was the way I earned money to be able to live in expensive San Diego County, I said, “Sounds great. By the way, do you happen to have an extra copy of the textbook we’ll be using?”\nYes, the first statistics class I took was the one I taught. Not ideal, I know.\nI was lucky to start teaching with Intro Stats by De Veaux, Velleman, and Bock, a book that was incredibly well-written and included a lot of resources for teachers like me. (I learned quickly that I wasn’t the only math professor in the world who got thrown into teaching statistics classes with little to no training.) I got my full-time appointment at Westminster in 2008 and continued to teach intro stats classes for many years to follow. As I mentioned earlier, we started the Data Science program at Westminster in 2016 and moved everything from our earlier hodgepodge of calculators, spreadsheets, and SPSS, over to R.\nEventually, I got interested in Bayesian statistics and read everything I could get my hands on. I became convinced that Bayesian statistics is the right way to do statistical analysis. I started teaching special topics courses in Bayesian Data Analysis and working with students on research projects that involved Bayesian methods. If it were up to me, every introductory statistics class in the world would be taught using Bayesian methods. I know that sounds like a strong statement. (And I put it in boldface, so it looks even stronger.) But I truly believe that in an alternate universe where Fisher and his disciples didn’t “win” the stats wars of the 20th century (and perhaps one in which computing power got a little more advanced a little earlier in the development of statistics), we would all be Bayesians. Bayesian thinking is far more intuitive and more closely aligned with our intuitions about probabilities and uncertainty.\nUnfortunately, our current universe timeline didn’t play out that way. So we are left with frequentism. It’s not that I necessarily object to frequentist tools. All tools are just tools, after all. However, the standard form of frequentist inference, with its null hypothesis significance testing, P-values, and confidence intervals, can be confusing. It’s bad enough that professional researchers struggle with them. We teach undergraduate students in introductory classes.\nOkay, so we are stuck not in the world we want, but the world we’ve got. At my institution and most others, intro stats is a service course that trains far more people who are outside the fields of mathematics and statistics. In that world, students will go on to careers where they interact with research that reports P-values and confidence intervals.\nSo what’s the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence interval tells them—and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we’re going to train frequentists, the least we can do is help them become good frequentists.\nOne source of inspiration for good statistical pedagogy comes from the Guidelines for Assessment and Instruction in Statistics Education (GAISE), a set of recommendations made by experienced stats educators and endorsed by the American Statistical Association. Their college guidelines are as follows:\n\nTeach statistical thinking.\n\nTeach statistics as an investigative process of problem-solving and decision-making.\nGive students experience with multivariable thinking.\n\nFocus on conceptual understanding.\nIntegrate real data with a context and purpose.\nFoster active learning.\nUse technology to explore concepts and analyze data.\nUse assessments to improve and evaluate student learning.\n\nIn every element of this book, I’ve tried to follow these guidelines:\n\nThe first part of the book is an extensive guide for exploratory data analysis. The rest of the book is about inference in the context of specific research questions that are answered using statistical tools. While multivariable thinking is a little harder to do in an intro stats class, I take the opportunity whenever possible to use graphs to explore more variables than we can handle with intro stats inferential techniques. I point out the the simple analyses taught in this class are only the first step in more comprehensive analyses that incorporate more information and control for confounders. I emphasize that students can continue their statistical growth by enrolling in more advanced stats classes.\nI often tell students that if they forget everything else from their stats class, the one think I want them to be able to do is interpret a P-value correctly. It’s not intuitive, so it takes an entire semester to set up the idea of a sampling distribution and explain over and over again how the P-value relates to it. In this book, I try to reinforce the logic of inference until the students know it almost instinctively. A huge pedagogical advantage is derived by using randomization and simulation to keep students from getting lost in the clouds of theoretical probability distributions. But they also need to know about the latter too. Every hypothesis test is presented both ways, a task made easy when using the infer package.\nThis is the thing I struggle with the most. Finding good data is hard. Over the years, I’ve found a few data sets I really like, but my goal is to continue to revise the book to incorporate more interesting data, especially data that serves to highlight issues of social justice.\nBack when I wrote the first set of modules that eventually became this book, the goal was to create assignments that merged content with activities so that students would be engaged in active learning. When these chapters are used in the classroom, students can collaborate with each other and with their professor. They learn by doing.\nUnlike most books out there, this book does not try to be agnostic about technology. This book is about doing statistics in R.\nThis one I’ll leave in the capable hands of the professors who use these materials. The chapter assignments should be completed and submitted, and that is one form of assessment. But I also believe in augmenting this material with other forms of assessment that may include supplemental assignments, open-ended data exploration, quizzes and tests, projects, etc." }, { "objectID": "index.html#course-structure", "href": "index.html#course-structure", "title": "Introduction to Statistics: an integrated textbook and workbook using R", "section": "Course structure", - "text": "Course structure\nAs explained above, this book is meant to be a workbook that students complete as they’re reading.\nAt Westminster University, we host Posit Workbench on a server that is connected to our single sign-on (SSO) systems so that students can access RStudio through a browser using their campus online usernames and passwords. If you have the ability to convince your IT folks to get such a server up and running, it’s highly worth it. Rather than spending the first day of class troubleshooting while students try to install software on their machines, you can just have them log in and get started right away. Campus admins install packages and tweak settings to make sure all students have a standardized interface and consistent experience.\nIf you don’t have that luxury, you will need to have students download and install both R and RStudio. The installation processes for both pieces of software are very easy and straightforward for the majority of students. The book chapters here assume that the necessary packages are installed already, so if your students are running R on their own machines, they will need to use install.packages at the beginning of some of the chapters for any new packages that are introduced. (They are mentioned at the beginning of each chapter with instructions for installing them.)\nChapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, they will have created a project called intro_stats in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.)\nIn Chapter 2, students are taught to click a link to download a Quarto document (.qmd). I have found that students struggle initially to get this file to the right place. If students are using RStudio Workbench online, they will need to use the “Upload” button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn’t a single set of instructions that will get all students’ files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code.\nBy Chapter 3, a rhythm is established that students will start to get used to:\n\nOpen the book online and open RStudio.\nInstall any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio Workbench in a browser.)\nCheck to make sure they’re are in the intro_stats project.\nClick the link online to download the Quarto document.\nMove the Quarto document from the Downloads folder to the project directory.\nOpen up the Quarto document.\nRestart R and Run All Chunks.\nStart reading and working.\n\nChapters 3 and 4 focus on exploratory data analysis for categorical and numerical data, respectively.\nChapter 5 is a primer on data manipulation using dplyr.\nChapters 6 and 7 cover correlation and regression. This “early regression” approach mirrors the IMS text. (IMS eventually circles back to hypothesis testing for regression, but this book does not. That’s a topic that is covered extensively in most second-semester stats classes.)\nChapters 8–11 are crucial for building the logical foundations for inference. The idea of a sampling distribution under the assumption of a null hypothesis is built up slowly and intuitively through randomization and simulation. By the end of Chapter 11, students will be fully introduced to the structure of a hypothesis test, and hopefully will have experienced the first sparks of intuition about why it “works.” All inference in this book is conducted using a “rubric” approach—basically, the steps are broken down into bite-sized pieces and students are expected to work through each step of the rubric every time they run a test. (The rubric steps are shown in the Appendix.)\nChapter 12 introduces a few more steps to the rubric for confidence intervals. As we are still using randomization to motivate inference, confidence intervals are calculated using the bootstrap approach for now.\nOnce students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models as well. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions.\nThe final chapters of the book (Chapters 15–22) are simply applications of inference in specific data settings: inference for one (Ch. 15) and two (Ch. 16) proportions, Chi-square tests for goodness-of-fit (Ch. 17) and independence (Ch. 18), inference for one mean (Ch. 19), paired data (Ch. 20), and two independent means (Ch. 21), and finally ANOVA (Ch. 22). Along the way, students learn about the chi-square, Student t, and F distributions. Although the last part of the book follows a fairly traditional parametric approach, every chapter still includes randomization and simulation to some degree so that students don’t lose track of the intuition behind sampling distributions under the assumption of a null hypothesis." + "text": "Course structure\nAs explained above, this book is meant to be a workbook that students complete as they’re reading.\nAt Westminster University, we host Posit Workbench on a server that is connected to our single sign-on (SSO) systems so that students can access RStudio through a browser using their campus online usernames and passwords. If you have the ability to convince your IT folks to get such a server up and running, it’s highly worth it. Rather than spending the first day of class troubleshooting while students try to install software on their machines, you can just have them log in and get started right away. Campus admins install packages and tweak settings to make sure all students have a standardized interface and consistent experience.\nIf you don’t have that luxury, you will need to have students download and install both R and RStudio. The installation processes for both pieces of software are very easy and straightforward for the majority of students. The book chapters here assume that the necessary packages are installed already, so if your students are running R on their own machines, they will need to use install.packages at the beginning of some of the chapters for any new packages that are introduced. (They are mentioned at the beginning of each chapter with instructions for installing them.)\nChapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, students will have created a project called intro_stats in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.)\nIn Chapter 2, students are taught to click a link to download a Quarto document (.qmd). I have found that students struggle initially to get this file to the right place. If students are using RStudio online, they will need to use the “Upload” button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn’t a single set of instructions that will get all students’ files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code.\nBy Chapter 3, a rhythm is established that students will start to get used to:\n\nOpen the book online and open RStudio.\nInstall any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio in a browser.)\nCheck to make sure they’re are in the intro_stats project.\nClick the link online to download the Quarto document.\nMove the Quarto document from the Downloads folder to the project directory.\nOpen up the Quarto document.\nRestart R and Run All Chunks.\nStart reading and working.\n\nWhen students finish each assignment, they will Restart R and Run All Chunks one last time and then “Render” the Quarto document, which will create HTML output that can then be submitted. (Hopefully, they will also take the opportunity to spell check and proofread thoroughly before submission. It’s important to proofread the HTML document not just for the writing, but also to make sure that the code output and formatting all looks correct.)\nChapters 3 and 4 focus on exploratory data analysis for categorical and numerical data, respectively.\nChapter 5 is a primer on data manipulation using dplyr.\nChapters 6 and 7 cover correlation and regression. This “early regression” approach mirrors the IMS text. (IMS eventually circles back to hypothesis testing for regression, but this book does not. That’s a topic that is covered extensively in most second-semester stats classes.)\nChapters 8–11 are crucial for building the logical foundations for inference. The idea of a sampling distribution under the assumption of a null hypothesis is built up slowly and intuitively through randomization and simulation. By the end of Chapter 11, students will be fully introduced to the structure of a hypothesis test, and hopefully will have experienced the first sparks of intuition about why it “works.” All inference in this book is conducted using a “rubric” approach—basically, the steps are broken down into bite-sized pieces and students are expected to work through each step of the rubric every time they run a test. (The rubric steps are shown in the Appendix.)\nChapter 12 introduces a few more steps to the rubric for confidence intervals. As we are still using randomization to motivate inference, confidence intervals are calculated using the bootstrap approach for now.\nOnce students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions.\nThe final chapters of the book (Chapters 15–22) are simply applications of inference in specific data settings: inference for one (Ch. 15) and two (Ch. 16) proportions, Chi-square tests for goodness-of-fit (Ch. 17) and independence (Ch. 18), inference for one mean (Ch. 19), paired data (Ch. 20), and two independent means (Ch. 21), and finally ANOVA (Ch. 22). Along the way, students learn about the chi-square, Student t, and F distributions. Although the last part of the book follows a fairly traditional parametric approach, every chapter still includes randomization and simulation to some degree so that students don’t lose track of the intuition behind sampling distributions under the assumption of a null hypothesis." }, { "objectID": "index.html#onward-and-upward", @@ -60,35 +60,35 @@ "href": "01-intro_to_r.html#rstudio", "title": "1  Introduction to R", "section": "1.3 RStudio", - "text": "1.3 RStudio\nRStudio is an “Integrated Development Environment,” or IDE for short. An IDE is a tool for working with a programming language that is fancier than just a simple text editor. Most IDEs give you shortcuts, menus, debugging facilities, syntax highlighting, and other things to make your life as easy as possible.\nOpen RStudio so we can explore some of the areas you’ll be using in the future.\nOn the left side of your screen, you should see a big pane called the “Console”. There will be some startup text there, and below that, you should see a “command prompt”: the symbol “>” followed by a blinking cursor. (If the cursor is not blinking, that means that the focus is in another pane. Click anywhere in the Console and the cursor should start blinking again.)\nA command prompt can be one of the more intimidating things about starting to use R. It’s just sitting there waiting for you to do something. Unlike other programs where you run commands from menus, R requires you to know what you need to type to make it work.\nWe’ll return to the Console in a moment.\nNext, look at the upper-right corner of the screen. There are at least three tabs in this pane starting with “Environment”, “History”, and “Connections”. The “Environment” (also called the “Global Environment”) keeps track of things you define while working with R. There’s nothing to see there yet because we haven’t defined anything! The “History” tab will likewise be empty; again, we haven’t done anything yet. We won’t use the “Connections” tab in this course. (Depending on the version of RStudio you are using and its configuration, you may see additional tabs, but we won’t need them for this course.)\nNow look at the lower-right corner of the screen. There are likely five tabs here: “Files”, “Plots”, “Packages”, “Help”, and “Viewer”. The “Files” tab will eventually contain the files you upload or create. “Plots” will show you the result of commands that produce graphs and charts. “Packages” will be explained later. “Help” is precisely what it sounds like; this will be a very useful place for you to get to know. We will never use the “Viewer” tab, so don’t worry about it." + "text": "1.3 RStudio\nRStudio is an “Integrated Development Environment,” or IDE for short. An IDE is a tool for working with a programming language that is fancier than just a simple text editor. Most IDEs give you shortcuts, menus, debugging facilities, syntax highlighting, and other things to make your life as easy as possible.\nOpen RStudio so we can explore some of the areas you’ll be using in the future.\nOn the left side of your screen, you should see a big pane called the “Console”. There will be some startup text there, and below that, you should see a “command prompt”: the symbol “>” followed by a blinking cursor. (If the cursor is not blinking, that means that the focus is in another pane. Click anywhere in the Console and the cursor should start blinking again.)\nA command prompt can be one of the more intimidating things about starting to use R. It’s just sitting there waiting for you to do something. Unlike other programs where you run commands from menus, R requires you to know what you need to type to make it work.\nWe’ll return to the Console in a moment.\nNext, look at the upper-right corner of the screen. There are at least three tabs in this pane starting with “Environment”, “History”, and “Connections”. The “Environment” (also called the “Global Environment”) keeps track of things you define while working with R. There’s nothing to see there yet because we haven’t defined anything! The “History” tab will likewise be empty; again, we haven’t done anything yet. We won’t use the “Connections” tab in this course. (Depending on the version of RStudio you are using and its configuration, you may see additional tabs, but we won’t need them for this course.)\nNow look at the lower-right corner of the screen. There are likely six tabs here: “Files”, “Plots”, “Packages”, “Help”, “Viewer”, and “Presentation”. The “Files” tab will eventually contain the files you upload or create. “Plots” will show you the result of commands that produce graphs and charts. “Packages” will be explained later. “Help” is precisely what it sounds like; this will be a very useful place for you to get to know. We will never use the “Viewer” or “Presentation” tabs, so don’t worry about them." }, { "objectID": "01-intro_to_r.html#try-something", "href": "01-intro_to_r.html#try-something", "title": "1  Introduction to R", "section": "1.4 Try something!", - "text": "1.4 Try something!\nSo let’s do something in R! Go back to the Console and at the command prompt (the “>” symbol with the blinking cursor), type\n\n1+1\n\nand hit Enter.\nCongratulations! You just ran your first command in R. It’s all downhill from here. R really is nothing more than a glorified calculator.\nOkay, let’s do something slightly more sophisticated. It’s important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase c, and hit Enter:\n\nx <- c(1, 3, 4, 7, 9)\n\nYou have just created a “vector”. When we use the letter c and enclose a list of things in parentheses, we tell R to “combine” those elements. So, a vector is just a collection of data. The little arrow <- says to take what’s on the right and assign it to the symbol on the left. The vector x is now saved in memory. As long as you don’t terminate your current R session, this vector is available to you.\nCheck out the “Environment” pane now. You should see the vector x that you just created, along with some information about it. Next to x, it says num, which means your vector has numerical data. Then it says [1:5] which indicates that there are five elements in the vector x.\nAt the command prompt in the Console, type\n\nx\n\nand hit Enter. Yup, x is there. R knows what it is. You may be wondering about the [1] that appears at the beginning of the line. To see what that means, try typing this (and hit Enter—at some point here I’m going to stop reminding you to hit Enter after everything you type):\n\ny <- letters\n\nR is clever, so the alphabet is built in under the name letters.\nType\n\ny\n\nNow can you see what the [1] meant above? Assuming the letters spilled onto more than one line of the Console, you should see a number in brackets at the beginning of each line telling you the numerical position of the first entry in each new line.\nSince we’ve done a few things, check out the “Global Environment” in the upper-right corner. You should see the two objects we’ve defined thus far, x and y. Now click on the “History” tab. Here you have all the commands you have run so far. This can be handy if you need to go back and re-run an earlier command, or if you want to modify an earlier command and it’s easier to edit it slightly than type it all over again. To get an older command back into the Console, either double-click on it, or select it and click the “To Console” button at the top of the pane.\nWhen we want to re-use an old command, it has usually not been that long since we last used it. In this case, there is an even more handy trick. Click in the Console so that the cursor is blinking at the blank command prompt. Now hit the up arrow on your keyboard. Do it again. Now hit the down arrow once or twice. This is a great way to access the most recently used commands from your command history.\nLet’s do something with x. Type\n\nsum(x)\n\nI bet you figured out what just happened.\nNow try\n\nmean(x)\n\nWhat if we wanted to save the mean of those five numbers for use later? We can assign the result to another variable! Type the following and observe the effect in the Environment.\n\nm <- mean(x)\n\nIt makes no difference what letter or combination of letters we use to name our variables. For example,\n\nmean_x <- mean(x)\n\njust saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (_), and dots (.). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special character in the names of variables.1 You should avoid variable names that are the same words as predefined R functions; for example, we should not type mean <- mean(x)." + "text": "1.4 Try something!\nSo let’s do something in R! Go back to the Console and at the command prompt (the “>” symbol with the blinking cursor), type\n\n1+1\n\nand hit Enter.\nCongratulations! You just ran your first command in R. It’s all downhill from here. R really is nothing more than a glorified calculator.\nOkay, let’s do something slightly more sophisticated. It’s important to note that R is case-sensitive, which means that lowercase letters and uppercase letters are treated differently. Type the following, making sure you use a lowercase x and lowercase c, and hit Enter:\n\nx <- c(1, 3, 4, 7, 9)\n\nYou have just created a “vector”. When we use the letter c and enclose a list of things in parentheses, we tell R to “combine” those elements. So, a vector is just a collection of data. The little arrow <- says to take what’s on the right and assign it to the symbol on the left. The vector x is now saved in memory. As long as you don’t terminate your current R session, this vector is available to you.\nCheck out the “Environment” pane now. You should see the vector x that you just created, along with some information about it. Next to x, it says num, which means your vector has numerical data. Then it says [1:5] which indicates that there are five elements in the vector x.\nAt the command prompt in the Console, type\n\nx\n\nand hit Enter. Yup, x is there. R knows what it is. You may be wondering about the [1] that appears at the beginning of the line. To see what that means, try typing this (and hit Enter—at some point here I’m going to stop reminding you to hit Enter after everything you type):\n\ny <- letters\n\nR is clever, so the alphabet is built in under the name letters.\nType\n\ny\n\nNow can you see what the [1] meant above? Assuming the letters spilled onto more than one line of the Console, you should see a number in brackets at the beginning of each line telling you the numerical position of the first entry in each new line.\nSince we’ve done a few things, check out the “Global Environment” in the upper-right corner. You should see the two objects we’ve defined thus far, x and y. Now click on the “History” tab. Here you have all the commands you have run so far. This can be handy if you need to go back and re-run an earlier command, or if you want to modify an earlier command and it’s easier to edit it slightly than type it all over again. To get an older command back into the Console, either double-click on it, or select it and click the “To Console” button at the top of the pane.\nWhen we want to re-use an old command, it has usually not been that long since we last used it. In this case, there is an even more handy trick. Click in the Console so that the cursor is blinking at the blank command prompt. Now hit the up arrow on your keyboard. Do it again. Now hit the down arrow once or twice. This is a great way to access the most recently used commands from your command history.\nLet’s do something with x. Type\n\nsum(x)\n\nI bet you figured out what just happened.\nNow try\n\nmean(x)\n\nWhat if we wanted to save the mean of those five numbers for use later? We can assign the result to another variable! Type the following and observe the effect in the Environment.\n\nm <- mean(x)\n\nIt makes no difference what letter or combination of letters we use to name our variables. For example,\n\nmean_x <- mean(x)\n\njust saves the mean to a differently named variable. In general, variable names can be any combination of characters that are letters, numbers, underscore symbols (_), and dots (.). (In this course, we will prefer underscores over dots.) You cannot use spaces or any other special characters in the names of variables.1 You should avoid variable names that are the same words as predefined R functions; for example, we should not type mean <- mean(x)." }, { "objectID": "01-intro_to_r.html#load-packages", "href": "01-intro_to_r.html#load-packages", "title": "1  Introduction to R", "section": "1.5 Load packages", - "text": "1.5 Load packages\nPackages are collections of commands, functions, and sometimes data that people all over the world write and maintain. These packages extend the capabilities of R and add useful tools. For example, we would like to use the palmerpenguins package because it includes an interesting data set on penguins.\nIf you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll need to type install.packages(\"palmerpenguins\") if you’ve never used the palmerpenguins package before. If you are using RStudio through a browser, you may not be able to install packages because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server.\nThe data set is called penguins. Let’s see what happens when we try to access this data set without loading the package that contains it. Try typing this:\n\npenguins\n\nYou should have received an error. That makes sense because R doesn’t know anything about a data set called penguins.\nNow—assuming you have the palmerpenguins package installed—type this at the command prompt:\n\nlibrary(palmerpenguins)\n\nIt didn’t look like anything happened. However, in the background, all the stuff in the palmerpenguins package became available to use.\nLet’s test that claim. Hit the up arrow twice and get back to where you see this at the Console (or you can manually re-type it, but that’s no fun!):\n\npenguins\n\nNow R knows about the penguins data, so the last command printed some of it to the Console.\nGo look at the “Packages” tab in the pane in the lower-right corner of the screen. Scroll down a little until you get to the “P”s. You should be able to find the palmerpenguins package. You’ll also notice a check mark by it, indicating that this package is loaded into your current R session.\nYou must use the library command in every new R session in which you want to use a package.2 If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you’re typing it correctly, but you’re still getting an error, check to see if the package containing that command has been loaded with library. (Many R commands are “base R” commands, meaning they come with R and no special package is required to access them. The set of letters you used above is one such example.)" + "text": "1.5 Load packages\nPackages are collections of commands, functions, and sometimes data that people all over the world write and maintain. These packages extend the capabilities of R and add useful tools. For example, we would like to use the palmerpenguins package because it includes an interesting data set on penguins.\nIf you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll need to type install.packages(\"palmerpenguins\") at the Console. (This is assuming you’ve never used the palmerpenguins package before. Once a package is installed the first time, it never has to be installed again.) If you are using RStudio through a browser, the packages you need should be pre-installed for you. In fact, you may not be able to install packages yourself because you may not have admin privileges. If you need a package that is not installed, contact the person who administers your server.\nAfter we’ve installed the package (a one-time process), we will need to load the package in every R session in which we want to use it. For example, the palmerpenguins package contains a data set called penguins. Let’s see what happens when we try to access this data set without loading the package that contains it. Try typing this:\n\npenguins\n\nYou should have received an error. That makes sense because R doesn’t know anything about a data set called penguins.\nNow—assuming you have the palmerpenguins package installed—type this at the command prompt:\n\nlibrary(palmerpenguins)\n\nIt didn’t look like anything happened. However, in the background, all the stuff in the palmerpenguins package became available to use.\nLet’s test that claim. Hit the up arrow twice and get back to where you see this at the Console (or you can manually re-type it, but that’s no fun!):\n\npenguins\n\nNow R knows about the penguins data, so the last command printed some of it to the Console.\nGo look at the “Packages” tab in the pane in the lower-right corner of the screen. Scroll down a little until you get to the “P”s. You should be able to find the palmerpenguins package. You’ll also notice a check mark by it, indicating that this package is loaded into your current R session.\nYou must use the library command in every new R session in which you want to use a package. If you terminate your R session, R forgets about the package. If you are ever in a situation where you are trying to use a command and you know you’re typing it correctly, but you’re still getting an error, check to see if the package containing that command has been loaded with library. (Many R commands are “base R” commands, meaning they come with R and no special package is required to access them. The set of letters you used above is one such example.)" }, { "objectID": "01-intro_to_r.html#getting-help", "href": "01-intro_to_r.html#getting-help", "title": "1  Introduction to R", "section": "1.6 Getting help", - "text": "1.6 Getting help\nThere are four important ways to get help with R. The first is the obvious “Help” tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type penguins and hit Enter. Take a few minutes to read the help file.\nHelp files are only as good as their authors. Fortunately, most package developers are conscientious enough to write decent help files. But don’t be surprised if the help file doesn’t quite tell you what you want to know. And for highly technical R functions, sometimes the help files are downright inscrutable. Try looking at the help file for the grep function. Can you honestly say you have any idea what this command does or how you might use it? Over time, as you become more knowledgeable about how R works, these help files get less mysterious.\nThe second way of getting help is from the Console. Go to the Console and type\n\n?letters\n\nThe question mark tells R you need help with the R command letters. This will bring up the help file in the same Help pane you were looking at before.\nSometimes, you don’t know exactly what the name of the command is. For example, suppose we misremembered the name and thought it was letter instead of letters. Try typing this:\n\n?letter\n\nYou should have received an error because there is no command called letter. Try this instead:\n\n??letter\n\nand scroll down a bit in the Help pane. Two question marks tell R not to be too picky about the spelling. This will bring up a whole bunch of possibilities in the Help pane, representing R’s best guess as to what you might be searching for. (In this case, it’s not easy to find. You’d have to know that the help file for letters appeared on a help page called base::Constants.)\nThe fourth way to get help—and often the most useful way—is to use your best friend, the search engine. You don’t want to just search for “R”. (That’s the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type “R __________” where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you’ll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to StackOverflow where very knowledgeable people post very helpful responses to common questions.\nUse a search engine to find out how to take the square root of a number in R. Test out your newly-discovered function on a few numbers to make sure it works." + "text": "1.6 Getting help\nThere are three important ways to get help with R. The first is the obvious “Help” tab in the lower-right pane on your screen. Click on that tab now. In the search bar at the right, type penguins and hit Enter. Take a few minutes to read the help file.\nHelp files are only as good as their authors. Fortunately, most package developers are conscientious enough to write decent help files. But don’t be surprised if the help file doesn’t quite tell you what you want to know. And for highly technical R functions, sometimes the help files are downright inscrutable. Try looking at the help file for the grep function. Can you honestly say you have any idea what this command does or how you might use it? Over time, as you become more knowledgeable about how R works, these help files get less mysterious.\nThe second way of getting help is from the Console. Go to the Console and type\n\n?letters\n\nThe question mark tells R you need help with the R command letters. This will bring up the help file in the same Help pane you were looking at before.\nSometimes, you don’t know exactly what the name of the command is. For example, suppose we misremembered the name and thought it was letter instead of letters. Try typing this:\n\n?letter\n\nYou should have received an error because there is no command called letter. Try this instead:\n\n??letter\n\nand scroll down a bit in the Help pane. Two question marks tell R not to be too picky about the spelling. This will bring up a whole bunch of possibilities in the Help pane, representing R’s best guess as to what you might be searching for. (In this case, it’s not easy to find. You’d have to know that the help file for letters appeared on a help page called base::Constants.)\nThe third way to get help—and often the most useful way—is to use your best friend, the internet. You don’t want to just type “R” into a search engine. (That’s the downside of using a single letter of the alphabet for the name of a programming language.) However, if you type “R __________” where you fill in the blank with the topic of interest, search engines usually do a pretty good job sending you to relevant pages. Within the first few hits, in fact, you’ll often see an online copy of the same help file you see in R. Frequently, the next few hits lead to StackOverflow where very knowledgeable people post very helpful responses to common questions.\nUse a search engine to find out how to take the square root of a number in R. Test out your newly-discovered function on a few numbers to make sure it works." }, { "objectID": "01-intro_to_r.html#understanding-the-data", "href": "01-intro_to_r.html#understanding-the-data", "title": "1  Introduction to R", "section": "1.7 Understanding the data", - "text": "1.7 Understanding the data\nLet’s go back to the penguins data contained in the penguins data set from the palmerpenguins package.\nThe first thing we do to understand a data set is to read the help file on it. (We’ve already done this for the penguins data.) Of course, this only works for data files that come with R or with a package that can be loaded into R. If you are using R to analyze your own data, presumably you don’t need a help file. And if you’re analyzing data from another source, you’ll have to go to that source to find out about the data.\nWhen you read the help file for penguins, you may have noticed that it described the “Format” as being “A tibble with 344 rows and 8 variables.” What is a “tibble”?\nThe word “tibble” is an R-specific term that describes data organized in a specific way. A more common term is “data frame” (or sometimes “data table”). The idea is that in a data frame, the rows and the columns have very specific interpretations.\nEach row of a data frame represents a single object or observation. So in the penguins data, each row represents a penguin. If you have survey data, each row will usually represent a single person. But an “object” can be anything about which we collect data. State-level data might have 50 rows and each row represents an entire state.\nEach column of a data frame represents a variable, which is a property, attribute, or measurement made about the objects in the data. For example, the help file mentions that various pieces of information are recorded about each penguin, like species, bill length, flipper length, body mass, sex, and so on. These are examples of variables. In a survey, for example, the variables will likely be the responses to individual questions.\nWe will use the terms tibble and data frame interchangeably in this course. They are not quite synonyms: tibbles are R-specific implementations of data frames, the latter being a more general term that applies in all statistical contexts. Nevertheless, there are no situations (at least not encountered in this course) where it makes any difference if a data set is called a tibble or a data frame.\nWe can also look at the data frame in “spreadsheet” form. Type\n\nView(penguins)\n\n(Be sure you’re using an upper-case “V” in View.) A new pane should open up in the upper-left corner of the screen. In that pane, the penguins data appears in a grid format, like a spreadsheet. The observations (individual penguins) are the rows and the variables (attributes and measurements about the penguins) are the columns. This will also let you sort each column by clicking on the arrows next to the variable name across the top.\nSometimes, we just need a little peek at the data. Try this to print just a few rows of data to the Console:\n\nhead(penguins)\n\nWe can customize this by specifying the number of rows to print. (Don’t forget about the up arrow trick!)\n\nhead(penguins, n = 10)\n\nThe tail command does something similar.\n\ntail(penguins)\n\nWhen we’re working with HTML documents like this one, it’s usually not necessary to use View, head, or tail because the HTML format will print the data frame a lot more neatly than it did in the Console. You do not need to type the following code; just look below it for the table that appears.\n\n\nWarning: package 'palmerpenguins' was built under R version 4.3.1\n\n\n\npenguins\n\nYou can scroll through the rows by using the numbers at the bottom or the “Next” button. You can scroll through the variables by clicked the little black arrow pointed to the right in the upper-right corner. The only thing you can’t do here that you can do with View is sort the columns.\nWe want to understand the “structure” of our data. For this, we use the str command. Try it:\n\nstr(penguins)\n\nThis tells us several important things. First it says that we are looking at a tibble with 344 observations of 8 variables. We can isolate those pieces of information separately as well, if needed:\n\nNROW(penguins)\n\n\nNCOL(penguins)\n\nThese give you the number of rows and columns, respectively.\nThe str command also tells us about each of the variables in our data set. We’ll talk about these later.\nWe need to be able to summarize variables in the data set. The summary command is one way to do it:\n\nsummary(penguins)\n\nYou may not recognize terms like “Median” or “1st Qu.” or “3rd Qu.” yet. Nevertheless, you can see why this summary could come in handy." + "text": "1.7 Understanding the data\nLet’s go back to the penguins data contained in the penguins data set from the palmerpenguins package.\nThe first thing we do to understand a data set is to read the help file on it. (We’ve already done this for the penguins data.) Of course, this only works for data files that come with R or with a package that can be loaded into R. If you are using R to analyze your own data, presumably you don’t need a help file. And if you’re analyzing data from another source, you’ll have to go to that source to find out about the data.\nWhen you read the help file for penguins, you may have noticed that it described the “Format” as being “A tibble with 344 rows and 8 variables.” What is a “tibble”?\nThe word “tibble” is an R-specific term that describes data organized in a specific way. A more common term is “data frame” (or sometimes “data table”). The idea is that in a data frame, the rows and the columns have very specific interpretations.\nEach row of a data frame represents a single object or observation. So in the penguins data, each row represents a penguin. If you have survey data, each row will usually represent a single person. But an “object” can be anything about which we collect data. State-level data might have 50 rows and each row represents an entire state.\nEach column of a data frame represents a variable, which is a property, attribute, or measurement made about the objects in the data. For example, the help file mentions that various pieces of information are recorded about each penguin, like species, bill length, flipper length, body mass, sex, and so on. These are examples of variables. In a survey, for example, the variables will likely be the responses to individual questions.\nWe will use the terms tibble and data frame interchangeably in this course. They are not quite synonyms: tibbles are R-specific implementations of data frames, the latter being a more general term that applies in all statistical contexts. Nevertheless, there are no situations (at least not encountered in this course) where it makes any difference if a data set is called a tibble or a data frame.\nWe can also look at the data frame in “spreadsheet” form. Type\n\nView(penguins)\n\n(Be sure you’re using an upper-case “V” in View.) A new pane should open up in the upper-left corner of the screen. In that pane, the penguins data appears in a grid format, like a spreadsheet. The observations (individual penguins) are the rows and the variables (attributes and measurements about the penguins) are the columns. This will also let you sort each column by clicking on the arrows next to the variable name across the top.\nSometimes, we just need a little peek at the data. Try this to print just a few rows of data to the Console:\n\nhead(penguins)\n\nWe can customize this by specifying the number of rows to print. (Don’t forget about the up arrow trick!)\n\nhead(penguins, n = 10)\n\nThe tail command does something similar, but for data from the last few rows.\n\ntail(penguins)\n\nWhen we’re working with HTML documents like this one, it’s usually not necessary to use View, head, or tail because the HTML format will print the data frame a lot more neatly than it did in the Console. You do not need to type the following code; just look below it for the table that appears.\n\n\nWarning: package 'palmerpenguins' was built under R version 4.3.1\n\n\n\npenguins\n\n\n\n \n\n\n\nYou can scroll through the rows by using the numbers at the bottom or the “Next” button. You can scroll through the variables by clicking the little black arrow pointed to the right in the upper-right corner. The only thing you can’t do here that you can do with View is sort the columns.\nWe want to understand the “structure” of our data. For this, we use the str command. Try it:\n\nstr(penguins)\n\nThis tells us several important things. First it says that we are looking at a tibble with 344 observations of 8 variables. We can isolate those pieces of information separately as well, if needed:\n\nNROW(penguins)\n\n\nNCOL(penguins)\n\nThese give you the number of rows and columns, respectively.\nThe str command also tells us about each of the variables in our data set. We’ll talk about these later.\nWe need to be able to summarize variables in the data set. The summary command is one way to do it:\n\nsummary(penguins)\n\nYou may not recognize terms like “Median” or “1st Qu.” or “3rd Qu.” yet. Nevertheless, you can see why this summary could come in handy." }, { "objectID": "01-intro_to_r.html#understanding-the-variables", @@ -116,7 +116,7 @@ "href": "01-intro_to_r.html#footnotes", "title": "1  Introduction to R", "section": "", - "text": "The official spec says that a valid variable name “consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.”↩︎\nIf you have installed R and RStudio on your own machine instead of accessing RStudio through a browser, you’ll want to know that install.packages only has to be run once, the first time you want to install a package. If you’re using RStudio Workbench, you don’t even need to type that because your server admin will have already done it for you.↩︎" + "text": "The official spec says that a valid variable name “consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.”↩︎" }, { "objectID": "02-using_quarto-web.html#introduction", diff --git a/index.qmd b/index.qmd index 18891b6..cdf38fd 100644 --- a/index.qmd +++ b/index.qmd @@ -20,17 +20,17 @@ In 2015, a group of interdisciplinary faculty at Westminster University (then ca Since then, I have been revising and updating the modules a little every semester. At some point, however, it became clear that some big changes needed to happen: -- The modules were more or less aligned with the OpenIntro book *Introduction to Statistics with Randomization and Simulation* (ISRS) by David Diez, Christopher Barr, and Mine Çetinkaya-Rundel. That book has now been supplanted by [*Introduction to Modern Statistics* (IMS)](https://openintro-ims.netlify.app/) by Mine Çetinkaya-Rundel and Johanna Hardin, also published through the OpenIntro project. -- The initial materials were written mostly using a mix of base R tools, some `tidyverse` tools, and the amazing resources of the `mosaic` package. I wanted to convert everything to be more aligned with `tidyverse` packages now that they are mature, well-supported, and becoming a *de facto* standard for doing data analysis in R. -- The initial choice of data sets that served as examples and exercises for students was guided by convenience. As I had only a short amount of time to write an entire textbook from scratch, I tended to grab the first data sets I could find that met the conditions needed for the statistical principles I was trying to illustrate. It has become clear in the last few years that the material will be more engaging with more interesting data sets. Ideally, we should use at least some data sets that speak to issues of social justice. -- Making statistics more inclusive requires us to confront some ugly chapters in the development of the subject. Statistical principles are often named after people. (These are supposedly the people who "discovered" the principle, but keep in mind Stigler's Law of Eponymy which states that no scientific discovery is truly named after its original discoverer. In a neat bit of self-referential irony, Stephen Stigler was not the first person to make this observation.) The beliefs of some of these people were problematic. For example, Francis Galton (famous for the concept of "regression to the mean"), Karl Pearson (of the Pearson correlation coefficient), and Ronald Fisher (famous for many things, including the P-value) were all deeply involved in the eugenics movement of the late 19th and early 20th century. The previous modules almost never referenced this important historical background and context. Additionally, it's important to discuss ethics, whether that be issues of data provenance, data manipulation, choice of analytic techniques, framing conclusions, and many other topics. +* The modules were more or less aligned with the OpenIntro book *Introduction to Statistics with Randomization and Simulation* (ISRS) by David Diez, Christopher Barr, and Mine Çetinkaya-Rundel. That book has now been supplanted by [*Introduction to Modern Statistics* (IMS)](https://openintro-ims.netlify.app/) by Mine Çetinkaya-Rundel and Johanna Hardin, also published through the OpenIntro project. +* The initial materials were written mostly using a mix of base R tools, some `tidyverse` tools, and the amazing resources of the `mosaic` package. I wanted to convert everything to be more aligned with `tidyverse` packages now that they are mature, well-supported, and becoming a *de facto* standard for doing data analysis in R. +* The initial choice of data sets that served as examples and exercises for students was guided by convenience. As I had only a short amount of time to write an entire textbook from scratch, I tended to grab the first data sets I could find that met the conditions needed for the statistical principles I was trying to illustrate. It has become clear in the last few years that the material will be more engaging with more interesting data sets. Ideally, we should use at least some data sets that speak to issues of social justice. +* Making statistics more inclusive requires us to confront some ugly chapters in the development of the subject. Statistical principles are often named after people. (These are supposedly the people who "discovered" the principle, but keep in mind Stigler's Law of Eponymy which states that no scientific discovery is truly named after its original discoverer. In a neat bit of self-referential irony, Stephen Stigler was not the first person to make this observation.) The beliefs of some of these people were problematic. For example, Francis Galton (famous for the concept of "regression to the mean"), Karl Pearson (of the Pearson correlation coefficient), and Ronald Fisher (famous for many things, including the P-value) were all deeply involved in the eugenics movement of the late 19th and early 20th century. The previous modules almost never referenced this important historical background and context. Additionally, it's important to discuss ethics, whether that be issues of data provenance, data manipulation, choice of analytic techniques, framing conclusions, and many other topics. The efforts of my revisions are here online. I've tried to address all the concerns mentioned above: -- The chapter are arranged to align somewhat with IMS. There isn't quite a one-to-one correspondence, but teachers who want to use the chapters of my book to supplement instruction from IMS, or vice versa, should be able to do so pretty easily. In the [Appendix](Concordance.qmd), I've included a concordance that shows how the books' chapters match up, along with some notes that explain when one book does more or less than the other. -- The book is now completely aligned with the `tidyverse` and other packages that are designed to integrate into the `tidyverse`. All plotting is done with `ggplot2` and all data manipulation is done with `dplyr`, `tidyr`, and `forcats`. Tables are created using `tabyl` from the `janitor` package. Inference is taught using the cool tools in the `infer` package. -- I have made an effort to find more interesting data sets. It's tremendously difficult to find data that is both fascinating on its merits and also meets the pedagogical requirements of an introductory statistics course. I would like to use even more data that addresses social justice issues. There's some in the book now, and I plan to incorporate even more in the future as I come across data sets that are suitable. -- When statistical tools are introduced, I have tried to give a little historical context about their development if I can. I've also tried to frame every step of the inferential process as a decision-making process that requires not only analytical expertise, but also solid ethical grounding. Again, there's a lot more I could do here, and my goal is to continue to develop more such discussion as I can in future revisions. +* The chapter are arranged to align somewhat with IMS. There isn't quite a one-to-one correspondence, but teachers who want to use the chapters of my book to supplement instruction from IMS, or vice versa, should be able to do so pretty easily. In the [Appendix](Concordance.qmd), I've included a concordance that shows how the books' chapters match up, along with some notes that explain when one book does more or less than the other. +* The book is now completely aligned with the `tidyverse` and other packages that are designed to integrate into the `tidyverse`. All plotting is done with `ggplot2` and all data manipulation is done with `dplyr`, `tidyr`, and `forcats`. Tables are created using `tabyl` from the `janitor` package. Inference is taught using the cool tools in the `infer` package. +* I have made an effort to find more interesting data sets. It's tremendously difficult to find data that is both fascinating on its merits and also meets the pedagogical requirements of an introductory statistics course. I would like to use even more data that addresses social justice issues. There's some in the book now, and I plan to incorporate even more in the future as I come across data sets that are suitable. +* When statistical tools are introduced, I have tried to give a little historical context about their development if I can. I've also tried to frame every step of the inferential process as a decision-making process that requires not only analytical expertise, but also solid ethical grounding. Again, there's a lot more I could do here, and my goal is to continue to develop more such discussion as I can in future revisions. Now, instead of a bunch of separate module files, all the material is gathered in one place as chapters of a book. In each chapter (starting with Chapter 2), students can download the chapter as a Quarto document, open it in RStudio, and work through the material. @@ -51,13 +51,13 @@ Unfortunately, our current universe timeline didn't play out that way. So we are Okay, so we are stuck not in the world we want, but the world we've got. At my institution and most others, intro stats is a service course that trains far more people who are outside the fields of mathematics and statistics. In that world, students will go on to careers where they interact with research that reports P-values and confidence intervals. -So what's the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence intervals tells them---and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we're going to train frequentists, the least we can do is help them become good frequentists. +So what's the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence interval tells them---and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we're going to train frequentists, the least we can do is help them become good frequentists. One source of inspiration for good statistical pedagogy comes from the [Guidelines for Assessment and Instruction in Statistics Education (GAISE)](https://www.amstat.org/education/guidelines-for-assessment-and-instruction-in-statistics-education-(gaise)-reports), a set of recommendations made by experienced stats educators and endorsed by the American Statistical Association. Their college guidelines are as follows: 1. Teach statistical thinking. - - Teach statistics as an investigative process of problem-solving and decision-making. - - Give students experience with multivariable thinking. + (a) Teach statistics as an investigative process of problem-solving and decision-making. + (b) Give students experience with multivariable thinking. 2. Focus on conceptual understanding. 3. Integrate real data with a context and purpose. 4. Foster active learning. @@ -82,20 +82,22 @@ At Westminster University, we host Posit Workbench on a server that is connected If you don't have that luxury, you will need to have students download and install both R and RStudio. The installation processes for both pieces of software are very easy and straightforward for the majority of students. The book chapters here assume that the necessary packages are installed already, so if your students are running R on their own machines, they will need to use `install.packages` at the beginning of some of the chapters for any new packages that are introduced. (They are mentioned at the beginning of each chapter with instructions for installing them.) -Chapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, they will have created a project called `intro_stats` in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.) +Chapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, students will have created a project called `intro_stats` in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.) -In Chapter 2, students are taught to click a link to download a Quarto document (`.qmd`). I have found that students struggle initially to get this file to the right place. If students are using RStudio Workbench online, they will need to use the "Upload" button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn't a single set of instructions that will get all students' files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code. +In Chapter 2, students are taught to click a link to download a Quarto document (`.qmd`). I have found that students struggle initially to get this file to the right place. If students are using RStudio online, they will need to use the "Upload" button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn't a single set of instructions that will get all students' files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using Quarto documents: markdown syntax, R code chunks, and inline code. By Chapter 3, a rhythm is established that students will start to get used to: -- Open the book online and open RStudio. -- Install any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio Workbench in a browser.) -- Check to make sure they're are in the `intro_stats` project. -- Click the link online to download the Quarto document. -- Move the Quarto document from the Downloads folder to the project directory. -- Open up the Quarto document. -- Restart R and Run All Chunks. -- Start reading and working. +* Open the book online and open RStudio. +* Install any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio in a browser.) +* Check to make sure they're are in the `intro_stats` project. +* Click the link online to download the Quarto document. +* Move the Quarto document from the Downloads folder to the project directory. +* Open up the Quarto document. +* Restart R and Run All Chunks. +* Start reading and working. + +When students finish each assignment, they will Restart R and Run All Chunks one last time and then "Render" the Quarto document, which will create HTML output that can then be submitted. (Hopefully, they will also take the opportunity to spell check and proofread thoroughly before submission. It's important to proofread the HTML document not just for the writing, but also to make sure that the code output and formatting all looks correct.) Chapters 3 and 4 focus on exploratory data analysis for categorical and numerical data, respectively. @@ -107,7 +109,7 @@ Chapters 8--11 are crucial for building the logical foundations for inference. T Chapter 12 introduces a few more steps to the rubric for confidence intervals. As we are still using randomization to motivate inference, confidence intervals are calculated using the bootstrap approach for now. -Once students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models as well. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions. +Once students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions. The final chapters of the book (Chapters 15--22) are simply applications of inference in specific data settings: inference for one (Ch. 15) and two (Ch. 16) proportions, Chi-square tests for goodness-of-fit (Ch. 17) and independence (Ch. 18), inference for one mean (Ch. 19), paired data (Ch. 20), and two independent means (Ch. 21), and finally ANOVA (Ch. 22). Along the way, students learn about the chi-square, Student t, and F distributions. Although the last part of the book follows a fairly traditional parametric approach, every chapter still includes randomization and simulation to some degree so that students don't lose track of the intuition behind sampling distributions under the assumption of a null hypothesis. @@ -116,6 +118,6 @@ The final chapters of the book (Chapters 15--22) are simply applications of infe I hope you enjoy the textbook. You can provide feedback two ways: -1. The preferred method is to file an issue on the Github page: https://github.com/VectorPosse/intro_stats/issues +1. The preferred method is to file an issue on the Github page: 2. Alternatively, send me an email: [sraleigh\@westminsteru.edu](mailto:sraleigh@westminsteru.edu){.email}