Skip to content

Introduction to R: June 2, 2021

Saranya Canchi edited this page Jun 2, 2021 · 3 revisions

When: Wednesday June 2, 2021 10am-noon PDT/ 1-3pm EDT

Co-Instructors: Saranya Canchi, Marisa Lim

Moderator: Abhijna Parigi

Helpers: Jose Sanchez, Jeremy Walter

Description:

This free two hour workshop will provide an overview of the basics of R programming language along with RStudio which is a user-friendly environment for working with R. You will be introduced to the syntax, variables, functions, packages along with the various data structures in R. You will also learn basics of data wrangling from reading data from files, subsetting, merging and exporting data out of the R environment.

While we wait to get started --

  1. ✔️ Have you checked out the pre-workshop resources page?
  2. 📝 Please fill out our pre-workshop survey if you have not already done so! Click the survey link here
  3. 💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button: Binder

Introduction

We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.

Have you heard of the NIH Common Fund Data Ecosystem?

Put up a ✔️ for yes and a ❎ for no!

You can contact us at [email protected]

How to ask questions

If you have questions at any point,

  • Drop them in the chat, or
  • Direct messages to the moderator or helpers are welcome, or
  • Unmute yourself and ask during the workshop

We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.

Some background

What is R?

R is a statistical computing and data visualization programming language.

What is RStudio?

R is often used via integrated development environments (IDE), and RStudio is probably the most popular option. R and Rstudio work on Mac, Linux, and Windows operating systems. The Rstudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface.

  • RStudio panels
  • Commonly used options in the panels
    • View file system, help docs
    • View/load/manage R package list
    • View/save plots
  • Customizing layout, etc. --> let's increase the font size

Data types and structures

For this tutorial, we are using sequencing data from a European Nucleotide Archive dataset (PRJEB5348). This dataset is for an RNA-Seq experiment comparing two yeast strains, SNF2 and WT. They sequenced 96 samples in 7 different lanes, with 45 wt and 45 mut strains. Thus, there are 45 biological replicates and 7 technical replicates present, with a total of 672 samples! The datasets were compared to test the underlying statistical distribution of read counts in RNA-Seq data, and to test which differential expression programs work best.

Examine RNA quality and quantity

We will examine the information about the RNA quality and quantity that the authors measured for all their samples.

We are going to explore those datasets using RStudio, combining both base R and a package called dplyr.

Some Useful Info 🏝️

  • Nucleic Acid Conc. = RNA concentration in the sample in units of nanogram per microliter (ng/ul)
  • 260/280 = ratio measured with a spectrophotometer (i.e., Nanodrop) to assess RNA sample purity. For RNA, the ratio should be ~2.0
  • Total RNA = total RNA molecular weight in the sample in units of microgram (ugm)
  • General information about RNA quality for RNA-Seq

📓 Please note!

The Rstudio binder comes pre-loaded with R and the R packages we're using in today's workshop. On your local computer, you need to install R, Rstudio, and the R packages before using them:

# install
install.packages('ggplot2')
install.packages('readr')
install.packages('dplyr')

Alternatively, all of these packages belong to the tidyverse and can be installed along with several other useful packages with one installation command. However, we are not doing this today because it will be very slow.

# install.packages('tidyverse')

Load in R packages with the library() function:

library(ggplot2)
library(readr)
library(dplyr)

The notes about masked objects mean certain R packages have used the same function name (i.e. both the stats and dplyr have their own functions called filter - check with ?stats::filter and ?dplyr::filter)

# are used for notes in scripts or "commented out" lines of code that we don't want to run

The data file we're using today is stored in an online repository (https://osf.io/pzs7g/) and it is a tab-delimited .tsv file. To read in the dataset in R, we are using the read_tsv() function from the readr R package.

A few notes about R syntax -- The <- is called an assignment operator and it is an R convention for assigning values to variable names (i.e., in Python it is =), here we assign the input data table values to the variable experiment_info.

Functions like read_tsv specify required and optional inputs within ( ).

experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')

If you run the read_tsv() function before loading the readr R package, R will give you an error message that the function can't be found.

You will see a warning message, let's talk about what that means:

Warning: 3 parsing failures.
row col   expected    actual                             file
  2  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
  3  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'
  4  -- 12 columns 9 columns 'https://osf.io/pzs7g/download/'

Warning message:
Missing column names filled in: 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12] 

Now let's look at the data!

# how big is the data set ?
dim(experiment_info)

# examine the first 6 rows across all columns of the data
head(experiment_info)

Tibbles - "a modern reimagining of the data.frame"

Tibbles are another flavor of data frame that are introduced as part of the tidyverse packages i.e. dplyr in this case to enable ease of data manipulation.

Instructor Switch!

There are six data types in R:

Data Type Description Examples
Numeric numerical values 1, 1.5, 20, pi
Character letters or words, enclosed in quotes “random”, “2”, “TRUE”, “@”, “#”
Integer for integer numbers; the L indicates to R that it’s an integer) 2L, 500L, -17L
Logical for boolean type data TRUE, FALSE, T, F
Complex numbers with real and imaginary parts 4+ 3i
Raw contains raw bytes encoding
# what type of data structure is this ?
class(experiment_info) 

There are other useful commands to get attribute information about a dataset.

# how many number of rows ?
nrow(experiment_info)

# how many number of columns ?
ncol(experiment_info)

# what are the column names of your dataset ?
colnames(experiment_info)

# what are the rownames of your dataset ?
rownames(experiment_info)

# what does the last few lines of my dataset look like ?
tail(experiment_info)

# what is the structure of my dataset ?
str(experiment_info)

data.frame is tabular data similar to an excel spreadsheet which can contain mixed data types.

vectors are uni-dimensional data structures with uniform data types.

Let's create a vector from the table we have.

a260_vector <- experiment_info$A260

# confirm the data structure
class(a260_vector)

# get the data type of the vector
typeof(a260_vector)

We can access individual elements within the vector using it's position. Indices start at 1 and use the [] notation.

# obtain the fifth element in the vector
a260_vector[5]

# subset the fifth and tenth element
a260_vector[c(5,10)]

# subset the fifth through tenth elements
a260_vector[c(5:10)]

Exercise 1

Generate a vector containing the yeast strain.

Solution
yeast_strain_vector <- experiment_info$Yeast Strain

# The above code with result in error because R cannot parse spaces in column name. We need to add back command around the name of the column.

yeast_strain_vector <- experiment_info$`Yeast Strain`

Alternatively, you could also use the index of the column:

# dataframe[row number, column number]
# subset based on column index
yeast_strain_vector <- experiment_info[,1]

Another, two dimensional data structure is the matrix. The major difference between matrix and dataframe is that matrix can have only one type of data type i.e. either character or numeric but not both!

Let's create a matrix containing the Nucleic Acid Conc., 260/280 ratio and Total RNA values.

# Create a matrix by subsetting the data frame by selecting the appropriate columns by column names
yeast_strain_matrix <- data.matrix(experiment_info[,c("Nucleic Acid Conc.",
                                                      "260/280",
                                                      "Total RNA")])
# View the data
head(yeast_strain_matrix )

Data Wrangling with dplyr

Why might you need to wrangle data?

  • Real world data is messy and needs some formatting before you can use it for analysis.
  • Often, you are only interested in a subset of the data and want to work with it without loading the entire set.
  • Some programs/ visualizations require data to be in a specific format.

There are many ways to wrangle data. We will show you how to use an R package called dplyr that contains sets of tools for dataframe manipulation.

For this workshop we will cover a few of them:

dplyr verb Description
select() subset columns
filter() subset rows matching some condition
mutate() create new columns using information from other columns
arrange() sort results
count() count discrete values
group_by() group data for analysis or plotting
summarize() create summary statistics

select()

There are some empty columns in the experiment data file we have been working with. Let's create a subset of data without those empty columns using the select function.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            Yeast Strain, 
                            Nucleic Acid Conc., 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            Total RNA)

# As before, since R cannot parse spaces in column names, we need to enclose them in backticks to indicate that these words belong together.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            `Total RNA`)

# As a general rule, it is best to avoid column names that start with a number; we can use backticks for this column name.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            `260/280`,
                            `Total RNA`)

# Check the dimensions of the subsetted dataframe
dim(experiment_info_cleaned)

You can also use select to specify the columns you don't want. This can be done with the - notation.

# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)

Exercise 2

Select the columns Sample, Yeast Strain, A260 and A280 and assign them to a new object called "experiment_data".

Solution
experiment_data <-  select(experiment_info_cleaned, Sample, `Yeast Strain`, A260, A280)
# Check the data
head(experiment_data)

filter()

We learned to subset entire columns. Using the filter function, we can subset based on conditions, using a logical vector with TRUE and FALSE, or with operators.

Conditional Subsetting Description
> greater than the value
>= greater than or equal to the value
< less than the value
<= less than or equal to the value
| applies logical operator OR
& applies logical operator AND
== equal to the value
%in% operator used to identify if the value belongs to a vector
!= not equal to the value

The filter function works on rows, so we can use it to subset data based on conditions in a column. Here's how you would use the function:

filter(experiment_info_cleaned, `Nucleic Acid Conc.` > 1500)
filter(experiment_info_cleaned, `A260/A280` >= 2.0 & `Nucleic Acid Conc.` > 1500)

Exercise 3

Create a new object called wt_high_conc that contains data from WT strains and contain nucleic acid concentration of more than 1500 ng/uL.

Solution
wt_high_conc<- filter(experiment_info_cleaned, `Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500)

# Check the data
head(wt_high_conc)

Pipes

So far, we have applied the actions individually. What if we wanted to apply select and filter at the same time?

We can use the concept of a pipe which is represented by %>% in R. This is equivalent to | in shell.

Pipes are available via the magrittr package, installed automatically with dplyr. If you use RStudio, the shortcut to produce the pipe is:

  • Ctrl + Shift + M on a PC or
  • Cmd + Shift + M on a Mac

Here's how you would use it:

experiment_info_wt <- experiment_info_cleaned %>% 
  filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>% 
  select(Sample, `Yeast Strain`, A260, A280)

(when using pipes, the order of the command matters!)

mutate()

mutate can be used to create new columns in our dataframe using existing data.

mutate(experiment_info_cleaned, 
concentration_ug_uL = `Nucleic Acid Conc.` / 1000) 

Exercise 4 (Bonus)

Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.

Solution
# create data object
library_start <- experiment_info_cleaned %>% 
  mutate(RNA_100 = 100/ `Nucleic Acid Conc.`,
          water = 50 - RNA_100) %>% 
  select(Sample, `Yeast Strain`, RNA_100, water) 

# check the data
head(library_start)

Split-apply-combine

The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this using group_by() and summarize()

group_by() takes as argument the column name/s that contain categorical variables that can be used to group the data.

experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
  
# summarize using more than one column
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`))

Next, we can use arrange() to sort the data in either ascending (defalt order) or descending order.

# arrange new table in ascending mean concentrations
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(mean_concentration) 
  
# arrange new table in descending mean concentrations 
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
  mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(desc(mean_concentration)) 

We can also examine the number of categorical values using count()

experiment_info_cleaned %>% 
  count(`Yeast Strain`)

Exercise 5 (Bonus)

Calculate the mean value for A260/A280 for each of the yeast strains. Can these sample be used for downstream analysis ?

Solution
# calculate mean values
experiment_info_cleaned %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_ratio = mean(`260/280`))

Instructor switch!

Plotting with ggplot

ggplot and its related packages are a flexible and widely-used plotting R package. You can make standard plots (i.e., scatter, histogram, box, bar, density plots), as well as a variety of other plot types like maps and phylogenetic trees!

It's a very helpful tool for exploring your data/results and generating publication-ready figures.

We're going to create some plots with the tibble dataframe from the previous section. To make it easier to specify column names, let's rename to remove spaces (you could also specify the column names using backticks as we did above):

# current column names
colnames(experiment_info_cleaned)
# new names specified with a vector
colnames(experiment_info_cleaned) <- c('Sample', 'Yeast_strain', 'Nucleic_Acid_Conc.',
                               'ng/ul', 'A260', 'A280', 'A260_280',
                               'Total_RNA', 'ugm')

The ggplot syntax always begins with a function that specifies the data and then layers on additional functions to specify a whole bunch of customizable plot aesthetics. Here, we specify the data object as the experiment_info_cleaned tibble and set the axes with the mapping flag.

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) 

Rstudio has some really helpful features for finding help docs, and for autofilling. Use the tab key to autofill variable or R package function names for example. Find help docs about functions or R packages with the ? key:

# help docs for ggplot() function
? ggplot 

# help docs for ggplot2 R package
? ggplot2

Let's make a scatter plot by adding geom_point() with a + to the ggplot() function:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
  geom_point() +
  ggtitle('Scatter plot')

Notice the indentation after the + sign.

Add transparency to points (also works for other plot types!) with the alpha option (0=completely transparent, 1=solid color) and change point size with size:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc.)) +
         geom_point(alpha=0.6, size=4) +
         ggtitle('Scatter with transparent points')

We can modify the code to fill points by yeast strain and make point size correspond to the 260/280 ratio:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., fill=Yeast_strain,
                   size=A260_280)) +
  geom_point(alpha=0.6, pch=21, col='white') +
  ggtitle('Scatter: color by strain \n and size by 260/280 ratio')

There are 2 strains in this dataset, WT and snf2.

unique(experiment_info_cleaned$Yeast_strain)

Let's make a boxplot by strain:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain, y=Total_RNA)) +
  geom_boxplot() +
  ggtitle('Boxplot')

Boxplots can be more informative if we add points to see the distribution of the data:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain, y=Total_RNA)) +
  geom_point(alpha=0.6) +
  geom_boxplot() +
  ggtitle('Boxplot w/ points')

Exercise 6

  • What happens if you reverse the order of the geom_point() and geom_boxplot() functions for the boxplot code above?
Solution

Above, the points are plotted first, then the boxplot. If we reverse the order, the points will be plotted on top of the boxplot.

Exercise 7

Solution

geom_jitter() 'jitters' the points so they are not overlapping as much.

More modifications and plot types!

  • Modifying axis label text: xlab() and ylab()
  • Facets: facet_wrap() or facet_grid()
ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA, y=Nucleic_Acid_Conc., col=Yeast_strain)) +
  geom_point() +
  facet_wrap(~Yeast_strain) +
  xlab('Total RNA (ugm)') +
  ylab('Nucleic Acid Conc. (ng/ul)') +
  ggtitle('Facet scatter plot')

Exercise 8

  • Try making a histogram (geom_histogram()) and a density plot (geom_density()) of the A260_280 ratio values. (Hint: a histogram plots counts on the y-axis so you only need to specify the x-axis variable in the ggplot() function.)
Solutions

Here is the basic code - you may need to modify the bin size:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280))+
      geom_histogram(bins=20) +
      ggtitle('Histogram')

You could add colors and plot by strain type like this:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      ggtitle('Histogram by strain')

And we could facet the plot by strain like this:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      facet_grid(Yeast_strain~.) + #this sets whether the facet is horizontal or vertical. here, we will get 2 rows of histograms by yeast strain
      ggtitle('Facet histogram')

One more edit! Let's change the colors on the histogram. There are many built-in color palettes! Check more out here:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_histogram(bins=20, col='black') +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) + # Add blue and yellow colors that are more colorblind friendly for plotting
      ggtitle('Histogram w/ custom colors')

Now, let's switch this to a density plot and change the plot background

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_density(alpha=0.6) +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
      theme_bw() +
      ggtitle('Density plot')

Hopefully this exercise demonstrates how flexible ggplot is!

Save plots

Use the ggsave() function to save images (variety of file options available), for example:

myplot <- ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_density(alpha=0.6) +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
      theme_bw() +
      ggtitle('Density plot')
ggsave(filename='mydensityplot.jpg', plot=myplot, height=4, width=4, units='in', dpi=600)

There are many more options to edit plot aesthetics and there are lots of excellent tutorials and resources, including:

Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!

For example, select a file, click "More", click "Export":

Wrap up

📝 Please fill out our post-workshop survey! We use the feedback to help improve our training materials and future workshops. Click here for survey!

Questions?

Acknowledgements: This lesson was inspired by material from 2019 ANGUS Intro to R lesson

Resources

Workshop FAQs

Q: I am receiving this error when I run the code error: could not find function <name of the function>

A: This generally indicates that the package containing the function or command you are trying to use is not loaded into. You can fix this by running this code: library(<name of the package>) in the R Console.

Q: How is the class() function different from typeof()?

A: class() in R informs you of the type of the data type in your dataset while typeof() determines the (R internal) type or storage mode of any object. You can find useful discussion on this stack exchange thread.

Q: What is the advantage of using <- as an assignment operator, instead of =?

A: It is a choice made by the creators of R about the different usage. But typically you use <- for assignment and = within functions for arguments.

Q: Can you specify a column name to rename?

A: You can use the column index to rename a specific column. For example colnames(experiment_info_cleaned)[2] <- “yeast_strain”

Clone this wiki locally