01 - Principles of Reproducible Research.Rmd

---
title: 'Lesson 1: Adopting principles of reproducible research'
output:
  html_document: default
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup_1, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(fs)
library(here)
```

## What is reproducible research?

In its simplest form, reproducible research is the principle that any
research result can be reproduced by anybody. Or, per Wikipedia: "The
term reproducible research refers to the idea that the ultimate product
of academic research is the paper along with the laboratory notebooks
and full computational environment used to produce the results in the
paper such as the code, data, etc. that can be used to reproduce the
results and create new work based on the research."

Reproducibility can be achieved when the following criteria are met
[(Marecelino
2016)](https://www.r-bloggers.com/what-is-reproducible-research/): - All
methods are fully reported - All data and files used for the analysis
are available - The process of analyzing raw data is well reported and
preserved

*But I'm not doing research for a publication, so why should I care
about reproducibility?*

-   Someone else may need to run your analysis (or you may want someone
    else to do the analysis so it's less work for you)
-   You may want to improve on that analysis
-   You will probably want to run the same exact analysis or a very
    similar analysis on the same data set or a new data set in the
    future

**"Everything you do, you will probably have to do over again."**
[(Noble
2009)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)

There are core practices we will cover in this lesson to help get your
code to be more reproducible and reusable:

-   Adopt a style convention for coding
-   Develop a standardized but easy-to-use project structure
-   Enforce reproducibility when working with projects and packages

We will cover using a version control system, another practice, in the
next lesson.

## Adopt a style convention for coding

### Pipes

A common element convention supported by the tidyverse that you may not be familiar with or may not use consistently is the almighty pipe `%>%`. The pipe allows you to chain together functions sequentially so that you can be much more efficient with your code and make it readable. This is a nice intuitive [example](https://twitter.com/andrewheiss/status/1359583543509348356) from Andrew Heiss, an educator at Georgia State University:

```{r, eval = FALSE}
my_morning <- I %>%
  wake_up(time = "8:00") %>%
  get_out_of_bed(side = "correct") %>%
  get_dressed(pants = TRUE, shirt = TRUE) %>%
  leave_house(car = TRUE, bike = FALSE)
```

Contrast that with the same operations but using sequential steps without the pipes:
```{r, eval = FALSE}
my_morning <- wake_up(I, time = "8:00")
my_morning <- get_out_of_bed(my_morning, side = "correct")
my_morning <- get_dressed(my_morning, pants = TRUE, shirt = TRUE) %>%
my_morning <- leave_house(my_morning, car = TRUE, bike = FALSE)
```
This is readable but contains quite a bit of duplicative code.

Pipes are not compatible with all functions but should work with all of
the tidyverse package functions (the magrittr package that defines the
pipe is included in the tidyverse). In general, functions expect data as
the primary argument and you can think of the pipe as feeding the data
to the function. From the perspective of coding style, the most useful
suggestion for using pipes is arguably to write the code so that each
function is on its own line. The tidyverse style guide [section on
pipes](http://style.tidyverse.org/pipes.html) is pretty helpful, which leads us to a general tool for making code readable: style guides.

### Style guides

Reading other people's code can be extremely difficult. Actually,
reading your own code is often difficult, particularly if you haven't
laid eyes on it long time and are trying to reconstruct what you did.
One thing that can help is to adopt certain conventions around how your
code looks, and style guides are handy resources to help with this. We
recommend the [Tidyverse style guide](http://style.tidyverse.org/) as
the style guide that much of the R community working with Tidyverse
tools has converged to. The Tidyverse guide was originally derived from
Google's [R Style
Guide](https://google.github.io/styleguide/Rguide.xml), but since that
time Google has updated their style guide to pull from the Tidyverse
guide.

Some highlights:

- Use underscores to separate words in a name (see above comments for
file names) 
- Put a space before and after operators (such as `==`, `+`,
`<-`), but there are a few exceptions such as `^` or `:` - Use `<-`
rather than `=` for assignment 
- Try to limit code to 80 characters per
line & if a function call is too long, separate arguments to use one
line each for function, arguments, and closing parenthesis.

```{r, eval = FALSE}
# Good
do_something_very_complicated(
  something = "that",
  requires = many,
  arguments = "some of which may be long"
)

# Bad
do_something_very_complicated("that", requires, many, arguments,
                              "some of which may be long"
                              )
```

### Packages supporting code style

You're not alone in your efforts to write readable code: there are multiple packages for that. We will not cover them in depth here but it is good to be aware of them: 

- [styler](http://styler.r-lib.org/) is a package that
allows you to interactively reformat a chunk of code, a file, or a
directory
  - styler can function as an Addin within RStudio (look above your markdown window for addins already installed in your RStudio)
  - You can highlight code, apply styler via the Addins menu, and code will automatically be formatted per the Tidyverse style guid
- [formatr](https://yihui.org/formatr/) allows you to reformat whole files and directories
- [lintr](https://github.com/jimhester/lintr) checks code and provides output on formatting issues

So, if you have some old scripts you want to make more readable, you can
unleash styler or formatr on the file(s) and it will reformat it. Functionality for
lintr has been built into more recent versions of RStudio - look at markers to the left of code chunks in the editor window.

## Develop a standard project structure

In their article "Good enough practices in scientific computing", Wilson
et al. highlight useful recommendations for organizing projects [(Wilson
2017)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510):

-   **Put each project in its own directory, which is named after the
    project**
-   Put text documents associated with the project in the doc directory
-   **Put raw data and metadata in a data directory and files generated
    during cleanup and analysis in a results directory**
-   Put project source code in the src directory
-   Put compiled programs in the bin directory
-   **Name all files to reflect their content or function**

Because we are focusing on using RMarkdown, notebooks, and less complex
types of analyses, we are going to focus on the recommendations in bold
in this course. All of these practices are recommended and we encourage
everyone to read the original article to better understand motivations
behind the recommendations.

### Put each project in its own directory, which is named after the project

Putting projects into their own directories helps to ensure that
everything you need to run an analysis is in one place. That helps you
minimize manual navigation to try and tie everything together (assuming
you create the directory as a first step in the project).

What is a project? Wilson et al. suggest dividing projects based on
"overlap in data and code files." I tend to think about this question
from the perspective of output, so a project is going to be the unit of
work that creates an analysis document that will go on to wider
consumption. If I am going to create multiple documents from the same
data set, that will likely be included in the same project. It gets me
to the same place that Wilson et al. suggest, but very often you start a
project with a deliverable document in mind and then decide to branch
out or not down the road.

Now that we're thinking about creating directories for projects and
directory structure in general, let's take the opportunity to review
some basic commands and configuration related to directories in R. We
are going to use functions available in both base R as well as the [fs
package](https://github.com/r-lib/fs), which provides clearer names for
functions as well as clearer output for directors and filenames. The fs
package should have been installed if you completed the pre-course
instructions, and you can load it if needed by running `library(fs)`.

**Exercise 1**

1.  Navigate to "Global Options" under the Tools menu in the RStudio
    application and note the *Default working directory (when not in a
    project)*
2.  Navigate to your Console and get the working directory using
    `getwd()`
3.  If you haven't already installed the fs package (from the pre-course
    instructions), do so now: `install.packages("fs")`. Then load the
    package with `library(fs)` if you did not already run the set up
    chunk above.
4.  Review the contents of your current folder using `dir_ls()`. (Base
    equivalent: `list.files()`)
5.  Now try to set your working directory using `setwd("test_dir")`.
    What happened?
6.  Create a new test directory using `dir_create("test_dir")`. (Base
    equivalent: `dir.create("test_dir")`)
7.  Review your current directory
8.  Set your directory to the test directory you just created
9.  Using the Files window (bottom right in RStudio, click on **Files**
    tab if on another tab), navigate to the test directory you just
    created and list the files. *Pro tip: The More menu here has
    shortcuts to set the currently displayed directory as your working
    directory and to navigate to the current working directory*
10. Navigate back to one level above the directory you created using
    `setwd("..")` and list the files
11. Delete the directory you created using the `dir_delete()` function.
    Learn more about how to use the function by reviewing the
    documentation: `?dir_delete`. (Base equivalent: `unlink()` +
    additional arguments)

**End Exercise**

The functions in the fs package include arguments and capabilities that
can be helpful for finding directories or files with names that have a
specific pattern. From our project directory, we may want to see the
files in a specific folder, without changing the directory of the
folder. We can use the path argument in the function:

```{r}
dir_ls(path = "data")
```

Another really handy argument to the `dir_ls` function is glob. This
allows you to supply a "wild card" pattern to retrieve records fitting a
specific pattern. The syntax is to use an asterisk to indicate any
pattern, either at the beginning or end of an expression. For example,
we may only want to retrieve the Excel files from our data directory, so
we would match to a file extension:

```{r}
dir_ls(path = "data", glob = "*.xlsx")
```

Or, we may be interested in only the sample csv files that are denoted
by "\_s":

```{r}
dir_ls(path = "data", glob = "*_s.csv")
```

Note that the asterisk at the beginning of the pattern followed by
characters to match against at the end requires that the text pattern be
at the very end of the string.

**Optional Exercise (If you do not already have a project directory)**

Now that you're warmed up with navigating through directories using R,
let's use functionality that's built into RStudio to make our
project-oriented lives easier. To enter this brave new world of project
directories, let's make a home for our projects. (Alternately, if you
already have a directory that's a home for your projects, set your
working directory there.) 1. Using the Files navigation window (bottom
right, Files tab), navigate to your home directory or any directory
you'd like to place your future RStudio projects 2. Create a "Projects"
directory 3. Set your directory to the "Projects" directory

```{r, eval = FALSE}
dir_create("Projects")
setwd("/Projects")
```

Alternately, you can do the above steps within your operating system
(eg. on a Mac, open Finder window and create a folder) or if you are
comfortable working at the command line, you can make a directory there.
In the newest version of RStudio (version 1.1), you have the option of
opening up a command line prompt under the Terminal tab (on the left
side, next to the Console tab).

**End Exercise**

**Exercise 2**

Let's start a new project :

1.  Navigate to the **File** menu and select **New Project...** OR
    Select the **Create a project** button on the global toolbar (2nd
    from the left)
2.  Select **New Directory** option
3.  In the Project Type prompt, select **New Project**
4.  In the Directory Name prompt under Create New Project, enter
    "sample-project-structure"
5.  In the Create Project as a Subdirectory of prompt under Create New
    Project, navigate to the Projects folder you just created (or
    another directory of your choosing). You can type in the path or hit
    the **Browse** button to find the directory. Check the option for
    "Open in a new session" and create your project.

**End Exercise**

So, what exactly does creating a Project in RStudio do for you? In a
nutshell, using these Projects allows you to drop what you're doing,
close RStudio, and then open the Project to pick up where you left off.
Your data, history, settings, open tabs, etc. will be saved for you
automatically.

Does using a RStudio Project allow someone else to pick up your code and
just use it? Or let you come back to a Project 1 year later and have
everything work magically? Not by itself, but with a few more tricks you
will be able to more easily re-run or share your code.

### Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory

Before we broke up with Excel, it was standard operating procedure to
perform our calculations and data manipulations in the same place that
our data lived. This is not necessarily incompatible with
reproducibility, if we have very careful workflows and make creative use
of macros. However, once you have modified your original input file, it
may be non-trivial to review what you actually did to your original raw
data (particularly if you did not save it as a separate file). Morever,
Excel generally lends itself to non-repeatable manual data manipulation
that can take extensive detective work to piece together.

Using R alone will not necessarily save you from these patterns but they
take a different form. Instead of clicking around, dragging, and
entering formulas, you might find yourself throwing different functions
at your data in a different order each time you open up R. While it
takes some effort to overwrite your original data file in R, other
non-ideal patterns of file management that are common in Excel-land can
creep up on you if you're not careful.

One solution to help avoid these issues in maintaining the separation of
church and state (to use a poor analogy) is to explicitly organize
your analysis so that raw data lives in one directory (the *data*
directory) and the results of running your R code are placed in another
directory (eg. *results* or *output*). You can take this concept a
little further and include other directories within your project folder
to better organize work such as *figures*, *documents* (for
manuscripts), or *processed_data*/*munge* (if you want to create
intermediate data sets). You have a lot of flexibility and there are
multiple resources that provide some guidance [(Parzakonis
2017)](https://statsravingmad.com/measure/sample-r-project-structure/),
[(Muller
2017)](http://blog.jom.link/implementation_basic_reproductible_workflow.html),
[(Software Carpentry
2016)](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/).

**Exercise 3**

Be sure to work within the RStudio window that contains your
"sample-project-structure" project. Refer to the top right of the window
and you should see the project name displayed there. Let's go ahead and
create a minimal project structure by running the following code within
the console:

```{r, eval = FALSE}
library(fs)
dir_create("data") # raw data
dir_create("output") # output from analysis
dir_create("cache") # intermediate data (after processing raw data)
dir_create("src") # code goes into this folder
```

This is a bare bones structure that should work for future projects you
create. Refer to the content below if you decide you want to adopt a
standard directory structure for your projects on top of using RStudio
Projects.

Keep this project open in a separate window for now. We will revisit it
as we learn about version control.

**End Exercise**

*Further exploration/tools for creating projects:*

The directory creation code in the above exercise can be packaged into a
function that creates the folder structure for you (either within or
outside of a project). Software Carpentry has a nice refresher on
writing functions:
<https://swcarpentry.github.io/r-novice-inflammation/02-func-R/>.

There is also a dedicated Project Template package that has a nice
"minimal project layout" that can be a good starting point if you want R
to do more of the work for you: [Project Template](http://projecttemplate.net/index.html). This package
duplicates some functionality that the RStudio Project does for you, so
you probably want to run it outside of an RStudio Project but it is a
good tool to be aware of.

### Name all files (and variables) to reflect their content or function

This concept is pretty straightforward: assume someone else will be
working with your code and analysis and won't intuitively understand
cryptic names. Rather than output such as results.csv, a file name of
morphine_precision_results.csv offers more insight. [Wilson et
al.](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510)
make the good point that using sequential numbers will come back to bite
you as your project evolves: for example, "figure_2.txt" for a
manuscript may eventually become "figure_3.txt". We'll get into it in
the next section but the final guidance with regards to file names is to
using a style convention for file naming to make it easier to read names
an manipulate files in R. One common issue is dealing with whitespace in
file names: this can be annoying when writing out the file names in
scripts so underscores are preferrable. Another issue is the use of
capital letters: all lowercase names is easier to write out. As an
example, rather than "Opiate Analysis.csv", the preferred name might be
"opiate_analysis.csv".

## Enforce reproducibility of the directories and packages

### Scenario 1: Sharing your project with a colleague

Let's think about a happy time a couple months from now. You've
completed this R course, have learned some new tricks, and you have
written an analysis of your mass spec data, bundled as a nice project in
a directory named "mass_spec_analysis". You're very proud of the
analysis you've written and your colleague wants to run the analysis on
similar data. You send them your analysis project (the whole directory)
and when they run it they immediately get the following error when
trying to load the data file with the `read.csv("file.csv")` command:

**Error in file(file, "rt") : cannot open the connection** 
**In addition: Warning message:** 
**In file(file, "rt") :** 
  **cannot open file 'file.csv': No such file or directory**

Hmmm, R can't find the file, even though you set the working directory
for your folder using
`setwd("/Users/username/path/to/mass_spec_analysis")`.

What is the problem? Setting your working directory is actually the
problem here, because it is almost guaranteed that the path to a
directory on your computer does not match the path to the directory on
another computer. That path may not even work on your own computer a
couple years from now!

Fear not, there is a package for that! The [here](https://cran.r-project.org/web/packages/here/index.html) package is a
helpful way to "anchor" your project to a directory without setting your
working directory. The here package uses a pretty straightforward syntax
to help you point to the file you want. In the example above, where
file.csv is a data file in the root directory (I know, not ideal
practice per our discussion on project structure above), then you can
reference the file using `here("file.csv")`, where `here()` indicates
the current directory. So reading the file could be accomplished with
`read.csv(here("file.csv"))` and it could be run by any who you share
the project with.

The here package couples well with an RStudio Project because there is
an algorithm that determines which directory is the top-level directory
by looking for specific files - creating an RStudio Project creates an
.Rproj file that tells here which is the project top-level directory -
if you don't create a Project in RStudio, you can create an empty file
named .here in the top-level directory to tell here where to go - there
are a variety of other file types the package looks for (including a
.git file which is generated if you have a project on Github)

I encourage you to read the following post by Jenny Bryan that includes
her strong opinions about setting your working directory:
[Project-oriented workflow](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/).

Moral of the story: avoid using `setwd()` and complicated paths to your
file - use `here()` instead!

### Scenario 2: Running your 2018 code in 2019

Now imagine you've written a nice analysis for your mass spec data but
let it sit on the shelf for 6 months or a year. In the meantime, you've
updated R and your packages multiple times. You rerun your analysis on
the same old data set and either (a) one or more lines of code longer
works or (b) the output of your analysis is different than the first
time you ran it. Very often these problems arise because one or more of
the packages you use in your code have been updated since the first time
you ran your analysis. Sometimes package updates change the input or
output specific functions expect or produce or alter the behavior of
packages in unexpected ways. These problems also arise when sharing code
with colleagues because different users may have different versions of
packages loaded.

Why do we run into this problem? Packages are installed in directories called library paths. Whenever you load a package, R will search for the package in your library paths. You may have multiple library paths - by default there is typically a system library but you may have a user library as well. You can see your library paths with the `libPaths()` function. R will go through the library paths in order, so it may look for user packages before moving to system packages. Ultimately R will default to using the first version of a specific package it finds as it moves through the library paths. If you have a script or notebook that was developed with a different version of package than R defaults to, you can end up with different output than expected or an analysis that does not work. Many packages that are developed by the RStudio team like the tidyverse group of packages are more likely to have significant testing and avoid breaking changes. However, this is not a guarantee.

![One library rules them all. Source: https://kevinushey-2020-rstudio-conf.netlify.app/slides.html](assets/shared-library.png)

The generalized way to avoid this issue is to manage packages on a project by project basis: instead of R using the default version of a package it finds, tie specific versions of packages to a project. There are a couple options for accomplishing this.

![One library per package. Source: https://kevinushey-2020-rstudio-conf.netlify.app/slides.html](assets/project-library.png)

#### Option 1: checkpoint

Arguably the most lightweight solution to this problem is the [checkpoint](https://cran.r-project.org/web/packages/checkpoint/index.html) package.
The basic premise behind checkpoint is that it allows you use the
package as it existed at a specific date. There is a snapshot for all
packages in CRAN (the R package repository) each day, dating back to
2017-09-17. By using checkpoint you can be confident that the version of
the package you reference in your code is the same version that anyone
else running your code will be using.

The behavior of checkpoint makes it complicated to test out in this
section: the package is tied to a project and by default searches for
every package called within your project (via `library()` or
`require()`).

The checkpoint package is very helpful in writing reproducible analyses,
but there are some limitations/considerations with using it: 

- retrieving and installing packages adds to the amount of time it takes to run your analysis 
- package updates over time may fix bugs so changes in output may be more accurate 
- checkpoint is tied to projects, so alternate structures that don't use projects may not able to utilize the package

#### Option 2: renv

There is another solution to this problem that has tighter integration with RStudio: the [renv](https://rstudio.github.io/renv/articles/renv.html) package. renv allows your to maintain specific package versions on a project by project level. Unlike checkpoint, package versions are not determined by date. Instead you can initiate a project and use renv functions to detect whatever packages you've loaded with their version information and generate a file with that data so that the environment can be replicated easy on another system or by someone else.

- The functionality can be initialized with the `renv::init()` function, which captures the state of your default R libraries for a project-local library which will then be used when future R sessions open the project
- Packages can be updated with most recent versions (if updated at the user or system level) with the `renv::snapshot()` function, which creates a lockfile containing the detailed package info
- The `renv::restore()` function is used to apply updates to packages in a project based on the lockfile data

There is a nice summary of renv here: https://kevinushey-2020-rstudio-conf.netlify.app/slides.html#1.

Note: the renv package was developed as a more stable solution than its predecessor [packrat](https://cran.r-project.org/web/packages/packrat/index.html).

Either approach to package management will work - the important point
here is to be proactive about how you manage your packages, especially
if you know your code will be used over and over again in the future.

## Summary

-   Reproducible research is the principle that any research result can
    be reproduced by anybody

-   Practices in reproducible research also offer benefits for to the
    code author in producing clearer, easier to understand code and
    being able to easily repeat past work

-   Important practices in reproducible research include:

    -   Developing a standardized but easy-to-use project structure
    -   Adopting a style convention for coding
    -   Enforcing reproducibility when working with projects and
        packages