-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy path01 - Principles of Reproducible Research.Rmd
539 lines (437 loc) · 26.8 KB
/
01 - Principles of Reproducible Research.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
---
title: 'Lesson 1: Adopting principles of reproducible research'
output:
html_document: default
editor_options:
markdown:
wrap: 72
---
```{r setup_1, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(fs)
library(here)
```
## What is reproducible research?
In its simplest form, reproducible research is the principle that any
research result can be reproduced by anybody. Or, per Wikipedia: "The
term reproducible research refers to the idea that the ultimate product
of academic research is the paper along with the laboratory notebooks
and full computational environment used to produce the results in the
paper such as the code, data, etc. that can be used to reproduce the
results and create new work based on the research."
Reproducibility can be achieved when the following criteria are met
[(Marecelino
2016)](https://www.r-bloggers.com/what-is-reproducible-research/): - All
methods are fully reported - All data and files used for the analysis
are available - The process of analyzing raw data is well reported and
preserved
*But I'm not doing research for a publication, so why should I care
about reproducibility?*
- Someone else may need to run your analysis (or you may want someone
else to do the analysis so it's less work for you)
- You may want to improve on that analysis
- You will probably want to run the same exact analysis or a very
similar analysis on the same data set or a new data set in the
future
**"Everything you do, you will probably have to do over again."**
[(Noble
2009)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)
There are core practices we will cover in this lesson to help get your
code to be more reproducible and reusable:
- Adopt a style convention for coding
- Develop a standardized but easy-to-use project structure
- Enforce reproducibility when working with projects and packages
We will cover using a version control system, another practice, in the
next lesson.
## Adopt a style convention for coding
### Pipes
A common element convention supported by the tidyverse that you may not be familiar with or may not use consistently is the almighty pipe `%>%`. The pipe allows you to chain together functions sequentially so that you can be much more efficient with your code and make it readable. This is a nice intuitive [example](https://twitter.com/andrewheiss/status/1359583543509348356) from Andrew Heiss, an educator at Georgia State University:
```{r, eval = FALSE}
my_morning <- I %>%
wake_up(time = "8:00") %>%
get_out_of_bed(side = "correct") %>%
get_dressed(pants = TRUE, shirt = TRUE) %>%
leave_house(car = TRUE, bike = FALSE)
```
Contrast that with the same operations but using sequential steps without the pipes:
```{r, eval = FALSE}
my_morning <- wake_up(I, time = "8:00")
my_morning <- get_out_of_bed(my_morning, side = "correct")
my_morning <- get_dressed(my_morning, pants = TRUE, shirt = TRUE) %>%
my_morning <- leave_house(my_morning, car = TRUE, bike = FALSE)
```
This is readable but contains quite a bit of duplicative code.
Pipes are not compatible with all functions but should work with all of
the tidyverse package functions (the magrittr package that defines the
pipe is included in the tidyverse). In general, functions expect data as
the primary argument and you can think of the pipe as feeding the data
to the function. From the perspective of coding style, the most useful
suggestion for using pipes is arguably to write the code so that each
function is on its own line. The tidyverse style guide [section on
pipes](http://style.tidyverse.org/pipes.html) is pretty helpful, which leads us to a general tool for making code readable: style guides.
### Style guides
Reading other people's code can be extremely difficult. Actually,
reading your own code is often difficult, particularly if you haven't
laid eyes on it long time and are trying to reconstruct what you did.
One thing that can help is to adopt certain conventions around how your
code looks, and style guides are handy resources to help with this. We
recommend the [Tidyverse style guide](http://style.tidyverse.org/) as
the style guide that much of the R community working with Tidyverse
tools has converged to. The Tidyverse guide was originally derived from
Google's [R Style
Guide](https://google.github.io/styleguide/Rguide.xml), but since that
time Google has updated their style guide to pull from the Tidyverse
guide.
Some highlights:
- Use underscores to separate words in a name (see above comments for
file names)
- Put a space before and after operators (such as `==`, `+`,
`<-`), but there are a few exceptions such as `^` or `:` - Use `<-`
rather than `=` for assignment
- Try to limit code to 80 characters per
line & if a function call is too long, separate arguments to use one
line each for function, arguments, and closing parenthesis.
```{r, eval = FALSE}
# Good
do_something_very_complicated(
something = "that",
requires = many,
arguments = "some of which may be long"
)
# Bad
do_something_very_complicated("that", requires, many, arguments,
"some of which may be long"
)
```
### Packages supporting code style
You're not alone in your efforts to write readable code: there are multiple packages for that. We will not cover them in depth here but it is good to be aware of them:
- [styler](http://styler.r-lib.org/) is a package that
allows you to interactively reformat a chunk of code, a file, or a
directory
- styler can function as an Addin within RStudio (look above your markdown window for addins already installed in your RStudio)
- You can highlight code, apply styler via the Addins menu, and code will automatically be formatted per the Tidyverse style guid
- [formatr](https://yihui.org/formatr/) allows you to reformat whole files and directories
- [lintr](https://github.com/jimhester/lintr) checks code and provides output on formatting issues
So, if you have some old scripts you want to make more readable, you can
unleash styler or formatr on the file(s) and it will reformat it. Functionality for
lintr has been built into more recent versions of RStudio - look at markers to the left of code chunks in the editor window.
## Develop a standard project structure
In their article "Good enough practices in scientific computing", Wilson
et al. highlight useful recommendations for organizing projects [(Wilson
2017)](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510):
- **Put each project in its own directory, which is named after the
project**
- Put text documents associated with the project in the doc directory
- **Put raw data and metadata in a data directory and files generated
during cleanup and analysis in a results directory**
- Put project source code in the src directory
- Put compiled programs in the bin directory
- **Name all files to reflect their content or function**
Because we are focusing on using RMarkdown, notebooks, and less complex
types of analyses, we are going to focus on the recommendations in bold
in this course. All of these practices are recommended and we encourage
everyone to read the original article to better understand motivations
behind the recommendations.
### Put each project in its own directory, which is named after the project
Putting projects into their own directories helps to ensure that
everything you need to run an analysis is in one place. That helps you
minimize manual navigation to try and tie everything together (assuming
you create the directory as a first step in the project).
What is a project? Wilson et al. suggest dividing projects based on
"overlap in data and code files." I tend to think about this question
from the perspective of output, so a project is going to be the unit of
work that creates an analysis document that will go on to wider
consumption. If I am going to create multiple documents from the same
data set, that will likely be included in the same project. It gets me
to the same place that Wilson et al. suggest, but very often you start a
project with a deliverable document in mind and then decide to branch
out or not down the road.
Now that we're thinking about creating directories for projects and
directory structure in general, let's take the opportunity to review
some basic commands and configuration related to directories in R. We
are going to use functions available in both base R as well as the [fs
package](https://github.com/r-lib/fs), which provides clearer names for
functions as well as clearer output for directors and filenames. The fs
package should have been installed if you completed the pre-course
instructions, and you can load it if needed by running `library(fs)`.
**Exercise 1**
1. Navigate to "Global Options" under the Tools menu in the RStudio
application and note the *Default working directory (when not in a
project)*
2. Navigate to your Console and get the working directory using
`getwd()`
3. If you haven't already installed the fs package (from the pre-course
instructions), do so now: `install.packages("fs")`. Then load the
package with `library(fs)` if you did not already run the set up
chunk above.
4. Review the contents of your current folder using `dir_ls()`. (Base
equivalent: `list.files()`)
5. Now try to set your working directory using `setwd("test_dir")`.
What happened?
6. Create a new test directory using `dir_create("test_dir")`. (Base
equivalent: `dir.create("test_dir")`)
7. Review your current directory
8. Set your directory to the test directory you just created
9. Using the Files window (bottom right in RStudio, click on **Files**
tab if on another tab), navigate to the test directory you just
created and list the files. *Pro tip: The More menu here has
shortcuts to set the currently displayed directory as your working
directory and to navigate to the current working directory*
10. Navigate back to one level above the directory you created using
`setwd("..")` and list the files
11. Delete the directory you created using the `dir_delete()` function.
Learn more about how to use the function by reviewing the
documentation: `?dir_delete`. (Base equivalent: `unlink()` +
additional arguments)
**End Exercise**
The functions in the fs package include arguments and capabilities that
can be helpful for finding directories or files with names that have a
specific pattern. From our project directory, we may want to see the
files in a specific folder, without changing the directory of the
folder. We can use the path argument in the function:
```{r}
dir_ls(path = "data")
```
Another really handy argument to the `dir_ls` function is glob. This
allows you to supply a "wild card" pattern to retrieve records fitting a
specific pattern. The syntax is to use an asterisk to indicate any
pattern, either at the beginning or end of an expression. For example,
we may only want to retrieve the Excel files from our data directory, so
we would match to a file extension:
```{r}
dir_ls(path = "data", glob = "*.xlsx")
```
Or, we may be interested in only the sample csv files that are denoted
by "\_s":
```{r}
dir_ls(path = "data", glob = "*_s.csv")
```
Note that the asterisk at the beginning of the pattern followed by
characters to match against at the end requires that the text pattern be
at the very end of the string.
**Optional Exercise (If you do not already have a project directory)**
Now that you're warmed up with navigating through directories using R,
let's use functionality that's built into RStudio to make our
project-oriented lives easier. To enter this brave new world of project
directories, let's make a home for our projects. (Alternately, if you
already have a directory that's a home for your projects, set your
working directory there.) 1. Using the Files navigation window (bottom
right, Files tab), navigate to your home directory or any directory
you'd like to place your future RStudio projects 2. Create a "Projects"
directory 3. Set your directory to the "Projects" directory
```{r, eval = FALSE}
dir_create("Projects")
setwd("/Projects")
```
Alternately, you can do the above steps within your operating system
(eg. on a Mac, open Finder window and create a folder) or if you are
comfortable working at the command line, you can make a directory there.
In the newest version of RStudio (version 1.1), you have the option of
opening up a command line prompt under the Terminal tab (on the left
side, next to the Console tab).
**End Exercise**
**Exercise 2**
Let's start a new project :
1. Navigate to the **File** menu and select **New Project...** OR
Select the **Create a project** button on the global toolbar (2nd
from the left)
2. Select **New Directory** option
3. In the Project Type prompt, select **New Project**
4. In the Directory Name prompt under Create New Project, enter
"sample-project-structure"
5. In the Create Project as a Subdirectory of prompt under Create New
Project, navigate to the Projects folder you just created (or
another directory of your choosing). You can type in the path or hit
the **Browse** button to find the directory. Check the option for
"Open in a new session" and create your project.
**End Exercise**
So, what exactly does creating a Project in RStudio do for you? In a
nutshell, using these Projects allows you to drop what you're doing,
close RStudio, and then open the Project to pick up where you left off.
Your data, history, settings, open tabs, etc. will be saved for you
automatically.
Does using a RStudio Project allow someone else to pick up your code and
just use it? Or let you come back to a Project 1 year later and have
everything work magically? Not by itself, but with a few more tricks you
will be able to more easily re-run or share your code.
### Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory
Before we broke up with Excel, it was standard operating procedure to
perform our calculations and data manipulations in the same place that
our data lived. This is not necessarily incompatible with
reproducibility, if we have very careful workflows and make creative use
of macros. However, once you have modified your original input file, it
may be non-trivial to review what you actually did to your original raw
data (particularly if you did not save it as a separate file). Morever,
Excel generally lends itself to non-repeatable manual data manipulation
that can take extensive detective work to piece together.
Using R alone will not necessarily save you from these patterns but they
take a different form. Instead of clicking around, dragging, and
entering formulas, you might find yourself throwing different functions
at your data in a different order each time you open up R. While it
takes some effort to overwrite your original data file in R, other
non-ideal patterns of file management that are common in Excel-land can
creep up on you if you're not careful.
One solution to help avoid these issues in maintaining the separation of
church and state (to use a poor analogy) is to explicitly organize
your analysis so that raw data lives in one directory (the *data*
directory) and the results of running your R code are placed in another
directory (eg. *results* or *output*). You can take this concept a
little further and include other directories within your project folder
to better organize work such as *figures*, *documents* (for
manuscripts), or *processed_data*/*munge* (if you want to create
intermediate data sets). You have a lot of flexibility and there are
multiple resources that provide some guidance [(Parzakonis
2017)](https://statsravingmad.com/measure/sample-r-project-structure/),
[(Muller
2017)](http://blog.jom.link/implementation_basic_reproductible_workflow.html),
[(Software Carpentry
2016)](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/).
**Exercise 3**
Be sure to work within the RStudio window that contains your
"sample-project-structure" project. Refer to the top right of the window
and you should see the project name displayed there. Let's go ahead and
create a minimal project structure by running the following code within
the console:
```{r, eval = FALSE}
library(fs)
dir_create("data") # raw data
dir_create("output") # output from analysis
dir_create("cache") # intermediate data (after processing raw data)
dir_create("src") # code goes into this folder
```
This is a bare bones structure that should work for future projects you
create. Refer to the content below if you decide you want to adopt a
standard directory structure for your projects on top of using RStudio
Projects.
Keep this project open in a separate window for now. We will revisit it
as we learn about version control.
**End Exercise**
*Further exploration/tools for creating projects:*
The directory creation code in the above exercise can be packaged into a
function that creates the folder structure for you (either within or
outside of a project). Software Carpentry has a nice refresher on
writing functions:
<https://swcarpentry.github.io/r-novice-inflammation/02-func-R/>.
There is also a dedicated Project Template package that has a nice
"minimal project layout" that can be a good starting point if you want R
to do more of the work for you: [Project Template](http://projecttemplate.net/index.html). This package
duplicates some functionality that the RStudio Project does for you, so
you probably want to run it outside of an RStudio Project but it is a
good tool to be aware of.
### Name all files (and variables) to reflect their content or function
This concept is pretty straightforward: assume someone else will be
working with your code and analysis and won't intuitively understand
cryptic names. Rather than output such as results.csv, a file name of
morphine_precision_results.csv offers more insight. [Wilson et
al.](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510)
make the good point that using sequential numbers will come back to bite
you as your project evolves: for example, "figure_2.txt" for a
manuscript may eventually become "figure_3.txt". We'll get into it in
the next section but the final guidance with regards to file names is to
using a style convention for file naming to make it easier to read names
an manipulate files in R. One common issue is dealing with whitespace in
file names: this can be annoying when writing out the file names in
scripts so underscores are preferrable. Another issue is the use of
capital letters: all lowercase names is easier to write out. As an
example, rather than "Opiate Analysis.csv", the preferred name might be
"opiate_analysis.csv".
## Enforce reproducibility of the directories and packages
### Scenario 1: Sharing your project with a colleague
Let's think about a happy time a couple months from now. You've
completed this R course, have learned some new tricks, and you have
written an analysis of your mass spec data, bundled as a nice project in
a directory named "mass_spec_analysis". You're very proud of the
analysis you've written and your colleague wants to run the analysis on
similar data. You send them your analysis project (the whole directory)
and when they run it they immediately get the following error when
trying to load the data file with the `read.csv("file.csv")` command:
**Error in file(file, "rt") : cannot open the connection**
**In addition: Warning message:**
**In file(file, "rt") :**
**cannot open file 'file.csv': No such file or directory**
Hmmm, R can't find the file, even though you set the working directory
for your folder using
`setwd("/Users/username/path/to/mass_spec_analysis")`.
What is the problem? Setting your working directory is actually the
problem here, because it is almost guaranteed that the path to a
directory on your computer does not match the path to the directory on
another computer. That path may not even work on your own computer a
couple years from now!
Fear not, there is a package for that! The [here](https://cran.r-project.org/web/packages/here/index.html) package is a
helpful way to "anchor" your project to a directory without setting your
working directory. The here package uses a pretty straightforward syntax
to help you point to the file you want. In the example above, where
file.csv is a data file in the root directory (I know, not ideal
practice per our discussion on project structure above), then you can
reference the file using `here("file.csv")`, where `here()` indicates
the current directory. So reading the file could be accomplished with
`read.csv(here("file.csv"))` and it could be run by any who you share
the project with.
The here package couples well with an RStudio Project because there is
an algorithm that determines which directory is the top-level directory
by looking for specific files - creating an RStudio Project creates an
.Rproj file that tells here which is the project top-level directory -
if you don't create a Project in RStudio, you can create an empty file
named .here in the top-level directory to tell here where to go - there
are a variety of other file types the package looks for (including a
.git file which is generated if you have a project on Github)
I encourage you to read the following post by Jenny Bryan that includes
her strong opinions about setting your working directory:
[Project-oriented workflow](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/).
Moral of the story: avoid using `setwd()` and complicated paths to your
file - use `here()` instead!
### Scenario 2: Running your 2018 code in 2019
Now imagine you've written a nice analysis for your mass spec data but
let it sit on the shelf for 6 months or a year. In the meantime, you've
updated R and your packages multiple times. You rerun your analysis on
the same old data set and either (a) one or more lines of code longer
works or (b) the output of your analysis is different than the first
time you ran it. Very often these problems arise because one or more of
the packages you use in your code have been updated since the first time
you ran your analysis. Sometimes package updates change the input or
output specific functions expect or produce or alter the behavior of
packages in unexpected ways. These problems also arise when sharing code
with colleagues because different users may have different versions of
packages loaded.
Why do we run into this problem? Packages are installed in directories called library paths. Whenever you load a package, R will search for the package in your library paths. You may have multiple library paths - by default there is typically a system library but you may have a user library as well. You can see your library paths with the `libPaths()` function. R will go through the library paths in order, so it may look for user packages before moving to system packages. Ultimately R will default to using the first version of a specific package it finds as it moves through the library paths. If you have a script or notebook that was developed with a different version of package than R defaults to, you can end up with different output than expected or an analysis that does not work. Many packages that are developed by the RStudio team like the tidyverse group of packages are more likely to have significant testing and avoid breaking changes. However, this is not a guarantee.

The generalized way to avoid this issue is to manage packages on a project by project basis: instead of R using the default version of a package it finds, tie specific versions of packages to a project. There are a couple options for accomplishing this.

#### Option 1: checkpoint
Arguably the most lightweight solution to this problem is the [checkpoint](https://cran.r-project.org/web/packages/checkpoint/index.html) package.
The basic premise behind checkpoint is that it allows you use the
package as it existed at a specific date. There is a snapshot for all
packages in CRAN (the R package repository) each day, dating back to
2017-09-17. By using checkpoint you can be confident that the version of
the package you reference in your code is the same version that anyone
else running your code will be using.
The behavior of checkpoint makes it complicated to test out in this
section: the package is tied to a project and by default searches for
every package called within your project (via `library()` or
`require()`).
The checkpoint package is very helpful in writing reproducible analyses,
but there are some limitations/considerations with using it:
- retrieving and installing packages adds to the amount of time it takes to run your analysis
- package updates over time may fix bugs so changes in output may be more accurate
- checkpoint is tied to projects, so alternate structures that don't use projects may not able to utilize the package
#### Option 2: renv
There is another solution to this problem that has tighter integration with RStudio: the [renv](https://rstudio.github.io/renv/articles/renv.html) package. renv allows your to maintain specific package versions on a project by project level. Unlike checkpoint, package versions are not determined by date. Instead you can initiate a project and use renv functions to detect whatever packages you've loaded with their version information and generate a file with that data so that the environment can be replicated easy on another system or by someone else.
- The functionality can be initialized with the `renv::init()` function, which captures the state of your default R libraries for a project-local library which will then be used when future R sessions open the project
- Packages can be updated with most recent versions (if updated at the user or system level) with the `renv::snapshot()` function, which creates a lockfile containing the detailed package info
- The `renv::restore()` function is used to apply updates to packages in a project based on the lockfile data
There is a nice summary of renv here: https://kevinushey-2020-rstudio-conf.netlify.app/slides.html#1.
Note: the renv package was developed as a more stable solution than its predecessor [packrat](https://cran.r-project.org/web/packages/packrat/index.html).
Either approach to package management will work - the important point
here is to be proactive about how you manage your packages, especially
if you know your code will be used over and over again in the future.
## Summary
- Reproducible research is the principle that any research result can
be reproduced by anybody
- Practices in reproducible research also offer benefits for to the
code author in producing clearer, easier to understand code and
being able to easily repeat past work
- Important practices in reproducible research include:
- Developing a standardized but easy-to-use project structure
- Adopting a style convention for coding
- Enforcing reproducibility when working with projects and
packages