-
Notifications
You must be signed in to change notification settings - Fork 16
/
ggplot2_intro.qmd
509 lines (354 loc) · 19 KB
/
ggplot2_intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
---
title: "R ggplot2: introductory data visualisation"
format:
gfm:
toc: true
toc-depth: 2
editor: source
date: today
author: UQ Library
---
## What are we going to learn?
This session is directed at R users (a beginner level is sufficient) who are new to the ggplot2 package.
During this session, you will:
* Have a visualisation package installed (ggplot2)
* Learn how to explore data visually
* Learn about the 3 essential ggplot2 components
* Use different kinds of visualisations
* Layer several visualisations
* Learn how to customise a plot with colours, labels and themes.
## Open RStudio
* If you are using your own laptop please open RStudio
+ If you need theme, we have [installation instruction](/R/Installation.md#r--rstudio-installation-instructions)
+ Make sure you have a working Internet connection
* On the Library computers (the first time takes about 10 min):
+ Log in with your UQ username and password (use your student account if you have both a staff and student account)
+ Make sure you have a working Internet connection
+ Open the ZENworks application
+ Look for RStudio
+ Double-click on RStudio which will install both R and RStudio, and open RStudio
## Essential shortcuts
Remember some of the most commonly used RStudio shortcuts:
* function or dataset help: press <kbd>F1</kbd> with your cursor anywhere in a function name.
* execute from script: <kbd>Ctrl</kbd> + <kbd>Enter</kbd>
* assignment operator (`<-`): <kbd>Alt</kbd> + <kbd>-</kbd>
## Installing ggplot2
We first need to make sure we have the **ggplot2 package** available on our computer. We can use the "Install" button in the "Packages" pane, or we can execute this command in the console: `install.packages("ggplot2")`
You only need to install a package once, but you need to load it every time you start a new R session.
## Setting up a project
> Everything we write today will be saved in your R project. Please remember to save it on your H drive or USB if you are using a Library computer.
Let's create a new **R project** to keep everything tidy:
* Click the "File" menu button (top left corner), then "New Project"
* Click "New Directory"
* Click "New Project"
* In "Directory name", type the name of your project, e.g. "ggplot2_intro"
* Select the folder where to locate your project: e.g. a `Documents/RProjects` folder, which you can create if it doesn't exist yet.
* Click the "Create Project" button
* Create a folder to store our plots:
+ `dir.create("plots")`
We will write ggplot2 code more comfortably in a **script**:
* Menu: Top left corner, click the green "plus" symbol, or press the shortcut (for Windows/Linux) <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>N</kbd> or (for Mac) <kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>N</kbd>. This will open an "Untitled1" file.
* Go to "File > Save" or press (for Windows/Linux) <kbd>Ctrl</kbd>+<kbd>S</kbd> or (for Mac) <kbd>Cmd</kbd>+<kbd>S</kbd>. This will ask where you want to save your file and the name of the new file.
* Call your file "process.R"
We can straight away **load the package** by adding this command to our script and executing it:
```{r}
library(ggplot2)
```
> Remember to use <kbd>Ctrl</kbd>+<kbd>Enter</kbd> to execute a command from the script.
## Finding help
We are going to work with different datasets that come with the ggplot2 package. For any dataset or function doubts that you might have, don't forget the two main ways to bring up a help page:
1. the command: `?functionname`
1. the keyboard shortcut: press <kbd>F1</kbd> after writing a function name
### Introducing ggplot2
The R package ggplot2 was developed by Hadley Wickham with the objective of creating a grammar of graphics for categorical data (in 2007). It is based on the book _The Grammar of Graphics_ Developed by Leland Wilkinson (first edition published in 1999).
It is now part of the group of data science packages called Tidyverse.
## The components of the Grammar of Graphics
The Grammar of Graphics is based on the idea that you can build every graph from the same few components.
The components are:
* Data
* Mapping
* Statistics
* Scales
* Geometries
* Facets
* Coordinates
* Theme
In this introductory session, we will mainly focus on the **data**, the **mapping**, the **statistics**, the **geometries** and the **theme**.
## ggplot2's three essential components
In ggplot2, the 3 main components that we usually have to provide are:
1. Where the **data** comes from,
2. the **aesthetic mappings**, and
3. a **geometry**.
For our first example, let's use the `msleep` dataset (from the ggplot2 package), which contains data about mammals' sleeping patterns.
> You can find out about the dataset with `?msleep`.
Let's start with specifying where the **data** comes from in the `ggplot()` function:
```{r}
ggplot(data = msleep)
```
This is not very interesting. We need to tell ggplot2 *what* we want to visualise, by **mapping** *aesthetic elements* (like our axes) to *variables* from the data. We want to visualise how common different conservations statuses are, so let's associate the right variable to the x axis:
```{r}
ggplot(data = msleep,
mapping = aes(x = conservation))
```
ggplot2 has done what we asked it to do: the conservation variable is on the x axis. But nothing is shown on the plot area, because we haven't defined *how* to represent the data, with a `geometry_*` function:
```{r}
ggplot(data = msleep,
mapping = aes(x = conservation)) +
geom_bar()
```
Now we have a useful plot: we can see that a lot of animals in this dataset don't have a conservation status, and that "least concern" is the next most common value.
We can see our three essential elements in the code:
1. the **data** comes from the `msleep` object;
1. the variable `conservation` is **mapped to the aesthetic** `x` (i.e. the x axis);
1. the **geometry** is `"bar"`, for "bar chart".
Here, we don't need to specify what variable is associated to the y axis, as the "bar" geometry automatically does a count of the different values in the `conservation` variable. That is what **statistics** are applied automatically to the data.
> In ggplot2, each geometry has default statistics, so we often don't need to specify which stats we want to use. We could use a `stat_*()` function instead of a `geom_*()` function, but most people start with the geometry (and let ggplot2 pick the default statistics that are applied).
## Line plots
Let's have a look at another dataset: the `economics` dataset from the US. Learn more about it with `?economics`, and have a peak at its structure with:
```{r}
str(economics)
```
Do you think that unemployment is stable over the years? Let's have a look with a line plot, often used to visualise time series:
```{r}
ggplot(data = economics,
mapping = aes(x = date,
y = unemploy)) +
geom_line()
```
Let's go through our essential elements once more:
* The `ggplot()` function initialises a ggplot object. In it, we declare the **input data frame** and specify the set of plot aesthetics used throughout all layers of our plot;
* The `aes()` function groups our **mappings of aesthetics to variables**;
* The `geom_<...>()` function specifies what **geometric element** we want to use.
## Scatterplots
Scatterplots are often used to look at the relationship between two variables. Let's try it with a new dataset: `mpg` (which stands for "miles per gallon"), a dataset about fuel efficiency of different models of cars.
```{r eval=FALSE}
?mpg
str(mpg)
```
Do you think that big engines use fuel more efficiently than small engines?
We can focus on two variables:
* `displ`: a car’s engine size, in litres.
* `hwy`: a car’s fuel efficiency on the highway, in miles per gallon.
For the geometry, we now have use "points":
```{r}
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy)) +
geom_point()
```
Notice how the points seem to be aligned on a grid? That's because the data was rounded. If we want to better visualise the density of points, we can use the "count" geometry, which makes the dots bigger when data points have the same `x` and `y` values:
```{r}
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy)) +
geom_count()
```
Alternatively, we can avoid overlapping of points by using the "jitter" geometry, which gives the points a little shake:
```{r}
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy)) +
geom_jitter()
```
Even though the position of the dots does not match exactly the original `x` and `y` values, it does help visualise densities better.
The plot shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size?
However, we can see some outliers. We need to find out more about our data.
## Adding aesthetics
We can highlight the "class" factor by adding a new aesthetic:
```{r}
ggplot(data = mpg,
mapping = aes(x = displ,
y = hwy,
colour = class)) +
geom_jitter()
```
It seems that two-seaters are more fuel efficient than other cars with a similar engine size, which can be explained by the lower weight of the car. The general trend starts to make more sense!
We now know how to create a simple scatterplot, and how to visualise extra variables. But how can we better represent a correlation?
## Trend lines
A trend line can be created with the `geom_smooth()` function:
```{r}
ggplot(mpg,
aes(x = displ,
y = hwy)) +
geom_smooth()
```
> We stopped using the argument names because we know in which order they appear: first the data, then the mapping of aesthetics. Let's save ourselves some typing from now on!
The console shows you what function / formula was used to draw the trend line. This is important information, as there are countless ways to do that. To better understand what happens in the background, open the function's help page and notice that the default value for the `method` argument is "NULL". Read up on how it automatically picks a suitable method depending on the sample size, in the "Arguments" section.
Want a linear trend line instead? Add the argument `method = "lm"` to your function:
```{r}
ggplot(mpg,
aes(x = displ,
y = hwy)) +
geom_smooth(method = "lm")
```
## Layering
A trend line is usually displayed on top of the scatterplot. How can we combine several layers? We can string them with the `+` operator:
```{r}
ggplot(mpg,
aes(x = displ,
y = hwy)) +
geom_point() +
geom_smooth()
```
> The order of the functions matters: the points will be drawn before the trend line, which is probably what you're after.
## The `colour` aesthetic
We can once again add some information to our visualisation by mapping the `class` variable to the `colour` aesthetic:
```{r}
ggplot(mpg,
aes(x = displ,
y = hwy)) +
geom_point(aes(colour = class)) +
geom_smooth()
```
**Challenge 1 – where should aesthetics be defined?**
Take the last plot we created:
```{r eval=FALSE}
ggplot(mpg,
aes(x = displ,
y = hwy)) +
geom_point(aes(colour = class)) +
geom_smooth()
```
What would happen if you moved the `colour = class` aesthetic from the geometry function to the `ggplot()` call?
Different geometries can also have their own mappings that overwrite the defaults.
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.
## Saving a plot
Like your visualisation? You can export it with the "Export" menu in the "Plots" pane.
* Building a document or a slideshow? You can copy it straight to your clipboard, and paste it into it.
* A PDF is a good, quick option to export an easily shareable file with vector graphics. Try for example the "A5" size, the "Landscape" orientation, and save it into your "plots" directory.
* More options are available in the "Save as image..." option. PNG is a good compressed format for graphics, but if you want to further customise your visualisation in a different program, use SVG or EPS, which are vector formats. (Try to open an SVG file in [Inkscape](https://inkscape.org/) for example.)
To save the last plot with a command, you can use the `ggsave()` function:
```{r eval=FALSE}
ggsave(filename = "plots/fuel_efficiency.png")
```
This is great to automate the export process for each plot in your script, but `ggsave()` also has extra options, like setting the DPI, which is useful for getting the right resolution for a specific use. For example, to export a plot for your presentation:
```{r eval=FALSE}
ggsave(filename = "plots/fuel_efficiency.png", dpi = "screen")
```
> Saving a .svg file with requires installing the svglite package. This packages seems so work best installing in a fresh R session (Session > Restart R) from source `install.packages("svglite", type = "source")`. Then load the library `library(svglite)` rerun your code including loading previous libraries (`ggplot2` etc.) and now saving a plot with a .svg extension should work!
**Challenge 2 – add a variable and a smooth line**
Let's use a similar approach to what we did with the `mpg` dataset.
Take our previous unemployment visualisation, but represented with points this time:
```{r eval=FALSE}
ggplot(economics,
aes(x = date,
y = unemploy)) +
geom_point()
```
How could we:
1. Add a smooth line for the number of unemployed people. Are there any interesting arguments that could make the smoother more useful?
2. Colour the points according to the median duration of unemployment (see `?economics`)
```{r}
ggplot(economics,
aes(x = date,
y = unemploy)) +
geom_point(aes(colour = uempmed)) +
geom_smooth()
```
> See how the legend changes depending on the type of data mapped to the `colour` aesthetic? (i.e. categorical vs continuous)
This default "trend line" is not particularly useful. We could make it follow the data more closely by using the `span` argument. The closer to 0, the closer to the data the smoother will be:
```{r}
ggplot(economics,
aes(x = date,
y = unemploy)) +
geom_point(aes(colour = uempmed)) +
geom_smooth(span = 0.1)
```
You can now see why this is called a "smoother": we can fit a smooth curve to data that varies a lot.
To further refine our visualisation , we could visualise the unemployment rate rather than the number of unemployed people, by calculating it straight into our code:
```{r}
ggplot(economics,
aes(x = date,
y = unemploy / pop)) +
geom_point(aes(colour = uempmed)) +
geom_smooth(span = 0.1)
```
The [early 1980s recession](https://en.wikipedia.org/wiki/Early_1980s_recession) now seems to have had a more significant impact on unemployment than the [Global Financial Crisis](https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%9308) of 2007-2008.
## Bar charts and ordered factors
Let's use the `diamonds` dataset now.
The `diamonds` dataset comes with ggplot2 and contains information about ~54,000 diamonds, including the price, carat, colour, clarity, and cut quality of each diamond.
Let's have a look at the data:
```{r eval=FALSE}
diamonds
summary(diamonds)
?diamonds
```
Back to bar charts. Consider a basic bar chart, as drawn with `geom_bar()`. The following chart displays the total number of diamonds in the `diamonds` dataset, grouped by cut:
```{r}
ggplot(diamonds,
aes(x = cut)) +
geom_bar()
```
The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
`cut` is an ordered factor, which you can confirm by printing it to the console:
```{r}
head(diamonds$cut)
```
See how ggplot2 respects that order in the bar chart?
### Customising a plot
Let's see how we can customise our bar chart's look.
#### Change a geometry's default colour
First, we can pick our favourite colour in `geom_bar()`:
```{r}
ggplot(diamonds,
aes(x = cut)) +
geom_bar(fill = "tomato")
```
If you are curious about what colour names exist in R, you can use the `colours()` function.
### Change labels
We can also modify labels with the `labs()` function to make our plot more self-explanatory:
```{r}
ggplot(diamonds,
aes(x = cut)) +
geom_bar(fill = "tomato") +
labs(title = "Where are the bad ones?",
x = "Quality of the cut",
y = "Number of diamonds")
```
Let's have a look at what `labs()` can do:
```{r eval=FALSE}
?labs
```
It can edit the title, the subtitle, the x and y axes labels, and the caption.
> Remember that captions and titles are better sorted out in the publication itself, especially for accessibility reasons (e.g. to help with screen readers).
### Horizontal bar charts
For a horizontal bar chart, we can map the `cut` variable to the `y` aesthetic instead of `x`. But remember to also change your labels around!
```{r}
ggplot(diamonds,
aes(y = cut)) + # switch here...
geom_bar(fill = "tomato") +
labs(title = "Where are the bad ones?",
y = "Quality of the cut", # ...but also here!
x = "Number of diamonds") # ...and here!
```
This is particularly helpful when long category names overlap under the x axis.
### Built-in themes
The `theme()` function allows us to really get into the details of our plot's look, but some `theme_*()` functions make it easy to apply a built-in theme, like `theme_bw()`:
```{r}
ggplot(diamonds,
aes(y = cut)) +
geom_bar(fill = "tomato") +
labs(title = "Where are the bad ones?",
y = "Quality of the cut",
x = "Number of diamonds") +
theme_bw()
```
Try `theme_minimal()` as well, and if you want more options, install the `ggthemes` package!
## Play time!
**Challenge 3: explore geometries**
When creating a new layer, start typing `geom_` and see what suggestions pop up. Are there any suggestions that sound useful or familiar to you?
Modify your plots, play around with different layers and functions, and ask questions!
## Close project
Closing RStudio will ask you if you want to save your workspace and scripts.
Saving your workspace is usually not recommended if you have all the necessary commands in your script.
## Useful links
* For ggplot2:
+ [ggplot2 cheatsheet](https://www.rstudio.org/links/data_visualization_cheat_sheet)
+ Official [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/)
+ Official [ggplot2 website](https://ggplot2.tidyverse.org/)
+ [Chapter on data visualisation](https://r4ds.hadley.nz/data-visualize.html) in the book _R for Data Science_
+ [From Data to Viz](https://www.data-to-viz.com/), a website to explore different visualisations and the code that generates them
+ Selva Prabhakaran's [_r-statistics.co_ section on ggplot2](https://r-statistics.co/ggplot2-Tutorial-With-R.html)
+ [Coding Club's data visualisation tutorial](https://ourcodingclub.github.io/tutorials/datavis/)
+ [STHDA's ggplot2 essentials](https://www.sthda.com/english/wiki/ggplot2-essentials)
* Our compilation of [general R resources](/R/usefullinks.md)