05-ggplot.Rmd

# ggplot2: Grammar of Graphics {#ggplot2}

In previous chapters, you have learned about tables that are the main way of representing data in psychological research and in R. In the following ones, you will learn how to manipulate data in these tables: change it, aggregate or transform individual groups of data, use it for statistical analysis. But before that you need to understand how to store your data in the table in the optimal (or, at least, recommended) way. First, I will introduce the idea of _tidy data_, the concept that gave [Tidyverse](https://www.tidyverse.org/) its name. Next, we will see how tidy data helps you visualize relationships between variables. Don't forget to download the [notebook](notebooks/Seminar 05 - ggplot2.qmd).

## Tidy data {#tidydata}
The tidy data follows [three rules](https://r4ds.had.co.nz/tidy-data.html):

* variables are in columns,
* observations are in rows,
* values are in cells.

This probably sound very straightforward to the point that you wonder "Can a table not be tidy?" As a matter of fact _a lot_ of typical results of psychological experiments are not tidy. Imagine an experiment where participants rated a face on symmetry, attractiveness, and trustworthiness. Typically (at least in my experience), the data will be stored as follows:
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
widish_df <- 
  tibble(Participant = c(1, 1, 2, 2),
         Face = rep(c("M1", "M2"), 2), 
         Symmetry = c(6, 4, 5, 3),
         Attractiveness = c(4, 7, 2, 7),
         Trustworthiness = c(3, 6, 1, 2))

knitr::kable(widish_df, align="c")
```

This is a very typical table optimized for _humans_. A single row contains all responses about a single face, so it is easy to visually compare responses of individual observers. Often, the table is even wider so that a single row holds all responses from a single observer (in my experience, a lot of online surveys produce data in this format).
```{r echo=FALSE, message=FALSE, warning=FALSE}
wide_df <- 
  tibble(Participant = c(1, 2),
         M1.Symmetry = c(6, 5),
         M1.Attractiveness = c(4, 2),
         M1.Trustworthiness = c(3, 1),
         M2.Symmetry = c(4, 3),
         M2.Attractiveness = c(7, 7),
         M2.Trustworthiness = c(6, 2))

knitr::kable(wide_df, align="c")
```

So, what is wrong with it? Don't we have variables in columns, observations in rows, and values in cells? Not really. You can already see it when comparing the two tables above. The _face identity_ is a variable, however, in the second table it is hidden in column names. Some columns are about face `M1`, other columns are about `M2`, etc. So, if you are interested in analyzing symmetry judgments across all faces and participants, you will need to select all columns that end with `.Symmetry` and figure out a way to extract the face identity from columns' names. Thus, face _is_ a variable but is not a column in the second table. 

Then, what about the first table, which has `Face` as a column, is it tidy? The short answer: Not really but that depends on your goals as well! In the experiment, we collected _responses_ (these are numbers in cells) for different _judgments_. The latter are a variable but it is hidden in column names. Thus, a _tidy_ table for this data would be

```{r echo=FALSE, message=FALSE, warning=FALSE}
long_df <- tidyr::pivot_longer(widish_df, 
                               cols=c("Symmetry", "Attractiveness", "Trustworthiness"),
                               names_to="Judgment",
                               values_to="Response")
knitr::kable(long_df, align="c")
```

This table is (very) tidy and it makes it easy to group data by every different combination of variables (e.g., by face and judgment, by participant and judgment), perform statistical analysis, etc. However, it may not always be the best way to represent the data. For example, if you would like to model `Trustworthiness` using `Symmetry` and `Attractiveness` as predictors, when the first table is more suitable. At the end, the table structure must fit your needs, not the other way around. Still, what you probably want is a _tidy_ table to begin with because it is best suited for most things you will want to do with the data and because it makes it easy to transform the data to match your specific needs (e.g., going from the third table to the first one via pivoting).

Most data you will get from experiments will not be tidy. We will spent quite some time learning how to tidy it up but first let us see how an already tidy data makes it easy to visualize relationships in it.

## ggplot2
[ggplot2](https://ggplot2.tidyverse.org/) package is my main tool for data visualization in R. ggplot2 tends to make really good looking production-ready plots (this is not a given, a default-looking Matlab plot is, or used to be when I used Matlab, pretty ugly). Hadley Wickham was influenced by works of [Edward Tufte](https://www.edwardtufte.com/tufte/) when developing ggplot2. Although the aesthetic aspect goes beyond our seminar, if you will need to visualize data in the future, I strongly recommend reading Tufte's books. In fact, it is such an informative and aesthetically pleasing experience that I would recommend reading them in any case.

More importantly, ggplot2 uses a grammar-based approach of describing a plot that makes it conceptually different from most other software such as Matlab, Matplotlib in Python, etc. You need to get used to it but once you do, you probably will never want to go back.

A plot in _ggplot2_ is described in three parts:

1. Aesthetics: Relationship between data and visual properties that define working space of the plot (which variables map on individual axes, color, size, fill, etc.).
2. Geometrical primitives that visualize your data (points, lines, error bars, etc.) that are _added_ to the plot.
3. Other properties of the plot (scaling of axes, labels, annotations, etc.) that are _added_ to the plot.

You always need the first one. But you do not need to specify the other two, even though a plot without geometry in it looks very empty. Let us start with a very simple artificial example table below. I simulate a response as
$$Response = Normal(\mu=1, \sigma=0.2) - \\
Normal(\mu=2*ConditionIndex, \sigma=0.4) + \\ Normal(\mu=Intensity, \sigma=0.4)$$
```{r echo=FALSE, message=FALSE, warning=FALSE,  fig.align="center"}
library(tidyverse)
simple_tidy_data <- 
  expand.grid(Condition = c("A", "B", "C"), Intensity = 1:8) |>
  mutate(Response = rnorm(n(), 1, 0.2) - rnorm(n(), as.numeric(as.factor(Condition)), 0.2) * 2 + rnorm(n(), Intensity, 0.4))

knitr::kable(simple_tidy_data)
```

We plot this data by 1) _defining aesthetics_ (mapping `Intensity` on to x-axis, `Response`on y-axis, and `Condition` on color) and 2) _adding_ lines to the plot (note the plus^[One of a potentially confusing bits is usage of `+` in ggplot2 but of pipe `|>` everywhere else. The difference is deliberate and fundamental. Pipe `|>` passes the output to the next function, `+` _adds_ something to the already existing plot.] in `+ geom_line()`).
```{r}
ggplot(data=simple_tidy_data, aes(x = Intensity, y = Response, color=Condition)) + 
  geom_line()
```

As I already wrote, technically, the only thing you need to define is aesthetics, so let us not add anything to the plot dropping the `+ geom_line()`.
```{r fig.align = 'center'}
ggplot(data=simple_tidy_data, aes(x = Intensity, y = Response, color=Condition))
```

Told you it will look empty and yet you can already see _ggplot2_ in action. Notice that axes are labeled and their limits are set. You cannot see the legend (they are not plotted without corresponding geometry) but it is also ready behind the scenes. This is because our initial call specified the most important part: how individual variables map on various properties even before we tell _ggplot2_ which visuals we will use to plot the data. When we specified that x-axis will represent the `Intensity`, ggplot2 figured out the range of values, so it knows _where_ it is going to plot whatever we decide to plot, and the axis's label. Points, lines, bars, error bars and what not will span only that range. Same goes for other properties such as color. We wanted _color_ to represent the condition. Again, we may not know what exactly we will be plotting (points, lines?) or even how many different visuals we will be adding to the plot (just lines? points + lines? points + lines + linear fit?) but we do know that whatever visual we add, if it can have color, its color _must_ represent condition for that data point. The beauty of _ggplot2_ is that it analyses your data and figures out how many colors you need and is ready to apply them _consistently_ to all visuals you will add later. It will ensure that all points, bars, lines, etc. will have consistent coordinates scaling, color, size, fill mapping that are the same across the entire plot. This may sound trivial but typically (e.g., Matlab, Matplotlib), it is _your_ job to make sure that all these properties match and that they represent the same value across all visual elements. And this is a pretty tedious job, particularly when you decide to change your mappings and have to redo all individual components by hand. In _ggplot2_, this dissociation between mapping and visuals means you can tinker with one of them at a time. E.g., keep the visuals but change grouping or see if effect of condition is easier to see via line type, size or shape of the point? Or you can keep the mapping and see whether adding another visual will make the plot easier to understand. Note that some mappings also _group_ your data, so when you use group-based visual information (e.g., a linear regression line) it will know what data belongs together and so will perform this computation per group.

Let us see how you can keep the relationship mapping but add more visuals. Let us add both lines and points.
```{r fig.align = 'center'}
ggplot(data=simple_tidy_data, aes(x = Intensity, y = Response, color=Condition)) + 
  geom_line() +
  geom_point() # this is new!
```

In the plot above, we _kept_ the relationship between variables and properties but said "Oh, and, please, throw in some points as well". And ggplot2 knows how to add the points so that they appear at proper location and in proper color. But we want more!
```{r fig.align = 'center'}
ggplot(data=simple_tidy_data, aes(x = Intensity, y = Response, color=Condition)) + 
  geom_line() +
  geom_point() +
  # a linear regression over all dots in the group
  geom_smooth(method="lm", formula = y ~ x, se=FALSE, linetype="dashed") 
```

Now we added a linear regression line that helps us to better see  the relationship between `Intensity` and `Response`. Again, we simply wished for another visual to be added (`method="lm"` means that we wanted to average data via linear regression with `formula = y ~ x` meaning that we regress y-axis on x-axis with no further covariates, `se=FALSE` means no standard error stripe, `linetype="dashed"` just makes it easier to distinguish linear fit from the solid data line).

Or, we can keep the _visuals_ but see whether changing _mapping_ would make it more informative (we need to specify `group=Intensity` as continuous data is not grouped automatically).
```{r fig.align = 'center'}
ggplot(data=simple_tidy_data, aes(x = Condition, y = Response, color=Intensity, group=Intensity)) + 
  geom_line() +
  geom_point() +
  geom_smooth(method="lm", se=FALSE,  formula = y ~ x, linetype="dashed")
```

Or, we can check whether splitting into several plots helps.
```{r fig.align = 'center'}
ggplot(data=simple_tidy_data, aes(x = Intensity, y = Response, color=Condition)) + 
  geom_line() +
  geom_point() +
  geom_smooth(method="lm", formula = y ~ x, se=FALSE, linetype="dashed") +
  facet_grid(. ~ Condition)  + # makes a separate subplot for each group
  theme(legend.position = "none") # we don't need the legend as conditions are split between facets 
```

Again, note that all three plots live on the same scale for x- and y-axis, making them easy to compare (you fully appreciate this magic if you ever struggled with ensuring optimal and consistent scaling by hand in Matlab). I went through so many examples to stress how ggplot allows you to think about the aesthetics of variable mapping _independently_ of the actual visual representation (and vice versa).

Now lets us explore _ggplot2_ by doing exercises. I recommend using [ggplot2 reference page](https://ggplot2.tidyverse.org/) and [cheatsheet](https://rstudio.github.io/cheatsheets/html/data-visualization.html) when you are doing the exercises.

## Auto efficiency: continuous x-axis
We start by visualizing how car efficiency, measured as miles-per-gallon, is affected by various factors such as production year, size of the engine, type of transmission, etc. The data is in the table [mpg](https://ggplot2.tidyverse.org/reference/mpg.html), which is part of the _ggplot2_ package. Thus, you need to first [import](#library) the library and then load the table via [data()](#data) function. Take a look at the [table description](https://ggplot2.tidyverse.org/reference/mpg.html) to familiarize yourself with the variables.

First, let us look at the relationship between car efficiency in the city cycle (`cty`), engine displacement (`displ`), and drive train type (`drv`) using color points. Reminder, the call should look as
```
ggplot(data_table_name, aes(x = var1, y = var2, color = var3, shape = var4, ...)) + 
  geom_primitive1() + 
  geom_primitive2() +
  ...
```
Think about [aesthetics](https://ggplot2.tidyverse.org/reference/aes.html), i.e., which variables are mapped on each axes and which is best depicted as color.

::: {.practice}
Do exercise 1.
:::

Do you see any clear dependence? Let us try to making it more evident by adding [geom_smooth](https://ggplot2.tidyverse.org/reference/geom_smooth.html) geometric primitive.

::: {.practice}
Do exercise 2.
:::

Both engine size (displacement) and drive train have a clear effect on car efficiency. Let us visualize the number of cylinders (`cyl`) as well. Including it by mapping it on the _size_ of geometry.

::: {.practice}
Do exercise 3.
:::

Currently, we are mixing together cars produced at different times. Let us visually separate them by turning each year into a subplot via [facet_wrap](https://ggplot2.tidyverse.org/reference/facet_wrap.html) function.

::: {.practice}
Do exercise 4.
:::

The dependence you plotted does not look linear but instead is saturating at certain low level of efficiency. This sort of dependencies could be easier to see on a logarithmic scale. See functions for different [scales](https://ggplot2.tidyverse.org/reference/scale_continuous.html) and use logarithmic scaling for y-axis.

::: {.practice}
Do exercise 5.
:::

Note that by now we managed to include _five_ variables into our plots. We can continue this by including transmission or fuel type but that would be pushing it, as too many variables can make a plot confusing and cryptic. Instead, let us make it prettier by using more meaningful axes labels (`xlab()`, `ylab()` functions) and adding a plot title ([labs](https://ggplot2.tidyverse.org/reference/labs.html)).

::: {.practice}
Do exercise 6.
:::

## Auto efficiency: discrete x-axis
The previous section use a continuous engine displacement variable for x-axis (at least that is my assumption on how you mapped the variables). Frequently, you need to plot data for discrete groups: experimental groups, conditions, treatments, etc. Let us practice on the same [mpg](https://ggplot2.tidyverse.org/reference/mpg.html) data set but visualize relationship between the drive train (`drv`) and highway cycle efficiency (`hwy`). Start by using point as visuals.

::: {.practice}
Do exercise 7.
:::

One problem with the plot is that all points are plotted at the same x-axis location. This means that if two points share the location, they overlap and appear as just one dot. This makes it hard to understand the density: one point can mean one point, or two, or a hundred. A better way to plot such data is by using [box](https://ggplot2.tidyverse.org/reference/geom_boxplot.html) or [violin](https://ggplot2.tidyverse.org/reference/geom_violin.html) plots^[You can also use an excellent package called [ggbeeswarm](https://github.com/eclarke/ggbeeswarm) that distributes dots to show the distribution.]. Experiment by using them instead of points.

::: {.practice}
Do exercise 8.
:::

Again, let's up the ante and split plots via both number of cylinders and year of manufacturing. Use [facet_grid](https://ggplot2.tidyverse.org/reference/facet_grid.html) function to generate grid of plots.

::: {.practice}
Do exercise 9.
:::

Let us again improve our presentation by using better axes labels and figure title.

::: {.practice}
Do exercise 10.
:::


## Mammals sleep: single variable
Now lets us work on plotting a distribution for a single variable using [mammals sleep dataset](https://ggplot2.tidyverse.org/reference/msleep.html). For this, you need to map `sleep_total` variable on x-axis and plot a [histogram](https://ggplot2.tidyverse.org/reference/geom_histogram.html). Explore available options, in particular `bins` that determines the bins number and, therefore, their size. Note that there is no "correct" number of bins to use. _ggplot2_ defaults to 30 but a small sample would be probably better served with fewer bins and, vice versa, with a large data set you can afford hundreds of bins.

::: {.practice}
Do exercise 11.
:::

Using a histogram gives you exact counts per each bin. However, the appearance may change quite dramatically if you would use fewer or more bins. An alternative way to represent the same information is via [smoothed density estimates](https://ggplot2.tidyverse.org/reference/geom_density.html). They use a sliding window and compute an estimate at each point but also include points _around_ it and weight them according to a kernel (e.g., a Gaussian one by default). This makes the plot look smoother and will mask sudden jumps in density (counts) as you, effectively, average over many bins. Whether this approach is better for visualizing data depends on the sample you have and message you are trying to get across. It is always worth checking both (just like it is worth checking different number of bins in histogram) to see which way is the best for your specific case.

::: {.practice}
Do exercise 12.
:::

Let us return to using histograms and plot a distribution per `vore` variable (it is `carnivore`, `omnivore`, `herbivore`, or `NA`). You can map it on the `fill` color of the histogram, so that each _vore_ kind will be binned separately.

::: {.practice}
Do exercise 13.
:::

The plot may look confusing because by default _ggplot2_ colors values for each group differently but stacks all of them together to produce the total histogram counts^[Perhaps it is me, but I find this more confusing than helpful.]. One way to disentangle the individual histograms is via [facet_grid](https://ggplot2.tidyverse.org/reference/facet_grid.html) function. Use it to plot `vore` distribution in separate rows.

::: {.practice}
Do exercise 14.
:::

That did the trick but there is an alternative way to plot individual distributions on the same plot by setting `position` argument of geom_histogram to `"identity"` (it is `"stack"` by default).

::: {.practice}
Do exercise 15.
:::

Hmm, shouldn't we have more carnivores, what is going on? Opacity is the answer. A bar "in front" occludes any bars that are "behind" it. Go back to the exercise and fix that by specifying `alpha` argument that controls transparency. It is `1` (completely opaque) by default and can go down to `0` (fully transparent as in "invisible"), so see which intermediate value works the best.

## Mapping for all visuals versus just one visual
In the previous exercise, you assigned a constant value to `alpha` (transparency) argument. You could do this in _two_ places,  inside of either `ggplot()` or `geom_histogram()` call. In the former case, you would have set `alpha` level for _all_ geometric primitives on the plot, whereas in the latter you do it only for the histogram. To better see the difference, reuse your code for plotting city cycle efficiency versus engine size (should be exercise #6) and set `alpha` either for all visuals (in `ggplot()` call) or in some visuals (e.g. only for points) to see the difference.

::: {.practice}
Do exercise 16.
:::

## Mapping on variables versus constants
In the previous exercise, you assigned a constant value to `alpha` (transparency) argument. However, transparency is just a property just like `x`, `color`, or `size`. Thus, there are _two_ ways you can use them:

* _inside_ `aes(x=column)`, where `column` is column in the table you supplied via `data=`
* _outside_ of `aes` by stating `x=value`, where value is some constant value or a variable _that is not in the table_.

Test this but setting the `size` in the previous plot to a constant outside of aesthetics or to a variable inside of it.

::: {.practice}
Do exercise 17.
:::

## Themes

Finally, if you are not a fan of the way the plots look, you can quickly modify this by using some other theme. You can define it yourself (there are lots of options you can specify for your [theme](https://ggplot2.tidyverse.org/reference/theme.html)) or can use one of the [ready-mades](https://ggplot2.tidyverse.org/reference/ggtheme.html). Explore the latter option, find the one you like the best.

::: {.practice}
Do exercise 18.
:::

## You ain't seen nothing yet
What you explored is just a tip of the iceberg. There are many more geometric primitive, annotations, scales, themes, etc. It will take an entire separate seminar to do _ggplot2_ justice. However, the basics will get you started and you can always consult reference, books (see below), or me once you need more.

## Further reading
If plotting data is part of your daily routine, I recommend reading [ggplot2 book](https://ggplot2-book.org/). It gives you an in-depth view of the package and goes through many possibilities that it offers. You may not need all of them but I find useful to know that they exists (who knows, I might need them one day). Another book worth reading is [Data Visualization: A Practical Introduction.](https://kieranhealy.org/publications/dataviz/) by Kieran Healy.

## Extending ggplot2
There are 128 (as of 05.10.2023) extensions that you find  at [ggplot2 website](https://exts.ggplot2.tidyverse.org/gallery/). They add more ways to plot your data, more themes, animated plots, etc. If you feel that _ggplot2_ does not have the geometric primitives you need, take a look at the gallery and, most likely, you will find something that fits your bill. 

One package that I use particularly often is [patchwork](https://patchwork.data-imaginist.com/). It was created "to make it ridiculously simple to combine separate ggplots into the same graphic". It is a bold promise but authors do make good on it. It is probably the easiest way to combine multiple plots but you can also consider [cowplot](https://github.com/wilkelab/cowplot) and [gridExtra](https://cran.r-project.org/web/packages/gridExtra/index.html) packages.

## ggplot2 cannot do everything
There are many different plotting routines and packages for R but I would recommend to use _ggplot2_ as your main tool. However, that does not mean that it must be your only tool, after all, CRAN is brimming with packages. In particular, _ggplot2_ is built for plotting data from a _single and tidy_ table, meaning it is less optimal for plotting data in other cases. E.g., you can use it to combine information from several tables in one plot but things become less automatic and consistent. Similarly, you can plot data which is stored in non-tidy tables or even in individual vectors but that makes it less intuitive and more convoluted. No package can do everything and _ggplot2_ is no exception.