bm432-datavis.Rmd

---
title: "BM432 Data Visualisation"
author: "Morgan Feeney and Leighton Pritchard"
date: "2023-10-18"
output:
  ioslides_presentation:
    css: ./includes/custom.css
    font-family: 'Helvetica'
    widescreen: True
    self_contained: True
    logo: images/science_logo_400x400.jpg
    df_print: paged
editor_options: 
  chunk_output_type: console
---

<style>
.forceBreak { -webkit-column-break-after: always; break-after: column; }
</style>

```{r setup, include=FALSE}
# Package imports
library(dplyr)
library(ggplot2)
library(ggthemes)
library(googlesheets4)
library(knitr)
library(stringr)
library(tidyr)
library(tm)
library(wordcloud)
library(wordcloud2)

# Authenticate with Google Forms
gs4_deauth()

# Grab data and clean it
dfm = read_sheet("https://docs.google.com/spreadsheets/d/1y8SM2IdBDEbRky23s3FQUfkMvxCBK0opEICz00ojclo/edit?resourcekey#gid=1938337482")
colnames(dfm) = c("timestamp", "figure", "effectiveness", "understandable",
                  "appealing",
                  "bestpractice", "bestpractice_errors", "bestpractice_unsure",
                  "effectiveness_comment", "improvements",
                  "colours", "fonts", "labels", "statistics", "whitespace",
                  "data", "legend", "title", "reproduce_paper", "reproduce_raw",
                  "describe", "figure_comments", "time_taken", "numfigures",
                  "prereading", "other_comments")

# Ignore responses prior to 11/10/2023
dfm = dfm %>% filter(timestamp > "2023-10-11")

# Convert figure choice to integer
dfm = dfm %>% separate(figure, c("fignum"), remove=FALSE)

# Make columns the appropriate datatype
factor_headers = c("figure", "fignum", "bestpractice", "time_taken")
for (colname in factor_headers) {
  dfm[[colname]] = as.factor(dfm[[colname]])
}
dfm$fignum = factor(dfm$fignum, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
dfm$time_taken = factor(dfm$time_taken, levels=c("Less than 5 minutes","About 5 or 10 minutes","About 15 minutes","About 30 minutes","More than 30 minutes"))

# Function to generate Economist-style figures from categories
draw_economist_categories = function(dfm, categories, fig_num) {
  ratings = dfm %>%
    filter(fignum == fig_num) %>%
    pivot_longer(cols=all_of(categories),
                 names_to = "category",
                 values_to = "rating")
  ratings$rating = factor(ratings$rating, levels=c("1", "2", "3", "4", "5"))
  p = ggplot(ratings, aes(x=rating))
  p = p + geom_bar() +
    stat_count(geom="text", color="white", aes(label=..count..), vjust=2)
  p = p + facet_wrap(~category)
  p = p + theme_economist() + scale_x_discrete(drop=FALSE)
  p
}

# Function to generate Tufte-style figures from categories
draw_tufte_categories = function(dfm, categories, fig_num) {
  ratings = dfm %>%
    filter(fignum == fig_num) %>%
    pivot_longer(cols=all_of(categories),
                 names_to = "category",
                 values_to = "rating")
  p = ggplot(ratings, aes(x=category, y=rating))
  p = p + geom_boxplot(outlier.colour="red") + geom_point(position=position_jitter(height=0.01, width=0.1), alpha=0.35)
  p = p + theme_tufte() + ylim(1, 5) + coord_flip()
  p
}

# Function to generate hc-style figures from categories
draw_hc_categories = function(dfm, categories) {
  ratings = dfm %>%
    pivot_longer(cols=all_of(categories),
                 names_to = "category",
                 values_to = "rating")
  p = ggplot(ratings, aes(x=fignum, y=rating))
  p = p + geom_boxplot(outlier.colour="red") + geom_point(position=position_jitter(height=0.01, width=0.1), alpha=0.35)
  p = p + facet_wrap(~category, ncol=1)
  p = p + theme_hc() + scale_x_discrete(drop=FALSE)
  p
}

# Process character column into a wordcloud
process_words = function(dfrm, column) {
  corpus = Corpus(VectorSource(dfrm[column]))
  comments = corpus %>%
    tm_map(removeNumbers) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace) %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(removeWords, stopwords("english"))
  wordmat <- as.matrix(TermDocumentMatrix(comments))
  words <- sort(rowSums(wordmat), decreasing=TRUE)
  dfm_words <- data.frame(word = names(words), freq=words)
  dfm_words
}

# Code block options
knitr::opts_chunk$set(echo = FALSE, warning = FALSE)
```

# 1. Introduction

## Learning Objectives

- You should be able to critically analyse how data is visualised 
- You should be able to judge a figure's clarity and potential for misunderstanding
- You should be able to identify potential sources of bias
- You should understand how to create effective figures for your own work

## Background Reading

- Introduction to Data Visualisation
  - [https://sipbs-compbiol.github.io/BM432/notebooks/04-data_presentation_workshop.html](https://sipbs-compbiol.github.io/BM432/notebooks/04-data_presentation_workshop.html)
  
- Interactive Data Visualisation examples
  - [https://sipbs-bm432.shinyapps.io/03-04a-barchart/](https://sipbs-bm432.shinyapps.io/03-04a-barchart/)
  
- Additional information:
  - ["Fundamentals of Data Visualisation" _by Claus O. Wilke_](https://clauswilke.com/dataviz/)
  - ["Data Visualisation: A Practical Introduction" _by Kieran Healy_](https://socviz.co/index.html#preface)
  
## Exercise: Ten figures (three per student)

- **For your assigned figure, consider the following:**
  - What type of data is being presented?
  - Are the data presented effectively? (why/why not?)
  - How can the data presentation be improved?
  - Use the DOI provided to find the paper the figure is from, if you need more information than the figure legend
- Fill in the _pro forma_ with your answers to the questions above (one sentence each)
  - [Google Form](https://docs.google.com/forms/d/e/1FAIpQLScECon-B5N-JkK2PIB4YiEyy8k8iR9srls9UBFYq2NWqEB3Rw/viewform)
  
# 2. Summary Results
  
## Responses by figure

- We received `r nrow(dfm)` ratings in total (at three figures per student, this is `r round(nrow(dfm)/3, 1)` students responding)

<center>
```{r overall_response}
figcounts = dfm %>% group_by(fignum) %>% summarize(count = n())
p = ggplot(figcounts, aes(y=fignum, x=count))
p + geom_point(size=8) + xlim(0, max(figcounts$count)) + theme_economist()
```
</center>

## Overall effectiveness

- How was effectiveness scored, distributed across all figures?

<center>
```{r overall_effectiveness}
p = ggplot(dfm, aes(x=effectiveness))
p = p + geom_histogram(binwidth=1) +
  stat_bin(binwidth=1, geom="text", color="white", aes(label=..count..), vjust=2)
p + theme_economist() + xlim(0.5, 5.5)
```
</center>

## Overall understandability

- How was understandability scored, distributed across all figures?

<center>
```{r overall_understanding}
p = ggplot(dfm, aes(x=understandable))
p = p + geom_histogram(binwidth=1) +
  stat_bin(binwidth=1, geom="text", color="white", aes(label=..count..), vjust=2)
p + theme_economist() + xlim(0.5, 5.5)
```
</center>

## Overall appeal

- How was appeal scored, distributed across all figures?

<center>
```{r overall_appeal}
p = ggplot(dfm, aes(x=appealing))
p = p + geom_histogram(binwidth=1) +
  stat_bin(binwidth=1, geom="text", color="white", aes(label=..count..), vjust=2)
p + theme_economist()  + xlim(0.5, 5.5)
```
</center>

## Time taken per figure

- How long did you take, per figure?

<center>
```{r overall_time_taken}
p = ggplot(dfm %>% drop_na(time_taken), aes(x=time_taken))
p = p + geom_bar() + stat_count(geom="text", aes(label=..count..), vjust=1.5, colour="white")
p + theme_economist() + theme(axis.text.x = element_text(angle = 90)) + scale_x_discrete(drop=FALSE)
```
</center>

# 3. Results By Figure

## Effectiveness/Understandability/Appeal

- How effective/understandable/appealing did you think each figure was?

<center>
```{r understandable_by_fig}
draw_hc_categories(dfm, c("effectiveness", "understandable", "appealing"))
```
</center>

## Colours/Fonts/Labels

- How well did each figure use colours, fonts, and labels?

<center>
```{r presentation_by_fig}
draw_hc_categories(dfm, c("colours", "fonts", "labels"))
```
</center>

## Statistics/Whitespace/Data

- How well did each figure use statistics, whitespace, and data?

<center>
```{r statistics_by_fig}
draw_hc_categories(dfm, c("statistics", "whitespace", "data"))
```
</center>

## Reproduction

- How well did you think you could reproduce each figure?

<center>
```{r reproduction_by_fig}
draw_hc_categories(dfm, c("describe", "reproduce_paper", "reproduce_raw"))
```
</center>

# 4. Specific Figures

## Figure 1 (doi:10.1016/j.cell.2021.08.029)

<center>
<img src="images/figure_1.jpg" height="500px" />
</center>

## Figure 1 (doi:10.1016/j.cell.2021.08.029) <img src="images/figure_1.jpg" height="100px" />

<center>
```{r fig1_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 1)
```
</center>

## Figure 1 (doi:10.1016/j.cell.2021.08.029) <img src="images/figure_1.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_1.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Increase contrast between font and background (black on dark blue, red/gray on white) $\rightarrow$ easier to read
  - 1D is not colour-blind friendly
  - Scales on A and B are very different, which could mislead the reader
  - Numbers on phylogenetic trees (A&B) are not explained in fig legend (bootstraps? but this could be made clearer)
  - 1D boxplot or similar for G & H instead of bar charts
  - Use of whitespace could be better: this feels quite cluttered and the flow from A $\rightarrow$ B $\rightarrow$ C $\rightarrow$ D is not very clear


## Figure 1 (doi:10.1016/j.cell.2021.08.029) <img src="images/figure_1.jpg" height="100px" /> {.smaller}

```{r fig1_improvements}
kable(dfm %>% filter(fignum==1) %>% select(improvements))
```


## Figure 2 (doi:10.1016/j.cell.2021.08.028)

<center>
<img src="images/figure_2.jpg" height="500px" />
</center>

## Figure 2 (doi:10.1016/j.cell.2021.08.028) <img src="images/figure_2.jpg" height="100px" />

<center>
```{r fig2_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 2)
```
</center>

## Figure 2 (doi:10.1016/j.cell.2021.08.028) <img src="images/figure_2.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_2.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Images missing scale bars/scale bars hard to see
  - Good to show actual data instead of barcharts
  - Red/green fluorescence, and rainbow colour scale hard to read/interpret / not particularly accessible for colourblind readers
  - Data could perhaps be simplified/summarized, e.g. with E or F moved to supplemental 
  - y-axis scales a little misleading
  - Use of whitespace could be better


## Figure 2 (doi:10.1016/j.cell.2021.08.028) <img src="images/figure_2.jpg" height="100px" /> {.smaller}

```{r fig2_improvements}
kable(dfm %>% filter(fignum==2) %>% select(improvements))
```


## Figure 3 (doi:10.1016/j.cell.2021.07.030)

<center>
<img src="images/figure_3.jpg" height="500px" />
</center>

## Figure 3 (doi:10.1016/j.cell.2021.07.030) <img src="images/figure_3.jpg" height="100px" />

<center>
```{r fig3_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 3)
```
</center>

## Figure 3 (doi:10.1016/j.cell.2021.07.030) <img src="images/figure_3.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_3.jpg" width="100%" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Inconsistent y-axis scales make it difficult to compare panels in part A
  - Why have mirrored y-axes at all?
  - Map and phylogenetic trees are missing scales
  - Colour choices between AB and CDE can be a little confusing
  - Text, particularly in panel E, too small to read clearly
  - Reasonably good use of whitespace


## Figure 3 (doi:10.1016/j.cell.2021.07.030) <img src="images/figure_3.jpg" height="100px" /> {.smaller}

```{r fig3_improvements}
kable(dfm %>% filter(fignum==3) %>% select(improvements))
```


## Figure 4 (doi:10.1016/j.cell.2021.08.016)

<center>
<img src="images/figure_4.jpg" height="500px" />
</center>

## Figure 4 (doi:10.1016/j.cell.2021.08.016) <img src="images/figure_4.jpg" height="100px" />

<center>
```{r fig4_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 4)
```
</center>

## Figure 4 (doi:10.1016/j.cell.2021.08.016) <img src="images/figure_4.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_4.jpg" width="100%" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Repetitive parts of y-axes could be simplified (all are normalized to GAPDH)
  - y-axes need to have the same scale
  - would you fit a straight line to that?
  - Good colour choices
  - Layout could have been improved – should be 3 rows with 1 column for each IFN, to make it easier for the reader to compare e.g. A, G, and M


## Figure 4 (doi:10.1016/j.cell.2021.08.016) <img src="images/figure_4.jpg" height="100px" /> {.smaller}

```{r fig4_improvements}
kable(dfm %>% filter(fignum==4) %>% select(improvements))
```


## Figure 5 (doi:10.1016/j.cell.2021.07.029)

<center>
<img src="images/figure_5.jpg" height="500px" />
</center>

## Figure 5 (doi:10.1016/j.cell.2021.07.029) <img src="images/figure_5.jpg" height="100px" />

<center>
```{r fig5_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 5)
```
</center>

## Figure 5 (doi:10.1016/j.cell.2021.07.029) <img src="images/figure_5.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_5.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - A flow chart illustrating the FACS technique chosen here would be helpful for the reader
  - Different scales on the x and y axes may be confusing and potentially misleading to readers
  - B-D, F – show the datapoints as 1D scatterplot?
  - G – fluorescence micrograph not colourblind friendly
  - Use of the same colour for tumor, adenoma, IBD lesion may potentially be misleading


## Figure 5 (doi:10.1016/j.cell.2021.07.029) <img src="images/figure_5.jpg" height="100px" /> {.smaller}

```{r fig5_improvements}
kable(dfm %>% filter(fignum==5) %>% select(improvements))
```


## Figure 6 (doi:10.1016/j.cell.2021.07.022)

<center>
<img src="images/figure_6.jpg" height="500px" />
</center>

## Figure 6 (doi:10.1016/j.cell.2021.07.022) <img src="images/figure_6.jpg" height="100px" />

<center>
```{r fig6_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 6)
```
</center>

## Figure 6 (doi:10.1016/j.cell.2021.07.022) <img src="images/figure_6.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_6.jpg" width="100%" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Good example of flow chart to illustrate experimental design
  - D-G could perhaps be combined into one graph – reduce cognitive burden on the reader
  - Some confusion in the colour scheme (P11 lysate/F2 lysate both indicated in green, inconsistent use of red, orange)
  - Nice to see box plots overlaid with datapoints


## Figure 6 (doi:10.1016/j.cell.2021.07.022) <img src="images/figure_6.jpg" height="100px" /> {.smaller}

```{r fig6_improvements}
kable(dfm %>% filter(fignum==6) %>% select(improvements))
```


## Figure 7 (doi:10.1016/j.cell.2021.07.023)

<center>
<img src="images/figure_7.jpg" height="500px" />
</center>

## Figure 7 (doi:10.1016/j.cell.2021.07.023) <img src="images/figure_7.jpg" height="100px" />

<center>
```{r fig7_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 7)
```
</center>

## Figure 7 (doi:10.1016/j.cell.2021.07.023) <img src="images/figure_7.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_7.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - Good use of flow diagram to illustrate experimental procedure
  - Text too small to read easily
  - Colours rather bewildering and hard to interpret, especially in G
  - heatmap in I difficult to accurately interpret
  - use of whitespace could be better
  - axis labels in B and F could be clearer


## Figure 7 (doi:10.1016/j.cell.2021.07.023) <img src="images/figure_7.jpg" height="100px" /> {.smaller}

```{r fig7_improvements}
kable(dfm %>% filter(fignum==7) %>% select(improvements))
```


## Figure 8 (doi:10.1016/j.cell.2021.08.003)

<center>
<img src="images/figure_8.jpg" height="500px" />
</center>

## Figure 8 (doi:10.1016/j.cell.2021.08.003) <img src="images/figure_8.jpg" height="100px" />

<center>
```{r fig8_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 8)
```
</center>

## Figure 8 (doi:10.1016/j.cell.2021.08.003) <img src="images/figure_8.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_8.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - A colour scale may be difficult to interpret – some shades seem very close, size of dots hard to classify
  - good to show datapoints
  - y axes in C should be same scale
  - t-SNEs difficult to read as presented
  - red/green colour palette not colourblind friendly
  - inconsistent use of colours between panels
  - scale bars missing for RNA ISH (H, right panels)


## Figure 8 (doi:10.1016/j.cell.2021.08.003) <img src="images/figure_8.jpg" height="100px" /> {.smaller}

```{r fig8_improvements}
kable(dfm %>% filter(fignum==8) %>% select(improvements))
```


## Figure 9 (doi:10.1016/j.cell.2021.07.024)

<center>
<img src="images/figure_9.jpg" height="500px" />
</center>

## Figure 9 (doi:10.1016/j.cell.2021.07.024) <img src="images/figure_9.jpg" height="100px" />

<center>
```{r fig9_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 9)
```
</center>

## Figure 9 (doi:10.1016/j.cell.2021.07.024) <img src="images/figure_9.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_9.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - colours in B are dark/hard to distinguish
  - different axes in D and E make it difficult to compare data
  - circular presentation of chromosomes is confusing/misleading
  - reddish/green colour choices might be OK but there are better options
  - A doesn’t add much value to the figure
  - may be better ways to present B – lines through datapoints?


## Figure 9 (doi:10.1016/j.cell.2021.07.024) <img src="images/figure_9.jpg" height="100px" /> {.smaller}

```{r fig9_improvements}
kable(dfm %>% filter(fignum==9) %>% select(improvements))
```


## Figure 10 (doi:10.1016/j.cell.2021.07.006)

<center>
<img src="images/figure_10.jpg" height="500px" />
</center>

## Figure 10 (doi:10.1016/j.cell.2021.07.006) <img src="images/figure_10.jpg" height="100px" />

<center>
```{r fig10_main_summary}
draw_tufte_categories(dfm, c("effectiveness", "understandable", "appealing", "colours", "fonts", "labels", "statistics", "whitespace", "data", "legend", "title", "reproduce_paper", "reproduce_raw", "describe"), 10)
```
</center>

## Figure 10 (doi:10.1016/j.cell.2021.07.006) <img src="images/figure_10.jpg" height="100px" /> {.columns-2 .smaller}

<center>
<img src="images/figure_10.jpg" height="450px" />
</center>

<p class="forceBreak"></p>

- Suggested improvements:
  - kinetic data better presented in a table
  - concentration of what?
  - include keys with figures, don’t make the reader look back and forth with the legend
  - Coloured line in B not explained in key or figure legend
  - match y-axes scales in B
  - confusing/unnecessary to split H into two graphs
  - include full colour key in G (missing purple), H (orange)
  - Text could be a little larger to make it easier to read
  - Good flow from A-H, good use of whitespace


## Figure 10 (doi:10.1016/j.cell.2021.07.006) <img src="images/figure_10.jpg" height="100px" /> {.smaller}

```{r fig10_improvements}
kable(dfm %>% filter(fignum==10) %>% select(improvements))
```


# 5. Summing Up

## General Comments

- Colour choices
- Larger figures/graphs, more space between figures/graphs
- Too much data per figure
- Split into multiple figures
- Remove unnecessary data (how do we define this?)
- “The data is presented in a manner that would likely be inaccessible for people without prior experience. A move toward a more palatable/digestible format will facilitate better science communication in the future.”

## Visualising Data About Data Visualisation

- What did you say about figure effectiveness?

<center>
```{r eff_wordcloud}
effwords = process_words(dfm, "effectiveness_comment")
wordcloud2(data=effwords, size=1.6, color='random-dark')
```
</center>

## Visualising Data About Data Visualisation

- What words did you use to describe figure improvements?

<center>
```{r imp_wordcloud}
impwords = process_words(dfm, "improvements")
wordcloud(words = impwords$word, freq = impwords$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))
```
</center>

## Data Visualisation is Not Neutral  {.columns-2}

<center>
<img src="images/bojo1.jpg" width="110%" />
</center>

<center>
<img src="images/bojo2.jpg" width="110%" />
</center>