99-answers.Rmd

---
title: "Answers to questions asked in the tutorial"
author: "Matthias Grenié & Marten Winter"
date: "Monday July 5th 2021"
output:
  html_document:
    toc: yes
editor_options:
  chunk_output_type: console
---

Here are answers regarding all the questions asked in the tutorial.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

# Loading the data (**Q1**-**Q6**)

## Summarizing the data

* **Q1**: How many plots were sampled?

```{r q1}
nrow(plot_data)
```

**Answer**: There are 180 sampled plots. 


* **Q2**: How many species are there in the dataset?

```{r q2}
nrow(species_traits)
```

**Answer**: There are 691 species in the dataset.


* **Q3**: How many traits are available?

```{r q3}
head(species_traits, 2)
```

**Answer**: There are 13 traits in the dataset.


* **Q4**: How many of them are continuous? How many of them are discrete?

**Answer**: Looking at the trait table there 3 continuous traits (`height`, `sla`, and `wood.dens`) and 10 that are discrete. (Referring to the README file we see that the `seed` column contains indeed a categorical trait). 

* **Q5**: What is the most numerous family among all observed species?

```{r q5}
summary(species_traits)
```

**Answer**: The `summary()` function displays the number of observations in each categories ordered by the most abundant category. The `family` column shows that the "Leguminosae" (a.k.a, Fabaceae) is the most numerous in the dataset.

* **Q6**: What is the most numerous genus?

**Answer**: Similarly the most numerous genus is "Unknown", which corresponds to individual that weren't identified at the genus level.

# Functional diversity (**Q7**-**Q19**)

## Computing Biomass-weighted mean traits per plot (**Q7**-**Q10**)

* **Q7**: How would you describe the relationship between the different CWMs and forest loss?

**Answer**: We're eye-balling the relationships but it seems that:

* The CWM-Height seems to decrease with forest loss.
* The CWM-SLA seems to increase with forest loss.
* There seem to be no relationship between the CWM-wood density and forest loss. 

* **Q8**: Can you test the correlation using the function `cor.test()` and does it support your previous statements?

```{r q8}
cor.test(cwm_env$forestloss17, cwm_env$height)
cor.test(cwm_env$forestloss17, cwm_env$sla)
cor.test(cwm_env$forestloss17, cwm_env$wood.dens)
```

**Answer**:

* We observe a negative correlation (-0.48) between CWM-Height and forest loss.
* We observe a positive correlation (0.48) between CWM-SLA and forest loss.
* We observe a slight negative correlation (-0.24) between CWM-wood density and forest loss. 

* **Q9**: How would you describe the understorey vegetation changes with increasing forest loss?

**Answer**: The CWM values reflect the dominant trait of the community. This means that with forest loss we observe smaller species, species with higher SLA, and species with lower wood density. Linking back to ecology, this is probably due to a change in understorey vegetation type from woody species in unlogged area to more herbaceous species in heavily logged area.

* **Q10**: How does this observation compare to above description of the change of understorey vegetation along the forest loss gradient?

```{r q10}
cor.test(non_quanti_cwm$forestloss17, non_quanti_cwm$woody_no)
```

**Answer**: The `woody_no` CWM represents the proportion of non-woody species in the community. We observe a positive medium correlation between this CWM and forest loss which tells us that non-woody species are more frequent in understorey vegetation with increasing forest loss.

## Building the functional space (**Q11**-**Q12**)

* **Q11**: Using the metadata available in the `README.txt` file, what is the meaning of the `pgf` column?

**Answer**: The `README.txt` file (in the `data/doi_10.5061_dryad.f77p7__v1/` folders) describes the full dataset with the meaning of all columns. It says:

> pgf: Plant growth form: A = fern, B = graminoid, C = forb, D = herbaceous climber, E = herbaceous shrub, F = tree sapling, G = woody climber, H = woody shrub, na = indeterminate

So the `pgf` column describes the plant growth form.

* **Q12**: How do you interpret the PCoA results given your answer to the previous question?

**Answer**: The PCoA shows groups of species according to their plant growth form. Especially the category "F" (tree saplings) seem to differentiate from all other growth forms. This means that tree saplings have distinct traits from all other growth form groups.


## Computing functional diversity indices (**Q13**-**Q15**)

* **Q13**: How would you describe the relationships between functional diversity and forest loss and road density?

**Answer**: None of the relationships seem to be very straightforward, they rather show no clear relationship.

* **Q14**: Using the plot generated by the code beneath how could you describe the relationships between the three different functional diversity indices we computed?

**Answer**: This plot is a [pairs plot](https://www.statology.org/pairs-plots-r/) it represents the relationship between each pair of variables and display the corresponding correlation coefficient on the upper diagonal. We observe no correlation between Rao's Quadratic Entropy and Functional Evenness. We observe a slight correlation between Functional Evenness and Functional Richness. And we observe a strong correlation between Functional Richness and Rao's Quadratic Entropy. This correlation can highlight that both co-vary even though they are supposed to assess independent dimensions of functional diversity. This may be due to a hidden relationship with species richness.

* **Q15**: How does the relationship between indices with species richness compare with the one observed with total biomass values? (You can use the function `cor.test()` if you want to test the association)

```{r q15}
pairs(ntaxa ~ FRic + FEve + Q, data = site_rich_fd, upper.panel = panel.cor)
pairs(tot_biomass ~ FRic + FEve + Q, data = site_rich_fd,
      upper.panel = panel.cor)
cor.test(site_rich_fd$ntaxa, site_rich_fd$FRic)
cor.test(site_rich_fd$tot_biomass, site_rich_fd$FRic)

cor.test(site_rich_fd$ntaxa, site_rich_fd$Q)
cor.test(site_rich_fd$tot_biomass, site_rich_fd$Q)
```

**Answer**: We observe a strong correlation between species richness and functional richness, as well as a medium correlation between species richness and Rao's quadratic entropy. This correlations become respectively small and absent when using total biomass.

## Null modeling (**Q16**-**Q19**)

* **Q16**: How would describe verbally the position of the observed value of FRic for site "a100f177r" compared to the null distribution?

**Answer**: The observe value is on the right tail of the distribution which means that the observed value is slightly higher than the null expectation.

* **Q17**: What's the quantile of the observed FRic value in the end?

```{r q17}
# The observed value of FRic for the site
subset(site_fd, plot.code == "a100f177r")$FRic

# The null distribution of FRic for the same site
summary(subset(null_fd_999, site == "a100f177r")$FRic)
```

**Answer**: The observe FRic is close to the third quartile (75th percentile).

* **Q18**: Using the `subset()` function with the greater (or equal) than `>=` and the lower (or equal) than `<=`, can you determine how many sites show a significant deviation from the null observation? (absolute SES >= 2)

```{r q18}
# First way of answering the question two separate subset() calls
nrow(subset(ses_fd, ses_FRic >= 2))
nrow(subset(ses_fd, ses_FRic <= -2))

# Or with a single call selecting the rows by absolute SES values
nrow(subset(ses_fd, abs(ses_FRic) >= 2))
```

**Answer**: We observe 16 sites with significant deviation from the null observation.

* **Q19**: Using similar code as used for observed values, what are the relationships between SES values and forest loss?

```{r q19}
par(mfrow = c(1, 1))

site_env_ses_fd = merge(ses_fd, plot_data[, c("plot.code", "forestloss17")],
                        by = "plot.code")

plot(site_env_ses_fd$forestloss17, site_env_ses_fd$ses_FRic,
     xlab = "Forest loss (%)", ylab = "SES of Functional Richness (FRic)",
     main = "SES Functional Richness vs. forest loss")
cor.test(site_env_ses_fd$forestloss17, site_env_ses_fd$ses_FRic)
cor.test(site_env_ses_fd$forestloss17, site_env_ses_fd$ses_Q)
cor.test(site_env_ses_fd$forestloss17, site_env_ses_fd$ses_FEve)
```

**Anwser**: The relationship is not clear when looking at it but we observe a medium negative correlation between the SES values and forest loss. This means that logging activity may decrease the functional richness of understorey vegetation. We observe a similar trend with the SES of Rao's quadratic entropy. There are no trend with the SES of functional evenness.

# Phylogenetic diversity (**Q20**-**Q27**)

## Getting the phylogenetic tree (**Q20**-**Q21**)

* **Q20**: How many taxa are in the phylogenetic tree?

```{r q20}
phylo_tree
```

**Answer**: There are 611 tips in the phylogenetic tree so 611.

* **Q21**: How does this number compare to the number of taxa found in the dataset?

**Answer**: There are 691 taxa in the dataset so 80 taxa are missing from the phylogenetic tree.

## Computing phylogenetic diversity indices (**Q22**-**Q24**)

* **Q22**: What do you notice with the species names? Especially compared to the ones available in `species_traits`.

**Answer**: The tip labels in the phylogenetic tree only contain the species epithet (second part of the [binomial name](https://en.wikipedia.org/wiki/Binomial_nomenclature)) which corresponds to the `species` column in the `species_traits` data.frame. That may be the source of confusion in the phylogenetic tree that reduces the number of taxa.

* **Q23**: What is the relationship between the weighted and the unweighted version of the MPD?

```{r q23}
cor.test(obs_mpd$mpd_unweighted, obs_mpd$mpd_weighted)
```

**Answer**: There is a positive relationship between both.

* **Q24**: What is the relationship between MPD and taxa richness? And with forest loss? Plot these relationships to visualize them and use the `cor.test()` function to validate your observations.

```{r q24}
par(mfrow = c(1, 2))
plot(obs_mpd$ntaxa, obs_mpd$mpd_weighted)
plot(obs_mpd$forestloss17, obs_mpd$mpd_weighted)

cor.test(obs_mpd$ntaxa, obs_mpd$mpd_weighted)
cor.test(obs_mpd$forestloss17, obs_mpd$mpd_weighted)
```

**Answer**: We seem to observe a positive relationship between species richness and MPD and it is shown by a medium positive correlation. We observe no relationship between forest loss and MPD.

## Null modeling (**Q25**-**Q27**)

* **Q25**: Explain what does the column `mpd.obs.z` means? How does this compare with the SES values we computed for functional diversity indices?

**Answer**: From the help page of `picante::ses.mpd` accessed with `?picante::ses.mpd`

> `mpd.obs.z` Standardized effect size of mpd vs. null communities (= (mpd.obs - mpd.rand.mean) / mpd.rand.sd, equivalent to -NRI)

These are exactly SES values but for phylogenetic diversity. The name comes from [Z-scoring variables](https://en.wikipedia.org/wiki/Standard_score)?

* **Q26**: How does the standardized value relates to taxa richness?

```{r q26}
cor.test(ses_mpd_999$ntaxa, ses_mpd_999$mpd.obs.z)
```

**Answer**: They show no relationship. Which clearly shows that the null models **corrected** for the richness effect compared to the observed values.

* **Q27**: What are the relationships between MPD values considering null models and forest loss? Visualize the relationships with the `plot()` function, validate your observations with the `cor.test()` function.

```{r q27, options}
par(mfrow = c(1, 1))

ses_mpd_999_env = merge(ses_mpd_999, plot_data[, c("plot.code", "forestloss17")],
                        by = "plot.code")

plot(ses_mpd_999_env$forestloss17, ses_mpd_999_env$mpd.obs.z)

cor.test(ses_mpd_999_env$forestloss17, ses_mpd_999_env$mpd.obs.z)
```

**Anwser**: There is still no clear relationship.

# Comparing facets (**Q28**)

* **Q28**: How are related are observed values of functional diversity and phylogenetic diversity? What about the SESs?

**Answer**: Observed MPD is correlated with observed Q and observed FRic, and with richness. All these correlations disappear with the SES values. (This the effect of null modeling to account for the relationships with species richness).

# Modelling the effect of logging (**Q29**-**Q31**)

## Single-predictor models (**Q29**-**Q30**)

* **Q29**: How would you qualify the effect of forest loss on the taxa richness?

**Answer**: Forest loss do not seem to affect taxa richnes.

* **Q30**: With the same formula build similar models with the other predictors `roaddensprim` and `roaddistprim`. How do they compare with forest loss?

```{r q30}

mod_taxa_dens = lm(ntaxa ~ roaddensprim, data = plot_div_env)
mod_taxa_dist = lm(ntaxa ~ roaddistprim, data = plot_div_env)

par(mfrow = c(1, 2))
plot(mod_taxa_dens$model$roaddensprim, mod_taxa_dens$model$ntaxa,
     xlab = "Primary Road Density", ylab = "Species Richness")
abline(coef = coef(mod_taxa_dens), col = "darkred", lwd = 1)

plot(mod_taxa_dist$model$roaddistprim, mod_taxa_dist$model$ntaxa,
     xlab = "Distance to nearest primary road", ylab = "Species Richness")
abline(coef = coef(mod_taxa_dist), col = "darkred", lwd = 1)
```

**Answer**: Similarly we observe no clear relationships between species richness and other disturbance variables.

## Multi-predictors models (**Q31**)

* **Q31**: What can you say about the effect of the disturbances on the different diversity metrics? What are the explanatory power of our models?

```{r q31}
summary(mod_taxa_all)
summary(mod_fd_all)
summary(mod_pd_all)
```

**Answer**: We do not observe any variables with a strong effect. Most of our models show a poor R-squared (< 20%) which means they have poor explanatory powers. We should probably think better about the determinants of our explanatory variables and account for the non-independence of our dataset (block design in sampling).