MV_mismatch.Rmd

---
title: "Multivariate mismatched seedling traits vs. seed microbiome"
author: "Quentin D. Read"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    code_folding: show
---

```{css, echo = FALSE}
.gt_table {
    margin-bottom: 20px;
    margin-top: 20px;
}
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# Summary

In this document I extend the single-variable approach from the previous notebook to a multivariate approach. The initial model fitting to get trait BLUPs for each maternal plant is done with a multivariate linear regression. The predictions it produces are really no different than what you would get if you fit a lot of univariate models, just that it also outputs residual correlations between the traits (which we ignore moving forward anyway). Then, instead of a random forest model, a multiple-output neural network model is fit, again we do this repeatedly for every posterior sample. Leave-one-out cross-validation is used. The process is repeated for 16S and ITS. In addition to the neural network, I also fit PLS models for both 16S and ITS.

Both PLS and NNET models produce results indistinguishable from a null model. I think there really is not much signal in the data so this dataset will not give us much insight about relationship between seed microbiome and the traits of the resulting seedling. Not having microbiome and traits from the same seedlings, but only from siblings or half-siblings of the same mother plant, may be obscuring any relationship that does exist.

# Setup

```{r load packages}
library(data.table)
library(brms)
library(ggplot2)
library(purrr)
library(keras)
library(mixOmics)
library(ggdist)

options(mc.cores = 4, brms.backend = 'cmdstanr', brms.file_refit = 'on_change')

# Import data. Corrected labels, data accessed from g drive 2023-06-06
abundance16S <- fread('project/data/16S_abundance.csv')
abundanceITS <- fread('project/data/ITS_Abundance.csv')
traits <- fread('project/data/seedling_data_updated.csv')

trait_names <- c('Days to germinate', 'Leaf height (cm)', 'Rooting depth (cm)', 'Root tip count', 'Primary root count', 'Leaf count')
```


```{r ggplot theme, include = FALSE}
windowsFonts(`fgbook` = windowsFont('Franklin Gothic Book'))
theme_set(
  theme_bw(base_family = 'fgbook') + 
    theme(panel.grid = element_blank(),
          strip.background = element_blank())
)
```

# Multivariate trait model

We will use the following traits: days to germination, leaf height, rooting depth, root tip count, number of primary roots, and number of leaves. 

First, eliminate any individuals that have no traits recorded. Also, there are some individuals that only have one trait recorded: days to germination. These will not contribute very much to the information about the maternal plant either, so we will remove them too. 

```{r}
mv_traits <- c("days_to_germinate", "leaf_height_cm", "rooting_depth_cm", "root_tip_count", "no_primary_roots", "no_leaves")

traits_std <- traits[apply(traits[, lapply(.SD, \(x) !is.na(x)), .SDcols = mv_traits], 1, sum) > 1, 
                     mget(c('maternal_plant_code', 'country_origin', mv_traits))]
```

We are left with 50 seedlings. Here are the ones that have any missing values. There are only 3, which are missing days to germination. Because there are so few, we will simply impute those missing values using the median from their maternal plant. This is acceptable to do with such a small number of missing values.

```{r}
traits_std[!complete.cases(traits_std)]
```

```{r}
traits_std[, days_to_germinate := ifelse(is.na(days_to_germinate), median(days_to_germinate, na.rm = TRUE), days_to_germinate),
           by = .(maternal_plant_code)]
```

Transform all variables with a log transformation. Use log(x+1) for leaf height and number of leaves because there are a few zero values.

```{r}
traits_std[, leaf_height_cm := leaf_height_cm + 1]
traits_std[, no_leaves := no_leaves + 1]
traits_std[, (mv_traits) := lapply(.SD, log), .SDcols = mv_traits]
```

Standardize the traits (z-transformation) so that they are in common units.

```{r}
traits_std[, (mv_traits) := lapply(.SD, scale), .SDcols = mv_traits]
```

Fit the multivariate linear regression model with population as fixed effect and random intercept for maternal plant. There will be 4000 posterior samples. This model takes about 2 minutes to compile and run. It is basically six regressions put together except that residual correlations between each pair of traits are also estimated.

```{r}
mv_trait_fit <- brm(
  mvbind(days_to_germinate, leaf_height_cm, rooting_depth_cm, root_tip_count, no_primary_roots, no_leaves) ~ country_origin + (1 | maternal_plant_code),
  data = traits_std, 
  chains = 4, iter = 2000, warmup = 1000, seed = 825,
  file = 'project/fits/mv_trait_fit'
)
```

Extract the posterior distribution of the BLUPs for each of the 9 maternal plants (plant J had no offspring with traits). If you add the fixed and random effect together you get a predicted value for each of the 4000 posterior samples for each maternal plant for each of the 6 traits, so this is a 4000x9x6 array.

```{r}
pred_grid <- unique(traits_std[, .(maternal_plant_code, country_origin)])

trait_post_blups <- predict(mv_trait_fit, newdata = pred_grid, summary = FALSE)
```

Reduce the number of posterior samples to 100 for easier calculation. Later if it becomes important to do so we can remove this step and parallelize the computations. We are left with a 100x9x6 array.

```{r}
set.seed(1104)
n_samples <- 100
trait_post_blups_subsample <- trait_post_blups[sample(1:dim(trait_post_blups)[1], n_samples), , ]
```

# Use aggregated microbiome to predict traits

This uses the quick and dirty method of using the aggregated microbiome (sums of taxa counts) of all the offspring of each maternal plant to represent the maternal microbiome. This will be a reasonably good estimate of the "mean" microbiome from each maternal plant but will not account for the uncertainty resulting from us being only able to obliquely observe the maternal microbiome through sampling the microbiomes of its offspring. Later we may be able to come up with a method, using Aldex2 or some other means, to account for this uncertainty.

As of 2023-10-16, the aggregation is done as follows:

- Remove maternal plant J from the data as it has no seedling traits.
- Remove any taxa that have all 0 abundance across all remaining microbiomes.
- Do CLR transformation, first adding 1 to all counts.
- Take the average log-ratio transformed value for each taxon, grouped by maternal plant.

```{r}
aggregate_abundance <- function(abundance) {
  # Remove maternal plant J as it has no seedling traits.
  J_columns <- grep('J$', names(abundance), value = TRUE)
  abundance[, c(J_columns) := NULL]
  
  # Remove any taxa that now have all zero abundance in the remaining 9 maternal plants.
  zero_rows <- rowSums(abundance[,-1]) == 0
  abundance <- abundance[!zero_rows]
  
  # Do CLR transformation, adding 1 to all counts. Replace original values with transformed.
  abundance_CLR <- logratio.transfo(abundance[,-1], logratio = 'CLR', offset = 1)
  abundance_CLR <- cbind(abundance[, .(V1)], as(abundance_CLR, 'matrix'))
  
  # Reshape to longform
  abundance_long <- melt(abundance_CLR, id.vars = 'V1') 
  abundance_long[, maternal_plant_code := substr(variable, 3, 3)]
  abundance_long[, sampleID := substr(variable, 1, 2)]
  
  # Aggregate by taking the average log ratio of each taxon grouped by maternal plant
  abundance_agg <- abundance_long[, .(value = mean(value)), by = .(V1, maternal_plant_code)]
  
  # Return to wideform
  abundance_agg_wide <- dcast(abundance_agg, maternal_plant_code ~ V1)
}

# Do aggregation and strip off identifier column.
abundance16S_agg_forfitting <- as.matrix(aggregate_abundance(abundance16S)[,-1])
abundanceITS_agg_forfitting <- as.matrix(aggregate_abundance(abundanceITS)[,-1])
```

## Multi-output neural network model

### Fit the model with LOO cross-validation

Some of the following code was adapted from [this post](https://www.datatechnotes.com/2020/01/multi-output-regression-example-with.html). First, we need to create a Keras model. Here it is a neural network with an input layer, an output layer, and one layer in between. The inputs are vectors of length 1293 for 16S and length 585 for ITS (the microbiomes) and the outputs are vectors of length 6 (the traits). The model will be fit using the `keras` package which means that Python code is running behind the scenes, using the `reticulate` package to call Python from within R.

Every iteration through this code, we redefine and recompile the model and fit a multi-output neural network model for a single posterior sample (a slice from the three dimensional array, which is a 9x6 matrix). Due to the small sample size, we do leave-one-out cross-validation on the model (use 8 of 9 maternal microbiomes, which is a 8x1293 matrix for 16S or 8x585 matrix for ITS, to fit the model and predict the 9th, repeat until all 9 are predicted). 

We have two nested loops, the outer loop for the number of trait MC samples we are repeating the model fit for (100), and the inner loop to loop through the rows of the MC sample to do leave-one-out cross-validation. We return the observed and predicted values in each case.

> NOTE: After much work I was able to configure my laptop (Windows 10) so that this code would run on it. However it takes a little while to run, so I ran it on the SciNet HPC anyway. For any case with a larger dataset, it will be better to run on the HPC.

#### 16S

```{r, eval = FALSE}
results <- list()
for (i in 1:n_samples) {
  for (j in 1:nrow(pred_grid)) {
    # Define and compile model.
    nnetmodel <- keras_model_sequential() |>
      layer_dense(units = 100, activation = 'relu', input_shape = dim(abundance16S_agg_forfitting)[2]) |>
      layer_dropout(rate = 0.3) |>
      layer_dense(units = 32, activation = 'relu') |>
      layer_dropout(rate = 0.3) |>
      layer_dense(units = dim(trait_post_blups)[3], activation = 'linear')
    
    nnetmodel |> compile(loss = 'mse', optimizer = 'adam')
    
    # Create train and test split with just a single holdout row and the rest used for fitting.
    ab_train <- abundance16S_agg_forfitting[-j, ]
    ab_test <- abundance16S_agg_forfitting[j, , drop = FALSE]
    trait_train <- trait_post_blups_subsample[i, -j, ]
    trait_test <- t(as.matrix(trait_post_blups_subsample[i, j, ]))
    
    # Fit model and get predictions for the holdout row.
    nnetmodel |> fit(
      x = ab_train, y = trait_train, epochs = 1000, verbose = 0
    )
    
    results[[length(results) + 1]] <- data.frame(sample = i, row = j, predict(nnetmodel, ab_test))
    
  }
  message('Sample ', i, ' of ', n_samples, ' complete.')
}

results_df <- do.call(rbind, results)
fwrite(results_df, 'project/fits/nnet_mv_trait_results_16S.csv')
```

#### ITS

```{r, eval = FALSE}
results <- list()
for (i in 1:n_samples) {
  for (j in 1:nrow(pred_grid)) {
    # Define and compile model.
    nnetmodel <- keras_model_sequential() |>
      layer_dense(units = 100, activation = 'relu', input_shape = dim(abundanceITS_agg_forfitting)[2]) |>
      layer_dropout(rate = 0.3) |>
      layer_dense(units = 32, activation = 'relu') |>
      layer_dropout(rate = 0.3) |>
      layer_dense(units = dim(trait_post_blups)[3], activation = 'linear')
    
    nnetmodel |> compile(loss = 'mse', optimizer = 'adam')
    
    # Create train and test split with just a single holdout row and the rest used for fitting.
    ab_train <- abundanceITS_agg_forfitting[-j, ]
    ab_test <- abundanceITS_agg_forfitting[j, , drop = FALSE]
    trait_train <- trait_post_blups_subsample[i, -j, ]
    trait_test <- t(as.matrix(trait_post_blups_subsample[i, j, ]))
    
    # Fit model and get predictions for the holdout row.
    nnetmodel |> fit(
      x = ab_train, y = trait_train, epochs = 1000, verbose = 0
    )
    
    results[[length(results) + 1]] <- data.frame(sample = i, row = j, predict(nnetmodel, ab_test))
    
  }
  message('Sample ', i, ' of ', n_samples, ' complete.')
}

results_df <- do.call(rbind, results)
fwrite(results_df, 'project/fits/nnet_mv_trait_results_ITS.csv')
```

### Examine results

Visualize the distributions of predicted and observed results. Note that everything here has been log-transformed and then standardized.

Calculate the RMSE for each of the traits for each posterior sample. Because the traits have been standardized, all results will be in standard deviation units.

```{r}
results_df_16S <- fread('project/fits/nnet_mv_trait_results_16S.csv')
results_df_ITS <- fread('project/fits/nnet_mv_trait_results_ITS.csv')

rmse_16S <- list()
rmse_ITS <- list()

for (i in 1:n_samples) {
  trait_obs <- trait_post_blups_subsample[i,,]
  trait_pred_16S <- results_df_16S[sample == i, -(1:2)]
  trait_pred_ITS <- results_df_ITS[sample == i, -(1:2)]
  
  # RMSE for each column.
  rmse_16S[[length(rmse_16S) + 1]] <- sqrt(apply((trait_obs - trait_pred_16S)^2, 2, mean))
  rmse_ITS[[length(rmse_ITS) + 1]] <- sqrt(apply((trait_obs - trait_pred_ITS)^2, 2, mean))
}

rmse_df_16S <- do.call(rbind, rmse_16S) |> as.data.table() |> setnames(names(traits_std)[-(1:2)])
rmse_df_ITS <- do.call(rbind, rmse_ITS) |> as.data.table() |> setnames(names(traits_std)[-(1:2)])
```

Show the distributions of the RMSEs.

```{r}
melt(rmse_df_16S, variable.name = 'trait', value.name = 'RMSE') |>
ggplot(aes(x = trait, y = RMSE)) +
  stat_interval(aes(color = after_stat(level)), position = position_dodge(width = .5)) + 
  stat_summary(fun = median, geom = 'point', size = 2, position = position_dodge(width = .5)) +
  geom_hline(yintercept = 1, linetype = 'dashed', linewidth = 1, color = 'gray50') +
  ggtitle('16S') +
  scale_color_brewer(palette = 'Blues') +
  theme(legend.position = 'bottom') +
  scale_x_discrete(labels = trait_names)

melt(rmse_df_ITS, variable.name = 'trait', value.name = 'RMSE') |>
ggplot(aes(x = trait, y = RMSE)) +
  stat_interval(aes(color = after_stat(level)), position = position_dodge(width = .5)) + 
  stat_summary(fun = median, geom = 'point', size = 2, position = position_dodge(width = .5)) +
  geom_hline(yintercept = 1, linetype = 'dashed', linewidth = 1, color = 'gray50') +
  ggtitle('ITS') +
  scale_color_brewer(palette = 'Blues') +
  theme(legend.position = 'bottom')  +
  scale_x_discrete(labels = trait_names)
```

The RMSEs are in standard deviation units, and are at or above 1 for all traits. A null model should have RMSE = 1 standard deviation. It seems like microbiome does not predict traits. Ultimately, I think that regardless of what statistical method is used, it may be that the relationship between seed microbiome and traits of the resulting seedling is very dependent on the individual seed. Because we do not have any one-to-one correspondence between seed microbiome and seedling traits, we may not be able to say that much about this relationship. We just have a set of seeds with microbiome sampled, and a set of seeds with traits sampled. They do not come from the same individuals. Yes, some are siblings or at least half-siblings because they come from the same maternal plant. However if the relationship between microbiome and traits is dependent on what components of the maternal microbiome happened to make it into one particular seed, then our data cannot uncover that relationship no matter what we do.

## PLS

The neural network model may be a bit too data-hungry for the relatively small sample size that we have. Thus I will use PLS instead to try to simultaneously do dimension reduction on the microbiome matrix and look at any relationship with the multivariate seedling trait matrix.

Here, for all 4000 of the posterior BLUP matrices for the traits, I conduct PLS regression with 3 components, regressing the trait matrix onto the microbiome matrix (this is done for both 16S and ITS). I conduct leave-one-out cross-validation by repeating this process 9 times, once with each of the 9 maternal plants left out. A prediction is generated for the 9th maternal plant. This prediction has 3 values for each of the 6 traits, using the first, the first two, and all three of the components to predict. I calculate the RMSE of the prediction. Then I plot the distribution of RMSEs across all 4000 posterior iterations for each trait, for each of the three predictions. 

This code takes a few minutes to run so I could run it on my laptop. Results are saved and reloaded to generate this notebook.

```{r, eval = FALSE}
# Define function to fit PLS using Leave-one-out cross-validation and return the RMSEs for each of the three components.
fit_pls <- function(abundance_mat, trait_mat, n_components) {
  pls_loo_fits <- lapply(1:9, \(i) pls(X = abundance_mat[-i,], Y = trait_mat[-i,], ncomp = n_components, mode = 'regression'))
  pls_loo_preds <- lapply(1:9, \(i) predict(pls_loo_fits[[i]], newdata = abundance_mat[i, , drop = FALSE]))
  pls_loo_predmats <- lapply(1:n_components, \(comp) do.call(rbind, lapply(pls_loo_preds, \(p) p$predict[,,comp])))
  pls_loo_rmses <- lapply(1:n_components, \(comp) sqrt(apply((trait_mat - pls_loo_predmats[[comp]])^2, 2, mean)))
  as.data.table(cbind(component = 1:n_components, do.call(rbind, pls_loo_rmses)))
}

# Apply function to 16S and ITS for all posterior iterations; 3 components used.
n_components <- 3

pls_rmses_16S <- apply(trait_post_blups, 1, \(tmat) fit_pls(abundance16S_agg_forfitting, tmat, n_components))
pls_rmses_16S <- cbind(iter = rep(1:dim(trait_post_blups)[1], each = n_components), do.call(rbind, pls_rmses_16S))

pls_rmses_ITS <- apply(trait_post_blups, 1, \(tmat) fit_pls(abundanceITS_agg_forfitting, tmat, n_components))
pls_rmses_ITS <- cbind(iter = rep(1:dim(trait_post_blups)[1], each = n_components), do.call(rbind, pls_rmses_ITS))

save(pls_rmses_16S, pls_rmses_ITS, file = 'project/fits/PLS_RMSE.RData')
```

```{r}
load('project/fits/PLS_RMSE.RData')
```


Show the distributions of the RMSEs as interval plots. The median RMSE is plotted as a black dot.

> Note that I put a horizontal line at RMSE = 1 because we are using standardized data with a standard deviation of 1. Thus, if we had a pure null model with no information, we would get RMSE = 1. Here, RMSE = 1 is the bare minimum we want to get. Any model with RMSE >= 1 is worse than the null model.

```{r}
pls_rmses_16S_long <- melt(pls_rmses_16S, id.vars = c('iter', 'component'), variable.name = 'trait', value.name = 'RMSE')
pls_rmses_16S_long[, component := factor(component)]

less1_16S <- pls_rmses_16S_long[, .(p = sum(RMSE < 1)/.N), by = .(trait, component)]

ggplot(pls_rmses_16S_long, aes(x = trait, y = RMSE)) +
  stat_interval(aes(color_ramp = after_stat(level), color = component), position = position_dodge(width = .5)) + 
  stat_summary(aes(group = component), fun = median, geom = 'point', size = 2, position = position_dodge(width = .5)) +
  geom_hline(yintercept = 1, linetype = 'dashed', linewidth = 1, color = 'gray50') +
  ggtitle('16S') +
  scale_color_brewer(palette = 'Dark2', name = 'number of components used to predict') +
  theme(legend.position = 'bottom') +
  scale_x_discrete(labels = trait_names)

pls_rmses_ITS_long <- melt(pls_rmses_ITS, id.vars = c('iter', 'component'), variable.name = 'trait', value.name = 'RMSE')
pls_rmses_ITS_long[, component := factor(component)]

less1_ITS <- pls_rmses_ITS_long[, .(p = sum(RMSE < 1)/.N), by = .(trait, component)]

ggplot(pls_rmses_ITS_long, aes(x = trait, y = RMSE)) +
  stat_interval(aes(color_ramp = after_stat(level), color = component), position = position_dodge(width = .5)) + 
  stat_summary(aes(group = component), fun = median, geom = 'point', size = 2, position = position_dodge(width = .5)) +
  geom_hline(yintercept = 1, linetype = 'dashed', linewidth = 1, color = 'gray50') +
  ggtitle('ITS') +
  scale_color_brewer(palette = 'Dark2', name = 'number of components used to predict') +
  theme(legend.position = 'bottom') +
  scale_x_discrete(labels = trait_names)
```

As you can see from the above plots for 16S and ITS, most of the prediction results are worse than the null model. In particular the median of the RMSE posterior distribution is at or above 1 for all six traits. **Interpretation: When we fit a PLS model to 8/9 of the plants, relating their seedlings' aggregated microbiome to their seedlings' aggregated traits, and predict the held-out maternal plant's average seedling trait value using its average seedling microbiome, we get about the same result as if we had completely ignored microbiome and just made a naive prediction using the mean seedling trait values across all maternal plants.** I believe that having RMSE distributions centered around 1 or slightly above 1, as we do here, is consistent with there being no signal to retrieve from the microbiome data that has any relationship with the trait data. We get the same distribution we might expect by chance under the "null hypothesis" that microbiome doesn't predict traits.

We also see that adding more components from the PLS to the prediction does not improve the prediction at all. This is yet another confirmation that there is not that much seedling trait signal in the seedling microbiome data.