-
Notifications
You must be signed in to change notification settings - Fork 6
/
05-mblmodels.Rmd
324 lines (244 loc) · 16.1 KB
/
05-mblmodels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# MBL Models
This section of the [demo script](https://whrc.github.io/Soil-Predictions-MIR/getting-started.html#demo-script) creates MBL models and gets predictions from them:
```{r, eval=FALSE}
#----------------------------------------------#
# Memory Based Learner Model #
#----------------------------------------------#
source("Functions/mbl_functions.R")
source("Functions/perform_functions.R")
# Make Model
mbl.OC <- runMBL(PROP="OC", REFNAME="refSet", PREDNAME="predSet")
# Extract Predictions
mbl.predictions <- getModResults(PROP="OC", MODTYPE="MBL", MODNAME= "mbl.OC", PREDNAME= "predSet")
```
## Model Theory {-}
### Overview {-}
**Memory-Based Learning** (MBL) is a local modeling approach that can be used to predict a given soil property from a set of spectral data, the **prediction set**.
Like PLS, this approach relies on a **reference set**, containing both spectral data and known values for the soil property of interest (ie. Organic Carbon).
While PLS create a single **global model** which can be applied to all samples in the prediction set, MBL makes a **local model** for each prediction.
Local models are built from a sample's **nearest neighbors**: samples in the reference set that are most similar to the sample being predicted.
Similarity is measured by **spectral similarity**, which should reflect similarities in soil composition. Since each sample has a customized model, predictions are often **more accurate** than PLS predictions.
However, MBL models can be quite **computationally intensive** since
1) A model is built for each sample being predicted
2) All samples in the prediction and reference set must be related in terms of similarity
### Animation {-}
The animation below illustrates how local modeling works in MBL. It is shown in multidimensional space since each spectral column is a dimension of the dataset.
```{r, echo=FALSE}
knitr::include_graphics(path="./images/mbl_vid.gif")
```
**A** Shows all the samples in the prediction set (red), overlaying all the samples in the reference set (gray)
**B** Shows a circle indicating the nearest neighbors of a sample being predicted
**C** Shows all the samples of the prediction set with their respective nearest neighbors
**D** Shows how local models will be created for each prediction from these nearest neighbors
Resemble Powerpoint: http://www.fao.org/fileadmin/user_upload/GSP/docs/Spectroscopy_dec13/SSW2013_f.pdf
## Running MBL {-}
Running an MBL modeling approach is accomplished using a couple functions from the [resemble package](https://whrc.github.io/Soil-Predictions-MIR/getting-started.html#resemble): `mblControl()` and `mbl()`. Full documentation for these functions is linked below and the following sections describe how they can be used.
**MBL**- [*Resemble mbl() Documentation*](https://www.rdocumentation.org/packages/resemble/versions/1.2.2/topics/mbl)
```
mbl(Yr, Xr, Yu = NULL, Xu,
mblCtrl = mblControl(),
dissimilarityM,
group = NULL,
dissUsage = "predictors",
k, k.diss, k.range,
method,
pls.c, pls.max.iter = 1, pls.tol = 1e-6,
noise.v = 0.001,
...)
```
**MBL Control**- [*Resemble mblControl() Documentation*](https://www.rdocumentation.org/packages/resemble/versions/1.2.2/topics/mblControl)
```
mblControl(sm = "pc",
pcSelection = list("opc", 40),
pcMethod = "svd",
ws = if(sm == "movcor") 41,
k0,
returnDiss = FALSE,
center = TRUE,
scaled = TRUE,
valMethod = c("NNv", "loc_crossval"),
localOptimization = TRUE,
resampling = 10,
p = 0.75,
range.pred.lim = TRUE,
progress = TRUE,
cores = 1,
allowParallel = TRUE)
```
### `runMBL()` {-}
`runMBL()` is a wrapper function for loading the appropriate datasets and calling the resemble functions for running an mbl model. It is used directly in the [demo script](https://whrc.github.io/Soil-Predictions-MIR/getting-started.html#demo-script) as shown below:
```{r, eval=FALSE}
source("Functions/mbl_functions.R")
mbl.OC <- runMBL(PROP="OC", REFNAME="refSet", PREDNAME="predSet")
```
* `runMBL()`
+ `PROP`: *string*- The column name of the soil property of interest.
+ `REFNAME`: *string*- The name of the reference set variable, if it is already loaded into the R environment. Use REFNAME or REFPATH
+ `REFPATH`: *string*- The path of the RData file containing your reference set, if the reference set is not already loaded. Use REFNAME or REFPATH
+ `PREDNAME`: *string*- The name of the prediction set variable, if it is already loaded into the R environment. Use PREDNAME or PREDPATH
+ `PREDPATH`: *string*- The path of the RData file containing your prediction set, if the prediction set is not already loaded. Use PREDNAME or PREDPATH
+ `SAVENAME`: *string*- The name assigned to the model when it is saved. Default is set to paste0("mbl.",PROP) which would save "mbl.OC.RData" for example in the "Models" folder
_
* Load the resemble package
```{r, eval=FALSE}
# Run MBL Model
library(resemble)
```
* Load the data
+ If `REFPATH` is not NA, it will load the reference set at the path passed. Otherwise, it assumes you have passed in `REFNAME`, the variable name of a reference set already loaded. We use the `get()` command, rather than the variable itself, so that the name of the variable can be saved with our prediction performance. The same applies to the prediction set.
```{r, eval=FALSE}
# Load Reference Set
if(!is.na(REFPATH)){
REFNAME <- load(REFPATH) # If REFPATH is given
}
refset <- get(REFNAME) # load as variable REFSET
# Load Prediction Set
if(!is.na(PREDPATH)){
PREDNAME <- load(PREDPATH)
}
predSet <- get(PREDNAME)
```
* Define inputs and eliminate rows with NA values
```{r eval=FALSE}
# Define Input Datasets
Xu <- predSet$spc # Prediction Spectra
Yu <- sqrt(predSet[,PROP]) # Prediction Lab Data
Yr <- sqrt(refSet[,PROP]) # Reference Spectra
Xr <- refSet$spc # Reference Lab Data
# Get Rid of NAs
Xu <- Xu[!is.na(Yu),]
Xr <- Xr[!is.na(Yr),]
Yu <- Yu[!is.na(Yu)]
Yr <- Yr[!is.na(Yr)]
```
* Set Control Parameters
+ In this example, nearest neighbors will be determined in principal component space (`sm='pc'`), the optimal number of principal components will be used and up to 50 components will be tested (`pcSelection = list('opc',50)`), and nearest neighbor validation will be used (`valMethod = 'NNv'`)- meaning for each prediction, a model will be built will all but the nearest neighbor of the sample being predicted.
```{r, eval=FALSE}
ctrl <- mblControl(sm = 'pc', pcSelection = list('opc', 50),
valMethod = 'NNv',center=TRUE,scale=FALSE,allowParallel=FALSE)
```
* Run MBL Model
+ **Option 1** - Will make local partial least squares regression models using 40, 60, 80 and 100 nearest neighbors (`k= seq(40, 100, by = 20)`) and using 6 components (`pls.c = 6`) to make predictions. Will not use the dissimilarity matrx (`dissUsage = 'none'`).
+ **Option 2**- Will create weighted partial least squares regression models (`method = "wapls1"`) using the neighbors within 0.3, 0.4,...1 distance from the sample being predicted (`k.diss = seq(0.3, 1, by=0.1)`), with a minimum of 20 neighbors used (`k.range = c(20, nrow(refSet))`) and predicting with 3 to 20 of the components of the model (`pls.c = c(minpls=3, maxpls=20)`)
Option 1
```{r, eval=FALSE}
mbl.sqrt <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl, dissUsage = 'none',
k = seq(40, 100, by = 20), method = 'pls', pls.c = 6)
```
Option 2
```{r, eval=FALSE}
mbl.sqrt <- mbl(Yr = Yr, Xr = Xr, Xu = Xu, mblCtrl = ctrl, dissUsage = "none", k.diss = seq(0.3, 1, by=0.1),
k.range = c(20, nrow(refSet)), pls.c = c(minpls=3, maxpls=20), method = "wapls1")
```
* Save the model
```{r, eval=FALSE}
# Save MBL Model
if(SAVENAME != "none"){
modelName <- paste("mbl", PROP, sep=".") # Assign Object Name
assign(modelName, mbl.sqrt)
savefile <- paste("./Models/mbl", PROP,"RData", sep=".") # Save File
save(list= modelName, file = savefile)
cat(paste("\nModel",modelName, "saved to", savefile)) # Print Save Location
}
```
### Modeling Parameters {-}
This section explains some of the main ways to customize and optimize mbl models using `mbl()` and `mblControl()` in the [resemble package](https://whrc.github.io/Soil-Predictions-MIR/getting-started.html#resemble). Full documentation on these functions are linked below:
Below is an example workflow showing the decision points of modeling with MBL. These parameters will be describe in the following subsections.
![](./images/mbl_step.png)
#### Input Datasets {-}
* The `mbl()` function accepts 4 different data products, `Xu` `Xr` `Yu` and `Yr`, summarized in the table below:
![](./images/mbl_input_data.png)
* Both ***X***s are matrices with spectral data and both ***Y***s are vectors with lab data for the property of interest.
* ***u*** indicates "uncertain" for our prediction set, and ***r*** indicates "reference" for our reference set.
* `Yu` is optional, since not all prediction sets will have associated lab data. If this is the case, set `Yu` to `NULL`.
* See the data preprocessing tab to prepare these datasets prior to modeling. In addition, it is necessary to remove all rows in the reference set inputs (`Yr` and `Xr`) that have `NA` values. If you would like to include `Yu` but there are missing values, you must also remove those rows in both prediction set inputs (`Yu` and `Xu`).
+ Number of columns in `Xr` must equal that of `Xu`.
+ Number of rows in `Yr` must equal that of `Yu`, if provided.
#### Matrix of Spectral Neighbors {-}
* When selecting nearest neighbors to build a local model, the `mbl()` function references a **spectral dissimilarity matrix**, which relates samples in the prediction and reference sets.
* This matrix can be created by setting the `sm` parameter in `mblControl()`, or can be passed into the `mbl()` function as `dissimilarityM` if a matrix has already been made.
* For creating the matrix, you will have to decide **how spectral dissimilarity will be calculated** by setting a couple variables in mblControl():
+ `sm` can be set to a variety of different methods for measuring distance in a multidimensional space. We have used `"pls" "pc" "euclid" "cosine" "cor" and "movcor"`
+ `pcSelection` determines how the number of principal components will be chosen for calculating Mahalonobis dissimilarity (when sm = "pc", "loc.pc", "pls" or "loc.pls")
+ We have this set to the default options of `(opc,40)` meaning the optimal principal component method will be used and up to 40 components will be tested.
.
* Lastly, you can specify **how the matrix will be used** within the local models, if at all, by setting the `dissUsage` parameter to `"weights" "predictors" or "none"`.
+ If set to `"predictors"`, the column of the matrix which shows similarity to the sample being predicted, will be added as a predictor variable to build the local model.
+ If set to `"weights"`, the neighbors are weighted based on dissimilarity/distance (those closer to the sample being predicted receive more weight in the model).
.
* The **matrix format** will look like one of the following, depending on how it will be used...
+ A. All reference and prediction sets samples as rows and columns ("predictors")
![](./images/matrix1.png)
+ B. Reference set samples as rows, prediction set samples as columns ("weights")
![](./images/matrix2.png)
#### Neighbor Selection {-}
* The `mbl()` function allows you to specify how many nearest neighbors will be used to build local models, by setting either `k`, or `k.diss` and `k.range`.
+ Option 1: Set `k` to a sequence of numbers to test, for how many neighbors to include.
+ `seq(40, 40, by=20)` , would perform 1 iteration, using 40 nearest neighbors
+ `seq(40, 100, by=20)`, would perform 4 iterations, using 40, 60, 80 and 100 nearest neighbors
+ Option 2:
+ Set a dissimilarity threshold `k.diss` that limits the distance to search for neighbors from a sample. You can think of it as the radius of the circles shown in the model theory animation.
![](./images/k.diss.png)
+ Set `k.range` to the minimum and maximum number of neighbors you want to include, within the `k.diss` distance.
#### Modeling Method {-}
* Once neighbors are selected, MBL builds local models using the multivariate regression method specified with the variable `method` in the `mbl()` function.
+ `pls` for partial least squares regression
+ `wapls1` for weighted average pls
+ `gpr` for gaussian process with dot product covariance
* `pls.c` allows you to set the number of pls components to be used if either "pls" or "wasp1" is used.
+ A single number if `pls` is used
+ A vector containing the minimum and maximum number of components to be used, if `wasp1` is used
#### Validation Method {-}
* You can specify the validation method by setting the parameter `valMethod` within the `mblControl()` function.
+ `NNv` for leave-nearest-neighbour-out cross validation
+ `loc_crossval` for local leave group out cross validation
+ `none` If you chose not to validate the model. This will improve processing speed.
## Getting MBL Predictions {-}
* **MBL predictions** are stored in the mbl model as `MODEL$results$model-name$pred`
* Since you can run the MBL model with different numbers of nearest neighbors or different dissimilarity thresholds, there can be **sets of predictions** stored
* To **choose the best model**, we look for the model with the minimum standardized rmse, and extract predictions from this model, using the functions `bestModMBL()` and `getPredMBL()` sourced from `Functions/mbl_functions.R`
* In the [demo script](https://whrc.github.io/Soil-Predictions-MIR/getting-started.html#demo-script), `getPredMBL()` is called within a wrapper function, `getModResults()`. Complete documentation of this function can be found under the [Model Performance]() tab
```{r, eval=FALSE}
mbl.predictions <- getModResults(PROP="OC", MODTYPE="MBL", MODNAME= "plsr.OC", PREDNAME= "predSet")
```
### `bestModMBL()` {-}
The following function returns the model with the **lowest standardized root mean standard error**- an index showing how much predictions varied from observations^[[Find more about the RMSE here](https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/#:~:text=Root%20Mean%20Square%20Error%20(RMSE)%20is%20the%20standard%20deviation%20of,the%20line%20of%20best%20fit.)].
```{r, eval=FALSE}
bestModMBL <- function(mbl.sqrt){
valType <- mbl.sqrt$cntrlParam$valMethod
if(valType=="NNv"){
index_best_model <- which.min(mbl.sqrt$nnValStats$st.rmse)
}
if(valType=="loc_crossval"){
index_best_model <- which.min(mbl.sqrt$localCrossValStats$st.rmse)
}
best_model_name <- names(mbl.sqrt$results)[index_best_model]
return(best_model_name)
}
```
### `getPredMBL()` {-}
This function calls `bestModMBL()` to choose a model to get predictions from, unless `model_name` is specified otherwise. Since the lab data was square root transformed when building the model, it is back transformed (squared) after predictions are made.
```{r, eval=FALSE}
getPredMBL <- function(mbl.sqrt, model_name=NULL){
if(is.null(model_name)){
model_name <- bestModMBL(mbl.sqrt)
}
sqrt_preds <- eval(parse( text=paste0("mbl.sqrt$results$", model_name,"$pred" )))
predictions <- c(sqrt_preds)^2 # Reverse transform
return(predictions)
}
```
### `getLabMBL()` {-}
If lab data was input when building the model as Yu, it will be stored at `MODEL$results$model-name$yu.obs`. Otherwise, you will have to get it from your original prediction dataset
```{r, eval=FALSE}
getLabMBL <- function(mbl.sqrt){
Yu <- mbl.sqrt$call$Yu
if(!is.null(Yu)){
sqrt_lab <- eval(parse( text=paste0("mbl.sqrt$results$",best_model_name,"$yu.obs" )))
}else{
sqrt_lab <- NULL
}
lab <- c(sqrt_lab)^2
return(lab)
}
```