-
Notifications
You must be signed in to change notification settings - Fork 17
/
10-spatial-estimation.Rmd
1727 lines (1320 loc) · 99.1 KB
/
10-spatial-estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
``` {r 10-load-packages-and-setup, echo = FALSE, include = FALSE, warning=FALSE}
library(broom)
library(cowplot)
library(dismo)
library(dplyr)
library(ggplot2)
library(ggpubr)
library(ggsn)
library(ggspatial)
library(gridExtra)
library(gstat)
library(knitr)
library(leaflet)
library(mapproj)
library(maptools)
library(plotly)
library(raster)
library(RColorBrewer)
library(rgl)
library(scales)
library(sf)
library(sp)
library(spatialreg)
library(spatstat)
library(spdep)
library(tmap)
knitr::knit_hooks$set(webgl = hook_webgl)
```
<!--- ```{r echo=FALSE}
yml_content <- yaml::read_yaml("chapterauthors.yml")
author <- yml_content[["spatialEstimation"]][["author"]]
coauthor <- yml_content[["spatialEstimation"]][["coauthor"]]
```
# Spatial Estimation {#spatial-estimation}
Written by
```{r results='asis', echo=FALSE}
cat(author, "and", coauthor)
```
**Geostatistics** uses the metrics based on statistical tools to characterize the distribution of a random variable across a geographical region of interest [@getis_spatial_2004]. Like any other data, spatial data may be incomplete, which can limit our analytical capacity in the incomplete regions. In context of spatial data, we wish to understand spatial phenomena like events or processes that occur over large areas, but usually we are limited in our capacity to sample everywhere. Most of the time, only discrete and unrepresentative information on the spatial phenomenon is collected, which does not allow us to create a precise and continuous map of the phenomenon over entire geographic area of interest. How can we best utilize the available samples to represent the phenomenon across the entire geographic area? This chapter introduces some basic ideas on different types of sampling strategies in spatial context. In addition, this chapter also introduces various type of spatial statistics that are available to predict the occurrence of spatial phenomena in unsampled locations.
:::: {.box-content .learning-objectives-content}
::: {.box-title .learning-objectives-top}
## Learning Objectives {-}
:::
1. Differentiate the advantages of spatial sampling strategies
2. Describe the relationship between observations at different spatial scales using autocorrelation and semivariogram
3. Apply methods of spatial interpolation to predict observations at unknown locations
4. Apply methods of spatial prediction using regression models
::::
## Key Terms {-}
Spatial autocorrelation,
## Spatial Autocorrelation
A key characteristic that distinguishes spatial data from other types of data is the fact that spatial phenomena are frequently spatially autocorrelated. **Spatial autocorrelation** is the relationship between a variable of interest with itself when measured at different locations [@cliff_ad_and_ord_spatial_1973]. Any two samples are more likely to be correlated or similar to each other when the samples are closer in space compared with two samples that are taken at farther distances. For example, if we measured air temperature at our current location and then walked 1 m away and measured it again, then walked 10 km and measured it again, we would expect that the temperatures sampled 1 m apart would be more similar than the temperatures measured 10 km apart. The precise degree and sign (positive or negative) of spatial autocorrelation will vary from phenomenon to phenomenon because some phenomena change more quickly over space than others. As far as we can tell, all spatial phenomenon exhibit spatial autocorrelation at some, but not all distances. In other words, we have never observed a natural spatial phenomenon that is truly random. Spatial autocorrelation provides a tremendous advantage for estimating statistics and characteristics of populations in space, but also introduces several important caveats.
Consider the following example; **Figure \@ref(fig:10-cluster-sampling)** shows the clustering pattern in the given square boxes (can be variable of interest) representing the positive spatial autocorrelation (left) and a complete checkerboard (right) distribution of square boxes (variable of interest) indicating a negative spatial autocorrelation.
:::: {.box-content .call-out-content}
::: {.box-title .call-out-top}
## Recall This {-}
:::
<p id="box-text">
## Domain
The study area from where spatial sample is taken.
## Attributes
The information attached to the study objects that are spatially distributed in a Domain. Often termed as variable of interest.
</p>
::::
```{r 10-spatial-autocorrelation, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Example of a positive (left) and negative spatial autocorrelation (right) for a give domain. Nepal, CC-BY-SA-4.0.")
n<- matrix(c(1:400), nrow=20, ncol=20)
# Create row indicator
df <- expand.grid(x=1:ncol(n),y=1:nrow(n))
df$val <- n[as.matrix(df[c('y','x')])]
df$val<-rep(c(1:5), each = 80)
# Subset odd rows
cols <- c("1" = "red", "2" = "blue", "3" = "darkgreen", "4" = "orange", "5"="black")
p <- ggplot(df)+ geom_tile(aes(x, y, fill =val),
colour = "black", width=1, height=1, size=1)+
scale_fill_gradient2(low=muted("blue"), high=muted("red"))+
scale_y_reverse() +
theme_classic() +
theme(legend.position = "none")+
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
############# negative#########
m <- matrix(c(1:625), nrow=25, ncol=25)
df3 <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df3$val <- m[as.matrix(df3[c('y','x')])]
odd_indexes<-seq(1,nrow(df3),2)
even_indexes<-seq(2,nrow(df3),2)
df3[odd_indexes, "val"] <- "0"
df3[even_indexes, "val"] <- "1"
q <- ggplot(df3)+ geom_tile(aes(x, y, fill = ifelse(val > 0,val, NA)),
colour = "black", width=1, height=1, size=1)+
#n <- df3[sample(nrow(df3), 40, replace = FALSE, prob = NULL),]
scale_y_reverse() +
theme_classic() +
theme(legend.position = "none")+
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
grids_bs <- plot_grid(p,q,ncol = 2, labels = c('Positive autocorrelation', 'Negative autocorrelation'), align = "h")
grids_bs
```
Example of a positive (left) and negative spatial autocorrelation (right) for a give domain.
## Moran's I
**Moran’s I** [@moran_notes_1950], is a correlation coefficient that measures the degree of spatial autocorrelation in certain attributes of the data. It is based on the spatial covariance standardized by the variance of the data [@moran_notes_1950]. It works based on the neighborhood list created based on spatial weight matrix [@suryowati_comparison_2018]. The value of Moran’s I ranges between -1 to 1, where 1 indicates the perfect positive spatial autocorrelation, 0 indicates the random pattern, and -1 indicates the perfect negative autocorrelation [@moran_notes_1950]. Moran’s I is calculated using the following formula [@moran_notes_1950]:
$$
I= \frac{1}{s^2} \frac{\sum_{i}\sum_{j}({y_i-\bar{y})({y_j-\bar{y}})}}
{\sum_{i}\sum_{j}w_{ij}}
$$
Where, $$I=\text{the Moran I statistics}$$, $$y_i=\text{variable measure at location i}$$
$$y_j=\text{variable measure at location j}$$
$$S^2=\text{the variance}$$
$$w_{ij}=\text{the spatial weight matrix}$$
## Case Study 1
For this case study, we will use ground plot data from Change Monitoring Inventory (CMI) program [for details: @province_of_bc_provincial_2018] for Williams Lake and 100-miles House timber supply area (TSA) in the province of British Columbia, Canada. William Lake TSA and 100-miles House TSA are divided into 18 and 8 blocks respectively **Figure \@ref(fig:10-spatial-autocorrelation)**. There is a total of 456 CMI plots used in this study **Figure \@ref(fig:10-spatial-autocorrelation)**. The total basal area (m2/ha) is our variable of interest in this study. For each of the polygon in Williams lake and 100-miles house TSA, total basal area was calculated by taking the sum of the basal area of each CMI plots in each polygon. For this part of exercise, we want to understand if there is any spatial relationship (autocorrelation) between the total basal area measured in each polygon of TSA. We will quantitatively measure the presence or absence of **spatial autocorrelation** using **Moran’s I** and **Geary's C**.
```{r 10-morans-I-load-data-and-setup, echo=F, warning=F, message=F}
options(stringsAsFactors = FALSE)
filename <- readOGR(dsn="data/10",layer="Block_basa_area")
plots<- read.csv("data/10/CMI.csv",header=T)
```
```{r 10-morans-I, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("CMI plots location for williams lake TSA and 100 miles house TSA. Basal area has been calculated using the sum of the basal area for each plot (dot) over the given polygon. Nepal, CC-BY-SA-4.0.")
###Morans'I using spdep'#########################################
####################### Plot the data #######################
# note that you don't need to call maptools to run the code below but it needs to be installed.
# to add a north arrow and a scale bar to the map libraries: ggsn and mapproj
# set factors to false
### Convert the spatial data into ata frame ##########
filename_df <- tidy(filename)
# make sure the shapefile attribute table has an id column
filename$id <- rownames(filename@data)
# join the attribute table from the spatial object to the new data frame
filename_df <- left_join(filename_df,
filename@data,
by = "id")
will<- readOGR(dsn="data/10",layer="100_and_will")
will_df <- tidy(will)
# make sure the shapefile attribute table has an id column
will$id <- rownames(will@data)
# join the attribute table from the spatial object to the new data frame
will_df <- left_join(will_df,
will@data,
by = "id")
###################### Bringing the CMI plots over williams lake block and 100 miles block
##################### Creating the map using ggplot###################
t<-ggplot() +
geom_polygon(data = filename_df,
aes(x = long, y = lat, group = group,
fill =Basal ),color="black")+
geom_polygon(data = will_df,
aes(x = long, y = lat, group = group),alpha=0.005,color="red")+
geom_label(aes(x=400000,
y=5750000,label="Williams lake TSA"),
label.padding = unit(0.55, "lines"), # Rectangle size around label
label.size = 0.35,
color = "black",
fill="#69b3a2")+
geom_label(aes(x=620000,
y=5700000,label="100 miles house TSA"),
label.padding = unit(0.55, "lines"), # Rectangle size around label
label.size = 0.35,
color = "black",
fill="#69b3a2")+
geom_point(data=plots, aes(utm_eastin,utm_northi), inherit.aes = FALSE,
alpha = 0.5, size =1.5) + coord_equal()+
scale_fill_continuous(type="viridis",option="E")+
ggsn::north(filename_df, scale =0.15, location = "bottomleft")+
annotation_scale(line_width = 1.5,
height = unit(0.3, "cm"),text_cex =1.5)+
theme_bw()+
labs(x = "", y = "")+
theme(axis.text = element_blank(), axis.ticks = element_blank())+
guides(alpha=F)+
labs(fill="Basal_area")+
theme(legend.position = c(0.93,0.10), legend.direction = "vertical")+
theme(legend.key = element_rect(fill = "white", colour = "white"))+
theme(legend.background = element_rect(fill="grey",
size=0.6, linetype="solid",
colour ="white"))+
theme(legend.key.size = unit(0.3, "cm"))+
theme(legend.text=element_text(size=10),
legend.title=element_text(size=10))
t
```
CMI plots location for williams lake TSA and 100 miles house TSA. Basal area has been calculated using the sum of the basal area for each plot (dot) over the given polygon.
:::
## Calculating Moran's I {-}
We will calculate the Moran's I for the basal area variable pertaining to the polygon.
## Using Contiguity {-}
**Define Neighborhood**
The Moran’s I statistic is the correlation coefficient for the relationship between a variable and its surrounding values. But before we go about computing this correlation, we need to come up with a way to define a neighborhood. There are two ways to define neighborhood namely; contiguity for spatial **polygon data** and distance-based approach for the spatial **point data** and polygon data both. For polygon data, contiguity based neighborhood selection can be adopted using two widely used method, respectively known as ***Rook's case*** or ***Queen's case*** **Figure \@ref(fig:10-morans-I)**.
```{r 10-using-contiguity, echo=FALSE, Message=FALSE, Warinig=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Rook's (left) and Queen's (right) case for searching the neighborhood (grey unit) for the darker unit in the center. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:9), nrow=3, ncol=3)
df3 <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df3$val <- m[as.matrix(df3[c('y','x')])]
p<-ggplot(df3,aes(x=x, y=y, label=val)) +
geom_tile(data=df3, fill='transparent', colour = 'black') +
geom_rect(aes(xmin =1.5,xmax = 2.5,ymin = 1.5, ymax = 2.5),
fill = 'black',alpha=0.5,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 1.5,ymin = 2.5, ymax = 3.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =2.5,xmax = 3.5,ymin = 2.5, ymax = 3.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 1.5,ymin = 0.5, ymax = 1.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =2.5,xmax = 3.5,ymin = 0.5, ymax = 1.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
q<-ggplot(df3,aes(x=x, y=y, label=val)) +
geom_tile(data=df3, fill='transparent', colour = 'black') +
geom_rect(aes(xmin =1.5,xmax = 2.5,ymin = 1.5, ymax = 2.5),
fill = 'black',alpha=0.5,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 1.5,ymin = 2.5, ymax = 3.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =2.5,xmax = 3.5,ymin = 2.5, ymax = 3.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 1.5,ymin = 0.5, ymax = 1.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =2.5,xmax = 3.5,ymin = 0.5, ymax = 1.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 1.5,ymin = 1.5, ymax = 2.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =2.5,xmax = 3.5,ymin = 1.5, ymax = 2.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =1.5,xmax = 2.5,ymin = 2.5, ymax = 3.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =1.5,xmax = 2.5,ymin = 0.5, ymax = 1.5),
fill = 'black',alpha=0.02,color = "black",size = 2)+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
grids_bs <- plot_grid(p,q,ncol = 2, labels = c('Rooks contiguity', 'Queens contiguity'), align = "h")
grids_bs
```
Rook's (left) and Queen's (right) case for searching the neighborhood (grey unit) for the darker unit in the center.
**Step 1: Build Neighborhood**
Since we are working with the polygons, we will use the queen's contiguity to build the neighborhood list using **polyn2b** function in **spdep** package in R and plot the the linkage**
```{r 10-using-contiguity-2, warning=FALSE, message=FALSE}
####################### Plot the data #######################
### Convert the spatial data into data frame ##########
filename_df <- tidy(filename)
# make sure the shapefile attribute table has an id column
filename$id <- rownames(filename@data)
# join the attribute table from the spatial object to the new data frame
filename_df <- left_join(filename_df,
filename@data,
by = "id")
#Searching neighborhood
w1 <- poly2nb(filename,row.names=filename$id,queen=T) ###### queens case
coords <- coordinates(filename)
plot(w1, coords, col="grey")
```
**Step 2: Getting a Spatial Weight Matrix for Neighborhood List**
```{r 10-using-contiguity-3, warning=FALSE, message=FALSE}
ww <- nb2listw(w1,style='B')
```
**Step 3: Calculate Moran's Correlation Coefficient Using the Spatial Weight Matrix for Neighbors**
```{r 10-using-contiguity-4, warning=FALSE, message=FALSE}
## calculating Moran's I
moran(filename$Basal, ww, n=length(ww$neighbours), S0=Szero(ww))
```
**Step 4: Conduct the Significance Test for the Calculated Moran's I Value**
```{r 10-using-contiguity-5, warning=FALSE, message=FALSE}
moran.test(filename$Basal, ww)
```
**Using the K-Nearest Neighborhood Imputation**
**Step 1: We will select 3 Nearest Neighbor Using Distance-based Approach**
The function knearneigh in 'spdep" package in R and build a spatial weight
```{r 10-using-contiguity-6, warning=FALSE, message=FALSE}
# Searching neighborhood
col.knn <- knearneigh(coords, k=3)
w<-knn2nb(col.knn,row.names = filename$id)
coords <- coordinates(filename)
plot(w, coords, col="grey")
```
**Step 2: Build a Spatial Weight Matrix for the Neighborhood List**
```{r 10-using-contiguity-7, warning=FALSE, message=FALSE}
#spatial weight
ww1 <- nb2listw(w,style='B')
#ww1
```
**Step 3: Calculate the Moran's I Coefficient**
```{r 10-using-contiguity-8, warning=FALSE, message=FALSE}
## Calculating Moran's I
moran(filename$Basal, ww1, n=length(ww1$neighbours), S0=Szero(ww1))
```
**Step 4: Significance Test for Calculated Moran's I**
```{r 10-using-contiguity-9, warning=FALSE, message=FALSE}
moran.test(filename$Basal, ww1)
```
Note that the value of Moran's I changed based on how we calculated the neighborhood list using two different approach. The interpretation change here based on the way we created the neighborhood. With contiguity based neighbor, we found a negative value for I (-0.11), indicating a negative weak spatial autocorrelation. When we run the significance test we can see that the p-value < 0.05 indicating the autocorrealtion is not significant. While using nearest neighbor we found that I (0.017) indicated a weak positive spatial autocorrelation. One reason for the difference is k-nearest neighbor uses polygon within a greater distance and can include more polygons as compared to contiguous neighbor which uses either "queens" or "rooks" contiguity [@suryowati_comparison_2018].
## Geary's C
Another more local measure of spatial autocorrelation unlike Moran's I is **Geary's C** [@geary_contiguity_1954]. While Moran's I is calculated by standardizing the spatial autocovariance by the variance of the data. Geary's c on the other had uses the sum of the squared differences between pairs of data values as it is a measure of covariance [@geary_contiguity_1954]. However, both statistics depends on the spatial nature of data and are based on neighborhood. Both of these statistics depend on a spatial structure specified by a spatial weights matrix. The value of Geary's C ranges between 0 to some unknown positive value, where 0 indicates the spatial randomness, values less than 1 indicates the positive spatial autocorrelation, while value greater than 1 indicates negative spatial autocorrelation [@geary_contiguity_1954]. It is calculated using following formula:
$$ C=\frac{(n-1) \sum_{i}^n\sum_{j}^nw_{ij}(y_i-y_j)^2}{2\sum_{i}^n\sum_{j}^nw_{ij}\sum_{i}(y_{i}-\bar{y})^2}$$
**Step 1 and 2: We will calculate all the neighborhood list exactly how we did for Moran's I and get our spatial weight matrix**
**Step 2: In our final step, we will use geary funtion from "spdep" package to calculate the value of Geary's**
***For Queens Case***
```{r 10-gearys-c, warning=FALSE, message=FALSE}
## Geary C
geary(filename$Basal, ww, n=length(ww$neighbours),n1=length(ww$neighbours)-1, S0=Szero(ww))
## Significance test for Geary C
geary.test(filename$Basal, ww)
```
***For Nearest Neighbour Method***
```{r 10-gearys-c-2, warning=FALSE, message=FALSE}
## Geary C
geary(filename$Basal, ww1, n=length(ww1$neighbours),n1=length(ww1$neighbours)-1, S0=Szero(ww1))
## Significance test for Geary C
geary.test(filename$Basal, ww1)
```
Note that the value of Geary's C indicated a positive spatial autocorrelation using both queens case and k-nearest neighbor. However, the spatial autocorrelation was not significant as given by p-value < 0.05. Both Moran's I and Geary's C are in agreement in terms of results.
## Populations, Samples and Statistics
Suppose you are interested in measuring the heights of trees in a forest. There are a lot of trees in this forest and you cannot measure all of them. So you cleverly decide to sample some locations and infer the mean tree height of the forest from this sample. In this example, the number of trees that you measured are the __sample__ $n$, the forest is the total __population__ of trees $N$, and the __statistic__ that you are interested in estimating is the mean tree height of the forest:
$$
x̄ = \frac{1}{n} \sum_{i=1}^{n} x_i
$$
Lowercase $x$ represents any sampled tree and $x̄$ represents the mean of the __sample__, which is an __estimate__ of the __population__ mean. It is important to recognize that statistics of samples and populations are discussed and represented differently. The true mean tree height of the forest is denoted by the Greek lowercase mu $μ$, and we can never know this value without measuring every tree. So instead, we __infer__ the population statistic using the sampled data and hope that the sample statistic will be close to the true population statistic, but we must accept that they are rarely equivalent $x̄≠μ$. The same equation above can be re-written for the population mean:
$$
μ = \frac{1}{N} \sum_{i=1}^{N} X_i
$$
Notice here we are using $μ$ to signify the population mean tree height, uppercase $N$ signifies all the trees in the population, and uppercase $X_i$ is one tree in the population. At this point, you might take a moment to appreciate that as our sample $n$ approaches $N$, our sample mean $x̄$ should also approach the population mean $μ$. In other words, the more trees we sample, the more likely we are going to have an accurate estimate of $μ$. It might surprise you that this conclusion is likely true for large samples $n$, but not small samples $n$. Why? Because any randomly-selected tree from our forest $n=1$ could be much taller or shorter than the mean just due to random chance alone. In fact, if our forest has two distinct cohorts of trees, one very young emerging below the canopy and one established in the overstory (i.e., bi-modal distribution), then the population mean might never be approximated by any particular tree. This is problematic for us, because we want to accurately estimate the population statistic whilst exhausting as few of our resources as possible to do so (i.e., time, people, equipment, budget).
When we made our measures of tree heights, it would be reasonable for us to expect that the tree heights that we sampled are going to be different (i.e., most trees will not have the same height). The magnitude of those differences in the sample are captured by the **variance**, which is a measure of the dispersion of our tree heights $x$ relative to the sample mean $x̄$:
$$
s^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i-x̄)^2
$$
We square the differences so that the variance is always positively signed and consequently variance always has squared units, hence $s^2$. If we take the square root of the variance, we have the **standard deviation**:
$$
s = \sqrt{s^2}
$$
The population standard deviation is represented by Greek lowercase sigma $σ$ and population variance is $σ^2$.
## Significance Testing
The significance level of a test statistic from the sample is compared with the critical value of the test statistic obtained by repeatedly reordering the data with random proportions [@jacquez_spatial_1999]. Interpretation of this significance level in classical statistics is done using a **p-value** which, if less than or equal to a defined threshold, also know as **alpha α** (usually 5% or 0.05), then the **null hypothesis** of "no difference from random" is rejected [@jacquez_spatial_1999]. In other words, when the p-value is less than or equal to 0.05, we conclude that the statistic we observe for the sample is unlikely to have occurred due to random chance alone. This interpretation is important because recall that if our sample is small, there is a random chance that any statistic we calculate from it could be due to that particular sample rather than represent a true characteristic of the larger population. For example, imagine the case where we randomly sample some trees, but by chance we happen to only sample young trees and so our mean tree height from the sample is not representative of the population. The p-value represents the probability that our statistic from the sample is observed due to random chance, so we want this value to be as low as possible.
While $α=0.05$ is widely used, there are cases where we may set a higher standard. For example, where the risk of incorrectly rejecting the null hypothesis, known as a "false positive" or **Type I error**, could have significant consequences on human health or safety. Consider the case where a new drug is being tested for efficacy and the p-value in the clinical trial is 0.05, so the null hypothesis that placebo is better than the drug is rejected. But there is a 5% probability that there is no difference between the new drug and a placebo. The implication is that 5% of the human population who take the new drug will not experience any benefits. If the population eligible for this drug is large, this can translate to many people being impacted by this decision. You can see how this can be problematic because there might be alternative therapies or drugs that are more effective than this new drug that the population are not utilizing. A similar case might exist where we believe that the null hypothesis is true, but it is in fact false, known as a "false negative" or **Type II error**. In these cases, we want to be sure that our result is not spurious, so the p-value standard may be raised to 0.01 or even 0.001, so that only one person in a hundred or one person in a thousand will experience no difference from the placebo.
## Classical Statistics vs. Geostatistics
In classical statistics, the variance is assumed to be totally random when samples are drawn from a defined population [@jacquez_spatial_1999] and are assumed to come from one distribution [@steel_principles_1980]. Inferences about the population can be made by comparing a statistic calculated for the sample to the distribution of that statistic under the null hypothesis for the assumed distribution [@jacquez_spatial_1999].
In geostatistics, the variance of a spatial phenomenon is assumed to be partly random and each point in the field represents a sample from some distribution [@jacquez_spatial_1999]. However, the distribution at any one point may differ completely from all other points in its shape, mean, and variance. The distribution of differences in sample values separated by a specified distance is assumed to be the same over the entire field. Sample values that exhibit spatial autocorrelation due to their proximity to each other will have relatively small random variance of the distribution of differences. If the sample values do not exhibit spatial autocorrelation, then the variance is larger. The **semivariance** or half of the variance is used to measure the similarity between points at a given distance apart. The distance at which we compare the semivariance of sample values is known as the **lag distance**. If we graph semivariance and lag distance, we have created a **semivariogram**. Since we are usually dealing with samples and not the full population, we will almost never know the exact semi-variogram, instead we must rely on fitting models to __estimate__ this relationship.
Thus, geostatistics is distinguished from classical statistics by the fact that we need to estimate the semivariogram function and then incorporate the semivariogram function to estimate the values at unsampled locations using various predictive spatial methods.
## Semivariogram Modeling
The semivariogram is a basic geostatistical tool for measuring spatial autocorrelation of a variable measured at different spatial location. A semivariogram is a measure of variance of a variable between two locations separated by certain lag distance [@isaaks_introduction_1989]. For example, we can measure how a variable $y$ changes in value between sites $i$ and $j$ by calculating the variance between the sites $y_i-y_j$. If the surface represented by the two points is continuous and the lag is a small distance, we would expect the variance to be small [@isaaks_introduction_1989]. With increasing lag distance, we would expect the variance to increase. Let us translate this intuitive statement into a equation known as the empirical variogram:
$$\gamma{(h)}=\frac{1}{2N}\sum_{i,j=1}^{N(h)}{({y_i}-y_j)}^2$$
Where, $$\gamma{h}=\text{semivariance at a spatial lag h}$$
$$ i=\text{measure spatial coordinate (latitude/UTM easting)}$$
$$ j=\text{measure spatial coordinate (longitude/UTM northing)}$$
$$y_{i}=\text{measured value of variable of interest at the spatial location i} $$
$$y_{j}=\text{measured value of variable of interest at the spatial location j} $$
$$N=\text{number of sampled differences or lag} $$
Like the familiar variance, it is a sum of squared differences divided by the number $N$ of sampled differences. Unlike simple variance about a mean that we discussed earlier, the semivariogram measures difference between two sample values. The __semi__ in semivariogram comes from the fact that the variance is divided in half.
A semivariogram is a graph that consists of semivariance on the y-axis and a lag distance on the x-axis Figure \@ref(fig:10-semivariogram-modelling). There are some important features of a semivariogram that can be used to interpret the nature and structure of spatial autocorrelation.
```{r 10-semivariogram-modelling, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
## An example semivariogram with all the components using the "Fulmar" [see details @pebesma_mapping_2005] data from "gstat" package in R.
fig_cap <- paste0("An example semivariogram with all the components using the 'Fulmar'data from 'gstat' package in R. Nepal, CC-BY-SA-4.0.")
data(fulmar)
fulmar.spdf <- SpatialPointsDataFrame(cbind(fulmar$x,fulmar$y),
fulmar)
fulmar.spdf <- fulmar.spdf[fulmar.spdf$year==1999,]
proj4string(fulmar.spdf) <- CRS("+init=epsg:32631")
evgm <- variogram(fulmar~1,fulmar.spdf,
boundaries=seq(0,250000,l=51))
fvgm <- fit.variogram(evgm,vgm(3,"Sph",100000,1))
preds = variogramLine(fvgm, maxdist = max(evgm$dist))
g<-ggplot(evgm,aes(x=dist,y=gamma))+geom_point()+
geom_line(data = preds)+
geom_segment(aes(x=110000,y=0,xend=110000,yend=13.1)) +
geom_segment(aes(x = 0, y = 1.8, xend = 110000, yend = 1.8))+
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1.8))+
geom_label(aes(x=50000,
y=2,label="Range"),
label.padding = unit(0.55, "lines"), # Rectangle size around label
label.size = 0.35,
color = "black",
fill="#69b3a2")+
geom_label(aes(x=110000,
y=7.5,label="Sill"),
label.padding = unit(0.55, "lines"), # Rectangle size around label
label.size = 0.35,
color = "black",
fill="#69b3a2")+
geom_label(aes(x=0,
y=0.8,label="Nugget"),
label.padding = unit(0.20, "lines"), # Rectangle size around label
label.size = 0.15,
color = "black",
fill="#69b3a2")+
labs(x = "lag distance (h)", y = "Semi-variance")+
theme_bw()
g
```
**Range $a$:** The distance at which a variogram model first flattens out. This is the distance up to which the variable is considered to be spatially autocorrelated (Figure \@ref(fig:10-semivariogram-modelling)). For variogram models that are bounded and have sills, the range is generally accepted to be 95% of the sill.
**Nugget $c_0$:** Nugget refers to unaccounted autocorrelation due to a smaller lag distance than sampling distance or due to errors and imprecision arising from sampling (Figure \@ref(fig:10-semivariogram-modelling)). The nugget is also where the variance function $\gamma{(h)}$ intercepts the y-axis. Without the nugget, we would expect all variogram models to evaluate to zero variance at zero lag $\gamma{(0)}=0$, and conceptually this makes sense because how can you have variance between any observation and itself?
**Sill $s$:** The value of variance that a variogram model attains at a given range (Figure \@ref(fig:10-semivariogram-modelling)).
$$
s=c_0+c_1
$$
**Partial sill $c_1$:** The sill minus the nugget.
$$
c_1=s-c_0
$$
**Partial sill to total sill ratio:** The structural variance explained by the fitted variogram model [@rossi_geostatistical_1992]. This is the amount of variance that is spatially autocorrelated [@rossi_geostatistical_1992].
$$ Ratio =\frac {s}{s+a}$$
Usually, we are interested in modeling the semivariance (semivariogram) or variance (variogram) of a process so that we can make predictions from a model rather than an incomplete set of sampled observations. In this way, we can fit a model to our paired observations and use a continuous variogram function to estimate the variance at any lag. These models are known as **theoretical variograms** and in the following sections we will compare several commonly-used models, summarized in Figure \@ref(fig:10-theoretical-variogram-models).
```{r 10-theoretical-variogram-models, fig.cap = fig_cap, out.width= "75%", echo = FALSE}
vv <- rev(seq(0,10,0.01))
## Linear variogram
vv_linear <- rev(vv/max(vv))
#plot(vv_linear,type="l")
## Quadratic variogram
qua <- function(x) {
xx <- (max(x)*(x^2))/(1+(x^2))
xx <- xx/max(xx)
xx
}
vv_quadratic <- rev(qua(vv))
#plot(vv_quadratic),type="l")
## Exponential variogram
vv_exponential <- (max(exp(vv))-exp(vv))/max(exp(vv))
#plot(vv_exponential,type="l")
## Gaussian variogram
gau <- function(x) {
xx <- 1-exp(-(x/max(x))^2)
xx <- xx/max(xx)
xx
}
vv_gaussian <- rev(gau(vv))
#plot(vv_gaussian,type="l")
## Spherical variogram
sph <- function(x) {
xx <- ((3/2)*(x/max(x)))-((1/2)*((x/max(x))^3))
xx <- xx/max(xx)
xx
}
vv_spherical <- rev(sph(vv))
#plot(vv_spherical,type="l")
## Power variogram
pow <- function(x,a) {
xx <- x^a
xx <- xx/max(xx)
xx
}
vv_power_05 <- rev(pow(vv,0.5))
vv_power_15 <- rev(pow(vv,1.5))
#plot(vv_power_15,type="l")
#png("./images/10-theoretical-variogram-models.png",width=1800,height=1800,res=300)
#par(mar=c(4,4,1,1))
#colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
#plot(vv,type="n",xlim=c(1,1000),ylim=c(0,1),xlab="Lag (h)",ylab="Variance λ(h)",xaxt="n",yaxt="n")
#axis(1,at=c(0,1000),label=c("0","Range (a)"))
#axis(2,at=c(0,1),label=c("0","Sill (s)"))
#vv_nugget <- vv_linear*0+1
#vv_nugget[1] <- 0
#lines(vv_nugget,lwd=2,col=colors[1])
#lines(vv_linear,lwd=2,col=colors[2])
#lines(vv_quadratic,lwd=2,col=colors[3])
#lines(vv_exponential,lwd=2,col=colors[4])
#lines(vv_gaussian,lwd=2,col=colors[5])
#lines(vv_spherical,lwd=2,col=colors[6])
#lines(vv_power_05,lwd=2,col=colors[7])
#lines(vv_power_15,lwd=2,col=colors[8])
#legend(650, 0.4, legend=c("Nugget" ,"Linear", "Quadratic", "Exponential", "Gaussian", "Spherical", "Power (a=0.5)", "Power (a=1.5)"), col=colors, lty=1, lwd=2, cex=1, title="Variogram Models")
#dev.off()
fig_cap <- paste0("Comparison of theoretical variogram models. The upper bound of the variance in the plot is the sill for variogram models that have sills and the upper limit of the lag in the plot is the range for variogram models that have ranges. Note that the linear and power models are not bounded and extend infinitely beyond the plot space. The exponential, quadratic, and Gaussian models approach the sill asymptotically as lag distances approach infinity. Pickell, CC-BY-SA-4.0.")
knitr::include_graphics("images/10-theoretical-variogram-models.png")
```
### Nugget variogram model
A **nugget model** is the simplest model that assumes there is no relationship between variance and lag and consequently there is no range for this model. In other words, the variance is assumed to be constant. This is also the least likely case for most natural spatial phenomena. If you used this model to make spatial predictions, then your predictions would be made as if your data had no spatial component at all, so this is also the least useful model for spatial prediction. The nugget model is expressed as:
$$
\begin{array}{ccc}
\gamma{(h)} = 0 & \text{for }h=0, \\
\gamma{(h)} = 1 & \text{ for }h>0
\end{array}
$$
### Linear variogram model
A **linear model** assumes a linear and constant slope $b$ between variance and lag. Like the nugget model, the linear model has no defined range or sill. Unlike the linear model, variance is assumed to be infinite and therefore it is not possible to distinguish between correlated and uncorrelated lags. Note the special case of $b=0$ is equivalent to the nugget model. A linear effect in spatial autocorrelation suggests a trend in your spatial data. The linear model is expressed as:
$$
\gamma{(h)} = bh+c_0
$$
Where $b$ is the slope to be estimated.
### Quadratic variogram model
A **quadratic or logistic model** assumes variance increases near-exponentially, but the shape is more "S"-shaped or sigmoid. The quadratic model is expressed as:
$$
\gamma{(h)} = \frac{ah^2}{1+bh^2}+c_0
$$
Where $b$ is the slope to be estimated.
### Exponential variogram model
An **exponential model** assumes spatial autocorrelation decreases exponentially with increasing lag distance. The variance in the exponential variogram approaches the sill asymptotically as $h → ∞$, so unlike the spherical and nugget models, an exponential model never assumes a repeating variance beyond the range. Since the variance will continue to marginally increase with infinite lag distance, the range is defined as 95% of the sill $a=0.95s$ or $3a$ beyond which lag distances $h$ are considered not spatially autocorrelated for the given variable. The exponential model is expressed as:
$$
\gamma{(h)} = c_1(1-e^{-\frac{h}{a}})+c_0
$$
### Gaussian variogram model
A **Gaussian model** assumes that spatial autocorrelation is extremely high at short lag distances (i.e., variance is low) and then falls quickly towards zero (i.e., variance is high) at farther lag distances. This is a good model to use if you expect high local autocorrelation or phenomena that change rapidly with distance. The range is defined as 95% of the sill $a=0.95s$ or $\sqrt{3}a$ beyond which lag distances $h$ are considered not spatially autocorrelated for the given variable. The Gaussian model is expressed as:
$$
\gamma{(h)} = c_1(1-e^{-(\frac{h}{a})^2})+c_0
$$
### Spherical variogram model
A **spherical model** assumes a progressive, but not constant, decrease of spatial autocorrelation (i.e., increasing variance) until the range, beyond which autocorrelation is zero and variance remains the same. Note that the slope of a spherical model is generally higher (i.e., faster) than a linear model where $h<a$. The spherical model is one of the most commonly used models because the assumption of zero autocorrelation beyond the range $h>a$. The spherical model is expressed as:
$$
\begin{array}{ccc}
\gamma{(h)} = c_1(\frac{3}{2}\frac{h}{a}-\frac{1}{2}(\frac{h}{a})^3)+c_0 & \text{for }0<h≤a, \\
\gamma{(h)} = s = c_0 + c_1 = 0 & \text{ for }h>a
\end{array}
$$
### Power variogram model
A **power model** modulates the variance by raising the lag to a power $λ$:
$$
\begin{array}{ccc}
\gamma{(h)} = h^λ & \text{for }0<λ<2, \\
\gamma{(h)} = bh+c_0 & \text{for }λ=1
\end{array}
$$
The power model has no range or sill that can be estimated and the variance process is considered to be infinite, the same as the exponential and Gaussian models. Notably, $λ=1$ is equivalent to the linear model.
### Case Study: Selecting the appropriate variogram model {-}
We will look at some different models for estimating the semivariogram. We will use the ground plot data from the young stand monitoring (YSM) program data [@province_of_bc_provincial_2018] for Fort Saint Johns Timber Supply Area (TSA) in the province of British Columbia, Canada. Fort Saint Johns is divided into six blocks respectively Figure \@ref(fig:10-FSJ-plot). There are a total of 108 YSM plots used in this study (Figure \@ref(fig:10-FSJ-plot)). The total basal area ($m^2/ha$) is the variable that we want to predict spatially. For each of the YSM plots, we will calculate the total basal area by adding the basal area for all trees within each plot. We will explore different variogram models with these data and check which model provides the best fit for the variogram.
```{r 10-FSJ-plot, echo=FALSE, fig.cap=fig_cap, message=FALSE, warning=FALSE}
fig_cap <- paste0("Location of Fort Saint Johns Timber Supply Area (TSA) in British Columbia (BC), Canada and the young stand monitoring (YSM) plots. Nepal, CC-BY-SA-4.0.")
knitr::include_graphics("images/10-FSJ-plot.png")
```
```{r 10-gaussian, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Semivariogram comparing different variogram models for basal area (m2/ha) of the young stand monitoring plots. Nepal, CC-BY-SA-4.0.")
data<-read.csv("data/10/FSJ.csv",header=T)
## summarize the basal area at plot level
data1<- data %>%
group_by(utm_easting,utm_northing) %>%
summarise(Basal=sum(baha_L))
coordinates(data1)= ~ utm_easting+utm_northing
## Model formula
TheVariogram <- variogram(Basal~1, data=data1)
## Initiating the parameters for the variogram, starting search window
nugmodel <- vgm(model="Nug")
linmodel <- vgm(model="Lin")
expmodel <- vgm(model="Exp")
gaumodel <- vgm(model="Gau")
sphmodel <- vgm(model="Sph")
pow05model <- vgm(model="Pow", range=0.5)
pow15model <- vgm(model="Pow", range=1.5)
## Fitting the variogram model
nugfit <- fit.variogram(TheVariogram, model=nugmodel)
linfit <- fit.variogram(TheVariogram, model=linmodel)
expfit <- fit.variogram(TheVariogram, model=expmodel)
gaufit <- fit.variogram(TheVariogram, model=gaumodel)
sphfit <- fit.variogram(TheVariogram, model=sphmodel)
pow05fit <- fit.variogram(TheVariogram, model=pow05model)
pow15fit <- fit.variogram(TheVariogram, model=pow15model)
## Predicting the variogram
nugpreds <- variogramLine(nugfit, maxdist=max(TheVariogram$dist))
linpreds <- variogramLine(linfit, maxdist=max(TheVariogram$dist))
exppreds <- variogramLine(expfit, maxdist=max(TheVariogram$dist))
gaupreds <- variogramLine(gaufit, maxdist=max(TheVariogram$dist))
sphpreds <- variogramLine(sphfit, maxdist=max(TheVariogram$dist))
pow05preds <- variogramLine(pow05fit, maxdist=max(TheVariogram$dist))
pow15preds <- variogramLine(pow15fit, maxdist=max(TheVariogram$dist))
## Plot
colors <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
plot(TheVariogram$dist/1000,TheVariogram$gamma,xlim=c(0,100),ylim=c(-1000,3500),xlab="Lag h (km)",ylab="Semivariance γ(h)",type="n")
lines(nugpreds$dist/1000,nugpreds$gamma,lwd=2,col=colors[1])
lines(linpreds$dist/1000,linpreds$gamma,lwd=2,col=colors[2])
#lines(vv_quadratic,lwd=2,col=colors[3])
lines(exppreds$dist/1000,exppreds$gamma,lwd=2,col=colors[4])
lines(gaupreds$dist/1000,gaupreds$gamma,lwd=2,col=colors[5])
lines(sphpreds$dist/1000,sphpreds$gamma,lwd=2,col=colors[6])
lines(pow05preds$dist/1000,pow05preds$gamma,lwd=2,col=colors[7])
lines(pow15preds$dist/1000,pow15preds$gamma,lwd=2,col=colors[8])
points(TheVariogram$dist/1000,TheVariogram$gamma,pch=20)
```
Just looking at the variograms, it appears that all of the four models fit our data well and indicates there is a strong correlation in basal area per hectare of live trees between the plots. However, we will use all the components of semivariogram models to pick our best fitting variogram.
```{r 10-circular-2, echo=FALSE, warning=FALSE, message=FALSE, tab.cap = FALSE}
Model<-c("Circular", "Gaussian", "Spherical", "Exponential")
Range<-c("7433.68", "4446.62","9729.37", "3871.76")
Nugget<-c("552.71","0.00","14.13","0.00")
Sill<-c("2338.27","2995.05","2980.05","2994.05")
Partial_sill<-c("1785.56","4446.62","2966.37","2994.05")
Sill_to_Sill<-c("0.76","1.00","0.99","1.00")
summary<-data.frame(Model,Range,Nugget,Sill,Partial_sill,Sill_to_Sill)
View(summary)
knitr::kable(summary, caption="Summary of various component of variogram for four different models.")
```
We can see that the p-to-s ratio is highest for Gaussian and exponential semivariogram. This indicates the most total amount of semivariance that is spatially autocorrelated. Similarly, we can see that both models are indicating that we are observing a spatial autocorrelation in basal area at a very shorter range. Since with, exponential semivariogram autocorrelation only disappear at a infinite distance in reality, it is better to pick a Gaussian model in most of the similar cases like we have in this example.
## Sampling
**Sampling** can be defined as the process of selecting some part of a population in order to make an inference, and estimate some parameters about the whole population[@thompson_sampling_2012]. For example, to estimate the amount of biomass of trees in the University of British Columbia Malcolm Knapp Research Forest, scientists collect data on tree size and height from 100 randomly distributed small plots across the forest. Based on some allometric equations that relate these tree size measures to the mass of different species, biomass of the entire Malcolm Knapp Research Forest can be estimated. Similarly, to estimate the amount of recoverable oil in a region, a few (highly expensive) sample holes can be drilled (example adapted from @thompson_sampling_2012). The situation is similar in a national opinion survey, in which only a sample of the people in the population are contacted, and the opinions in the sample are used to estimate the proportions with the various opinions in the whole population (example adapted from @thompson_sampling_2012).
Sampling should not be confused with observational study. In an observational study, one has little-to-no control over the inclusion of units in the study whereas sampling usually consists of a well-defined protocol for defining both the population under study and the inclusion criteria of units to sample [@thompson_sampling_2012]. Broadly, sampling can be categorized into two groups [@teddlie_mixed_2007]:
1. Probability sampling
2. Non-probability sampling
Before getting into the details about different types of sampling. We will make ourself familiar with some sampling key terms and their definitions.
:::: {.box-content .call-out-content}
::: {.box-title .call-out-top}
## Recall This {-}
:::
<p id="box-text">
## Population
Any large spatially defined entity of plots, people, trees, animals etc., from which samples are drawn and measurement of certain characteristics is conducted.
## Sampling Design
The procedure by which the sample of units is selected from the population is called the sampling design.
## Sampling Unit
The smallest entity within a population from which the information about population is drawn is known as sampling unit. For example, in a survey of potential internet user over entire BC , sampling unit can be the certain number of household in each city across BC.
</p>
::::
## Probability Sampling
Probability sampling techniques are mostly used in studies that use extensive amount of quantitative analysis [@tashakkori_sage_2010]. It involves selecting a large number of units from a population where the probability of inclusion for every member of the population is determinable [@tashakkori_sage_2010].
## Simple Random Sampling
In simple random sampling, each sampling unit within a given population has equal probability of being selected in a sample [@thompson_sampling_2012].For example, suppose we would like to measure the tree heights of all the trees from a simple random sample of 60 plots with their spatial locations (given by plots center co-ordinates) from a forest divided into 625 spatially defined plots as given in **Figure \@ref(fig:10-simple-random-sampling)**. Notice, there is no distinctive pattern on how plots are being selected for the measurement of tree heights, this justify the **random part** of the simple random sampling.
As an investigator, when we make a sequence of selections from a population, at each step, new and distinct set of sampling units are being selected in the sample, each having equal probability of being selected at each step.
For example, when we take another sample of 60 plots, we can see that different **sampling units (plots)** are being selected from what we obtained, this represent the **equal probability** of each sampling unit being selected.
```{r 10-simple-random-sampling, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Simple random sample of 60 units from a population of 625 units. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:625), nrow=25, ncol=25)
df <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df$val <- m[as.matrix(df[c('y','x')])]
n <- df[sample(nrow(df), 60, replace = FALSE, prob = NULL),]
p<-ggplot(n,aes(x=x, y=y, label=val)) +
geom_tile(data=df, fill='transparent', colour = 'black') +
geom_tile(data=n,fill='black')+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Simple random sample of 60 units from a population of 625 units.
```{r 10-simple-random-sampling-2, echo=FALSE,warning=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Another simple random sample of 60 units. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:625), nrow=25, ncol=25)
df2 <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df2$val <- m[as.matrix(df[c('y','x')])]
n <- df2[sample(nrow(df2), 60, replace = FALSE, prob = NULL),]
p<-ggplot(n,aes(x=x, y=y, label=val)) +
geom_tile(data=df2, fill='transparent', colour = 'black') +
geom_tile(data=n,fill='black')+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
## Stratified Random Sampling
When a population under study is not **homogeneous** (similar in biological characteristics) across the entire study area and consists of some sort of gradient, stratified random sampling method is used [@thompson_sampling_2012]. The principle of stratification is to partition the population in such a way that the units within a stratum are as similar as possible [@teddlie_mixed_2007]. Random samples from each strata are drawn to ensure adequate sampling of all groups [@teddlie_mixed_2007]. Even though one stratum may differ markedly from another, a stratified sample with the desired number of units from each stratum in the population will tend to be “representative” of the population as a whole [@howell_area_2020].
For example, a forest under study is divided **(stratified)** into similar regions Figure \@ref(fig:10-simple-random-sampling-2) defined by elevation, soil moisture, and soil nutrient gradient and random samples are taken within each strata. The stratification of a study region despite of its size can help to spread the sample over the entire study area.
```{r 10-stratified-random-sampling, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Stratified random sample within unequal strata within a study area. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:400), nrow=20, ncol=20)
df3 <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df3$val <- m[as.matrix(df3[c('y','x')])]
n <- df3[sample(nrow(df3), 40, replace = FALSE, prob = NULL),]
p<-ggplot(n,aes(x=x, y=y, label=val)) +
geom_tile(data=df3, fill='transparent', colour = 'black') +
geom_tile(data=n,fill='black')+
geom_rect(aes(xmin =0.5,xmax = 10.5,ymin = 0.5, ymax = 20.5),
fill = 'pink',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =10.5,xmax = 20.5,ymin = 0.5, ymax = 5.5),
fill = 'skyblue',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =10.5,xmax = 20.5,ymin = 5.5, ymax = 20.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Stratified random sample within unequal strata within a study area.
```{r 10-stratified-random-sampling-2, echo=FALSE,warning=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Stratified random sample from equal strata within a study area. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:400), nrow=20, ncol=20)
df <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df$val <- m[as.matrix(df[c('y','x')])]
n <- df[sample(nrow(df), 40, replace = FALSE, prob = NULL),]
p<-ggplot(n,aes(x=x, y=y, label=val)) +
geom_tile(data=df, fill='transparent', colour = 'black') +
geom_tile(data=n,fill='black')+
geom_rect(aes(xmin =0.5,xmax = 10.5,ymin = 0.5, ymax = 10.5),
fill = 'pink',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =0.5,xmax = 10.5,ymin = 10.5, ymax = 20.5),
fill = 'skyblue',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =10.5,xmax = 20.5,ymin = 10.5, ymax = 20.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin =10.5,xmax = 20.5,ymin = 0.5, ymax = 10.5),
fill = 'brown',alpha=0.02,color = "black",size = 2)+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Stratified random sample from equal strata within a study area.
## Systematic Sampling
A systematic sample uses a fixed grid or array to assign plots in a regular pattern **Figure \@ref(fig:10-stratified-random-sampling-2)** [@mcroberts_sampling_2014]. The advantage of systematic sampling is that it maximizes the average distance between the plots and therefore minimizes spatial correlation among observations and increases statistical efficiency [@mcroberts_sampling_2014]. In addition, a systematic sample, which is clearly seen to be representative in some sense, can be very convincing to decision-makers who lack experience with sampling [@mcroberts_sampling_2014]. Raster grids such as digital elevation models (DEM) are some examples of systematic sample.
```{r 10-systematic-sampling, echo=FALSE,warning=FALSE,message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Sample every second observation in the row. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:400), nrow=20, ncol=20)
# Create row indicator
df <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df$val <- m[as.matrix(df[c('y','x')])]
row_odd <- seq_len(nrow(df)) %% 2
col_odd<- seq_len(ncol(df)) %% 2
data_row_odd <- df[row_odd == 0,]
data_col_odd <- df[col_odd == 0,]
# Subset odd rows
p<-ggplot(data_col_odd,aes(x=x, y=y, label=val)) +
geom_tile(data=df, fill='transparent', colour = 'black') +
geom_rect(data=data_col_odd,aes(xmin = x, ymin = y, xmax = x + 0.3, ymax = y + 0.6),fill='black')+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Sample every second observation in the row
```{r 10-systematic-sampling-2, echo=FALSE,warning=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Sample all the observation in every second column. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:400), nrow=20, ncol=20)
# Create row indicator
df <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df$val <- m[as.matrix(df[c('y','x')])]
row_odd <- seq_len(nrow(df)) %% 2
data_row_odd <- df[row_odd == 0,]
# Subset odd rows
p<-ggplot(data_row_odd,aes(x=x, y=y, label=val)) +
geom_tile(data=df, fill='transparent', colour = 'black') +
geom_rect(data=data_row_odd,aes(xmin = x, ymin = y, xmax = x + 0.3, ymax = y + 0.6),fill='black')+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Sample all the observation in every second column
## Cluster Sampling
In cluster sampling, rather than sampling individual units, which might be geographically spread over great distances, we can sample groups (clusters) of plots that occur naturally in the study area [@teddlie_mixed_2007]. Cluster sampling is employed when we want to be more efficient in terms of the use of time and money to generate a more efficient probability sample [@teddlie_mixed_2007].
```{r 10-cluster-sampling, echo=FALSE, warning=FALSE, message=FALSE, fig.cap=fig_cap}
fig_cap <- paste0("Cluster of plots selected from entire study area. Nepal, CC-BY-SA-4.0.")
m <- matrix(c(1:400), nrow=20, ncol=20)
df <- expand.grid(x=1:ncol(m),y=1:nrow(m))
df$val <- m[as.matrix(df[c('y','x')])]
p<-ggplot(df,aes(x=x, y=y,label=val)) +
geom_tile(data=df, fill='transparent', colour = 'black') +
geom_rect(aes(xmin = 0.5,xmax = 3.5,ymin = 14.5, ymax = 20.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin = 5.5,xmax = 7.5,ymin = 5.5, ymax = 7.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin = 12.5,xmax = 14.5,ymin = 12.5, ymax = 14.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
geom_rect(aes(xmin = 15.5,xmax = 20.5,ymin = 0.5, ymax = 5.5),
fill = 'lightgrey',alpha=0.02,color = "black",size = 2)+
scale_y_reverse() +
theme_classic() +
theme(axis.text = element_blank(),
panel.grid = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p
```
Cluster of plots selected from entire study area
## Non-probability Sampling
Non-probability sampling is generally used in qualitative studies. They are also know as **purposive or adaptive sampling**, and defined as selecting units (e.g.,individuals, groups of individuals, institutions) based on specific purposes associated with answering some research questions. **Purposive or Adaptive** sampling can be classified into three broad categories [@teddlie_mixed_2007]:
## Representative Sampling
This type of sampling is used when we want to select samples that will represent broader groups as closely as possible [@teddlie_mixed_2007]. One of the example of representative sampling is selecting 100 Douglas fir and 50 Spruce tree from study area within Malcom Knapp Forest, BC consisting of 500 Douglas fir and 300 Spruce trees for the measurement of tree height.
## Unique Case Sampling