-
Notifications
You must be signed in to change notification settings - Fork 17
/
03-types-of-data.Rmd
451 lines (330 loc) · 43.4 KB
/
03-types-of-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
```{r echo=FALSE, warning=FALSE}
yml_content <- yaml::read_yaml("chapterauthors.yml")
author <- yml_content[["typesOfData"]][["author"]]
coauthor <- yml_content[["typesOfData"]][["coauthor"]]
```
# Data Types and Spatial Data Models {#types-of-data}
Written by
```{r results='asis', echo=FALSE}
cat(author, "and", coauthor)
```
In the previous chapter, we discussed some of the unique challenges associated with representing spatial data in a GIS, and how to account for these these with geographic coordinates systems and map projections. In this chapter we will discuss more broadly how to represent both spatial and non-spatial data in a Geographic Information System. We will introduce the different types of data that can represent non-spatial attributes and discuss the different scales this data can be measured on. Then we will introduce the different *spatial data models* we use to link the spatial and non-spatial data. Finally, we will cover some of the different file types that can be used to store data.
:::: {.box-content .learning-objectives-content}
::: {.box-title .learning-objectives-top}
### Learning Objectives {-}
:::
1. Types of Spatial Phenomena
2. Measurement Scales of Data: Quantitative vs. Qualitative
3. Overview of Raster and Vector Data Models
4. Data Resolutions
5. Common File Types in GIS
::::
### Key Terms {-}
Phenomena, Discrete Object, Continuous Field, Qualitative, Quantitative, Measurement Scale, Raster, Vector, Resolution,
## Types of Phenomena
**Phenomenon**, noun, plural *Phenomena*: *1* a fact or situation that is observed to exist or happen, especially one whose cause or explanation is in question. *2* a remarkable person, thing, or event [@oxford_languages_phenomena_nodate]. Essentially, anything and everything are phenomenon: lightning, a country, coastlines, a dog on a kayak. Broadly speaking, in GIS we categorize phenomena as **discrete** or **continuous**. Both kinds of phenomena can be represented in a GIS, but they come with different considerations and cannot always be represented with the same kind of data model.
```{r 3-dog-on-a-boat, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Yarrow enjoying the scenery at Aloutte Lake, she's quite the phenomenon indeed. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-dog-on-a-boat.jpg")
```
<br>
### Discrete Objects
Discrete objects are finite and have distinct boundaries. Each object is a unique, self contained entity whose geography can be exactly defined. Because each object is unique and self contained, collections of objects are countable. A concrete example of a discrete objects would be buildings. They are real physical objects with well defined boundaries. We can count the number of buildings on a college campus or in a city. National and sub-national boundaries are also discrete objects. They (typically) have well defined boundaries and we can easily count the number of nations or provinces. They are not, however, real physical objects. Political boundaries are arbitrary human constructs.
```{r 3-vector, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Provinces are clearly delineated, distinct objects, despite having no real physical presence. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-vector.png")
```
### Continuous Fields
Continuous fields are infinite and lack defined boundaries. Fields can be measured at an infinite number of locations. However, similar values tend to cluster in space so we can often make assumptions based on finite observations of continuous fields. One of the most common examples of a continuous field is elevation. This is a physical property associated with every location on earth. We can't count the "number of elevations" because space is infinitely divisible and everywhere in space can have an elevation.
(ref:3-elevation-caption) Elevation of the Sunshine Coast in British Columbia, from the Natural Resources Canada Digital Elevation Model [@nrcan_canadian_2021].
```{r 3-elevation, fig.cap = fig_cap, out.width= "75%", echo = FALSE}
if (knitr:::is_latex_output()) {
fig_cap <- "(ref:3-elevation-caption)"
} else {
fig_cap <- paste0("Elevation of the Sunshine Coast in British Columbia, from the Natural Resources Canada Digital Elevation Model [@nrcan_canadian_2021].")
}
knitr::include_graphics("images/03-elevation.png")
```
<br/>
### Imperfect Distinctions
Few phenomena will fit perfectly and exclusively into one category or the other. That said, its helpful for us to think about the discrete v. continuous dichotomy. As long as we recognize that it's not a perfect classification. Whether a phenomenon is considered discrete or continuous depends on scale (both spatial and temporal) and perspective. Some phenomenon are a bit of both. Take the coastlines, they can be treated as discrete or continuous. At the scale of an individual beach over hours, tides can cause wide variations in water levels/position. How/where does one draw the discrete line representing the "coast". At this scale, the coast isn't really a discreet object, rather a continuous field known as the inter-tidal zone. Zoom out a bit and those fluctuations aren't particularly relevant if you want to make a map of Pacific Rim National Park. The coast could be considered a discrete object. But if you change the timescale and look at sea level rise projections, then you're dealing with a continuous field.
```{r 3-beach-small-scale, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("The west coast of Vancouver Island. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-beach-small-scale.png")
```
A lightning strike is an electric discharge between the atmosphere and the ground. A lightning strike is a discrete object. The precise location of the strike can be pin pointed, the number of strikes during a storm can be counted. But what about the actual lightning bolt? That is more of a continuous field, it is not really possible to measure the exact boundaries of the path the electric discharge takes. Then we can look at other things, like the probability of lightning strikes. Figure \@ref(fig:3-lightning-density) is a continuous field, calculated from counting discrete objects.
```{r 3-lightning-density, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Global lightning strike density per month. Skeeter, CC-BY-SA-4.0.")
if (knitr:::is_latex_output()) {
knitr::include_graphics("images/03-lightning-density-static.png")
} else {
knitr::include_graphics("images/03-lightning-density.gif")
}
```
## Types of Data
Within the context of a Geographic Information System, each piece of information pertaining to a phenomena can be referred to as an **Attribute**. An phenomena can have many different attributes associated with it, but each attribute can broadly be said to address one of three questions: **What**, **When**, or **Where**? Attributes that describe *where* are known as **Spatial Data** while all other attributes are **Non-Spatial Data**. All data, spatial and non-spatial, can broadly be classified as either **qualitative** or **quantitative**. These data types are fundamentally different and are therefore measured on fundamentally different scales. The types of analysis we can conduct with qualitative data are more limited than quantitative data, but that does not necessarily mean quantitative data are “better” than qualitative.
```{r 3-data-types, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Summary of the types of data. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-data-types.png")
```
### Qualitative Data
Qualitative data are categorical; they are strictly descriptive and lack any meaningful numeric value. They describe the qualities of a phenomenon, without giving us any numeric information. Most qualitative data you will work with in GIS are textual or coded numerals, but there are circumstances where you may encounter non-textual data (e.g. images, sound clips, videos) in a dataset. Qualitative data can be "spatial" in nature (e.g. relative directional descriptors: left/right, near/far, north/south), but because they lack numeric values, they cannot be used for spatial analysis. Qualitative data can be measured on either a **Nominal** or **Ordinal** scale.
#### Nominal Scale
These are data that just consist of names or categories with no ranking or direction are nominal. One category is not more or less, better or worse than another, they are just different. A good example would be flower types. Other examples would be zoning categories, colors, flavors of ice cream, place names, etc. With nominal data, you can check for equality between entities and you can count occurrence. These are the only operations we can do. You can't calculate
```{r 3-flowers-nominal, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Each flower is different, but no flower is 'more' or 'less' a flower than any of the others. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-flowers-nominal.png")
```
```{r 3-nominal-operations, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Checking equality with flower species. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-nominal-operations.png")
```
#### Ordinal Scale
Ordinal scale data are categories that also have a some ranking or directionality. A good example would be relative sizes (see Figure \@ref(fig:3-dogs-ordinal)). Some other good examples of ordinal data include spice levels (mild, medium, hot), residential zoning density (low, medium, high), and survey responses.
```{r 3-dogs-ordinal, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("We can see Yarrow is taller than her sister Shamsa, so we can rank these dogs by height. However, we haven't measured their heights, so we don not know how much taller Yarrow is than Shamsa. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-dogs-ordinal.png")
```
```{r 3-ordinal-operations, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("In some circumstances, we can directly calculate the median (middle value) of an ordinal set. With odd numbered sets (e.g. Group 1), the median is simply the middle value of the set, when sorted lowest to highest. We can always take the median when we have an odd number. With even numbered sets, its a bit more complicated. The median, is the average of the middle two values. For Group 2, the middle values (5th and 6th) are both 'Neutral', so we don't have an issue. But for Group 3, the 5th value is 'Neutral' and the 6th value is 'Agree'. We cannot directly average these two ordinal values. One solution is to arbitrarily assign a numeric score to the ordinal categories (e.g., 1-5). This would then allow you to show the median is between 'Neutral' and 'Agree'. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-ordinal-operations.png")
```
The only arithmetic operations we can do with nominal data are checking for equality (True/False), counting occurrences (frequencies), and calculating the mode (most frequent occurrence). With ordinal data, we can do these operations as well, plus a few more. We can check the order/rank (greater than, less than) and in some circumstances we can calculate the median (see Figure \@ref(fig:3-ordinal-operations)).
(Graded Membership) When trying to group real world phenomena into categories, there are often "exceptions" that blur the lines a bit. Take this example: you are trying to develop a land cover classification scheme for Garibaldi Provincial park in British Columbia. Some of the land surface is unquestionably alpine tundra and some is certainly forest area. However, the transition between forest and alpine meadow is not an abrupt line. How/where do you draw the line? Examples like this are known as fuzzy variables, and we often use a *Graded Membership* scale to assign them to categories. With the landscape classification, a simple approach would be a "winner take all" approach. If a plot is 5% bare rock, 40% forest, and 45% alpine meadow, the area will be classified as alpine meadow. From that point forward, in the GIS, that area will be treated as alpine meadow, any information about the variability within the area will be lost. In practice, many of the qualitative data we work with in GIS, especially those describing natural phenomena, are actually graded membership variables.
```{r 3-fuzzy, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("In this example, we see an alpine landscape in Garibalid Provinical Park, BC. We can see patches of forest and patches of meadow. But where, exactly, would we draw the boundary between these two landscape classes. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-fuzzy.jpg")
```
### Quantitative Data
Quantitative data are numeric; they describe the quantities associated with an phenomena. The numerical values that are separated by a unit that has some inherent meaning (as opposed to the arbitrary numeric codes like in the ordinal data example). This allows us to conduct a wider range of arithmetic operations on quantitative data. In addition to the operations we perform on Qualitative data; with numeric data we can always calculate measures of central tendency (mean/median) and we can add/subtract values to calculate differences.
Numeric data can be either **discrete** or **continuous**. Discrete variables (e.g. population) are obtained by counting and values within a range cannot be infinitely subdivided. You can have a population of 1, 37, or 179 but you cannot have a population of 2.3. Continuous variables (e.g. temperature) can take an infinite number of values a given range, but they cannot be counted. You can have temperatures of 10, 10.5, or 10.1167 °C, but a temperature of 10°C does not mean you have 10 individual degrees of temperature. Quantitative data (both discrete and continuous) can be measured on either an **Interval** or **Ratio** scale. These types of quantitative data are closely related, but have one important distinction.
#### Ratio Data
Ratio data have fixed, meaningful, absolute zero points. The absolute zero point means ratio data cannot take negative values. It also means that we can multiply/divide two values to calculate a meaningful ratio between them (hence the name). A good example of ratio data are population total (see figure). Population counts start at zero and go up from there. A population of zero means there are no residents, and its impossible to have a negative population. Other examples of ratio data include: temperature (*in degrees Kelvin*), precipitation, tree height, income, rental cost, and units of time (years, seconds, etc.)
```{r 3-ratio-population, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Because of the fixed, meaningful zero point, we can calculate ratios between populations: e.g. Manitoba's population is 1/10th that of Ontario, British Columbia has 129 times as many people as Nunavut. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-ratio-population.png")
```
Interval data on the other hand, have an arbitrarily set zero point. This means they can can take negative values. Because the zero point is arbitrary, we cannot multiply/divide two values or calculate meaningful ratios between two values. A good example of interval data is temperature measured in Celsius, and comparing it to Kelvin highlights the difference between the two data types (see Figure \@ref(fig:3-interval-ratio-temperature)). The conversion between Kelvin (K, ratio) and Celsius (C, interval) is very simple: $^\circ$C = K-273.15. Zero Kelvin is "Absolute Zero" or the lack of temperature, while 0$^\circ$C is the freezing point of water (273.15 K above absolute zero). Other examples of interval data include: the pH scale, IQ test scores, elevation (relative to a datum) dates (April 12th, 2011), and times (11:00 A.M.).
```{r 3-interval-ratio-temperature, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("The ratio between two temperatures in Celsius is not meaningful, 20 C is not 'twice' as warm as 10 C. Kelvin's zero point is fixed to absolute zero, the 'absence' of temperature. So we can calculate the ratio, 293.15 K is 1.035 times warmer than 283.15 K. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-interval-ratio-temperature.png")
```
### Derived Ratio: Normalizing Data
Sometimes we want to account for the influence of one variable when analyzing another. To do this, we can divide one value by by another to get the ratio of the two, also known as a **derived ratio**. This process is sometimes referred to as **Normalizing** or **Standardizing** our data. The basic formula is: $C=\frac{A}{B}$, where A is our variable of interest, B is our confounding variable, and C is our new derived ratio. There are many circumstances where we might need to do this. One common example is population density: Canada and Poland both have populations of ~ 38 million people but Canada had 32x the land area of Poland. Any comparison of the these two nations that fails to account for the size disparity would be seriously flawed. Another key example are affordability indexes. The example below shows how normalization can be applied to a households expenditures on food. Income and household expenditures on food are strongly related (wealthy regions tend to purchase more expensive food). An analysis of the cost of food that does not account for this relationship would not adequately account for the *affordability* of food in a given region. Dividing household food expenditures by household income, we get the proportion of income spent on food. This is a much more accurate representation of the affordability of food and highlights that the poorest communities are most severely impacted by increasing food costs.
```{r 3-income-v-food, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Income versus household expenditures on Food by Census Subdivisions in British Columbia. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-income-v-food.png")
```
```{r 3-income-v-income-on-food, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Income versus Fraction of Income Spent on Food by Census Subdivisions in British Columbia. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-income-v-income-on-food.png")
```
### Summary of Data Types
| Operation |Nominal|Ordinal|Interval|Ratio|
|---------------|-------|-------|--------|-----|
|Equality |x |x |x |x |
|Counts/Mode |x |x |x |x |
|Rank/Order | |x |x |x |
|Median | |x |x |x |
|Add/Subtract | | |x |x |
|Mean | | |x |x |
|Multiply/Divide| | | |x |
## Spatial Is Special
You might encounter the phrase "Spatial is special" in your time studying GIS. Spatial data is the foundation of Geographic Information Science, it is what distinguishes GIS from the broader field of data science. This was succinctly summarized by Waldo Tobler in The First Law of Geography:
- *"Everything is related to everything else, but near things are more related than distant things."*
```{r 3-spatial-is-special, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Visualization of Tobler's First Law. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-spatial-is-special.jpg")
```
This might seem obvious: people interact more if they live in the same city, orca pods in different areas develop different dialects, hemlocks on Vancouver Island are more related to their neighbors than to to hemlock in the New Brunswick. Generally, near things are more related to one another, but it *does not guarantee similarity*. Downtown Vancouver averages 40 cm of snow/year, but the ski resort on Grouse Mountain 15 km north gets over 9 m. These locations are impacted by the same storm systems, but the 1200 m elevation difference causes vastly different quantities and different types of precipitation.
The measure of similarity between objects across space called **spatial autocorrelation**. Spatial autocorrelation allows us to make some a key assumptions when representing spatial data. We don't have to measure a phenomena everywhere in order to represent it adequately. We only need to measure it at specific locations or over regular intervals. If point A is in dense forest, it is likely point B 10 m away is also in a dense forest. We do not have to get the location of every tree in the forest. Instead, we can look at the average presence of trees over a larger area.
```{r 3-bc-snow, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Screengrab from Hectares BC, can easily make into better map. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-bc-snow.png")
```
## Spatial Data Models
As discussed in the previous chapter, spatial data is three-dimensional, though we usually project it into two-dimensions for simplicity. Because of the unique transformations that must be applied to spatial data, it must be treated and represented differently than the non-spatial data that describe *what* is happening and *when*. We cannot simply put all of our data into a spreadsheet and start analyzing it. We have to use **Spatial Data Models** to organize our data and link our spatial and non-spatial data. Spatial data models store geographic data in a systematic way so that we can effectively display, query, edit, and analyze our data within a GIS.
There are two main types of spatial data models: the **Raster** and **Vector** models. The raster data model represents spatial data as grid of cells, and each cell has one non-spatial attribute associated with it. The vector data model represents spatial data as either points, lines, or polygons that are each linked to one or more non-spatial attributes. These two models represent the world in fundamentally different ways. One is not inherently better than the other, but they are better suited for different circumstances. The choice of which model to use is often dictated by three main factors:
1) The type of phenomena we are trying to represent.
2) The scale at which we plan to analyze our data.
3) How we plan to use the data.
```{r 3-vector-v-raster, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Representing space in the raster model vs. the vector model. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-vector-v-raster.jpg")
```
### Raster Data Model
The raster data model represents a phenomena across space as a gridded set of cell (or pixels). The cell size determines the **Resolution** of the raster image, that is the smallest feature we can resolve with the raster. A 10 m resolution raster has cells that are 10 x 10 m (100 m2), a 2 m resolution has cells that are 2 x 2 m (4 m2). Along with the cell size, the number of rows and columns dictates the extent (or bounds) of a raster image. A raster with a 1 m cell size, 5 rows, and 5 columns, will cover an area of 5 m x 5 m (25 m2). Because of the full coverage within their bounds, raster data models are very well suited for representing *continuous phenomena* where cell values correspond to measured (or estimated) value at specific location. In GIS, rasters are commonly encountered as: satellite and drone imagery, elevation models, climate data, model outputs, and scanned maps.
```{r 3-raster-example, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Example of raster data. Skeeter, CC-BY-SA-4.0")
knitr::include_graphics("images/03-raster-example.png")
```
The value of a pixel can be quantitative (e.g. elevation) or qualitative (e.g. land use). Each pixel/cell can only have a single value associated with it. Multiple bands can be combined to store or more information, as is done with a RGB color photograph. Algebraic expressions can also be performed quickly and efficiently with raster layers a inputs. This is known as raster overlay, and is one of the key advantages to raster data. If layer A = Average July Temperature and layer B = Average January Temperature, then A – B will give us the Average Temperature Range across the rasters domain.
```{r 3-raster-overlay, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Raster math illustration. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-raster-overlay.png")
```
Rasters data relies on Spatial Autocorrelation and The First Law of Geography, the model assumes that *all areas* within a given cell are equally represented by the cell value. Depending on the resolution of the raster and the scale of the task at hand, this may or may not be an effective assumption. If you are trying to represent the coastline of Nova Scotia, 100 m or even 1 km resolution cells will likely suffice (see Figure \@ref(fig:3-raster-resolution)). However, 10 km cells severely degrade the quality of the representation and at a 100 km cell size, the province is indistinguishable.
```{r 3-raster-resolution, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Raster Resolution. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-raster-resolution.png")
```
The above example is related to something known as them *mixed pixel* problem. Each cell in a raster can only have **one** value. So how do you handle when the area a cell covers contains **multiple** examples. Possible approaches are:
* **Majority/Mode**, the cell value is determined by the value/class covering the largest area within each cell. This can be useful for **discrete** phenomena, but generally will not be helpful for continuous phenomena.
* **Touches All**, can be useful for discrete phenomena if you need to prioritize specific class(es) you can designate it them to be assigned to any pixel the touch (e.g., with flood/fire risk or most other hazards, you want to take an inclusive approach when defining risk zones. Better safe than sorry.)
* **Nearest Neighbor/Center Point**, the cell value is determined by the value/class only at the center point of the cell. This method is quick to calculate but can under- or over-estimate repeating phenomena with frequencies lining up with the raster resolution (e.g., City Blocks/Roads, rows in agricultural fields).
* **Average**, when working with continuous phenomena (e.g., rainfall, temperature, elevation) it might be best to use the average value across the cell instead. If multiple observations are available calculate the spatially weighted average within each cell. If we are working with discrete phenomena, this method is generally less useful.
When the data resolution is very high, relative to the scale of the map/analysis, the specific choice of method will produce negligible differences. If you are working with a 25 m resolution land cover classification and doing a continental scale analysis, the improper attribution of boundary pixels will not have a huge impact on the results. If the data resolution is low relative to the scale of your analysis, the choice of method could have a significant impact on your results.
Raster data can come in many different formats. **GeoTIFF** which has the extension .tif is one of the most common. This format is based of the Tag Image File Format (TIFF), a common file type used by graphic artists and photographers. A TIFF file stores metadata (data about the data) as tags. For instance, your camera might store a tag that describes the make and model of the camera and another for the date the photo was taken when it saves a picture. A GeoTIFF is a standard .tif image format plus additional tags spatial tags denoting spatial information including:
* Extent (minimum x,y and maximum x,y)
* Resolution (cell size)
* Projection, Coordinate system, and datum
Other file types you will likely encounter when working with raster data include:
* 1) IMG - A proprietary image format commonly used by ESRI products
* 2) JPEG2000 - A geospatial version of the common .jpg image type
* 3) ASCII - An older human readable format (simple text file) with slower performance than the types listed above.
### Vector Data
The vector data model is much more well suited to represent discrete phenomena than the raster data model. A vector feature is a representation of a discrete object as a set of x,y coordinate pairs (points) linked to set of descriptive attribute about that object. A vector feature's coordinates can consist of just one (x,y) pair to form a single point feature, or multiple points which can be connected to form lines or polygons (see Figure \@ref(fig:3-vector-types)). The non-spatial attribute data is typically stored in a **Tabular** format separate from the spatial data, and it is linked using an index. One of the key advantages of the vector model is the ability to store and retrieve many attributes them quickly. In GIS, vector data are commonly encountered as: political boundaries, cenus data, pathways (road, trails, etc.), point location (stop sign, fire hydrant), etc.
```{r 3-vector-types, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Vector objects (points, lines, or polygons) are stored along with any number of attribute. Point, line, and polygon data are typically stored in separate files. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-vector-types.png")
```
**Points** are “zero-dimensional”, they have no length or width or area. A point feature is just an individual (x,y) coordinate pair representing a precise location, that has some linked attribute information. Points are great for representing a variety of objects, depending on the scale. Fire hydrants, light poles, and trees are suitable to be represented as points in almost any application. If you are making a map of mines in British Columbia, or cities across Canada, it is probably acceptable to just display them as points.
```{r 3-vector-points, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("An example of point data showing locations of trees. The points are labeled with their index (unique ID number) which correspoonds to the attribute table below which stores more information about eacht tree. Skeeter, CC-BY-SA-4.0")
knitr::include_graphics("images/03-vector-points.png")
```
|index|Longitude (X)|Latitude (Y)|Name |Age|Height|
|----:|------------:|-----------:|-----|--:|-----:|
| 0| 0.44| 0.03|Fir | 54| 119|
| 1| 0.55| 0.71|Fir | 29| 56|
| 2| 0.89| 0.33|Fir | 82| 197|
| 3| 0.18| 0.02|Fir | 46| 98|
| 4| 0.65| 0.51|Maple| 87| 212|
| 5| 0.43| 0.81|Maple| 73| 172|
| 6| 0.38| 0.86|Maple| 94| 233|
| 7| 0.68| 0.04|Cedar| 34| 68|
| 8| 0.15| 0.13|Cedar| 36| 73|
**Lines** are one-dimensional, they have length, but no width and thus no area. A line consists of two or more points. Every line must have a start point and end point, they may also have any number of middle points, called vertices. A vertex is just any point where two or more lines meet. Lines are also great for representing a variety of objects, depending on the scale. Hiking trails, flight paths, coastlines, and power lines are suitable to be represented as lines in almost most applications. When making smaller scale maps, it is often sufficient to represent rivers as lines, though at large scales we might elect to use a polygon.
```{r 3-bc-road-map, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Roads are typically reprented as line data. Though they obviously have an area, unless we are making a very large scale map, we do not need (or have the room) to show that on a map. This British Columbia road altlas makes use of line data, representing roads a lines and using different colors to denote the type of road. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-bc-road-map.jpg")
```
**Polygons** are two-dimensional, they have both a length and width and therefore we can also calculate their area. All polygons consist of a set of at three or more points (vertices) connected by line segments called “edges” that connect to form an enclosed shape. All polygons form an enclosed shape, but some can also have "holes" (think doughnuts!), these holes are sometimes called interior rings. Each interior ring is a separate set vertices and edges that is wholly contained within the polygon and no two interior rings can overlap. Polygons are useful for representing many different objects depending: political boundaries boundaries, Köppen climate zones, lakes, continents, etc. At large scales they can represent things like buildings which we might choose to represent as points at smaller scales.
Sometimes, a discrete object has multiple parts, that are spatially separated. In these circumstances, the vector model allows for multi-polygon, multi-line, or multi-point objects. A good example of when a multi-polygon would be useful is the StatsCanada provincial boundary file (see Figure \@ref(fig:3-vector-2)). Roads sometimes need to be stored as multi-lines as well, for example Highway 1 crosses the Georgia Straight from Vancouver to Nanaimo. If we want the to represent the entire Highway as one object, we need to use a multi-line.
```{r 3-vector-2, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("This is the official Stats Canada provincial boundary layer. All the other coastal provinces and territories have islands. We do not need to represent every island as a separate object, so we can 'bundle' together the polygons as multipolygons. The landlocked provinces do not have any coastlines and are represented as simple polygons reather than multipolygons. The attribute table bellow corresponds to the map and lists the geometry type (polygon/multipolygon). Skeeter, CC-BY-SA-4.0")
knitr::include_graphics("images/03-vector.png")
```
| PRNAME |Province ID|Population| Area |Geometry Type|
|-------------------------|----------:|---------:|------:|-------------|
|Newfoundland and Labrador| 10| 525572| 373872|MultiPolygon |
|Prince Edward Island | 11| 157329| 5660|MultiPolygon |
|Nova Scotia | 12| 971451| 53338|MultiPolygon |
|New Brunswic | 13| 779940| 71450|MultiPolygon |
|Quebec | 24| 8536855|1365128|MultiPolygon |
|Ontario | 35| 14666590| 917741|MultiPolygon |
|Manitoba | 46| 1389952| 553556|MultiPolygon |
|Saskatchewan | 47| 1206019| 591670|Polygon |
|Alberta | 48| 4511223| 642317|Polygon |
|British Columbia | 59| 5111756| 925186|MultiPolygon |
|Yukon | 60| 41774| 474391|MultiPolygon |
|Northwest Territories | 61| 45217|1183085|MultiPolygon |
|Nunavut | 62| 39419|1936113|MultiPolygon |
Vector data also has a **resolution** although it has a somewhat different definition in the context of the vector model. Vector resolution is determined by the smallest resolvable feature. Another way to describe vector resolution, would be the distance between vertices. The greater the distance between vertices, the fewer vertices there are per polygon and the lower the resolution. If a vector object (line or polygon) has many vertices, we will have a higher resolution representation of the feature.
```{r 3-vector-resolution, echo=FALSE, out.width = '75%', fig.cap=fig_cap}
fig_cap <- paste0("Vector image of Nova Scotia at different resolutions. Here the original polygon (top left) has been downsampled to lower resolutions, by setting the minimum allowable distance between verticies. As the distance beween vercicies increases, the resolution decreases and the coastline becomes less distinguishable. Skeeter, CC-BY-SA-4.0.")
knitr::include_graphics("images/03-vector-resolution.png")
```
#### File Types
Like raster data, vector data can also come in many different formats. The **shapefile** format which has the extension .shp is one of the most common file types you will encounter. A shapefile stores the geographic coordinates of each vertex in the vector, as well as metadata including:
* The spatial extent of the shapefile (i.e., geographic area that the shapefile covers). The spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.
* Object type - whether the shapefile includes points, lines, or polygons.
* Coordinate reference system (CRS)
* Attributes - for example, a line shapefile that contains the locations of streams, might contain the name of each stream.
Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.
**GeoJSON** is a simple, lightweight format for storing a variety of geographic data structures. It is most commonly encountered in web mapping and other open source applications. GeoJSON supports the following geometries: Point, Line, Polygon, MultiPoint, MultiLine, and MultiPolygon objects. Unlike with shapefiles, one GeoJSON file can contain any mix of geometries. An objects with and its attributes are a Feature object. A set of Features is a FeatureCollection. GeoJSON has the added benefit of allowing you to encode stylistic choices within the file. If you would like to explore this format a bit more, copy the code below and paste it in the [online GeoJSON editor geojson.io](https://geojson.io/#map=2/20.0/0.0). You can make changes and see them reflected on your the map.
```json
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"marker-color": "#blue",
"marker-size": "medium",
"marker-symbol": "circle",
"Name": "Vancouver"
},
"geometry": {
"type": "Point",
"coordinates": [
-123.04687499999999,
49.23912083246698
]
}
},
{
"type": "Feature",
"properties": {
"marker-color": "red",
"marker-size": "medium",
"marker-symbol": "square",
"Name": "Victoria"
},
"geometry": {
"type": "Point",
"coordinates": [
-123.40942382812501,
48.516604348867475
]
}
}
]
}
```
**Simple text files** are human readable file formats (.txt, .csv) that are suitable for storing point and attribute data. You will often encounter .txt or .csv files when working with weather data for instance. Coordinates (typically latitude and longitude) are stored in a text files along with the other attributes. We can bring this type of file into a GIS, but we need to convert the data to point features before we can display it.
*Canadian Weather Station File*
| Name | Province |Climate ID|Latitude (Decimal Degrees)|Longitude (Decimal Degrees)|
|------------------------|----------------|---------:|-------------------------:|--------------------------:|
|ACTIVE PASS |BRITISH COLUMBIA| 1010066| 48.87| -123.28|
|ALBERT HEAD |BRITISH COLUMBIA| 1010235| 48.40| -123.48|
|BAMBERTON OCEAN CEMENT |BRITISH COLUMBIA| 1010595| 48.58| -123.52|
|BEAR CREEK |BRITISH COLUMBIA| 1010720| 48.50| -124.00|
|BEAVER LAKE |BRITISH COLUMBIA| 1010774| 48.50| -123.35|
|BECHER BAY |BRITISH COLUMBIA| 1010780| 48.33| -123.63|
|BRENTWOOD BAY 2 |BRITISH COLUMBIA| 1010960| 48.60| -123.47|
|BRENTWOOD CLARKE ROAD |BRITISH COLUMBIA| 1010961| 48.57| -123.45|
|BRENTWOOD W SAANICH RD |BRITISH COLUMBIA| 1010965| 48.57| -123.43|
|CENTRAL SAANICH VEYANESS|BRITISH COLUMBIA| 1011467| 48.58| -123.42|
## Choice of Spatial Data Model
There is no "best" spatial data model. Rasters are more well suited for some applications and vector data are better suited for others. The section summarizes some of the key considerations that influence which model is suited for which situations.
### Comparing Data Models
| Vector | Raster |
|------------------------------------------------|-----------------------------------------------|
|Usually <b>discrete objects</b> |Usually <b>continuous fields</b> |
|Points, Lines, and/or Polygons |Grid of cells (pixels) with continuous coverage|
|Each object can have <b>many</b> attributes |Each cell has <b>one</b> value per band (layer)|
|Objects may overlap, have gaps, or be continuous|One raster image can have <b>many</b> bands |
### Raster Data Model
| Advantages | Disadvantages |
|--------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Well suited for continuous variables: in space <b>and</b> time|Large file size: <b>exponentially</b> proportional to resolution and <b>linearly</b> proportional to number of bands. |
|Simple data structure makes overlay is easy and efficient |Loss of information during rasterization (mixed pixel problem, see case study). Reductions in cell size may lead to inability to recognize spatial features.|
### Vector Data Model
| Advantages | Disadvantages |
|-------------------------------------------------------|---------------------------------------------------------------------------------------------|
|Compact data structure: smaller file sizes |Complex data structures compared to rasters |
|A good representation of discrete objects |Topology (connectivity) - can be a huge head ache when creating a layer |
|Easy to query and select by attributes |Some tasks (overlay of layers) can be computationally expensive |
|Graphic output is usually more aesthetically pleasing |No variability within polygons possible |
|Topology (connectivity) - Proximity & Network Analysis |Less suited for continuous variables (requires significant generalization) or temporal change|
### Which Data Model is Best?
No single data model is suitable for all types of data or analysis.
* Most GIS systems employ both raster and vector data structures so that the user can choose the model best suited to the representation of their data
* It is possible to convert back and forth between models
* However, this results in a loss of information and may introduce additional error each time a conversion is made
### Reflection Questions {-}
1. Explain the difference between continuous fields and discrete objects.
2. Define Quantitative data and Qualitative data.
3. What is the role of resolution in raster data?
4. How does the vector data model differ from the raster data model?