01_introduction.Rmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---

# Introduction and Setup {#intro}

## Introduction {#intro-intro}

Citizen science data are increasingly making important contributions to ecological research and conservation. One of the most common forms of citizen science data is derived from members of the public recording species observations. [eBird](https://ebird.org/) [@sullivanEBirdEnterpriseIntegrated2014] is the largest of these biological citizen science programs. As of January 2019, the database contained nearly 600 million bird observations from every country in the world, with observations of nearly every bird species on Earth. The eBird database is valuable to researchers across the globe, due to its year-round, broad spatial coverage, high volumes of [open access](https://en.wikipedia.org/wiki/Open_data) data, and applications to many ecological questions. These data have been widely used in scientific research to study phenology, species distributions, population trends, evolution, behavior, global change, and conservation. However, robust inference with eBird data requires careful processing of the data to address the challenges associated with citizen science datasets. This book, and the [associated paper](https://www.biorxiv.org/content/10.1101/574392v1), outlines a set of best practices for addressing these challenges and making reliable estimates of species distributions from eBird data.

There are two key characteristics that distinguish eBird from many other citizen science projects and facilitate robust ecological analyses: the checklist structure enables non-detection to be inferred and the effort information associated with a checklist facilitates robust analyses by accounting for variation in the observation process [@lasorteOpportunitiesChallengesBig2018; @kellingFindingSignalNoise2018]. When a participant submits data to eBird, sightings of multiple species from the same observation period are grouped together into a single **checklist**. **Complete checklists** are those for which the participant reported all birds that they were able to detect and identify. Critically, this enables scientists to infer counts of zero individuals for the species that were not reported. If checklists are not complete, it's not possible to ascertain whether the absence of a species on a list was a non-detection or the result of a participant not recording the species. In addition, citizen science projects occur on a spectrum from those with predefined sampling structures that resemble more traditional survey designs, to those that are unstructured and collect observations opportunistically. eBird is a **semi-structured** project, having flexible, easy to follow protocols that attract many participants, but also collecting data on the observation process (e.g. amount of time spent birding, number of observers, etc.), which can be used in subsequent analyses [@kellingFindingSignalNoise2018].

Despite the strengths of eBird data, species observations collected through citizen science projects present a number of challenges that are not found in conventional scientific data. The following are some of the primary challenges associated these data; challenges that will be addressed throughout this book:

- **Taxonomic bias:** participants often have preferences for certain species, which may lead to preferential recording of some species over others [@greenwoodCitizensScienceBird2007; @tullochBehaviouralEcologyApproach2012]. Restricting analyses to complete checklists largely mitigates this issue.
- **Spatial bias:** most participants in citizen science surveys sample near their homes [@luckAlleviatingSpatialConflict2004], in easily accessible areas such as roadsides [@kadmonEffectRoadsideBias2004], or in areas and habitats of known high biodiversity [@prendergastCorrectingVariationRecording1993]. A simple method to reduce the spatial bias that we describe is to create an equal area grid over the region of interest, and sample a given number of checklists from within each grid cell.
- **Temporal bias:** participants preferentially sample when they are available, such as weekends [@courterWeekendBiasCitizen2013], and at times of year when they expect to observe more birds, notably during spring migration [@sullivanEBirdEnterpriseIntegrated2014]. To address the weekend bias, we recommend using a temporal scale of a week or multiple weeks for most analyses.
- **Spatial precision:** the spatial location of an eBird checklist is given as a single latitude-longitude point; however, this may not be precise for two main reasons. First, for traveling checklists, this location represents just one point on the journey. Second, eBird checklists are often assigned to a **hotspot** (a common location for all birders visiting a popular birding site) rather than their true location. For these reasons, it's not appropriate to align the eBird locations with very precise habitat covariates, and we recommend summarizing covariates within a neighborhood around the checklist location.
- **Class imbalance:** bird species that are rare or hard to detect may have data with high class imbalance, with many more checklists with non-detections than detections. For these species, a distribution model predicting that the species is absent everywhere will have high accuracy, but no ecological value. We'll follow the methods for addressing class imbalance proposed by Robinson et al. [-@robinsonUsingCitizenScience2018].
- **Variation in detectability:** detectability describes the probability of a species that is present in an area being detected and identified. Detectability varies by season, habitat, and species [@johnstonSpeciesTraitsExplain2014; @johnstonEstimatesObserverExpertise2018]. Furthermore, eBird data are collected with high variation in effort, time of day, number of observers, and external conditions such as weather, all of which can affect the detectability of species [@ellisEffectsWeatherTime2018; @oliveiraObservationDiurnalSoaring2018]. Therefore, detectability is particularly important to consider when comparing between seasons, habitats or species. Since eBird uses a semi-structured protocol, that collects variables associated with variation in detectability, we'll be able to account for a larger proportion of this variation in our analyses.

The remainder of this book will demonstrate how to address these challenges using real data from eBird to produce reliable estimates of species distributions. In general, we'll take a two-pronged approach to dealing with unstructured data and maximizing the value of citizen science data: imposing more structure onto the data via data filtering and including covariates in models to account for the remaining variation.

The next two chapters show how to access and prepare [eBird data](#ebird) and [land cover covariates](#covariates), respectively. The remaining three chapters provide examples of different species distribution models that can be fit using these data: [encounter rate models](#encounter), [occupancy models](#occupancy), and [abundance models](#abundance). Although these examples focus on the use of eBird data, in many cases they also apply to similar citizen science datasets.

## Prerequisites {#intro-pre}

To understand the code examples used throughout this book, some knowledge of the programming language [R](https://www.r-project.org/) is required. If you don't meet this requirement, or begin to feel lost trying to understand the code used in this book, we suggest consulting one of the excellent free resources available online for learning R. For those with little or no prior programming experience, [Hands-On Programming with R](https://rstudio-education.github.io/hopr/) is an excellent introduction. For those with some familiarity with the basics of R that want to take their skills to the next level, we suggest [R for Data Science](https://r4ds.had.co.nz/) as the best resource for learning how to work with data within R.

## Setup {#intro-setup}

### Data package {#intro-setup-data}

The first two chapters of this book focus on obtaining and preparing eBird and land cover data for the modeling that will occur in the remaining chapters. These steps can be time consuming and laborious. If you'd like to skip straight to the analysis, [**download this package of prepared data**](https://github.com/cornelllabofornithology/ebird-best-practices/raw/master/data/data.zip). Unzip this file so that the contents are in the `data/` subdirectory of your RStudio project folder. This will allow you to jump right in to the modeling and ensure that you're using exactly the same data as was used when creating this book. This is a good option if you don't have enough hard drive space to store the full eBird data set, which is more than 200GB, or don't have a fast enough internet connection to download it.

### Software {#intro-setup-software}

The examples throughout this website use the programming language **R** [@R-base] to work with eBird data. If you don't have R installed, [download it now](https://cloud.r-project.org/), if you already have R, chances are you're using an outdated version, so [update it to the latest version now](https://cloud.r-project.org/). R is updated regularly, and **it is important that you have the most recent version of R** to avoid headaches when installing packages. We suggest checking every couple months to see if a new version has been released.

We strongly encourage R users to use **RStudio**. RStudio is not required to follow along with this book; however, it will make your R experience significantly better. If you don't have RStudio, [download it now](https://www.rstudio.com/products/rstudio/download/#download), if you already have it, [update it](https://www.rstudio.com/products/rstudio/download/#download) because new versions with useful additional features are regularly released. Pro tip: immediately go into RStudio preferences (Tools > Global Options) and on the General pane uncheck "Restore .RData into workspace at startup" and set "Save workspace to .RData on exit" to "Never". This will avoid cluttering your R session with old data and save you headaches down the road.

Due to the massive size of the eBird dataset, working with it requires the Unix command-line utility AWK. You won't need to use AWK directly, since the R package `auk` does this hard work for you, but you do need AWK to be installed on your computer. Linux and Mac users should already have AWK installed on their machines; however, Windows user will need to [install Cygwin](https://www.cygwin.com/) to gain access to AWK. Cygwin is free software that allows Windows users to use Unix tools. Cygwin should be installed in the default location (`C:/cygwin/bin/gawk.exe` or `C:/cygwin64/bin/gawk.exe`) in order for everything to work correctly. Note: there's no need to do anything at the "Select Utilities" screen, AWK will be installed by default.

### R packages {#intro-setup-packages}

The examples in this book use a variety of R packages for accessing eBird data, working with spatial data, data processing and manipulation, and model fitting. To install all the packages necessary to work through this book, run the following code:

```{r packages, eval = FALSE}
install.packages("remotes")
remotes::install_github("mstrimas/ebppackages")
```

Note that several of the spatial packages require dependencies. If installing these packages fails, consult the [instructions for installing dependencies on the `sf` package website](https://r-spatial.github.io/sf/#installing). Finally, **ensure all R packages are updated** to their most recent version by clicking on the Update button on the Packages tab in RStudio.

### Tidyverse {#intro-setup-tidyverse}

Throughout this book, we use packages from the [Tidyverse](https://www.tidyverse.org/), an opinionated collection of R packages designed for data science. Packages such as [`ggplot2`](http://ggplot2.tidyverse.org/), for data visualization, and [`dplyr`](http://dplyr.tidyverse.org/), for data manipulation, are two of the most well known Tidyverse packages; however, there are many more. In the following chapters, we often use Tidyverse functions without explanation. If you encounter a function you're unfamiliar with, consult the documentation for help (e.g. `?mutate` to see help for the `dplyr` function `mutate()`). More generally, the free online book [R for Data Science](http://r4ds.had.co.nz/) by [Hadley Wickham](http://hadley.nz/) is the best introduction to working with data in R using the Tidyverse.

The one piece of the Tidyverse that we will cover here, because it is ubiquitous throughout this book and unfamiliar to many, is the pipe operator `%>%`. The pipe operator takes the expression to the left of it and "pipes" it into the first argument of the expression on the right, i.e. one can replace `f(x)` with `x %>% f()`. The pipe makes code significantly more readable by avoiding nested function calls, reducing the need for intermediate variables, and making sequential operations read left-to-right. For example, to add a new variable to a data frame, then summarize using a grouping variable, the following are equivalent:

```{r pipes}
library(dplyr)

# pipes
mtcars %>% 
  mutate(wt_kg = 454 * wt) %>% 
  group_by(cyl) %>% 
  summarize(wt_kg = mean(wt_kg))

# intermediate variables
mtcars_kg <- mutate(mtcars, wt_kg = 454 * wt)
mtcars_grouped <- group_by(mtcars_kg, cyl)
summarize(mtcars_grouped, wt_kg = mean(wt_kg))

# nested function calls
summarize(
  group_by(
    mutate(mtcars, wt_kg = 454 * wt),
    cyl
  ),
  wt_kg = mean(wt_kg)
)
```

Once you become familiar with the pipe operator, we believe you'll find the the above example using the pipe the easiest of the three to read and interpret.

### Getting eBird data access {#intro-setup-ebird}

The complete eBird database is provided via the [eBird Basic Dataset (EBD)](https://ebird.org/science/download-ebird-data-products), a large text file. To access the EBD, begin by [creating an eBird account and signing in](https://secure.birds.cornell.edu/cassso/account/create). Then visit the [eBird Data Access page](https://ebird.org/data/download) and fill out the data access request form. eBird data access is free; however, you will need to request access in order to download the EBD. Filling out the access request form allows eBird to keep track of the number of people using the data and obtain information on the applications for which the data are used.

Once you have access to the data, proceed to the [download page](https://ebird.org/data/download/ebd). Download both the World EBD (~ 42 GB compressed, ~ 210 GB uncompressed) and corresponding Sampling Event Data (~ 3.5 GB compressed, ~ 11 GB uncompressed). The former provides observation-level data, while the latter provides checklist-level data; both files are required for species distribution modeling. If limited hard drive space or a slow internet connection make dealing with these large files challenging, consult Section \@ref(ebird-size-custom) for details on a method for downloading a subset of EBD.

The downloaded data files will be in `.tar` format, and should be unarchived. The resulting directories will contain files with extension `.txt.gz`, these files should be uncompressed (on Windows use [7-Zip](https://www.7-zip.org/), on Mac use the default system uncompression utility) to produce two text files (e.g., `ebd_relAug-2019.txt` and `ebd_sampling_relAug-2019.txt`). Move these two large, uncompressed .txt files to a sensible, central location on your computer. In general, we suggest creating an `ebird/` folder nested in a `data/` folder within your home directory (i.e. `~/data/ebird/`) to store these files, and throughout the remainder of this chapter we'll assume you've placed the data there. If you choose to store the EBD elsewhere, you will need to update references to this folder in the code. If the files are too large to fit on your computer's hard drive, they can be stored on an external hard drive. 

Each time you want to access eBird data in an R project, you'll need to reference the full path to these text files, for example `~/data/ebird/ebd_relAug-2019.txt`. In general, it's best to avoid using absolute paths in R scripts because it makes them less portable–if you're sharing the files with someone else, they'll need to change the file paths to point to the location at which they've stored the eBird data. The R package `auk` provides a workaround for this, by allowing users to set an environment variable (`EBD_PATH`) that points to the directory where you've stored the eBird data. To set this variable, use the function [`auk_set_ebd_path()`](https://cornelllabofornithology.github.io/auk/reference/auk_set_ebd_path.html). For example, if the EBD and Sampling Event Data files are in `~/data/ebird/`, use:

```{r set-ebd-path, eval = FALSE}
# set ebd path
auk::auk_set_ebd_path("~/data/ebird/")
```

After **restarting your R session**, you should be able to refer directly to the EBD or Sampling Event Data files within `auk` functions (e.g., `auk_ebd("ebd_relAug-2019.txt")`). Provided your collaborators have also set `EDB_PATH`, your scripts should now be portable.

You now have access to the full eBird dataset! Note, however, that **the EBD is updated monthly**. If you want the most recent eBird records, be sure to **regularly download an updated version**. Finally, **whenever you update the EBD, always update the `auk` package as well**, this will ensure that `auk` will be able to handle any changes to the EBD that may have occurred.

### GIS data {#intro-setup-gis}

Throughout this book, we'll be producing maps of species distributions. To provide context for these distributions, we'll need GIS data for political boundaries. [Natural Earth](https://www.naturalearthdata.com/) is the best source for a range of tightly integrated vector and raster GIS data for producing professional cartographic maps. The R package, [`rnaturalearth`](https://github.com/ropensci/rnaturalearth) provides a convenient method for accessing these data from within R. We'll also need [Bird Conservation Region (BCR)](http://nabci-us.org/resources/bird-conservation-regions/) boundaries, which are available through [Bird Studies Canada](https://www.birdscanada.org/research/gislab/index.jsp?targetpg=bcr&targetpg=bcr).

These GIS data layers are most easily accessed by [**downloading the data package**](https://github.com/cornelllabofornithology/ebird-best-practices/raw/master/data/data.zip) for this book. 
These data were generated using the following code. If you intend to run this code yourself, first create an RStudio project, so the files will be stored within the `data/` subdirectory of the project. This will allow us to load these data in later chapters as they're needed. Note that calls to `ne_download()` often produce warnings suggesting that you've used the incorrect "category"; these can safely be ignored.

```{r intro-setup-gis, eval = FALSE}
library(sf)
library(rnaturalearth)
library(dplyr)

# file to save spatial data
gpkg_dir <- "data"
if (!dir.exists(gpkg_dir)) {
  dir.create(gpkg_dir)
}
f_ne <- file.path(gpkg_dir, "gis-data.gpkg")

# download bcrs
tmp_dir <- normalizePath(tempdir())
tmp_bcr <- file.path(tmp_dir, "bcr.zip")
paste0("https://www.birdscanada.org/research/gislab/download/", 
       "bcr_terrestrial_shape.zip") %>% 
  download.file(destfile = tmp_bcr)
unzip(tmp_bcr, exdir = tmp_dir)
bcr <- file.path(tmp_dir, "BCR_Terrestrial_master_International.shp") %>% 
  read_sf() %>% 
  select(bcr_code = BCR, bcr_name = LABEL) %>% 
  filter(bcr_code == 27)
# clean up
list.files(tmp_dir, "bcr", ignore.case = TRUE, full.names = TRUE) %>% 
  unlink()

# political boundaries
# land border with lakes removed
ne_land <- ne_download(scale = 50, category = "cultural",
                       type = "admin_0_countries_lakes",
                       returnclass = "sf") %>%
  filter(CONTINENT == "North America") %>%
  st_set_precision(1e6) %>%
  st_union()
# country lines
# downloaded globally then filtered to north america with st_intersect()
ne_country_lines <- ne_download(scale = 50, category = "cultural",
                                type = "admin_0_boundary_lines_land",
                                returnclass = "sf") %>% 
  st_geometry()
ne_country_lines <- st_intersects(ne_country_lines, ne_land, sparse = FALSE) %>%
  as.logical() %>%
  {ne_country_lines[.]}
# states, north america
ne_state_lines <- ne_download(scale = 50, category = "cultural",
                              type = "admin_1_states_provinces_lines",
                              returnclass = "sf") %>%
  filter(adm0_a3 %in% c("USA", "CAN")) %>%
  mutate(iso_a2 = recode(adm0_a3, USA = "US", CAN = "CAN")) %>% 
  select(country = adm0_name, country_code = iso_a2)

# output
unlink(f_ne)
write_sf(ne_land, f_ne, "ne_land")
write_sf(ne_country_lines, f_ne, "ne_country_lines")
write_sf(ne_state_lines, f_ne, "ne_state_lines")
write_sf(bcr, f_ne, "bcr")
```