Issues with duplicate rows following zero filling #83

mayamonk · 2024-07-23T22:26:30Z

I'm having a problem with a duplication error while trying to produce zero-filled data using the complete eBird dataset to create a smaller presence-absence dataset. While following both tutorials from "Best Practices for Using eBird Data" (https://ebird.github.io/ebird-best-practices/) and "Introduction to auk" (https://cornelllabofornithology.github.io/auk/articles/auk.html#quick-start), the step to collapse the zero-filled data results in each entry being duplicated, and most seem to be 322 duplicates.

For instance, here is the code I used while following the "Introduction to Auk" tutorial:

library(auk)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(lubridate)
library(readr)
library(sf)

states <- c("US-GA", "US-IL", "US-CO", "US-IN", "US-WI", "US-FL", "US-AZ", "US-NY", "US-MO", "US-WA", "US-DE")

input_file <- "/Volumes/UES_LAB/UWIN_acad_perf_analysis/ebird/ebd_US_relJun-2023.txt/ebd_US_relJun-2023.txt"
output_file <- "ebd-filtered-states.txt"
ebird_data <- input_file %>%
auk_ebd() %>%
auk_date(date = c("2011-01-01", "2012-12-31")) %>%
auk_country(country = "United States") %>%
auk_state(states) %>%
auk_complete() %>%
auk_filter(file = output_file) %>%
read_ebd()

ebird_data %>%
glimpse()

f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp) %>%
auk_state(states) %>%
auk_complete()
filters

ebd_sed_filtered <- auk_filter(filters, file = "ebd_filteredPA.txt", file_sampling = "sampling_filteredPA.txt")
ebd_sed_filtered

read_ebd(ebd_sed_filtered)

A tibble: 1,070 × 48

read_ebd(f_ebd)

A tibble: 1,070 × 48

read_ebd(f_smp)
A tibble: 1,070 × 48

here the data shows 1,070 entries and everything had worked thus far

ebd_zf <- auk_zerofill(ebd_sed_filtered)
ebd_zf

Zero-filled EBD: 1,096 unique checklists, for 322 species.

ebd_zf_df <- collapse_zerofill(ebd_zf)
class(ebd_zf_df)
ebd_zf_df

A tibble: 352,912 × 57

After collapse_zerofill, each entry duplicates around 322 times. Using the other tutorial from "Best Practices for Using eBird Data" works the same way, in which the entries duplicate after the code:

zerofill <- auk_zerofill(observations, checklists, collapse = TRUE)

It also results in the same total number of entries: 352,912. Using code to remove duplicates is unsuccessful, such as:

unique.data.frame(zerofill)
unique.array(zerofill)
unique.matrix(zerofill)

("zerofill" is the name of the zero-filled dataset, these result in no change)

Has anyone run into this issue or knows a possible solution? Thanks!

mstrimas · 2024-07-26T14:26:40Z

I'm confused what's happening here:

f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp)

you seem to be using the same file for both the observation dataset and the checklists dataset. I also don't understand why you have this second round of filtering. The idea is to filter the observations and checklists at the same time, i.e.

observations_input <- "ebd_US_relJun-2023.txt"
checklists_input <- "ebd_US_relJun-2023_sampling.txt"
observations_output <- "ebd-filtered-states_observations.txt"
checklists_output <- "ebd-filtered-states_checklists.txt"
ebird_data <- ?auk_ebd(observations_input, file_sampling = checklists_input) %>%
  auk_date(date = c("2011-01-01", "2012-12-31")) %>%
  auk_country(country = "United States") %>%
  auk_state(states) %>%
  auk_complete() %>%
  auk_filter(file = observations_output, file_sampling = checklists_output)

The checklist file (ending in _sampling) is provided when you select the "Include sampling event data" check box when downloading the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with duplicate rows following zero filling #83

Issues with duplicate rows following zero filling #83

mayamonk commented Jul 23, 2024

mstrimas commented Jul 26, 2024

Issues with duplicate rows following zero filling #83

Issues with duplicate rows following zero filling #83

Comments

mayamonk commented Jul 23, 2024

A tibble: 1,070 × 48

A tibble: 1,070 × 48

here the data shows 1,070 entries and everything had worked thus far

Zero-filled EBD: 1,096 unique checklists, for 322 species.

A tibble: 352,912 × 57

mstrimas commented Jul 26, 2024