Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with duplicate rows following zero filling #83

Open
mayamonk opened this issue Jul 23, 2024 · 1 comment
Open

Issues with duplicate rows following zero filling #83

mayamonk opened this issue Jul 23, 2024 · 1 comment

Comments

@mayamonk
Copy link

I'm having a problem with a duplication error while trying to produce zero-filled data using the complete eBird dataset to create a smaller presence-absence dataset. While following both tutorials from "Best Practices for Using eBird Data" (https://ebird.github.io/ebird-best-practices/) and "Introduction to auk" (https://cornelllabofornithology.github.io/auk/articles/auk.html#quick-start), the step to collapse the zero-filled data results in each entry being duplicated, and most seem to be 322 duplicates.

For instance, here is the code I used while following the "Introduction to Auk" tutorial:

library(auk)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(lubridate)
library(readr)
library(sf)

states <- c("US-GA", "US-IL", "US-CO", "US-IN", "US-WI", "US-FL", "US-AZ", "US-NY", "US-MO", "US-WA", "US-DE")

input_file <- "/Volumes/UES_LAB/UWIN_acad_perf_analysis/ebird/ebd_US_relJun-2023.txt/ebd_US_relJun-2023.txt"
output_file <- "ebd-filtered-states.txt"
ebird_data <- input_file %>%
auk_ebd() %>%
auk_date(date = c("2011-01-01", "2012-12-31")) %>%
auk_country(country = "United States") %>%
auk_state(states) %>%
auk_complete() %>%
auk_filter(file = output_file) %>%
read_ebd()

ebird_data %>%
glimpse()

f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp) %>%
auk_state(states) %>%
auk_complete()
filters

ebd_sed_filtered <- auk_filter(filters, file = "ebd_filteredPA.txt", file_sampling = "sampling_filteredPA.txt")
ebd_sed_filtered

read_ebd(ebd_sed_filtered)

A tibble: 1,070 × 48

read_ebd(f_ebd)

A tibble: 1,070 × 48

read_ebd(f_smp)
A tibble: 1,070 × 48

here the data shows 1,070 entries and everything had worked thus far

ebd_zf <- auk_zerofill(ebd_sed_filtered)
ebd_zf

Zero-filled EBD: 1,096 unique checklists, for 322 species.

ebd_zf_df <- collapse_zerofill(ebd_zf)
class(ebd_zf_df)
ebd_zf_df

A tibble: 352,912 × 57

After collapse_zerofill, each entry duplicates around 322 times. Using the other tutorial from "Best Practices for Using eBird Data" works the same way, in which the entries duplicate after the code:

zerofill <- auk_zerofill(observations, checklists, collapse = TRUE)

It also results in the same total number of entries: 352,912. Using code to remove duplicates is unsuccessful, such as:

unique.data.frame(zerofill)
unique.array(zerofill)
unique.matrix(zerofill)

("zerofill" is the name of the zero-filled dataset, these result in no change)

Has anyone run into this issue or knows a possible solution? Thanks!

@mstrimas
Copy link
Contributor

I'm confused what's happening here:

f_ebd <- output_file
f_smp <- output_file
filters <- auk_ebd(f_ebd, file_sampling = f_smp)

you seem to be using the same file for both the observation dataset and the checklists dataset. I also don't understand why you have this second round of filtering. The idea is to filter the observations and checklists at the same time, i.e.

observations_input <- "ebd_US_relJun-2023.txt"
checklists_input <- "ebd_US_relJun-2023_sampling.txt"
observations_output <- "ebd-filtered-states_observations.txt"
checklists_output <- "ebd-filtered-states_checklists.txt"
ebird_data <- ?auk_ebd(observations_input, file_sampling = checklists_input) %>%
  auk_date(date = c("2011-01-01", "2012-12-31")) %>%
  auk_country(country = "United States") %>%
  auk_state(states) %>%
  auk_complete() %>%
  auk_filter(file = observations_output, file_sampling = checklists_output)

The checklist file (ending in _sampling) is provided when you select the "Include sampling event data" check box when downloading the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants