Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include more trait datasets incl. Std version #20

Open
3 tasks done
fdschneider opened this issue Nov 15, 2017 · 15 comments
Open
3 tasks done

include more trait datasets incl. Std version #20

fdschneider opened this issue Nov 15, 2017 · 15 comments
Assignees
Milestone

Comments

@fdschneider
Copy link
Member

fdschneider commented Nov 15, 2017

the package should provide more datasets from the living spreadshet (fdschneider/bexis_traits#20).

  • identify data for integration
  • write script to extract data upon call of data() (files are placed in 'data/' directory)
  • include documnetation in package files 'R/data.R'

A standardised version of each dataset should be provided as well (linking to trait Thesauri and taxon Ontologies).

@caterinap
Copy link
Member

No super sure about including more datasets in the package itself (I don't know if there is an "ideal" size for a package). If we do, they should be small, I guess.
We can alternatively/also provide a tutorial with more examples on how to handle different trait datasets using the package (not only the CC.0 ones).

@fdschneider
Copy link
Member Author

Ha! Trick is, we're not including the datasets, just provide code to pull the datasets from their source:

See files in data.R. Only when you call data(carabids) the file is downloaded and made available for use. The package remains small. The user decides what to download.

The package vignette contains plenty of advice on how to harmonize own data, or data from other sources.

@caterinap
Copy link
Member

Then it's all good!! Sorry, I need to dive a bit more into the package!

@caterinap
Copy link
Member

@fdschneider I started to add more datasets in https://github.com/caterinap/traitdataform/tree/master/data. See if it's fine, I can continue adding more later in the week.
Also added more entries in the spreadsheet and a new column indicating if the dataset is in the package.

@Rekyt
Copy link
Contributor

Rekyt commented Nov 21, 2017

Hi @fdschneider, you're initiative seems really cool! I hope to use it soon ;)

A lot of work has been done by people who built Eco Data Retriever (http://www.data-retriever.org/, Github Repo) you can see the available datasets here.

I'm also thinking about the trait package by rOpenScience. Maybe you could use some wrappers to those already built tools?

@fdschneider
Copy link
Member Author

@Rekyt Thanks. Yes, I looked into those. We basically use the same idea as Retriever when pulling example datasets from the original sources on Figshare or wherever. The 'traits' package is great for tapping APIs of more extensive databases. There is also the package 'TR8'.

It would be cool to have wrappers for these data sources that add harmonization on top.

@caterinap
Copy link
Member

Ok, now all CC.0 are in the package, on the same form as the "carabids" one. On windows I did not get errors when building the package (only warnings).

Some remarks:

  • I did not modify yet 3 datasets: biotraits, plantsBROT and plantsD3 because they have CC BY 4.0 license. For the moment they are still there and we can decide to remove or modify them later.
  • In the heteroptera_raw I did not change the coordinates into decimal because we do not import any package to do so (as far as I saw)

Have a look and let me know if you want to add/remove/change anything!

@fdschneider
Copy link
Member Author

Great, thanks.

I will pull and test it.

I wasn't aware that some of those datasets have so many traits. Great job mapping them to the ontologies.
However, I just noticed that the URIs in Nadjas list are not correct. They should correspond to the URL with headings: e.g. https://ecologicaltraitdata.github.io/TraitDataList/#age_at_reproduction.
We should fix this in the TraitDataList repository, @nadjasimons.

Furthermore, I thought that some of the cryptic trait names might be replaced by more intuitive trait names.
E.g. if the thesaurus call states

X10.2_SocialGrpSize = traitdataform::as.trait("social_group_size", expectedUnit = NA,
valueType = "numeric"),

The function standardize() will keep the original name in traitName but replace it with the easier one in traitNameStd.

The CC BY 4.0 data could be added in the future in just the same way, since we always state the correct reference.

I think the Ricklefs data on passerine birds can't be included since it is not labelled as public domain or CC by. Sorry, that license statement in the documentation is my fault, I guess. I already removed it from the current version.

@caterinap
Copy link
Member

caterinap commented Nov 28, 2017

ok, so I will:

  • change the URIs once they are fixed
  • modify cryptic trait names
  • add the CC BY 4.0 datasets (when I have a bit of time)

Concerning the passerine, I actually checked before adding it and in the metadata (which is a word file in the supplementary) he states:

  1. Copyright restrictions: None
  2. Proprietary restrictions: None
  3. Costs: None

So I guess that we could keep it.

@fdschneider
Copy link
Member Author

Ok, thanks. No pressure. Whenever you find time.

The passerines: I'm relieved. After I was assured that the data are open by a colleaque, I was desperately looking for this disclaimer but didn't find it. Great 'bad example' for open data labelling.

@nadjasimons
Copy link

I fixed URIs in the trait data list

@fdschneider fdschneider removed this from the v0.3 milestone Nov 26, 2018
@fdschneider
Copy link
Member Author

For now this is put on halt because it overlaps with functionality provided by Will Pearses natdb package (@willpearse). They include 100+ datasets with short recipes (see this file), and in the process fix some major heterogeneity in the data (like replacing abbreviations with species names or adding units). I did not have the time to investigate how the data are processed into a virtual database.
We should figure out how the two packages can complement each other.

Regardless, I would like to include Caterinas Pull request for v1.0 to have some more example datasets to draw from.

@fdschneider fdschneider added this to the v1.0 milestone Nov 26, 2018
@willpearse
Copy link

willpearse commented Dec 4, 2018 via email

@fdschneider
Copy link
Member Author

Thanks Will, and sorry for not keeping up with our earlier e-mail discussion. I wanted to get a first functional version out before investigating further on interfaces with other tools.
Let me know how I can help making this work seamlessly with your package.

@willpearse
Copy link

willpearse commented Dec 7, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants