Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mostly minor edits in the four Rmd files. #9

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,23 @@ date: "v0.6, released: 14 Nov. 2017"

# Glossary of terms

This defined vocabulary aims at providing all essential terms to describe datasets of functional trait measurements and facts for ecological research. Many terms refine terms from the Darwin Core Standard and it's extensions (terms of DWC are referenced thus in field 'Refines'; the full Darwin Core Standard can be found here: http://rs.tdwg.org/dwc/terms/index.htm).
This defined vocabulary aims at providing all essential terms to describe datasets of functional trait measurements and facts for ecological research. Many terms refine terms from the Darwin Core Standard and its extensions (terms of DWC are referenced thus in field 'Refines'; the full Darwin Core Standard can be found here: http://rs.tdwg.org/dwc/terms/index.htm).

The glossary of terms is ordered into a **core section** with essential columns for trait data, extensions which are allowing to provide additional layers of information, as well as a vocabulary for **metadata** information of particular importance for trait data.

Another section provides defined terms and structure for **trait Thesauri**, i.e. lists of trait definitions.
A third section provides defined terms and structure for **trait thesauri**, i.e. lists of trait definitions.

We provide three **extensions** of the vocabulary, that allow for additional information on the trait measurement.

- the `Occurrence` extension contains information on the level of individual specimens, such as date and location and method of sampling and preservation, or physiological specifications of the phenotype, such as sex, life stage or age.
- the `Occurrence` extension contains information on the level of individual specimens, such as date, location, and method of sampling and preservation, or physiological specifications of the phenotype, such as sex, life stage or age.
- the `MeasurementOrFact` extension takes information at the level of single measurements or reported values, such as the original literature from where the value is cited, the method of measurement or statistical method of aggregation.
- The `BiodiversityExploratories` extension provides columns for localisation for trait data from the Biodiversity Exploratories sites (www.biodiversity-exploratories.de).
- The `BiodiversityExploratories` extension provides columns for linking trait data from the Biodiversity Exploratories to the respective project sites (www.biodiversity-exploratories.de).

This glossary of terms is available as

- this human-readable reference (html file), including commentaries and further definitions
- a csv table file (the 'source' file, [TraitDataStandard.csv](https://github.com/EcologicalTraitData/ETS/raw/master/TraitDataStandard.csv))
- a machine readable RDF ontology file, compliant with semantic web standards accessible via an API (produced by and hosted on GFBio Terminology Server)
- a machine readable RDF ontology file, compliant with semantic web standards accessible via an API (produced by and hosted on the GFBio Terminology Server)

## Table of contents

Expand Down Expand Up @@ -67,11 +67,11 @@ for(j in namespace) {
# Core traitdata terms

For the essential primary data (trait value, taxon assignment, trait name), the trait data standard recommends to report the original naming and value scheme as used by the data provider. However, to ensure compatibility with other datasets, the original data provider's information should be duplicated into standardized columns indexed by appending `Std` to the column name.
This ensures compatibility on the provider's side and transparency for data users on the reported measurements and facts, and enables checking for inconsistencies and misspellings in the complete dataset provided by the author. If provided, the standardized fields allow merging heterogeneous data sources into a single table to perform further analyses. This practice of double bookkeeping of trait data has successfully established for the TRY database on plant traits, for instance (Kattge et al. 2011. TRY – a global database of plant traits. Global Change Biology, 17, 2905–2935).
This ensures compatibility on the provider's side and transparency for data users on the reported measurements and facts, and enables checking for inconsistencies and misspellings in the complete dataset provided by the author. If provided, the standardized fields allow merging heterogeneous data sources into a single table to perform further analyses. This practice of double bookkeeping of trait data has been successfully established for the TRY database on plant traits, for instance (Kattge et al. 2011. TRY – a global database of plant traits. Global Change Biology, 17, 2905–2935).

By linking to (public) ontologies via the field `taxonID`, further taxonomic information can be extracted for analysis. Alternatively, `taxonID` may also link to an accompanying datasheet that contains information on the taxonomic resolution or specification of the observation.

Similarly, linking to trait terminologies (a 'Thesaurus') via the field `traitID` allows an unambiguous interpretation of the trait measurement. If no online ontology is available, an accompanying dataset should specify the trait definition. For setting up such a Thesaurus, we propose the use of terms provided in section 'Traitlist' below.
Similarly, linking to trait terminologies (a 'Thesaurus') via the field `traitID` allows an unambiguous interpretation of the trait measurement. If no online ontology is available, an accompanying dataset should specify the trait definition. For setting up such a thesaurus, we propose the use of terms provided in section 'Traitlist' below.

```{r, results = 'asis', echo = FALSE}

Expand Down Expand Up @@ -99,11 +99,10 @@ parseterms("Traitdata")

# Metadata vocabulary

For datasets collate from multiple other datasets
There is the set of information that applies to the entire trait-dataset, which classifies them as metadata.
For datasets collated from multiple other datasets, there is the subset of information that applies to the entire trait-dataset, which classifies it as metadata.


To retain the rights of the original data contributor, the field `rightsHolder` states the person or organization that owns or manages the rights to the data; `bibliographicCitation` states a bibliographic reference which should be cited when the data is used; and license specifies under which terms and conditions the data can be used, re-used and/or published. This information always applies to one single fact or measurement,
To retain the rights of the original data contributor, the field `rightsHolder` states the person or organization that owns or manages the rights to the data; `bibliographicCitation` states a bibliographic reference which should be cited when the data is used; and `license` specifies under which terms and conditions the data can be used, re-used and/or published. This information always applies to one single fact or measurement.

Further information on the larger dataset which originally contained this entry can be stored in `datasetID`, `datasetName`, `author` <!-- -->. These columns should hence give credit to the person who compiled the original dataset and signs responsible for the correct identification and reporting of the rights holder.
These information usually may be kept in the metadata of the dataset, but if datasets from different sources are merged, those should be referred to by a unique identifier (`datasetID`) or be reported as additional columns in the merged dataset (`author`, `license`, ...; see Dublin Core Metadata standards, Ref).
Expand Down
4 changes: 2 additions & 2 deletions structure.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ There are two possibilities to integrate further information to the core trait d

For chosing one or the other, the trade-off is data-consistency and readability *vs.* avoidance of content duplication:

For standalone dataset publications on a hosting service with only little information content beside the core traitdata columns, the first would be the preferred format, since it facilitates an analysis of cofactors and correlations further down the road. If datasets of different source are merged, the information is readily available without the risk of breaking the reference to an external datasheet.
For standalone dataset publications on a hosting service with only little information content beside the core traitdata columns, the first option would be preferred, since it facilitates an analysis of cofactors and correlations further down the road. If datasets of different source are merged, the information is readily available without the risk of breaking the reference to an external datasheet.
Other cases, where key data columns would be placed in the same table as the core data are traits assessed on a higher level of organisation, e.g. microbial functional traits assessed at the community level taken from a soil sample. Here, location or measurement information are in the primary focus of the investigation (see vocabulary extensions below).
A general definition, whether a column describes asset data or is part of the central dataset is ill advised. Therefore, our glossary of terms and its extensions should be used to describe the scientific data according to the study context.

The latter links separate data sheets by identifiers, which has the advantage of tidy datasets and avoids duplication of verbose information [@wickham14]. As a rule of thumb, the columns of the 'Measurement or Fact' and 'Occurrence' extension would be stored in a separate data sheet. The use of Darwin Core Archives [http://eol.org/info/structured_data_archives, DwC-A; @robertson09] is the recommended structure for GBIF [@gbif17, http://tools.gbif.org/dwca-assistant/] or EOL TraitBank [@parr16, http://eol.org/info/cp_archives]. These are .zip archives containing data table txt-files along with a descriptive metadata file (in .xml format). Detailled descriptions and tools for validation can be found on the website of EOL (http://eol.org/info/cp_archives) and GBIF (http://tools.gbif.org/dwca-assistant/).
The second option links separate data sheets by identifiers, which has the advantage of tidy datasets and avoids duplication of verbose information [@wickham14]. As a rule of thumb, the columns of the 'Measurement or Fact' and 'Occurrence' extension would be stored in a separate data sheet. The use of Darwin Core Archives [http://eol.org/info/structured_data_archives, DwC-A; @robertson09] is the recommended structure for GBIF [@gbif17, http://tools.gbif.org/dwca-assistant/] or EOL TraitBank [@parr16, http://eol.org/info/cp_archives]. These are .zip archives containing data table txt-files along with a descriptive metadata file (in .xml format). Detailed descriptions and tools for validation can be found on the website of EOL (http://eol.org/info/cp_archives) and GBIF (http://tools.gbif.org/dwca-assistant/).

The metadata of any dataset that employs this data structure should refer to the respective version of the Ecological Traitdata Standard as "Schneider et al. 2017 Ecological Traitdata Standard v1.0, DOI: XXXX.xxxx, URL: https://ecologicaltraitdata.github.io/ETS/v1.0/". In addition to the versioned online reference, the dataset should also cite the methods paper "Schneider et al. (in preparation) ..." for an explanation of the rationale.

Expand Down
20 changes: 10 additions & 10 deletions thesauri.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ A project-specific trait thesaurus may be a table of terms containing the follow

- a human readable, informative trait name (`trait`)
- unique dataset-specific identifier (`Identifier`), which is referenced in the trait data-set
- a short, unambiguous verbal definition (`traitDescription`) which may make use of standard terms provided in other Ontologies, e.g. the definition for 'fruit mass' in TOP reads: "the mass (PATO:mass), either fresh or dried, of a fruit (PO:fruit)", referring to Phenotypic Characeristics Ontology PATO and Planteome Plant Ontolgy, PO (http://top-thesaurus.org/annotationInfo?viz=1&&trait=Fruit_mass).
- constrain the legit factor levels (for categorical data, `factorLevels`) or expected standard units (`expectedUnit` for numerical data). The type of values should be differentiated in the field `valueType` by specifying 'numerical', 'logical', 'integer', 'categorical' traits.
- link the term to a broader or narrower term (`broaderTerm`, `narrowerTerm`), related terms (`relatedTerm`) or synonyms (`synonym`), e.g. the definition of 'femur length of first leg, left side' is narrower than 'femur length' which is narrower than 'leg trait' which is narrower than 'locomotion trait'. This extends the trait list into a semantic web resource, facilitates the classification of traits, and allows for cross-taxon comparative studies at the level of broader terms [@garnier17].
- a short, unambiguous verbal definition (`traitDescription`) which may make use of standard terms provided in other ontologies, e.g. the definition for 'fruit mass' in TOP reads: "the mass (PATO:mass), either fresh or dried, of a fruit (PO:fruit)", referring to Phenotypic Characeristics Ontology PATO and Planteome Plant Ontolgy, PO (http://top-thesaurus.org/annotationInfo?viz=1&&trait=Fruit_mass).
- legit factor levels (for categorical data, `factorLevels`) or expected standard units (`expectedUnit` for numerical data). The type of values should be differentiated in the field `valueType` by specifying 'numerical', 'logical', 'integer', or 'categorical' traits.
- links of each term to broader and/or narrower terms (`broaderTerm`, `narrowerTerm`), related terms (`relatedTerm`) or synonyms (`synonym`), e.g. the definition of 'femur length of first leg, left side' is narrower than 'femur length' which is narrower than 'leg trait' which is narrower than 'locomotion trait'. This extends the trait list into a semantic web resource, facilitates the classification of traits, and allows for cross-taxon comparative studies at the level of broader terms [@garnier17].

## defining expected values
## Defining expected values

Traits are not only defined in terms of their interpretation, but are ideally also standardised in terms of numerical units and, even more important, the use of factor levels. This is challenging, given the range of data types that fall within datasets of functional traits.

Expand All @@ -36,33 +36,33 @@ To cope with this variety of data types, definitions should refer to other well-

Such reference definitions should also refer to methodological handbooks (@perez-harguindeguy13; @moretti17), which standardise the process of measurement.

## semantic web
## Semantic web

Online ontologies extend into (machine readable) semantic web resources by providing a hierarchical classification of traits (TOP) or a relational tree of functional traits (e.g. TOP or T-SITA).

Each trait definition may link to a broader or narrower term. For instance, the definition of 'femur length of first leg, left side' is narrower than 'femur length' which is narrower than 'leg trait' which is narrower than 'locomotion trait'. (Ref semantic database methods)
This links traits of similar functional meaning.

These systematic approaches to traits will be very useful for comparing similar traits measured in different taxa at higher hierarchical levels.
Furthermore, trait definitions may refer to related terms that describe a similar feature but with other means, or synonyms defined in other trait ontologies. By providing this semantic interlinkage of trait ontologies, a web of definitions is spun across the internet which allows researchers and search engines to relate independent trait measurements with each other.
Furthermore, trait definitions may refer to related terms that describe a similar feature but with other means, or synonyms defined in other trait ontologies. By providing this semantic interlinkage of trait ontologies, a web of definitions is spun across the internet which allows researchers and search engines to relate independent trait measurements to each other.

The benefit of such classifications will increase if open Application Programming Interfaces (APIs) provide a way to extract the definitions and higher-level trait hierarchies programmatically via software tools. To harmonize trait data across databases, future trait standard initiatives should provide this functionality.


# Publish trait thesauri or ontologies

The biggest challenge in building semantic web resources for scientific communities is building a consensus vocabulary that is sufficiently broad while being specific enough for all most use cases. Thesauri and ontologies may be published to cover the specific project context only (e.g. ecosystem, region or experiment), or to be of general use and application for an entire taxonomic group or methodology.
The biggest challenge in building semantic web resources for scientific communities is building a consensus vocabulary that is sufficiently broad while being specific enough for most use cases. Thesauri and ontologies may be published to cover the specific project context only (e.g. ecosystem, region or experiment), or to be of general use and application for an entire taxonomic group or methodology.
The claim of generality comes with higher demands on consensus building.

That said, different models of publication of trait thesauri may apply depending on the claim of generality.

## publish as supplementary information
## Publish as supplementary information

If the thesaurus is only meant to translate the specific project dataset, the definition of traits may be published as metadata along with the trait dataset. This can be done by adding it to the Archive uploaded to a low-threshold fileserver. We highly recommend the use of Darwin Core Archives for this use case.

## Publication on own website

If the thesaurus is of broader use, e.g. for a specific organism group, a region or ecosystem, or a broader project context, thesauri are rather defined by pragmatic choice than a claim of completeness. Consensus is acheived within a limited community of researchers. In this case, multiple indidual datasets refer to the same thesaurus. Therefore, it should be published as a simple, static web resource.
If the thesaurus is of broader use, e.g. for a specific organism group, a region or ecosystem, or a broader project context, thesauri are rather defined by pragmatic choice than a claim of completeness. Consensus is acheived within a limited community of researchers. In this case, multiple individual datasets refer to the same thesaurus. Therefore, it should be published as a simple, static web resource.

A simple way to publish a list of trait definitions for a project may be as a public repository on development platforms like Github.

Expand All @@ -71,4 +71,4 @@ A simple way to publish a list of trait definitions for a project may be as a pu
Online ontologies hosted with accredited ontology servers have the advantage of providing a persistent and direct link of the term on the internet (a *Uniform Resource Identifier*, URI).
Terminology portals or registries, such as the GFBio Terminology Service, the OBO Foundry, or Ontobee, may provide a central host for trait ontologies.

# References
# References
Loading