Skip to content

Latest commit

 

History

History
63 lines (60 loc) · 14.4 KB

glossary.en.adoc

File metadata and controls

63 lines (60 loc) · 14.4 KB

Glossary

Most parts of his glossary is identical to the glossary of the general DNA publishing guide. Entries unique to this version are marked with "(§)".

Atlas of Living Australia (ALA)

The ALA is a web-based platform that pulls together Australian biodiversity data from multiple sources, making it accessible and reusable to anyone (see https://www.ala.org.au/about-ala/). The open infrastructure platform developed by the ALA is also used by several other countries for their own national biodiversity data platform (see https://living-atlases.gbif.org/).

Amplicon Sequence Variant (ASV)

Unique DNA sequence derived from high-throughput sequencing and denoising, and assumed to represent a biologically real sequence variant. See also Operational Taxonomic Unit (OTU) and (Callahan et al. 2017).

Application Programming Interface (API)

Set of protocols and tools for interaction and data transmission between different computer applications.

Barcode Index Numbers (BINs)

Species-level Operational Taxonomic Units (OTUs) derived from clustering of the cytochrome c oxidase I (COI) gene in animals. Each BIN is assigned a globally unique identifier, and is made available in searchable database within the Barcode of Life Data System (BOLD).

Barcode of Life Data System (BOLD)

BOLD is the reference database maintained by the Centre for Biodiversity Genomics in Guelph on behalf of the International Barcode of Life Consortium (IBOL). It hosts data on barcode reference specimens and sequences for eukaryote species, particularly COI for animals, and maintains the Barcode Index Number (BIN; Ratnasingham & Hebert 2013) system, identifiers for OTUs of approximately species rank, based on clusters of closely similar sequences.

Biodiversity data platform

General online resource to discover and access biodiversity data derived from various sources, such as natural history collections, citizen science, ecology and monitoring projects, and genetic sequences. Can be global (GBIF) or national (ALA).

Clustering

In taxonomic classification, the process of grouping organisms together according to some similarity criterion. See Operational Taxonomic Unit.

Community (bulk) DNA

DNA from bulk samples (e.g. plankton samples or Malaise trap samples consisting of several individuals from many species). For the purpose of this guide, bulk sample DNA is included in the eDNA concept.

Darwin Core Archive (DwC-A)

Compressed (ZIP) file format for exchange of biodiversity data compiled in accordance with the Darwin Core (DwC) standard. Essentially a self-contained set of interconnected CSV files and an XML document describing included files and data columns, and their mutual relationships.

Darwin Core term

a standardized field name (e.g. term:dwc[decimalLatitude] is the official DwC term for geographical latitude). (§)

Darwin Core (DwC) standard

Standard for sharing and publishing biodiversity data, originating from the Biodiversity Information Standards (TDWG) community. In principle, a set of terms used for describing different entities of biodiversity observations, such as sampling events, occurrences and taxa. Current Darwin Core terms are described in the Quick Reference Guide.

Data vocabulary

Set of preferred terms or concepts with specific, well-defined meanings and interrelationships, facilitating data exchange and reuse.

ddPCR (droplet digital Polymerase Chain Reaction)

Droplet digital PCR. Method for measuring absolute amount of DNA (number of copies) of one marker in a sample. See also qPCR.

Denoising

In metabarcoding, method for separation of true biological sequences (see ASVs) from spurious sequence variants caused by PCR amplification and sequencing error.

Digital Object Identifier (DOI)

Long-lasting reference used to uniquely identify (and locate) digital information objects, such as a biodiversity data set or a scientific publication.

DNA barcoding and metabarcoding (amplicon sequencing)

Use of short, standardized DNA fragments to identify individual organisms via sequencing. Metabarcoding combines barcoding with high-throughput DNA sequencing, using universal primers to amplify and sequence large groups of organisms in eDNA samples.

DNA marker

A DNA fragment used as a marker of some property (e.g., taxonomic affiliation). May, but does not have to, be a gene or a part of a gene.

DNA metabarcoding database

Database containing DNA sequences (DNA barcodes) from previously recovered or studied organisms. The reference sequences were ideally generated from individuals of described, well-studied species-with the type specimen serving as the ideal-or higher taxonomic level (e.g., genus, family), but may also stem from eDNA sequencing efforts. It is wise not to trust “reference sequences” blindly.

dna-derived data

An extension to Occurrence core to capture information relating to DNA (e.g. primers, the sequence, sequencing platform, etc.). This extension is based on the MIxS standard used by the "GenBanks". (§)

DNA probe

A short, synthetic single-stranded DNA fragment with fluorescent labelling that binds to a selected region of target DNA (marker) during PCR. Increases specificity and can be used in addition to primers in qPCR and ddPCR to detect and quantify a genetic marker.

European Bioinformatics Institute (EMBL-EBI)

Intergovernmental organization for bioinformatics research and services, part of the European Molecular Biology Laboratory (EMBL), providing eg. (raw) sequence reads and assembly data via the European Nucleotide Archive (ENA).

Environmental DNA (eDNA)

DNA from an environmental sample, e.g. soil, water, air or host organism. An often used definition is that environmental DNA is the genetic material (DNA) obtained from environmental samples without any obvious evidence of biological source material (Thomsen and Willerslev 2015).

European Nucleotide Archive (ENA)

European repository for nucleotide sequences, covering raw sequencing data, sequence assembly information and functional annotation. Includes the Sequence Read Archive (SRA), and is maintained by the European Bioinformatics Institute (EMBL-EBI), as part of the International Nucleotide Sequence Database Collaboration (INSDC).

Endpoint

In the context of GBIF, an "endpoint" refers to a URL or web address where a DwC-A can be accessed through the internet, and indexed by GBIF. (§)

FASTQ

Text-based standard for storing molecular sequences and associated quality measures deriving from High-throughput sequencing (HTS). For each sequence position, single ASCII-characters are used to represent base call (identified nucleotide) and score, respectively.

Global Biodiversity Information Facility (GBIF)

International network and research infrastructure, mainly focused on mobilizing and providing open access to global biodiversity data.

Global Genome Biodiversity Network (GGBN)

International network of institutions concerned with efficient sharing and usage of genomic biodiversity samples and associated metadata, e.g. promoting the Darwin Core-compatible GGBN Data Standard.

Global Positioning System (GPS)

Satellite navigation system operated by the United States Space Force.

High-throughput sequencing (HTS)

Different technologies for massively parallel sequencing, producing millions of DNA sequence reads from library preparations of genetic material, rather than targeting single amplicons as in traditional Sanger sequencing. Also called Next Generation Sequencing (NGS).

Ingestion

Process of importing data from heterogeneous sources, such as local databases, text files or spreadsheets, to a common destination system, such as an online biodiversity data platform, for storage and further analysis. Typically includes steps of extraction, transformation (cleaning) and loading (ETL).

Indexing

Organization of information in accordance with a specific schema or structure, making data easier to access and present.

International Nucleotide Sequence Database Collaboration (INSDC)

Joint effort of the DNA Databank of Japan (DDBJ), EMBL and NCBI to provide global public access to nucleotide sequence data and associated information.

Integrated Publishing Toolkit (IPT)

The Integrated Publishing Toolkit — commonly referred to as the IPT — is free open-source software developed by GBIF and used by organizations around the world to create and manage repositories for sharing biodiversity datasets.

Metagenomics

PCR-free sequencing of random genomic fragments in a mixed sample.

Minimum Information about any (x) Sequence (MIxS) standard

Family of standards (checklists) for sequence metadata, developed by the Genomic Standards Consortium (GSC).

molecular Operational Taxonomic Unit (mOTU)

See Operational Taxonomic Unit (OTU).

National Center for Biotechnology Information (NCBI)

Division of United States National Library of Medicine (NLM) housing important bioinformatics resources, such as the GenBank database of DNA sequences, and the Sequence Read Archive (SRA) of high throughput sequencing data.

Next Generation Sequencing (NGS)

See High-throughput sequencing (HTS).

Occurrence

An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time.

Occurrence core

The part of DwC that includes all the central information (fields) on biological occurrences in GBIF (e.g. spatiotemporal data, taxonomy, etc), also for eDNA data. (§)

Operational Taxonomic Unit (OTU)

Cluster of organisms based on similarity in specific DNA marker sequence(s), used for taxonomic classification. Includes, for example, Species Hypothesis in UNITE, and Barcode Index Numbers in the Barcode of Life Data System (BOLD). Amplicon Sequence Variants (ASVs) may be considered analogous to zero radius OTUs (zOTUs).

OTU table

Spreadsheet that holds the number of sequencing reads detected of each OTU/sequence in each sample.

Polymerase Chain Reaction (PCR)

Technique for fast amplification and detection of specific fragments of target DNA (or RNA) sequences. Amplified regions are determined by the pair of PCR primers used in the reaction.

Pipeline

In bioinformatics, a set of algorithms or tools applied in a predefined workflow to process e.g. High-throughput sequencing (HTS) data.

Primers (PCR primers)

Short, synthetic, single-stranded DNA fragments that bind to a selected region of target DNA (marker) to initiate replication during PCR. A pair of primers is necessary for the polymerase enzyme to amplify the selected marker.

qPCR (quantitative Polymerase Chain Reaction)

Quantitative PCR. Method that measures relative DNA quantity of a marker in a sample. See also ddPCR.

Sample

Material (water, soil, gut content, etc) obtained for analysis.

Sequence alignment

Bioinformatic process of comparing and arranging two or more molecular (DNA, RNA or protein) sequences to detect similarities caused by e.g. evolutionary relatedness.

Species Hypothesis (SH)

Species-level Operational Taxonomic Unit (OTU) as defined in the UNITE database and sequence management environment, for Fungi.

Specimen

An individual animal, plant, fungus, etc. used as an example of its species or type for scientific study or display.

Sequence Read Archive (SRA)

Public repository of high throughput (NGS) sequencing data, with instances operated by the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EMBL-EBI), and the DNA Data Bank of Japan (DDBJ). Includes both raw (non-denoised) sequencing output and sequence alignments. One of three components of the European Nucleotide Archive (ENA), and previously known as the Short Read Archive.

Target-capture sequencing

Sequencing of DNA fragments isolated with hybridization probes.

UNITE

UNITE is a web-based sequence management environment centred on the eukaryotic nuclear ribosomal ITS region. All public sequences are clustered into species hypotheses (SHs), which are assigned unique DOIs. An SH-matching service outputs various elements of information, including what species are present in eDNA samples, whether these species are potentially undescribed new species, other studies in which they were recovered, whether the species are alien to a region, and whether they are threatened. The DOIs are connected to the taxonomic backbone of the PlutoF platform and GBIF, such that they are accompanied by a taxon name where available. The data used in UNITE are hosted and managed in PlutoF. Data are represented through a range of standards, primarily Darwin Core, MIxS, and DMP Common Standard; partial support is available for EML, MCL, and GGBN. PlutoF exports data primarily through the CSV and FASTA formats. PlutoF can also be used to publish data in GBIF (using the DwC format) and to prepare GenBank submission files. It is furthermore possible to download species lists from your data and download your project as a JSON document with project data in hierarchically structured.

Zero radius otu (zOTU)

See ASV.