This is an R package with utility functions to assist in harmonizing digital soil repositories. The purpose of the Soil Organic Carbon Data Rescue and Harmonization Repository (SOC-DRaHR) is to collect harmonization scripts for data sets related to soil carbon to enable scientific research. This project identifies soil carbon datasets that are publicly available, provides data harmonization scripts to integrate those data sets, and provides output scripts for a harmonized data product. Data can come from field surveys, field manipulation studies, and laboratory experiments.
This project is not a data repository or archive but instead a code repository. End users are responsible for complying with ALL data use policies of the orginal data providers, please check with the orginal archives and repositories to ensure you are complying with use policies.
Here are some important resources to get you started:
- International Soil Carbon Network is our parent organization.
- Our code of conduct is based on the The Contributor Covenant.
- How to contribute
- Our mailing list
To quickly access ISCN3 data:
library(devtools)
install_github("ISCN/SOCDRaHR2")
library(SOCDRaH2)
ISCN3 <- ISCN3(dataDir='YourDataDir')
- Data is downloaded from an archived url link.
- Meta data is encoded as a set of data table, and is ideally generated from the associated metadata on the archiving repository.
- Data columns that share common
variables
are directly comparable with trivial unit conversions.
Data should be archived on a TRUST-ed repository (Lin et al, 2020) with an associated download url.
SOC-DRaR2 scripts will then fetch the data from the identified url and may transform the data depending on the original structure.
In general, R/read*.R
scripts return a set of relational data tables that is wide or un-normalized where each row is a unique sample and each column an associated measurement or information about a measurement.
The distinction between data and metadata is blurry and often varies between each contributing data set.
While we acknowledge that meta-data is often encoded as some XML tree structure, for this project we have opted to go with a data tables. The metadata has the following structure:
header | description |
---|---|
data_id |
the name of the data product |
table_id |
the name of the data table |
column_id |
the name of the column header OR NA if the information is in the entry column |
variable |
a valid ISCN measurement name that the column is associated with (see more on variables below) |
data_type |
what kind of is is either in the column OR encoded in the data_entry |
entry |
NA if this describes the values of a column OR the value of the data_type |
For example, in ISC3:
data_id | table_id | column_id | variable | data_type | entry |
---|---|---|---|---|---|
ISCN3 | NA |
NA |
NA |
download_url | ftp:[...]xlsx |
ISCN3 | layer | 13c (‰) | 13c | value_number | NA |
ISCN3 | layer | NA |
13c | unit | permille |
Note that this format is easy to derive from a wide table format
data_id | table_id | column_id | variable | data_type | value_string | unit |
---|---|---|---|---|---|---|
ISCN3 | NA |
NA |
download_url | NA |
ftp:[...]xlsx |
NA |
ISCN3 | layer | 13c (‰) | 13c | value_number | NA |
permille |
Valid data_type
entries include: value_number
, value_string
, unit
, sigma
, measurement_method
, observation_note
, and id
.
The proposed vocabulary list from the ISCN 2016 data template is in data/vocabularyList.csv
.
ISCN 2016 has 295 terms grouped across 9 tables.
This has been reduced to 123 terms across 5 tables for ISCN3 and many of the 2016 template terms were never used.
In addition 58 new terms were added by ISCN3, mainly from the NRCS-Kellog Soil Science Lab.
This is currently being document and cleaned up.
An example of the head of this vocabulary is:
data_table | data_column | variable | unit |
---|---|---|---|
site | site_name | Site Name | NA |
site | site_note | Site Notes | NA |
site | country | Country | NA |
site | province | Province | NA |
site | state | State | (us_states) |
site | county | County | NA |
site | lat | Latitude | dec. deg |
site | long | Longitude | dec. deg |
Data columns that are comparable with trivial conversions (generally unit conversion) share a variable
.
In some cases this requires unit conversions (eg grams to kilograms) but conversion factors should be coded as separate variables (eg loss on ignition times a conversion factor equals organic carbon fraction).
A current draft of the variable table is here data/vocabularyList.csv
, it is under active development.
Please see the CONTRIBUTING document for more details on how to contribute, including how to identify datasets, contribute to the code, and everything else that is needed to run an open source community project. Contributors will be offered co-authorship for repository DOI but can not be guaranteed co-authorship on manuscripts, proposals or studies that utilize this repository. This repository is licensed here.
Repository versions will be assigned DOIs as needed. Any manuscripts which use these data harmonization scripts are asked to cite the appropreate repository version but are not required to list contributors as co-authors. The aggregation scripts here are licensed under BSD 2-clause. See LICENSE for details.
Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020). https://doi.org/10.1038/s41597-020-0486-7
International Soil Radiocarbon Database (ISRaD) a data curation project around soil radiocarbon and fractionation measurements.
Environmental Data Initiative has a GitRepo that has a similar approach to aggregating survey data.
Earth Sciences Information Partners
DataOne is a go-to resource for finding environmental data in general and searches many different established repositories. They have an R package for interacting with their search engine here
Five ways consortia can catalyse open science, Nature, 2017