-
Notifications
You must be signed in to change notification settings - Fork 28
Managing schema.yaml
A LinkML schema.yaml file can be maintained directly with a text editor or some other editing tool, but there are a few ways to generate schema.yaml from specifications held in team editable Google Sheets for example, making it easier to maintain:
- LinkML offers a Schemasheets spreadsheet method.
- DataHarmonizer has a similar system pitched at users without any LinkML experience, but which still requires a bit of programmer-level setup. Three files are required which can be kept in a template folder alongside schema.yaml etc.
- schema_core.yaml for specifying the necessary parts of a LinkML schema as a whole
- schema_slots.tsv, a tab delimited text file for specifying templates and their fields (LinkML classes and their slots).
- schema_enums.tsv, a tab delimited text file for specifying categorical pick lists that a field might require.
To generate or refresh schema.yaml from the above, run this in the template's directory:
> python3 ../../script/tabular_to_schema.py
For the .tsv and .yaml files, although UTF8 characters are generally acceptable, it helps to normalize any quotes or dashes in column header or field text to basic - and " quotes.
As mentioned in the template intro, the schema.yaml contains a list of all possible templates for a folder, and supporting information. For those who are building schema.yaml from the 3 files above, the schema_core.yaml file populates schema.yaml's top level entities. Generally schema.yaml (and schema_core.yaml) contain:
A generic resolvable URI for the schema. A name and description for the schema. An imports section that indicates below to include the LinkML built-in data types such as decimal and date. A list of prefixes that may occur in ontology or other term IRI references. Objects containing dictionaries of classes (templates), slots (fields), enumerations (pick lists), types (datatypes), and settings (search & replace key / values).
From the example schema_core.yaml below, the "CanCOGeN Covid-19" schema will be built, with one "CanCOGeN Covid-19" class (template) which DataHarmonizer will show in its menu system.
id: https://example.com/CanCOGeN_Covid-19
name: CanCOGeN_Covid-19
description: ""
imports:
- "linkml:types"
prefixes:
linkml: "https://w3id.org/linkml/"
GENEPIO: "http://purl.obolibrary.org/obo/GENEPIO_"
classes:
dh_interface:
name: dh_interface
description: "A DataHarmonizer interface"
from_schema: https://example.com/CanCOGeN_Covid-19
"CanCOGeN Covid-19":
name: "CanCOGeN Covid-19"
description: Canadian specification for Covid-19 clinical virus biosample data gathering
is_a: dh_interface
slots: {}
enums: {}
types:
WhitespaceMinimizedString:
name: "WhitespaceMinimizedString"
typeof: string
description: "A string that has all whitespace trimmed off of beginning and end, and all internal whitespace segments reduced to single spaces. Whitespace includes #x9 (tab), #xA (linefeed), and #xD (carriage return)."
base: str
uri: xsd:token
Provenance:
name: "Provenance"
typeof: string
description: "A field containing a DataHarmonizer versioning marker. It is issued by DataHarmonizer when validation is applied to a given row of data."
base: str
uri: xsd:token
settings:
Title_Case: "(((?<=\\b)[^a-z\\W]\\w*?|[\\W])+)"
UPPER_CASE: "[A-Z\\W\\d_]*"
lower_case: "[a-z\\W\\d_]*"
As described below, the "slots: {}" and "enums: {}" dictionaries get filled in by the tabular_to_schema.py script which processes schema_slots.tsv and schema_enums.tsv content. These files can be managed as tabs in a Google spreadsheet, for example in viral pathogen data collection standards the CanCOGeN-slots and CanCOGeN-enums tabs are copied in their entirety into the tab delimited /template/canada_covid19/ folder's schema_slots.tsv and schema_enums.tsv files, which are then processed along with schema_core.yaml to create schema.yaml
IMPORTANT: DataHarmonizer will add each schema class as a template to its menu system if it finds that the class has an "is_a" relationship to the special "dh_interface" class.
A slot specification lists the slot name, description, range of possible values, mappings, required or recommended status, etc. If the slot offers a menu of choices, that menu is contained in the enums dictionary, specified by schema_enums.tsv.
The schema_slots.tsv and schema_enums.tsv files' first row contains LinkML slot names or DataHarmonizer friendly variants as described below. Ensure that the field content of these files does not have extra carriage returns or line feeds (or spaces instead of tabs) as these likely will cause errors as they are read and compiled into schema.yaml. Erroneous line feeds can be detected when viewing either .tsv file by seeing if text from one row appears on next row even when "word wrap" feature is turned off in a text editor - an indication that a carriage return was copied over from in content of a spreadsheet cell text value.
property | description |
---|---|
class_name | a semi-colon delimited list of classes (templates) that the current row applies to. (It will be reused row-by-row until it is set by a different value on a subsequent row). |
slot_group | A user-friendly section label that this slot will be listed under in the two row header DataHarmonizer user interface. |
slot_uri | an ontology id or URI that provides a unique semantic web identifier for this slot. (Was Ontology ID in DH <= v0.15.5) |
title | A user-friendly label that gets displayed in the second row of spreadsheet column |
name | optional but may supply the database field name, if different from the title. (In the template code title will be copied into empty name entries.) |
range | A data type that a slot value validates to, which can be a date, decimal, a picklist menu name of categorical choices, etc. |
range_2 | An additional data type that a slot value validates to, which can include a semi-colon delimited list of other picklist menus. This and the range field are converted into the "any_of" structure of range specifications. This enables a menu of metadata values like "Missing", "Not Collected" etc. |
identifier | If true means this field value should be unique within the column (of tabular data). |
multivalued | If true then more than one value is allowed for this field. Multiple values are usually delimited by semi-colons. |
required | If true means this field requires a data value. |
recommended | If true means this field is suggested for data entry but this is not required. |
minimum_value | Contains a minimum numeric number that a decimal or integer value can take on. Can also include dates to test against, including the special value "{today}" See todos reference above. |
maximum_value | Contains a maximum numeric number that a decimal or integer value can take on. . Can also include dates to test against, including the special value "{today}". See todos reference above. |
pattern | A regular expression to validate a field's textual content by. Include ^ and $ start and end of line qualifiers for full string match. Example simple email validation: ^\S+@\S+.\S+$ |
structured_pattern | A LinkML system for specifying strings containing regular expression pattern names which are compiled into a pattern. Takes advantage of search and replace operations of names stored in a schema's settings dictionary. |
description | Helpful description of what field is about. Available in column help info. |
comments | Data entry guidance for a field. Available in column help info. |
examples | An array of values which are displayed as a bulleted list in column help info. |
EXPORT_... | A list of 0 or more export target columns that provide instructions for mapping to external database fields. Each becomesan export template option. |
The class_name field's ability to list several classes allows:
- Two classes in the same row to have an identical slot specification, which will be provided in the schema's "slots" dictionary.
- A slot (field) and its specification can be listed for one class on one row, and on the next row mention the same slot for a different class. The tabular_to_schema.py script will tease out what is common in the two slots and will store that in the generic slot definition, and meanwhile place the properties unique to each class in the class's respective slot_usage entry for that slot.
Enumerations are flat or hierarchic lists of categorical (nominal) choices a slot can have in its range (value).
property | description |
---|---|
title | Title of enumeration (menu) of categorical choices. |
meaning | A curie or IRI of an ontology term that clarifies this item's semantic meaning. |
menu_1 | An enumeration item's label - this is at the top-level of a hierarchic menu |
menu_2 | A 2nd tier menu item label |
menu_3 | A 3rd tier menu item label |
menu_4 | A 4th tier menu item label |
menu_5 | A 5th tier menu item label |
description | A description of a menu choice. |
EXPORT_... | A possible export database field and value that this choice can be transformed to. These get transformed into an enumeration item's exact_mappings. |
In the future, DataHarmonizer will work with LinkML to extend the functionality of this so that menus can be compiled by dynamically fetching branches of ontologies during schema generation.
If this functionality is used, then in DataHarmonizer's "Export To ..." menu list there will be listed one or more data formats that a template's current spreadsheet data can be exported to.
If an EXPORT_XYZ column is at top of the schema_slots.tsv file, the XYZ part becomes an export format menu option under the DataHarmonizer "Export To ..." menu list, and values in this column get added to respective slot's exact_mappings dictionary, to guide which target columns that source template column values end up in.
If an EXPORT_XYZ column is at top of the schema_enums.tsv file, then an enumeration item choice will get an exact_mapping entry prefixed by "XYZ:" pointing to the target database and field.
Two additional features enable many common transform tasks:
- A semicolon ";" symbol existing in an EXPORT_ field value will cause the source template's field value to be channeled to the multiple export fields separated by the semicolon. If an export field target is mentioned on multiple field specification rows of the template, then the values of the source fields, if any, will be concatenated into the export target field, in order (with delimiters as placeholders for any empty component values).
- In addition, if a targeted export field is specified as a key:value pair, i.e. "[export field name]:[string]" format, then the input field value will be transformed to the given export string value. This allows conversion of values from source to export, and is pertinent to selection lists choices that vary across systems but which are semantically equivalent.
There are inevitably export data transformation cases that the above functionality can't handle. For this custom coding in an export.js file is required.