Skip to content

Managing schema.yaml

Damion Dooley edited this page Jun 21, 2022 · 14 revisions

Building schema.yaml via field and picklist tables

A LinkML schema.yaml file can be maintained directly with a text editor or some other editing tool, but there are a few ways to generate schema.yaml from specifications held in team editable Google Sheets for example, making it easier to maintain:

  • LinkML offers a Schemasheets spreadsheet method.
  • DataHarmonizer has a similar system pitched at users without any LinkML experience, but which still requires a bit of programmer-level setup. To do this, three files are required which can be kept in a template folder alongside schema.yaml etc.
  • schema_core.yaml for specifying the necessary parts of a LinkML schema as a whole
  • schema_slots.tsv, a tab delimited text file for specifying templates and their fields (LinkML classes and their slots).
  • schema_enums.tsv, a tab delimited text file for specifying categorical pick lists that a field might require.

For all the .tsv and .yaml files, although UTF8 characters are generally acceptable, it helps to normalize any quotes or dashes in column header or field text to basic - and " quotes.

The schema_slots.tsv and schema_enums.tsv files' first row contains LinkML slot names or DataHarmonizer friendly variants as described below. Ensure that the field content of these files does not have extra carriage returns or line feeds (or spaces instead of tabs) as these likely will cause errors as they are read and compiled into schema.yaml. Erroneous line feeds can be detected when viewing either .tsv file by seeing if text from one row appears on next row even when "word wrap" feature is turned off in a text editor - an indication that a carriage return was copied over from in content of a spreadsheet cell text value.

Examples of the Google Sheets sources for schema_slots.tsv and schema_enums.tsv files are in tabs of this table of viral pathogen data collection standards. For example, the CanCOGeN-slots and CanCOGeN-enums tabs are copied in their entirety into the tab delimited /template/canada_covid19/ folder's schema_slots.tsv and schema_enums.tsv files, which are then processed along with schema_core.yaml to create schema.yaml

To generate or refresh schema.yaml from the above, run this in the template's directory:

> python3 ../../script/tabular_to_schema.py

From the example schema_core.yaml below, the "CanCOGeN Covid-19" schema will be built, with one "CanCOGeN Covid-19" class (template) which DataHarmonizer will show in its menu system.

id: https://example.com/CanCOGeN_Covid-19
    name: CanCOGeN_Covid-19
    description: ""
    imports:
      - "linkml:types"
    prefixes:
      linkml: "https://w3id.org/linkml/"
      GENEPIO: "http://purl.obolibrary.org/obo/GENEPIO_"
    classes:
      dh_interface:
        name: dh_interface
        description: "A DataHarmonizer interface"
        from_schema: https://example.com/CanCOGeN_Covid-19
      "CanCOGeN Covid-19":
        name: "CanCOGeN Covid-19"
        description: Canadian specification for Covid-19 clinical virus biosample data gathering
        is_a: dh_interface
    slots: {}
    enums: {}
    types:
      WhitespaceMinimizedString:
        name: "WhitespaceMinimizedString"
        typeof: string
        description: "A string that has all whitespace trimmed off of beginning and end, and all internal whitespace segments reduced to single spaces. Whitespace includes #x9 (tab), #xA (linefeed), and #xD (carriage return)."
        base: str
        uri: xsd:token
      Provenance:
        name: "Provenance"
        typeof: string
        description: "A field containing a DataHarmonizer versioning marker. It is issued by DataHarmonizer when validation is applied to a given row of data."
        base: str
        uri: xsd:token
    settings:
      Title_Case: "(((?<=\\b)[^a-z\\W]\\w*?|[\\W])+)"
      UPPER_CASE: "[A-Z\\W\\d_]*"
      lower_case: "[a-z\\W\\d_]*"

As described later, the "slots: {}" and "enums: {}" dictionaries get filled in by the tabular_to_schema.py script which processes schema_slots.tsv and schema_enums.tsv content.

IMPORTANT: DataHarmonizer will add each schema class as a template to its menu system if it finds that the class has an "is_a" relationship to the special "dh_interface" class.

Template field (slot) specification:

Previously the content of schema.yaml was described. To fashion schema.yaml from lists of slots and enumerations we need to detail the structure of schema_slots.tsv and schema_enums.tsv .

schema_slots.tsv

property description
class_name a semi-colon delimited list of classes (templates) that the current row applies to. (It will be reused row-by-row until it is set by a different value on a subsequent row).
slot_group A user-friendly section label that this slot will be listed under in the two row header DataHarmonizer user interface.
slot_uri an ontology id or URI that provides a unique semantic web identifier for this slot. (Was Ontology ID in DH <= v0.15.5)
title A user-friendly label that gets displayed in the second row of spreadsheet column
name optional but may supply the database field name, if different from the title. (In the template code title will be copied into empty name entries.)
range A data type that a slot value validates to, which can be a date, decimal, a picklist menu name of categorical choices, etc.
range_2 An additional data type that a slot value validates to, which can include a semi-colon delimited list of other picklist menus. This and the range field are converted into the "any_of" structure of range specifications. This enables a menu of metadata values like "Missing", "Not Collected" etc.
identifier If true means this field value should be unique within the column (of tabular data).
multivalued If true then more than one value is allowed for this field. Multiple values are usually delimited by semi-colons.
required If true means this field requires a data value.
recommended If true means this field is suggested for data entry but this is not required.
minimum_value Contains a minimum numeric number that a decimal or integer value can take on. Can also include dates to test against, including the special value "{today}" See todos reference above.
maximum_value Contains a maximum numeric number that a decimal or integer value can take on. . Can also include dates to test against, including the special value "{today}". See todos reference above.
pattern A regular expression to validate a field's textual content by. Include ^ and $ start and end of line qualifiers for full string match. Example simple email validation: ^\S+@\S+.\S+$
structured_pattern A LinkML system for specifying strings containing regular expression pattern names which are compiled into a pattern. Takes advantage of search and replace operations of names stored in a schema's settings dictionary.
description Helpful description of what field is about. Available in column help info.
comments Data entry guidance for a field. Available in column help info.
examples An array of values which are displayed as a bulleted list in column help info.
EXPORT_... A list of 0 or more export target columns that provide instructions for mapping to external database fields. Each becomesan export template option.

The class_name field's ability to list several classes allows:

  • Two classes in the same row to have an identical slot specification, which will be provided in the schema's "slots" dictionary.
  • A slot (field) and its specification can be listed for one class on one row, and on the next row mention the same slot for a different class. The tabular_to_schema.py script will tease out what is common in the two slots and will store that in the generic slot definition, and meanwhile place the properties unique to each class in the class's respective slot_usage entry for that slot.

schema_enums.tsv:

<style type="text/css"></style>

title | meaning | menu_1 | menu_2 | menu_3 | menu_4 | menu_5 | description | EXPORT_GISAID | EXPORT_CNPHI | EXPORT_NML_LIMS | EXPORT_BIOSAMPLE | EXPORT_VirusSeq_Portal | ontology label | definition | definition source | exact synonym | broad synonym | narrow synonym | Parental Term | Parental Term URL | Issues / Status | Other |   |   |  

| property | description |

title Title of enumeration (menu) of categorical choices.
-- --
-- --
-- --
-- --
-- --
-- --

EXPORT_ fields

Template specification columns that are named EXPORT_[export format keyword] (e.g. EXPORT_GSAID) get translated into a simple export format data structure associated with each applicable template column field in data.js.

If the export format keyword is listed in main.js at top in the TEMPLATES dictionary, its dictionary key becomes a menu option under the DataHarmonizer "Export To ..." menu list.

  • For DataHarmonizer template column field specifications, a value in an EXPORT_[export format keyword] column field is treated as an export format column name to export the template spreadsheet's column field data to.
  • Similarly, for template select and multiple select values, a value in an EXPORT_[export format keyword] will cause a source field to be transformed into given target export field.

Two additional features enable many common transform tasks:

  • A semicolon ";" symbol existing in an EXPORT_ field value will cause the source template's field value to be channeled to the multiple export fields separated by the semicolon. If an export field target is mentioned on multiple field specification rows of the template, then the values of the source fields, if any, will be concatenated into the export target field, in order (with delimiters as placeholders for any empty component values).
  • In addition, if a targeted export field is specified as a key:value pair, i.e. "[export field name]:[string]" format, then the input field value will be transformed to the given export string value. This allows conversion of values from source to export, and is pertinent to selection lists choices that vary across systems but which are semantically equivalent.

There are inevitably export data transformation cases that the above functionality can't handle. For this custom coding in an export.js file is required. Documentation on this coming soon!