Obviously, cross-linguistic data often compares languages using comparative concepts, e.g. typological features like in WALS, or Swadesh terms as in many wordlists.
While it may sometimes be enough to refer to such a concept by ID, e.g.
using a value like 116A
as
parameterReference
to refer to WALS feature 116A
in a Structure Dataset, often additional metadata
must be provided. This should be done in CLDF datasets by including a
ParameterTable
, i.e. a table with "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ParameterTable"
, and pointing to rows in this table
using the parameterReference
property in the ValueTable
.
Often, values for parameters are just text, e.g. word forms in the case of a CLDF Wordlist
.
In this case, the text string representing the value in the CSV table can simply be interpreted "as is"
by CLDF consumers.
If a parameter represents a categorical (or ordinal) variable,
It is recommended to provide the list of possible values in a CodeTable (possibly extended with a column
indicating the ordering of these values in the case of ordinal variables).
The ValueTable should then include a codeReference
column, but also list the string value in the value
column.
While this introduces some redundancy, it ensures compatibility with somewhat simplistic data access methods which may be
employed e.g. for data visualization.
Sometimes typological surveys use data binning to transform values of
varying data types (often numeric) into categorical data. Ideally, though, this step should be left to data analysis,
unless the "bins" have some theoretical foundation. To make it possible to store string representations of typed data
in CSV while still specifying how this data should be interpreted, a
columnSpec
column can be added to the ParameterTable
. CLDF
consumers SHOULD then consult the value of this column when reading values associated with the parameter.
As an example, we use the Python package csvw to obtain a reader for typed data as
specified by a columnSpec
value:
>>> import json
>>> from csvw import Column
>>> # Read the datatype description from a string value of the columnSpec column:
>>> reader = Column.fromvalue(json.loads('{"datatype": {"base": "decimal", "minimum": "1", "maximum": "11"}}'))
>>> # Use this reader to interpret string values from the value column as appropriate Python objects:
>>> reader.read('3.4')
Decimal('3.4')
>>> reader.read('30')
...
ValueError: value must be <= 11
Tip
This mechanism even allows list-valued parameter values. If for example a parameter's value for columnSpec
is the string {"datatype": "integer", "separator": " "}
values for the parameter can be read as follows:
reader = Column.fromvalue(json.loads('{"datatype": "integer", "separator": " "}'))
reader.read('1 2 3')
[1, 2, 3]
See also the related discussion at #109
The ExampleTable
of a Wordlist
from the Intercontinental Dictionary Series
is described here: https://github.com/intercontinental-dictionary-series/lindseyende/blob/v2.0/cldf/cldf-metadata.json#L269-L300
Since the parameters in this Wordlist
are the lexical concepts listed in the IDS concept list,
the corresponding Concepticon concept sets are specified
using the concepticonReference
property.
ParameterTable: parameters.csv
Name/Property | Datatype | Cardinality | Description |
---|---|---|---|
ID | string |
singlevalued | A unique identifier for a row in a table. To allow usage of identifiers as path components of URLs IDs must only contain alphanumeric characters, underscore and hyphen. |
Name | string |
unspecified | A title, name or label for an entity. |
Description | string |
unspecified | A description for an entity. |
ColumnSpec | json |
singlevalued | A column specification given as JSON representation of a CSVW column description. This column specification may be used by CLDF consumers to read a parameter's value as typed data. Note that a CSVW datatye description is not sufficient, because parsing a string value must also be informed by the column properties |