In this tutorial we use cldfbench
to create a CLDF StructureDataset
from
the data of the WALS feature "Consonant Inventories".
Throughout the tutorial we will interact with cldfbench
's command line interface.
This cli is a single command cldfbench
, providing access to subcommands. To get a list of available
subcommands, run
cldfbench -h
to get help on usage of a particular subcommand (e.g. the subcommand new
), run
cldfbench new -h
-
Create a dataset directory, initialized with a skeleton suitable for curation with
cldfbench
:$ cldfbench new id: theid title: The Title license: CC-BY url: citation:
Note that we specified
theid
as dataset ID when prompted. So now we should see a directorytheid
in our working directory:$ tree theid/ theid/ ├── cldf │ └── README.md ├── cldfbench_theid.py ├── etc │ └── README.md ├── metadata.json ├── raw │ └── README.md ├── setup.cfg ├── setup.py └── test.py
-
Now edit the python module
theid/cldfbench_theid.py
, filling in functionality to download the "raw" data from WALS. Acldfbench.Dataset
provides several convenience methods for this kind of task. So in our case, it's a one-liner:def cmd_download(self, args): self.raw_dir.download('https://wals.info/feature/1A.tab', '1A.tsv')
Having implemented the command, we can run it from the command line:
$ cldfbench download theid/cldfbench_theid.py INFO running cmd_download on theid ... INFO ... done theid [0.5 secs]
And inspect whether it did the right thing:
$ tree theid/raw/ theid/raw/ ├── 1A.tsv └── README.md
-
Now we want to convert WALS' quirky
tab
format to nice CLDF. We do so by implementing thecldf_specs
method, to specify that we want to create aStructureDataset
(note that this should replace the already existing definition ofcldf_specs
in theDataset
class):def cldf_specs(self): from cldfbench import CLDFSpec return CLDFSpec(dir=self.cldf_dir, module='StructureDataset')
and implementing the
cmd_makecldf
method:def cmd_makecldf(self, args): from csvw.dsv_dialects import Dialect for row in self.raw_dir.read_csv( '1A.tsv', dicts=True, dialect=Dialect( skipRows=5, # Ignore the citation info on top skipBlankRows=True, delimiter='\t', ) ): args.writer.objects['ValueTable'].append({ 'ID': row['wals code'], 'Language_ID': row['wals code'], 'Parameter_ID': '1A', 'Value': row['description'], })
Let's break this down:
- Then we iterate over the rows of the downloaded data. Again,
cldfbench
provides convenient access to acsvw.dsv.reader
, which understands multiple CSV dialects. We specify a dialect that can cope with WALS' format, ignoring the citation info at the top, and splitting columns on\t
. - For each row in the input data, we append a row to the
StructureDataset
'sValueTable
. - Because we only create a single CLDF dataset here, we do not need to call
with self.cldf_writer(...) as ds:
explicitly. Instead, an initializedcldfbench.cldf.CLDFWriter
instance is available asargs.writer
.
Again, we can run the command from the command line:
$ cd theid $ cldfbench makecldf cldfbench_theid.py --glottolog ../glottolog/glottolog INFO running cmd_makecldf on theid ... INFO ... done theid [0.1 secs]
Note that we had to pass an additional argument: The path to a clone of the glottolog/glottolog repository, because CLDF datasets should link to standard reference catalogs.
Inspecting the results, we see ...
$ tree theid/cldf theid/cldf ├── README.md ├── StructureDataset-metadata.json └── values.csv
... a valid CLDF dataset:
$ cldf stats theid/cldf/StructureDataset-metadata.json <cldf:v1.0:StructureDataset at theid/cldf> key value ------------- ---------------------------------------------------- dc:conformsTo http://cldf.clld.org/v1.0/terms.rdf#StructureDataset rdf:type http://www.w3.org/ns/dcat#Distribution Path Type Rows ---------- ---------- ------ values.csv ValueTable 563
- Then we iterate over the rows of the downloaded data. Again,
The CLDF data we have created so far was pretty bare-bones. We should at least include some metadata about the languages, and ideally also some information about the feature (or parameter in CLDF lingo), e.g. a description of the values for the categorical variable used in this example.