Skip to content

Latest commit

 

History

History
54 lines (42 loc) · 1.87 KB

README.md

File metadata and controls

54 lines (42 loc) · 1.87 KB

Manipulating CLDF data with pycldf

The pycldf package provides tools and Python APIs to read and write CLDF datasets.

Exploring datasets using pycldf.orm

Starting with version 1.18, pycldf provides a convenient Python API to interactively (or programmatically) explore CLDF datasets:

from collections import Counter
from tabulate import tabulate
from pycldf import Dataset

Now we can instantiate a pycldf.Dataset from data on the web:

wals = Dataset.from_metadata('https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json')

Note that we use the URL for the raw metadata file of a particular version, namely the release tagged as v2020. For "production" use, e.g. for analyses for publications, you should use the long-term accessible release on Zenodo DOI, but since the Zenodo deposit contains a zip archive of the dataset, this would require downloading and unzipping first. So for exploratory analysis, we enjoy the hassle-free data access by URL, which downloads the data directly into memory and not to the hard disk.

Now we can look at features we are interested in, using pycldf's ORM (see https://github.com/cldf/pycldf#object-oriented-access-to-cldf-data) ...

feature1 = wals.get_object('ParameterTable', '1A')

... count the datapoints by value ...

values = Counter(v.code.name for v in feature1.values)

... and look at the result ...

print('\n{}\n\n{}'.format(feature1.name, tabulate(values.most_common())))

... which should look as follows:

value #
Average 201
Moderately small 122
Moderately large 94
Small 89
Large 57