These scripts illustrate how to use basic R (i.e. without any external dependencies, packages, etc.) to work with CLDF data sets. merge()
and the df[filter$condition %in% filter2$condition,]
construct are used to combine the different data sources and perform basic queries and filtering tasks.
Note: While accessing CLDF data using standard data analysis tools as shown here is easy, this approach should be combined with consultation of the JSON metadata supplied with a CLDF dataset, to verify assumptions regarding syntax (e.g. the CSV dialect) and semantics (e.g. the mapping of column names to CLDF properties) of the data files. Such issues can be circumvented by loading the CLDF data into a SQLite database using pycldf's cldf createdb
command and then accessing the data as shown below
-
simple_access_to_values.R (or its notebook version) illustrates a very basic analysis on the basis of the WALS CLDF dump.
-
wals_ids_comparison.R (or its notebook version) illustrates, in a more involved fashion, how to filter and analyse different CLDF dumps together (WALS and IDS, in this case).
-
typology_visualisation.R, a more involved example, outlining how to access, merge, filter, and post-process data for visualisation purposes. See also the associated helper file with all the functions that are being used in the example. This is based on coded provided by @bambooforest, here.
As an example, we'll poke around in the Glottolog CLDF data. Let's download release v4.6:
$ curl -LO https://github.com/glottolog/glottolog-cldf/archive/refs/tags/v4.6.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 7622k 0 7622k 0 0 262k 0 --:--:-- 0:00:28 --:--:-- 255k
$ unzip v4.6.zip
Archive: v4.6.zip
c8eefe82b4c87f3c566a8e5181bacf714661e18e
creating: glottolog-cldf-4.6/
...
Now we need to install pycldf and load the CLDF into SQLite:
$ pip install pycldf
$ cldf createdb glottolog-cldf-4.6/cldf/cldf-metadata.json glottolog.sqlite
INFO <cldf:v1.0:StructureDataset at glottolog-cldf-4.6/cldf> loaded in glottolog.sqlite
Let's connect to the database via RSQLite:
> library(RSQLite)
> conn <- dbConnect(RSQLite::SQLite(), "glottolog.sqlite")
> dbListTables(conn)
[1] "CodeTable" "LanguageTable" "ParameterTable"
[4] "SourceTable" "ValueTable" "ValueTable_SourceTable"
The database schema (in particular table and column names) follows the rules described here.
Now we can let dplyr loose on the data:
> library(dplyr)
> languages <- tbl(conn, "languagetable")
> values <- tbl(conn, "valuetable")
> aes <- values %>% filter(cldf_parameterReference == "aes")
> inner_join(aes, languages, by=c("cldf_languageReference" = "cldf_id")) %>% group_by(cldf_codeReference) %>% summarise(langs = count(cldf_languageReference))
# Source: lazy query [?? x 2]
# Database: sqlite 3.38.5
# [glottolog.sqlite]
cldf_codeReference langs
<chr> <int>
1 aes-extinct 1250
2 aes-moribund 414
3 aes-nearly_extinct 351
4 aes-not_endangered 2956
5 aes-shifting 1837
6 aes-threatened 1537