Skip to content

Latest commit

 

History

History

cldf_r

Working with CLDF data in R

These scripts illustrate how to use basic R (i.e. without any external dependencies, packages, etc.) to work with CLDF data sets. merge() and the df[filter$condition %in% filter2$condition,] construct are used to combine the different data sources and perform basic queries and filtering tasks.

Note: While accessing CLDF data using standard data analysis tools as shown here is easy, this approach should be combined with consultation of the JSON metadata supplied with a CLDF dataset, to verify assumptions regarding syntax (e.g. the CSV dialect) and semantics (e.g. the mapping of column names to CLDF properties) of the data files. Such issues can be circumvented by loading the CLDF data into a SQLite database using pycldf's cldf createdb command and then accessing the data as shown below

  • simple_access_to_values.R (or its notebook version) illustrates a very basic analysis on the basis of the WALS CLDF dump.

  • wals_ids_comparison.R (or its notebook version) illustrates, in a more involved fashion, how to filter and analyse different CLDF dumps together (WALS and IDS, in this case).

  • typology_visualisation.R, a more involved example, outlining how to access, merge, filter, and post-process data for visualisation purposes. See also the associated helper file with all the functions that are being used in the example. This is based on coded provided by @bambooforest, here.

Working with CLDF via SQLite

As an example, we'll poke around in the Glottolog CLDF data. Let's download release v4.6:

$ curl -LO https://github.com/glottolog/glottolog-cldf/archive/refs/tags/v4.6.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 7622k    0 7622k    0     0   262k      0 --:--:--  0:00:28 --:--:--  255k
$ unzip v4.6.zip 
Archive:  v4.6.zip
c8eefe82b4c87f3c566a8e5181bacf714661e18e
   creating: glottolog-cldf-4.6/
...

Now we need to install pycldf and load the CLDF into SQLite:

$ pip install pycldf
$ cldf createdb glottolog-cldf-4.6/cldf/cldf-metadata.json glottolog.sqlite
INFO    <cldf:v1.0:StructureDataset at glottolog-cldf-4.6/cldf> loaded in glottolog.sqlite

Let's connect to the database via RSQLite:

> library(RSQLite)
> conn <- dbConnect(RSQLite::SQLite(), "glottolog.sqlite")
> dbListTables(conn)
[1] "CodeTable"              "LanguageTable"          "ParameterTable"        
[4] "SourceTable"            "ValueTable"             "ValueTable_SourceTable"

The database schema (in particular table and column names) follows the rules described here.

Now we can let dplyr loose on the data:

> library(dplyr)
> languages <- tbl(conn, "languagetable")
> values <- tbl(conn, "valuetable")
> aes <- values %>% filter(cldf_parameterReference == "aes")

> inner_join(aes, languages, by=c("cldf_languageReference" = "cldf_id")) %>% group_by(cldf_codeReference) %>% summarise(langs = count(cldf_languageReference))
# Source:   lazy query [?? x 2]
# Database: sqlite 3.38.5
#   [glottolog.sqlite]
  cldf_codeReference langs
  <chr>              <int>
1 aes-extinct         1250
2 aes-moribund         414
3 aes-nearly_extinct   351
4 aes-not_endangered  2956
5 aes-shifting        1837
6 aes-threatened      1537