phonology.Rmd

---
title: "On phonology of East Caucasian languages"
author: "G. Moroz"
always_allow_html: true
date: 'Last update: November 2021'
bibliography: "data/orig_bib/phonology.bib"
link-citations: true
csl: apa.csl
output:
  html_document:
    number_sections: true
    anchor_sections: true
    pandoc_args: --shift-heading-level-by=-1
---

```{r, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.width = 9.5, fig.height = 6)
library(lingtypology)
library(tidyverse)
df <- read_csv("https://raw.githubusercontent.com/agricolamz/new_caucasian_phonology_dataset/master/database.csv")
```

```{r}
library(RefManageR)
BibOptions(check.entries = FALSE, style = 'text', bib.style = 'authoryear')
article_citation <- BibEntry(bibtype = 'Incollection', 
 key='moroz2020',
 title='On phonology of East Caucasian languages',
 author='George Moroz',
 year='2021',
 editor= 'Daniel, Michael  and Filatov, Konstantin and Moroz, George and Mukhin, Timofey and Naccarato, Chiara and Verhees, Samira',
 publisher='Linguistic Convergence Laboratory, NRU HSE',
 address='Moscow',
 booktitle= 'Typological Atlas of the Languages of Daghestan (TALD)',
 url='http://lingconlab.ru/dagatlas')
```

## {.tabset .tabset-fade .tabset-pills -} 

### Plain text {-}
```{r, results = 'asis'}
print(article_citation, .opts = list(style = 'text'))
```

### BibTeX {-}

```{r}
print(article_citation, .opts = list(style = 'Bibtex'))
```

## Introduction

There are many studies dedicated to the phonology of the indigenous languages of the Caucasus, including [@catford77; @smeets94a; @smeets94b; @alekseev01; @hewitt04; @grawunder17; @begus21; @boris21b; @boris21a; @koryakov21], and [@kk90] on the East Caucasian languages in particular. There are also many studies on the historical-comparative phonetics of these languages, such as [@bokarev60; @gudava64; @imnayshvili77; @akiev77; @gigineyshvili77; @talibov80; @bokarev81; @nikolayev94; @nichols03; @ardoteli09; @mudrak19; @mudrak20] and many others. Since the amount of grammatical descriptions for particular idioms is increasing, we currently have a lot more information about the phonological inventories represented in specific villages, so we do not need to extrapolate our knowledge of standard languages onto all villages where the language is spoken. Even though we have a lot of material on different East Caucasian languages, in order to be able to compare them we still need a unified description of those inventories. In order to solve this task, I compiled a database of phoneme inventories of East Caucasian and neighboring languages that can be accessed and downloaded [here](https://raw.githubusercontent.com/agricolamz/new_caucasian_phonology_dataset/master/database.csv).

On average, East Caucasian languages (as well as other indigenous languages of the Caucasus) have more consonants and vowels than other languages of the world. The main reasons for this are the following:

* the presence of **ejective consonants** (with the exception of Udi);
* the presence of **uvular consonants**;
* the presence of **lateral obstruents** in Andic and Tsezic languages;
* widespread **labialization**, gemination and fortis/lenis distinction across Dagestan;
* common triangle vowel systems (*i*, *e*, *a*, *o*, *u*) are complicated with **nasalization** (mostly in Andic and some Tsezic languages), **long vowels** (Nakh, Andic and Tsezic), **pharyngealisation** and **umlaut vowels** (in languages spoken in the south, possibly due to Azerbaijani influence).

## Inventory size

The map below shows the size of the phoneme inventories in languages of the eastern Caucasus:

```{r}
set.seed(42)
df %>% 
  distinct(lang, source, glottocode) %>% 
  group_by(lang) %>% 
  slice_sample(n = 1) %>% 
  mutate(filter_languages = TRUE) ->
  filter_for_maps
```

```{r}
df %>% 
  left_join(filter_for_maps) %>% 
  filter(filter_languages) %>% 
  count(lang, source, glottocode) ->
  for_map

map.feature(languages = for_map$lang,
            minichart = "pie",
            minichart.data = for_map$n,
            minichart.labels = TRUE,
            width = 3,
            tile = 'Stamen.TonerLite')
```

As we can see, the inventory size of the languages in our dataset range from `r min(range(for_map$n))` (Georgian and Kumyk) to `r max(range(for_map$n))` (Northern Akhvakh). We can compare the obtained numbers with the PHOIBLE database [@phoible], which contains phoneme inventories of the languages of the world[^exclude_EA]:

[^exclude_EA]: When performing such a comparison of our dataset and the one provided by PHOIBLE, the question arises whether we need to exclude East Caucasian and other languages of the Caucasus from the PHOIBLE subsample that we use. In this text we decided to exclude them for the sake of comparability. Not excluding them slightly changed the shape of the density plot shown below, but the change was extremely small. Keep in mind that neither of the samples are balanced, so that they are not ideal for comparison: different language families are overrepresented in both samples, and the sizes of the datasets are different (PHOIBLE's 2169 languages vs our dataset of 50 idioms) etc.

```{r download phoible, cache=TRUE}
ph <- phoible.feature(na.rm = FALSE)

set.seed(42)
ph %>% 
  distinct(language, inventoryid, glottocode) %>% 
  group_by(language) %>% 
  sample_n(1) %>% 
  anti_join(df, by = "glottocode") %>% 
  pull(inventoryid) ->
  ph_sample
```

```{r}
ph %>% 
  filter(inventoryid %in% ph_sample) %>% 
  count(language, inventoryid) %>% 
  ggplot(aes(n))+
  geom_density(data = for_map, aes(x = n))+
  geom_density(color = "red")+
  theme_minimal()+
  labs(x = "number of segments",
       subtitle = "Density functions of East Caucasian inventory sizes (in black)\nwith overlayed with the density function of inventory sizes of the languages of the world (in red)")
```

As demonstrated on the plot, East Caucasian languages in general have big segment inventories (with mean, median and mode near 60 segments) compared to languages of the rest of the world. A small peak around 40 can be explained by the presence of non-indigenous languages in our dataset. Further we will see that overall large inventories are mostly caused by large consonant inventories, although vowels and diphthongs also play a significant role.

### Consonant inventory size

We can compare the consonant inventories in the same way we compared the whole phoneme inventories earlier:

```{r}
df %>% 
  left_join(filter_for_maps) %>% 
  filter(filter_languages,
         segment_type == "consonant") %>% 
  count(lang, source) ->
  for_map

map.feature(languages = for_map$lang,
            minichart = "pie",
            minichart.data = for_map$n,
            minichart.labels = TRUE,
            width = 3,
            tile = 'Stamen.TonerLite')
```

As we can see, the inventory size differs from `r min(range(for_map$n))` (Azerbaijani) to `r max(range(for_map$n))` (Northern Akhvakh). We can compare the obtained numbers with the PHOIBLE database [@phoible]:

```{r}
set.seed(42)
ph %>% 
  filter(inventoryid %in% ph_sample,
         segmentclass == "consonant") %>% 
  count(language, inventoryid) %>% 
  ggplot(aes(n))+
  geom_density(data = for_map, aes(x = n))+
  geom_density(size = 1, color = "red")+
  theme_minimal()+
  labs(x = "number of consonant segments",
       subtitle = "Density of East Caucasian consonant inventory sizes (in black)\nwith overlayed density function of consonant inventory sizes of languages of the world (in red)")
```

As demonstrated on the plot, the majority of languages from PHOIBLE have less consonants than East Caucasian languages. This result is caused by different subsystems of East Caucasian languages like ejectives, labialized consonants, uvulars and post uvulars. More or less the same phonological profile can be found in other indigenous language families of the Caucasus --- among the West Caucasian languages, for example, we find Ubykh [@fenwick11: 16--17], which has one of the largest consonant inventories in the world.

### Vowel inventory size

We can do the same comparison for the vowel inventories:

```{r}
df %>% 
  left_join(filter_for_maps) %>% 
  filter(filter_languages,
         segment_type == "vowel" | segment_type == "diphthong") %>% 
  count(lang, source) ->
  for_map

map.feature(languages = for_map$lang,
            minichart = "pie",
            minichart.data = for_map$n,
            minichart.labels = TRUE,
            width = 3,
            tile = 'Stamen.TonerLite')
```

The vowel inventory size ranges from `r min(range(for_map$n))` (Avar) to `r max(range(for_map$n))` (Chechen). Again we can compare the obtained numbers with the PHOIBLE database [@phoible]:

```{r}
ph %>% 
  filter(segmentclass == "vowel",
         inventoryid %in% ph_sample) %>% 
  count(language, inventoryid) %>% 
  ggplot(aes(n))+
  geom_density(data = for_map, aes(x = n))+
  geom_density(size = 1, color = "red")+
  theme_minimal()+
  labs(x = "number of vowel or dipthong segments",
       subtitle = "Density of East Caucasian vowel and dipthong inventory sizes (in black)\nwith overlayed density function of vowel inventory sizes of languages of the world (in red)")
```

As demonstrated in the plot, the vowel/diphthong inventory sizes of East Caucasian languages are slightly bigger than the average of the languages of the world. As expected, the PHOIBLE data reveal an average vowel/diphthong inventory size of around 5, while the East Caucasian dataset shows a mean, median and mode around 10 vowels/diphthongs. Diphthongs are present only in Nakh languages and in Hinuq [@forker13]. However, it is worth mentioning that the distinction between diphthongs and combinations of vowels and semivowels like *j* and *w* is not clear in East Caucasian languages. There is a tendency to have closing diphthongs or combinations of a vowel and a semivowel at the end of the syllable (like *ai*/*aj*, *eu*/*ew*) and opening diphthongs or combinations of a vowel and a semivowel at the beginning of the syllable (like *ia*/*ja*, *ue*/*we*). As far as I am aware, there is no phonological difference between diphthongs and combinations of vowels and semivowels in any East Caucasian language. I can stipulate that scholars of Nakh languages tend to describe those units as diphthongs while scholars of Daghestanian languages tend to describe them as combinations of a vowel and a semivowel.

## References {-}