Skip to content

gelato-org/gelato-data

Repository files navigation

Welcome to the GeLaTo dataset

CLDF validation

GEnes and LAnguages TOgether: a resource for multidisciplinary studies on human genetic and linguistic variation

The GeLaTo dataset is a worldwide diversity panel of available population genetic samples matched with databases of linguistic diversity. Each genetic population is associated to the main language spoken by their people. The choice of genetic data corresponds to essential guidelines: maximum compatibility and standardization, modern high quality data, avoidance of ascertainment bias, availability for different regions of the world, and high resolution to capture recent events. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions. The resource is designed to explore connections between our linguistic diversity and the history and diversity of human groups. The use of the scientific information in GeLaTo should be carried with respect of people culture and traditions.

Linguistic and Anthropological curation:

All the genetic populations considered are matched with a unique GlottoCode identifier, which corresponds to the main language spoken by the population. This information is recovered from the original genetic publication, and it is extrapolated either from direct sampling observation, cultural/linguistic self-identification, or geographical characterization, with the assistance of linguists and anthropologists (for a list of people who contributed expertise, see Credits). Language introduced during colonial ages (widely diffused trans-national languages) are not considered, to exclude the wave of historical language shift documented in the past ~2 centuries.

The GlottoCode link returns the linguistic classification of the genetic population samples.

Geographic location of the populations is based on information on the genetic samples, and not on linguistic information. Migrants are located in their place of origin before the migration, when this information is available: details for migrant populations are indicated in the curation notes.

Multilingualism is a common feature of human populations. In cases of multilingualism, we consider only one langauge as the main ("non-colonial") language present in the population. In some cases, suggestions for alternative language assignation are indicated in the curation notes.

Genetic statistics provided:

Methods from population genetics are used to calculate values of diversity within and between populations. The genetic summary statistics provided correspond to measures of relatedness and are shaped by the interaction of the ancestors of the individuals who contributed their genetic profile. The summary statistics associated to each population are suitable for population history investigation, and are not intended for any medical or commercial purposes.

Further information on the population genetics methods employed is available in variables

GeLaTo goals:

  • Allowing geneticists to properly characterize the human history behind the molecular data, and give an accessible reference dataset for regional or worldwide comparisons.
  • Allowing linguists, historians and cultural anthropologists to integrate information on genealogical relatedness and demography, which can be robustly extrapolated from the genetic data.
  • Allowing scholars of various disciplines to approach questions of major relevance on human diversity in a true multidisciplinary perspective, and develop a more realistic understanding of the complex mechanisms behind human migration, contact and cultural transmission.

How to cite

If you use this data, please cite

Barbieri et al. 2022. A global analysis of matches and mismatches between human genetic and linguistic histories. PNAS. DOI: 10.1073/pnas.2122084119

as well as the released version of the dataset.