vignette.Rmd

---
title: "How to control for non-independence in cross-national analyses: An R tutorial"
author: "Scott Claessens"
date: "29/04/2022"
output: md_document
---

```{r echo=F}
options(width = 300)
oecd <- c("AT","AU","BE","CA","CH","CL","CZ","DE","DK",
          "EE","ES","FI","FR","GB","GR","HU","IE","IL",
          "IS","IT","JP","KR","LT","LU","LV","MX","NL",
          "NO","NZ","PL","PT","SE","SI","SK","TR","US")
```

I recently posted a pre-print on PsyArxiv entitled ["The non-independence of nations and why it matters"](https://psyarxiv.com/m6bsn/). In the paper, myself and Quentin Atkinson argue that many cross-national correlations in the social sciences are flawed because they do not adequately take into account various forms of non-independence between nations. Nations are linked to one another via spatial proximity and shared cultural ancestry, and failing to account for these dependencies can increase the chance of false positives.

That's the issue: but how to rectify it? In the paper, we suggest some Bayesian modelling techniques that researchers can use to overcome the problem. However, we did not go into these techniques in detail. So I thought it would be useful to provide a brief R tutorial on the methods we used in the paper, so that interested researchers can apply them to their own cross-national datasets.

In this blog post, I will walk through methods of dealing with (1) spatial non-independence, and (2) cultural phylogenetic non-independence. In each case, I will use simulated data, show that a naive cross-national correlation produces an inflated effect size, and then present an alternative method that controls for non-independence. The techniques I will use are not the only ways of dealing with non-independence, but they are the techniques I learned through folks like Richard McElreath and Paul-Christian Bürkner.

For this tutorial, we will need the following R packages: _tidyverse_ for plotting and data wrangling, _brms_ for fitting Bayesian models, _ggcorrplot_ for plotting proximity matrices, _ggrepel_ for neatly plotting country labels, _readxl_ for reading in Excel files, and _targets_ for extracting objects from our reproducible analysis pipeline.

```{r warning=FALSE, error=FALSE, message=FALSE}
library(tidyverse)
library(brms)
library(ggcorrplot)
library(ggrepel)
library(readxl)
library(targets)
```

The data and code for this vignette can be found on GitHub: https://github.com/ScottClaessens/crossNationalCorrelations

# Spatial non-independence

## Spatially non-independent simulated data

For this demonstration, we load a subset of OECD nations from one of the many simulated datasets from our paper. This dataset contains a standardised predictor variable (X) and a standardised outcome variable (Y), both of which are strongly spatially autocorrelated.

```{r}
# load dataset from file
spatialSimData <- read.csv("data/vignette/spatialSimData.csv")
```

We can plot the relationship between the nation-level predictor and outcome variables.

```{r warning=FALSE, message=FALSE}
ggplot(spatialSimData, aes(x, y, label = country)) +
  geom_point() +
  geom_text_repel() +
  geom_smooth(method = "lm") +
  theme_classic()
```

The relationship between the simulated variables appears to be positive. But is this just due to spatial non-independence? From the plot, it looks like this might be the case. For example, neighbouring countries, such as Sweden (SE) and Finland (FI), have very similar values on both variables, while the distant countries Australia (AU) and New Zealand (NZ) are far off in the top right hand corner of the plot.

## Fit naive correlation

We can easily get the naive Pearson's correlation between the national-level variables.

```{r}
cor(spatialSimData$y, spatialSimData$x) %>% round(2)
```
We can also get this same correlation from fitting a Bayesian regression. This is where we use the R package _brms_. We set some regularising priors on the intercept, slope, and residual variance in this model.

```{r message=FALSE, warning=FALSE, error=FALSE}
# fit naive bayesian regression
m1.1 <- brm(y ~ x, data = spatialSimData, 
            cores = 4, seed = 2113,
            # regularising priors
            prior = c(prior(normal(0, 1), class = Intercept),
                      prior(normal(0, 1), class = b),
                      prior(exponential(2), class = sigma)))
```

```{r}
# get model summary
summary(m1.1)
```

Because both variables are standardised, the slope is comparable to the Pearson's correlation reported above. The model converged normally, and the 95% credible interval for the slope is above zero. In other words, this is a "significantly positive" relationship.

## Fit correlation with Gaussian Process

But is spatial non-independence ruining our inference? To find out, we fit another model including a Gaussian Process over latitudes and longitudes for nations. A Gaussian Process adds a random intercept for each nation, and these random intercepts are allowed to covary according to the distance between latitude-longitude coordinates. For a more detailed discussion of this method of dealing with spatial autocorrelation, see Section 14.5 in Richard McElreath's book Statistical Rethinking (second edition), or his 2022 lecture on the topic [here](https://www.youtube.com/watch?v=PIuqxOBJqLU).

We can include Gaussian Processes in _brms_ models using the `gp()` function. This function takes a variable --- or multiple variables, in our case --- and, under the hood, computes a normalised distance matrix between all the cases for that variable. It then estimates a covariance function specifying how these distances relate to the covariance between nations. If the data are spatially autocorrelated, the model will estimate strong spatial covariance between nations, which could "soak up" much of the relationship we saw in the previous section.

We fit this model by adding a `gp()` term to our formula.

```{r message=FALSE, warning=FALSE, error=FALSE}
# fit Bayesian regression with Gaussian Process
m1.2 <- brm(y ~ x + gp(latitude, longitude), 
            data = spatialSimData, cores = 4, seed = 2113,
            # regularising priors
            prior = c(prior(normal(0, 1), class = Intercept),
                      prior(normal(0, 1), class = b),
                      prior(exponential(2), class = sdgp),
                      prior(exponential(2), class = sigma)),
            # tune the mcmc sampler
            control = list(adapt_delta = 0.99))
```

```{r}
# get model summary
summary(m1.2)
```

As expected, the slope has reduced in size, and the 95% credible interval now includes zero. This suggests that the positive relationship we initially found in our naive regression was confounded by spatial non-independence.

# Cultural phylogenetic non-independence

## Culturally non-independent simulated data

It is also possible for nations to be culturally non-independent --- that is, they share common influences from shared cultural ancestors. For example, even though the United Kingdom and New Zealand are on opposite sides of the planet, they are culturally similar due to the effects of British colonialism. This source of non-independence can be just as important as spatial proximity, especially when we are interested in cultural norms and institutions.

As before, we load a subset of OECD nations from a simulated dataset containing a standardised predictor variable (X) and a standardised outcome variable (Y). Both of these variables are strongly culturally autocorrelated.

```{r}
# load dataset from file
culturalSimData <- read.csv("data/vignette/culturalSimData.csv")
```

Let's plot the relationship between the nation-level predictor and outcome variables.

```{r warning=FALSE, message=FALSE}
ggplot(culturalSimData, aes(x, y, label = country)) +
  geom_point() +
  geom_text_repel() +
  geom_smooth(method = "lm") +
  theme_classic()
```

There's a positive relationship between the variables. But a close look at the plot suggests some cultural clustering. For example, the English-speaking nations Australia (AU), New Zealand (NZ), the United States (US), and the United Kingdom (GB) are all towards the left of the plot, whereas nations speaking more distantly related languages, such as Japan (JP) and Greece (GR), are further to the right.

## Fit naive correlation

As before, we can get the naive Pearson's correlation.

```{r}
cor(culturalSimData$y, culturalSimData$x) %>% round(2)
```

And we can estimate this correlation in a naive Bayesian regression model.

```{r message=FALSE, warning=FALSE, error=FALSE}
# fit naive bayesian regression
m2.1 <- brm(y ~ x, data = culturalSimData, 
            cores = 4, seed = 2113,
            # regularising priors
            prior = c(prior(normal(0, 1), class = Intercept),
                      prior(normal(0, 1), class = b),
                      prior(exponential(2), class = sigma)))
```

```{r}
# get model summary
summary(m2.1)
```

The 95% credible interval for the slope is above zero, suggesting that there is a "significantly positive" relationship.

## Fit correlation with covarying random effects

"Cultural similarity" does not have a fixed coordinate system like longitude and latitude, making Gaussian Processes tricky in _brms_. Instead, we can use another technique that allows nation random intercepts to covary, but we _specify the covariance matrix in advance rather than estimate it in the model_. This will allow us to control for cultural non-independence.

What covariance matrix should we specify in advance? The matrix that we use in the paper is the linguistic proximity between nations, weighted by the proportion of speakers of each language in each nation. "Linguistic proximity" here means how related two languages are on a phylogeny of languages. We just average this measure over all languages spoken in each nation, weighted by speaker percentages. This gives us a sense of how "culturally similar" different nations are.

It is simpler to just show the matrix. Below, we load the full linguistic distance matrix, subset it to just our OECD nations, and convert it to a linguistic proximity matrix.

```{r}
# load full linguistic distance matrix
linDist <- 
  read_xlsx("data/networks/2F Country Distance 1pml adj.xlsx") %>%
  select(-ISO) %>%
  as.matrix()

# rows and columns identical in symmetrical matrix
rownames(linDist) <- colnames(linDist)

# subset to only OECD countries
linDist <- linDist[oecd,oecd]

# nation's distance from itself is zero
diag(linDist) <- 0

# normalise between 0 and 1
linDist <- linDist / max(linDist)

# convert to linguistic proximity
linProx <- 1 - linDist
```

We can then plot this `linProx` object, to get a sense of the cultural relatedness of different nations.

```{r fig.height=8,fig.width=8}
ggcorrplot(linProx, hc.order = TRUE, type = "lower") +
  theme(legend.title = element_blank())
```

From this plot, we can see that majority English-speaking nations have high linguistic proximity (i.e. cultural similarity), as do Scandinavian nations, nations in Western Europe, and nations in Eastern Europe.

We can use this linguistic proximity matrix in our modelling. In _brms_, we can specify in advance how our random effects should be correlated by using the `cov` argument in the `gr()` function. The only restrictions are that the matrix that we feed in must have row and column names that match our variable, and that the matrix is positive definite. To learn more about this method, see [this vignette on phylogenetic regression](https://cran.r-project.org/web/packages/brms/vignettes/brms_phylogenetics.html) from Paul-Christian Bürkner.

```{r message=FALSE, warning=FALSE, error=FALSE}
# fit bayesian regression with covarying random effects
m2.2 <- brm(y ~ x + (1 | gr(country, cov = linProx)), 
            data = culturalSimData, cores = 4, seed = 2113,
            # pre-specified covariance matrix must be passed to brms via data2 argument
            data2 = list(linProx = linProx),
            # regularising priors
            prior = c(prior(normal(0, 1), class = Intercept),
                      prior(normal(0, 1), class = b),
                      prior(exponential(2), class = sd),
                      prior(exponential(2), class = sigma)),
            # tune mcmc sampler
            control = list(adapt_delta = 0.999))
```

```{r}
# get model summary
summary(m2.2)
```

Note that there were a number of divergent transitions, even after we increased `adapt_delta`, and the number of effective samples are not great. But the Rhat values are all beneath 1.05, suggesting that this model is reasonable.

As expected, we find that the positive relationship between our variables goes away when we control for cultural phylogenetic non-independence between nations.

# Applying these methods in your research

There are a number of advantages to using Gaussian Processes and covarying random effects to deal with various forms of non-independence, beyond what I've outlined here. Both methods are possible to use when we have multiple data points per nation. Both methods can also be used for any conceivable likelihood --- logistic regression for binary variables, cumulative link regression for ordinal variables, beta regression for variables bounded between 0 and 1, etc. All of this is possible using the _brms_ package.

Happy cross-national modelling!