Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial heritability with simulated random outcome #14

Open
aneumann-science opened this issue May 29, 2017 · 2 comments
Open

Substantial heritability with simulated random outcome #14

aneumann-science opened this issue May 29, 2017 · 2 comments

Comments

@aneumann-science
Copy link

aneumann-science commented May 29, 2017

Dear all,

I wanted to use BGLR to estimate the genome-wide variance explained of a trait by SNPs and DNA methylation . A colleague and I noticed that BGLR estimates substantial SNP heritability or DNA methylation variance components in the range of 10-25% even if the outcome is randomly simulated. I included below R code using the example data and scripts from the paper to illustrate the problem. My expectation is that the variance component should be 0%, yet the example estimates a heritability of 24%. I wonder whether this is expected behavior. The only explanation I can think of is that the prior distribution has a large influence on the mean posterior distribution. If that is the case, however, I wonder how to interpret any outcomes with true variance components of 20-30%. These would seem to be indistinguishable from random associations.

Best,

Alex

### Boxes  9  and S3 #####################################################################

### Box 9 ################################################################################
#1# Loading and preparing the input data 
library(BGLR); data(wheat);  set.seed(123);
X<-wheat.X; A<-wheat.A/mean(diag(wheat.A));  
y<-rnorm(599)

#2# Computing the genomic relationship matrix 
X<-scale(X,center=TRUE,scale=TRUE) 
G<-tcrossprod(X)/ncol(X) 

#3# Computing the eigen-value decomposition of G 
EVD<-eigen(G) 

#3# Setting the linear predictor 
ETA<-list(PED=list(K=A, model="RKHS"), 
          MRK=list(V=EVD$vectors,d=EVD$values, model="RKHS") 
) 

#4# Fitting the model 
fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000,saveAt="PGBLUP_") 
save(fm,file="fmPG_BLUP.rda") 

# Residual variance 
varE<-scan("PGBLUP_varE.dat") 

#varA and varU 
varA<-scan("PGBLUP_ETA_PED_varU.dat") 
varU<-scan("PGBLUP_ETA_MRK_varU.dat") 


varG<-varU+varA 
h2<-varG/(varE+varG) 
mean(h2);sd(h2) 

mean(varU/varG) 
mean(varA/varG)
@gdlc
Copy link
Owner

gdlc commented May 31, 2017

Hello Alex, and thank you very much for your comment.

Neither ML, REML nor Bayes are necessarily unbiased. These estimators are expected to be asymptotically unbiased, but small-sample bias can occur. Bias depends on many factors, including: n/p ratio, signal/noise ratio in the data (i.e., true heritability) and the extent of relationships (for distantly related individuals G is not very different than I and this makes the separation between genetic and not genetic signals more difficult). In my experience, small sample size, small signal/noise ratio and distantly related individuals give higher small sample bias.

In small samples Bayesian estimates can be highly influenced by the prior. When we use proper priors (as done in BGLR) the prior has no mass at values VAR<=0 , then, since you cannot get negative estimates, over conceptual repeated sampling, you may get some corner solutions (estimate=0) and some positive estimates (negative estimates does not happen with proper priors). Averaged over conceptual repeated sampling you will get an positive estimate and thus bias under the null (VAR=0).

BGLR uses scaled inverse chi square priors (in the parametrization used E[var]=S/(df-2) and mode is S/(df+2). You can reduce the influence of the prior by using very small scale and very small df.

The example below shows how to do that. For a model with one random effect I get very low bias (average h2 estimate ~.005 for Bayes and ~.007 for ML). The bias with two random effects was higher (average h2 estimate of ~0.015 ) but still low.

If you look at the trace plot of varU (plot(varU)), you will see that the variance is very close to zero in most iterations, then it jumps to values ~.1 and comes back to zero. It never touches zero because the prior has zero mass @ var=0. Given this poor mixing long chains (longer than the ones used in the example) may be needed. You can try with your data using very small df and S and if your trace plot shows long segments of the chain close to zero this may be indicative of a very low (or null) signal.

In short, try reducing S0 and df0 and look at the trace plots.

Another thing you can do is to add to your data (real data) noise with known variance and see if the estimated h2 and variance components vary. If varG increases when you add a bit of noise that will be indicative that part of the noise variance is being wrongly caputred as 'signal'.

Hope this helps

Gustavo

Code

## scripts for ML

neg2LogLik<-function(logVar,V,y,d,n=length(y)){
  y<-y-mean(y)
  Vy<-crossprod(V,y)
  Vy2<-as.vector(Vy)^2
  varE<-exp(logVar[1])
  varU<-exp(logVar[2])
  lambda<-varU/varE
  dStar<-(d*lambda+1)
  sumLogD<-sum(log(dStar))
  neg2LogLik_1<- ( n*log(varE) + sumLogD )
  neg2LogLik_2<- (sum(Vy2/dStar))/varE
  out<- neg2LogLik_1+neg2LogLik_2
  return(out)
}

fitML=function(y,EVD=NULL,K=NULL){
    if(is.null(EVD)){
        if(is.null(K)){
            stop('provide either K or its eigenvalue decomposition')
        }else{
            EVD=eigen(K)
        }
    }
   vP=var(y)
   logVar=vP*c(.5,.5)
   tmp<-EVD$values>1e-5
   d=EVD$values[tmp]
   V=EVD$vectors[,tmp]

   fm=optim(fn=neg2LogLik,V=V,d=d,n=length(y),par=logVar,y=y)
   out<-exp(fm$par)
   out=c(out,out[2]/sum(out),-2*fm$value)
   names(out)<-c('varE','varU','h2','log-lik')
   return(out)

}

### Simulation ###########################################################################
h2_bayes=rep(NA,100)
h2_bayes_PG=rep(NA,100)
h2_ml=rep(NA,100)

#1# Loading and preparing the input data 
library(BGLR); data(wheat);  
X<-wheat.X; A<-wheat.A/mean(diag(wheat.A)); 

#2# Computing the genomic relationship matrix 
X<-scale(X,center=TRUE,scale=TRUE) 
G<-tcrossprod(X)
G<-G/mean(diag(G))

EVD.G=eigen(G)
EVD.A=eigen(A)

for(i in 1:100){ # 100 MC replicates
	
  y<-rnorm(599)

  # BGLR 1 variace component
  ETA<-list(list(V=EVD.G$vectors,d=EVD.G$values, model="RKHS",df0=0.0001,S0=.00001)) 
  fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000,saveAt="PGBLUP_",verbose=F,df0=0.0001,S0=.00001) 
  varG=scan('PGBLUP_ETA_1_varU.dat')
  varE=scan('PGBLUP_varE.dat')
  tmp<-varG/(varE+varG) 
  h2_bayes[i]=mean(tmp)
  
  # ML 1 variance component
  h2_ml[i]=fitML(y,EVD=EVD.G,K=NULL)["h2"]

  # Bayes 2 components
  ETA<-list(list(V=EVD.G$vectors,d=EVD.G$values, model="RKHS",df0=0.0001,S0=.00001),
            list(V=EVD.A$vectors,d=EVD.A$values, model="RKHS",df0=0.0001,S0=.00001)
  			) 
  fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000,saveAt="PGBLUP_",verbose=F,df0=0.0001,S0=.00001) 
  varG=scan('PGBLUP_ETA_1_varU.dat')
  varA=scan('PGBLUP_ETA_2_varU.dat')
  varE=scan('PGBLUP_varE.dat')
  tmp<-(varG+varA)/(varE+varG+varA) 
  h2_bayes_PG[i]=mean(tmp)
  print(c(mean(h2_bayes,na.rm=TRUE),mean(h2_ml,na.rm=TRUE),mean(h2_bayes_PG,na.rm=TRUE)))
  print(i)
}

@aneumann-science
Copy link
Author

Dear Gustavo,

Thank you for the detailed explanation and helpful suggestions. I will explore the issues and changes you mentioned with my data.

Best,

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants