Skip to content

Latest commit

 

History

History
954 lines (948 loc) · 29.3 KB

ego.org

File metadata and controls

954 lines (948 loc) · 29.3 KB
 \section*{Introduction}
 The program \ty{ego} tests for enrichment of GO terms in a sample of
 genes. It does so by reading a list of observed gene IDs
 (Table~\ref{tab:eg}) and multiple ID lists that together represent the
 null distribution of random ID picks. These random lists can be
 generated by the program \ty{shuffle}. \ty{ego} then calculates the
 frequency with which each GO category occupied by at least, say, ten
 genes has at least an equal occupancy in the expected lists. This
 frequency is the Monte-Carlo $P$-value for enrichment of that
 category, $\ppm$. In other words, $\ppm$ is, the error
 probability when rejecting the null hypothesis that the observed
 occupancy is due to chance. In addition to $\ppm$, \ty{ego} also
 calculates a parametric $P$-value, $\ppp$, based on the assumption that
 the null distribution of GO counts is normal. Mapping the gene IDs
 onto GO IDs is done using the file \ty{gene2go} available from
 \begin{verbatim}
 ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz
 \end{verbatim}

 \begin{table}
   \caption{A list of observed gene IDs (\textbf{A}) is transformed
     by \ty{ego} into a list of GO-IDs and their enrichment
     (\textbf{B}).}\label{tab:eg}
   \begin{center}
     \begin{tabular}{cc}
	\textbf{A} & \textbf{B}\\
	\input{../ego/egoTab1} & \input{../ego/egoTab2}
     \end{tabular}
   \end{center}
 \end{table}

 As an auxiliary functionality, \ty{ego} can also classify the input
 IDs into their GO categories, map the IDs onto their gene symbols, and
 print what I call a GO/sym table, for example
 Table~\ref{tab:eg2}. This requires the genome's annotations in the GFF
 format distributed by the NCBI.

 \begin{table}
   \caption{Example GO/sym table.}\label{tab:eg2}
   \begin{center}
     \resizebox{\textwidth}{!}{
	\begin{tabular}{cclll}
	\hline
	GO & Count & Category & Description & Sym\\\hline
	\input{../ego/egoTab3}\\
	\hline
	\end{tabular}
     }
   \end{center}
 \end{table}

 \section*{Implementation}
 The outline of \ty{ego} contains hooks for imports, types, methods,
 functions, and the logic of the main function.
package main

import (
	  //<<Imports, Ch.~\ref{ch:eg}>>
)
//<<Types, Ch.~\ref{ch:eg}>>
//<<Methods, Ch.~\ref{ch:eg}>>
//<<Functions, Ch.~\ref{ch:eg}>>
func main() {
	  //<<Main function, Ch.~\ref{ch:eg}>>
}
In the main function, we set the program name and its usage, declare
the options, parse the options, and parse the input files.
util.Name("ego")
//<<Set usage, Ch.~\ref{ch:eg}>>
//<<Declare options, Ch.~\ref{ch:eg}>>
//<<Parse options, Ch.~\ref{ch:eg}>>
//<<Parse input files, Ch.~\ref{ch:eg}>>
"github.com/evolbioinf/gin/util"
The usage consists of the actual usage message, an explanation of the
purpose of \ty{ego}, and an example command.
u := "ego [-h] -o gene2go [option]... " +
	  "obsID.txt [expID.txt]..."
p := "Calculate enrichment of GO terms for " +
	  "observed list of gene IDs, given a "+
	  "large number of expected ID lists."
e := "ego -o gene2go obsID.txt expID.txt"
clio.Usage(u, p, e)
We import \ty{clio}.
"github.com/evolbioinf/clio"
Apart from the version, \ty{-v}, we declare five options, the file
containing the GO terms, \ty{-o}, the GFF file for printing the GO/sym
table, \ty{-g}, the minimum occupancy of a GO term to be included
in the final result, \ty{-m}, and whether to print the raw counts
instead of the default enrichment analysis.
var optV = flag.Bool("v", false, "version")
var optO = flag.String("o", "", "file of mapping " +
	  "gene IDs to GO terms")
var optG = flag.String("g", "",
	  "GFF file to print GO/sym table")
var optM = flag.Int("m", 10, "minimum occupancy of GO term")
var optR = flag.Bool("r", false, "print raw gene counts")
We import \ty{flag}.
"flag"
We parse the options and respond to \ty{-v} as this stops the
program. We also check there is a file mapping gene IDs onto GO terms
and bail with a friendly message if no such GO file exists. Then we
read the GO file. We also declare a map of gene ID to gene symbol and
fill it if there is a GFF file.
flag.Parse()
if *optV { util.Version() }
if *optO == "" {
	  m := "please enter a file mapping gene IDs " +
		  "to GO accessions using -o"
	  log.Fatal(m)
}
//<<Read GO file, Ch.~\ref{ch:eg}>>
var id2sym map[string]string
if *optG != "" {
	  //<<Fill \ty{id2sym}, Ch.~\ref{ch:eg}>>
}
We import \ty{log}.
"log"
To read the GO file, we open and scan it.
f := util.Open(*optO)
defer f.Close()
sc := bufio.NewScanner(f)
//<<Scan GO file, Ch.~\ref{ch:eg}>>
"bufio"
During the scan, we extract the required GO attributes from the GO
file. Then we store the GO ID as a function of the gene ID. Since a
gene ID typically maps to many GO IDs, we store the relationship
between a gene ID and a GO ID as a map between one gene ID and a map
of GO IDs. We also store the GO description and the GO category as
functions of the GO ID.
id2go := make(map[string]map[string]bool)
go2descr := make(map[string]string)
go2cat := make(map[string]string)
for sc.Scan() {
	  if sc.Text()[0] == '#' { continue }
	  //<<Extract GO attributes, Ch.~\ref{ch:eg}>>
	  //<<Store GO ID, Ch.~\ref{ch:eg}>>
	  //<<Store GO description, Ch.~\ref{ch:eg}>>
	  //<<Store GO category, Ch.~\ref{ch:eg}>>
}
We check the GO file has eight columns and bail otherwise.  The gene
ID is in column 2, the GO ID in column 3, the description of the GO
term in column 6, and its category in column 8.
fields := strings.Split(sc.Text(), "\t")
if len(fields) != 8 {
	  m := "gene2go file has %d columns instead of " +
		  "the expected 8; are you using " +
		  "the correct file?"
	  log.Fatalf(m, len(fields))
}
geneId := fields[1]
goId := fields[2]
goDescr := fields[5]
goCat := fields[7]
"strings"
We check whether a given gene ID already has an associated map of GO
IDs. If not, we make it. Then we store the GO ID.
if id2go[geneId] == nil {
	  id2go[geneId] = make(map[string]bool)
}
id2go[geneId][goId] = true
GO descriptions may end with a GO ID in square brackets, which is
redundant, so we cut it off with the preceding blank before storing
the description. We also make sure the remaining description doesn't
end in a blank.
li := strings.LastIndex(goDescr, "[")
if li > -1 && strings.Contains(goDescr, "GO:") {
	  if li > 0 {
		  goDescr = goDescr[:li-1]
	  } else {
		  goDescr = goDescr[:li]
	  }
}
go2descr[goId] = goDescr
We store the GO category.
go2cat[goId] = goCat
id2sym = make(map[string]string)
f := util.Open(*optG)
sc := bufio.NewScanner(f)
for sc.Scan() {
	  t := sc.Text()
	  if t[0] == '#' { continue }
	  fields := strings.Split(t, "\t")
	  //<<Check number of fields, Ch.~\ref{ch:eg}>>
	  if fields[2] == "gene" {
		  //<<Store gene symbol, Ch.~\ref{ch:eg}>>
	  }
}
if len(fields) != 9 {
	  m := "expecting 9 fields in GFF file, " +
		  "but you have %d"
	  log.Fatalf(m, len(fields))
}
sym := ""
gid := ""
attributes := strings.Split(fields[8], ";")
for _, attribute := range attributes {
	  kv := strings.Split(attribute, "=")
	  //<<Search for gene ID, Ch.~\ref{ch:eg}>>
	  //<<Search for gene symbol, Ch.~\ref{ch:eg}>>
}
//<<Check gene ID, Ch.~\ref{ch:eg}>>
//<<Check gene symbol, Ch.~\ref{ch:eg}>>
id2sym[gid] = sym
if kv[0] == "Dbxref" {
	  ids := strings.Split(kv[1], ",")
	  for _, id := range ids {
		  arr := strings.Split(id, ":")
		  if arr[0] == "GeneID" {
			  gid = arr[1]
		  }
	  }
}
if kv[0] == "Name" {
	  sym = kv[1]
}
if gid == "" {
	  fmt.Fprintf(os.Stderr,
		  "couldn't find gene ID in %q\n",
		  attributes)
}
"fmt"
"os"
if sym == "" {
	  fmt.Fprintf(os.Stderr, "couldn't find name in %q\n",
		  attributes)
} else if gid == "" {
	  log.Fatalf("found name but no gene ID in %q\n",
		  attributes)
}
The remaining tokens on the command line are taken as the names of
input files. The first of these contains the observed IDs. If the IDs
file doesn't exist, we bail. Otherwise, we open it and scan the
observed IDs. If desired, we print these observed IDs in a table next
to their GO terms and exit. But, remember, what we are really
interested in, is the GO occupancy. So once we've read the gene IDs,
we convert them to GO counts.
files := flag.Args()
if len(files) < 1 {
	  log.Fatal("please enter a file with observed IDs")
}
//<<Open observed IDs file, Ch.~\ref{ch:eg}>>
//<<Scan observed IDs file, Ch.~\ref{ch:eg}>>
if *optG != "" {
	  //<<Print GO/sym table, Ch.~\ref{ch:eg}>>
	  os.Exit(0)
}
//<<Convert observed IDs to GO counts, Ch.~\ref{ch:eg}>>
We open the observed IDs file and defer closing it again, as
usual.
f = util.Open(files[0])
defer f.Close()
We scan the observed IDs, which we assume to be presented in a single
column. If there is more than one column, we bail with a message. To
ensure that each gene is counted only once, we store the gene IDs in a
map.
sc = bufio.NewScanner(f)
ids := make(map[string]bool)
for sc.Scan() {
	  fields := strings.Fields(sc.Text())
	  if len(fields) > 1 {
		  m := "gene IDs should be " +
			  "in a single column"
		  log.Fatal(m)
	  }
	  ids[fields[0]] = true
}
We construct the GO/sym table, sort it, and print it.
//<<Construct GO/sym table, Ch.~\ref{ch:eg}>>
//<<Sort GO/sym table, Ch.~\ref{ch:eg}>>
//<<Print sorted Go/sym table, Ch.~\ref{ch:eg}>>
For each gene ID, we look up the symbol. Then we iterate over the GO
terms associated with the gene ID.
g2s := make(map[string][]string)
for id, _ := range ids {
	  sym := id2sym[id]
	  gterms := id2go[id]
	  //<<Iterate over GO terms for ID, Ch.~\ref{ch:eg}>>
}
We add the gene symbol to the symbol slice of each GO term. For GO
terms that don't have a symbol slice yet, we create one.
for gterm, _ := range gterms {
	  sl := g2s[gterm]
	  if sl == nil {
		  sl = make([]string, 0)
	  }
	  sl = append(sl, sym)
	  g2s[gterm] = sl
}
To sort the GO/sym table by GO term, we store the GO terms and their
associated symbols in a slice of type \ty{gosym}. This can be sorted
by casting to type \ty{gosymSlice}. While constructing the \ty{gosym}
slice, we also sort the symbols slices alphabetically.
var gosyms []*gosym
for g, symbols := range g2s {
	  gs := new(gosym)
	  gs.g = g
	  sort.Strings(symbols)
	  gs.s = symbols
	  gosyms = append(gosyms, gs)
}
sort.Sort(gosymSlice(gosyms))
We import \ty{sort}.
"sort"
We declare the type \ty{gosym} as a struct with a string to hold the
GO term and a slice of strings for the symbols.
type gosym struct {
	  g string
	  s []string
}
We also declare the type \ty{gosymSlice} as a slice of pointers go
\ty{gosym}.
type gosymSlice []*gosym
To make \ty{gosymSlice} sortable, we implement the three methods of
the \ty{Sort} interface, \ty{Len}, \ty{Less}, and \ty{Swap}. \ty{Len}
and \ty{Swap} are standard.
func (g gosymSlice) Len() int {
	  return len(g)
}
func (g gosymSlice) Swap(i, j int) {
	  g[i], g[j] = g[j], g[i]
}
The actual sorting is done according to the number of symbols
associated with the GO term. In case of equal numbers, the GO term is
sorted alphabetically.
func (g gosymSlice) Less(i, j int) bool {
	  l1 := len(g[i].s)
	  l2 := len(g[j].s)
	  if l1 != l2 {
		  return l1 < l2
	  }
	  return g[i].g < g[j].g
}
We print the sorted GO/sym table in order of descending symbol number
using a tab writer. First we print the gene ID, the number of symbols,
the GO category, and the GO description. Then we print the symbols and
close the line with a newline. Th the end, we flush the tab writer.
w := tabwriter.NewWriter(os.Stdout, 1, 0, 2, ' ', 0)
for i := len(gosyms)-1; i >= 0; i-- {
	  gosym := gosyms[i]
	  fmt.Fprintf(w, "%s\t%d\t%s\t%s\t", gosym.g,
		  len(gosym.s), go2cat[gosym.g],
		  go2descr[gosym.g])
	  //<<Print symbols, Ch.~\ref{ch:eg}>>
	  fmt.Fprintf(w, "\n")
}
w.Flush()
"text/tabwriter"
for i, symbol := range gosym.s {
	  if i > 0 {
		  fmt.Fprintf(w, " ")
	  }
	  fmt.Fprintf(w, "%s", symbol)
}
The conversion between ID lists and GO occupancy is carried out
here and again later when we read the expected lists. So we delegate
it to a call to the function \ty{gid2go}.
obsGOcounts := gid2go(ids, id2go)
Inside \ty{gid2go}, we allocate a map for the GO counts and iterate
over the gene IDs. For each ID we look up its associated GO IDs, which
we count.
func gid2go(ids map[string]bool,
	  id2go map[string]map[string]bool) map[string]int {
	  og := make(map[string]int)
	  for id, _ := range ids {
		  //<<Count GO IDs for gene, Ch.~\ref{ch:eg}>>
	  }
	  return og
}
We skip the gene IDs for which we cannot look up GO IDs.
gm := id2go[id]
if gm == nil { continue }
for g, _ := range gm {
	  og[g]++
}
The remaining files contain expected sets of symbols. They are scanned
in turn using the function \ty{scan}, which takes as arguments the
observed GO counts, the map from ID to GO, the map from GO to its
description, the map from GO to its category, and the minimum
occupancy of a GO term.
files = files[1:]
clio.ParseFiles(files, scan, obsGOcounts,
	  id2go, go2descr, go2cat, *optM, *optR)
Inside \ty{scan}, we retrieve the arguments, prepare the enrichment
analysis, scan the file, and analyze the IDs. Once we have scanned all
IDs, the last set of IDs still remains to be analyzed before we report
the enrichment analysis, perhaps followed by the raw gene counts.
func scan(r io.Reader, args ...interface{}) {
	  //<<Retrieve arguments, Ch.~\ref{ch:eg}>>
	  sc := bufio.NewScanner(r)
	  //<<Prepare enrichment analysis, Ch.~\ref{ch:eg}>>
	  for sc.Scan() {
		  //<<Analyze IDs, Ch.~\ref{ch:eg}>>
	  }
	  //<<Analyze last set of IDs, Ch.~\ref{ch:eg}>>
	  //<<Report enrichment analysis, Ch.~\ref{ch:eg}>>
	  if raw {
		  //<<Print raw counts, Ch.~\ref{ch:eg}>>
	  }
}
We import \ty{io}.
"io"
We retrieve the six arguments passed to \ty{scan}.
obsGOcounts := args[0].(map[string]int)
id2go := args[1].(map[string]map[string]bool)
go2descr := args[2].(map[string]string)
go2cat := args[3].(map[string]string)
minOcc := args[4].(int)
raw := args[5].(bool)
The enrichment analysis depends on the number of samples we analyze
and the IDs they contain, so we declare variables for these two
things. In addition, there are two statistics we produce, the expected
occupancy of the observed GO terms, and the frequency with which a
random occupancy is at least as great as that observed
(Table~\ref{tab:eg}B). So we allocate a map to sum up the expected GO
counts, a map to count excessive GO counts, and a map to store slices
of the expected GO counts.
n := 0
ids := make(map[string]bool)
expGOcounts := make(map[string]int)
eCounts := make(map[string]int)
eVals := make(map[string][]int)
Whenever we encounter a blank line, we've traversed a sample. At that
point we increment our sample counter, convert the IDs to GO counts,
analyze them, and reset the map of IDs.
fields := strings.Fields(sc.Text())
if len(fields) > 1 {
	  log.Fatalf("can't parse %q", sc.Text())
}
if sc.Text() == "" {
	  n++
	  eg := gid2go(ids, id2go)
	  //<<Analyze GO counts, Ch.~\ref{ch:eg}>>
	  //<<Reset IDs, Ch.~\ref{ch:eg}>>
} else {
	  ids[fields[0]] = true
}
We iterate over the GO terms found in the observed sample. Their
occupancy is summed and stored. Moreover, terms where the expected
occupancy is at least as great as that observed are counted.
for k, o := range obsGOcounts {
	  e := eg[k]
	  expGOcounts[k] += e
	  eVals[k] = append(eVals[k], e)
	  if e >= o {
		  eCounts[k]++
	  }
}
for id, _ := range ids {
	  delete(ids, id)
}
We analyze the last set of IDs.
n++
eg := gid2go(ids, id2go)
//<<Analyze GO counts, Ch.~\ref{ch:eg}>>
We can now report the results of our enrichment analysis. We'd like to
sort the output by $P$-value. So we declare a type that holds one GO
term, \ty{GOterm}. It is a structure containing the accession,
description, and category. It also contains the observed occupancy,
the expected occupancy, $\ppm$, and $\ppp$.
type GOterm struct {
	  a, d, c string
	  o, e, pm, pp float64
}
We collect the results in a slice of GO terms. For each GO term we
calculate $\ppm$ and $\ppp$, before we fill in the term.
GOterms := make([]*GOterm, 0)
for a, o := range obsGOcounts {
	  gt := new(GOterm)
	  //<<Calculate $\ppm$, Ch.~\ref{ch:eg}>>
	  //<<Calculate $\ppp$, Ch.~\ref{ch:eg}>>
	  //<<Fill in GO term, Ch.~\ref{ch:eg}>>
	  GOterms = append(GOterms, gt)
}
We declare a slice of GO terms.
type GOslice []*GOterm
To make the slice sortable, we implement the methods of the \ty{Sort}
interface, \ty{Len}, \ty{Less}, and \ty{Swap}. We do \ty{Len} and
\ty{Swap} first, as they are standard.
func (g GOslice) Len() int { return len(g) }
func (g GOslice) Swap(i, j int) {
	  g[i], g[j] = g[j], g[i]
}
The actual sorting is done according to $\pm$. If $\pm$-values are
identical, we sort by descending enrichment ratio. If all else fails,
we sort the GO accessions alphabetically.
func (g GOslice) Less(i, j int) bool {
	  if g[i].pm != g[j].pm {
		  return g[i].pm < g[j].pm
	  }
	  er1 := g[i].o / g[i].e
	  er2 := g[j].o / g[j].e
	  if er1 != er2 {
		  return er1 > er2
	  }
	  return g[i].a < g[j].a
}
pm := float64(eCounts[a]) / float64(n)
if pm == 0 {
	  pm = -1
}
var mean, sd float64
//<<Calculate mean of GO counts, Ch.~\ref{ch:eg}>>
//<<Calculate standard deviation of GO counts, Ch.~\ref{ch:eg}>>
x := (float64(o) - mean) / (sd * math.Sqrt(2))
pp := 1.0 - 0.5 * (1.0 + math.Erf(x))
"math"
data := eVals[a]
nn := len(data)
for i := 0; i < nn; i++ {
	  mean += float64(data[i])
}
mean /= float64(nn)
v := 0.0
for i := 0; i < nn; i++ {
	  s := mean - float64(data[i])
	  v += s * s
}
v /= float64(nn - 1)
sd = math.Sqrt(v)
gt.pm = pm
gt.pp = pp
gt.a = a
gt.d = go2descr[a]
gt.c = go2cat[a]
gt.o = float64(o)
gt.e = float64(expGOcounts[a]) / float64(n)
We sort the results and print the terms with minimum occupancy to a
table. In the table header we also pass the number of iterations to
help with downstream analyses.
sort.Sort(GOslice(GOterms))
w := tabwriter.NewWriter(os.Stdout, 1, 0, 2, ' ', 0)
fmt.Fprintf(w, "#GO\tO\tE\tO/E\tP_m(n=%.1e)\t", float64(n))
fmt.Fprint(w, "P_p\tCategory\tDescription\n")
for _, g := range GOterms {
	  if int(g.o) < minOcc { continue }
	  fmt.Fprintf(w,
		  "%s\t%d\t%.3g\t%.3g\t%.3g\t%.3g\t%s\t%s\n",
		  g.a, int(g.o), g.e, g.o/g.e, g.pm, g.pp, g.c, g.d)
}
w.Flush()
We import \ty{sort}.
"sort"
for _, term := range GOterms {
	  a := term.a
	  fmt.Printf("R %s", a)
	  counts := eVals[a]
	  for _, count := range counts {
		  fmt.Printf(" %d", count)
	  }
	  fmt.Printf("\n")
}
We're finished writing \ty{ego}, so let's test it.
\section*{Testing}
The outline of our testing program has hooks for imports and the
testing logic.
package main

import (
	  "testing"
	  //<<Testing imports, Ch.~\ref{ch:eg}>>
)
func TestEgo(t *testing.T) {
	  //<<Testing, Ch.~\ref{ch:eg}>>
}
We construct the tests and run them.
var tests []*exec.Cmd
//<<Construct tests, Ch.~\ref{ch:eg}>>
for i, test := range tests {
	  //<<Run test, Ch.~\ref{ch:eg}>>
}
To begin with, we run \ty{ego} on a list of IDs obtained by shuffling
\ty{i1.txt} and annotating the intervals.
ge := "../data/gene2go"
ob := "../data/obsId.txt"
ra := "../data/expId.txt"
test := exec.Command("./ego", "-o", ge, ob, ra)
tests = append(tests, test)
We import \ty{exec}.
"os/exec"
As a second test, we reduce the minimum number of genes per GO term
from default ten to five.
test = exec.Command("./ego", "-o", ge, "-m", "5", ob, ra)
tests = append(tests, test)
test = exec.Command("./ego", "-o", ge, "-r", ob, ra)
tests = append(tests, test)
We run the test and compare what we get with what we want, which is
contained in \ty{r1.txt}, \ty{r2.txt}, and \ty{r3.txt}.
get, err := test.Output()
if err != nil {
	  t.Errorf("can't run %q", test)
}
f := "r" + strconv.Itoa(i+1) + ".txt"
want, err := ioutil.ReadFile(f)
if err != nil {
	  t.Errorf("can't open %q", f)
}
if !bytes.Equal(get, want) {
	  t.Errorf("get:\n%s\nwant:\n%s\n", get, want)
}
We import \ty{strconv}, \ty{ioutil}, and \ty{bytes}.
"strconv"
"io/ioutil"
"bytes"