-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2018_XSEDE.tex
137 lines (84 loc) · 22.6 KB
/
2018_XSEDE.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
\documentclass[10.5pt]{article}
\usepackage{longtable}
\usepackage{graphicx}
\usepackage{lineno}
\usepackage{amssymb}
\usepackage{hyperref}
\hypersetup{colorlinks=true, urlcolor=blue, citecolor=black}
\usepackage[comma, sort&compress]{natbib}
\usepackage{fullpage}
\usepackage{setspace}
\newcommand{\peer}{\textit{P. eremicus}}
\newcommand{\per}{\textit{Peromyscus eremicus}}
\newcommand{\pcr}{\textit{Peromyscus crinitus}}
\newcommand{\eg}{\textit{e.g.,}}
\newcommand{\tit}{\textit}
\begin{document}
\begin{flushleft}
{\large
\textbf{Using genomics to understand adaptation in non-model organisms }
}
\vspace{8mm}
\noindent
Matthew D MacManes$^{1,2}$$^\ast$\\
\vspace{4mm}
\bf{1} \textnormal{\em{University of New Hampshire, Durham, NH 03824}} \\
\end{flushleft}
\vspace{8mm}
\begin{abstract}
The genetics of phenotypic variation and adaption is one of the central problems in Evolutionary Biology. Indeed, gaining a deeper mechanistic understanding of the links between genotype, phenotype and fitness has long been focus of research. Though long studied, only recently, with the advent of high-throughout sequencing have methods for the study of these links become available. Unlike traditional genetics, high-throughout sequencing has the power to assay all variation within entire genomes, which enables biologists to begin to understand the genetic underpinning of complex traits. During the year 2018-2019, I will analyze DNA and RNA sequence data to better understand (1) the genomics of adaption to desert life (2) adaption to a unique social environment and (3) the genetics of color polymorphism in poison frogs. In addition to these, I am an active developer of several open-source \textit{de novo} transcriptome assemblers (Oyster River Protocol. Trinity), and a moderate amount of time will be requested to support my work on the development of computational algorithms. To conduct this work, I am asking for an allocation of 800,000 service units on the XSEDE resource \textsc{Bridges} split between the large-memory and regular-memory systems.
\end{abstract}
\doublespacing
\linenumbers
\vspace{12mm}
\section*{Introduction}
For biologists interested in understanding the relationship between fitness, genotype, and phenotype, these are exciting times. Rapidly developing sequencing technologies now afford researchers like myself an unprecedented opportunity to gain a deep understanding of genome level processes that together, underlie adaptation. This newfound ability, combined with an understanding of ontogeny, evolutionary history and fitness benefit represents an extremely rich and novel context within which to test hypotheses regarding adaptation. \\
\noindent
Unlike traditional Sanger sequencing, where production was measured in hundreds or thousands of bases, next generation sequencing, specifically, Illumina sequencing, can produce billions of base pairs of data every day from whole genomes, specific tissues, or parts of the genome, during specific stages in an organism’s development. While this newfound ability to generate massive amounts of data has opened up novel avenues of study, it has challenged the existing computational infrastructure. Here, I propose to leverage the computational power of the XSEDE system (specifically PSC \textsc{Bridges}) against problems that just a few years ago were considered intractable. The proposed research is an integrative project that deals with a fundamental biological process \textemdash adaptation. Specifically, I aim to begin analysis on a number of projects (e.g. genetics of color polymorphism), and continue work on adaptation to unique environmental conditions-- the desert, and a unique social environment. \\
\noindent
My previous work both as a PhD student and postdoctoral scholar on computational genomics has prepared me well to pursue this line of research. Having been a Co-PI and PI on previous XSEDE grants (TG-MCB110134 and IBN100014) has given me the necessary technical skills to efficiently leverage XSEDE resources for analysis of these types of data. Several of these previous projects have been featured in XSEDE-related publications. ({\href{http://goo.gl/3lXKf}{Behavioral Genomics: Alone Time for Tucos}, \href{http://goo.gl/fasl6}{Monogamy and the Immune System}, and \href{http://goo.gl/iGKSN}{White Paper: Blacklight at Pittsburgh Supercomputing Center Shines Light on Life Sciences Research})
\section*{Specific Study Objectives}
\subsection*{Desert Biology}
\emph{How do organisms respond to extreme, and ever changing environments?} In a world where anthropogenic influences have resulted in unprecedented and rapid climate change \citep{IPCC:2007uz}, understanding how animals respond to these changes is of paramount importance. Critically, the study of organisms living in harsh environments (\eg\ high altitude, deep sea, deserts) represents natural experiments, given they have evolved physiological mechanisms allowing for persistence. Indeed, understanding the ways in which these organisms survive will shape our ideas about adaptation and the potential for other species to adapt to increasingly erratic and extreme climatic patterns. Specifically, given our changing climate, identifying the physiological mechanisms through which animals currently facing water-stress (\eg\ the cactus mouse) have adapted, is urgently needed, perhaps today more so than ever before, if we are to be proactive in our attempts at mitigating ongoing human-induced climate change.
For the past several years, the MacManes lab has been working at the interface of ecology, physiology, and evolutionary genomics. We are interested in understanding the proximate (sensu \citet{Tinbergen:1963ty}) underpinnings of adaptive phenotypes, and study them within the context of their ultimate evolutionary explanations. To accomplish these goals, we use the powerful and well-developed cactus mouse (\per) system, specifically asking how these desert rodents survive in acute and chronic dehydration. We integrate our expertise in physiology with substantial genomic resources and understanding of the native ecology, to generate a synthetic understanding of the ecophysiological and genomic mechanisms that allow for persistence \citep{MacManes:2017bu,Kordonowy:2017jk,Kordonowy:2016fq,MacManes:2014br}. These studies have revealed a dramatic response to dehydration, which includes substantial weight loss (mean=23\%) and electrolyte derangement (especially Sodium) but, importantly, without renal compromise \citep{MacManes:2017bu}. These findings, combined with studies of renal gene expression \citep{MacManes:2017bu}, have led us to form hypotheses related to metabolic water production via $\beta$-oxidation of fat, and alterations in renal microcirculation. \\
\subsection*{Social Behavior}
\noindent
Becoming a parent causes big changes in behavior. For animals that exhibit parental care, the successful rearing of offspring involves a shift from aggressive and sexual behaviors to caring and nurturing ones, but what in the brain mediates this transition? How flexible are these changes in response to unpredictable environmental perturbations, and how is behavior altered because of them? In a rapidly changing world, understanding how the environment affects the brain and how, in turn, the brain affects the behavioral transition into parental care will shed light on how changes in environment can ultimately affect fitness. In an interdisciplinary collaboration, Dr. Calisi-Rodríguez (Barnard College) and Dr. MacManes (University of New Hampshire) will characterize changes in all neural genetic and specific proteomic levels in male and female rock doves (Columba livia) during the transition into parental care and in response to environmental manipulations upon transition. \\
\subsection*{Vertebrate Color Polymorphism}
\noindent
Color polymorphism, particularly in extreme cases like poison dart frogs where color is linked to toxicity, has long been help up as a classic example of adaptation. Interestingly, despite all the interest, we have a relatively poor understanding of the genetics of color, especially in cases were individuals vary dramatically in color or patterning. This is particularly true in amphibians, for which genomic data are depauperate. With collaborators (Kyle Summers, Rasmus Nielsen), we have begun a project that aims to uncover the genomic architecture of color polymorphism in Dendrobatid frogs. We have collected sequence data on a variety of projects that aim to examine coloration in poison frogs: 1) gene expression of different color morphs of the green and black poison frog \textit{Dendrobates auratus} 2) gene expression between different color morphs of the poison frog \textit{Ranitomeya imitator}, as well as an analysis of when candidate color genes are expressed during development with a developmental series of data from each population of \textit{R. imitator} 3) analyses of the genomic architecture in different, mimetic species which have converged on the same color and pattern. Fort this we will be comparing our results from the developmental series of \textit{R. imitator} with developmental series from two co-mimetic color morphs of \textit{Ranitomeya variabilis} and two co-mimetic color morphs of \textit{R. fantastica}. 4) Finally, and most ambitiously, we will be attempting to link color, pattern, toxicity, diet, and gene expression in multiple tissues from 6 different populations of \textit{R. imitator} from an area in Peru where the species changes from a yellow striped appearance to an orange banded appearance, with populations from either color morph as well as two populations in the middle of this hybrid zone where \textit{R. imitator} exhibits substantial phenotypic variation. This project will attempt to link color and pattern of a polymorphic aposematic species in the wild to other fundamental questions about aposematic species such as the genomic mechanisms that allow toxin sequestration and how toxicity is extraordinarily variable within and between species. In addition to frogs, we are beginning a parallel project using salamanders of the genus \textit{Ensatina} \url{http://www.biomedcentral.com/1471-2148/11/194/figure/F1?highres=y}, a classic model for the study of speciation and color polymorphism. We have generated sequence data form the skin of several individuals, and data analysis will soon commence.
\subsection*{Transcriptome Assembly}
For all biology, modern sequencing technologies have provided for an unprecedented opportunity to gain a deep understanding of genome level processes that underlie a very wide array of natural phenomena, from intracellular metabolic processes to global patterns of population variability. Transcriptome sequencing has been influential \citep{Mortazavi:2008jj,Wang:2009di}, particularly in functional genomics \citep{Lappalainen:2013el,Cahoy:2008hm}, and has resulted in discoveries not possible even just a few years ago. This in large part is due to the scale at which these studies may be conducted \citep{Li:2017bq, Tan:2017ix}. Unlike studies of adaptation based on one or a small number of candidate genes (\eg\ \citep{Fitzpatrick:2005vd,Panhuis:2006kp}), modern studies may assay the entire suite of expressed transcripts -- the transcriptome -- simultaneously. In addition to issues of scale, as a direct result of enhanced dynamic range, newer sequencing studies have increased ability to simultaneously reconstruct and quantitate lowly- and highly-expressed transcripts \citep{Wolf:2013hd,Vijay:2012gy}. Lastly, improved methods for the detection of differences in gene expression (\eg\ \citep{Robinson:2010cw,Love:2014cf}) across experimental treatments have resulted in increased resolution for studies aimed at understanding changes in gene expression. \\
As a direct result of their widespread popularity, a diverse toolset for the assembly of transcriptome exists, with each potentially reconstructing transcripts others fail to reconstruct. Amongst the earliest of specialized \tit{de novo} transcriptome assemblers were the packages \texttt{Trans-ABySS} \citep{Robertson:2010ih}, \texttt{Oases} \citep{Schulz:2012je}, and \texttt{SOAPdenovoTrans} \citep{Xie:2013wu}, which were fundamentally based on the popular \tit{de Bruijn} graph-based genome assemblers \texttt{ABySS} \citep{Simpson:2009iv}, \texttt{Velvet} \citep{Zerbino:2008bm}, and \texttt{SOAP} \cite{Li:2008in} respectively. These early efforts gave rise to a series of more specialized \tit{de novo} transcriptome assemblers, namely \texttt{Trinity} \citep{Haas:2013jq}, and \texttt{IDBA-Tran} \citep{Peng:2013eu}. While the \tit{de Bruijn} graph approach remains powerful, newly developed software explores novel parts of the algorithmic landscape, offering substantial benefits, assuming novel methods reconstruct different fractions of the transcriptome. \texttt{BinPacker} \citep{Liu:2016hh}, for instance, abandons the \tit{de Bruijn} graph approach to model the assembly problem after the classical bin packing problem, while \texttt{Shannon} \citep{Kannan:2016be} uses information theory, rather than a set of software engineer-decided heuristics. These newer assemblers, by implementing fundamentally different assembly algorithms, may reconstruct fractions of the transcriptome that other assemblers fail to accurately assemble.
In addition to the variety of tools available for the \tit{de novo} assembly of transcripts, several tools are available for pre-processing of reads via read trimming ((\eg\ \texttt{Skewer} \citep{Jiang:2014cx}, \texttt{Trimmomatic} \citep{Bolger:2014ek}, \texttt{Cutadapt} \cite{Martin:2011va}), read normalization (\texttt{khmer} \citep{Pell:2012id}), and read error correction (\texttt{SEECER} \citep{Le:2013dy} and \texttt{RCorrector} \citep{Song:2015in}, \texttt{Reptile} \cite{Yang:2010kv}). Similarly, benchmarking tools that evaluate the quality of assembled transcriptomes including \texttt{TransRate} \citep{SmithUnna:2016go}, \texttt{BUSCO} (\underline{B}enchmarking \underline{U}niversal \underline{S}ingle-\underline{C}opy \underline{O}rthologs - \citep{Simao:2015kk}), and \texttt{Detonate} \citep{Li:2014cm} have been developed. Despite the development of these evaluative tools, this manuscript describes the first systematic effort coupling them with the development of a \textit{de novo} transcriptome assembly pipeline.
The ease with which these tools may be used to produce and characterize transcriptome assemblies belies the true complexity underlying the overall process \citep{Ungaro:2017kf, Wang:2017gc, Moreton:2015fw, Yang:2013iz}. Indeed, the subtle (and not so subtle) methodological challenges associated with transcriptome reconstruction may result in highly variable assembly quality. In particular, while most tools run using default settings, these defaults may be sensible only for one specific (often unspecified) use case or data type. Because parameter optimization is both dataset-dependent and factorial in nature, an exhaustive optimization particularly of entire pipelines, is never possible. Given this, the production of a \tit{de novo} transcriptome assembly requires a large investment in time and resources, with each step requiring careful consideration. Here, I propose an evidence-based protocol for assembly that results in the production of high quality transcriptome assemblies, across a variety of commonplace experimental conditions or taxonomic groups. \\
A significant amount of computational efforts are related to further development of The Oyster River Protocol \citep{MacManes:2015iz} for transcriptome assembly. Specifically, this method explicitly considers and attempts to address many of the shortcomings described in \citep{Vijay:2012gy}, by leveraging a multi-kmer and multi-assembler strategy. This innovation is critical, as all assembly solutions treat the sequence read data in ways that bias transcript recovery. Specifically, with the development of assembly software comes the use of a set of heuristics that are necessary given the scope of the assembly problem itself. Given each software development team carries with it a unique set of ideas related to these heuristics while implementing various assembly algorithms, individual assemblers exhibit unique assembly behavior. By leveraging a multi-assembler approach, the strengths of one assembler may complement the weaknesses of another. In addition to biases related to assembly heuristics, it is well known that assembly kmer-length has important effects on transcript reconstruction, with shorter kmers more efficiently reconstructing lower-abundance transcripts relative to more highly abundant transcripts. Given this, assembling with multiple different kmer lengths, then merging the resultant assemblies may effectively reduce this type of bias. Recognizing these issue, I hypothesize that an assembly that results from the combination of multiple different assemblers and lengths of assembly-kmers will be better than each individual assembly, across a variety of metrics.
\section*{Methods and Computational Plan}
Common Experimental Design and Methods-- Many of the projects proposed here will use Illumina sequencing of DNA, and tissue-specific RNA molecules to better understand the genomic mechanisms underlying phenotype. Because none of the species I am currently working with have reference genomes, a computationally costly assembly step is required. Although compared to genome assembly \citep{Bradnam:2013uu,Earl:2011gt}, transcriptome assembly is less challenging, significant hurdles still exist (see \cite{Francis:2013gc,Vijay:2012gy,Pyrkosz:2013tm} for examples of the types of challenges). \\
\noindent
Most modern de novo assemblers attempt to solve this problem via a \textit{de Brujin} graph representation of sequence neighborhoods, where sequences are decomposed into tiled sub-reads of length k (k-mers) and sequences sharing k-1 bases are connected by directed edges. Currently, there are several open-sourced, peer reviewed software packages for available for short read de novo assembly (\textsc{ABySS} \citep{Birol:2009ia,Simpson:2009iv}, \textsc{Ray} \citep{Boisvert:2010dz}, \textsc{Velvet} \citep{Zerbino:2008bm}, \textsc{Oases} \citep{Schulz:2012je}, \textsc{SOAPdenovo} \citep{Li:2009cx}, and \textsc{Trinity} \citep{Haas:jq,Grabherr:2011jb}). These software packages vary with regards to their ability to reconstruct genome sequences as coverage and DNA sequence complexity changes, and thus, each assembler will produce different results. All \textit{de novo} genome and transcriptome assemblers have very high RAM requirements. For some of my previous work, RAM usage has exceeded 512Gb, which suggests that a large computer with shared-memory architecture may be best suited for this work. \\
\noindent
The general workflow for transcriptome assembly is as follows. Raw sequencing reads are uploaded to Bridges. A single .bz2 compressed archival copy is sent to the PSC Data Supercell. The working copy of the data, which can be in excess of 100Gb for a single experiment, is quality trimmed. In brief, because each nucleotide has a corresponding quality value, we can elect to remove nucleotides whose quality is below a specified threshold. In addition to reducing runtime and RAM use, this process can improve the resultant assembly quality (64 SU's per assembly. Next, an error correction phase is begun. This step, about which I recently published a paper \citep{MacManes:2013ec}, has been shown to reduce error in the final assembly. \\
\noindent
After these pre-assembly are completed, assembly begins. The 1st steps in assembly, using the software packages , \textsc{Trinity}, \textsc{Shannon}, and \textsc{SPAdes} include a set of parameter sweep, then a final assembly. Lastly, an assembly refinement step is initiated. \\
\noindent
Assembly is coordinated by using the Oyster River Protocol \cite{MacManes:2015iz}, which is already installed on Bridges. Each assembly uses approximately 512Gb of RAM, and runs for between 1 and 7 days, depending on the size of the dataset. Given tens of datasets are to be assembled, the SUs requested for this aim of the proposal are 200000.
\noindent
After assembly, transcript expression is quantified via a RNAseq experiment. These quantification experiments are done locally, and included here for explanatory purposes only. RNAseq experiments offer three main benefits over traditional array-based experiments \citep{tHoen:2008hn}. First, no a priori knowledge of the genes underlying phenotypic differences is required \citep{Gilad:2009km}. This is especially important in non-model organisms, because it opens up the possibility for the identification of novel genes. Second, because the number of reads matching a given genomic region is directly related to transcript abundance, very small, yet potentially biologically relevant differences are detectable \citep{Mortazavi:2008jj}. Lastly, the presence of alternative spicing events that can lead to different phenotypes can be discovered \citep{Sultan:2008jh}. These processes are relatively rapid on low-memory. For this part of the workflow, we request 185000 SUs on the RM partition of Bridges. \\
\noindent
After quantification, population genetic analyses are done. These analyses focus on estimating selection coefficients for each transcript, and are extremely computationally intensive. While each individual analysis is relatively fast, that 20,000 transcripts (this is the typical number of transcripts per experiment) are to be tested, the computer requirements are large. A small code scaling experiment was performed several months ago, on Bridges. \\
\section*{Justification of resources requested}
Having served as a Co-PI and PI on previous XSEDE allocations (IBN-100014, MCB-110134), I am very familiar with many aspects of XSEDE resource use, specifically its efficient use for transcriptome and genome assembly. Resulting from the previous allocation, several peer-reviewed publications have resulted; several others are in preparation (see CV), and data analysis (on Blacklight) for several others is currently in progress. XSEDE resources are critical for this work. Several of the datasets consist or will consist of hundreds of millions of 100 basepair-long sequence reads, which sum to more than 250Gb raw input data. These data are the starting material for analyses which require a very large amount of RAM ( $\sim$512Gb RAM loaded onto a single core), which is substantially more than any locally available computer resource available to me at UNH. \\
\begin{center}
\begin{tabular}{ | c | c | c | c |}
\hline
Task & Total RM & Total LM & Total Pylon \\ \hline
Assembly & & 200,000 & \\ \hline
Mapping &185,000 & &\\ \hline
Storage & & & 25Tb\\ \hline
\end{tabular}
\end{center}
\singlespacing
\bibliographystyle{model2-names-edit.bst}
\bibliography{formatted7.bib}
\end{document}