The Desai Lab


The response of microbial and viral populations to natural selection determines how much diversity they create and maintain, how quickly they adapt to novel environments, and how readily they evolve new features. It is key to understanding the evolution of antibiotic resistance and the response of pathogens to immune selection, and to related clonal processes such as the somatic evolution of cancers. The overall goal of our work is to understand evolution and population genetics in these largely asexual populations, using mathematical models of evolutionary dynamics, empirical studies of natural genetic variation, and direct observations of adaptation in experimental microbial populations. Broadly speaking, we ask two types of questions. First, what is the structure of the enormous and high-dimensional map between the genotypes accessible to a population and the phenotypes relevant to its evolution? This determines the landscape on which evolution takes place. Second, how does evolution navigate this landscape? That is, how does a particular combination of selection pressures, population sizes, mutation rates, and other factors determine the outcomes of the evolutionary process, and what signatures does this leave on patterns of genetic variation within and between populations? These questions drive the two main research directions in our lab:

(1) The statistical structure of the genotype-phenotype map. We develop novel experimental and statistical approaches to infer the genetic architecture of complex traits, with a focus on characterizing the genotype-phenotype maps that determine how microbial and viral populations evolve.

(2) Evolutionary dynamics and population genetics in microbial and viral populations. We use both mathematical models and experimental approaches to analyze the dynamics of molecular evolution and patterns of genomic sequence diversity within and between populations.

Our focus is on characterizing statistical properties of genotype-phenotype maps and of evolutionary trajectories, rather than on describing the details of the biochemical or cell biological mechanisms by which specific mutations lead to a particular phenotype. This is driven by our view that statistical features of evolution are in principle predictable, while the mechanistic details will often be specific to each individual biological system.

The Statistical Structure of the Genotype-Phenotype Map

All evolution ultimately depends on the nature of the available genetic variation: what are the possible mutations, the rates at which they arise, and their phenotypic effects in present and future environments? Our lab is working to characterize these genotype-phenotype maps in a variety of settings. Examples of research directions in our ongoing and future work include:

Mapping the genetic basis of complex traits in yeast:
A major focus of ongoing work in the lab is to develop novel experimental and statistical tools to characterize the genetic architecture of complex traits (i.e. genotype-phenotype maps). The basic idea behind any such method in quantitative genetics is to measure genotypes and phenotypes of a set of diverse individuals, and then infer the genetic basis of the phenotypes by finding statistical associations between particular genetic variants and corresponding phenotypes. A large community of researchers focuses on developing and applying these methods in the context of human genome-wide association studies (GWAS). However, there are fundamental constraints and confounding factors associated with quantitative genetics in humans (e.g. population structure and data privacy constraints, among many others), and experimentally validating results presents enormous challenges.

We are taking an orthogonal approach, by developing novel experimental and statistical methods that allow us to dramatically increase the power and throughput of quantitative genetics (specifically, QTL mapping) in a model organism, budding yeast. In recent work, we have demonstrated that we can increase statistical power by increasing the sample size of mapping panels by two orders of magnitude over the previous state of the art. This is enabled by methods we have developed for (1) efficient and accurate genotyping using low-coverage sequencing, (2) combinatorial barcoding to enable accurate high-throughput phenotyping, and (3) new statistical methods to detect numerous densely-spaced small-effect QTLs as well as epistatic and dominance effects. Our results show that we can infer genotype-phenotype maps with unprecedented resolution, including the identity of weak-effect loci and epistatic interactions between them. In ongoing and future work, we plan to use these methods to infer genotype-phenotype maps for a variety of important traits across multiple yeast crosses, and use these as a new way to probe the architecture of genetic networks in yeast. We are also developing computational and statistical tools for inferring the structure of these landscapes. This includes novel approaches to fine-map causal loci, to identify epistatic interactions and other "global" nonlinear effects, and to jointly map the genetic basis of multiple traits, which provides a new and principled way to investigate shared genetic architecture across phenotypes (i.e. pleiotropy). Finally, we are building on a recently developed retron-based barcoded CRISPR system to reconstruct tens of thousands of specific variants (and combinations of variants) in hundreds of genetic backgrounds, which makes it possible to conduct large-scale direct experimental validation. Together, these approaches make it possible to systematically study the statistical structure of genotype-phenotype maps in budding yeast. Given the proven utility of yeast in advancing our understanding of human biology (orthologous genes typically have orthologous functions, and the exceptions have proven informative), there is reason to believe that these insights into the genetic architecture of complex traits may also be broadly applicable. In addition, this system provides an ideal setting to test and validate novel statistical inference methods that we anticipate will be eventually be generalizable to other settings, including human GWAS.

Inferring sparse latent structure in genotype-phenotype maps:
One example of the statistical inference tools we are working to develop are methods to infer low-dimensional latent structure in genotype-phenotype maps (i.e. a simpler representation of the map). Our approach is based on the idea that correlations among multiple phenotypes across related individuals may reflect some pattern of shared genetic architecture: individual genetic loci affect multiple phenotypes (pleiotropy), creating relationships between phenotypes. A natural hypothesis is that pleiotropic effects reflect a relatively small set of "core" cellular processes: each genetic locus affects one or a few core processes, and these core processes in turn determine the observed phenotypes. We are working to develop methods to infer the identity and genetic architecture of this space of core processes. While there are numerous existing methods to infer latent structure in high-dimensional biological data, in general these do not preserve the modular structure we aim to identify. Instead, we have recently proposed a novel sparse structure discovery (SSD) method, which uses a penalized matrix decomposition designed to identify latent structure that reflects expectations for modular architecture. We have shown that this can identify lower-dimensional structure that reflects interpretable and potentially biologically meaningful cellular processes. In ongoing and future work, we are working to generalize this method to reflect other types of latent structure, including nonlinear effects and epistatic interactions, and to apply it broadly across other types of data (e.g. it represents a novel way to infer sparse latent structure in tissue-specific transcriptomic data). We are also applying recently developed transformer models, which represent an alternative approach to finding structure in these high-dimensional landscapes.

Binding landscapes for immune-pathogen coevolutionary dynamics:
Our adaptive immune systems are engaged in a constant coevolutionary struggle with the pathogens that challenge them, as pathogens adapt to evade our immune response and our immune repertoires shift in turn. These coevolutionary dynamics take place across a vast and high-dimensional landscape of potential pathogen and immune receptor sequence variants (antibodies and T-cell receptors). Mapping the relationship between these genotypes and the phenotypes that determine immune-pathogen interactions is critical for understanding, predicting, and controlling disease. My lab is working to empirically characterize these genotype-phenotype maps. We focus on several key phenotypes (e.g. protein stability, binding affinity of antibodies to relevant antigens or of pathogens to relevant host proteins, and neutralization of pathogenic strains by sera) that are thought to be major drivers of immune-pathogen coevolution. While these traits do not encompass all of the selection pressures relevant for coevolution, they are useful proxies for important aspects of this process.

To more comprehensively analyze the enormous and high-dimensional genotype-phenotype maps relevant for immune-pathogen coevolution, my lab and others have developed methods to dramatically increase the throughput of phenotypic measurements. For example, we have recently applied Tite-Seq, a high-throughput method for measuring tens of thousands of equilibrium binding constants in parallel, to map changes in binding affinity along the maturation pathway of several broadly neutralizing anti-influenza antibodies, by creating combinatorically complete libraries of all heavy-chain mutations separating germline from mature antibodies (up to hundreds of thousands of variants, depending on the antibody). By measuring the binding affinity of each antibody variant to the relevant vaccine component antigens, we determined how different combinations of mutations provide varying levels of potency and breadth, providing insight into the nature of the binding landscapes and the types of selection pressures that can lead to these unusual antibodies. More recently, we used similar approaches to map the effects of all possible combinations of the 15 mutations separating the ancestral Wuhan Hu-1 strain of the SARS-CoV-2 spike protein receptor binding domain from the Omicron BA.1 variant on binding to human ACE2 and a set of representative monoclonal antibodies. Our results revealed a highly epistatic landscape that supports the hypothesis that Omicron evolution involved a period of relaxed selection pressures, for example during chronic infection of an immunocompromised host.

In ongoing and future work, we are continuing to characterize combinatorial landscapes involving mutations found in B-cell receptor sequences and pathogen variants (e.g. more recent Omicron variants). We are also working to dramatically increase the throughput of other types of phenotypic assays (e.g. using methods analogous to Tite-seq to measure the ability of large libraries of antibody variants to neutralize a specific viral strain, or the ability of a specific antibody sequence to neutralize a large library of viral variants). In addition, we are conducting highly parallel directed evolution experiments to analyze how population genetic parameters of the process interact with the binding landscape to determine the outcomes of affinity maturation. For example, we are evolving hundreds of human germline antibody lineages in the presence of a variety of selection pressures imposed by different influenza antigen variants. Together with the theoretical frameworks developed in other aspects of our research, this work will help us understand how evolutionary dynamics in the immune system determine the success (or failure) of adaptive immune responses, and provide insight into why specific antibodies do or do not emerge.

As with any high-dimensional genotype-phenotype map, the central problem in empirically characterizing immune-pathogen coevolutionary landscapes is the enormous scale of sequence space. Regardless of how rapidly we improve experimental throughput, we will never be able to comprehensively survey even a small fraction of all possible genotypes. A critical challenge is therefore to determine which combinations of the approaches described above, along with other strategies which we may not yet have conceived, will provide the most power for extrapolation. That is: what types of measurements (and which computational approaches) will best allow us to accurately infer the larger-scale structure of the coevolutionary landscape from inevitably limited data? There is reason to be optimistic that this is possible: evolution itself cannot and does not comprehensively explore sequence space. Instead, it explores and selects trajectories based on relatively limited information. Thus, if we can collect a similar sort of information, it should be possible to make at least general statistical predictions about how evolution will act.

Evolutionary dynamics and population genetics in microbial and viral populations

Most existing frameworks in evolutionary dynamics and population genetics assume that populations exist near a fitness optimum. In this view, evolution primarily involves neutral mutations, purging of deleterious variants, and occasional beneficial mutations that spread through the population in rare selective sweeps. However, over the past two decades it has become clear that these assumptions can be grossly misleading, particularly in the large and mostly clonal populations characteristic of microbes and viruses. Instead, these populations often exist away from fitness optima, with many beneficial and deleterious mutations often present simultaneously. Because recombination is limited, selection cannot act on each separately, but only on combinations of mutations linked together on physical chromosomes. This dramatically reduces the efficiency of natural selection, places enormous constraints on adaptation, and makes it difficult to predict how these populations evolve. Grappling with these effects requires a combination of empirical work (to help determine what processes are essential, what can be neglected, and what types of effects demand explanation) and new theoretical frameworks (to provide a basis for predicting evolutionary dynamics and for drawing inferences from empirical data). Our lab has been a leader in both areas, and we are now poised to expand both our empirical and our theoretical work beyond the context of simple laboratory evolution experiments and into more realistic natural settings.

Direct observations of the dynamics of molecular evolution:
Over the past decade we have conducted a number of studies to directly observe the dynamics of molecular evolution in laboratory evolution experiments. For example, we have used whole-population whole-genome sequencing to identify individual spontaneously arising mutations and to track their frequencies through time in hundreds of parallel budding yeast populations, and analyzed molecular evolution across 60,000 generations in a long-term E. coli experiment conducted by Rich Lenski’s group. Our results have highlighted the critical role played by hitchhiking and clonal interference in constraining adaptation. We have also directly quantified how recombination affects the efficiency of selection, analyzed the role of epistatic interactions, and studied the spontaneous evolution of ecological interactions. More recently, we have developed a new approach to track evolutionary dynamics using “renewable” DNA barcoding methods, which allow us to follow the fates of individual cell lineages through time at frequencies as low as one in a million. This is essential to observe competition between rare lineages, which is often crucial in determining the outcomes of evolution.

While these laboratory experiments have provided important insights, they are inherently artificial. We now aim to extend these observational methods into natural populations. For example, we recently analyzed molecular evolution in communities of budding yeast and bacteria in non-aseptic bioethanol production in million-liter open fermenters in Brazil. These populations are vastly more complex than our laboratory experiments, and involve constant immigration, ecological interactions, and environmental fluctuations. However, many of the same themes emerge. We are also launching new directions involving tracking immune-pathogen coevolutionary dynamics (e.g. using B-cell receptor repertoire sequencing data generated by ourselves and others) and within-strain evolutionary dynamics in host-associated microbial communities (e.g. plant-associated communities and the human gut microbiome). This work is inherently exploratory, but previous experience has shown that important surprises often emerge, as new observations demand explanation.

Evolutionary dynamics in rapidly evolving populations:
Existing methods in theoretical population genetics have been of limited utility in analyzing rapidly evolving microbial and viral populations, where selection often acts on multiple linked mutations simultaneously. The basic problem is that each mutation occurs at first in a single individual, making its fate crucially dependent on genetic drift, the other mutations in this genetic background, and nonlinear interactions with competing lineages. In past work, my lab has played a leading role in developing a new theoretical framework for studying these effects, which couples a linear but stochastic analysis of each individual mutation with a nonlinear but deterministic model of the rest of the population. We and others have applied this approach to predict the rate and genetic basis of adaptation and the statistics of frequency changes of each mutation through time. However, the field has yet to fully engage with many of the complexities involved in natural populations. For example, it remains unclear how rapidly evolving microbial and viral populations are affected by selection pressures that fluctuate across space and time, by the interactions between recombination and epistasis, or by ecological interactions in more complex communities. In ongoing and future work, we are working to develop new theoretical methods to explain these effects. Our longer-term goal is to use this theory as the basis for more powerful and principled ways to infer evolutionary history from sequence data.

Predicting how linked selection shapes patterns of diversity:
A central goal of population genetics is to predict how natural selection shapes patterns of genomic sequence diversity. However, most existing methods are limited to looking for deviations from neutral expectations, or to explaining the action of selection at a single locus. In contrast, when selection is pervasive, genetic variation at each site is affected by selection at other linked sites (linked selection), and existing methods break down. In recent years, as experimental studies and empirical data from a variety of systems (including microbes, viruses, humans, and Drosophila) all point to the widespread importance of linked selection, it has become clear that this is a crucial gap in our understanding. Building on earlier ideas, my lab has developed a “structured coalescent” framework to account for these effects. We first calculate how variation in fitness within a population is determined by the collective fates of many lineages of individual mutants. We then calculate the frequency distribution of these lineages, and use this to trace the mutational ancestry of each lineage backwards in time. This allows us to completely characterize the joint distribution of the genealogy and the selected mutations. We have used this approach in a series of studies to analyze how various forms of linked selection affect expected patterns of sequence diversity. We have also recently introduced an entirely novel forward-time approach to these questions. In ongoing and future work, we plan to use these methods to analyze complications critical for interpreting diversity in natural populations (e.g. interactions between recombination and linked selection) and to analyze more complex aspects of genetic diversity (e.g. those involving samples taken at different times). Our goal is to use this work as the basis for developing new statistical methods to look for these patterns in empirical data, and apply these methods to infer evolutionary history in natural populations.