Workshop on Algebraic Statistics and Computational Biology
Abstracts of talks
Christopher Burge
Towards an RNA Splicing Code
Splicing of pre-messenger RNAs is required for the expression of most human genes, and alternative splicing affects of these, and plays important roles in development and disease. This talk will focus on recent progress towards development of an 'RNA splicing code', i.e. a set of rules/algorithm capable of simulating exon recognition by the splicing machinery to predict the splicing pattern of any human primary transcript, and on computational methods for the detection of AS events conserved between human and mouse. Our work on the splicing code involves systematic computational and experimental screens for splicing regulatory elements and use of the resulting information to improve splicing simulation algorithms. Alternative splicing events conserved since the divergence of human and mouse are likely of primary biological importance, but only on the order of several hundred to ~1,000 such events are known. A set of sequence features has been identified that distinguish exons subject to evolutionarily conserved alternative splicing, which we call 'alternative-conserved exons' (ACEs), from other orthologous human/mouse exons. These features have been integrated into a regularized least-squares classification algorithm, ACEScan, which was run genome-wide to identify over 2,000 predicted ACEs. Alternative splicing could be verified in both human and mouse tissues using an RT-PCR-sequencing protocol for 21 of 30 (70%) predicted ACEs tested, supporting the validity of a majority of ACEScan predictions. Predicted ACEs were much more likely to preserve reading frame, and less likely to disrupt protein domains than other AS events, and were enriched in genes expressed in the brain and in genes involved in transcriptional regulation, RNA processing and development.
Andreas Dress
The Category of X-Nets
In phylogenetics, network analysis has become an indispensable tool. In the lecture, the "Category of X-Nets" will be introduced and the relevant parameters needed for specifying a net will be discussed as examples of quantities whose statistical analysis can hopefully gain a lot from "Algebraic Statistics".
Mathias Drton
Algebraic Methods for Gaussian Random Variables
The classic distributional assumption in statistical modeling of continuous variables is the assumption of a normal distribution, also known as the Gaussian distribution. This is reflected in particular in computational biology, where dependencies among gene expression measurements are often modeled within the Gaussian framework. This talk provides an introduction to normal distribution theory and to statistical models based on assumptions of normality. Emphasis is placed on problems in which algebraic techniques can be applied.
Vanja Dukic
Resolution-invariant Binary Partition Priors: Application to Breast Cancer Survival Studies
This talk will introduce a class of Bayesian semiparametric models for analyzing cancer survival data. These models are designed to account for individual patient characteristics and heterogeneity in both baseline hazards and treatment effects, and also incorporate various 'a priori' smoothness and shape assumptions about the hazard curve. The method relies on a new class of "multi-resolution" priors, obtained by binary partitions of the cumulative hazard in a recursive way so that the 'a priori' structure remains invariant to the choice of resolution (i.e. the depth of the tree), and so that the prior belief is consistent under aggregation. In the end, we obtain a smooth estimate of the survival and hazard curves for ``multiple resolutions'', i.e. for sets of time points of interest simultaneously. Properties of the multi-resolution priors and resulting posterior computational methods will be discussed in detail, and illustrated with an analysis of a large multicenter clinical trial of tamoxifen (a treatment for breast cancer).
Komei Fukuda
Polyhedral Computation in Algebraic Statistics
Recent advances in algebraic statistics clearly suggest the importance of polyhedral computation, which is the study of computational problems associated with convex polyhedra in general dimension. In this talk, we review various key problems in polyhedral computation arising from computational algebra and biology applications, including, problems of computing the convex hull of a finite set of points and the Minkowski addition of several polytopes (bounded polyhedra). Our main objectives are to discuss theoretical complexities of these problems and to present what one can practically compute using existing algorithms and implementations.
Christine Heitsch
Deciphering the Information Encoded in RNA Viral Genomes
The formation of base pairs within single-stranded RNA molecules, such as the Hepatitis C viral genome, creates structure and affects function, thereby conveying biological information. We investigate how RNA viral genomes encode structure and function by analyzing a combinatorial model of RNA folding. Single-stranded RNA sequences are understood to self-bond with a complex interplay between energetically beneficial stacked base pairs, or "helices," and destabilizing single-stranded structures called "loops." One result of our combinatorial analysis demonstrates the importance of local helical constraints in specifying a global structure while another characterizes the minimal loop energy configurations in our model of RNA folding. As we will discuss, this work not only provides new insights into the coding of secondary structure in RNA sequences but also suggests new directions for analyzing functional motifs in RNA viral genomes.
Reinhard Laubenbacher
Computational Algebra Methods in Systems Biology
Recent advances in measurement technologies have made it possible to obtain experimental data about organisms at system levels and on a large scale. For instance, DNA microarray technology can provide simultaneous snapshots of the activity levels of all 25,000 genes in a human cell; new functional MRI technology provides global images of brain activity; and new in vivo imaging technology gives unprecedented insight into the functioning of our immune system. The availability of such data makes it possible for the first time to aim at an understanding of whole subsystems of an organism, from intracellular molecular signaling networks all the way to the structure of organismal networks such as the immune system. This is the goal of systems biology, which plays an increasingly important role in diverse research areas such as drug design and cancer biology. Mathematics provides the natural language and the tool set for systems biology.
This talk will focus on the problem of the identification of biochemical networks from a collection of experimental observations. In almost all cases the available data vastly underdetermine the network, so model selection becomes a central problem. The algorithm for model selection presented here relies on tools from computational algebraic geometry, and we will outline a comprehensive modeling program within the framework of polynomial dynamical systems over finite fields. An important aspect of model selection is the incorporation of biological constraints on model dynamics. This leads to the problem of inferring information about the dynamics of polynomial systems from their structure. We will present several results related to this problem. We will also discuss the development of a software package for polynomial dynamical systems within the computer algebra software system Macaulay2.
Franziska Michor
The Mathematics of Cancer Therapy
Evolutionary concepts such as mutation and selection can best be described when formulated as mathematical equations. Cancer arises as a consequence of somatic evolution. Therefore, a mathematical approach can be used to understand the process of cancer initiation, progression and treatment. I will discuss a mathematical analysis of chronic myeloid leukemia during therapy with the chemotherapeutic agent Gleevec.
Lior Pachter
Biological Sequence Analysis
We introduce the Drosophila genome projects, and survey some of the sequence analysis problems that are central to ongoing studies of the genomes. Using genome alignments as an example, we show how natural considerations of robustness and accuracy lead to problems about algebraic statistical models and polytopes. We then the describe results of a whole genome parametric alignment, which builds on ideas in the ASCB book from Chapters 2, 4, 7 and 22.
Niles Pierce
Computational Analysis and Design of Nucleic Acid Systems
RNA and single-stranded DNA are versatile construction materials that can be programmed to self-assemble into nanoscale devices driven by the free energy of base pair formation. This talk will describe our efforts to develop a general computational framework for the analysis and design of nucleic acid systems. Experimental demonstrations will include the locomotion of a synthetic DNA walker and biosensing using the mechanism of hybridization chain reaction.
John Rhodes
Algebraic Models in Phylogenetics
The central problem of phylogenetics is the inference of an evolutionary tree from sequence data. Stochastic models of sequence evolution along trees lead to polynomial expressions for probabilities of observations, with the form of the polynomials reflecting both the model and the topology of the tree. Algebraic techniques can therefore be used to study such models, and their relationship to data.
Polynomial relationships between expected observations, called `phylogenetic invariants,' were originally proposed as an inference tool in the late 1980s, but difficulties initially limited their development. After a survey of representative results in understanding invariants for various models, this talk will highlight some recent uses and directions for further development. For instance, in addition to potentially providing useful inference tools, invariants have also led to new results on how quite general models of sequence evolution will still produce recoverable phylogenetic signals (identifiability of models).
Eduardo Sontag
Control Systems Theory and a Qualitative/Quantitative Approach to Systems Biology
Abstract: Modern biology is in need of powerful tools to help understand, organize, quantify, and conceptualize the properties of protein, gene, and metabolic networks. Among the issues of greatest interest are the analysis of information processing, signaling, robustness, feedback, and dynamical properties of such networks. To a large extent, these topics constitute the focus of control systems theory, which is a sophisticated and deeply-developed field of mathematics and theoretical engineering. Systems theory allows one to constrain and predict internal structure from input/output experiments, quantify sensitivities, analyze the effects of feedback loops, and verify the controllability and observability of components and the identifiability of parameters.
Nevertheless, I argue in this talk that in spite of its immense success in engineering, "off the shelf" application of known control theory is not always appropriate. This is because detailed models are hard to come by: it is virtually impossible to experimentally validate the form of the nonlinearities used in reaction terms and even when such forms are known, to accurately estimate coefficients (parameters). New tools must to be developed in order to bridge the "data-rich/data-poor" dichotomy that exists in systems biology. I illustrate this point by describing a new approach to the analysis of certain highly nonlinear dynamical systems. It blends qualitative and graph-theoretic network knowledge of the type often obtained from biological experiments with a small amount of "quantitative" data such as is obtained from steady state step responses. This novel approach emerged originally from the study of possible multi-stability or oscillations in feedback loops in cell signal transduction, but it turns out to be of more general applicability. The mathematical techniques rely heavily on the theory of monotone systems with inputs and outputs.
Bernd Sturmfels
Parametric Inference
Abstract: This lecture gives an introduction to the mathematical underpinnings of parametric inference for statistical models used in biological sequence analysis. The relevant parts of the ASCB book are Sections 2.2, 2.3 and 3.4 and Chapters 5,6,7,8,9,10. In parametric inference, one computes the Newton polytope of a multivariate polynomial arising from the model and the data in question. We discuss results on the structure and complexity of these polytopes, and we take a closer look at the inference functions of hidden Markov models.
Seth Sullivant
What is Algebraic Statistics?
The emerging field of algebraic statistics advocates the use of polynomial algebra as a tool for statistical analysis. The underlying principle is that many natural families of probability distributions on discrete random variables are parametrized algebraic varieties. Knowing the polynomials which define these sets of probability distributions can be useful for making statistical inferences and provides a different viewpoint for some problems in probability theory. I will try to illustrate this point with examples from graphical models, phylogeny reconstruction and conditionally specified models.
Glenn Tesler
The Fragile Breakage versus Random Breakage Models of Chromosome Evolution
Co-authors: Qian Peng and Pavel Pevzner, Department of Computer Science and Engineering, University of California, San Diego.
For many years, studies of chromosome evolution were dominated by the random breakage theory, which implies that there are no rearrangement hot spots in the human genome. In 2003, Pevzner and Tesler argued against the random breakage model and proposed an alternative "fragile breakage" model of chromosome evolution. In 2004, Sankoff and Trinh argued against the fragile breakage model and raised doubts that Pevzner and Tesler provided any evidence of rearrangement hot spots.
We investigate whether Sankoff and Trinh indeed revealed a flaw in the Pevzner and Tesler arguments. We show that Sankoff and Trinh's synteny block generation algorithm is flawed and that their parameters do not reflect the realities of the comparative genomic architecture of human and mouse. We further argue that if Sankoff and Trinh had fixed these problems, their arguments in support of the random breakage model would disappear.
Sumio Watanabe
Algebraic Geometry and Singular Statistics
A lot of statistical models, for example, normal mixtures, hidden Markov models, artificial neural networks, Bayesian networks, reduced rank regressions, and stochastic context-free grammars are statistically singular models. In these models, the set of parameters of a smaller model is an analytic set or algebraic varieties with singularities in that of a larger model, hence their Fisher information matrices are not full rank. In a singular statistical model, neither the maximum likelihood estimator nor the Bayes a posteriori distribution converges to the normal distribution. In this presentation, I introduce the resolution theorem in algebraic geometry, and explain the following three points. (1) The likelihood function can be made to be well-defined by blowing-up even in singular models. (2) The likelihood function as the empirical process of the parameter set converges to a gaussian process on the analytic set. (3) The Bayes generalization error is equal to the largest pole of the zeta function of the statistical model.

Return to top