INVITED TALKS 
Chris Bailey-Kellogg, Dartmouth College 
John Capra, Gladstone Institute and Vanderbilt University 
Lenore Cowen, Tufts University 
Oliver Eulenstein, Iowa State University 
Andrey Rzhetsky, University of Chicago 
Hagit Shatkay, University of Delaware 

Amarda Shehu , George Mason University
Jinbo Xu, TTI and University of Chicago 
Wei Wang, University of California, Los Angeles 


CONTRIBUTED TALKS 
Jianwen Fang, University of Kansas 
Debashis Sahoo, Stanford University 
Andrew Wong, University of Delaware 


TITLES & ABSTRACTS


Chris Bailey-Kellogg, Dartmouth College
Optimization algorithms for the design of immunotolerant biotherapies


The explosive growth of biotherapeutic agents is revolutionizing treatment of numerous diseases, but innovations in biotherapies have also created new challenges for drug design and development. One distinguishing risk factor of therapeutic proteins is the prospect of eliciting an immune response in humans. To meet this challenge, we have developed optimization algorithms that minimize a protein's T cell epitope content while simultaneously ensuring that the engineered variant maintains a high level of stability and activity. Immunogenicity is assessed via T cell epitope predictors that score peptide binding potential against class II MHC molecules. The structural and functional consequences of deimmunizing mutations are evaluated with statistical sequence potentials and molecular mechanics force fields. Our algorithms then map the Pareto frontier of designs balancing both criteria. The application of these algorithms will be highlighted through comparative analysis with previously published deimmunization efforts as well as our collaborative experimental validation using beta-lactamase, a model therapeutic candidate with utility in ADEPT cancer therapies.


D.C. Osipovitch, A.S. Parker, C.D. Makokha, J. Desrosiers, W.C. Kett, L. Moise, C. Bailey-Kellogg, and K.E. Griswold, Design and analysis of immune-evading enzymes for ADEPT therapy, Protein Eng. Des. Sel., 2012, in press.



John Capra, Gladstone Institute and Vanderbilt University
ProteinHistorian: Tools for the Comparative Analysis of Eukaryote Protein Origin


The evolutionary history of a protein reflects the functional history of its ancestors. Recent phylogenetic studies identified distinct evolutionary signatures that characterize proteins involved in cancer, Mendelian disease, and different ontogenic stages. In this talk, I will introduce ProteinHistorian, a tool for identifying enrichment of specific evolutionary origins in protein sets of interest. ProteinHistorian's approach to analyzing phylogenetic origins is similar to that commonly used for Gene Ontology functional annotation enrichment analysis. Given an input protein set of interest, ProteinHistorian estimates the phylogenetic age of each protein and compares the resulting phylogenetic distribution to a relevant background set. ProteinHistorian allows considerable flexibility in the definition of protein age by including several algorithms for estimating ages from different databases of evolutionary relationships. To illustrate the utility of ProteinHistorian, I will describe its role in elucidating the evolutionary origin of a newly discovered regulatory mechanism unique to multicellular animals.


Capra JA, Williams AG, Pollard KS (2012) PLoS Comput Biol 8(6): e1002567.



Lenore Cowen, Tufts University 
SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone


One of the most successful methods to date for recognizing protein sequences that are evolutionarily related has been profile hidden Markov models (HMMs). However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta sheets. These dependencies have been partially captured in the HMM setting by simulated evolution in the training phase and can be fully captured by Markov random fields (MRFs). However, the MRFs can be computationally prohibitive when beta strands are interleaved in complex topologies. We introduce SMURFLite, a method that combines both simplified MRFs and simulated evolution to substantially improve remote homology detection for beta structures. Unlike previous MRF-based methods, SMURFLite is computationally feasible on any beta-structural motif. We test SMURFLite on all propeller and barrel folds in the mainly-beta class of the SCOP hierarchy in stringent cross-validation experiments. We show a mean 26% (median 16%) improvement in area under curve (AUC) for beta-structural motif recognition as compared with HMMER (a well-known HMM method) and a mean 33% (median 19%) improvement as compared with RAPTOR (a well-known threading method) and even a mean 18% (median 10%) improvement in AUC over HHPred (a profile-profile HMM method), despite HHpred's use of extensive additional training data. We demonstrate SMURFLite's ability to scale to whole genomes by running a SMURFLite library of 207 beta-structural SCOP superfamilies against the entire genome.


Daniels NM, Hosur R, Berger B, and Cowen LJ (2012) Bioinformatics (28): 1216.



Oliver Eulenstein, Iowa State University 
Supertrees and Supertree Problems


Modern sequencing techniques have provided deep and rich data for phylogenetic inference. However, along with the new data came challenging problems of how to infer accurate, large-scale, most comprehensive, and resolved phylogenetic trees from the data. A common way to addres this challenge is to exploit the many phylogenetic trees that are harnessed in tree databases, by solving supertree problems. Supertree problems seek a speices tree that presents a collection of typically discordant trees with different taxa as best as possible guided by some objective. These problems have been utilzed to infer large-scale phylogenetic studies for several critical biological groups. Maybe most notably, using supertree problems derived the first nearly complete species-level phylogeny of extant mammals. Furthermore, supertree studies have been used in addressing conservation issues, biodiversity hotspots, and response to global change.


Today, a large variety of different types of supertree problems have been developed. For many of these problems time complexity classes and theoretical properties have been well studied. Unfortunately, nearly all of the supertree problems that are typically used in practice are NP-hard. However, effective heuristics have been developed to address most of them.


In my talk, I will overview supertree problems, by categorizing them based on their design principles, theoretical properties, and algorithmic solutions. Finally I will present an efficient knolwedge enhanced approach for exact supertree construction that allows including existing evolutionary information to reduce the complexity of the solutions space.



Andrey Rzhetsky, University of Chicago 
Modeling and inferring genetic overlap between disease phenotypes from clinical data.


Geneticists and epidemiologists often observe that certain hereditary disorders cooccur in individual patients significantly more (or significantly less) frequently than expected, suggesting there is a genetic variation that predisposes its bearer to multiple disorders, or that protects against some disorders while predisposing to others. We suggest that, by using a large number of phenotypic observations about multiple disorders and an appropriate statistical model, we can infer genetic overlaps between phenotypes.



Hagit Shatkay, University of Delaware 
What We Found on Our Way to Building a Classifier: A Critical Analysis of the Screening Questionnaire for Young Athletes.


The American Heart Association (AHA) has recommended a 12-element questionnaire for pre-participation screening of athletes (before they take on athletic activities), in order to reduce and hopefully prevent sudden cardiac death in young athletes. This screening procedure is widely used throughout the United States. As part of a study on cardiovascular disorders in young athletes, we set out aiming to pursue a classification task: namely, training a machine-learning-based classifier to automatically categorize several hundreds of athletes into risk-levels based on their respective answers to the AHA questionnaire. However, surprisingly, rather than producing such a classifier, the classification results, along with an analysis of the information contents of the questions, suggest that the AHA-recommended procedure does not effectively distinguish between Normal and Non-normal heart as identified by cardiologists using Electro- and Echo-cardiogram examinations. The talk will describe the study, the unexpected results, and some of the conclusions.


Joint work with: Quazi Abidur Rahman (School of Computing, Queen's University, Kingston, ON) and Sivajothi Kanagalingam, Aurelio Pinheiro & Theodore Abraham (Heart and Vascular Institute, Johns Hopkins University, Baltimore, MD)



Amarda Shehu, George Mason University

Probabilistic Search Frameworks for Modeling Structures and Motions of Protein Systems

Elucidating structures and motions employed by protein systems for biological activity is central to understanding biology and treating disease but challenging in silico. Our recent work focuses on modeling biologically-active structures of a protein and mapping out the conformational transitions between them. Our algorithms employ ideas from evolutionary computation to address hard non-linear optimization problems in the context of modeling biologically-active structures from amino-acid sequence information. Robotics-inspired ideas are employed to obtain a probabilistic search framework that is shown versatile and effective both in modeling structures and transitions between them. Inspired by the use of subdivisions and projections of the robot configurational space in sampling-based motion planning, a tree-based search employs projections of the explored conformational space and underlying energy surface in order to adaptively guide the search and further computational resources to relevant regions of the search space. Applications and analyses on diverse small-to-medium size proteins show enhanced sampling of the conformational space and effective modeling of biologically-active structures. In addition, conformational transitions are obtained that connect functional states of significant structural dissimilarity in multimodal protein systems.

Brian Olson and Amarda Shehu. Evolutionary-inspired Probabilistic Search for Enhancing Sampling of Local Minima in the Protein Energy Surface. Proteome Science 2012, 10(Suppl1): S5.

Kevin Molloy and Amarda Shehu. Adaptive Tree-based Search to Sample Conformational Paths Connecting Protein Functional States, BMC Struct Biol (to appear).


Wei Wang, University of California, Los Angeles 
Mining Genetic Interactions in Genome-Wide Association Study


Advanced biotechnologies have rendered feasible high-throughput data collecting in human and other model organisms. The availability of such data holds promise for dissecting complex biological processes. Making sense of the flood of biological data poses great statistical and computational challenges. I will discuss the problem of mining gene-gene interactions in high-throughput genetic data. Finding genetic interactions is an important biological problem since many common diseases are caused by joint effects of genes. Previously, it was considered intractable to find genetic interactions in the whole-genome scale due to the enormous search space. The problem was commonly addressed using heuristics which do not guarantee the optimality of the solution. I will show that by utilizing the upper bound of the test statistic and effectively indexing the data, we can dramatically prune the search space and reduce computational burden. Moreover, our algorithms guarantee to find the optimal solution. In addition to handling specific statistical tests, our algorithms can be applied to a wide range of study types by utilizing convexity, a common property of many commonly used statistics.



Jinbo Xu, TTI and University of Chicago
Deriving protein statistical potential using the Boltzmann law and machine learning


Although studied extensively, designing highly accurate protein energy potential is still challenging. A lot of knowledge-based statistical potentials are derived from the inverse of the Boltzmann law and consist of two major components: observed atomic interacting probability and reference state. These potentials mainly distinguish themselves in the reference state and use a similar simple counting method to estimate the observed probability, which is usually assumed to correlate with only atom types. This article takes a rather different view on the observed probability and parameterizes it by the protein sequence profile context of the atoms and the radius of the gyration, in addition to atom types. Experiments confirm that our position-specific statistical potential outperforms currently the popular ones in several decoy discrimination tests. Our results imply that, in addition to reference state, the observed probability also makes energy potentials different and evolutionary information greatly boost performance of energy potentials.


Zhao and Xu (2012) Structure 20:1118.



Jianwen Fang, University of Kansas
Protein structures and sequence mining for improving computational design of stable proteins.


The ability to design proteins with enhanced stability is important both theoretically and practically. Computational methods for predicting stabilizing mutations are attractive due to their potential low cost and time-saving properties over current experimental approaches. Despite extensive studies in the past decade, effective and robust computational algorithms for designing thermo-stable proteins are still in critical demand. <


In this presentation I will describe several novel algorithms for predicting stabilizing mutations of proteins based on large-scale protein structure and sequence mining. The main focus will be PROTS, a sequential and structural four-residue fragment based protein thermo-stability potential. PROTS is derived from a non-redundant representative collection of thousands of thermophilic and mesophilic protein structures and a large set of point mutations with experimentally determined changes of melting temperatures. To the best of our knowledge, PROTS is the first protein stability predictor based on integrated analysis and mining of these two types of data. Besides conventional cross validation and blind testing, we introduce hypothetical reverse mutations as a means of testing the robustness of protein thermo-stability predictors. In all tests, PROTS demonstrates the ability to reliably predict mutation induced thermo-stability changes as well as classify thermophilic and mesophilic proteins. In addition, this white-box predictor allows easy interpretation of the factors that influence mutation induced protein stability changes at the residue level.


Li Y, Zhang J, Tai D, Middaugh CR, Zhang Y, Fang J. Proteins (2011).



Debashis Sahoo, Stanford University
Protein structures and sequence mining for improving computational design of stable proteins.


Three differentiation states risk-stratify bladder cancer into distinct subtypes Current clinical judgment in bladder cancer (BC) relies primarily on pathological stage and grade. We investigated whether a molecular classification of tumor cell differentiation, based on a developmental biology approach, can provide additional prognostic information. Exploiting large preexisting gene-expression databases, we developed a biologically supervised computational model to predict markers that correspond with BC differentiation. To provide mechanistic insight, we assessed relative tumorigenicity and differentiation potential via xenotransplantation. We then correlated the prognostic utility of the identified markers to outcomes within gene expression and formalin-fixed paraffin-embedded (FFPE) tissue datasets. Our data indicate that BC can be subclassified into three subtypes, on the basis of their differentiation states: basal, intermediate, and differentiated, where only the most primitive tumor cell subpopulation within each subtype is capable of generating xenograft tumors and recapitulating downstreampopulations.Wefound that keratin 14 (KRT14) marks the most primitive differentiation state that precedes KRT5 and KRT20 expression. Furthermore, KRT14 expression is consistently associated with worse prognosis in both univariate and multivariate analyses. We identify here three distinct BC subtypes on the basis of their differentiation states, each harboring a unique tumor-initiating population.


Jens-Peter Volkmer, Debashis Sahoo, Robert Chin, Philip Ho, Chad Tang, Antonina Kurtova, Stephen Willingham, Senthil Pazhanisamy, Humberto Contreras-Trujillo, Theresa Storm, Yair Lotan, Andrew Beck, Benjamin Chung, Ash Alizadeh, Guilherme Godoy, Seth Lerner, Matt van de Rijn, Linda Shortliffe, Irving Weissman and Keith Chan. PNAS (2012). Modeling and inferring genetic overlap between disease phenotypes from clinical data.



Andrew Wong, Queen's University, Kingston, ON
Protein function prediction using text-based features extracted from the biomedical literature: The CAFA Challenge.


Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand.

 

Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors.

 

Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins, using text-based features derived from PubMed abstracts associated with each protein. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge.

 

Andrew Wong and Hagit Shatkay. Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge. BMC Bioinformatics (to appear).