Matthew J. Beal - Publications

Papers by year: (click [abstract] links to toggle)

2006

Creswell, C.Y., Beal, M.J., Chen, J., Cornell T., Nilsson, L. and Srihari, R.K.
Automatically Extracting Nominal Mentions of Events with a Bootstrapped Probabilistic Classifier
To appear in Proceedings of CoLing-ACL 2006 (International Committee on Computational Linguistics and the Association for Computational Linguistics), 17-21 July, Syndey, Australia, 2006. (poster)
[pdf] [abstract]
Most approaches to event extraction focus on mentions anchored in verbs. However, many mentions of events surface as noun phrases. Detecting them can increase the recall of event extraction and provide the foundation for detecting relations between events. This paper describes a weakly-supervised method for detecting nominal event mentions that combines techniques from word sense disambiguation (WSD) and lexical acquisition to create a classifier that labels noun phrases as denoting events or non-events. The classifier uses bootstrapped probabilistic generative models of the contexts of events and non-events. The contexts are the lexically-anchored semantic dependency relations that the NPs appear in. Our method dramatically improves with bootstrapping, and comfortably outperforms lexical lookup methods which are based on very much larger hand-crafted resources.
Beal, M.J. and Krishnamurthy, P.
Clustering Gene Expression Time Course Data with Countably Infinite Hidden Markov Models
To appear in Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence UAI 2006, 13-16 July, MIT, 2006.
Also presented by Praveen at NESCAI'06, Cornell, April 28-29 2006.
[pdf] [abstract]
Most existing approaches to clustering gene expression time course data treat the different time points as independent dimensions and are invariant to permutations, such as reversal, of the experimental time course. Approaches utilizing HMMs have been shown to be helpful in this regard, but are hampered by having to choose model architectures with appropriate complexities. Here we propose for a clustering application an HMM with a countably infinite state space; inference in this model is possible by recasting it in the hierarchical Dirichlet process (HDP) framework (Teh et al. 2006), and hence we call it the HDP-HMM. We show that the infinite model outperforms model selection methods over finite models, and traditional time-independent methods, as measured by a variety of external and internal indices for clustering on two large publicly available data sets. Moreover, we show that the infinite models utilize more hidden states and employ richer architectures (e.g. state-to-state transitions) without the damaging effects of overfitting.
Beal, M.J. and Ghahramani, Z.
Variational Bayesian Learning of Directed Graphical Models with Hidden Variables
To appear in Bayesian Analysis 1(4), 2006.
[preprint pdf] [abstract]

A key problem in statistics and machine learning is inferring suitable structure of a model given some observed data. A Bayesian approach to model comparison makes use of the marginal likelihood of each candidate model to form a posterior distribution over models; unfortunately for most models of interest, notably those containing hidden or latent variables, the marginal likelihood is intractable to compute.

We present the variational Bayesian (VB) algorithm for directed graphical models, which optimises a lower bound approximation to the marginal likelihood in a procedure similar to the standard EM algorithm. We show that for a large class of models, which we call conjugate exponential, the VB algorithm is a straightforward generalisation of the EM algorithm that incorporates uncertainty over model parameters. In a thorough case study using a small class of bipartite DAGs containing hidden variables, we compare the accuracy of the VB approximation to existing asymptotic-data approximations such as the Bayesian Information Criterion (BIC) and the Cheeseman-Stutz (CS) criterion, and also to a sampling based gold standard, Annealed Importance Sampling (AIS). We find that the VB algorithm is empirically superior to CS and BIC, and much faster than AIS. Moreover, we prove that a VB approximation can always be constructed in such a way that guarantees it to be more accurate than the CS approximation.

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Hierarchical Dirichlet Processes
To appear, Journal of the American Statistical Association, December 2006.
[preprint pdf] [older tech report version] [abstract]
We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the ÒChinese restaurant franchise.Ó We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling.
Sridharan, K., Beal, M.J. and Govindaraju, V.
Competitive Mixtures of Neurons
To appear at 18th International Conference on Pattern Recognition (ICPR'06).
[pdf] [abstract]
We propose a competitive finite mixture of neurons (or perceptrons) for solving binary classification problems. Our classifier includes a prior for the weights between different neurons such that it prefers mixture models made up from neurons having classification boundaries as orthogonal to each other as possible. We derive an EM algorithm for learning the mixing proportions and weights of each neuron, consisting of an exact E step and a partial M step, and show that our model covers the regions of high posterior probability in weight space and tends to reduce overfitting. We demonstrate the way in which our mixture classifier works using a toy 2-dimensional data set, showing the effective use of strategically positioned components in the mixture. We further compare its performance against SVMs and one-hidden-layer neural networks on four real-world data sets from the UCI repository, and show that even a relatively small number of neurons with appopriate competitive priors can achieve superior classification accuracies on held-out test data.
Ghosh, J., Beal, M.J., Ngo, H. and Qiao, C.
On Profiling Mobility and Predicting Locations of Campus-wide Wireless Network Users
To appear in ACM/SIGMOBILE REALMAN, in conjunction with ACM MobiHoc, Florence, Italy, May 2006.
[pdf] [related TR] [abstract]
In this paper we analyze a year long wireless network users' mobility trace data collected on ETH Zurich campus. Unlike earlier work in Chen et al. (2005) and Ghosh et al. (2005a), we profile the movement pattern of wireless users and predict their locations. More specifically, we show that each network user regularly visits a list of places such as a building (also referred to as "hubs") with some probability. The daily list of hubs, along with their corresponding visit probabilities, are referred to as a mobility profile. We also show that over a period of time (e.g., a week), a user may repeatedly follow a mixture of mobility profiles with certain probabilities associated with each of the profiles. Our analysis of the mobility trace data not only validate the existence of our so-called sociological orbits (Ghosh et al., 2005b), but also demonstrate the advantages of exploiting it in performing hub-level location predictions. In particular, we show that such profile based location predictions are more precise than common statistical approaches based on observed hub visitation frequencies alone.
Srinivasan, H., Srihari, S.N., Beal, M.J., Fang, G. and Phatak, P.
Comparison of ROC-based and Likelihood methods for Fingerprint Verification
Proc. Biometric Technology for Human Identification III: SPIE Defense and Security Symposium, Orlando, FL, April 17-22, 2006.
[pdf] [abstract]
The fingerprint verification task answers the question of whether or not two fingerprints belongs to the same finger. The paper focuses on the classification aspect of fingerprint verification. Classification is the third and final step after after the two earlier steps of feature extraction, where a known set of features (minutiae points) have been extracted from each fingerprint, and scoring, where a matcher has determined a degree of match between the two sets of features. Since this is a binary classification problem involving a single variable, the commonly used threshold method is related to the so-called receiver operating characteristics (ROC). In the ROC approach the optimal threshold on the score is determined so as to determine match or non-match. Such a method works well when there is a well-registered fingerprint image. On the other hand more sophisticated methods are needed when there exists a partial imprint of a finger as in the case of latent prints in forensics or due to limitations of the biometric device. In such situations it is useful to consider classification methods based on computing the likelihood ratio of match/non-match. Such methods are commonly used in some biometric and forensic domains such as speaker verification where there is a much higher degree of uncertainty. This paper compares the two approaches empirically for the fingerprint classification task when the number of available minutiae are varied. In both ROC-based and likelihood ratio methods, learning is from a general population of ensemble of pairs, each of which is labeled as being from the same finger or from different fingers. In the ROC-based method the best operating point is derived from the ROC curve. In the likelihood method the distributions of same finger and different finger scores are modeled using Gaussian and Gamma distributions. The performances of the two methods are compared for varying numbers of minutiae points available. Results show that the likelihood method performs better than the ROC-based method when fewer minutiae points are available. Both methods converge to the same accuracy as more minutiae points are available.

2005

Ghosh, J., Beal, M.J., Ngo, H. and Qiao, C.
On Profiling Mobility and Predicting Locations of Campus-wide Wireless Network Users
Technical Report # 2005-27 (December), Department of Computer Science and Engineering, University at Buffalo, SUNY, 2006.
[pdf] [abstract]
In this paper, we analyze a year long wireless network users' mobility trace data collected on ETH Zurich campus. Unlike earlier work in Chen et al. (2005), Kim et al. (2006), and Tuduce and Gross (2005), we profile the movement pattern of wireless users and predict their locations. More specifically, we show that each network user regularly visits a list of places, such as a building (also referred to as "hubs") with some probability. The daily list of hubs, along with their corresponding visit probabilities, are referred to as a mobility profile. We also show that over a period of time (e.g., a week), a user may repeatedly follow a mixture of mobility profiles with certain probabilities associated with each of the profiles. Our analysis of the mobility trace data not only validate the existence of our so-called sociological orbits (Ghosh et al., 2005), but also demonstrate the advantages of exploiting it in performing hub-level location predictions. Moreover, such profile based location predictions are found not only to be more precise than a common statistical approach based on observed hub visitation frequencies, but also shown to incur a much lower overhead. We further illustrate the benefit of profiling users' mobility by discussing relevant work and suggesting applications in different types of wireless networks, including mobile ad hoc networks.
Beal, M.J., Falciani, F., Ghahramani Z., Rangel, C. and Wild, D.L.
A Bayesian Approach to Reconstructing Genetic Regulatory Networks with Hidden Factors
In Bioinformatics 21:349-356, 2005.
[web abstract] [pdf] supplementary data page VBLDS software [abstract]
Motivation: We have used state-space models to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T cell activation. State-space models are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects which cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation etc.

Results: We have approached the problem of inferring the model structure of these state-space models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing `gene-gene' interactions over time. In this paper variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells.

Availability: Supplementary data is available at http://public.kgi.edu/~wild/index.htm and Matlab source code for variational Bayesian learning of state-space models is available at http://www.cs.toronto.edu/~beal/software.html.
Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes
In Neural Information Processing Systems 17:1385-1392, MIT Press, 2005.
[ps] [pdf] [ps.gz] [pdf.gz] [djvu] [bibtex] NIPS 2004 project page [abstract]
We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generaliza- tion to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.
Srinivasan, H., Srihari, S.N. and Beal, M.J.
Signature Verification using Kolmogorov-Smirnov Statistic
In Proceedings of the 12th Conference of the International Graphonomics Society (IGS), Salerno, Italy, June 2005.
[pdf] [abstract]
Automatic signature verification of scanned documents are presented here. The strategy used for verification is applicable in scenarios where there are multiple knowns(genuine signature samples) from a writer. First the learning process invovles learning the variation and similarities from the known genuine samples from the given writer and then classification problem answers the question whether or not a given questioned sample belongs to the ensemble of known samples or not. The learning strategy discussed compares pairs of signature samples from amongst the known samples, to obtain a distribution in distance space, that represents the distribution of the variation amongst samples, for that particular writer. The corresponding classification method involves comparing the questioned sample, with all the available knowns, to obtain another distribution in distance space. The classification task is now to compare the two distributions to obtain a probability of similarity of the two distributions, that represents, the probability of the questioned sample belonging to the ensemble of the knowns. The above strategies are applied to the problem of signature verification and performance results are presented.
Srinivasan, H., Beal, M.J. and Srihari, S.N.
Machine Learning Approaches for Person Identification and Verification
In Proceedings of the SPIE Defense and Security Symposium, Orlando, FL, March 2005.
[pdf] [abstract]
New machine learning strategies are proposed for person identification which can be used in several biometric modalities such as friction ridges, handwriting, signatures and speech. The biometric or forensic performance task answers the question of whether or not a sample belongs to a known person. Two different learning paradigms are discussed: person-independent (or general learning) and person-dependent (or person-specific learning). In the first paradigm, learning is from a general population of ensemble of pairs, each of which is labelled as being from the same person or from different persons--the learning process determines the range of variations for given persons and between different persons. In the second paradigm the identity of a person is learnt when presented with multiple known samples of that person--where the variation and similarities within a particular person are learnt. The person-specific learning strategy is seen to perform better than general learning (5% higher performance with signatures). Improvement of person-specific performance with increasing number of samples is also observed.
Srihari, S.N., Beal, M.J., Bandi, K., Shah, V. and Krishnamurthy, P.
A Statistical Model for Writer Verification
In Proc. 8th International Conference on Document Analysis and Recognition (ICDAR), Seoul, Korea, August 2005, pp. 1105-1109.
[pdf] [abstract]
A statistical model for determining whether a pair of documents, a known and a questioned, were written by the same individual is proposed. The model has the following four components: (i) discriminating elements, e.g., global features and characters, are extracted from each document, (ii) differences between corresponding elements from each document are computed, (iii) using conditional probability estimates of each difference, the log-likelihood ratio (LLR) is computed for the hypotheses that the documents were written by the same or different writers; the conditional probability estimates themselves are determined from labelled samples using either Gaussian or gamma estimates for the differences assuming their statistical independence, and (iv) distributions of the LLRs for same and different writer LLRs are analyzed to calibrate the strength of evidence into a standard nine-point scale used by questioned document examiners. The model is illustrated with experimental results for a specific set of discriminating elements.
Attias, H.T. and Beal, M.J.
Tree of Latent Mixtures for Bayesian Modelling and Classification of High Dimensional Data
Technical Report No. 2005-06. Dept. Computer Science and Engineering, University at Buffalo, SUNY. January 1, 2005.
[pdf] [ps.gz] [abstract]
Many domains of interest to machine learning, such as audio and video, computational biology, climate modelling, and quantitative finance, involve very high dimensional data. Such data are often characterized by statistical structure that includes correlations on multiple scales. Attempts to model those high dimensional data raise problems that are much less significant in low dimensions. This paper presents a novel approach to modelling and classification in high dimensions. The approach is based on a graphical model with a tree structure, which performs density estimation on multiple spatial scales. Exact inference in this model is computationally intractable; two algorithms for approximate inference using variational techniques are derived and analyzed. We discuss learning and classification using these algorithms, and demonstrate their performance on a handwritten digit recognition task.

2004

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Hierarchical Dirichlet Processes
Technical Report 653, UC Berkeley Statistics, 2004.
[JASA version] [pdf] project page [abstract]
We consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the ``Chinese restaurant franchise.'' We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling.
Neal, R.M., Beal, M.J. and Roweis, S.T.
Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
In Neural Information Processing Systems 16:401-408, MIT Press, 2004.
[pdf] [ps.gz] [abstract]
We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of "pools" of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers.
Beal, M.J. and Ghahramani, Z.
The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures
In Bayesian Statistics 7:453-464, Oxford University Press, 2003.
Preprint: [pdf] [ps.gz]. Volume preface [pdf] and contents [pdf] [abstract]
We present an efficient procedure for estimating the marginal likelihood of probabilistic models with latent variables or incomplete data. This method constructs and optimises a lower bound on the marginal likelihood using variational calculus, resulting in an iterative algorithm which generalises the EM algorithm by maintaining posterior distributions over both latent variables and parameters. We define the family of conjugate-exponential models--which includes finite mixtures of exponential family models, factor analysis, hidden Markov models, linear state-space models, and other models of interest--for which this bound on the marginal likelihood can be computed very simply through a modification of the standard EM algorithm. In particular, we focus on applying these bounds to the problem of scoring discrete directed graphical model structures (Bayesian networks). Extensive simulations comparing the variational bounds to the usual approach based on the Bayesian Information Criterion (BIC) and to a sampling-based gold standard method known as Annealed Importance Sampling (AIS) show that variational bounds substantially outperform BIC in finding the correct model structure at relatively little computational cost, while approaching the performance of the much more costly AIS procedure. Using AIS allows us to provide the first serious case study of the tightness of variational bounds. We also analyse the perfomance of AIS through a variety of criteria, and outline directions in which this work can be extended.

Beal, M.J.
Variational Algorithms for Approximate Bayesian Inference
PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. (281 pages)
[pdf] [ps.gz] or as individual chapters:

Ch 1: Introduction	(31)	[pdf 410k] [ps.gz 335k]
Ch 2: Variational Bayesian Theory	(38)	[pdf 491k] [ps.gz 379k]
Ch 3: Variational Bayesian Hidden Markov Models	(24)	[pdf 529k] [ps.gz 652k]
Ch 4: Variational Bayesian Mixture of Factor Analysers	(53)	[pdf 980k] [ps.gz 906k]
Ch 5: Variational Bayesian Linear Dynamical Systems	(47)	[pdf 1.1M] [ps.gz 1.3M]
Ch 6: Learning the structure of discrete-variable graphical models with hidden variables	(44)	[pdf 749k] [ps.gz 689k]

[abstract]

The Bayesian framework for machine learning allows for the incorporation of prior knowledge in a coherent way, avoids overfitting problems, and provides a principled basis for selecting between alternative models. Unfortunately the computations required are usually intractable. This thesis presents a unified variational Bayesian (VB) framework which approximates these computations in models with latent variables using a lower bound on the marginal likelihood.

Chapter 1 presents background material on Bayesian inference, graphical models, and propagation algorithms. Chapter 2 forms the theoretical core of the thesis, generalising the expectation-maximisation (EM) algorithm for learning maximum likelihood parameters to the VB EM algorithm which integrates over model parameters. The algorithm is then specialised to the large family of conjugate-exponential (CE) graphical models, and several theorems are presented to pave the road for automated VB derivation procedures in both directed and undirected graphs (Bayesian and Markov networks, respectively).

Chapters 3-5 derive and apply the VB EM algorithm to three commonly-used and important models: mixtures of factor analysers, linear dynamical systems, and hidden Markov models. It is shown how model selection tasks such as determining the dimensionality, cardinality, or number of variables are possible using VB approximations. Also explored are methods for combining sampling procedures with variational approximations, to estimate the tightness of VB bounds and to obtain more effective sampling algorithms. Chapter 6 applies VB learning to a long-standing problem of scoring discrete-variable directed acyclic graphs, and compares the performance to annealed importance sampling amongst other methods. Throughout, the VB approximation is compared to other methods including sampling, Cheeseman-Stutz, and asymptotic approximations such as BIC. The thesis concludes with a discussion of evolving directions for model selection including infinite models and alternative approximations to the marginal likelihood.

Beal, M.J., Jojic, N. and Attias, H.
A Graphical Model for Audio-Visual Object Tracking
In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), July 2003 (Vol. 25, No. 7), pp. 828--836.
[web abstract] [pdf] [abstract]
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.

2002

Beal, M.J., Ghahramani, Z. and Rasmussen, C.E.
The Infinite Hidden Markov Model
In Advances in Neural Information Processing Systems 14:577-584, eds. T. Dietterich, S. Becker, Z. Ghahramani, MIT Press, 2002.
[pdf] [ps.gz] [abstract]
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite--consider, for example, symbols being possible words appearing in English text.
Beal, M.J., Attias, H. and Jojic, N.
Audio-Video Sensor Fusion with Probabilistic Graphical Models
In Proc. European Conf. on Computer Vision (ECCV), May 2002.
[web abstract] [abstract]
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.
Beal, M.J., Jojic, N. and Attias, H.
A Self-Calibrating Algorithm for Speaker Tracking Based on Audio-Visual Statistical Models
In Proc. Int. Conf. on Acoustics Speech and Signal Proc. (ICASSP), May 2002.
[pdf] [ps.gz] [abstract]
We present a self-calibrating algorithm for audio-visual tracking using two microphones and a camera. The algorithm uses a parametrized statistical model which combines simple models of video and audio. Using unobserved variables, the model describes the process that generates the observed data. Hence, it is able to capture and exploit the statistical structure of the audio and video data, as well as their mutual dependencies. The model parameters are estimated by the EM algorithm; object templates are learned and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location using the model. Successful performance is demonstrated on real multimedia clips.

2001

Beal, M.J. and Ghahramani, Z.
The Variational Kalman Smoother
Gatsby Unit Technical Report TR01-003 (contains an error that is corrected in Ph.D. thesis).
[pdf] [ps] [abstract]
In this note we outline the derivation of the variational Kalman smoother, in the context of Bayesian Linear Dynamical Systems. The smoother is an efficient algorithm for the E-step in the Expectation-Maximisation (EM) algorithm for linear-Gaussian state-space models. However, inference approximations are required if we hold distributions over parameters. We derive the E-step updates for the hidden states (the variational smoother), and the M-step updates for the parameter distributions. We show that inference of the hidden state is tractable for any distribution over parameters, provided the expectations of certain quantities are available, analytically or otherwise.
Ghahramani, Z. and Beal, M.J.
Propagation Algorithms for Variational Bayesian Learning
In Advances in Neural Information Processing Systems 13:507-513, eds. T.K. Leen, T. Dietterich, V. Tresp, MIT Press, 2001.
[pdf] [ps.gz] [poster] [software] [abstract]
Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set.
Ghahramani, Z. and Beal, M.J.
Graphical Models and Variational Methods
Book chapter in Advanced Mean Field methods - Theory and Practice, eds. D. Saad and M. Opper, MIT Press, 2001. jacket
Part of this work is that discussed at NIPS*99 workshop of the same title as the book.
[pdf] [ps.gz] [abstract]
We review the use of variational methods of approximating inference and learning in probabilistic graphical models. In particular, we focus on variational approximations to the integrals required for Bayesian learning. For models in the conjugate-exponential family, a generalisation of the EM algorithm is derived that iterates between optimising hyperparameters of the distribution over parameters, and inferring the hidden variable distributions. These approximations make use of available propagation algorithms for probabilistic graphical models. We give two case studies of how the variational Bayesian approach can be used to learn model structure: inferring the number of clusters and dimensionalities in a mixture of factor analysers, and inferring the dimension of the state space of a linear dynamical system. Finally, importance sampling corrections to the variational approximations are discussed, along with their limitations.

2000

Ghahramani, Z. and Beal, M.J.
Variational Inference for Bayesian Mixtures of Factor Analysers
In Advances in Neural Information Processing Systems 12:449-455, eds. S. A. Solla, T.K. Leen, K. Müller, MIT Press, 2000.
[pdf] [ps.gz] [poster] [software] [abstract]

We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples.

By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exact predictive density, and the KL divergence between the variational posterior and the true posterior, not only in this model but for variational approximations in general.

Miscellaneous

Classical and Bayesian approaches to reconstructing genetic regulatory networks (28th July 2004)
A poster presented at the 12th International Conference on Intelligent Systems for Molecular Biology (ISMB'04) in Edinburgh on 02/08/04.
[pdf] [ps.gz]
Variational Bayesian quick-reference sheet (updated 1st April 2002)
Something I cooked up during the summer of 2000. Ongoing, and not thoroughly checked for mistakes. This version is for A4 paper.
[pdf] [ps.gz] [html]
Variational Scoring of Graphical Model Structures (15/09/03)
A recent talk on VB structure learning that I gave at the Toronto Machine Learning group meetings.
[pdf]
Variational Inference in the Conjugate-Exponential Family (16/08/00)
A talk I gave to David J.C. MacKay's group in Inferential Sciences at Cambridge University. Many of the slides' contents are from the NIPS 2001 paper with Zoubin.
[pdf] [ps.gz]