nips nips2006 knowledge-graph by maker-knowledge-mining

nips 2006 knowledge graph


similar papers computed by tfidf model


similar papers computed by lsi model


similar papers computed by lda model


papers list:

1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

Author: Michael D. Lee, Ian G. Fuss, Daniel J. Navarro

Abstract: We present a computational Bayesian approach for Wiener diffusion models, which are prominent accounts of response time distributions in decision-making. We first develop a general closed-form analytic approximation to the response time distributions for one-dimensional diffusion processes, and derive the required Wiener diffusion as a special case. We use this result to undertake Bayesian modeling of benchmark data, using posterior sampling to draw inferences about the interesting psychological parameters. With the aid of the benchmark data, we show the Bayesian account has several advantages, including dealing naturally with the parameter variation needed to account for some key features of the data, and providing quantitative measures to guide decisions about model construction. 1

2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Author: Yee W. Teh, David Newman, Max Welling

Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off fidelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational benefits of the approach to the some of the problems on which IC has been demonstrated. 1

4 nips-2006-A Humanlike Predictor of Facial Attractiveness

Author: Amit Kagian, Gideon Dror, Tommer Leyvand, Daniel Cohen-or, Eytan Ruppin

Abstract: This work presents a method for estimating human facial attractiveness, based on supervised learning techniques. Numerous facial features that describe facial geometry, color and texture, combined with an average human attractiveness score for each facial image, are used to train various predictors. Facial attractiveness ratings produced by the final predictor are found to be highly correlated with human ratings, markedly improving previous machine learning achievements. Simulated psychophysical experiments with virtually manipulated images reveal preferences in the machine's judgments which are remarkably similar to those of humans. These experiments shed new light on existing theories of facial attractiveness such as the averageness, smoothness and symmetry hypotheses. It is intriguing to find that a machine trained explicitly to capture an operational performance criteria such as attractiveness rating, implicitly captures basic human psychophysical biases characterizing the perception of facial attractiveness in general. 1 I n trod u cti on Philosophers, artists and scientists have been trying to capture the nature of beauty since the early days of philosophy. Although in modern days a common layman's notion is that judgments of beauty are a matter of subjective opinion, recent findings suggest that people might share a common taste for facial attractiveness and that their preferences may be an innate part of the primary constitution of our nature. Several experiments have shown that 2 to 8 months old infants prefer looking at faces which adults rate as being more attractive [1]. In addition, attractiveness ratings show very high agreement between groups of raters belonging to the same culture and even across cultures [2]. Such findings give rise to the quest for common factors which determine human facial attractiveness. Accordingly, various hypotheses, from cognitive, evolutional and social perspectives, have been put forward to describe the common preferences for facial beauty. Inspired by Sir Francis Galton’s photographic method of composing faces [3], Rubenstein, Langlois and Roggman created averaged faces by morphing multiple images together and proposed that averageness is the answer for facial attractiveness [4, 5]. Human judges found these averaged faces to be attractive and rated them with attractiveness ratings higher than the mean rating of the component faces composing them. Grammer and Thornhill have investigated symmetry and averageness of faces and concluded that symmetry was more important than averageness in facial attractiveness [6]. Little and colleagues have agreed that average faces are attractive but claim that faces with certain extreme features, such as extreme sexually dimorphic traits, may be more attractive than average faces [7]. Other researchers have suggested various conditions which may contribute to facial attractiveness such as neonate features, pleasant expressions and familiarity. Cunningham and his associates suggest a multiple fitness model in which there is no single constructing line that determines attractiveness. Instead, different categories of features signal different desirable qualities of the perceived target [8]. Even so, the multiple fitness model agrees that some facial qualities are universally physically attractive to people. Apart from eliciting the facial characteristics which account for attractiveness, modern researchers try to describe underlying mechanisms for these preferences. Many contributors refer to the evolutionary origins of attractiveness preferences [9]-[11]. According to this view, facial traits signal mate quality and imply chances for reproductive success and parasite resistance. Some evolutionary theorists suggest that preferred features might not signal mate quality but that the “good taste” by itself is an evolutionary adaptation (individuals with a preference for attractiveness will have attractive offspring that will be favored as mates) [9]. Another mechanism explains attractiveness' preferences through a cognitive theory - a preference for attractive faces might be induced as a by-product of general perception or recognition mechanisms [5, 12]: Attractive faces might be pleasant to look at since they are closer to the cognitive representation of the face category in the mind. These cognitive representations are described as a part of a cognitive mechanism that abstracts prototypes from distinct classes of objects. These prototypes relate to average faces when considering the averageness hypothesis. A third view has suggested that facial attractiveness originates in a social mechanism, where preferences may be dependent on the learning history of the individual and even on his social goals [12]. Different studies have tried to use computational methods in order to analyze facial attractiveness. Averaging faces with morph tools was done in several cases (e.g. [5, 13]). In [14], laser scans of faces were put into complete correspondence with the average face in order to examine the relationship between facial attractiveness, age, and averageness. Another approach was used in [15] where a genetic algorithm, guided by interactive user selections, was programmed to evolve a “most beautiful” female face. [16] used machine learning methods to investigate whether a machine can predict attractiveness ratings by learning a mapping from facial images to their attractiveness scores. Their predictor achieved a significant correlation of 0.6 with average human ratings, demonstrating that facial beauty can be learned by a machine, at least to some degree. However, as human raters still do significantly outperform the predictor of [16], the challenge of constructing a facial attractiveness machine with human level evaluation accuracy has remained open. A primary goal of this study is to surpass these results by developing a machine which obtains human level performance in predicting facial attractiveness. Having accomplished this, our second main goal is to conduct a series of simulated psychophysical experiments and study the resemblance between human and machine judgments. This latter task carries two potential rewards: A. To determine whether the machine can aid in understanding the psychophysics of human facial attractiveness, capitalizing on the ready accessibility of the analysis of its inner workings, and B. To study whether learning an explicit operational ratings prediction task also entails learning implicit humanlike biases, at least for the case of facial attractiveness. 2 2.1 T h e f aci al trai n in g d atab as e: Acq u i s i ti on , p rep roces s i n g an d rep res en tati on Rating facial attractiveness The chosen database was composed of 91 facial images of American females, taken by the Japanese photographer Akira Gomi. All 91 samples were frontal color photographs of young Caucasian females with a neutral expression. All samples were of similar age, skin color and gender. The subjects’ portraits had no accessories or other distracting items such as jewelry. All 91 facial images in the dataset were rated for attractiveness by 28 human raters (15 males, 13 females) on a 7-point Likert scale (1 = very unattractive, 7 = very attractive). Ratings were collected with a specifically designed html interface. Each rater was asked to view the entire set before rating in order to acquire a notion of attractiveness scale. There was no time limit for judging the attractiveness of each sample and raters could go back and adjust the ratings of already rated samples. The images were presented to each rater in a random order and each image was presented on a separate page. The final attractiveness rating of each sample was its mean rating across all raters. To validate that the number of ratings collected adequately represented the ``collective attractiveness rating'' we randomly divided the raters into two disjoint groups of equal size. For each facial image, we calculated the mean rating on each group, and calculated the Pearson correlation between the mean ratings of the two groups. This process was repeated 1,000 times. The mean correlation between two groups was 0.92 ( = 0.01). This corresponds well to the known level of consistency among groups of raters reported in the literature (e.g. [2]). Hence, the mean ratings collected are stable indicators of attractiveness that can be used for the learning task. The facial set contained faces in all ranges of attractiveness. Final attractiveness ratings range from 1.42 to 5.75 and the mean rating was 3.33 ( = 0.94). 2.2 Data preprocessing and representation Preliminary experimentation with various ways of representing a facial image have systematically shown that features based on measured proportions, distances and angles of faces are most effective in capturing the notion of facial attractiveness (e.g. [16]). To extract facial features we developed an automatic engine that is capable of identifying eyes, nose, lips, eyebrows, and head contour. In total, we measured 84 coordinates describing the locations of those facial features (Figure 1). Several regions are suggested for extracting mean hair color, mean skin color and skin texture. The feature extraction process was basically automatic but some coordinates needed to be manually adjusted in some of the images. The facial coordinates are used to create a distances-vector of all 3,486 distances between all pairs of coordinates in the complete graph created by all coordinates. For each image, all distances are normalized by face length. In a similar manner, a slopes-vector of all the 3,486 slopes of the lines connecting the facial coordinates is computed. Central fluctuating asymmetry (CFA), which is described in [6], is calculated from the coordinates as well. The application also provides, for each face, Hue, Saturation and Value (HSV) values of hair color and skin color, and a measurement of skin smoothness. Figure 1: Facial coordinates with hair and skin sample regions as represented by the facial feature extractor. Coordinates are used for calculating geometric features and asymmetry. Sample regions are used for calculating color values and smoothness. The sample image, used for illustration only, is of T.G. and is presented with her full consent. Combining the distances-vector and the slopes-vector yields a vector representation of 6,972 geometric features for each image. Since strong correlations are expected among the features in such representation, principal component analysis (PCA) was applied to these geometric features, producing 90 principal components which span the sub-space defined by the 91 image vector representations. The geometric features are projected on those 90 principal components and supply 90 orthogonal eigenfeatures representing the geometric features. Eight measured features were not included in the PCA analysis, including CFA, smoothness, hair color coordinates (HSV) and skin color coordinates. These features are assumed to be directly connected to human perception of facial attractiveness and are hence kept at their original values. These 8 features were added to the 90 geometric eigenfeatures, resulting in a total of 98 image-features representing each facial image in the dataset. 3 3.1 E xp eri men ts an d resu l ts Predictor construction and validation We experimented with several induction algorithms including simple Linear Regression, Least Squares Support Vector Machine (LS-SVM) (both linear as well as non-linear) and Gaussian Processes (GP). However, as the LS-SVM and GP showed no substantial advantage over Linear Regression, the latter was used and is presented in the sequel. A key ingredient in our methods is to use a proper image-features selection strategy. To this end we used subset feature selection, implemented by ranking the image-features by their Pearson correlation with the target. Other ranking functions produced no substantial gain. To measure the performance of our method we removed one sample from the whole dataset. This sample served as a test set. We found, for each left out sample, the optimal number of image-features by performing leave-one-out-cross-validation (LOOCV) on the remaining samples and selecting the number of features that minimizes the absolute difference between the algorithm's output and the targets of the training set. In other words, the score for a test example was predicted using a single model based on the training set only. This process was repeated n=91 times, once for each image sample. The vector of attractiveness predictions of all images is then compared with the true targets. These scores are found to be in a high Pearson correlation of 0.82 with the mean ratings of humans (P-value < 10 -23), which corresponds to a normalized Mean Squared Error of 0.39. This accuracy is a marked improvement over the recently published performance results of a Pearson correlation of 0.6 on a similar dataset [16]. The average correlation of an individual human rater to the mean correlations of all other raters in our dataset is 0.67 and the average correlation between the mean ratings of groups of raters is 0.92 (section 2.1). It should be noted that we tried to use this feature selection and training procedure with the original geometric features instead of the eigenfeatures, ranking them by their correlation to the targets and selecting up to 300 best ranked features. This, however, has failed to produce good predictors due to strong correlations between the original geometric features (maximal Pearson correlation obtained was 0.26). 3.2 S i m i l a r i t y o f ma c h i n e a n d h u m a n j u d g m e n t s Each rater (human and machine) has a 91 dimensional rating vector describing its Figure 2: Distribution of mean Euclidean distance from each human rater to all other raters in the ratings space. The machine’s average distance form all other raters (left bar) is smaller than the average distance of each of the human raters to all others. attractiveness ratings of all 91 images. These vectors can be embedded in a 91 dimensional ratings space. The Euclidian distance between all raters (human and machine) in this space was computed. Compared with each of the human raters, the ratings of the machine were the closest, on average, to the ratings of all other human raters (Figure 2). To verify that the machine ratings are not outliers that fall out of clusters of human raters (even though their mean distance from the other ratings is small) we surrounded each of the rating vectors in the ratings space with multidimensional spheres of several radius sizes. The machine had more human neighbors than the mean number of neighbors of human raters, testifying that it does not fall between clusters. Finally, for a graphic display of machine ratings among human ratings we applied PCA to machine and human ratings in the rating space and projected all ratings onto the resulting first 2 and 3 principal components. Indeed, the machine is well placed in a mid-zone of human raters (Figure 3). 5 Machine 0 Machine 0 -4 -8 -5 7 5 0 -10 -10 -5 0 5 10 (a) 0 -7 -5 (b) Figure 3: Location of machine ratings among the 28 human ratings: Ratings were projected into 2 dimensions (a) and 3 dimensions (b) by performing PCA on all ratings and projecting them on the first principal components. The projected data explain 29.8% of the variance in (a) and 36.6% in (b). 3.3 Psychophysical experiments in silico A number of simulated psychophysical experiments reveal humanlike biases of the machine's performance. Rubenstein et al. discuss a morphing technique to create mathematically averaged faces from multiple face images [5]. They reported that averaged faces made of 16 and 32 original component images were rated higher in attractiveness than the mean attractiveness ratings of their component faces and higher than composites consisting of fewer faces. In their experiment, 32-component composites were found to be the most attractive. We used a similar technique to create averaged virtually-morphed faces with various numbers of components, nc, and have let the machine predict their attractiveness. To this end, coordinate values of the original component faces were averaged to create a new set of coordinates for the composite. These coordinates were used to calculate the geometrical features and CFA of the averaged face. Smoothness and HSV values for the composite faces were calculated by averaging the corresponding values of the component faces 1. To study the effect of nc on the attractiveness score we produced 1,000 virtual morph images for each value of n c between 2 and 50, and used our attractiveness predictor (section 3.1) to compute the attractiveness scores of the resulting composites. In accordance with the experimental results of [5], the machine manifests a humanlike bias for higher scores of averaged composites over their components’ mean score. Figure 4a, presenting these results, shows the percent of components which were rated as less attractive than their corresponding composite, for each number of components n c. As evident, the attractiveness rating of a composite surpasses a larger percent of its components’ ratings as nc increases. Figure 4a also shows the mean scores of 1,000 1 HSV values are converted to RGB before averaging composites and the mean scores of their components, for each n c (scores are normalized to the range [0, 1]). Their actual attractiveness scores are reported in Table 1. As expected, the mean scores of the components images are independent of n c, while composites’ scores increase with nc. Mean values of smoothness and asymmetry of the composites are presented in Figure 4b. 0.4 Smoothness Asymmetry 0.8 0.2 0.75 0 -0.2 0.7 -0.4 0.65 -0.6 0.6 Fraction of less attractive components Composite's score (normalized) Components' mean score (normalized) 0.55 2 10 20 30 40 -0.8 50 -1 2 Number of components in composite 10 20 30 40 50 Number of components in composite (a) (b) Figure 4: Mean results over 1,000 composites made of varying numbers of image components: (a) Percent of components which were rated as less attractive than their corresponding composite accompanied with mean scores of composites and the mean scores of their components (scores are normalized to the range [0, 1]. actual attractiveness scores are reported in Table 1). (b) Mean values of smoothness and asymmetry of 1,000 composites for each number of components, nc. Table 1: Mean results over 1,000 composites made of varying numbers of component images NUMBER OF COMPONENTS IN COMPOSITE COMPOSITE SCORE COMPONENTS MEAN SCORE 2 4 12 25 50 3.46 3.66 3.74 3.82 3.94 3.34 3.33 3.32 3.32 3.33 COMPONENTS RATED LOWER THAN COMPOSITE (PERCENT) 55 64 70 75 81 % % % % % Recent studies have provided evidence that skin texture influences judgments of facial attractiveness [17]. Since blurring and smoothing of faces occur when faces are averaged together [5], the smooth complexion of composites may underlie the attractiveness of averaged composites. In our experiment, a preference for averageness is found even though our method of virtual-morphing does not produce the smoothening effect and the mean smoothness value of composites corresponds to the mean smoothness value in the original dataset, for all nc (see Figure 4b). Researchers have also suggested that averaged faces are attractive since they are exceptionally symmetric [18]. Figure 4b shows that the mean level of asymmetry is indeed highly correlated with the mean scores of the morphs (Pearson correlation of -0.91, P-value < 10 -19). However, examining the correlation between the rest of the features and the composites' scores reveals that this high correlation is not at all unique to asymmetry. In fact, 45 of the 98 features are strongly correlated with attractiveness scores (|Pearson correlation| > 0.9). The high correlation between these numerous features and attractiveness scores of averaged faces indicates that symmetry level is not an exceptional factor in the machine’s preference for averaged faces. Instead, it suggests that averaging causes many features, including both geometric features and symmetry, to change in a direction which causes an increase in attractiveness. It has been argued that although averaged faces are found to be attractive, very attractive faces are not average [18]. A virtual composite made of the 12 most attractive faces in the set (as rated by humans) was rated by the machine with a high score of 5.6 while 1,000 composites made of 50 faces got a maximum score of only 5.3. This type of preference resembles the findings of an experiment by Perrett et al. in which a highly attractive composite, morphed from only attractive faces, was preferred by humans over a composite made of 60 images of all levels of attractiveness [13]. Another study by Zaidel et al. examined the asymmetry of attractiveness perception and offered a relationship between facial attractiveness and hemispheric specialization [19]. In this research right-right and left-left chimeric composites were created by attaching each half of the face to its mirror image. Subjects were asked to look at left-left and right-right composites of the same image and judge which one is more attractive. For women’s faces, right-right composites got twice as many ‘more attractive’ responses than left-left composites. Interestingly, similar results were found when simulating the same experiment with the machine: Right-right and left-left chimeric composites were created from the extracted coordinates of each image and the machine was used to predict their attractiveness ratings (taking care to exclude the original image used for the chimeric composition from the training set, as it contains many features which are identical to those of the composite). The machine gave 63 out of 91 right-right composites a higher rating than their matching left-left composite, while only 28 left-left composites were judged as more attractive. A paired t-test shows these results to be statistically significant with P-value < 10 -7 (scores of chimeric composites are normally distributed). It is interesting to see that the machine manifests the same kind of asymmetry bias reported by Zaidel et al, though it has never been explicitly trained for that. 4 Di s cu s s i on In this work we produced a high quality training set for learning facial attractiveness of human faces. Using supervised learning methodologies we were able to construct the first predictor that achieves accurate, humanlike performance for this task. Our results add the task of facial attractiveness prediction to a collection of abstract tasks that has been successfully accomplished with current machine learning techniques. Examining the machine and human raters' representations in the ratings space identifies the ratings of the machine in the center of human raters, and closest, in average, to other human raters. The similarity between human and machine preferences has prompted us to further study the machine’s operation in order to capitalize on the accessibility of its inner workings and learn more about human perception of facial attractiveness. To this end, we have found that that the machine favors averaged faces made of several component faces. While this preference is known to be common to humans as well, researchers have previously offered different reasons for favoring averageness. Our analysis has revealed that symmetry is strongly related to the attractiveness of averaged faces, but is definitely not the only factor in the equation since about half of the image-features relate to the ratings of averaged composites in a similar manner as the symmetry measure. This suggests that a general movement of features toward attractiveness, rather than a simple increase in symmetry, is responsible for the attractiveness of averaged faces. Obviously, strictly speaking this can be held true only for the machine, but, in due of the remarkable ``humnalike'' behavior of the machine, it also brings important support to the idea that this finding may well extend also to human perception of facial attractiveness. Overall, it is quite surprising and pleasing to see that a machine trained explicitly to capture an operational performance criteria such as rating, implicitly captures basic human psychophysical biases related to facial attractiveness. It is likely that while the machine learns the ratings in an explicit supervised manner, it also concomitantly and implicitly learns other basic characteristics of human facial ratings, as revealed by studying its

5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

Author: Yoshinobu Kawahara, Takehisa Yairi, Kazuo Machida

Abstract: In this paper, we present a subspace method for learning nonlinear dynamical systems based on stochastic realization, in which state vectors are chosen using kernel canonical correlation analysis, and then state-space systems are identified through regression with the state vectors. We construct the theoretical underpinning and derive a concrete algorithm for nonlinear identification. The obtained algorithm needs no iterative optimization procedure and can be implemented on the basis of fast and reliable numerical schemes. The simulation result shows that our algorithm can express dynamics with a high degree of accuracy. 1

7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf

Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1

9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

Author: Daniel J. Navarro, Thomas L. Griffiths

Abstract: The additive clustering model is widely used to infer the features of a set of stimuli from their similarities, on the assumption that similarity is a weighted linear function of common features. This paper develops a fully Bayesian formulation of the additive clustering model, using methods from nonparametric Bayesian statistics to allow the number of features to vary. We use this to explore several approaches to parameter estimation, showing that the nonparametric Bayesian approach provides a straightforward way to obtain estimates of both the number of features used in producing similarity judgments and their importance. 1

10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and defines the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To find a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) by first computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The first factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefficients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are defined, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to find a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we find that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the filtered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is defined by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difficulties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF filtered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring significant benefits. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we briefly present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classifier, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difficult, mainly because the explicit influence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefficients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we define ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the first row contains the AR coefficients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the first component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefficients cr (s) are copied into the first row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main benefit of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.

11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

Author: Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand

Abstract: We provide a PAC-Bayesian bound for the expected loss of convex combinations of classifiers under a wide class of loss functions (which includes the exponential loss and the logistic loss). Our numerical experiments with Adaboost indicate that the proposed upper bound, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 1 Intoduction The PAC-Bayes approach [1, 2, 3, 4, 5] has been very effective at providing tight risk bounds for large-margin classifiers such as the SVM [4, 6]. Within this approach, we consider a prior distribution P over a space of classifiers that characterizes our prior belief about good classifiers (before the observation of the data) and a posterior distribution Q (over the same space of classifiers) that takes into account the additional information provided by the training data. A remarkable result that came out from this line of research, known as the “PAC-Bayes theorem”, provides a tight upper bound on the risk of a stochastic classifier (defined on the posterior Q) called the Gibbs classifier. In the context of binary classification, the Q-weighted majority vote classifier (related to this stochastic classifier) labels any input instance with the label output by the stochastic classifier with probability more than half. Since at least half of the Q measure of the classifiers err on an example incorrectly classified by the majority vote, it follows that the error rate of the majority vote is at most twice the error rate of the Gibbs classifier. Therefore, given enough training data, the PAC-Bayes theorem will give a small risk bound on the majority vote classifier only when the risk of the Gibbs classifier is small. While the Gibbs classifiers related to the large-margin SVM classifiers have indeed a low risk [6, 4], this is clearly not the case for the majority vote classifiers produced by bagging [7] and boosting [8] where the risk of the associated Gibbs classifier is normally close to 1/2. Consequently, the PAC-Bayes theorem is currently not able to recognize the predictive power of the majority vote in these circumstances. In an attempt to progress towards a theory giving small risk bounds for low-risk majority votes having a large risk for the associated Gibbs classifier, we provide here a risk bound for convex combinations of classifiers under quite arbitrary loss functions, including those normally used for boosting (like the exponential loss) and those that can give a tighter upper bound to the zero-one loss of weighted majority vote classifiers (like the sigmoid loss). Our numerical experiments with Adaboost [8] indicate that the proposed upper bound for the exponential loss and the sigmoid loss, computed on the training set, behaves very similarly as the true loss estimated on the testing set. 2 Basic Definitions and Motivation We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {−1, +1}. An example is an input-output (x, y) pair where x ∈ X and y ∈ Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, probability distribution D on X × Y. We consider learning algorithms that work in a fixed hypothesis space H of binary classifiers and produce a convex combination fQ of binary classifiers taken from H. Each binary classifier h ∈ H contribute to fQ with a weight Q(h) ≥ 0. For any input example x ∈ X , the real-valued output fQ (x) is given by fQ (x) = Q(h)h(x) , h∈H where h(x) ∈ {−1, +1}, fQ (x) ∈ [−1, +1], and called the posterior distribution1 . h∈H Q(h) = 1. Consequently, Q(h) will be Since fQ (x) is also the expected class label returned by a binary classifier randomly chosen according to Q, the margin yfQ (x) of fQ on example (x, y) is related to the fraction WQ (x, y) of binary classifiers that err on (x, y) under measure Q as follows. Let I(a) = 1 when predicate a is true and I(a) = 0 otherwise. We then have: WQ (x, y) − Since E (x,y)∼D 1 2 = E h∼Q I(h(x) = y) − 1 2 = E − h∼Q yh(x) 1 = − 2 2 Q(h)yh(x) h∈H 1 = − yfQ (x) . 2 WQ (x, y) is the Gibbs error rate (by definition), we see that the expected margin is just one minus twice the Gibbs error rate. In contrast, the error for the Q-weighted majority vote is given by E (x,y)∼D I WQ (x, y) > 1 2 = ≤ ≤ E 1 1 tanh (β [2WQ (x, y) − 1]) + 2 2 tanh (β [2WQ (x, y) − 1]) + 1 (∀β > 0) E exp (β [2WQ (x, y) − 1]) E lim (x,y)∼D β→∞ (x,y)∼D (x,y)∼D (∀β > 0) . Hence, for large enough β, the sigmoid loss (or tanh loss) of fQ should be very close to the error rate of the Q-weighted majority vote. Moreover, the error rate of the majority vote is always upper bounded by twice that sigmoid loss for any β > 0. The sigmoid loss is, in turn, upper bounded by the exponential loss (which is used, for example, in Adaboost [9]). More generally, we will provide tight risk bounds for any loss function that can be expanded by a Taylor series around WQ (x, y) = 1/2. Hence we consider any loss function ζQ (x, y) that can be written as def ζQ (x, y) = = 1 1 + 2 2 1 1 + 2 2 ∞ g(k) (2WQ (x, y) − 1) k=1 ∞ (1) k g(k) k=1 k E − yh(x) h∼Q , (2) and our task is to provide tight bounds for the expected loss ζQ that depend on the empirical loss ζQ measured on a training sequence S = (x1 , y1 ), . . . , (xm , ym ) of m examples, where def ζQ = E (x,y)∼D ζQ (x, y) ; def ζQ = 1 m m ζQ (xi , yi ) . (3) i=1 Note that by upper bounding ζQ , we are taking into account all moments of WQ . In contrast, the PAC-Bayes theorem [2, 3, 4, 5] currently only upper bounds the first moment E WQ (x, y). (x,y)∼D 1 When H is a continuous set, Q(h) denotes a density and the summations over h are replaced by integrals. 3 A PAC-Bayes Risk Bound for Convex Combinations of Classifiers The PAC-Bayes theorem [2, 3, 4, 5] is a statement about the expected zero-one loss of a Gibbs classifier. Given any distribution over a space of classifiers, the Gibbs classifier labels any example x ∈ X according to a classifier randomly drawn from that distribution. Hence, to obtain a PACBayesian bound for the expected general loss ζQ of a convex combination of classifiers, let us relate ζQ to the zero-one loss of a Gibbs classifier. For this task, let us first write k E E − yh(x) = h∼Q (x,y)∼D E E E (−y)k h1 (x)h2 (x) · · · hk (x) . ··· E h1 ∼Q h2 ∼Q hk ∼Q (x,y) Note that the product h1 (x)h2 (x) · · · hk (x) defines another binary classifier that we denote as h1−k (x). We now define the error rate R(h1−k ) of h1−k as def R(h1−k ) = = I (−y)k h1−k (x) = sgn(g(k)) E (x,y)∼D (4) 1 1 + · sgn(g(k)) E (−y)k h1−k (x) , 2 2 (x,y)∼D where sgn(g) = +1 if g > 0 and −1 otherwise. If we now use E h1−k ∼Qk ζQ = = = to denote E E h1 ∼Q h2 ∼Q 1 1 + 2 2 1 1 + 2 2 1 + 2 · · · E , Equation 2 now becomes hk ∼Q ∞ k g(k) E E − yh(x) h∼Q (x,y)∼D k=1 ∞ |g(k)| · sgn(g(k)) k=1 ∞ |g(k)| E E R(h1−k ) − h1−k ∼Qk k=1 E h1−k ∼Qk (x,y)∼D 1 2 (−y)k h1−k (x) . (5) Apart, from constant factors, Equation 5 relates ζQ the the zero-one loss of a new type of Gibbs classifier. Indeed, if we define def ∞ c = |g(k)| , (6) k=1 Equation 5 can be rewritten as 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| E def h1−k ∼Qk k=1 R(h1−k ) = R(GQ ) . (7) The new type of Gibbs classifier is denoted above by GQ , where Q is a distribution over the product classifiers h1−k with variable length k. More precisely, given an example x to be labelled by GQ , we first choose at random a number k ∈ N+ according to the discrete probability distribution given by |g(k)|/c and then we choose h1−k randomly according to Qk to classify x with h1−k (x). The risk R(GQ ) of this new Gibbs classifier is then given by Equation 7. We will present a tight PAC-Bayesien bound for R(GQ ) which will automatically translate into a bound for ζQ via Equation 7. This bound will depend on the empirical risk RS (GQ ) which relates to the the empirical loss ζQ (measured on the training sequence S of m examples) through the equation 1 c ζQ − 1 2 + 1 1 = 2 c ∞ |g(k)| k=1 E h1−k ∼Qk def RS (h1−k ) = RS (GQ ) , where RS (h1−k ) def = 1 m m I (−yi )k h1−k (xi ) = sgn(g(k)) . i=1 (8) Note that Equations 7 and 8 imply that ζQ − ζQ = c · R(GQ ) − RS (GQ ) . Hence, any looseness in the bound for R(GQ ) will be amplified by the scaling factor c on the bound for ζQ . Therefore, within this approach, the bound for ζQ can be tight only for small values of c. Note however that loss functions having a small value of c are commonly used in practice. Indeed, learning algorithms for feed-forward neural networks, and other approaches that construct a realvalued function fQ (x) ∈ [−1, +1] from binary classification data, typically use a loss function of the form |fQ (x) − y|r /2, for r ∈ {1, 2}. In these cases we have 1 1 |fQ (x) − y|r = 2 2 r r = 2r−1 |WQ (x, y)| , E yh(x) − 1 h∼Q which gives c = 1 for r = 1, and c = 3 for r = 2. Given a set H of classifiers, a prior distribution P on H, and a training sequence S of m examples, the learner will output a posterior distribution Q on H which, in turn, gives a convex combination fQ that suffers the expected loss ζQ . Although Equation 7 holds only for a distribution Q defined by the absolute values of the Taylor coefficients g(k) and the product distribution Qk , the PAC-Bayesian theorem will hold for any prior P and posterior Q defined on def H∗ = Hk , (9) k∈N+ and for any zero-one valued loss function (h(x), y)) defined ∀h ∈ H∗ and ∀(x, y) ∈ X × Y (not just the one defined by Equation 4). This PAC-Bayesian theorem upper-bounds the value of kl RS (GQ ) R(GQ ) , where def kl(q p) = q ln q 1−q + (1 − q) ln p 1−p denotes the Kullback-Leibler divergence between the Bernoulli distributions with probability of success q and probability of success p. Note that an upper bound on kl RS (GQ ) R(GQ ) provides both and upper and a lower bound on R(GQ ). The upper bound on kl RS (GQ ) R(GQ ) depends on the value of KL(Q P ), where def E ln KL(Q P ) = h∼Q Q(h) P (h) denotes the Kullback-Leibler divergence between distributions Q and P defined on H∗ . In our case, since we want a bound on R(GQ ) that translates into a bound for ζQ , we need a Q that satisfies Equation 7. To minimize the value of KL(Q P ), it is desirable to choose a prior P having properties similar to those of Q. Namely, the probabilities assigned by P to the possible values of k will also be given by |g(k)|/c. Moreover, we will restrict ourselves to the case where the k classifiers from H are chosen independently, each according to the prior P on H (however, other choices for P are clearly possible). In this case we have KL(Q P ) = 1 c = 1 c = 1 c = ∞ |g(k)| k=1 E h1−k ∼Qk ln |g(k)| · Qk (h1−k ) |g(k)| · P k (h1−k ) ∞ k |g(k)| E k=1 ∞ k=1 h1 ∼Q ... E hk ∼Q ln i=1 Q(hi ) P (hi ) Q(h) |g(k)| · k E ln h∼Q P (h) k · KL(Q P ) , (10) where 1 c def k = ∞ |g(k)| · k . (11) k=1 We then have the following theorem. Theorem 1 For any set H of binary classifiers, any prior distribution P on H∗ , and any δ ∈ (0, 1], we have 1 m+1 KL(Q P ) + ln ≥ 1−δ. Pr ∀Q on H∗ : kl RS (GQ ) R(GQ ) ≤ S∼D m m δ Proof The proof directly follows from the fact that we can apply the PAC-Bayes theorem of [4] to priors and posteriors defined on the space H∗ of binary classifiers with any zero-one valued loss function. Note that Theorem 1 directly provides upper and lower bounds on ζQ when we use Equations 7 and 8 to relate R(GQ ) and RS (GQ ) to ζQ and ζQ and when we use Equation 10 for KL(Q P ). Consequently, we have the following theorem. Theorem 2 Consider any loss function ζQ (x, y) defined by Equation 1. Let ζQ and ζQ be, respectively, the expected loss and its empirical estimate (on a sample of m examples) as defined by Equation 3. Let c and k be defined by Equations 6 and 11 respectively. Then for any set H of binary classifiers, any prior distribution P on H, and any δ ∈ (0, 1], we have Pr S∼D m ∀Q on H : kl 1 1 1 ζQ − + c 2 2 1 1 1 ζQ − + c 2 2 ≤ 4 1 m+1 k · KL(Q P ) + ln m δ ≥ 1−δ. Bound Behavior During Adaboost We have decided to examine the behavior of the proposed bounds during Adaboost since this learning algorithm generally produces a weighted majority vote having a large Gibbs risk E (x,y) WQ (x, y) (i.e., small expected margin) and a small Var (x,y) WQ (x, y) (i.e., small variance of the margin). Indeed, recall that one of our main motivations was to find a tight risk bound for the majority vote precisely under these circumstances. We have used the “symmetric” version of Adaboost [10, 9] where, at each boosting round t, the weak learning algorithm produces a classifier ht with the smallest empirical error m t = Dt (i)I[ht (xi ) = yi ] i=1 with respect to the boosting distribution Dt (i) on the indices i ∈ {1, . . . , m} of the training examples. After each boosting round t, this distribution is updated according to Dt+1 (i) = 1 Dt (i) exp(−yi αt ht (xi )) , Zt where Zt is the normalization constant required for Dt+1 to be a distribution, and where αt = 1 ln 2 1− t . t Since our task is not to obtain the majority vote with the smallest possible risk but to investigate the tightness of the proposed bounds, we have used the standard “decision stumps” for the set H of classifiers that can be chosen by the weak learner. Each decision stump is a threshold classifier that depends on a single attribute: it outputs +y when the tested attribute exceeds the threshold and predicts −y otherwise, where y ∈ {−1, +1}. For each decision stump h ∈ H, its boolean complement is also in H. Hence, we have 2[k(i) − 1] possible decision stumps on an attribute i having k(i) possible (discrete values). Hence, for data sets having n attributes, we have exactly n |H| = 2 i=1 2[k(i) − 1] classifiers. Data sets having continuous-valued attributes have been discretized in our numerical experiments. From Theorem 2 and Equation 10, the bound on ζQ depends on KL(Q P ). We have chosen a uniform prior P (h) = 1/|H| ∀h ∈ H. We therefore have Q(h) def = Q(h) ln Q(h) + ln |H| = −H(Q) + ln |H| . KL(Q P ) = Q(h) ln P (h) h∈H h∈H At boosting round t, Adaboost changes the distribution from Dt to Dt+1 by putting more weight on the examples that are incorrectly classified by ht . This strategy is supported by the propose bound on ζQ since it has the effect of increasing the entropy H(Q) as a function of t. Indeed, apart from tiny fluctuations, the entropy was seen to be nondecreasing as a function of t in all of our boosting experiments. We have focused our attention on two different loss functions: the exponential loss and the sigmoid loss. 4.1 Results for the Exponential Loss The exponential loss EQ (x, y) is the obvious choice for boosting since, the typical analysis [8, 10, 9] shows that the empirical estimate of the exponential loss is decreasing at each boosting round 2 . More precisely, we have chosen def 1 exp (β [2WQ (x, y) − 1]) . EQ (x, y) = (12) 2 For this loss function, we have c = eβ − 1 β . k = 1 − e−β Since c increases exponentially rapidly with β, so will the risk upper-bound for EQ . Hence, unfortunately, we can obtain a tight upper-bound only for small values of β. All the data sets used were obtained from the UCI repository. Each data set was randomly split into two halves of the same size: one for the training set and the other for the testing set. Figure 1 illustrates the typical behavior for the exponential loss bound on the Mushroom and Sonar data sets containing 8124 examples and 208 examples respectively. We first note that, although the test error of the majority vote (generally) decreases as function of the number T of boosting rounds, the risk of the Gibbs classifier, E (x,y) WQ (x, y) increases as a function of T but its variance Var (x,y) WQ (x, y) decreases dramatically. Another striking feature is the fact that the exponential loss bound curve, computed on the training set, is essentially parallel to the true exponential loss curve computed on the testing set. This same parallelism was observed for all the UCI data sets we have examined so far.3 Unfortunately, as we can see in Figure 2, the risk bound increases rapidly as a function of β. Interestingly however, the risk bound curves remain parallel to the true risk curves. 4.2 Results for the Sigmoid Loss We have also investigated the sigmoid loss TQ (x, y) defined by 1 def 1 + tanh (β [2WQ (x, y) − 1]) . TQ (x, y) = 2 2 2 (13) In fact, this is true only for the positive linear combination produced by Adaboost. The empirical exponential risk of the convex combination fQ is not always decreasing as we shall see. 3 These include the following data sets: Wisconsin-breast, breast cancer, German credit, ionosphere, kr-vskp, USvotes, mushroom, and sonar. 0.6 0.4 0.5 0.3 0.4 0.2 0.1 EQ bound EQ on test 0.3 EQ bound EQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 1: Behavior of the exponential risk bound (EQ bound), the true exponential risk (EQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. 0.5 0.8 0.7 0.4 0.6 0.5 0.3 0.4 β=1 β=2 β=3 β=4 MV error on test 0.2 0.1 β=1 β=2 β=3 β=4 MV error on test 0.3 0.2 0.1 0 0 1 40 80 120 160 T 1 40 80 120 160 T Figure 2: Behavior of the true exponential risk (left) and the exponential risk bound (right) for different values of β on the Mushroom data set. Since the Taylor series expansion for tanh(x) about x = 0 converges only for |x| < π/2, we are limited to β ≤ π/2. Under these circumstances, we have c = k = tan(β) 1 . cos(β) sin(β) Similarly as in Figure 1, we see on Figure 3 that the sigmoid loss bound curve, computed on the training set, is essentially parallel to the true sigmoid loss curve computed on the testing set. Moreover, the bound appears to be as tight as the one for the exponential risk on Figure 1. 5 Conclusion By trying to obtain a tight PAC-Bayesian risk bound for the majority vote, we have obtained a PAC-Bayesian risk bound for any loss function ζQ that has a convergent Taylor expansion around WQ = 1/2 (such as the exponential loss and the sigmoid loss). Unfortunately, the proposed risk 0.6 0.4 0.5 0.4 0.3 0.2 0.1 TQ bound TQ on test 0.3 TQ bound TQ on test E(WQ ) on test MV error on test Var(WQ ) on test 0.2 E(WQ ) on test MV error on test Var(WQ ) on test 0.1 0 0 0 40 80 120 160 T 0 40 80 120 160 T Figure 3: Behavior of the sigmoid risk bound (TQ bound), the true sigmoid risk (TQ on test), the Gibbs risk (E(WQ ) on test), its variance (Var(WQ ) on test), and the test error of the majority vote (MV error on test) as of function of the boosting round T for the Mushroom (left) and the Sonar (right) data sets. The risk bound and the true risk were computed for β = ln 2. bound is tight only for small values of the scaling factor c involved in the relation between the expected loss ζQ of a convex combination of binary classifiers and the zero-one loss of a related Gibbs classifier GQ . However, it is quite encouraging to notice in our numerical experiments with Adaboost that the proposed loss bound (for the exponential loss and the sigmoid loss), behaves very similarly as the true loss. Acknowledgments Work supported by NSERC Discovery grants 262067 and 122405. References [1] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355–363, 1999. [2] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269, 2002. [3] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5–21, 2003. [4] John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273–306, 2005. [5] Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for sample-compressed ¸ Gibbs classifiers. Proceedings of the 22nth International Conference on Machine Learning (ICML 2005), pages 481–488, 2005. [6] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423– 430. MIT Press, Cambridge, MA, 2003. [7] Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. [8] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997. [9] Robert E. Schapire and Yoram Singer. Improved bosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651– 1686, 1998.

12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

Author: Johanna M. Zumer, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan

Abstract: We have developed a novel algorithm for integrating source localization and noise suppression based on a probabilistic graphical model of stimulus-evoked MEG/EEG data. Our algorithm localizes multiple dipoles while suppressing noise sources with the computational complexity equivalent to a single dipole scan, and is therefore more efficient than traditional multidipole fitting procedures. In simulation, the algorithm can accurately localize and estimate the time course of several simultaneously-active dipoles, with rotating or fixed orientation, at noise levels typical for averaged MEG data. Furthermore, the algorithm is superior to beamforming techniques, which we show to be an approximation to our graphical model, in estimation of temporally correlated sources. Success of this algorithm for localizing auditory cortex in a tumor patient and for localizing an epileptic spike source are also demonstrated. 1

13 nips-2006-A Scalable Machine Learning Approach to Go

Author: Lin Wu, Pierre F. Baldi

Abstract: Go is an ancient board game that poses unique opportunities and challenges for AI and machine learning. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into a recursive neural network, derived from a Bayesian network architecture. The network integrates local information across the board and produces local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end (or at other stages) of the game. Local area targets for training can be derived from datasets of human games. A system trained using only 9 × 9 amateur game data performs surprisingly well on a test set derived from 19 × 19 professional game data. Possible directions for further improvements are briefly discussed. 1

14 nips-2006-A Small World Threshold for Economic Network Formation

Author: Eyal Even-dar, Michael Kearns

Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1

15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

Author: Oliver Williams

Abstract: This paper describes a Gaussian process framework for inferring pixel-wise disparity and bi-layer segmentation of a scene given a stereo pair of images. The Gaussian process covariance is parameterized by a foreground-backgroundocclusion segmentation label to model both smooth regions and discontinuities. As such, we call our model a switched Gaussian process. We propose a greedy incremental algorithm for adding observations from the data and assigning segmentation labels. Two observation schedules are proposed: the first treats scanlines as independent, the second uses an active learning criterion to select a sparse subset of points to measure. We show that this probabilistic framework has comparable performance to the state-of-the-art. 1

16 nips-2006-A Theory of Retinal Population Coding

Author: Eizaburo Doi, Michael S. Lewicki

Abstract: Efficient coding models predict that the optimal code for natural images is a population of oriented Gabor receptive fields. These results match response properties of neurons in primary visual cortex, but not those in the retina. Does the retina use an optimal code, and if so, what is it optimized for? Previous theories of retinal coding have assumed that the goal is to encode the maximal amount of information about the sensory signal. However, the image sampled by retinal photoreceptors is degraded both by the optics of the eye and by the photoreceptor noise. Therefore, de-blurring and de-noising of the retinal signal should be important aspects of retinal coding. Furthermore, the ideal retinal code should be robust to neural noise and make optimal use of all available neurons. Here we present a theoretical framework to derive codes that simultaneously satisfy all of these desiderata. When optimized for natural images, the model yields filters that show strong similarities to retinal ganglion cell (RGC) receptive fields. Importantly, the characteristics of receptive fields vary with retinal eccentricities where the optical blur and the number of RGCs are significantly different. The proposed model provides a unified account of retinal coding, and more generally, it may be viewed as an extension of the Wiener filter with an arbitrary number of noisy units. 1

17 nips-2006-A recipe for optimizing a time-histogram

Author: Hideaki Shimazaki, Shigeru Shinomoto

Abstract: The time-histogram method is a handy tool for capturing the instantaneous rate of spike occurrence. In most of the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time-histogram to the underlying rate has been selected by individual researchers in an unsystematic manner. We propose an objective method for selecting the bin size of a time-histogram from the spike data, so that the time-histogram best approximates the unknown underlying rate. The resolution of the histogram increases, or the optimal bin size decreases, with the number of spike sequences sampled. It is notable that the optimal bin size diverges if only a small number of experimental trials are available from a moderately fluctuating rate process. In this case, any attempt to characterize the underlying spike rate will lead to spurious results. Given a paucity of data, our method can also suggest how many more trials are needed until the set of data can be analyzed with the required resolution. 1

18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons

Author: Chiara Bartolozzi, Giacomo Indiveri

Abstract: Selective attention is the strategy used by biological sensory systems to solve the problem of limited parallel processing capacity: salient subregions of the input stimuli are serially processed, while non–salient regions are suppressed. We present an mixed mode analog/digital Very Large Scale Integration implementation of a building block for a multi–chip neuromorphic hardware model of selective attention. We describe the chip’s architecture and its behavior, when its is part of a multi–chip system with a spiking retina as input, and show how it can be used to implement in real-time flexible models of bottom-up attention. 1

19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis

Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant. 1

20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

21 nips-2006-AdaBoost is Consistent

22 nips-2006-Adaptive Spatial Filters with predefined Region of Interest for EEG based Brain-Computer-Interfaces

23 nips-2006-Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models

24 nips-2006-Aggregating Classification Accuracy across Time: Application to Single Trial EEG

25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

26 nips-2006-An Approach to Bounded Rationality

27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

29 nips-2006-An Information Theoretic Framework for Eukaryotic Gradient Sensing

30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

31 nips-2006-Analysis of Contour Motions

32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

33 nips-2006-Analysis of Representations for Domain Adaptation

34 nips-2006-Approximate Correspondences in High Dimensions

35 nips-2006-Approximate inference using planar graph decomposition

36 nips-2006-Attentional Processing on a Spike-Based VLSI Neural Network

37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

39 nips-2006-Balanced Graph Matching

40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

41 nips-2006-Bayesian Ensemble Learning

42 nips-2006-Bayesian Image Super-resolution, Continued

43 nips-2006-Bayesian Model Scoring in Markov Random Fields

44 nips-2006-Bayesian Policy Gradient Algorithms

45 nips-2006-Blind Motion Deblurring Using Image Statistics

46 nips-2006-Blind source separation for over-determined delayed mixtures

47 nips-2006-Boosting Structured Prediction for Imitation Learning

48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

49 nips-2006-Causal inference in sensorimotor integration

50 nips-2006-Chained Boosting

51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

52 nips-2006-Clustering appearance and shape by learning jigsaws

53 nips-2006-Combining causal and similarity-based reasoning

54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

57 nips-2006-Conditional mean field

58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

59 nips-2006-Context dependent amplification of both rate and event-correlation in a VLSI network of spiking neurons

60 nips-2006-Convergence of Laplacian Eigenmaps

61 nips-2006-Convex Repeated Games and Fenchel Duality

62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

65 nips-2006-Denoising and Dimension Reduction in Feature Space

66 nips-2006-Detecting Humans via Their Pose

67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

69 nips-2006-Distributed Inference in Dynamical Systems

70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

75 nips-2006-Efficient sparse coding algorithms

76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

77 nips-2006-Fast Computation of Graph Kernels

78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests

79 nips-2006-Fast Iterative Kernel PCA

80 nips-2006-Fundamental Limitations of Spectral Clustering

81 nips-2006-Game Theoretic Algorithms for Protein-DNA binding

82 nips-2006-Gaussian and Wishart Hyperkernels

83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

86 nips-2006-Graph-Based Visual Saliency

87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

88 nips-2006-Greedy Layer-Wise Training of Deep Networks

89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

94 nips-2006-Image Retrieval and Classification Using Local Distance Functions

95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

96 nips-2006-In-Network PCA and Anomaly Detection

97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

98 nips-2006-Inferring Network Structure from Co-Occurrences

99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

103 nips-2006-Kernels on Structured Objects Through Nested Histograms

104 nips-2006-Large-Scale Sparsified Manifold Regularization

105 nips-2006-Large Margin Component Analysis

106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

108 nips-2006-Large Scale Hidden Semi-Markov SVMs

109 nips-2006-Learnability and the doubling dimension

110 nips-2006-Learning Dense 3D Correspondence

111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

113 nips-2006-Learning Structural Equation Models for fMRI

114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

115 nips-2006-Learning annotated hierarchies from relational data

116 nips-2006-Learning from Multiple Sources

117 nips-2006-Learning on Graph with Laplacian Regularization

118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

120 nips-2006-Learning to Traverse Image Manifolds

121 nips-2006-Learning to be Bayesian without Supervision

122 nips-2006-Learning to parse images of articulated bodies

123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

124 nips-2006-Linearly-solvable Markov decision problems

125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

126 nips-2006-Logistic Regression for Single Trial EEG Classification

127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

128 nips-2006-Manifold Denoising

129 nips-2006-Map-Reduce for Machine Learning on Multicore

130 nips-2006-Max-margin classification of incomplete data

131 nips-2006-Mixture Regression for Covariate Shift

132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

134 nips-2006-Modeling Human Motion Using Binary Latent Variables

135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification

137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

138 nips-2006-Multi-Task Feature Learning

139 nips-2006-Multi-dynamic Bayesian Networks

140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

141 nips-2006-Multiple timescales and uncertainty in motor adaptation

142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype

143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

145 nips-2006-Neurophysiological Evidence of Cooperative Mechanisms for Stereo Computation

146 nips-2006-No-regret Algorithms for Online Convex Programs

147 nips-2006-Non-rigid point set registration: Coherent Point Drift

148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

149 nips-2006-Nonnegative Sparse PCA

150 nips-2006-On Transductive Regression

151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

153 nips-2006-Online Clustering of Moving Hyperplanes

154 nips-2006-Optimal Change-Detection and Spiking Neurons

155 nips-2006-Optimal Single-Class Classification Strategies

156 nips-2006-Ordinal Regression by Extended Binary Classification

157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier

158 nips-2006-PG-means: learning the number of clusters in data

159 nips-2006-Parameter Expanded Variational Bayesian Methods

160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron

163 nips-2006-Prediction on a Graph with a Perceptron

164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

166 nips-2006-Recursive Attribute Factoring

167 nips-2006-Recursive ICA

168 nips-2006-Reducing Calibration Time For Brain-Computer Interfaces: A Clustering Approach

169 nips-2006-Relational Learning with Gaussian Processes

170 nips-2006-Robotic Grasping of Novel Objects

171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

174 nips-2006-Similarity by Composition

175 nips-2006-Simplifying Mixture Models through Function Approximation

176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

179 nips-2006-Sparse Representation for Signal Classification

180 nips-2006-Speakers optimize information density through syntactic reduction

181 nips-2006-Stability of $K$-Means Clustering

182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures

183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

185 nips-2006-Subordinate class recognition using relational object models

186 nips-2006-Support Vector Machines on a Budget

187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks

189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

192 nips-2006-Theory and Dynamics of Perceptual Bistability

193 nips-2006-Tighter PAC-Bayes Bounds

194 nips-2006-Towards a general independent subspace analysis

195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

196 nips-2006-TrueSkill™: A Bayesian Skill Rating System

197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall

198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

203 nips-2006-implicit Online Learning with Kernels