nips nips2006 nips2006-131 knowledge-graph by maker-knowledge-mining

131 nips-2006-Mixture Regression for Covariate Shift


Source: pdf

Author: Masashi Sugiyama, Amos J. Storkey

Abstract: In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. [sent-7, score-0.306]

2 The situations where the training and test data are from different distributions is called covariate shift. [sent-9, score-0.911]

3 Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. [sent-10, score-1.018]

4 Recent work on covariate shift can be understood in terms of mixture regression. [sent-13, score-1.207]

5 Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. [sent-14, score-0.772]

6 1 Introduction There is a common presumption in developing supervised methods that the distribution of training points used for learning supervised models will match the distribution of points seen in a new test scenario. [sent-16, score-0.369]

7 The expectation that the training and test points follow the same distribution is explicitly stated in [2, p. [sent-17, score-0.224]

8 ” Cases where test scenarios truly match the training data are probably rare. [sent-27, score-0.23]

9 The problem of mismatch has been grappled within literature from a number of fields, and has become known as covariate shift [14]. [sent-28, score-0.965]

10 Specific examples of covariate shift include situations in reinforcement learning [c. [sent-29, score-1.014]

11 The common issue of sample selection bias [7] is a particular case of covariate shift. [sent-34, score-0.705]

12 Much of the recent analysis of covariate shift has been made in the context of assessing the asymptotic bias of various estimators [15]. [sent-35, score-1.053]

13 where the model from which the training data is generated is not included in the training model class), some typical estimators, such as least squares approaches, produce biased asymptotic estimators [14]. [sent-38, score-0.375]

14 It might appear that the presumption of matched models in Bayesian analysis means covariate shift is not an issue: failure or otherwise under situations of covariate shift is solved by valid choice for the prior distribution over conditional models. [sent-39, score-2.095]

15 In fact we can categorise at least three different types of covariate shift: 1. [sent-41, score-0.649]

16 Independent covariate shift: Ptrain (y|x) = Ptest (y|x), but Ptrain (x) = Ptest (x). [sent-42, score-0.628]

17 Then Case 1 is the only one of the above where covariate shift will have no effect on modelling. [sent-48, score-0.965]

18 Case 3 involves a more general assumption, and arguably can be used to cover most situations of covariate shift, by incorporating any known structural characteristics of the problem into some latent variable r. [sent-50, score-0.754]

19 Change in the distribution of x points implicitly informs us about variation in the targets y via the shift in the latent variable r, which is the causal factor for the change. [sent-51, score-0.424]

20 The purpose of this paper is to provide a generative framework for analysis of covariate shift. [sent-52, score-0.651]

21 The main advantages of this new formulation over previous approaches are • It provides an explicit model for the changes occurring in situations of covariate shift, and hence the predictions that result from it. [sent-53, score-0.742]

22 • There is no need to presume the training and test distributions are known. [sent-54, score-0.294]

23 Furthermore the test covariates are also used as part of the model estimation procedure, resulting in better predictions. [sent-55, score-0.212]

24 • Utilising the test covariate distribution gives performance benefits over using the same model for training data alone. [sent-58, score-0.89]

25 • All the usual machinery for mixture of experts are available, and so this approach allows model selection and many natural extensions. [sent-59, score-0.325]

26 A specific form of mixture regression model is formulated and an Expectation Maximisation solution is given in Section 3. [sent-62, score-0.399]

27 2 Prior work Covariate shift will be interpreted, in the context of this work, using mixture of regressor models, where the regression model is dependent on a latent class variable. [sent-69, score-0.995]

28 The benefits of the mixture of regressor approach for heterogeneous data was discussed in [17], but not formulated specifically for the problem of covariate shift. [sent-71, score-1.086]

29 This paper establishes for the first time the relationship between the mixture of regressor model and the typical statistical results in the literature on covariate shift. [sent-72, score-1.09]

30 The main differences of our approach from a standard mixture of regressor formalism is that we utilise the training and test distributions as part of the model and do not use only a conditional model, and we allow coupling of regressors across different mixture components. [sent-73, score-1.222]

31 The mixture of regressors form for (x, y) used in this paper is a specific from of mixture of experts [10]. [sent-75, score-0.785]

32 The problem of sample selection bias is related to covariate shift. [sent-77, score-0.705]

33 The problem of sample selection bias differs from the case in this paper as here there is no fundamental requirement of distribution overlap between the training and test sets. [sent-79, score-0.281]

34 Second, the presumption is different: rather than there being a sample rejection process that characterised the difference between training and test sets, there is a sample production process that differs. [sent-81, score-0.372]

35 Each datum x is assumed to have been generated from one of a number of data sources using a mixture distribution corresponding to the source. [sent-83, score-0.368]

36 The proportions of each of the sources varies across the training and test datasets. [sent-84, score-0.371]

37 Hence, in the context of this paper, we understand covariate shift to be effected by a change in the contribution of different sources to the data. [sent-85, score-1.037]

38 The two sets of sources have the following characteristics: • Source set 1 corresponds to sources that may occur in the test data, and potentially also in the training data, and are associated with regression model P1 (y|x). [sent-89, score-0.507]

39 • Source set 2 corresponds to sources that occur only in the training data, and are associated with regression model P2 (y|x). [sent-90, score-0.335]

40 By taking this approach we note that we will be able to separate out effects that we expect to be only characteristics of the training data from effects that are common across training and test sets. [sent-91, score-0.313]

41 The full generative model for the observed data consists of the model for the training data D and model for the test data T . [sent-92, score-0.401]

42 The test data is just used to determine the nature of the covariate shift, and consists of only of the covariates x, and not any targets y. [sent-93, score-0.815]

43 We emphasise that we do not presume to have seen the test data we wish to predict. [sent-94, score-0.214]

44 Rather a prior model is built for the training and test data, and this is then conditioned on the information from the training data and the known covariates for the test data but not the unknown targets. [sent-95, score-0.553]

45 This significantly extends the previous work on covariate shift, in that the model allows for unknown training and test distributions, and utilises a mixture model approach for representing the relationship between the two. [sent-98, score-1.176]

46 2, we will show how the previous results on covariate shift are special cases of the general model. [sent-101, score-0.984]

47 We will develop this formalism for any parametric form for the regressors P (y|x). [sent-102, score-0.294]

48 The model takes the following form • The distribution of the training data and test data are denoted PD and PT respectively, and are unknown in general. [sent-104, score-0.314]

49 • Source set 1 consists of M mixture distributions, where mixture t is denoted P1t (x). [sent-105, score-0.51]

50 • Source set 2 consists of M2 mixture distributions, where mixture t is denoted P2t (x). [sent-107, score-0.51]

51 2 If a component i is associated with a regression model j, this means that any datum x generated from the mixture component i, will also have a corresponding y generated from the associated regression model Pj (y|x). [sent-110, score-0.716]

52 Finally γ1t are the proportions of each mixture from source set 1 in the test data. [sent-112, score-0.567]

53 (2) ν∈T µ where s denotes the source set used to generate the data point µ, and tµ denotes the particular mixture from that source set used to generate the data point µ. [sent-116, score-0.47]

54 In words, this says that the model for the training dataset involves sampling the particular source set sµ , then the mixture component tµ from that particular source set. [sent-117, score-0.606]

55 Given these we then sample an xµ from the relevant mixture and a yµ conditionally on xµ from the relevant regressor. [sent-118, score-0.262]

56 The same procedure is followed for the test set, except now there is only one source set to consider. [sent-119, score-0.209]

57 1 EM algorithm A maximum likelihood solution for the parameters (β, γ, Θ, Ω) can be obtained for this model (given the training data and test covariate) using Expectation Maximisation (EM) [3]. [sent-122, score-0.262]

58 Denote the µ responsibility of mixture i for data point µ by αi . [sent-126, score-0.31]

59 The E-step update uses current parameter values to compute the responsibility (denoted by αs) of each mixture 1t and 2t for each data point µ in the training set and each data point ν in the test set using µ αst = T γ1t P1t (xµ |Ω) . [sent-128, score-0.54]

60 T µ t γ1t P1t (x |Ω) D βs γst Pst (xµ |Ω)Ps (yµ |xµ , Θ) ν and α1t = D βs γst Pst (xµ |Ω)Ps (yµ |xµ , Θ) s,t (4) µ We set α2t = 0 for ν ∈ T , as none of these mixtures are represented in the test set. [sent-129, score-0.233]

61 The parameters of the mixture model distributions are then updated with the usual M steps for the relevant mixture component, and the regression parameters are updated using maximum responsibility-weighted likelihood. [sent-130, score-0.664]

62 The test data is associated with a single regression model P1 (y|x), and so the predictive distribution for the test set is the learnt predictor P1 (y|xi ) for each point xi in the test set. [sent-132, score-0.613]

63 2 Importance Weighted Least Squares Previous results in modelling covariate shift can be obtained as special cases of the general approach taken in this paper. [sent-135, score-1.049]

64 Suppose also that the two regressors have a large and identical variance Σ. [sent-137, score-0.273]

65 In this simple case, we do not need to know the actual test points (in this framework these are only used to infer the test distribution, which is assumed given here). [sent-138, score-0.262]

66 Hence we never need to learn β1 or the parameters associated with mixture 2 in this procedure. [sent-142, score-0.268]

67 This is the Importance Weighted Least Squares estimator for covariate shift [14]. [sent-145, score-0.965]

68 A simple extension of this model will allow the large variance assumption to be relaxed, so the model can use the regressor information for computing responsibilities. [sent-146, score-0.23]

69 1 Examples Generated Test Data We demonstrate the mixture of regressors approach to covariate shift (MRCS) on generated test data: a one dimensional regression problem with two sources each corresponding to different linear regressors. [sent-148, score-1.796]

70 Regression performance for MRCS with Gaussian mixtures and linear regressors is compared with three other cases. [sent-149, score-0.385]

71 The first is an importance weighted least squares estimator (IWLS) given the best mixture model fit for the data, corresponding to the current standard for modelling covariate shift. [sent-150, score-1.108]

72 The second uses a mixture of regressors model that ignores the form of the test data, but chooses the regressor corresponding to the mixtures which best match the test data distribution using a KL divergence measure (MRKL). [sent-151, score-1.093]

73 This corresponds to recognising that covariate shift can happen, but ignoring the nature of the test distribution in the modelling process, and trying to choose the best of the two regressors. [sent-152, score-1.177]

74 The third case is where the mixture of regressors is used simply as a standard regression model, ignoring the possibility of covariate shift (MRREG). [sent-153, score-1.631]

75 The generative procedure for each of the 100 test datasets involves generating random parameter values for between 1 and 3 mixtures for each of two linear regressors. [sent-154, score-0.313]

76 Test and training datasets of 200 data points each are generated from these mixtures and regression models, using different mixing proportions in each case. [sent-155, score-0.529]

77 Even though the regularity conditions for BIC do not hold for mixture models, it has been shown that BIC is consistent in this case [12]. [sent-160, score-0.242]

78 The results of these tests show the significant benefits of explicit recognition of covariate shift over straight regression even compared with the use of the same mixture of regressors model, but without reference to the test distribution. [sent-162, score-1.804]

79 It also shows benefits of the approach of this paper over the current state of the art for modelling covariate shift. [sent-163, score-0.693]

80 Again MRCS performs best, and consistently gives better performance on the test data for more than 70 percent of the test cases. [sent-167, score-0.287]

81 It can be seen that both IWLS and MRKL fail to untangle the regressors associated with the overlapping central clusters in the training data and hence perform badly in that region of the test data. [sent-171, score-0.557]

82 ’ data labels the points for which the test regressor has greater responsibility, and the ’+’ data labels points for which the training only regressor has greater responsibility. [sent-175, score-0.628]

83 Table 1: Average mean square error over all 100 datasets for each choice of fixed model mixture size. [sent-176, score-0.322]

84 MRCS - mixture of regressors for covariate shift, IWLS - Importance weighted least squares, MRKL - Mixture of regressors, evaluated on regressor with best fit to test distribution. [sent-178, score-1.475]

85 MRREG - Mixture of regressors as a standard regression model, ignoring covariate shift. [sent-179, score-1.052]

86 This provides a natural scenario for demonstrating covariate shift. [sent-209, score-0.628]

87 To demonstrate covariate shift we can consider the prediction task trained on cars from one place of origin and tested on cars from another place of origin. [sent-216, score-1.122]

88 We train the model using data on cars from origin 1, and test on cars from origin 2 and origin 3. [sent-218, score-0.47]

89 We use the same test algorithms as the previous section, but now using a Gaussian process regressors for each regression function. [sent-219, score-0.538]

90 Critically, we note that all covariate shift methods performed better than a straight Gaussian Process predictor in this situation. [sent-224, score-1.034]

91 The mixture of Gaussian processes did not perform as well as the methods which explicitly recognised the covariate shift, although interestingly did perform better than a straight Gaussian process predictor. [sent-225, score-0.939]

92 5 Discussion This paper establishes that explicit generative modelling of covariate shift can bring improvements over conditional regression models, or over standard covariate shift methods that ignore the dependent data in the modelling process. [sent-227, score-2.325]

93 The method is also better than using an identical mixture of regressors model for the training data alone, as it utilises the positions of the independent test points to help refine the mixture locations and the separation of regressors. [sent-228, score-1.096]

94 This framework is currently being extended to the case of multiple training and test datasets using a fully Bayesian scheme, and will be the subject of future work. [sent-230, score-0.232]

95 706 similar to Latent Dirichlet Allocation, where each dataset is built from a number of contributing regression components, where each component is expressed in different proportions in each dataset. [sent-245, score-0.261]

96 6 Conclusion In this paper a novel approach to the problem of covariate shift has been developed that is demonstrably better than state of the art regression approaches, and better than the current standard for covariate shift. [sent-247, score-1.756]

97 These have been tested on both generated data, and on a real problem of covariate shift, derived from a standard UCI dataset. [sent-248, score-0.647]

98 Specifically we provide explicit modelling of the covariate shift process by assuming a shift in the proportions of a number of latent components. [sent-250, score-1.593]

99 A mixture of regressors model is used for this purpose, but it differs from standard mixture of regressors by allowing sharing of the regression functions between mixture components and explicitly including a model for the test set as part of the process. [sent-251, score-1.582]

100 Improving predictive inference under covariate shift by weighting the log-likelihood function. [sent-336, score-0.965]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('covariate', 0.628), ('shift', 0.337), ('regressors', 0.273), ('mixture', 0.242), ('mrcs', 0.21), ('regressor', 0.166), ('iwls', 0.134), ('ptest', 0.134), ('regression', 0.125), ('test', 0.121), ('proportions', 0.116), ('mrkl', 0.115), ('ptrain', 0.115), ('mixtures', 0.112), ('source', 0.088), ('training', 0.083), ('st', 0.08), ('mrreg', 0.077), ('fs', 0.071), ('origin', 0.067), ('bic', 0.067), ('mst', 0.067), ('presume', 0.067), ('latent', 0.067), ('modelling', 0.065), ('presumption', 0.057), ('em', 0.055), ('sources', 0.051), ('pd', 0.051), ('squares', 0.05), ('pt', 0.047), ('ps', 0.047), ('importance', 0.046), ('maximisation', 0.046), ('cars', 0.045), ('responsibility', 0.042), ('covariates', 0.04), ('clusterwise', 0.038), ('kst', 0.038), ('pst', 0.038), ('utilises', 0.038), ('bias', 0.034), ('fuel', 0.033), ('characterised', 0.033), ('consumption', 0.033), ('model', 0.032), ('straight', 0.031), ('datum', 0.03), ('presumed', 0.03), ('situations', 0.03), ('estimators', 0.029), ('involves', 0.029), ('hence', 0.028), ('minimisation', 0.028), ('datasets', 0.028), ('experts', 0.028), ('bayesian', 0.027), ('denoted', 0.026), ('associated', 0.026), ('ignoring', 0.026), ('data', 0.026), ('dependent', 0.026), ('generalisation', 0.025), ('assessing', 0.025), ('supervised', 0.025), ('ts', 0.025), ('sugiyama', 0.024), ('says', 0.024), ('explicit', 0.024), ('weighted', 0.024), ('discussed', 0.024), ('selection', 0.023), ('gp', 0.023), ('tests', 0.023), ('distributions', 0.023), ('bene', 0.023), ('generative', 0.023), ('gaussian', 0.022), ('criterion', 0.022), ('establishes', 0.022), ('learnt', 0.022), ('change', 0.021), ('least', 0.021), ('formalism', 0.021), ('prior', 0.021), ('choice', 0.02), ('component', 0.02), ('sample', 0.02), ('points', 0.02), ('conditional', 0.019), ('predictor', 0.019), ('relaxed', 0.019), ('better', 0.019), ('generated', 0.019), ('reinforcement', 0.019), ('process', 0.019), ('special', 0.019), ('uci', 0.018), ('models', 0.018), ('occur', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 131 nips-2006-Mixture Regression for Covariate Shift

Author: Masashi Sugiyama, Amos J. Storkey

Abstract: In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making. 1

2 0.11444797 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring significant computational benefits in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simplified models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.

3 0.08864592 46 nips-2006-Blind source separation for over-determined delayed mixtures

Author: Lars Omlor, Martin Giese

Abstract: Blind source separation, i.e. the extraction of unknown sources from a set of given signals, is relevant for many applications. A special case of this problem is dimension reduction, where the goal is to approximate a given set of signals by superpositions of a minimal number of sources. Since in this case the signals outnumber the sources the problem is over-determined. Most popular approaches for addressing this problem are based on purely linear mixing models. However, many applications like the modeling of acoustic signals, EMG signals, or movement trajectories, require temporal shift-invariance of the extracted components. This case has only rarely been treated in the computational literature, and specifically for the case of dimension reduction almost no algorithms have been proposed. We present a new algorithm for the solution of this problem, which is based on a timefrequency transformation (Wigner-Ville distribution) of the generative model. We show that this algorithm outperforms classical source separation algorithms for linear mixtures, and also a related method for mixtures with delays. In addition, applying the new algorithm to trajectories of human gaits, we demonstrate that it is suitable for the extraction of spatio-temporal components that are easier to interpret than components extracted with other classical algorithms. 1

4 0.081647508 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and defines the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the filtered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both filtered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively find the filtered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the filtered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To find a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We find p(ht+1 |st , it , st+1 , v1:t+1 ) by first computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The first factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefficients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are defined, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to find a suitable way to ‘correct’ the filtered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to find a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difficulty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we find p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the filtered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to find p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we find that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the filtered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is defined by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difficulties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically finessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difficult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conflicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best filtered results are given using ADF, since this is better able to represent the variance in the filtered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the filtered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF filtered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring significant benefits. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we briefly present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classifier, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difficult, mainly because the explicit influence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefficients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we define ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the first row contains the AR coefficients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the first component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefficients cr (s) are copied into the first row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a finite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main benefit of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reflects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.

5 0.080193326 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

6 0.076265283 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

7 0.075691491 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

8 0.069045015 126 nips-2006-Logistic Regression for Single Trial EEG Classification

9 0.066026293 33 nips-2006-Analysis of Representations for Domain Adaptation

10 0.064371526 113 nips-2006-Learning Structural Equation Models for fMRI

11 0.060109485 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

12 0.057797417 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

13 0.054875743 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

14 0.051818788 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

15 0.051747341 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

16 0.051438827 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

17 0.051065434 20 nips-2006-Active learning for misspecified generalized linear models

18 0.051050417 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

19 0.05032498 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

20 0.048160594 66 nips-2006-Detecting Humans via Their Pose


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, 0.028), (2, 0.018), (3, -0.06), (4, -0.047), (5, 0.019), (6, 0.109), (7, -0.016), (8, -0.009), (9, 0.082), (10, -0.024), (11, 0.044), (12, 0.03), (13, 0.077), (14, 0.07), (15, -0.035), (16, 0.114), (17, -0.041), (18, -0.032), (19, -0.019), (20, -0.066), (21, -0.077), (22, -0.046), (23, 0.103), (24, 0.09), (25, -0.118), (26, 0.175), (27, 0.137), (28, -0.002), (29, 0.023), (30, 0.029), (31, -0.044), (32, -0.122), (33, -0.13), (34, -0.039), (35, -0.141), (36, 0.053), (37, 0.157), (38, 0.014), (39, -0.019), (40, 0.087), (41, -0.014), (42, -0.022), (43, -0.013), (44, 0.097), (45, -0.024), (46, -0.123), (47, -0.039), (48, 0.128), (49, -0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9501307 131 nips-2006-Mixture Regression for Covariate Shift

Author: Masashi Sugiyama, Amos J. Storkey

Abstract: In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making. 1

2 0.55743289 158 nips-2006-PG-means: learning the number of clusters in data

Author: Yu Feng, Greg Hamerly

Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1

3 0.53706563 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects

Author: Seyoung Kim, Padhraic Smyth

Abstract: Data sets involving multiple groups with shared characteristics frequently arise in practice. In this paper we extend hierarchical Dirichlet processes to model such data. Each group is assumed to be generated from a template mixture model with group level variability in both the mixing proportions and the component parameters. Variabilities in mixing proportions across groups are handled using hierarchical Dirichlet processes, also allowing for automatic determination of the number of components. In addition, each group is allowed to have its own component parameters coming from a prior described by a template mixture model. This group-level variability in the component parameters is handled using a random effects model. We present a Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate model parameters and demonstrate the method by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI). 1

4 0.47718608 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1

5 0.47308254 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

Author: Steffen Bickel, Tobias Scheffer

Abstract: We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size fits all”) filter as well as independent filters for all users. 1

6 0.46655646 175 nips-2006-Simplifying Mixture Models through Function Approximation

7 0.45747685 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

8 0.41061851 5 nips-2006-A Kernel Method for the Two-Sample-Problem

9 0.40652102 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

10 0.40215117 113 nips-2006-Learning Structural Equation Models for fMRI

11 0.3985346 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

12 0.39514214 121 nips-2006-Learning to be Bayesian without Supervision

13 0.39074767 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

14 0.384523 180 nips-2006-Speakers optimize information density through syntactic reduction

15 0.3802931 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

16 0.37254393 46 nips-2006-Blind source separation for over-determined delayed mixtures

17 0.36834213 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

18 0.36577144 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

19 0.33958638 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype

20 0.33792698 20 nips-2006-Active learning for misspecified generalized linear models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.128), (3, 0.011), (7, 0.065), (9, 0.028), (20, 0.024), (22, 0.102), (44, 0.059), (55, 0.022), (57, 0.07), (65, 0.05), (69, 0.023), (71, 0.014), (85, 0.267), (90, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78149205 131 nips-2006-Mixture Regression for Covariate Shift

Author: Masashi Sugiyama, Amos J. Storkey

Abstract: In supervised learning there is a typical presumption that the training and test points are taken from the same distribution. In practice this assumption is commonly violated. The situations where the training and test data are from different distributions is called covariate shift. Recent work has examined techniques for dealing with covariate shift in terms of minimisation of generalisation error. As yet the literature lacks a Bayesian generative perspective on this problem. This paper tackles this issue for regression models. Recent work on covariate shift can be understood in terms of mixture regression. Using this view, we obtain a general approach to regression under covariate shift, which reproduces previous work as a special case. The main advantages of this new formulation over previous models for covariate shift are that we no longer need to presume the test and training densities are known, the regression and density estimation are combined into a single procedure, and previous methods are reproduced as special cases of this procedure, shedding light on the implicit assumptions the methods are making. 1

2 0.60293365 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

3 0.60175377 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

4 0.59899354 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

5 0.59793174 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1

6 0.59582019 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

7 0.59556216 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

8 0.5953815 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

9 0.59411645 65 nips-2006-Denoising and Dimension Reduction in Feature Space

10 0.59161901 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

11 0.58855021 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

12 0.58734953 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

13 0.58560848 138 nips-2006-Multi-Task Feature Learning

14 0.5854305 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

15 0.58505774 175 nips-2006-Simplifying Mixture Models through Function Approximation

16 0.58179915 194 nips-2006-Towards a general independent subspace analysis

17 0.58116597 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

18 0.58034903 169 nips-2006-Relational Learning with Gaussian Processes

19 0.57985932 79 nips-2006-Fast Iterative Kernel PCA

20 0.57922214 126 nips-2006-Logistic Regression for Single Trial EEG Classification