nips nips2010 nips2010-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. [sent-11, score-0.595]
2 While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. [sent-12, score-0.22]
3 To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. [sent-13, score-0.313]
4 In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. [sent-14, score-0.218]
5 Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. [sent-15, score-0.866]
6 1 Introduction Latent representations of data are wide-spread tools in supervised and unsupervised learning. [sent-16, score-0.141]
7 In supervised learning, latent models are often used in a generative way, e. [sent-18, score-0.218]
8 , through mixture models on the input variables only, which may not lead to increased predictive performance. [sent-20, score-0.122]
9 This has led to numerous works on supervised dimension reduction (e. [sent-21, score-0.14]
10 , [1, 2]), where the final discriminative goal of prediction is taken explicitly into account during the learning process. [sent-23, score-0.225]
11 In this context, various probabilistic models have been proposed, such as mixtures of experts [3] or discriminative restricted Boltzmann machines [4], where a layer of hidden variables is used between the inputs and the outputs of the supervised learning model. [sent-24, score-0.568]
12 Parameters are usually estimated by expectation-maximization (EM), a method that is computationally efficient but whose cost function may have many local maxima in high dimensions. [sent-25, score-0.1]
13 In this paper, we consider a simple discriminative latent class (DLC) model where inputs and outputs are independent given the latent representation. [sent-26, score-0.483]
14 1 – We provide in Section 2 a quadratic (non convex) local approximation of the log-likelihood of our model based on the EM auxiliary function. [sent-28, score-0.253]
15 3 a novel probabilistic interpretation of discriminative clustering with added benefits, such as fewer hyperparameters than previous approaches [5, 6, 7]. [sent-31, score-0.451]
16 – We design in Section 4 a low-rank optimization method for non-convex quadratic problems over a product of simplices. [sent-32, score-0.168]
17 This method relies on a convex relaxation over completely positive matrices. [sent-33, score-0.159]
18 – We perform experiments on text documents in Section 5, where we show that our inference technique outperforms existing supervised dimension reduction and clustering methods. [sent-34, score-0.375]
19 2 Probabilistic discriminative latent class models We consider a set of N observations xn ∈ Rp , and their labels yn ∈ {1, . [sent-35, score-0.599]
20 We assume that each observation xn has a certain probability to be in one of K latent classes, modeled by introducing hidden variables zn ∈ {1, . [sent-42, score-0.503]
21 , K}, and that these classes should be predictive of the label yn . [sent-45, score-0.213]
22 We model directly the conditional probability of zn given the input data xn and the probability of the label yn given zn , while making the assumption that yn and xn are independent given zn (leading to the directed graphical model xn → zn → yn ). [sent-46, score-1.575]
23 More precisely, we assume that, given xn , zn follows a multinomial logit model while, given zn , yn is a multinomial variable: T p(zn = k|xn ) = ewk xn +bk and T K wj xn +bj j=1 e p(yn = m|zn = k) = αkm , (1) M with wk ∈ Rp , bk ∈ R and m=1 αkm = 1. [sent-47, score-1.337]
24 First, it is an instance of a mixture of experts [3] where each expert has a constant prediction. [sent-59, score-0.108]
25 It has thus weaker predictive power than general mixtures of experts; however, it allows efficient optimization as shown in Section 4. [sent-60, score-0.126]
26 It would be interesting to extend our optimization techniques to the case of experts with non-constant predictions. [sent-61, score-0.115]
27 This is what is done in [8] where a convex relaxation of EM for a similar mixture of experts is considered. [sent-62, score-0.267]
28 However, [8] considers the maximization with respect to hidden variables rather than their marginalization, which is essential in our setting to have a well-defined probabilistic model. [sent-63, score-0.114]
29 Note also that in [8], the authors derive a convex relaxation of the softmax regression problems, while we derive a quadratic approximation. [sent-64, score-0.365]
30 Indeed, if we marginalize the latent variable z, we get that the probability of y given x is a linear combination of softmax functions of linear functions of the input variables x. [sent-67, score-0.257]
31 Thus, the only difference with a two-layer neural network with softmax functions for the last layer is the fact that our last layer considers linear parameterization in the mean parameters rather than in the natural parameters of the multinomial variable. [sent-68, score-0.238]
32 Among probabilistic models, a discriminative restricted Boltzmann machine (RBM) [4, 9] models p(y|z) as a softmax function of linear functions of z. [sent-70, score-0.389]
33 Again, this distinction between mean parameters and natural parameters allows us to derive a quadratic approximation of our cost function. [sent-72, score-0.153]
34 It would of course be of interest to extend our optimization technique to the discriminative RBM. [sent-73, score-0.276]
35 2 3 Inference We consider the negative conditional log-likelihood of yn given xn (regularized in w to avoid overfitting) where θ = (α, w, b) and ynm is equal to 1 if yn = m and 0 otherwise: ℓ(θ) = − 3. [sent-77, score-0.527]
36 1 1 N N M λ w 2K ynm log p(ynm = 1|xn ) + n=1 m=1 2 F. [sent-78, score-0.162]
37 The EM algorithm can be viewed as a two-step block-coordinate descent procedure [11], where the first step (E-step) consists in finding the optimal auxiliary variables ξ, given the parameters of the model θ. [sent-90, score-0.162]
38 In our case, the result of this step is obtained in closed form T T T as ξnk ∝ yn αk ewk xn +bk with ξn 1K = 1. [sent-91, score-0.374]
39 The second step (M-step) consists of finding the best set of parameters θ, given the auxiliary variables ξ. [sent-92, score-0.111]
40 Optimizing the parameters αk leads to the closed N T form updates αk ∝ n=1 ξnk yn with αk 1M = 1 while optimizing jointly on w and b leads to a softmax regression problem, which we solved with Newton method. [sent-93, score-0.352]
41 Since F (ξ, θ) is not jointly convex in ξ and θ, this procedure stops when it reaches a local minimum, and its performance strongly depends on its initialization. [sent-94, score-0.201]
42 We propose in the following section, a robust initialization for EM given our latent model, based on an approximation of the auxiliary cost function obtained with the M-step. [sent-95, score-0.333]
43 In this section, we focus on deriving a quadratic approximation of this function, which will be minimized to obtain an initialization for EM. [sent-105, score-0.213]
44 We consider second-order Taylor expansions around the value of ξ corresponding to the uniformly 1 distributed latent variables zn , independent of the observations xn , i. [sent-106, score-0.503]
45 This choice K is motivated by the lack of a priori information on the latent classes. [sent-109, score-0.129]
46 Assuming uniformly disT tributed variables zn and independence between zn and xn implies that wk xn + bk = 0. [sent-113, score-0.864]
47 Its solution may be obtained in closed form and leads to: Jwb (ξ) = cst + K tr ξξ T I − A(X, λ) 2N where A(X, λ) = ΠN I − X(N λI + X T ΠN )−1 X T ΠN . [sent-119, score-0.204]
48 Japp (ξ) is thus a combination of a term trying to put every point in the same cluster and a term trying to spread them equally. [sent-126, score-0.132]
49 We optimize Japp (ξ) to get a robust initialization for EM. [sent-133, score-0.138]
50 Since the entries of each vector ξn sum to 1, we optimize Japp over a set of N simplices in K dimensions, S = {v ∈ RK | v ≥ 0, v T 1K = 1}. [sent-134, score-0.135]
51 However, since this function is not convex, minimizing it directly leads to local minima. [sent-135, score-0.109]
52 We propose, in Section 4, a general reformulation of any non-convex quadratic program over a set of N simplices and propose an efficient algorithm to optimize it. [sent-136, score-0.252]
53 3 Discriminative clustering The goal of clustering is to find a low-dimensional representation of unlabeled observations, by assigning them to K different classes, Xu et al. [sent-138, score-0.302]
54 [5] proposes a discriminative clustering framework based on the SVM and [7] simplifies it by replacing the hinge loss function by the square loss, leading to ridge regression. [sent-139, score-0.427]
55 By taking M = N and the labels Y = I, we obtain a formulation similar to [7] where we are looking for a latent representation that can recover the identity matrix. [sent-140, score-0.168]
56 However, unlike [5, 7], our discriminative clustering framework is based on a probabilistic model which may allow natural extensions. [sent-141, score-0.451]
57 Finally, since our formulation is derived from EM, we obtain a natural rounding by applying the EM algorithm after the optimization whereas [7] uses a coarse k-means rounding. [sent-144, score-0.221]
58 4 Optimization of quadratic functions over simplices To initialize the EM algorithm, we must minimize the non-convex quadratic cost function defined by Eq. [sent-146, score-0.363]
59 More precisely, we are interested in the following problems: min f (V ) = 1 tr (V V T B) s. [sent-148, score-0.158]
60 Traditional convex relaxation methods [14] would rewrite the objective function as v T Qv = tr(Qvv T ) = 4 tr(QT ) where T = vv T is a rank-one matrix which satisfies the set of constraints: − T ∈ DN K = {T ∈ RN K×N K | T ≥ 0, T − ∀ n, m ∈ {1, . [sent-158, score-0.211]
61 With the unit-rank constraint, optimizing over v is exactly equivalent to optimizing over T . [sent-166, score-0.106]
62 The problem is relaxed into a convex problem by removing the rank constraint, leading to a semidefinite programming problem (SDP) [15]. [sent-167, score-0.101]
63 In particular, [17] shows that the non convex optimization with respect to V leads to the global optimum of the convex problem in T . [sent-171, score-0.372]
64 For R ≥ 5, it turns out that the intersection of CP K and F is the convex hull of the matrices vv T such that v is an element of the product of simplices [16]. [sent-179, score-0.246]
65 This implies that the convex optimization problem of minimizing tr (QT ) over CP K ∩ F is equivalent to our original problem (for which no polynomial-time algorithm is known). [sent-180, score-0.259]
66 However, because of the positivity constraint one cannot find in polynomial time a local minimum of a differentiable function. [sent-184, score-0.106]
67 In order to derive a local descent algorithm, we reformulate the constraints (7-8) in terms of V (details can be found in the supplementary material). [sent-188, score-0.115]
68 Thus projecting Z on D is equivalent to find the solution of: N min a≥0 L(a) = n=1 max min 1 λn ∈R U n ≥0 2 U n − Zn 2 2 + λn (1T U n − a), K where (λn )n≤N are Lagrange multipliers. [sent-206, score-0.102]
69 We show that the performances are equivalent but, our algorithm can scale up to larger database. [sent-219, score-0.102]
70 We also consider the problem of supervised and unsupervised discriminative clustering. [sent-220, score-0.366]
71 For supervised and unsupervised multilabel classification, we first optimize the second-order approximation Japp , using the reformulation (9). [sent-223, score-0.183]
72 We use a projected gradient descent method with Armijo’s rule along the projection arc for backtracking [19]. [sent-224, score-0.116]
73 We also compare this rounding with another heuristic obtained by taking v ∗ = argminVr f (Vr ) (“min round”). [sent-227, score-0.131]
74 We compare our optimization of the non-convex quadratic problem (9) in V , to the convex SDP in T = V V T on the set of constraints defined by T ∈ DN K , (7) and (8). [sent-230, score-0.269]
75 Thus we set N = 10 and K = 2 on discriminative clustering problems (which are described later in this section). [sent-234, score-0.376]
76 We compare the performances of these algorithms after rounding. [sent-235, score-0.102]
77 For the SDP, we take ξ ∗ = T 1N K and for our algorithm we report performances obtained for both rounding discuss above (“avg round” and “min round”). [sent-236, score-0.233]
78 On these small examples, our algorithm associated with “min round” reaches similar performances than the SDP, whereas, associated with “avg round”, its performance drops. [sent-237, score-0.102]
79 We compare the performances of the two different roundings, “min round” and “avg round” on discriminative clustering problems. [sent-239, score-0.478]
80 We also compare our algorithm for a given R, to two baselines where we solve independently problem (4) R times and then apply the same roundings (“ind - min round” and “ind - avg round”). [sent-241, score-0.337]
81 We look at the average performances as the number of noise dimensions increases in discriminative clustering problems. [sent-251, score-0.571]
82 Our method outperforms the baseline whatever rounding we use. [sent-252, score-0.171]
83 Figure 1 shows that on problems with a small number of latent classes (K < 5), we obtain better performances by taking the column associated with the lowest value of the cost function (“min round”), than summing all the columns (“avg round”). [sent-253, score-0.37]
84 A potential 1 explanation is that summing the columns of V gives a solution close to K 1N 1T in expectation, thus K in the region where our quadratic approximation is valid. [sent-255, score-0.166]
85 Moreover, the best column of V is usually a local minimum of the quadratic approximation, which we have found to be close to similar local minima of our original problem, therefore, preventing the EM algorithm from converging to another solution. [sent-256, score-0.291]
86 We also compare to the EM without our initialization (“rand-init”) but also with 50 random initializations, a local descent method which is close to back-propagation in a two-layer neural network, which in this case strongly suffers from local minima problems. [sent-274, score-0.321]
87 An interesting result on computational time is that EM without our initialization needs more steps to obtain a local minimum. [sent-275, score-0.16]
88 The comparison with “rand-init” shows the importance of our convex initialization. [sent-280, score-0.101]
89 Indeed, the number of latent classes needed to correctly separate two classes of text is small. [sent-282, score-0.281]
90 2 0 5 10 15 20 noise dimension 5 10 15 20 noise dimension Figure 3: Clustering error when increasing the number of noise dimensions. [sent-294, score-0.246]
91 Figure 3 shows the optimization performance of the EM algorithm with 10 random starting points with (“DLC”) and without (“rand-init”) our initialization method. [sent-302, score-0.147]
92 We compare their performances to K-means, Gaussian Mixture Model (“GMM”), Diffrac [7] and max-margin clustering (“MMC”) [24]. [sent-303, score-0.253]
93 Following [7], we take linearly separable bumps in a two-dimensional space and add dimensions containing random independent Gaussian noise (e. [sent-304, score-0.129]
94 We show in Figure 4 additional examples which are non linearly separable. [sent-316, score-0.11]
95 6 Conclusion We have presented a probabilistic model for supervised dimension reduction, together with associated optimization tools to improve upon EM. [sent-317, score-0.266]
96 Application to text classification has shown that our model outperforms related ones and we have extended it to unsupervised situations, thus drawing new links between probabilistic models and discriminative clustering. [sent-318, score-0.436]
97 In terms of probabilistic models, such techniques should generalize to other latent variable models. [sent-320, score-0.204]
98 Finally, some additional structure could be added to the problem to take into account more specific problems, such as multiple instance learning [25], multi-label learning or discriminative clustering for computer vision [26, 27]. [sent-321, score-0.376]
99 Diffrac : a discriminative and flexible framework for clustering. [sent-361, score-0.225]
100 CVX: Matlab software for disciplined convex programming, version 1. [sent-433, score-0.101]
wordName wordTfidf (topN-words)
[('round', 0.257), ('japp', 0.226), ('discriminative', 0.225), ('avg', 0.221), ('em', 0.215), ('zn', 0.21), ('ynm', 0.162), ('clustering', 0.151), ('vr', 0.147), ('rounding', 0.131), ('ewk', 0.129), ('vrn', 0.129), ('latent', 0.129), ('xn', 0.125), ('yn', 0.12), ('quadratic', 0.117), ('sdp', 0.109), ('tr', 0.107), ('bk', 0.107), ('performances', 0.102), ('convex', 0.101), ('ind', 0.098), ('dlc', 0.097), ('initialization', 0.096), ('simplices', 0.093), ('rn', 0.089), ('supervised', 0.089), ('softmax', 0.089), ('diffrac', 0.085), ('classification', 0.084), ('pointwise', 0.082), ('initializations', 0.077), ('km', 0.077), ('probabilistic', 0.075), ('non', 0.074), ('cp', 0.073), ('auxiliary', 0.072), ('cvx', 0.069), ('xw', 0.069), ('nk', 0.069), ('multinomial', 0.069), ('projection', 0.065), ('italie', 0.065), ('jwb', 0.065), ('postings', 0.065), ('ppxa', 0.065), ('qv', 0.065), ('roundings', 0.065), ('local', 0.064), ('experts', 0.064), ('relaxation', 0.058), ('mmc', 0.057), ('joulin', 0.057), ('paris', 0.056), ('df', 0.054), ('classi', 0.054), ('classes', 0.054), ('optimizing', 0.053), ('unsupervised', 0.052), ('cst', 0.052), ('linli', 0.052), ('vv', 0.052), ('dimension', 0.051), ('descent', 0.051), ('min', 0.051), ('blei', 0.051), ('ridge', 0.051), ('optimization', 0.051), ('inef', 0.05), ('summing', 0.049), ('dn', 0.049), ('noise', 0.048), ('wk', 0.048), ('bach', 0.047), ('trying', 0.047), ('boltzmann', 0.046), ('minima', 0.046), ('dimensions', 0.045), ('leads', 0.045), ('text', 0.044), ('samuel', 0.044), ('rieure', 0.044), ('mixture', 0.044), ('positivity', 0.042), ('normale', 0.042), ('optimize', 0.042), ('outperforms', 0.04), ('layer', 0.04), ('variables', 0.039), ('predictive', 0.039), ('formulation', 0.039), ('avenue', 0.038), ('cluster', 0.038), ('ecole', 0.037), ('stops', 0.036), ('mixtures', 0.036), ('linearly', 0.036), ('expansion', 0.036), ('cost', 0.036), ('vn', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999869 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
2 0.17007771 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
3 0.14552335 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
Author: Akshat Kumar, Shlomo Zilberstein
Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is significantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1
4 0.13673285 290 nips-2010-t-logistic regression
Author: Nan Ding, S.v.n. Vishwanathan
Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1
5 0.12993856 228 nips-2010-Reverse Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
6 0.1296892 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
7 0.11895198 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
8 0.11818224 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
9 0.10861441 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
10 0.10771903 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
11 0.10569905 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
12 0.10269021 287 nips-2010-Worst-Case Linear Discriminant Analysis
13 0.10231339 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
14 0.09973485 99 nips-2010-Gated Softmax Classification
15 0.096909687 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
16 0.096160203 235 nips-2010-Self-Paced Learning for Latent Variable Models
17 0.091745995 243 nips-2010-Smoothness, Low Noise and Fast Rates
18 0.089850694 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
19 0.086328082 192 nips-2010-Online Classification with Specificity Constraints
20 0.085297003 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
topicId topicWeight
[(0, 0.284), (1, 0.087), (2, 0.087), (3, -0.05), (4, -0.019), (5, 0.003), (6, 0.057), (7, 0.065), (8, 0.022), (9, 0.009), (10, 0.062), (11, 0.021), (12, 0.241), (13, -0.077), (14, 0.066), (15, -0.011), (16, 0.015), (17, 0.127), (18, 0.106), (19, -0.051), (20, 0.104), (21, -0.072), (22, -0.088), (23, 0.047), (24, -0.028), (25, -0.016), (26, -0.002), (27, 0.155), (28, -0.082), (29, -0.035), (30, -0.059), (31, -0.045), (32, 0.074), (33, 0.064), (34, 0.01), (35, 0.055), (36, 0.043), (37, -0.029), (38, -0.087), (39, 0.119), (40, 0.051), (41, 0.082), (42, 0.043), (43, -0.042), (44, -0.015), (45, 0.045), (46, -0.007), (47, 0.119), (48, 0.005), (49, -0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.95157206 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
2 0.67700505 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
3 0.67070591 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
4 0.66721481 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
5 0.62904543 228 nips-2010-Reverse Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
6 0.60782522 290 nips-2010-t-logistic regression
7 0.55748695 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
8 0.55548978 287 nips-2010-Worst-Case Linear Discriminant Analysis
10 0.5376479 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
11 0.50312442 235 nips-2010-Self-Paced Learning for Latent Variable Models
12 0.49586946 99 nips-2010-Gated Softmax Classification
13 0.48958847 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
14 0.47511086 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
15 0.46391004 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
16 0.4559902 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
17 0.45529711 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
18 0.45476383 262 nips-2010-Switched Latent Force Models for Movement Segmentation
19 0.44248986 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
20 0.43797225 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
topicId topicWeight
[(13, 0.094), (17, 0.024), (26, 0.166), (27, 0.068), (30, 0.063), (35, 0.02), (45, 0.191), (50, 0.055), (52, 0.04), (60, 0.084), (77, 0.052), (78, 0.014), (90, 0.052)]
simIndex simValue paperId paperTitle
1 0.86840463 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
Author: Daniel Golovin, Andreas Krause, Debajyoti Ray
Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1
2 0.85237694 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu
Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1
same-paper 3 0.8499679 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.81422132 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.80675185 265 nips-2010-The LASSO risk: asymptotic results and real world examples
Author: Mohsen Bayati, José Pereira, Andrea Montanari
Abstract: We consider the problem of learning a coefficient vector x0 ∈ RN from noisy linear observation y = Ax0 + w ∈ Rn . In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator x. In this case, a popular approach consists in solving an ℓ1 -penalized least squares problem known as the LASSO or Basis Pursuit DeNoising (BPDN). For sequences of matrices A of increasing dimensions, with independent gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic mean square error of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas. Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.
6 0.80164289 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
7 0.80020231 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
8 0.79958355 258 nips-2010-Structured sparsity-inducing norms through submodular functions
9 0.79753131 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
10 0.79644293 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
11 0.79498029 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
12 0.79496199 261 nips-2010-Supervised Clustering
13 0.79450381 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
14 0.79421878 222 nips-2010-Random Walk Approach to Regret Minimization
15 0.793329 63 nips-2010-Distributed Dual Averaging In Networks
16 0.79296362 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
17 0.79278231 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
18 0.79166657 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
19 0.79109168 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
20 0.79065561 280 nips-2010-Unsupervised Kernel Dimension Reduction