nips nips2008 nips2008-175 knowledge-graph by maker-knowledge-mining

175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Source: pdf

Author: Chunhua Shen, Alan Welsh, Lei Wang

Abstract: In this work, we consider the problem of learning a positive semideﬁnite matrix. The critical issue is how to preserve positive semideﬁniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semideﬁnite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semideﬁnite matrix with trace being one instead of a classiﬁer. PSDBoost is based on the observation that any trace-one positive semideﬁnite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. [sent-3, score-0.201]

2 The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semideﬁnite matrix with trace being one instead of a classiﬁer. [sent-5, score-0.142]

3 PSDBoost is based on the observation that any trace-one positive semideﬁnite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. [sent-6, score-0.35]

4 1 Introduction Column generation (CG) [3] is a technique widely used in linear programming (LP) for solving large-sized problems. [sent-8, score-0.152]

5 The proposed work here—which we dub matrix generation (MG)—extends the column generation technique to non-polyhedral semideﬁnite constraints. [sent-10, score-0.204]

6 In particular, as an application we show how to use it for solving a semideﬁnite metric learning problem. [sent-11, score-0.217]

7 Given the limitations of current semideﬁnite programming (SDP) solvers to deal with large-scale problems, the work presented here is of importance for many real applications. [sent-14, score-0.135]

8 The choice of a metric has a direct effect on the performance of many algorithms such as the simplest k-NN classiﬁer and some clustering algorithms. [sent-15, score-0.184]

9 Much effort has been spent on learning a good metric for pattern recognition and data mining. [sent-16, score-0.154]

10 Clearly a good metric is task-dependent: different applications should use different measures for (dis)similarity between objects. [sent-17, score-0.154]

11 We show how a Mahalanobis metric is learned from examples of proximity comparison among triples of training data. [sent-18, score-0.274]

12 For example, assuming that we are given triples of images ai , aj and ak (ai , aj have same labels and ai , ak have different labels, ai ∈ RD ), we want to learn a metric between pairs of images such that the distance from aj to ai (distij ) is smaller than from ak to ai (distik ). [sent-19, score-2.27]

13 Triplets like this are the input of our metric learning algorithm. [sent-20, score-0.154]

14 By casting the problem as optimization of the inner product of the linear transformation matrix and its transpose, the formulation is based on solving a semideﬁnite program. [sent-21, score-0.167]

15 The algorithm ﬁnds an optimal linear transformation that maximizes the margin between distances distij and distik . [sent-22, score-0.315]

16 State-of-the-art solvers like CPLEX [4] can solve large problems up to millions of variables and constraints. [sent-27, score-0.181]

17 This motivates us to develop an LP approach to solve our SDP metric learning problem. [sent-28, score-0.203]

18 The general idea of CG is that, instead of solving the original large-scale problem (master problem), one works on a restricted master problem with a reasonably small subset of variables at each step. [sent-32, score-0.235]

19 The dual of the restricted master problem is solved by the simplex method, and the optimal dual solution is used to ﬁnd the new column to be included into the restricted master problem. [sent-33, score-0.64]

20 For the ﬁrst time, LPBoost shows that in an LP framework, unknown weak hypotheses can be learned from the dual although the space of all weak hypotheses is inﬁnitely large. [sent-35, score-0.216]

21 Metric learning using convex optimization has attracted a lot of attention recently [6–8]. [sent-37, score-0.172]

22 [6] learns a Mahalanobis metric for clustering using convex optimization to minimize the distance between examples belonging to the same class, while at the same time restricting examples in difference classes not to be too close. [sent-42, score-0.404]

23 The work in [7] also learns a Mahalanobis metric using SDP by optimizing a modiﬁed k-NN classiﬁer. [sent-43, score-0.154]

24 They replace the positive semideﬁnite (PSD) conic constraint using a sequence of linear constraints under the fact that a diagonal dominance matrix must be PSD (but not vice versa). [sent-47, score-0.285]

25 We denote the space of D × D symmetric matrices by SD , and positive semideﬁnite matrices by SD . [sent-52, score-0.212]

26 , xM } in a real vector or matrix space Sp, the convex hull of Sp spanned by M elements in Sp is deﬁned as: convM (Sp) = M i=1 θi xi θi ≥ 0, M i=1 θi = 1, xi ∈ Sp . [sent-63, score-0.274]

27 2 Let us deﬁne Γ1 to be the space of all positive semideﬁnite matrices X ∈ SD with + trace equaling one: 2 Γ1 = {X | X 0, Tr(X) = 1 } ; and Ω1 to be the space of all positive semideﬁnite matrices with both trace and rank equaling one: Ω1 = {Z | Z 0, Tr(Z) = 1, rank(Z) = 1 } . [sent-67, score-0.535]

28 3 Let Ω2 be a convex polytope deﬁned as Ω2 = {λ ∈ RD | λk ≥ 0, ∀k = 1, · · · , D, D k=1 λk = 1}, then the points with only one element equaling one and all the others being zeros are the extreme points (vertexes) of Ω2 . [sent-72, score-0.441]

29 If λ′ is not an extreme point of Ω2 , then it must be expressed as an convex combination of a few other points in M M M Ω2 : λ′ = i=1 θi λi , θi > 0, i=1 θi = 1 and λi = λ′ . [sent-75, score-0.333]

30 Therefore such a convex combination does not exist and λ′ must be an extreme point. [sent-80, score-0.3]

31 It is trivial to see that any λ that has more than one active element is an convex combination of the above-deﬁned extreme points. [sent-81, score-0.3]

32 In other words, all Z ∈ Ω1 , forms the set of extreme points of Γ1 . [sent-87, score-0.171]

33 Proof: It is easy to check that any convex combination i θi Zi , such that Zi ∈ Ω1 , resides in Γ1 , with the following two facts: (1) a convex combination of PSD matrices is still a PSD matrix; (2) i Tr = i θi Tr(Zi ) = 1. [sent-88, score-0.394]

34 3, it is immediate to see that a matrix Z such that Z 0, Tr(Z) = 1 and rank(Z) > 1 can not be an extreme point of Γ1 . [sent-92, score-0.201]

35 The only candidates for extreme points are those rank-one matrices (λ1 = 1 and λ2,··· ,D = 0). [sent-93, score-0.241]

36 Moreover, it is not possible that some rank-one matrices are extreme points and others are not because the other two constraints Z 0 and Tr(Z) = 1 do not distinguish between different rank-one matrices. [sent-94, score-0.299]

37 Hence, all Z ∈ Ω1 forms the set of extreme points of Γ1 . [sent-95, score-0.171]

38 Furthermore, Γ1 is a convex and compact set, which must have extreme points. [sent-96, score-0.269]

39 Krein-Milman Theorem [9] tells us that a convex and compact set is equal to the convex hull of its extreme points. [sent-97, score-0.48]

40 1 Strictly speaking, the union of convex hulls may not be a convex hull in general. [sent-105, score-0.342]

41 A density matrix of rank one is called a pure state, and a density matrix of rank higher than one is called a mixed state. [sent-108, score-0.316]

42 Typically a boosting algorithm [12] creates a single strong learner by incrementally adding base (weak) learners to the ﬁnal strong learner. [sent-111, score-0.326]

43 In general, a boosting algorithm builds on a user-speciﬁed base learning procedure and runs it repeatedly on modiﬁed data that are outputs from the previous iterations. [sent-113, score-0.229]

44 The inputs to a boosting algorithm are a set of training example x, and their corresponding class labels y. [sent-114, score-0.15]

45 This is exactly the problem that boosting methods have been designed to solve. [sent-120, score-0.15]

46 This observation inspires us to solve a special type of SDPs using boosting techniques. [sent-121, score-0.199]

47 A sparse greedy approximation algorithm proposed by Zhang [2] is an efﬁcient way of solving a class of convex problems, which provides fast convergence rates. [sent-122, score-0.223]

48 It is shown in [2] that boosting algorithms can be interpreted within the general framework of [2]. [sent-123, score-0.15]

49 The algorithm ﬁnds ui ∈ V, i = 1, · · · , and 0 ≤ λ ≤ 1 such that the cost function F ((1 − λ)ui−1 + λui ) is approximately minimized; Then the solution ui is updated as ui = (1 − λ)ui−1 + λui and the iteration goes on. [sent-126, score-0.292]

50 4 Large-margin Semideﬁnite Metric Learning We consider the Mahalanobis metric learning problem as an example although the proposed technique can be applied to many other problems in machine learning such as nonparametric kernel matrix learning [13]. [sent-127, score-0.217]

51 We are given a set of training examples ai ∈ RD , i = 1, 2, · · · . [sent-128, score-0.211]

52 The task is to learn a distance metric such that with the learned metric, classiﬁcation or clustering will achieve better performance on testing data. [sent-129, score-0.263]

53 Mathematically we are given a set S which contains the training triplets: S = {(ai , aj , ak )| distij < distik }, where distij measures distance between ai and aj with a certain metric. [sent-131, score-1.222]

54 Equivalently we are learning a linear transformation P ∈ RD×d such that dist is the Euclidean distance in the projected space: distij = 2 P⊤ ai − P⊤ aj 2 = (ai − aj )⊤ PP⊤ (ai − aj ). [sent-133, score-1.035]

55 We wish to maximize the margin that is deﬁned as the distance between distij and distik . [sent-137, score-0.363]

56 That is, ρ = distik − distij = (ai − ak )⊤ X(ai − ak ) − (ai − aj )⊤ X(ai − aj ). [sent-138, score-0.931]

57 X |S| r=1 ξr 0, Tr(X) = 1, ξ ≥ 0, ⊤ (3) ⊤ (ai − ak ) X(ai − ak ) − (ai − aj ) X(ai − aj ) ≥ ρ − ξr , ∀(ai , aj , ak ) ∈ S. [sent-142, score-0.972]

58 Note that the constraint Tr(X) = 1 removes the scale ambiguity because the distance inequalities are scale invariant. [sent-147, score-0.134]

59 To simplify our exposition, we write Ar = (ai − ak )(ai − ak )⊤ − (ai − aj )(ai − aj )⊤ . [sent-148, score-0.648]

60 Therefore it can be solved using off-the-shelf SDP solvers like CSDP [15]. [sent-151, score-0.145]

61 Current solvers can only solve problems up to a few thousand variables, which makes many applications intractable. [sent-153, score-0.14]

62 4, we can replace the PSD conic constraint in (3) with a linear convex combination M of rank-one unitary PSD matrices: X = i=1 θi Zi . [sent-158, score-0.284]

63 This above problem is still very hard to solve since it has non-convex rank constraints and an indeﬁnite number of variables (M is indeﬁnite because there are an indeﬁnite number of rank-one matrices). [sent-162, score-0.243]

64 However if we somehow know matrices Zi (i = 1, · · · ) a priori, we can then drop all the constraints imposed on Zi (i = 1, · · · ) and the problem becomes a linear program; or more precisely a semi-inﬁnite linear program (SILP) because it has an inﬁnitely large set of variables θ. [sent-163, score-0.169]

65 Column generation is a state-of-the-art method for optimally solving difﬁcult large-scale optimization problems. [sent-164, score-0.149]

66 So we must be able to solve the subproblem: given a set of dual values, one either identiﬁes a variable that has a favorable reduced cost, or indicates that such a variable does not exist. [sent-171, score-0.168]

67 |S| r=1 |S| r=1 Ar , Zi wr ≤ π, i = 1, · · · , M, wr = 1, 0 ≤ wr ≤ C, r = 1, · · · , |S|. [sent-178, score-0.744]

68 (D1 ) For convex programs with strong duality, the dual gap is zeros, which means the optimal value of the primal and dual problems coincide. [sent-179, score-0.473]

69 The LP solved using Z is usually termed restricted master problem (RMP). [sent-184, score-0.185]

70 Because the primal variables correspond to the dual constraints, solving RMP is equivalent to solving a ˜ relaxed version of the dual problem. [sent-185, score-0.478]

71 If we can prove that among all the constraints that we have not added to the dual problem, no single constraint is violated, then we can conclude that solving the restricted problem is equivalent to solving the original problem. [sent-187, score-0.409]

72 The violated constraints correspond to variables in primal that are not in RMP. [sent-189, score-0.218]

73 Hence, as in LPBoost [1] we have a base learning algorithm as an oracle that either ﬁnds a new Z′ such that |S| r=1 Ar , Z′ wr > π , ˜ where π is the solution of the current restricted problem, or a guarantee that such a Z′ does not exist. [sent-192, score-0.413]

74 ˜ (B1 ) Again here wr (r = 1, · · · , |S|) are obtained by solving the current restricted dual problem (D1 ). [sent-197, score-0.481]

75 We now have a criterion that guarantees the optimal convex combination over all Z’s satisfying the constraints in Γ2 has been found. [sent-199, score-0.22]

76 We have |S| r=1 Ar , Z wr = ˜ |S| r=1 wr Ar , Z = u ˜ |S| r=1 wr Ar u⊤ . [sent-206, score-0.744]

77 ˜ By denoting ˜ H= |S| r=1 wr Ar , ˜ (6) the optimization in (B1 ) equals: ˜ max u⊤ Hu, subject to u u 2 = 1. [sent-207, score-0.289]

78 1 There are approximate eigenvalue solvers, which guarantee that for a symmetric matrix U and any ε > 0, a vector v is found such that v⊤ Uv ≥ λmax −ε. [sent-211, score-0.153]

79 ˜ Algorithm 1: PSDBoost for semideﬁnite metric learning. [sent-218, score-0.154]

80 Input: Training set triplets (ai , aj , ak ) ∈ S; Calculate Ar , r = 1, · · · from S using Equation (4). [sent-219, score-0.412]

81 wr = 1 |S| , r = 1, · · · , |S| (uniform dual weights). [sent-224, score-0.367]

82 Add Z′ to the restricted master problem, which corresponds to a new constraint in Problem (D1 ); 4. [sent-230, score-0.186]

83 Solve the dual (D1 ) to obtain updated π and wr (r = 1, · · · , |S|); 5. [sent-231, score-0.367]

84 Calculate the primal variable θ from the optimality conditions and the last solved dual LP; 2. [sent-234, score-0.246]

85 Putting all the above analysis together, we summarize our PSDBoost algorithm for metric learning in Algorithm 1. [sent-236, score-0.154]

86 Each iteration the solution is provably better than the preceding one, and has rank at most one larger. [sent-240, score-0.171]

87 The bounded rank follows the fact that rank(A + B) ≤ rank(A) + rank(B), ∀ matrices A and B. [sent-243, score-0.165]

88 An advantage of the proposed PSDBoost algorithm over standard boosting schemes is the totallycorrective weight update in each iteration, which leads faster convergence. [sent-244, score-0.15]

89 The coordinate descent optimization employed by standard boosting algorithms is known to have a slow convergence rate in general. [sent-245, score-0.191]

90 When the size of of the bundle becomes too large, bundle methods select columns to be discarded and the selected information is aggregated into a single one. [sent-251, score-0.181]

91 It can be shown that as long as the aggregated column is introduced in the bundle, the bundle algorithm remains convergent, although different selection of discarded columns may lead to different convergence speeds. [sent-252, score-0.156]

92 The triplets are obtained in this way: For a point ai , we ﬁnd its nearest neighbor in the same class aj and its nearest neighbor in the different class ak . [sent-256, score-0.699]

93 To show the convergence, we have plotted the optimal values of the dual problem (D1 ) at each iteration in Figure 1. [sent-258, score-0.16]

94 Moreover, we usually are not interested in the numerical accuracy of the solution but the test error for many problems such as metric and kernel learning. [sent-264, score-0.189]

95 The results show that PSDBoost converges quickly and the learned metric is very similar to the results obtained by a standard SDP solver. [sent-268, score-0.185]

96 7 Conclusion We have presented a new boosting algorithm, PSDBoost, for learning a positive semideﬁnite matrix. [sent-272, score-0.192]

97 In particular, as an example, we use PSDBoost to learn a distance metric for classiﬁcation. [sent-273, score-0.202]

98 Distance metric learning, with application to clustering with side-information. [sent-317, score-0.184]

99 Distance metric learning for large margin nearest neighbor classiﬁcation. [sent-330, score-0.224]

100 The dashed line shows the ground truth obtained by directly solving the original primal SDP (3) using interior-point methods. [sent-425, score-0.136]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psdboost', 0.33), ('sdp', 0.251), ('wr', 0.248), ('semide', 0.247), ('ai', 0.211), ('aj', 0.191), ('ar', 0.177), ('psd', 0.17), ('distij', 0.165), ('metric', 0.154), ('boosting', 0.15), ('extreme', 0.138), ('cg', 0.138), ('sp', 0.138), ('ak', 0.133), ('convex', 0.131), ('mahalanobis', 0.124), ('dual', 0.119), ('distik', 0.118), ('lpboost', 0.118), ('rmp', 0.118), ('opt', 0.112), ('zi', 0.101), ('tr', 0.101), ('rank', 0.095), ('lp', 0.092), ('solvers', 0.091), ('triplets', 0.088), ('lanczos', 0.083), ('master', 0.08), ('hull', 0.08), ('base', 0.079), ('bundle', 0.076), ('sd', 0.074), ('primal', 0.073), ('ui', 0.072), ('equaling', 0.071), ('matrices', 0.07), ('nite', 0.069), ('conic', 0.067), ('inde', 0.067), ('matrix', 0.063), ('solving', 0.063), ('sdps', 0.062), ('eigenvalue', 0.06), ('constraints', 0.058), ('constraint', 0.055), ('solved', 0.054), ('column', 0.051), ('restricted', 0.051), ('canberra', 0.05), ('rd', 0.05), ('solve', 0.049), ('distance', 0.048), ('proximity', 0.048), ('convm', 0.047), ('csdp', 0.047), ('nitely', 0.046), ('violated', 0.046), ('generation', 0.045), ('programming', 0.044), ('eigenvalues', 0.043), ('australian', 0.043), ('positive', 0.042), ('triples', 0.041), ('cplex', 0.041), ('lps', 0.041), ('iteration', 0.041), ('optimization', 0.041), ('variables', 0.041), ('neighbor', 0.038), ('restarted', 0.038), ('dist', 0.038), ('nds', 0.037), ('trace', 0.037), ('dantzig', 0.035), ('conv', 0.035), ('eigenvector', 0.035), ('zeros', 0.035), ('learners', 0.035), ('solution', 0.035), ('subproblem', 0.033), ('nicta', 0.033), ('points', 0.033), ('weak', 0.033), ('classi', 0.033), ('margin', 0.032), ('learned', 0.031), ('strong', 0.031), ('inequalities', 0.031), ('largest', 0.031), ('combination', 0.031), ('pp', 0.031), ('calculates', 0.031), ('symmetric', 0.03), ('clustering', 0.03), ('alleviate', 0.029), ('aggregated', 0.029), ('essence', 0.029), ('greedy', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Author: Chunhua Shen, Alan Welsh, Lei Wang

2 0.11756263 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

Author: Ohad Shamir, Naftali Tishby

Abstract: unkown-abstract

3 0.1166169 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

Author: Xuming He, Richard S. Zemel

Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difﬁcult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1

4 0.11583425 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for ﬁnding the most probable (MAP) conﬁguration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and signiﬁcantly faster than a method that does not use partitioning. 1

5 0.11162885 118 nips-2008-Learning Transformational Invariants from Natural Movies

Author: Charles Cadieu, Bruno A. Olshausen

Abstract: We describe a hierarchical, probabilistic model that learns to extract complex motion from movies of the natural environment. The model consists of two hidden layers: the ﬁrst layer produces a sparse representation of the image that is expressed in terms of local amplitude and phase variables. The second layer learns the higher-order structure among the time-varying phase variables. After training on natural movies, the top layer units discover the structure of phase-shifts within the ﬁrst layer. We show that the top layer units encode transformational invariants: they are selective for the speed and direction of a moving pattern, but are invariant to its spatial structure (orientation/spatial-frequency). The diversity of units in both the intermediate and top layers of the model provides a set of testable predictions for representations that might be found in V1 and MT. In addition, the model demonstrates how feedback from higher levels can inﬂuence representations at lower levels as a by-product of inference in a graphical model. 1

6 0.10999398 78 nips-2008-Exact Convex Confidence-Weighted Learning

7 0.10626676 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

8 0.10286751 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

9 0.10187382 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

10 0.10142195 202 nips-2008-Robust Regression and Lasso

11 0.098560922 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning

12 0.095571078 48 nips-2008-Clustering via LP-based Stabilities

13 0.094721638 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

14 0.092761181 15 nips-2008-Adaptive Martingale Boosting

15 0.090937436 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

16 0.089092664 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

17 0.08452186 143 nips-2008-Multi-label Multiple Kernel Learning

18 0.083847187 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

19 0.083525039 168 nips-2008-Online Metric Learning and Fast Similarity Search

20 0.08126875 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.231), (1, -0.058), (2, -0.12), (3, 0.044), (4, 0.009), (5, -0.0), (6, -0.038), (7, -0.11), (8, -0.036), (9, -0.032), (10, 0.065), (11, 0.011), (12, 0.058), (13, 0.121), (14, 0.062), (15, 0.051), (16, -0.006), (17, -0.02), (18, 0.006), (19, -0.088), (20, -0.014), (21, 0.025), (22, -0.123), (23, 0.072), (24, -0.034), (25, -0.093), (26, -0.017), (27, 0.046), (28, 0.006), (29, -0.028), (30, -0.031), (31, -0.001), (32, -0.1), (33, -0.194), (34, 0.039), (35, -0.228), (36, -0.006), (37, -0.035), (38, 0.065), (39, 0.037), (40, -0.003), (41, -0.052), (42, -0.051), (43, -0.149), (44, -0.045), (45, 0.101), (46, -0.029), (47, -0.187), (48, -0.023), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9471342 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Author: Chunhua Shen, Alan Welsh, Lei Wang

2 0.60368586 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

Author: Ohad Shamir, Naftali Tishby

Abstract: unkown-abstract

3 0.51547331 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning

Author: Zenglin Xu, Rong Jin, Irwin King, Michael Lyu

Abstract: We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efﬁcient methods, i.e., Semi-Inﬁnite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can signiﬁcantly improve efﬁciency by saving on average 91.9% of computational time over the SILP method and 70.3% over the SD method. 1

4 0.50077575 143 nips-2008-Multi-label Multiple Kernel Learning

Author: Shuiwang Ji, Liang Sun, Rong Jin, Jieping Ye

Abstract: We present a multi-label multiple kernel learning (MKL) formulation in which the data are embedded into a low-dimensional space directed by the instancelabel correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, which can be cast into a semi-inﬁnite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained convex optimization problem. In addition, we show that the objective function of the approximate formulation is differentiable with Lipschitz continuous gradient, and hence existing methods can be employed to compute the optimal solution efﬁciently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.

5 0.49696648 104 nips-2008-Improved Moves for Truncated Convex Models

Author: Philip Torr, M. P. Kumar

Abstract: We consider the problem of obtaining the approximate maximum a posteriori estimate of a discrete random ﬁeld characterized by pairwise potentials that form a truncated convex model. For this problem, we propose an improved st-MINCUT based move making algorithm. Unlike previous move making approaches, which either provide a loose bound or no bound on the quality of the solution (in terms of the corresponding Gibbs energy), our algorithm achieves the same guarantees as the standard linear programming (LP) relaxation. Compared to previous approaches based on the LP relaxation, e.g. interior-point algorithms or treereweighted message passing (TRW), our method is faster as it uses only the efﬁcient st-MINCUT algorithm in its design. Furthermore, it directly provides us with a primal solution (unlike TRW and other related methods which solve the dual of the LP). We demonstrate the effectiveness of the proposed approach on both synthetic and standard real data problems. Our analysis also opens up an interesting question regarding the relationship between move making algorithms (such as α-expansion and the algorithms presented in this paper) and the randomized rounding schemes used with convex relaxations. We believe that further explorations in this direction would help design efﬁcient algorithms for more complex relaxations.

6 0.49135533 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

7 0.48506704 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

8 0.48184735 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

9 0.46346346 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

10 0.46313459 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

11 0.46090639 48 nips-2008-Clustering via LP-based Stabilities

12 0.41765147 33 nips-2008-Bayesian Model of Behaviour in Economic Games

13 0.39952815 15 nips-2008-Adaptive Martingale Boosting

14 0.39840424 202 nips-2008-Robust Regression and Lasso

15 0.39608285 168 nips-2008-Online Metric Learning and Fast Similarity Search

16 0.39348125 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

17 0.3908585 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

18 0.38679528 214 nips-2008-Sparse Online Learning via Truncated Gradient

19 0.38118154 239 nips-2008-Tighter Bounds for Structured Estimation

20 0.37703052 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.075), (7, 0.085), (12, 0.037), (28, 0.148), (57, 0.064), (63, 0.058), (71, 0.026), (77, 0.068), (83, 0.055), (86, 0.27)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76695871 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Author: Chunhua Shen, Alan Welsh, Lei Wang

2 0.67955917 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

Author: Jun Zhu, Eric P. Xing, Bo Zhang

Abstract: Learning graphical models with hidden variables can offer semantic insights to complex data and lead to salient structured predictors without relying on expensive, sometime unattainable fully annotated training data. While likelihood-based methods have been extensively explored, to our knowledge, learning structured prediction models with latent variables based on the max-margin principle remains largely an open problem. In this paper, we present a partially observed Maximum Entropy Discrimination Markov Network (PoMEN) model that attempts to combine the advantages of Bayesian and margin based paradigms for learning Markov networks from partially labeled data. PoMEN leads to an averaging prediction rule that resembles a Bayes predictor that is more robust to overﬁtting, but is also built on the desirable discriminative laws resemble those of the M3 N. We develop an EM-style algorithm utilizing existing convex optimization algorithms for M3 N as a subroutine. We demonstrate competent performance of PoMEN over existing methods on a real-world web data extraction task. 1

3 0.62130195 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

4 0.61269838 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive deﬁnite kernels allow to use large and potentially inﬁnite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efﬁciently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

5 0.60904837 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classiﬁcation, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge signiﬁcantly improve prediction accuracy. 1

6 0.60796982 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

7 0.60791409 75 nips-2008-Estimating vector fields using sparse basis field expansions

8 0.60666394 202 nips-2008-Robust Regression and Lasso

9 0.60573399 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

10 0.60550594 216 nips-2008-Sparse probabilistic projections

11 0.60416394 200 nips-2008-Robust Kernel Principal Component Analysis

12 0.60355222 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

13 0.60275602 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

14 0.60198587 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

15 0.60167336 196 nips-2008-Relative Margin Machines

16 0.60105741 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

17 0.60103613 143 nips-2008-Multi-label Multiple Kernel Learning

18 0.60006434 81 nips-2008-Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM

19 0.59922463 238 nips-2008-Theory of matching pursuit

20 0.59897292 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models