nips nips2005 nips2005-166 knowledge-graph by maker-knowledge-mining

166 nips-2005-Robust Fisher Discriminant Analysis


Source: pdf

Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd

Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. [sent-5, score-0.501]

2 Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. [sent-6, score-0.234]

3 The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. [sent-7, score-0.668]

4 For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. [sent-8, score-0.315]

5 Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i. [sent-10, score-0.702]

6 , robust Fisher LDA in a high dimensional feature space. [sent-12, score-0.235]

7 1 Introduction Fisher linear discriminant analysis (LDA), a widely-used technique for pattern classification, finds a linear discriminant that yields optimal discrimination between two classes which can be identified with two random variables, say X and Y in Rn . [sent-13, score-1.007]

8 A discriminant that maximizes the Fisher discriminant ratio is given by wnom = (Σx + Σy )−1 (µx − µy ), which gives the maximum Fisher discriminant ratio (µx − µy )T (Σx + Σy )−1 (µx − µy ) = max f (w, µx , µy , Σx , Σy ). [sent-15, score-1.584]

9 w=0 In applications, the problem data µx , µy , Σx , and Σy are not known but are estimated from sample data. [sent-16, score-0.033]

10 Fisher LDA can be sensitive to the problem data: the discriminant wnom computed from an estimate of the parameters µx , µy , Σx , and Σy can give very poor discrimination for another set of problem data that is also a reasonable estimate of the parameters. [sent-17, score-0.599]

11 In this paper, we attempt to systematically alleviate this sensitivity problem by explicitly incorporating a model of data uncertainty in the classification problem and optimizing for the worst-case scenario under this model. [sent-18, score-0.234]

12 We assume that the problem data µx , µy , Σx , and Σy are uncertain, but known to belong to a convex compact subset U of Rn × Rn × Sn × Sn . [sent-19, score-0.201]

13 This assumption simply means that for each possible value of the means and covariances, the classes are distinguishable via Fisher LDA. [sent-22, score-0.11]

14 The worst-case analysis problem of finding the worst-case means and covariances for a given discriminant w can be written as minimize f (w, µx , µy , Σx , Σy ) subject to (µx , µy , Σx , Σy ) ∈ U, (1) with variables µx , µy , Σx , and Σy . [sent-23, score-0.849]

15 The optimal value of this problem is the worst-case Fisher discriminant ratio (over the class U of possible means and covariances), and any optimal points for this problem are called worst-case means and covariances. [sent-24, score-0.78]

16 We will show in §2 that (1) is a convex optimization problem, since the Fisher discriminant ratio is a convex function of µx , µy , Σx , Σy for a given discriminant w. [sent-26, score-1.341]

17 As a result, it is computationally tractable to find the worst-case performance of a discriminant w over the set of possible means and covariances. [sent-27, score-0.527]

18 The robust Fisher LDA problem is to find a discriminant that maximizes the worst-case Fisher discriminant ratio. [sent-28, score-1.183]

19 This can be cast as the optimization problem maximize subject to min (µx ,µy ,Σx ,Σy )∈U f (w, µx , µy , Σx , Σy ) w = 0, (2) with variable w. [sent-29, score-0.207]

20 Here we choose a linear discriminant that maximizes the Fisher discrimination ratio, with the worst possible means and covariances that are consistent with our data uncertainty model. [sent-31, score-0.856]

21 The main result of this paper is to give an effective method for solving the robust Fisher LDA problem (2). [sent-32, score-0.246]

22 We will show in §2 that the robust optimal Fisher discriminant w⋆ can be found as follows. [sent-33, score-0.709]

23 First, we solve the (convex) optimization problem minimize max f (w, µx , µy , Σx , Σy ) = (µx − µy )T (Σx + Σy )−1 (µx − µy ) w=0 subject to (µx , µy , Σx , Σy ) ∈ U, (3) with variables (µx , µy , Σx , Σy ). [sent-34, score-0.183]

24 Let (µ⋆ , µ⋆ , Σ⋆ , Σ⋆ ) denote any optimal point. [sent-35, score-0.044]

25 Then the x y x y discriminant −1 w⋆ = Σ⋆ + Σ⋆ (µ⋆ − µ⋆ ) (4) x y x y is a robust optimal Fisher discriminant, i. [sent-36, score-0.709]

26 Moreover, we will see that µ⋆ , µ⋆ and Σ⋆ , Σ⋆ are worst-case means and covariances for the robust optimal Fisher x y x y discriminant w⋆ . [sent-39, score-0.951]

27 Since convex optimization problems are tractable, this means that we have a tractable general method for computing a robust optimal Fisher discriminant. [sent-40, score-0.537]

28 A robust Fisher discriminant problem of modest size can be solved by standard convex optimization methods, e. [sent-41, score-0.924]

29 For some special forms of the uncertainty model, the robust optimal Fisher discriminant can be solved more efficiently than by a general convex optimization formulation. [sent-44, score-1.036]

30 Recently, there has been a growing interest in kernel Fisher discriminant analysis i. [sent-48, score-0.505]

31 Our results can be extended to robust kernel Fisher discriminant analysis under certain uncertainty models. [sent-53, score-0.804]

32 Various types of robust classification problems have been considered in the prior literature, e. [sent-55, score-0.213]

33 Most of the research has focused on formulating robust classification problems that can be efficiently solved via convex optimization. [sent-58, score-0.417]

34 In particular, the robust classification method developed in [6] is based on the criterion g(w, µx , µy , Σx , Σy ) = |wT (µx − µy )| , + (wT Σy w)1/2 (wT Σx w)1/2 which is similar to the Fisher discriminant ratio f . [sent-59, score-0.729]

35 With a specific uncertainty model on the means and covariances, the robust classification problem with discrimination criterion g can be cast as a second-order cone program, a special type of convex optimization problem [5]. [sent-60, score-0.743]

36 With general uncertainty models, however, it is not clear whether robust discriminant analysis with g can be performed via convex optimization. [sent-61, score-0.935]

37 2 Robust Fisher LDA We first consider the worst-case analysis problem (1). [sent-62, score-0.049]

38 Here we consider the discriminant w as fixed, and the parameters µx , µy , Σx , and Σy are variables, constrained to lie in the convex uncertainty set U. [sent-63, score-0.729]

39 To show that (1) is a convex optimization problem, we must show that the Fisher discriminant ratio is a convex function of µx , µy , Σx , and Σy . [sent-64, score-0.889]

40 To show this, we express the Fisher discriminant ratio f as the composition f (w, µx , µy , Σx , Σy ) = g(H(µx , µy , Σx , Σy )), where g(u, t) = u2 /t and H is the function H(µx , µy , Σx , Σy ) = (wT (µx − µy ), wT (Σx + Σy )w). [sent-65, score-0.559]

41 The function H is linear (as a mapping from µx , µy , Σx , and Σy into R2 ), and the function g is convex (provided t > 0, which holds here). [sent-66, score-0.168]

42 Thus, the composition f is a convex function of µx , µy , Σx , and Σy . [sent-67, score-0.193]

43 Consider a function of the form R(w, a, B) = (wT a)2 , wT Bw (5) which is the Rayleigh quotient for the matrix pair aaT ∈ Sn and B ∈ Sn , evaluated at w. [sent-70, score-0.044]

44 + ++ The robust Fisher LDA problem (2) is equivalent to a problem of the form maximize subject to min R(w, a, B) (a,B)∈V w = 0, (6) where a = µx −µy , B = Σx +Σy , V = {(µx −µy , Σx +Σy ) | (µx , µy , Σx , Σy ) ∈ U }. [sent-71, score-0.383]

45 (7) (This equivalence means that robust FLDA is a special type of robust matched filtering problem studied in the 1980s; see, e. [sent-72, score-0.548]

46 ) We will prove a ‘nonconventional’ minimax theorem for a Rayleigh quotient of the form (5), which will establish the main result described in §1. [sent-75, score-0.146]

47 To do this, we consider a problem of the form minimize aT B −1 a (8) subject to (a, B) ∈ V, with variables a ∈ Rn , B ∈ Sn , and V is a convex compact subset of Rn × Sn such ++ ++ that for each (a, B) ∈ V, a is not zero. [sent-76, score-0.285]

48 The objective of this problem is a matrix fractional function and so is convex on Rn × Sn ; see [3, §3. [sent-77, score-0.217]

49 It follows that (3) is a convex optimization problem. [sent-81, score-0.205]

50 The following theorem states the minimax theorem for the function R. [sent-82, score-0.117]

51 While R is convex in (a, B) for fixed w, it is not concave in w for fixed (a, B), so conventional convex-concave minimax theorems do not apply here. [sent-83, score-0.255]

52 Let (a⋆ , B ⋆ ) be an optimal solution to the problem (8), and let w⋆ = B ⋆ −1 a⋆ . [sent-85, score-0.077]

53 Then (w⋆ , a⋆ , B ⋆ ) satisfies the minimax property R(w⋆ , a⋆ , B ⋆ ) = max min R(w, a, B) = min max R(w, a, B), w=0 (a,B)∈V (a,B)∈V w=0 (9) and the saddle point property R(w, a⋆ , B ⋆ ) ≤ R(w⋆ , a⋆ , B ⋆ ) ≤ R(w⋆ , a, B), ∀w ∈ Rn \{0}, ∀(a, B) ∈ V. [sent-86, score-0.284]

54 It suffices to prove (10), since the saddle point property (10) implies the minimax property (9) [1, §2. [sent-88, score-0.162]

55 What remains is to show min R(w⋆ , a, B) = R(w⋆ , a⋆ , B ⋆ ). [sent-91, score-0.04]

56 (11) (a,B)∈V Since a⋆ and B ⋆ are optimal for the convex problem (8) (by definition), they must satisfy the optimality condition ∇a (aT B −1 a) (a⋆ ,B ⋆ ) , (a − a⋆ ) + ∇B (aT B −1 a) ≥ 0, ∀ (a, B) ∈ V (a⋆ ,B ⋆ ) , (B − B ⋆ ) (see [3, §4. [sent-92, score-0.301]

57 (12) Now we turn to the convex optimization problem minimize R(w⋆ , a, B) subject to (a, B) ∈ V, (13) with variables (a, B). [sent-96, score-0.322]

58 We will show that (a⋆ , B ⋆ ) is optimal for this problem, which will establish (11). [sent-97, score-0.044]

59 A pair (¯, B) is optimal for (13) if and only if a ¯ ∇a (w⋆T a)2 w⋆T Bw⋆ (¯,B) a ¯ , (a − a) + ∇B ¯ (w⋆T a)2 w⋆T Bw⋆ ¯ , (B − B) ≥ 0, ∀ (a, B) ∈ V. [sent-98, score-0.044]

60 ¯ Substituting a = a⋆ , B = B ⋆ , and noting that a⋆T w⋆ /w⋆T B ⋆ w⋆ = 1, the optimality ¯ condition reduces to = 2 2w⋆ T (a − a⋆ ) − w⋆ T (B − B ⋆ )w⋆ ≥ 0, ∀ (a, B) ∈ V, which is precisely (12). [sent-100, score-0.056]

61 Thus, we have shown that (a⋆ , B ⋆ ) is optimal for (13), which in turn establishes (11). [sent-101, score-0.044]

62 3 Robust Fisher LDA with product form uncertainty models In this section, we focus on robust Fisher LDA with the product form uncertainty model U = M × S, (14) where M is the set of possible means and S is the set of possible covariances. [sent-102, score-0.472]

63 For this model, the worst-case Fisher discriminant ratio can be written as min (µx ,µy ,Σx ,Σy )∈U f (a, µx , µy , Σx , Σy ) = (wT (µx − µy ))2 . [sent-103, score-0.578]

64 (µx ,µy )∈M max(Σx ,Σy )∈S w T (Σx + Σy )w min If we can find an analytic expression for max(Σx ,Σy )∈S wT (Σx + Σy )w (as a function of w), we can simplify the robust Fisher LDA problem. [sent-104, score-0.253]

65 The worst-case Fisher discriminant ratio can be expressed as (wT (µx − µy ))2 . [sent-112, score-0.516]

66 ¯ ¯ (µx ,µy )∈M w T (Σx + Σy + (δx + δy )I)w min This is the same worst-case Fisher discriminant ratio obtained for a problem in which the ¯ ¯ covariances are certain, i. [sent-113, score-0.776]

67 , fixed to be Σx + δx I and Σy + δy I, and the means lie in the set M. [sent-115, score-0.078]

68 We conclude that a robust optimal Fisher discriminant with the uncertainty model (14) in which S has the form (15) can be found by solving a robust Fisher LDA problem with these fixed values for the covariances. [sent-116, score-1.041]

69 The problem (17) is relatively simple: it involves minimizing a convex quadratic function over the set of possible µx and µy . [sent-118, score-0.22]

70 , µx and µy each lie in some confidence ellipsoid) the problem (17) is to minimize a convex quadratic subject to two convex quadratic constraints. [sent-121, score-0.496]

71 Such a problem is readily solved in O(n3 ) flops, since the dual problem has two variables, and evaluating the dual function and its derivatives can be done in O(n3 ) flops [3]. [sent-122, score-0.087]

72 Thus, the effort to solve the robust is the same order (i. [sent-123, score-0.213]

73 , n3 ) as solving the nominal Fisher LDA (but with a substantially larger constant). [sent-125, score-0.183]

74 4 Numerical results To demonstrate robust Fisher LDA, we use the sonar and ionosphere benchmark problems from the UCI repository (www. [sent-126, score-0.34]

75 The two benchmark problems have 208 and 351 points, respectively, and the dimension of each data point is 60 and 34, respectively. [sent-131, score-0.037]

76 We use the training set to compute the optimal discriminant and then test its performance using the test set. [sent-133, score-0.523]

77 3 means that 30% of the data points are used for training, and 70% are used to test the resulting discriminant. [sent-137, score-0.055]

78 For various values of α, we generate 100 random partitions of the data (for each of the two benchmark problems), and collect the results. [sent-138, score-0.037]

79 The parameters are estimated through a resampling technique [4] as follows. [sent-140, score-0.04]

80 For a given training set we create 100 new sets by resampling the original training set with a uniform distribution over all the data points. [sent-141, score-0.094]

81 For each of these sets we estimate its mean and covariance and then take their average values as the nominal mean and covariance. [sent-142, score-0.182]

82 We also evaluate the covariance Σµ of all the means obtained with the resampling. [sent-143, score-0.07]

83 This choice corresponds µ µ to a 50% confidence ellipsoid in the case of a Gaussian distribution. [sent-145, score-0.038]

84 The parameters ρx and ρy are taken to be the maximum deviations between the covariances and the average covariances in the Frobenius norm sense, over the resampling of the training set. [sent-146, score-0.441]

85 ionosphere sonar 100 90 TSA (%) 100 90 80 80 robust nominal 70 70 60 50 20 robust 60 nominal 30 40 50 60 α (%) 70 80 50 0 10 20 30 40 α (%) 50 60 Figure 1: Test-set accuracy (TSA) for sonar and ionosphere benchmark versus size of the training set. [sent-147, score-1.004]

86 The solid line represents the robust Fisher LDA results and the dotted line the nominal Fisher LDA results. [sent-148, score-0.38]

87 For each of our two problems, and for each value of α, we show the average test set accuracy (TSA), as well as the standard deviation (over the 100 instances of each problem with the given value of α). [sent-151, score-0.033]

88 The plots show the robust Fisher LDA performs substantially better than the nominal Fisher LDA for small training sets, but this performance gap disappears as the training set becomes larger. [sent-152, score-0.465]

89 5 Robust kernel Fisher discriminant analysis In this section we show how to ‘kernelize’ the robust Fisher LDA. [sent-153, score-0.718]

90 We will consider only a specific class of uncertainty models; the arguments we develop here can be extended to more general cases. [sent-154, score-0.086]

91 In the kernel approach we map the problem to an higher dimensional space Rf via a mapping φ : Rn → Rf so that the new decision boundary is more general and possibly nonlinear. [sent-155, score-0.087]

92 µ µ The uncertainty model we consider has the form µφ(x) − µφ(y) = µφ(x) − µφ(y) + P uf , ¯ ¯ uf ≤ 1, ¯ ¯ Σφ(x) − Σφ(x) F ≤ ρx , Σφ(y) − Σφ(y) F ≤ ρy . [sent-157, score-0.286]

93 (18) ¯ ¯ Here the vectors µφ(x) , µφ(y) represent the nominal means, the matrices Σφ(x) , Σφ(y) rep¯ ¯ resent the nominal covariances, and the (positive semidefinite) matrix P and the constants ρx and ρy represent the confidence regions in the feature space. [sent-158, score-0.427]

94 The worst-case Fisher discriminant ratio in the feature space is then given by ¯ uf ≤1, Σφ(x) −Σφ(x) min F ≤ρx , T (wf (¯φ(x) − µφ(y) + P uf ))2 µ ¯ ¯ Σφ(y) −Σφ(y) F ≤ρy T wf (Σφ(x) + Σφ(y) )wf . [sent-159, score-0.969]

95 The robust kernel Fisher discriminant analysis problem is to find the discriminant in the feature space that maximizes this ratio. [sent-160, score-1.258]

96 To apply the kernel trick to the problem (19), the nonlinear decision boundary should be entirely expressed in terms of inner products of the mapped data only. [sent-162, score-0.103]

97 Let ν ⋆ be an optimal solution of the problem maximize subject to ν T (D1 + D2 ξ)(D1 + D2 ξ)T ν ν T D3 ν 4 ξ≤1 ν = 0. [sent-173, score-0.141]

98 min ξT D (20) ⋆ Then, wf = Φν ⋆ is an optimal solution of the problem (19). [sent-174, score-0.308]

99 Moreover, for every point n z∈R , Ny Nx ⋆T wf φ(z) ⋆ νi K(z, xi ) = i=1 ⋆ νi+Nx K(z, yi ). [sent-175, score-0.237]

100 A mathematical programming approach to the kernel Fisher a u algorithm, 2001. [sent-217, score-0.037]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fisher', 0.554), ('discriminant', 0.452), ('lda', 0.269), ('nx', 0.233), ('robust', 0.213), ('wf', 0.191), ('covariances', 0.187), ('bw', 0.183), ('convex', 0.168), ('nominal', 0.167), ('wt', 0.147), ('sn', 0.11), ('uf', 0.1), ('minimax', 0.087), ('uncertainty', 0.086), ('ny', 0.084), ('ratio', 0.064), ('rn', 0.057), ('means', 0.055), ('tsa', 0.05), ('ionosphere', 0.045), ('sonar', 0.045), ('optimal', 0.044), ('discrimination', 0.043), ('py', 0.042), ('subject', 0.04), ('resampling', 0.04), ('boyd', 0.04), ('rf', 0.04), ('min', 0.04), ('optimality', 0.04), ('ellipsoid', 0.038), ('rnx', 0.038), ('sy', 0.038), ('wnom', 0.038), ('optimization', 0.037), ('benchmark', 0.037), ('kernel', 0.037), ('px', 0.034), ('problem', 0.033), ('rayleigh', 0.033), ('maximizes', 0.033), ('cast', 0.033), ('ops', 0.03), ('aat', 0.03), ('sx', 0.03), ('alleviate', 0.03), ('max', 0.029), ('dence', 0.029), ('quotient', 0.028), ('yi', 0.027), ('training', 0.027), ('cone', 0.027), ('minimize', 0.026), ('composition', 0.025), ('classi', 0.025), ('tr', 0.024), ('maximize', 0.024), ('saddle', 0.023), ('frobenius', 0.023), ('lie', 0.023), ('written', 0.022), ('feature', 0.022), ('systematically', 0.021), ('solved', 0.021), ('constants', 0.021), ('tractable', 0.02), ('matched', 0.019), ('xi', 0.019), ('quadratic', 0.019), ('variables', 0.018), ('express', 0.018), ('semide', 0.018), ('property', 0.018), ('cation', 0.018), ('boundary', 0.017), ('represent', 0.017), ('verd', 0.017), ('alessandro', 0.017), ('kernelize', 0.017), ('numerical', 0.017), ('mapped', 0.016), ('stanford', 0.016), ('matrix', 0.016), ('product', 0.016), ('condition', 0.016), ('incorporating', 0.016), ('analysis', 0.016), ('substantially', 0.016), ('proposition', 0.016), ('prove', 0.016), ('theorem', 0.015), ('covariance', 0.015), ('sensitivity', 0.015), ('special', 0.015), ('rep', 0.015), ('bhattacharyya', 0.015), ('ellipsoids', 0.015), ('disappears', 0.015), ('formulating', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 166 nips-2005-Robust Fisher Discriminant Analysis

Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd

Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1

2 0.1534079 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

3 0.12642372 167 nips-2005-Robust design of biological experiments

Author: Patrick Flaherty, Adam Arkin, Michael I. Jordan

Abstract: We address the problem of robust, computationally-efficient design of biological experiments. Classical optimal experiment design methods have not been widely adopted in biological practice, in part because the resulting designs can be very brittle if the nominal parameter estimates for the model are poor, and in part because of computational constraints. We present a method for robust experiment design based on a semidefinite programming relaxation. We present an application of this method to the design of experiments for a complex calcium signal transduction pathway, where we have found that the parameter estimates obtained from the robust design are better than those obtained from an “optimal” design. 1

4 0.11394371 104 nips-2005-Laplacian Score for Feature Selection

Author: Xiaofei He, Deng Cai, Partha Niyogi

Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1

5 0.089062214 52 nips-2005-Correlated Topic Models

Author: John D. Lafferty, David M. Blei

Abstract: Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets. 1

6 0.087739028 189 nips-2005-Tensor Subspace Analysis

7 0.078272574 50 nips-2005-Convex Neural Networks

8 0.067777842 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

9 0.065050915 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

10 0.062684953 149 nips-2005-Optimal cue selection strategy

11 0.057710346 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

12 0.057236239 48 nips-2005-Context as Filtering

13 0.056356836 58 nips-2005-Divergences, surrogate loss functions and experimental design

14 0.056036148 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

15 0.055967979 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

16 0.050738931 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

17 0.050501555 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

18 0.05007682 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

19 0.048737582 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

20 0.046059448 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.136), (1, 0.071), (2, -0.05), (3, -0.03), (4, 0.03), (5, -0.006), (6, 0.003), (7, -0.036), (8, -0.064), (9, -0.099), (10, 0.033), (11, -0.127), (12, -0.095), (13, -0.026), (14, -0.052), (15, -0.059), (16, -0.075), (17, 0.009), (18, -0.231), (19, -0.134), (20, 0.007), (21, -0.033), (22, -0.132), (23, 0.091), (24, -0.085), (25, 0.003), (26, 0.215), (27, -0.099), (28, 0.017), (29, -0.046), (30, 0.02), (31, -0.07), (32, -0.003), (33, 0.049), (34, -0.16), (35, 0.172), (36, -0.126), (37, -0.059), (38, 0.14), (39, 0.066), (40, 0.074), (41, -0.032), (42, -0.116), (43, -0.058), (44, 0.057), (45, 0.084), (46, 0.117), (47, -0.014), (48, 0.027), (49, -0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96366751 166 nips-2005-Robust Fisher Discriminant Analysis

Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd

Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1

2 0.62149763 167 nips-2005-Robust design of biological experiments

Author: Patrick Flaherty, Adam Arkin, Michael I. Jordan

Abstract: We address the problem of robust, computationally-efficient design of biological experiments. Classical optimal experiment design methods have not been widely adopted in biological practice, in part because the resulting designs can be very brittle if the nominal parameter estimates for the model are poor, and in part because of computational constraints. We present a method for robust experiment design based on a semidefinite programming relaxation. We present an application of this method to the design of experiments for a complex calcium signal transduction pathway, where we have found that the parameter estimates obtained from the robust design are better than those obtained from an “optimal” design. 1

3 0.53361791 104 nips-2005-Laplacian Score for Feature Selection

Author: Xiaofei He, Deng Cai, Partha Niyogi

Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1

4 0.52363425 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

5 0.46089786 189 nips-2005-Tensor Subspace Analysis

Author: Xiaofei He, Deng Cai, Partha Niyogi

Abstract: Previous work has demonstrated that the image variations of many objects (human faces in particular) under variable lighting can be effectively modeled by low dimensional linear spaces. The typical linear subspace learning algorithms include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locality Preserving Projection (LPP). All of these methods consider an n1 × n2 image as a high dimensional vector in Rn1 ×n2 , while an image represented in the plane is intrinsically a matrix. In this paper, we propose a new algorithm called Tensor Subspace Analysis (TSA). TSA considers an image as the second order tensor in Rn1 ⊗ Rn2 , where Rn1 and Rn2 are two vector spaces. The relationship between the column vectors of the image matrix and that between the row vectors can be naturally characterized by TSA. TSA detects the intrinsic local geometrical structure of the tensor space by learning a lower dimensional tensor subspace. We compare our proposed approach with PCA, LDA and LPP methods on two standard databases. Experimental results demonstrate that TSA achieves better recognition rate, while being much more efficient. 1

6 0.32414138 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

7 0.30351767 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

8 0.30245519 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

9 0.29159066 48 nips-2005-Context as Filtering

10 0.28808692 52 nips-2005-Correlated Topic Models

11 0.28258306 58 nips-2005-Divergences, surrogate loss functions and experimental design

12 0.27155852 50 nips-2005-Convex Neural Networks

13 0.26398641 15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels

14 0.25974643 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

15 0.253901 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

16 0.25068533 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

17 0.2358602 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

18 0.23188852 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

19 0.22539638 159 nips-2005-Q-Clustering

20 0.22240154 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.058), (10, 0.04), (14, 0.365), (27, 0.018), (31, 0.014), (34, 0.12), (55, 0.026), (69, 0.059), (73, 0.03), (88, 0.084), (91, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75693291 166 nips-2005-Robust Fisher Discriminant Analysis

Author: Seung-jean Kim, Alessandro Magnani, Stephen Boyd

Abstract: Fisher linear discriminant analysis (LDA) can be sensitive to the problem data. Robust Fisher LDA can systematically alleviate the sensitivity problem by explicitly incorporating a model of data uncertainty in a classification problem and optimizing for the worst-case scenario under this model. The main contribution of this paper is show that with general convex uncertainty models on the problem data, robust Fisher LDA can be carried out using convex optimization. For a certain type of product form uncertainty model, robust Fisher LDA can be carried out at a cost comparable to standard Fisher LDA. The method is demonstrated with some numerical examples. Finally, we show how to extend these results to robust kernel Fisher discriminant analysis, i.e., robust Fisher LDA in a high dimensional feature space. 1

2 0.5537408 177 nips-2005-Size Regularized Cut for Data Clustering

Author: Yixin Chen, Ya Zhang, Xiang Ji

Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1

3 0.44121104 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1

4 0.43940589 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1

5 0.43852413 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1

6 0.43552965 105 nips-2005-Large-Scale Multiclass Transduction

7 0.43425322 160 nips-2005-Query by Committee Made Real

8 0.43306094 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

9 0.43019947 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

10 0.4297272 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

11 0.42932555 184 nips-2005-Structured Prediction via the Extragradient Method

12 0.42739737 77 nips-2005-From Lasso regression to Feature vector machine

13 0.42623401 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

14 0.42612928 58 nips-2005-Divergences, surrogate loss functions and experimental design

15 0.42535314 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

16 0.42534646 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

17 0.42490986 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

18 0.42443523 195 nips-2005-Transfer learning for text classification

19 0.42424467 47 nips-2005-Consistency of one-class SVM and related algorithms

20 0.42423046 54 nips-2005-Data-Driven Online to Batch Conversions