nips nips2005 nips2005-123 knowledge-graph by maker-knowledge-mining

123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables


Source: pdf

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. [sent-7, score-0.127]

2 Recent progress in machine learning has mainly focused on supervised classification of such structured variables. [sent-8, score-0.571]

3 In this paper, we investigate structured classification in a semi-supervised setting. [sent-9, score-0.483]

4 We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. [sent-10, score-1.077]

5 Unlike transductive algorithms, our formulation naturally extends to new test points. [sent-11, score-0.25]

6 Many real-world classification problems, on the other hand, involve sequential or structural dependencies between multiple labels. [sent-14, score-0.129]

7 For example labeling the words in a sentence with their part-of-speech tags involves sequential dependency between part-of-speech tags; finding the parse tree of a sentence involves a structural dependency among the labels in the parse tree. [sent-15, score-0.579]

8 Recently, there has been a growing interest in generalizing kernel methods to predict structured and inter-dependent variables in a supervised learning setting, such as dual perceptron [7], SVMs [2, 15, 14] and kernel logistic regression [1, 11]. [sent-16, score-0.883]

9 In this paper, we investigate classification of structured objects in a semi-supervised setting. [sent-18, score-0.483]

10 The goal of semi-supervised learning is to leverage the learning process from a small sample of labeled inputs with a large sample of unlabeled data. [sent-19, score-0.72]

11 This idea has recently attracted a considerable amount of interest due to ubiquity of unlabeled data. [sent-20, score-0.322]

12 In many applications from data mining to speech recognition it is easy to produce large amounts of unlabeled data, while labeling is often manual and expensive. [sent-21, score-0.401]

13 This is also the case for many structured classification problems. [sent-22, score-0.483]

14 the labels of two inputs x and x are likely to be the same if x and x are similar. [sent-26, score-0.109]

15 The unlabeled points reveal the intrinsic structure, which is then utilized by the classification algorithm. [sent-28, score-0.396]

16 A discriminative approach to semi-supervised learning was developed by Belkin, Sindhwani and Niyogi [3, 13], where the Laplacian operator associated with unlabeled data is used as an additional penalty (regularizer) on the space of functions in a Reproducing Kernel Hilbert Space. [sent-29, score-0.446]

17 The additional regularization from the unlabeled data can be represented as a new kernel — a “graph regularized” kernel. [sent-30, score-0.478]

18 In this paper, building on [3, 13], we present a discriminative semi-supervised learning formulation for problems that involve structured and inter-dependent outputs and give experimental results on max-margin semi-supervised structured classification using graph-regularized kernels. [sent-31, score-1.196]

19 The solution of the optimization problem that utilizes both labeled and unlabeled data is a linear combination of the graph regularized kernel evaluated at the parts of the labeled inputs only, leading to a large reduction in the number of parameters. [sent-32, score-1.312]

20 It is important to note that our classification function is defined on all input points whereas some previous work is only defined for the input points in the (labeled and unlabeled) training sample, as they use standard graph kernels, which are restricted to in-sample data points by definition. [sent-33, score-0.197]

21 There is an the extensive literature on semi-supervised learning and the growing number of studies on learning structured and inter-dependent variables. [sent-34, score-0.567]

22 [8] propose a semi-supervised learning method for standard classification that extends to out-of-sample points. [sent-37, score-0.081]

23 [5] is one of the first studies investigating semi-supervised structured learning problem in a discriminative framework. [sent-40, score-0.607]

24 The most relevant previous work is the transductive structured learning proposed by Lafferty et. [sent-41, score-0.704]

25 2 Supervised Learning for Structured Variables In structured learning, the goal is to learn a mapping h : X → Y from structured inputs to structured response values, where the inputs and response values form a dependency structure. [sent-44, score-1.643]

26 We denote the set of feasible input-output pairs by Z ⊆ X × Y. [sent-47, score-0.161]

27 It is common to construct a discriminant function F : Z → which maps the feasible input-output pairs to a compatibility score of the pair. [sent-48, score-0.232]

28 To make a prediction for x, this score is maximized over the set of feasible outputs, h(x) = argmax F (x, y). [sent-49, score-0.137]

29 In Markov random fields, x is a graph, y is a labeling of the nodes of x and a local fragment (a part) of x, y is a clique in x and its labeling y. [sent-51, score-0.294]

30 In parsing with probabilistic context free grammars, a local fragment (a part) of x, y consist of a branch of the tree y, where a branch is an internal node in y together with its children, plus all pairs of a leaf node in y with the word in x labeled by that node. [sent-52, score-0.447]

31 Note that a given branch structure, such as NP → Det N, can occur more than once in a given parse tree. [sent-53, score-0.115]

32 We assume a “counting function”, c, such that for p ∈ P and x, y ∈ Z, c(p, x, y ) gives the number of times that the part p occurs in the pair x, y (the count of p in x, y ). [sent-55, score-0.099]

33 For a Mercer kernel k : P ×P → on P, there is an associated RHKS Hk of functions f : P → , where f measures the goodness of a part p. [sent-56, score-0.234]

34 , x with xi ∈ Γ and we take Y(x) to be the set of all sequences y1 , . [sent-63, score-0.136]

35 We can take P to be the set of all pairs s, s plus all ¯ pairs s, u with s, s ∈ Σ and u ∈ Γ. [sent-67, score-0.188]

36 Note that in this example there are two types of parts — pairs of hidden states and pairs of a hidden state and an observation. [sent-70, score-0.405]

37 In the supervised learning scenario, we are given a sample S of pairs ( x1 , y 1 , . [sent-72, score-0.222]

38 The goal is to learn a function f on the local parts P with small expected loss EP [L(x, y, f )] where L is a prescribed loss function. [sent-79, score-0.395]

39 This is commonly realized by learning f that minimizes the regularized loss functional f ∗ = argmin f ∈Hk L(xi , y i , f ) + λ f 2 k, (5) i=1 where . [sent-80, score-0.435]

40 A variety of loss functions L have been considered in the literature. [sent-82, score-0.107]

41 ˆ Let P(x) ⊆ P be the set of parts having nonzero count in some pair x, y for y ∈ Y(x). [sent-86, score-0.214]

42 Definition: A loss L is local if L(x, y, f ) is determined by the value of f on the set P(x), i. [sent-89, score-0.139]

43 For any local loss function L and sample S there exist weights αp for p ∈ P(S) such that f ∗ as defined by (5) can be written as follows. [sent-93, score-0.179]

44 f ∗ (p) = (7) αp k(p , p) p ∈P(S) Thus, even though the set of feasible outputs for x generally scales exponentially with the size of output, the solution can be represented in terms of the parts of the sample, which commonly scales polynomially. [sent-94, score-0.288]

45 This is true for any loss function that partitions into parts, which is the case for loss functions discussed above. [sent-95, score-0.214]

46 3 A Semi-Supervised Learning Approach to Structured Variables In semi-supervised learning, we are given a sample S consisting of l input-output pairs {(x1 , y 1 ), . [sent-96, score-0.134]

47 from the probability distribution P on Z and u unlabeled input patterns {x +1 , . [sent-102, score-0.322]

48 , x +u } and let Z(S) be the set of all pairs x, y with x ∈ X (S) and y ∈ Y(x). [sent-111, score-0.094]

49 If the true classification function is smooth wrt the underlying marginal distribution, one can utilize unlabeled data points to favor functions that are smooth in this sense. [sent-112, score-0.55]

50 [13] prove that the minimizer of (8) is in the span of a new kernel function (details below) evaluated at labeled data only. [sent-118, score-0.353]

51 Here, we generalize this framework to structured variables and give a simplified derivation of the new kernel. [sent-119, score-0.483]

52 The smoothness assumption in the structured setting states that f should be smooth on the underlying density on the parts P, thus we enforce f to assign similar goodness scores to two parts p and p , if p and p are similar, for all parts of Z(S). [sent-120, score-1.01]

53 Let P(S) be the union of all sets P(z) for z ∈ Z(S) and let W be symmetric matrix where Wp,p represents the similarity of p and p for p, p ∈ P(S). [sent-121, score-0.099]

54 Note that the last term depends only on the value of f on the parts in the set P(S). [sent-124, score-0.149]

55 Then, for any local loss L(x, y, f ), we immediately have the following Representer Theorem for the semi-supervised structured case where S includes the labeled and the unlabeled data. [sent-125, score-1.109]

56 Note that (11) applies to any local loss function and if L(x, y, f ) is convex in f , as in the case for logistic or hinge loss, then (11) is convex in α. [sent-127, score-0.252]

57 We now have a loss function over labeled data regularized by the L2 norm (wrt the inner product Q), for which we can re-evoke the Representer Theorem. [sent-128, score-0.334]

58 , x }, Z(S ) be the set of all pairs x, y with x ∈ X (S ) and y ∈ Y(x) and P(S ) be the set of al parts having nonzero count for some pair in Z(S ). [sent-132, score-0.308]

59 = α βp γp + α⊥ p∈P(S ) α⊥ can only increase the quadratic term in the optimization problem. [sent-135, score-0.08]

60 Since γp Qα⊥ = 0, we conclude that the optimal solution to (11) is given by α∗ βp γp = βKQ−1 , = (12) p∈P(S ) where β is required to be sparse, such that only parts from the labeled data are nonzero. [sent-137, score-0.314]

61 Plugging this into original equations we get ˜ k(p, p ) = kp Q−1 kp (13) ˜ fβ (p ) = βp k(p, p ) (14) p∈P(S ) β ∗ = ˜ argmin L(S , fβ ) + β T Kβ (15) β ˜ ˜ where kp is the vector of k(p, p ) for all p ∈ P(S) and K is the matrix of k(p, p ) ˜ is the same as in [13]. [sent-138, score-0.474]

62 k ˜ We call k the graph-regularized kernel, in which unlabeled data points are used to augment the base kernel k wrt the standard graph kernel to take the underlying density on parts into account. [sent-140, score-1.009]

63 This kernel is defined over the complete part space, where as standard graph kernels are restricted to P(S) only. [sent-141, score-0.349]

64 Given the graph-regularized kernel, the semi-supervised structured learning problem is reduced to supervised structured learning. [sent-142, score-1.054]

65 Since in semi-supervised learning problems, in general, labeled data points are far fewer than unlabeled data, the dimensionality of the optimization problems is greatly reduced by this reduction. [sent-143, score-0.61]

66 4 Structured Max-Margin Learning We now investigate optimizing the hinge loss as defined by (6) using graphx,y ˜ regularized kernel k. [sent-144, score-0.362]

67 Defining γ x,y to be the vector where γp = c(p, x, y ) is the count of p in x, y , the linear discriminant can be written in matrix notation for x ∈ S as ˜ Ffβ (x, y) = β T Kγ x,y . [sent-145, score-0.175]

68 Then, the optimization problem for margin maximization is l β ∗ = ˜ ξi + β T Kβ argmin min β ξ i=1 i i i y ˜ (ˆ, y i ) − β T K γ x ,y − γ x ,ˆ y ξi ≥ max y ∈Y(xi ) ˆ ∀i ≤ l. [sent-146, score-0.277]

69 5 Semi-Supervised vs Transductive Learning Since one major contribution of this paper is learning a classifier for structured objects that is defined over the complete part space P, we now examine the differences of semi-supervised and transductive learning in more detail. [sent-151, score-0.78]

70 The most common approach to realize the smoothness assumption is to construct a data dependent kernel kS derived from the graph Laplacian on a nearest neighbor graph on the labeled and unlabeled input patterns in the sample S. [sent-152, score-0.924]

71 Given kS , one can construct a function f ∗ on S as ˜ f ∗ = argmin f ∈HkS i=1 L(xi , y i , f ) + λ||f ||2S . [sent-154, score-0.191]

72 This observation in the transductive setting leads to the following optimization problem, when the kernel of the optimization problem is taken to be a linear combination of a graph kernel kS and a standard kernel k restricted to P(S). [sent-156, score-0.846]

73 ¯ f∗ = argmin f ∈H(µ1 k+µ2 kS ) i=1 L(xi , y i , f ) + λ||f ||2 1 k+µ2 kS ) (µ (18) A structured semi-supervised algorithm based on (18) has been evaluated in [11]. [sent-157, score-0.706]

74 The kernel is (18) is the weighted mean of k and kS , whereas the graph-regularized kernel, resulting from weighted mean of two regularizers, is the harmonic mean of k ¯ and kS [16]. [sent-158, score-0.156]

75 An important distinction between f ∗ and f ∗ in (8), the optimization ¯ performed in this paper, is that f ∗ is only defined on P(S) (only on observations in the training data) while f ∗ is defined on all of P and can be used for novel (out of sample) inputs x. [sent-159, score-0.118]

76 Out-of-sample extension is already a serious limitation for transductive learning, but it is even more severe in the structured case where parts of P can be composed of multiple observation tokens. [sent-161, score-0.811]

77 6 Experiments Similarity Graph: We build the similarity matrix W over P(S) using K-nearest neighborhood relationship. [sent-162, score-0.099]

78 Otherwise, the similarity is given by a heat kernel. [sent-164, score-0.111]

79 In our applications, the structure is a simple chain, therefore the cliques involved single observation label pairs, Wp,p = δ(y(up ), y(up ))e up −u p t 2 , (19) where up denotes the observation part of p and y(u) denotes the labeling of u 1 . [sent-165, score-0.175]

80 Applications: We performed experiments using a simple chain model for pitch accent (PA) prediction and OCR. [sent-167, score-0.21]

81 We ran experiments comparing semi-supervised PA U:0 U:80 U:0 U:80 U:200 structured (referred as STR) 65. [sent-170, score-0.483]

82 45 STR, we used RBF kernel as the base kernel ko in (4) and Table 1: Per-label accuracy for Pitch Accent. [sent-187, score-0.399]

83 We chose the width of the RBF kernel by cross-validation on SVM and used the same value for STR. [sent-189, score-0.156]

84 We report the average results of experiments with 5 random selection of labeled sequences in Table 1 and 2, with number of labeled sequences 4 on the left side of Table 1, 40 on the right side, and 10 in Table 2. [sent-191, score-0.46]

85 We varied the number of unlabeled sequences and reported the per-label accuracy of test sequences (on top of each cell) and of unlabeled sequences (bottom) (when U > 0). [sent-192, score-0.876]

86 The results in pitch accent prediction shows the advantage of a sequence model over a non-structured 1 For more complicated parts, different measures can apply. [sent-193, score-0.165]

87 For example, in sequence classification, if the classifier is evaluated wrt the correctly classified individual labels in P the sequence, W can be s. [sent-194, score-0.165]

88 Wp,p = u∈p,u ∈p δ(y(u), y(u ))˜(u, u ) where s denotes s ˜ some similarity measure such as the heat kernel. [sent-196, score-0.142]

89 If the evaluation is over segments of P the sequence, the similarity can be Wp,p = δ(y(p), y (p )) u∈p,u ∈p s(u, u ) where y(p) ˜ denotes all the label nodes in the part p. [sent-197, score-0.126]

90 We also observe the usefulness of unlabeled data both in the structured and unstructured models, where as U increases, so does the accuracy. [sent-199, score-0.847]

91 The improvement from unlabeled data and from structured classification can be considered as additive. [sent-200, score-0.805]

92 The small difference between the accuracy of in-sample unlabeled data and the test data indicates the natural extension of our framework to new data points. [sent-201, score-0.359]

93 Even though unlabeled data improves accuracy, performing sequence classification is not helpful due to the sparsity of structural information. [sent-203, score-0.381]

94 Since |Σ| = 15 and there are only 10 labeled sequences with average length 8. [sent-204, score-0.23]

95 65 Table 2: OCR We presented a discriminative approach to semi-supervised learning of structured and inter-dependent response variables. [sent-212, score-0.607]

96 In this framework, we derived a maximum margin formulation and presented experiments for a simple chain model. [sent-213, score-0.114]

97 Our approach naturally extends to the classification of unobserved structured inputs and this is supported by our empirical results which showed similar accuracy on in-sample unlabeled data and out-of-sample test data. [sent-214, score-0.95]

98 Learning to classify text from labeled and unlabeled documents. [sent-274, score-0.487]

99 Beyond the point cloud: from transductive to semi-supervised learning. [sent-280, score-0.179]

100 Support vector machine learning for interdependent and structured output spaces. [sent-293, score-0.563]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('structured', 0.483), ('unlabeled', 0.322), ('ks', 0.224), ('argmin', 0.191), ('transductive', 0.179), ('labeled', 0.165), ('str', 0.163), ('kernel', 0.156), ('ff', 0.153), ('parts', 0.149), ('hk', 0.13), ('loss', 0.107), ('graph', 0.101), ('classi', 0.098), ('pairs', 0.094), ('sindhwani', 0.094), ('ocr', 0.093), ('wrt', 0.093), ('svm', 0.085), ('erent', 0.083), ('di', 0.082), ('discriminative', 0.082), ('kq', 0.082), ('chicago', 0.079), ('labeling', 0.079), ('altun', 0.075), ('xi', 0.071), ('belkin', 0.071), ('icml', 0.07), ('pitch', 0.069), ('kp', 0.069), ('inputs', 0.069), ('feasible', 0.067), ('sequences', 0.065), ('count', 0.065), ('accent', 0.063), ('brefeld', 0.063), ('erty', 0.063), ('rhks', 0.063), ('representer', 0.062), ('parse', 0.062), ('regularized', 0.062), ('cation', 0.061), ('similarity', 0.061), ('niyogi', 0.06), ('structural', 0.059), ('kernels', 0.058), ('dependency', 0.056), ('clique', 0.054), ('branch', 0.053), ('laplacian', 0.053), ('ko', 0.05), ('heat', 0.05), ('fragment', 0.05), ('tsochantaridis', 0.05), ('optimization', 0.049), ('hofmann', 0.046), ('tags', 0.046), ('supervised', 0.046), ('chain', 0.045), ('erence', 0.044), ('goodness', 0.044), ('dr', 0.044), ('il', 0.044), ('intrinsic', 0.042), ('learning', 0.042), ('utilizes', 0.042), ('sentence', 0.042), ('unstructured', 0.042), ('sample', 0.04), ('labels', 0.04), ('outputs', 0.039), ('neighbor', 0.039), ('extends', 0.039), ('la', 0.038), ('regularizer', 0.038), ('vector', 0.038), ('convex', 0.038), ('matrix', 0.038), ('accuracy', 0.037), ('score', 0.037), ('margin', 0.037), ('hinge', 0.037), ('smooth', 0.036), ('pa', 0.035), ('sequential', 0.035), ('involve', 0.035), ('discriminant', 0.034), ('part', 0.034), ('hidden', 0.034), ('commonly', 0.033), ('prediction', 0.033), ('points', 0.032), ('local', 0.032), ('evaluated', 0.032), ('cluster', 0.032), ('formulation', 0.032), ('denotes', 0.031), ('marginal', 0.031), ('quadratic', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

2 0.23455104 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

3 0.2074022 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.

4 0.20327395 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1

5 0.19622844 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

6 0.14308925 105 nips-2005-Large-Scale Multiclass Transduction

7 0.12988704 41 nips-2005-Coarse sample complexity bounds for active learning

8 0.12986307 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

9 0.12804423 114 nips-2005-Learning Rankings via Convex Hull Separation

10 0.11947069 76 nips-2005-From Batch to Transductive Online Learning

11 0.11940482 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

12 0.11439464 47 nips-2005-Consistency of one-class SVM and related algorithms

13 0.11424589 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

14 0.10977361 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

15 0.10313845 50 nips-2005-Convex Neural Networks

16 0.092525735 195 nips-2005-Transfer learning for text classification

17 0.087265827 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

18 0.085784383 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

19 0.080729395 185 nips-2005-Subsequence Kernels for Relation Extraction

20 0.07765688 58 nips-2005-Divergences, surrogate loss functions and experimental design


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.284), (1, 0.168), (2, -0.106), (3, -0.117), (4, -0.005), (5, 0.135), (6, 0.18), (7, 0.082), (8, 0.125), (9, 0.067), (10, 0.025), (11, -0.123), (12, -0.048), (13, -0.067), (14, -0.133), (15, -0.203), (16, 0.129), (17, 0.028), (18, 0.074), (19, 0.037), (20, -0.01), (21, -0.031), (22, 0.061), (23, -0.041), (24, 0.152), (25, -0.016), (26, 0.017), (27, -0.016), (28, -0.074), (29, -0.131), (30, 0.023), (31, -0.02), (32, -0.01), (33, -0.09), (34, 0.045), (35, 0.066), (36, 0.084), (37, -0.022), (38, -0.039), (39, 0.005), (40, 0.051), (41, -0.043), (42, 0.054), (43, -0.015), (44, 0.013), (45, -0.074), (46, 0.031), (47, -0.008), (48, -0.018), (49, 0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95888215 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

2 0.77962679 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

3 0.71967697 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.

4 0.71808296 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1

5 0.68025821 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined. 1

6 0.66862792 105 nips-2005-Large-Scale Multiclass Transduction

7 0.6190396 184 nips-2005-Structured Prediction via the Extragradient Method

8 0.55388093 103 nips-2005-Kernels for gene regulatory regions

9 0.55173206 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

10 0.54936242 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

11 0.51170492 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

12 0.49108449 114 nips-2005-Learning Rankings via Convex Hull Separation

13 0.4812184 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

14 0.47601649 195 nips-2005-Transfer learning for text classification

15 0.4701272 160 nips-2005-Query by Committee Made Real

16 0.46619746 185 nips-2005-Subsequence Kernels for Relation Extraction

17 0.46052587 76 nips-2005-From Batch to Transductive Online Learning

18 0.42103025 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

19 0.39805555 41 nips-2005-Coarse sample complexity bounds for active learning

20 0.38598117 47 nips-2005-Consistency of one-class SVM and related algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.052), (10, 0.024), (27, 0.04), (31, 0.042), (34, 0.158), (41, 0.017), (50, 0.296), (55, 0.031), (65, 0.017), (69, 0.029), (73, 0.048), (88, 0.105), (91, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87260413 126 nips-2005-Metric Learning by Collapsing Classes

Author: Amir Globerson, Sam T. Roweis

Abstract: We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature ´Üµ for an input using a simple distance funcspace , we can measure distances between ܽ ܾ ¾ tion (e.g. Euclidean) ´Ü½ µ ´Ü¾ µ℄ in feature space. Thus by fixing , any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the ´Üµ is a linear projection of Ü ¾ Ö so that ´Üµ Ï Ü. The Euclidean distance between ´Ü½ µ and ´Ü¾ µ is then the Mahalanobis distance ´Ü½ µ   ´Ü¾ µ ¾ ´Ü½   ܾ µÌ ´Ü½   ܾ µ, where Ï Ì Ï is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix . This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equiv- alence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of Ò labeled examples ´Ü Ý µ, where Ü Ý ¾ ½ , we seek a space. We focus on Mahalanobis form metrics similarity measure between two points in ´Ü where Ü ´Ü µ ¾   Ü µÌ Ö and ´Ü  Ü µ (1) is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping Ü via a linear projection Ï Ü ( Ï Ì Ï ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). such that Specifically, for each Ü we define a conditional distribution over points Ô ´ µ ½ È       (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: Ô¼ ´ µ » Ò ½ ¼ Ý Ý Ý Ý (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least Ö · ¾ points in each class, where Ö rank ℄ (note that Ö Ö). Then Ô ´ µ Ô¼ ´ µ ( ) implies that under all points in the same class will be mapped to a single point, infinitely far from other class points 1 . 1 Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix such that Ô ´ µ is as close as possible to Ô¼ ´ Since we are trying to match distributions, we minimize the KL divergence ÃÄ Ô¼ Ô℄: ÑÒ ÃÄ Ô¼ ´ µ Ô ´ µ℄ ¾ ÈË ×Ø µ. (4) The crucial property of this optimization problem is that it is convex in the matrix . To see this, first note that any convex linear combination of feasible solutions « ¼ · ´½   «µ ½ s.t. ¼ « ½ is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that ´ µ alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ in the form 2 : È   ´ µ ÐÓ Ý Ý Ô´ µ · Ý ÐÓ Ý where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why ´ µ is convex, first note that ´Ü   Ü µÌ ´Ü   Ü µ is linear is a ÐÓ ÜÔ function of affine functions of in , and thus convex. The function ÐÓ and is therefore also convex (see [4], page 74). È 2.1 Convex Duality Since our optimization problem is convex, it has an equivalent convex dual. Specifically, the convex dual of Eq. (4) is the following entropy maximization problem: À Ô´ Ñ Ü Ô´ µ where Ú Ü ×Ø µ℄ Ô¼ ´ Ú ÚÌ ℄   µ Ô´ µ Ú ÚÌ ℄   Ü , À ¡℄ is the entropy function and we require È Ô´ µ ¼ ½ (5) . To prove this duality we start with the proposed dual and obtain the original problem in Equation 4 as its dual. Write the Lagrangian for the above problem (where is PSD) 3 Ä´Ô   ¬µ À ´Ô´ µµ   Ì Ö´ ´ Ô¼ ´ Ú ÚÌ ℄   Ô Ú ÚÌ ℄µµµ   ¬ ´ Ô´ µ   ½µ The dual function is defined as ´ ¬ µ Ñ ÒÔ Ä´Ô ¬ µ. To derive it, we first solve for the minimizing Ô by setting the derivative of Ä´Ô ¬ µ w.r.t. Ô´ µ equal to zero. Ì ¼ ½ · ÐÓ Ô´ µ · Ì Ö ´ Ú Ú µ   ¬ µ Ô´ µ ¬  ½ Ì Ö´ Ú ÚÌ µ Plugging solution È Ô thisThe dual to Ä Ô is¬ towe get ¬ . problem maximize È  Ì Ö Ú Ú . ¬ , yielding   ¬ ´ ´ µ ´ µ ½ Now note that Ì Ö´ ÐÓ Ú ÚÌ µ ÚÌ Ú ´ µ   ´ ̵ ´ µ  Ì Ö´ ¬ µ. È Ô¼ Ú ÚÌ ℄µ · Ȭ  We can do this analytically w.r.t. , so we can write Ý Ý   ÐÓ   which is minus our original target function. Since ´ µ should be maximized, and we have the desired duality result (identifying with ). ¼ » µ ¼ when Ý Ý . For a given point Ü , all the points in its class satisfy Ô´ µ ½. Due to the structure of Ô´ µ in Equation 2, and because it is obeyed for all points in ܼ × class, this implies that all the points in that class are equidistant from each other. However, it is easy to show that the maximum number of different equidistant points (also known as the equilateral dimension [1]) in Ö dimensions is Ö · ½. Since by assumption we have at least Ö · ¾ points in the class of Ü , and maps points into Ö , it follows that all points are identical. 2 À Ô¼ ´ µ℄. Up to an additive constant 3 We consider the equivalent problem of minimizing minus entropy. Ô¼ ´  È 2.1.1 Relation to covariance based and embedding methods The convex dual derived above reveals an interesting relation to covariance based learning methods. The sufficient statistics used by the algorithm are a set of Ò “spread” matrices. Each matrix is of the form Ô¼ ´ µ Ú ÚÌ ℄. The algorithm tries to find a maximum entropy distribution which matches these matrices when averaged over the sample. This should be contrasted with the covariance matrices used in metric learning such as Fisher’s Discriminant Analysis. The latter uses the within and between class covariance matrices. The within covariance matrix is similar to the covariance matrix used here, but is calculated with respect to the class means, whereas here it is calculated separately for every point, and is centered on this point. This highlights the fact that MCML is not based on Gaussian assumptions where it is indeed sufficient to calculate a single class covariance. Our method can also be thought of as a supervised version of the Stochastic Neighbour Embedding algorithm [7] in which the “target” distribution is Ô¼ (determined by the class labels) and the embedding points are not completely free but are instead constrained to be of the form Ï Ü . 2.2 Optimizing the Convex Objective Since the optimization problem in Equation 4 is convex, it is guaranteed to have only a single minimum which is the globally optimal solution4 . It can be optimized using any appropriate numerical convex optimization machinery; all methods will yield the same solution although some may be faster than others. One standard approach is to use interior point Newton methods. However, these algorithms require the Hessian to be calculated, which would require Ç´ µ resources, and could be prohibitive in our case. Instead, we have experimented with using a first order gradient method, specifically the projected gradient approach as in [10]. At each iteration we take a small step in the direction of the negative gradient of the objective function5, followed by a projection back onto the PSD cone. This projection is performed simply by taking the eigen-decomposition of and removing the components with negative eigenvalues. The algorithm is summarized below: Ý µ, Ò Input: Set of labeled data points ´Ü Output: PSD metric which optimally collapses classes. ½ Initialization: Initialize ¼ to some PSD matrix (randomly or using some initialization heuristic). Iterate: È Ø   ¯   ÔØ where   Ü Ô Ü ¯ Calculate the eigen-decomposition of È È Ù ÙÌ , then set Ø Ø Ø ¯ Set Ø·½ ´ µ ´ ´ ¼´ µ µ ´ µµ´ µ´Ü   Ü µÌ ·½ ·½ ·½ Ñ Ü´ ¼µÙ ÙÌ Of course in principle it is possible to optimize over the dual instead of the primal but in our case, if the training data consists of Ò points in Ö-dimensional space then the primal has only Ç´Ö¾ ¾µ variables while the dual has Ç´Ò¾ µ so it will almost always be more efficient to operate on the primal directly. One exception to this case may be the kernel version (Section 4) where the primal is also of size Ç´Ò¾ µ. 4 When the data can be exactly collapsed into single class points, there will be multiple solutions at infinity. However, this is very unlikely to happen in real data. 5 In the experiments, we used an Armijo like step size rule, as described in [3]. 3 Low Dimensional Projections for Feature Extraction The Mahalanobis distance under a metric can be interpreted as a linear projection of the original inputs by the square root of , followed by Euclidean distance in the projected space. Matrices which have less than full rank correspond to Mahalanobis distances based on low dimensional projections. Such metrics and the induced distances can be advantageous for several reasons [5]. First, low dimensional projections can substantially reduce the storage and computational requirements of a supervised method since only the projections of the training points must be stored and the manipulations at test time all occur in the lower dimensional feature space. Second, low dimensional projections re-represent the inputs, allowing for a supervised embedding or visualization of the original data. If we consider matrices with rank at most Õ , we can always represent them in the form Ï Ì Ï for some projection matrix Ï of size Õ ¢ Ö. This corresponds to projecting the original data into a Õ -dimensional space specified by the rows of Ï . However, rank constraints on a matrix are not convex [4], and hence the rank constrained problem is not convex and is likely to have local minima which make the optimization difficult and illdefined since it becomes sensitive to initial conditions and choice of optimization method. Luckily, there is an alternative approach to obtaining low dimensional projections, which does specify a unique solution by sequentially solving two globally tractable problems. This is the approach we follow here. First we solve for a (potentially) full rank metric using the convex program outlined above, and then obtain a low rank projection from it via spectral decomposition. This is done by diagonalizing into the form Ö Ú ÚÌ where ½ and Ú are the corre¾ Ö are eigenvalues of ½ sponding eigenvectors. To obtain a low rank projection we constrain the sum above to Õ include only the Õ terms corresponding to the Õ largest eigenvalues: Õ Ú ÚÌ . ½ The resulting projection is uniquely defined (up to an irrelevant unitary transformation) as Ô Ì Ì Ï ´ ÚÕ ℄. ½ Õ µ Ú½ È È Ô In general, the projection returned by this approach is not guaranteed to be the same as the projection corresponding to minimizing our objective function subject to a rank constraint on unless the optimal metric is of rank less than or equal to Õ . However, as we show in the experimental results, it is often the case that for practical problems the optimal has an eigen-spectrum which is rapidly decaying, so that many of its eigenvalues are indeed very small, suggesting the low rank solution will be close to optimal. 4 Learning Metrics with Kernels It is interesting to consider the case where Ü are mapped into a high dimensional feature space ´Ü µ and a Mahalanobis distance is sought in this space. We focus on the case where dot products in the feature space may be expressed via a kernel function, such that ´Ü µ ¡ ´Ü µ ´Ü Ü µ for some kernel . We now show how our method can be changed to accommodate this setting, so that optimization depends only on dot products. Consider the regularized target function: Ê ´ µ ÃÄ Ô¼ ´ µ Ô´ µ℄ · Ì Ö´ µ (6) where the regularizing factor is equivalent to the Frobenius norm of the projection matrix Ï since Ì Ö´ µ Ï ¾. Deriving w.r.t. Ï we obtain Ï Í , where Í is some matrix which specifies Ï as a linear combination of sample points, and the Ø row of the matrix Ì Í Ì Í . Defining the PSD matrix is Ü . Thus is given by Í Ì Í , we can recast our optimization as looking for a PSD matrix , where the Mahalanobis distance is ´Ü   Ü µÌ Ì ´Ü   Ü µ ´   µÌ ´   µ, where we define Ü. This is exactly our original distance, with Ü replaced by , which depends only on dot products in space. The regularization term also depends solely on the dot products since Ì Ö´ µ Ì Ö´ Ì µ Ì Ö´ Ì µ Ì Ö´Ã µ, where à is the kernel matrix given Ì . Note that the trace is a linear function of , keeping the problem convex. by à Thus, as long as dot products can be represented via kernels, the optimization can be carried out without explicitly using the high dimensional space. To obtain a low dimensional solution, we follow the approach in Section 3: obtain a decomposition Î Ì Î 6 , and take the projection matrix to be the first Õ rows of ¼ Î . Ì , and thus Ì Ì As a first step, we calculate a matrix such that . Since is a correlation matrix for the rows of it can be shown (as in Kernel PCA) that its (left) eigenvectors are linear combinations of the rows of . Denoting by Î « the eigenvector matrix, we obtain, after some algebra, that « Ã Ì «. We conclude that « is an eigenvector of the matrix Ã Ì . Denote by « the matrix whose rows are orthonormal eigenvectors of Ã Ì . Then Î can be shown to be orthonormal if we set  ¼ « . The final projection will then be ¼ Î Ü « . Low dimensional Î projections will be obtained by keeping only the first Õ components of this projection. 5 Experimental Results We compared our method to several metric learning algorithms on a supervised classification task. Training data was first used to learn a metric over the input space. Then this metric was used in a 1-nearest-neighbor algorithm to classify a test set. The datasets we investigated were taken from the UCI repository and have been used previously in evaluating supervised methods for metric learning [10, 5]. To these we added the USPS handwritten digits (downsampled to 8x8 pixels) and the YALE faces [2] (downsampled to 31x22). The algorithms used in the comparative evaluation were ¯ ¯ ¯ Fisher’s Linear Discriminant Analysis (LDA), which projects on the eigenvectors   of ËϽ Ë where ËÏ Ë are the within and between class covariance matrices. The method of Xing et al [10] which minimizes the mean within class distance, while keeping the mean between class distance larger than one. Principal Component Analysis (PCA). There are several possibilities for scaling the PCA projections. We tested several, and report results of the empirically superior one (PCAW), which scales the projection components so that the covariance matrix after projection is the identity. PCAW often performs poorly on high dimensions, but globally outperforms all other variants. We also evaluated the kernel version of MCML with an RBF kernel (denoted by KMCML)7 . Since all methods allow projections to lower dimensions we compared performance for different projection dimensions 8 . The out-of sample performance results (based on 40 random splits of the data taking 70% for training and 30% for testing9 ) are shown in Figure 1. It can be seen that when used in a simple nearest-neighbour classifier, the metric learned by MCML almost always performs as well as, or significantly better than those learned by all other methods, across most dimensions. Furthermore, the kernel version of MCML outperforms the linear one on most datasets. Where Î is orthonormal, and the eigenvalues in are sorted in decreasing order. The regularization parameter and the width of the RBF kernel were chosen using 5 fold crossvalidation. KMCML was only evaluated for datasets with less than 1000 training points. 8 To obtain low dimensional mappings we used the approach outlined in Section 3. 9 Except for the larger datasets where 1000 random samples were used for training. 6 7 Wine 0.2 0.1 2 4 6 8 10 Projection Dimension 0.5 Error Rate Error Rate Balance Ion MCML PCAW LDA XING KMCML 0.3 0.25 0.4 0.2 0.15 0.3 0.1 0.2 0.05 10 20 30 Projection Dimension Soybean−small 0.1 1 2 3 Protein 4 Spam 0.6 0.4 0.25 0.5 0.4 0 5 10 15 0.2 0.3 0.2 0.15 20 5 10 Yale7 15 20 0.1 10 20 30 40 50 40 50 Digits Housing 0.6 0.4 0.4 0.3 0.4 0.35 0.2 0.3 0.2 0.1 0.25 10 20 30 40 50 5 10 15 10 20 30 Figure 1: Classification error rate on several UCI datasets, USPS digits and YALE faces, for different projection dimensions. Algorithms are our Maximally Collapsing Metric Learning (MCML), Xing et.al.[10], PCA with whitening transformation (PCAW) and Fisher’s Discriminant Analysis (LDA). Standard errors of the means shown on curves. No results given for XING on YALE and KMCML on Digits and Spam due to the data size. 5.1 Comparison to non convex procedures The methods in the previous comparison are all well defined, in the sense that they are not susceptible to local minima in the optimization. They also have the added advantage of obtaining projections to all dimensions using one optimization run. Below, we also compare the MCML results to the results of two non-convex procedures. The first is the Non Convex variant of MCML (NMCML): The objective function of MCML can be optimized w.r.t the projection matrix Ï , where Ï Ì Ï . Although this is no longer a convex problem, it is not constrained and is thus easier to optimize. The second non convex method is Neighbourhood Components Analysis (NCA) [5], which attempts to directly minimize the error incurred by a nearest neighbor classifier. For both methods we optimized the matrix Ï by restarting the optimization separately for each size of Ï . Minimization was performed using a conjugate gradient algorithm, initialized by LDA or randomly. Figure 2 shows results on a subset of the UCI datasets. It can be seen that the performance of NMCML is similar to that of MCML, although it is less stable, possibly due to local minima, and both methods usually outperform NCA. The inset in each figure shows the spectrum of the MCML matrix , revealing that it often drops quickly after a few dimensions. This illustrates the effectiveness of our two stage optimization procedure, and suggests its low dimensional solutions are close to optimal. 6 Discussion and Extensions We have presented an algorithm for learning maximally collapsing metrics (MCML), based on the intuition of collapsing classes into single points. MCML assumes that each class Balance Error Rate Wine MCML NMCML NCA 0.2 Soybean 0.4 0.1 0.3 0.1 0.05 0.2 0 2 4 6 8 10 Projection Dimension 0.1 1 0 2 Protein 3 4 5 10 Ion 15 20 Housing 0.4 0.6 0.3 0.5 0.35 0.2 0.4 0.3 0.3 0.25 5 10 15 20 0.1 10 20 30 5 10 15 Figure 2: Classification error for non convex procedures, and the MCML method. Eigen-spectra for the MCML solution are shown in the inset. may be collapsed to a single point, at least approximately, and thus is only suitable for unimodal class distributions (or for simply connected sets if kernelization is used). However, if points belonging to a single class appear in several disconnected clusters in input (or feature) space, it is unlikely that MCML could collapse the class into a single point. It is possible that using a mixture of distributions, an EM-like algorithm can be constructed to accommodate this scenario. The method can also be used to learn low dimensional projections of the input space. We showed that it performs well, even across a range of projection dimensions, and consistently outperforms existing methods. Finally, we have shown how the method can be extended to projections in high dimensional feature spaces using the kernel trick. The resulting nonlinear method was shown to improve classification results over the linear version. References Ò [1] N. Alon and P. Pudlak. Equilateral sets in ÐÔ . Geom. Funct. Anal., 13(3), 2003. [2] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. In ECCV (1), 1996. [3] D.P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transaction on Automatic Control, 21(2):174–184, 1976. [4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004. [5] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [6] T. Hastie, R. Tibshirani, and J.H. Friedman. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, 2001. [7] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002. [8] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2001. [9] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proc. of ECCV, 2002. [10] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2004.

same-paper 2 0.83872098 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

3 0.76855642 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

Author: Frank Wood, Stefan Roth, Michael J. Black

Abstract: Probabilistic modeling of correlated neural population firing activity is central to understanding the neural code and building practical decoding algorithms. No parametric models currently exist for modeling multivariate correlated neural data and the high dimensional nature of the data makes fully non-parametric methods impractical. To address these problems we propose an energy-based model in which the joint probability of neural activity is represented using learned functions of the 1D marginal histograms of the data. The parameters of the model are learned using contrastive divergence and an optimization procedure for finding appropriate marginal directions. We evaluate the method using real data recorded from a population of motor cortical neurons. In particular, we model the joint probability of population spiking times and 2D hand position and show that the likelihood of test data under our model is significantly higher than under other models. These results suggest that our model captures correlations in the firing activity. Our rich probabilistic model of neural population activity is a step towards both measurement of the importance of correlations in neural coding and improved decoding of population activity. 1

4 0.66497725 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1

5 0.63869971 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

6 0.61165911 114 nips-2005-Learning Rankings via Convex Hull Separation

7 0.59911907 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

8 0.59335846 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

9 0.5919047 50 nips-2005-Convex Neural Networks

10 0.59114003 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

11 0.57337195 105 nips-2005-Large-Scale Multiclass Transduction

12 0.57164234 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

13 0.56625944 195 nips-2005-Transfer learning for text classification

14 0.56392884 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

15 0.56351984 177 nips-2005-Size Regularized Cut for Data Clustering

16 0.56341308 189 nips-2005-Tensor Subspace Analysis

17 0.56290257 58 nips-2005-Divergences, surrogate loss functions and experimental design

18 0.56281793 102 nips-2005-Kernelized Infomax Clustering

19 0.56254047 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

20 0.55893636 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models