nips nips2004 nips2004-127 knowledge-graph by maker-knowledge-mining

127 nips-2004-Neighbourhood Components Analysis

Source: pdf

Author: Jacob Goldberger, Geoffrey E. Hinton, Sam T. Roweis, Ruslan Salakhutdinov

Abstract: In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classiﬁcation algorithm. The algorithm directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. It can also learn a low-dimensional linear embedding of labeled data that can be used for data visualization and fast classiﬁcation. Unlike other methods, our classiﬁcation model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classiﬁcation algorithm. [sent-3, score-0.084]

2 The algorithm directly maximizes a stochastic variant of the leave-one-out KNN score on the training set. [sent-4, score-0.096]

3 It can also learn a low-dimensional linear embedding of labeled data that can be used for data visualization and fast classiﬁcation. [sent-5, score-0.128]

4 Unlike other methods, our classiﬁcation model is non-parametric, making no assumptions about the shape of the class distributions or the boundaries between them. [sent-6, score-0.113]

5 The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction. [sent-7, score-0.233]

6 Its appeal stems from the fact that its decision surfaces are nonlinear, there is only a single integer parameter (which is easily tuned with cross-validation), and the expected quality of predictions improves automatically as the amount of training data increases. [sent-9, score-0.045]

7 These advantages, shared by many non-parametric methods, reﬂect the fact that although the ﬁnal classiﬁcation machine has quite high capacity (since it accesses the entire reservoir of training data at test time), the trivial learning procedure rarely causes overﬁtting itself. [sent-10, score-0.104]

8 The ﬁrst is computational, since it must store and search through the entire training set in order to classify a single test point. [sent-12, score-0.126]

9 (Storage can potentially be reduced by “editing” or “thinning” the training data; and in low dimensional input spaces, the search problem can be mitigated by employing data structures such as KD-trees or ball-trees[4]. [sent-13, score-0.144]

10 ) The second is a modeling issue: how should the distance metric used to deﬁne the “nearest” neighbours of a test point be deﬁned? [sent-14, score-0.327]

11 In this paper, we attack both of these difﬁculties by learning a quadratic distance metric which optimizes the expected leave-one-out classiﬁcation error on the training data when used with a stochastic neighbour selection rule. [sent-15, score-0.476]

12 Furthermore, we can force the learned distance metric to be low rank, thus substantially reducing storage and search costs at test time. [sent-16, score-0.367]

13 2 Stochastic Nearest Neighbours for Distance Metric Learning We begin with a labeled data set consisting of n real-valued input vectors x1 , . [sent-17, score-0.039]

14 We want to ﬁnd a distance metric that maximizes the performance of nearest neighbour classiﬁcation. [sent-24, score-0.431]

15 Ideally, we would like to optimize performance on future test data, but since we do not know the true data distribution we instead attempt to optimize leave-one-out (LOO) performance on the training data. [sent-25, score-0.127]

16 In what follows, we restrict ourselves to learning Mahalanobis (quadratic) distance metrics, which can always be represented by symmetric positive semi-deﬁnite matrices. [sent-26, score-0.084]

17 We estimate such metrics through their inverse square roots, by learning a linear transformation of the input space such that in the transformed space, KNN performs well. [sent-27, score-0.275]

18 If we denote the transformation by a matrix A we are effectively learning a metric Q = A A such that d(x, y) = (x − y) Q(x − y) = (Ax − Ay) (Ax − Ay). [sent-28, score-0.284]

19 The actual leave-one-out classiﬁcation error of KNN is quite a discontinuous function of the transformation A, since an inﬁnitesimal change in A may change the neighbour graph and thus affect LOO classiﬁcation performance by a ﬁnite amount. [sent-29, score-0.301]

20 Instead, we adopt a more well behaved measure of nearest neighbour performance, by introducing a differentiable cost function based on stochastic (“soft”) neighbour assignments in the transformed space. [sent-30, score-0.527]

21 In particular, each point i selects another point j as its neighbour with some probability pij , and inherits its class label from the point it selects. [sent-31, score-0.446]

22 Of course, since the cost function above is not convex, some care must be taken to avoid local maxima during training. [sent-33, score-0.048]

23 However, unlike many other objective functions (where good optima are not necessarily deep but rather broad) it has been our experience that the larger we can drive f during training the better our test performance will be. [sent-34, score-0.137]

24 Notice that by learning the overall scale of A as well as the relative directions of its rows we are also effectively learning a real-valued estimate of the optimal number of neighbours (K). [sent-36, score-0.109]

25 This estimate appears as the effective perplexity of the distributions pij . [sent-37, score-0.296]

26 If the learning procedure wants to reduce the effective perplexity (consult fewer neighbours) it can scale up A uniformly; similarly by scaling down all the entries in A it can increase the perplexity of and effectively average over more neighbours during the stochastic selection. [sent-38, score-0.268]

27 Maximizing the objective function f (A) is equivalent to minimizing the L1 norm between the true class distribution (having probability one on the true class) and the stochastic class distribution induced by pij via A. [sent-39, score-0.451]

28 To speed up the gradient computation, the sums that appear in equations (5) and (7) over the data points and over the neigbours of each point, can be truncated (one because we can do stochastic gradient rather than exact gradient and the other because pij drops off quickly). [sent-42, score-0.429]

29 3 Low Rank Distance Metrics and Nonsquare Projection Often it is useful to reduce the dimensionality of input data, either for computational savings or for regularization of a subsequent learning algorithm. [sent-43, score-0.076]

30 Linear dimensionality reduction techniques (which apply a linear operator to the original data in order to arrive at the reduced representation) are popular because they are both fast and themselves relatively immune to overﬁtting. [sent-44, score-0.17]

31 By restricting A to be a nonsquare matrix of size d×D, NCA can also do linear dimensionality reduction. [sent-47, score-0.241]

32 In this case, the learned metric will be low rank, and the transformed inputs will lie in Rd . [sent-48, score-0.223]

33 (Since the transformation is linear, without loss of generality we only consider the case d ≤ D. [sent-49, score-0.132]

34 ) By making such a restriction, we can potentially reap many further beneﬁts beyond the already convenient method for learning a KNN distance metric. [sent-50, score-0.084]

35 In particular, by choosing d D we can vastly reduce the storage and search-time requirements of KNN. [sent-51, score-0.045]

36 Selecting d = 2 or d = 3 we can also compute useful low dimensional visualizations on labeled datasets, using only a linear projection. [sent-52, score-0.122]

37 The algorithm is exactly the same: optimize the cost function (3) using gradient descent on a nonsquare A. [sent-53, score-0.215]

38 Our method requires no matrix inversions and assumes no parametric model (Gaussian or otherwise) for the class distributions or the boundaries between them. [sent-54, score-0.139]

39 For now, the dimensionality of the reduced representation (the number of rows in A) must be set by the user. [sent-55, score-0.1]

40 By using an highly rectangular A so that d D, we can signiﬁcantly reduce the computational load of KNN at the expense of restricting the allowable metrics to be those of rank at most d. [sent-56, score-0.187]

41 To achieve this, we apply the NCA learning algorithm to ﬁnd the optimal transformation A, and then we store only the projections of the training points yn = Axn (as well as their labels). [sent-57, score-0.263]

42 At test time, we classify a new point xtest by ﬁrst computing its projection ytest = Axtest and then doing KNN classiﬁcation on ytest using the yn and a simple Euclidean metric. [sent-58, score-0.231]

43 If d is relatively small (say less than 10), we can preprocess the yn by building a KD-tree or a ball-tree to further increase the speed of search at test time. [sent-59, score-0.087]

44 The storage requirements of this method are O(dN ) + Dd compared with O(DN ) for KNN in the original input space. [sent-60, score-0.045]

45 4 Experiments in Metric Learning and Dimensionality Reduction We have evaluated the NCA algorithm against standard distance metrics for KNN and other methods for linear dimensionality reduction. [sent-61, score-0.264]

46 We have also investigated the use of linear dimensionality reduction using NCA (with nonsquare A) for visualization as well as reduced-complexity classiﬁcation on several datasets. [sent-66, score-0.301]

47 First, we generated a synthetic threedimensional dataset (shown in top row of ﬁgure 2) which consists of 5 classes (shown by different colors). [sent-68, score-0.099]

48 In two dimensions, the classes are distributed in concentric circles, while the third dimension is just Gaussian noise, uncorrelated with the other dimensions or the class label. [sent-69, score-0.198]

49 If the noise variance is large enough, the projection found by PCA is forced to include the noise (as shown on the top left of ﬁgure 2). [sent-70, score-0.07]

50 (A full rank Euclidean metric would also be misled by this dimension. [sent-71, score-0.201]

51 ) The classes are not convex and cannot be linearly separated, hence the results obtained from LDA will be inappropriate (as shown in ﬁgure 2). [sent-72, score-0.085]

52 In contrast, NCA adaptively ﬁnds the best projection without assuming any parametric structure in the low dimensional representation. [sent-73, score-0.149]

53 We have also applied NCA to the UCI “wine” dataset, which consists of 178 points labeled into 3 classes and to a database of gray-scale images of faces consisting of 18 classes (each a separate individual) and 560 dimensions (image size is 20 × 28). [sent-74, score-0.323]

54 The face dataset consists of 1800 images (100 for each person). [sent-75, score-0.113]

55 Finally, we applied our algorithm to a subset of the USPS dataset of handwritten digit images, consisting of the ﬁrst ﬁve digit classes (“one” through “ﬁve”). [sent-76, score-0.309]

56 As can be seen in ﬁgure 2 when a two-dimensional projection is used, the classes are consistently much better separated by the NCA transformation than by either PCA (which is unsupervised) or LDA (which has access to the class labels). [sent-78, score-0.353]

57 Of course, the NCA transformation is still only a linear projection, just optimized with a cost function which explicitly encourages local separation. [sent-79, score-0.246]

58 To further quantify the projection results we can apply a nearest-neighbor classiﬁcation in the projected space. [sent-80, score-0.093]

59 Using the same projection learned at training time, we project the training set and all future test points and perform KNN in the low-dimensional space using the Euclidean measure. [sent-81, score-0.246]

60 The results under the PCA, LDA, LDA followed by RCA and NCA transformations (using K=1) appear in ﬁgure 1. [sent-82, score-0.041]

61 The NCA projection consistently gives superior performance in this highly constrained low- distance metric learning − training distance metric learning − testing 1 1 0. [sent-83, score-0.604]

62 5 bal ion iris wine NCA diag−NCA RCA whitened Euclidean 0. [sent-100, score-0.421]

63 5 bal rank 2 transformation − training 1 ion iris wine hous digit rank 2 transformation − testing 1 NCA LDA+RCA LDA PCA 0. [sent-103, score-1.069]

64 3 bal ion iris wine hous digit Figure 1: KNN classiﬁcation accuracy (left train, right test) on UCI datasets balance, ionosphere, iris, wine and housing and on the USPS handwritten digits. [sent-117, score-0.745]

65 Results are averages over 40 realizations of splitting each dataset into training (70%) and testing (30%) subsets (for USPS 200 images for each of the 10 digit classes were used for training and 500 for testing). [sent-118, score-0.345]

66 Top panels show distance metric learning (square A) and bottom panels show linear dimensionality reduction down to d = 2. [sent-119, score-0.409]

67 In summary, we have found that when labeled data is available, NCA performs better both in terms of classiﬁcation performance in the projected representation and in terms of visualization of class separation as compared to the standard methods of PCA and LDA. [sent-121, score-0.184]

68 5 Extensions to Continuous Labels and Semi-Supervised Learning Although we have focused here on discrete classes, linear transformations and fully supervised learning, many extensions of this basic idea are possible. [sent-122, score-0.071]

69 Clearly, a nonlinear transformation function A(·) could be learned using any architecture (such as a multilayer perceptron) trainable by gradient methods. [sent-123, score-0.206]

70 Furthermore, it is possible to extend the classiﬁcation framework presented above to the case of a real valued (continuous) supervision signal by deﬁning the set of “correct matches” Ci for point i to be those points j having similar (continuous) targets. [sent-124, score-0.062]

71 This naturally leads to the idea of “soft matches”, in which the objective function becomes a sum over all pairs, each weighted by their agreement according to the targets. [sent-125, score-0.06]

72 The data are reduced from their original dimensionalities (D=3,D=13,D=560,D=256 respectively) to the d=2 dimensions show. [sent-127, score-0.058]

73 Figure 3: The two dimensional outputs of the neural network on a set of test cases. [sent-128, score-0.056]

74 On the left, each point is shown using a line segment that has the same orientation as the input face. [sent-129, score-0.033]

75 in a video of a person’s face we may assume that pose, and expression vary slowly in time even if no individual frames are ever labeled explicitly with numerical pose or expression values. [sent-134, score-0.133]

76 To illustrate this, we generate pairs of faces in the following way: First we choose two faces at random from the FERET-B dataset (5000 isolated faces that have a standard orientation and scale). [sent-135, score-0.316]

77 The ﬁrst face is rotated by an angle uniformly distributed between ±45o and scaled to have a height uniformly distributed between 25 and 35 pixels. [sent-136, score-0.048]

78 The second face (which is of a different person) is given the same rotation and scaling but with Gaussian noise of ±1. [sent-137, score-0.048]

79 The pair is given a weight, wab , which is the probability density of the added noise divided by its maximum possible value. [sent-140, score-0.038]

80 We then trained a neural network with one hidden layer of 100 logistic units to map from the 35×35 pixel intensities of a face to a point, y, in a 2-D output space. [sent-141, score-0.048]

81 Backpropagation was used to minimize the cost function in Eq. [sent-142, score-0.048]

82 8 which encourages the faces in a pair to be placed close together: Cost = − wab log pair(a,b) exp(−||ya − yb ||2 ) 2 c,d exp(−||yc − yd || ) (8) where c and d are indices over all of the faces, not just the ones that form a pair. [sent-143, score-0.155]

83 Four example faces are shown to the right; horizontally the pairs agree and vertically they do not. [sent-144, score-0.081]

84 Figure 3 above shows that the feedforward neural network discovered polar coordinates without the user having to decide how to represent scale and orientation in the output space. [sent-145, score-0.033]

85 6 Relationships to Other Methods and Conclusions Several papers recently addressed the problem of learning Mahalanobis distance functions given labeled data or at least side-information of the form of equivalence constraints. [sent-146, score-0.147]

86 RCA is implicitly assuming a Gaussian distribution for each class (so it can be described using only the ﬁrst two moments of the class-conditional distribution). [sent-148, score-0.063]

87 al attempt to ﬁnd a transformation which minimizes all pairwise squared distances between points in the same class; this implicitly assumes that classes form a single compact connected set. [sent-150, score-0.217]

88 For highly multimodal class distributions this cost function will be severely penalized. [sent-151, score-0.139]

89 Lowe[6] proposed a method similar to ours but used a more limited idea for learning a nearest neighbour distance metric. [sent-152, score-0.304]

90 In parallel there has been work on learning low rank transformations for fast classiﬁcation and visualization. [sent-154, score-0.144]

91 The classic LDA algorithm[3] is optimal if all class distributions are Gaussian with a single shared covariance; this assumption, however is rarely true. [sent-155, score-0.118]

92 [5], [2]) make the transformation more robust to outliers and to numerical instability when not enough datapoints are available. [sent-159, score-0.132]

93 ) In general, there are two classes of regularization assumption that are common in linear methods for classiﬁcation. [sent-161, score-0.089]

94 The ﬁrst is a strong parametric assumption about the structure of the class distributions (typically enforcing connected or even convex structure); the second is an assumption about the decision boundary (typically enforcing a hyperplane). [sent-162, score-0.207]

95 Our method makes neither of these assumptions, relying instead on the strong regularization imposed by restricting ourselves to a linear transformation of the original inputs. [sent-163, score-0.201]

96 Future research on the NCA model will investigate using local estimates of K as derived from the entropy of the distributions pij ; the possible use of a stochastic classiﬁcation rule at test time; and more systematic comparisons between the objective functions f and g. [sent-164, score-0.385]

97 To conclude, we have introduced a novel non-parametric learning method — NCA — that handles the tasks of distance learning and dimensionality reduction in a uniﬁed manner. [sent-165, score-0.2]

98 Although much recent effort has focused on non-linear methods, we feel that linear embedding has still not fully fulﬁlled its potential for either visualization or learning. [sent-166, score-0.089]

99 Acknowledgments Thanks to David Heckerman and Paul Viola for suggesting that we investigate the alternative cost g(A) and the case of diagonal A. [sent-167, score-0.071]

100 A new lda-based face recognition system which can solve the small sample size problem. [sent-181, score-0.048]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nca', 0.627), ('knn', 0.287), ('rca', 0.217), ('pij', 0.214), ('lda', 0.186), ('neighbour', 0.169), ('wine', 0.147), ('transformation', 0.132), ('metric', 0.127), ('xik', 0.115), ('hous', 0.096), ('nonsquare', 0.096), ('xij', 0.096), ('ci', 0.096), ('digit', 0.093), ('neighbours', 0.084), ('iris', 0.084), ('distance', 0.084), ('faces', 0.081), ('ion', 0.077), ('bal', 0.077), ('mahalanobis', 0.077), ('pca', 0.076), ('dimensionality', 0.076), ('metrics', 0.074), ('rank', 0.074), ('projection', 0.07), ('pik', 0.063), ('class', 0.063), ('classi', 0.062), ('objective', 0.06), ('visualization', 0.059), ('classes', 0.059), ('euclidean', 0.055), ('usps', 0.054), ('perplexity', 0.054), ('stochastic', 0.051), ('nearest', 0.051), ('pi', 0.05), ('axi', 0.048), ('ytest', 0.048), ('face', 0.048), ('cost', 0.048), ('gradient', 0.046), ('storage', 0.045), ('training', 0.045), ('concentric', 0.042), ('neighbourhood', 0.042), ('loo', 0.042), ('transformations', 0.041), ('reduction', 0.04), ('dataset', 0.04), ('labeled', 0.039), ('transformed', 0.039), ('restricting', 0.039), ('wab', 0.038), ('testing', 0.038), ('gure', 0.037), ('supervision', 0.036), ('encourages', 0.036), ('whitened', 0.036), ('dimensions', 0.034), ('cation', 0.033), ('yn', 0.033), ('person', 0.033), ('orientation', 0.033), ('enforcing', 0.032), ('test', 0.032), ('linear', 0.03), ('labels', 0.03), ('low', 0.029), ('consistently', 0.029), ('ax', 0.028), ('ay', 0.028), ('distributions', 0.028), ('learned', 0.028), ('matches', 0.028), ('store', 0.027), ('rarely', 0.027), ('points', 0.026), ('parametric', 0.026), ('panels', 0.026), ('convex', 0.026), ('optimize', 0.025), ('effectively', 0.025), ('images', 0.025), ('suffers', 0.024), ('handwritten', 0.024), ('xing', 0.024), ('reduced', 0.024), ('dimensional', 0.024), ('equivalence', 0.024), ('diagonal', 0.023), ('dn', 0.023), ('expression', 0.023), ('projected', 0.023), ('search', 0.022), ('uci', 0.022), ('boundaries', 0.022), ('maximizing', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 127 nips-2004-Neighbourhood Components Analysis

Author: Jacob Goldberger, Geoffrey E. Hinton, Sam T. Roweis, Ruslan Salakhutdinov

2 0.17698289 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

Author: Jieping Ye, Ravi Janardan, Qi Li

Abstract: Linear Discriminant Analysis (LDA) is a well-known scheme for feature extraction and dimension reduction. It has been used widely in many applications involving high-dimensional data, such as face recognition and image retrieval. An intrinsic limitation of classical LDA is the so-called singularity problem, that is, it fails when all scatter matrices are singular. A well-known approach to deal with the singularity problem is to apply an intermediate dimension reduction stage using Principal Component Analysis (PCA) before LDA. The algorithm, called PCA+LDA, is used widely in face recognition. However, PCA+LDA has high costs in time and space, due to the need for an eigen-decomposition involving the scatter matrices. In this paper, we propose a novel LDA algorithm, namely 2DLDA, which stands for 2-Dimensional Linear Discriminant Analysis. 2DLDA overcomes the singularity problem implicitly, while achieving efﬁciency. The key difference between 2DLDA and classical LDA lies in the model for data representation. Classical LDA works with vectorized representations of data, while the 2DLDA algorithm works with data in matrix representation. To further reduce the dimension by 2DLDA, the combination of 2DLDA and classical LDA, namely 2DLDA+LDA, is studied, where LDA is preceded by 2DLDA. The proposed algorithms are applied on face recognition and compared with PCA+LDA. Experiments show that 2DLDA and 2DLDA+LDA achieve competitive recognition accuracy, while being much more efﬁcient. 1

3 0.12284677 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

Author: Michael Fink

Abstract: We describe a framework for learning an object classiﬁer from a single example. This goal is achieved by emphasizing the relevant dimensions for classiﬁcation using available examples of related classes. Learning to accurately classify objects from a single training example is often unfeasible due to overﬁtting effects. However, if the instance representation provides that the distance between each two instances of the same class is smaller than the distance between any two instances from different classes, then a nearest neighbor classiﬁer could achieve perfect performance with a single training example. We therefore suggest a two stage strategy. First, learn a metric over the instances that achieves the distance criterion mentioned above, from available examples of other related classes. Then, using the single examples, deﬁne a nearest neighbor classiﬁer where distance is evaluated by the learned class relevance metric. Finding a metric that emphasizes the relevant dimensions for classiﬁcation might not be possible when restricted to linear projections. We therefore make use of a kernel based metric learning algorithm. Our setting encodes object instances as sets of locality based descriptors and adopts an appropriate image kernel for the class relevance metric learning. The proposed framework for learning from a single example is demonstrated in a synthetic setting and on a character classiﬁcation task. 1

4 0.10423689 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

5 0.095792957 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

Author: Volker Roth

Abstract: The problem of detecting “atypical objects” or “outliers” is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classiﬁers. The main conceptual shortcoming of most one-class approaches, however, is that in a strict sense they are unable to detect outliers, since the expected fraction of outliers has to be speciﬁed in advance. The method presented in this paper overcomes this problem by relating kernelized one-class classiﬁcation to Gaussian density estimation in the induced feature space. Having established this relation, it is possible to identify “atypical objects” by quantifying their deviations from the Gaussian model. For RBF kernels it is shown that the Gaussian model is “rich enough” in the sense that it asymptotically provides an unbiased estimator for the true density. In order to overcome the inherent model selection problem, a cross-validated likelihood criterion for selecting all free model parameters is applied. 1

6 0.086650126 163 nips-2004-Semi-parametric Exponential Family PCA

7 0.085190661 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

8 0.08329796 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

9 0.076891698 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

10 0.075253263 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

11 0.073040053 87 nips-2004-Integrating Topics and Syntax

12 0.071248539 145 nips-2004-Parametric Embedding for Class Visualization

13 0.071079835 68 nips-2004-Face Detection --- Efficient and Rank Deficient

14 0.070575878 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

15 0.068497822 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

16 0.062954232 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

17 0.062119015 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

18 0.059494637 115 nips-2004-Maximum Margin Clustering

19 0.059366271 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

20 0.058378715 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, 0.081), (2, -0.067), (3, -0.054), (4, 0.015), (5, 0.066), (6, -0.091), (7, -0.022), (8, 0.166), (9, 0.059), (10, 0.01), (11, -0.104), (12, -0.134), (13, -0.01), (14, 0.004), (15, -0.043), (16, -0.077), (17, -0.007), (18, 0.061), (19, 0.052), (20, 0.029), (21, -0.005), (22, 0.101), (23, 0.071), (24, 0.128), (25, 0.032), (26, -0.135), (27, 0.063), (28, 0.029), (29, 0.074), (30, -0.018), (31, -0.025), (32, -0.005), (33, 0.015), (34, -0.019), (35, 0.021), (36, 0.034), (37, -0.011), (38, 0.042), (39, 0.059), (40, 0.03), (41, 0.049), (42, -0.032), (43, 0.043), (44, 0.011), (45, -0.049), (46, 0.027), (47, -0.096), (48, 0.088), (49, -0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91285634 127 nips-2004-Neighbourhood Components Analysis

Author: Jacob Goldberger, Geoffrey E. Hinton, Sam T. Roweis, Ruslan Salakhutdinov

2 0.7783289 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

Author: Jieping Ye, Ravi Janardan, Qi Li

3 0.69293314 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

Author: Tao Xiong, Jieping Ye, Qi Li, Ravi Janardan, Vladimir Cherkassky

Abstract: Linear Discriminant Analysis (LDA) is a well-known method for feature extraction and dimension reduction. It has been used widely in many applications such as face recognition. Recently, a novel LDA algorithm based on QR Decomposition, namely LDA/QR, has been proposed, which is competitive in terms of classiﬁcation accuracy with other LDA algorithms, but it has much lower costs in time and space. However, LDA/QR is based on linear projection, which may not be suitable for data with nonlinear structure. This paper ﬁrst proposes an algorithm called KDA/QR, which extends the LDA/QR algorithm to deal with nonlinear data by using the kernel operator. Then an efﬁcient approximation of KDA/QR called AKDA/QR is proposed. Experiments on face image data show that the classiﬁcation accuracy of both KDA/QR and AKDA/QR are competitive with Generalized Discriminant Analysis (GDA), a general kernel discriminant analysis algorithm, while AKDA/QR has much lower time and space costs. 1

4 0.54364163 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon

Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classiﬁcation, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, ﬁnd the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efﬁcient triangle ﬁxing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1

5 0.5249843 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

Author: Volker Roth

6 0.50874996 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

7 0.46676648 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

8 0.4594759 125 nips-2004-Multiple Relational Embedding

9 0.45494735 145 nips-2004-Parametric Embedding for Class Visualization

10 0.45260009 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

11 0.4520106 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

12 0.3998664 87 nips-2004-Integrating Topics and Syntax

13 0.39429909 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

14 0.39154649 163 nips-2004-Semi-parametric Exponential Family PCA

15 0.38877252 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

16 0.3762973 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

17 0.35978988 205 nips-2004-Who's In the Picture

18 0.35657999 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

19 0.35301679 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

20 0.34626219 68 nips-2004-Face Detection --- Efficient and Rank Deficient

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.11), (15, 0.157), (26, 0.041), (31, 0.024), (33, 0.189), (35, 0.048), (39, 0.03), (50, 0.036), (54, 0.014), (61, 0.225), (71, 0.01), (76, 0.012), (87, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86031908 127 nips-2004-Neighbourhood Components Analysis

Author: Jacob Goldberger, Geoffrey E. Hinton, Sam T. Roweis, Ruslan Salakhutdinov

2 0.77085322 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

3 0.76807719 178 nips-2004-Support Vector Classification with Input Data Uncertainty

Author: Jinbo Bi, Tong Zhang

Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classiﬁcation, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efﬁciently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1

4 0.76602781 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang

Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to ﬁnd a discriminative generative model that best separates the target from the background. We present a computationally efﬁcient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not signiﬁcantly change as time progresses, our method adapts a discriminative generative model to reﬂect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.

5 0.76549107 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.76312143 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

7 0.76230276 125 nips-2004-Multiple Relational Embedding

8 0.76121855 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

9 0.76100588 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

10 0.76009035 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

11 0.75908273 102 nips-2004-Learning first-order Markov models for control

12 0.75900114 163 nips-2004-Semi-parametric Exponential Family PCA

13 0.75814986 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

14 0.75732076 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

15 0.75722873 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

16 0.75709718 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

17 0.75648457 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

18 0.75640291 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

19 0.75636739 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.75570428 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks