nips nips2010 nips2010-104 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. [sent-8, score-1.343]
2 Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. [sent-9, score-0.949]
3 We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. [sent-10, score-1.108]
4 As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. [sent-11, score-0.955]
5 Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. [sent-12, score-1.799]
6 1 Introduction The classic dichotomy between generative and discriminative methods for classification in machine learning can be clearly seen in two distinct performance regimes as the number of training examples is varied [12, 18]. [sent-13, score-0.442]
7 On the other hand, more flexible discriminative methods—which are interested in a direct measure of p(y|x)—can accurately capture the true posterior structure p(y|x) when the number of training examples is large. [sent-15, score-0.177]
8 Thus, given enough training examples, the best performing classification algorithms have typically employed purely discriminative methods. [sent-16, score-0.223]
9 However, due to the curse of dimensionality when D is large, the number of data examples may not be sufficient for discriminative methods to approach their asymptotic performance limits. [sent-17, score-0.397]
10 In this case, it may be possible to improve discriminative methods by exploiting knowledge of generative models. [sent-18, score-0.386]
11 There has been recent work on hybrid models showing some improvement [14, 15, 20], but mainly the generative models have been improved through the discriminative formulation. [sent-19, score-0.386]
12 In this work, we consider a very simple discriminative classifier, the nearest neighbor classifier, where the class label of an unknown datum is chosen according to the class label of the nearest known datum. [sent-20, score-1.482]
13 The choice of a metric to define nearest is then crucial, and we show how this metric can be locally defined based upon knowledge of generative models. [sent-21, score-1.604]
14 Previous work on metric learning for nearest neighbor classification has focused on a purely discriminative approach. [sent-22, score-1.44]
15 The metric is parameterized by a global quadratic form which is then optimized on the training data to maximize pairwise separation between dissimilar points, and to minimize the pairwise separation of similar points [3, 9, 10, 21, 26]. [sent-23, score-0.662]
16 Here, we show how the problem of learning 1 a metric can be related to reducing the theoretical bias of the nearest neighbor classifier. [sent-24, score-1.416]
17 Though the performance of the nearest neighbor classifier has good theoretical guarantees in the limit of infinite data, finite sampling effects can introduce a bias which can be minimized by the choice of an appropriate metric. [sent-25, score-0.992]
18 By directly trying to reduce this bias at each point, we will see the classification error is significantly reduced compared to the global class-separating metric. [sent-26, score-0.19]
19 We show how to choose such a metric by analyzing the probability distribution on nearest neighbors, provided we know the underlying generative models. [sent-27, score-1.122]
20 Analyses of nearest neighbor distributions have been discussed before [11, 19, 24, 25], but we take a simpler approach and derive the metricdependent term in the bias directly. [sent-28, score-0.923]
21 considered optimizing a metric function in a generative setting [7, 8], but the resulting derivation was inaccurate and does not improve nearest neighbor performance. [sent-31, score-1.48]
22 first showed how a generative model can be used to derive a special kernel, called the Fisher kernel [12], which can be related to a distance function. [sent-33, score-0.35]
23 Unfortunately, the Fisher kernel is quite generic, and need not necessarily improve nearest neighbor performance. [sent-34, score-0.788]
24 Our generative approach also provides a theoretical relationship between metric learning and the dimensionality reduction problem. [sent-35, score-0.941]
25 In order to find better projections for classification, research on dimensionality reduction using labeled training data has utilized information-theoretic measures such as Bhattacharrya divergence [6] and mutual information [2, 17]. [sent-36, score-0.257]
26 We argue how these problems can be connected with metric learning for nearest neighbor classification within the general framework of F-divergences. [sent-37, score-1.244]
27 We will also explain how dimensionality reduction is entirely different from metric learning in the generative approach, whereas in the discriminative setting, it is simply a special case of metric learning where particular directions are shrunk to zero. [sent-38, score-1.601]
28 In section 2, we motivate by comparing the metric dependency of the discriminative and generative approaches for nearest neighbor classification. [sent-40, score-1.63]
29 After we derive the bias due to finite sampling in section 3, we show, in section 4, how minimizing this bias results in a local metric learning algorithm. [sent-41, score-0.833]
30 In section 5, we explain how metric learning should be understood in a generative perspective, in particular, its relationship with dimensionality reduction. [sent-42, score-0.879]
31 2 Metric and Nearest Neighbor Classification In recent work, determining a good metric for nearest neighbor classification is believed to be crucial. [sent-45, score-1.244]
32 However, traditional generative analysis of this problem has simply ignored the metric issue with good reason, as we will see in section 2. [sent-46, score-0.718]
33 In this section, we explain the apparent contradiction between two different approaches to this issue, and briefly describe how the resolution of this contradiction will lead to a metric learning method that is both theoretically and practically plausible. [sent-48, score-0.55]
34 1 Metric Learning for Nearest Neighbor Classification A nearest neighbor classifier determines the label of an unknown datum according to the label of its nearest neighbor. [sent-50, score-1.28]
35 In general, the meaning of the term nearest is defined along with the notion of distance in data space. [sent-51, score-0.492]
36 One common choice for this distance is the Mahalanobis distance with a positive definite square matrix A ∈ RD×D where D is the dimensionality of data space. [sent-52, score-0.278]
37 This recent work has assumed the following common heuristic: the training data in different classes should be separated in a new 2 metric space. [sent-54, score-0.509]
38 Given training data, a global A is optimized such that directions separating different class data are extended, and directions binding same class data together are shrunk [3, 9, 10, 21, 26]. [sent-55, score-0.212]
39 2 Theoretical Performance of Nearest Neighbor Classifier Contrary to recent metric learning approaches, a simple theoretical analysis using a generative model displays no sensitivity to the choice of the metric. [sent-58, score-0.764]
40 With infinite samples, the probability of misclassification using a nearest neighbor classifier can be obtained: EAsymp = p1 (x)p2 (x) dx, p1 (x) + p2 (x) (2) which is better known by its relationship to an upper bound, twice the optimal Bayes error [4, 7, 8]. [sent-63, score-0.801]
41 By looking at the asymptotic error in a linearly transformed z-space, we can show that Eq. [sent-64, score-0.155]
42 If we consider a linear transformation z = LT x using a full rank matrix L, and the distribution qc (z) for c ∈ {1, 2} in z-space satisfying pc (x)dx = qc (z)dz and accompanying measure change dz = |L|dx, we see EAsymp in z-space is unchanged. [sent-66, score-0.233]
43 Since any positive definite A can be decomposed as A = LLT , we can say the asymptotic error remains constant even as the metric shrinks or expands any spatial directions in data space. [sent-67, score-0.671]
44 This difference in behavior in terms of metric dependence can be understood as a special property that arises from infinite data. [sent-68, score-0.541]
45 When we do not have infinite samples, the expectation of error is biased in that it deviates from the asymptotic error, and the bias is dependent on the metric. [sent-69, score-0.333]
46 From a theoretical perspective, the asymptotic error is the theoretical limit of expected error, and the bias reduces as the number of samples increase. [sent-70, score-0.398]
47 Since this difference is not considered in previous research, the aforementioned metric will not exhibit performance improvements when the sample number is large. [sent-71, score-0.511]
48 In the next section, we look at the performance bias associated with finite sampling directly and find a metric that minimizes the bias from the asymptotic theoretical error. [sent-72, score-0.979]
49 3 Performance Bias due to Finite Sampling Here, we obtain the expectation of nearest neighbor classification error from the distribution of nearest neighbors in different classes. [sent-73, score-1.282]
50 As we consider finite number of samples, the nearest neighbor from a point x0 appears at a finite distance dN > 0. [sent-74, score-0.85]
51 This non-zero distance gives rise to the performance difference from its theoretical limit (2). [sent-75, score-0.163]
52 Now, under the condition that the nearest neighbor appears at the distance dN from the test point, the expectation of the probability p(xN N ) at a nearest neighbor point is derived by averaging the probability over the D-dimensional hypersphere of radius dN , as in Fig. [sent-77, score-1.704]
53 ˜ ExN N p(xN N ) dN , x0 = = 1 p(x)(x − x0 ) x − x0 p(x0 ) + ExN N (x − x0 )T 2 d2 p(x0 ) + N · 2 p|x=x0 ≡ p(x0 ) ˜ 2D 3 2 = d2 N (4) Figure 1: The nearest neighbor xN N appears at a finite distance dN from x0 due to finite sampling. [sent-81, score-0.85]
54 If we look at the expected error, it is the expectation of the probability that the test point and its neighbor are labeled differently. [sent-84, score-0.435]
55 The residual term can be considered as the finite sampling bias of the error discussed earlier. [sent-88, score-0.225]
56 Under the coordinate transformation z = LT x and the distributions p(x) on x and q(z) on z, we see that this bias term is dependent upon the choice of a metric A = LLT . [sent-89, score-0.643]
57 = 1 2 2 q 2 2 q2 + q2 2 q1 − q1 q2 q1 + 2 q2 dz (q1 + q2 )2 1 1 tr A−1 p2 p2 + p2 p1 − p1 p2 ( p1 + 1 2 (p1 + p2 )2 (7) p2 ) dx which is derived using p(x)dx = q(z)dz and |L| 2 q = tr[A−1 p]. [sent-90, score-0.181]
58 Thus, finding the N metric that minimizes the quantity given in Eq. [sent-92, score-0.507]
59 4 4 Reducing Deviation from the Asymptotic Performance Finding the local metric that minimizes the bias can be formulated as a semi-definite programming (SDP) problem of minimizing squared residual with respect to a positive semi-definite metric A: min (tr[A−1 B])2 s. [sent-95, score-1.216]
60 In principle, distances should then be defined as geodesic distances using this local metric on a Riemannian manifold. [sent-102, score-0.552]
61 However, this is computationally difficult, so we use the surrogate distance A = γI + Aopt and treat γ as a regularization parameter that is learned in addition to the local metric Aopt . [sent-103, score-0.64]
62 The asymptotic error with C-class disC 1 pc j=i pj / tributions can be extended to C c=1 i pi dx using the posteriors of each class, and it replaces B in Eq. [sent-105, score-0.374]
63 (11) j=i Metric Learning in Generative Models Traditional metric learning methods can be understood as being purely discriminative. [sent-107, score-0.587]
64 In general, their motivations are compared to the supervised dimensionality reduction methods, which try to find a low dimensional space where the separation between classes is maximized. [sent-109, score-0.241]
65 Their dimensionality reduction is not that different from metric learning, but often as a special case where metric in particular directions is forced to be zero. [sent-110, score-1.175]
66 In the generative approach, however, the relationship between dimensionality reduction and metric learning is different. [sent-111, score-0.895]
67 As in the discriminative case, dimensionality reduction in generative models tries to obtain class separation in a transformed space. [sent-112, score-0.702]
68 (12) The examples of using this divergence include the Bhattacharyya divergence p1 (x)p2 (x)dx √ p2 (x) when φ(t) = t and the KL-divergence − p1 (x) log p1 (x) dx when φ(t) = − log(t). [sent-115, score-0.198]
69 Unlike dimensionality reduction, we cannot use these criteria for metric learning because any Fdivergence is metric-invariant. [sent-118, score-0.584]
70 The local p/p of the three classes are plotted on the right using a Euclidean metric I and for the ˜ optimal metric Aopt . [sent-122, score-1.034]
71 The solution Aopt tries to keep the ratio p/p over the different classes as ˜ similar as possible when the distance dN is varied. [sent-123, score-0.137]
72 Therefore, in generative models, the metric learning problem is qualitatively different from the dimensionality reduction problem in this aspect. [sent-125, score-0.895]
73 One interpretation is that the F-measure can be understood as a measure of dimensionality reduction in an asymptotic situation. [sent-126, score-0.352]
74 In this case, the role of metric learning can be defined to move the expected F-measure toward the asymptotic F-measure by appropriate metric adaptation. [sent-127, score-1.105]
75 (9) into (p2 − p1 )(p2 2 p1 − p1 2 p2 ), we can see that the optimal metric tries to minimize 2 2 2 2 p p p p the difference between p1 1 and p2 2 . [sent-131, score-0.531]
76 This means that the expected nearest neighbor classification at a distance dN will be least biased due to finite sampling. [sent-134, score-0.875]
77 2 shows how the learned local metric Aopt varies at a point x for a 3-class Gaussian example, and how the ratio of p/p is kept as similar ˜ as possible. [sent-136, score-0.552]
78 6 Experiments We apply our algorithm for learning a local metric to synthetic and various real datasets and see how well it improves nearest neighbor classification performance. [sent-137, score-1.391]
79 Simple standard Gaussian distributions are used to learn the generative model, with parameters including the mean vector µ and covariance matrix Σ for each class. [sent-138, score-0.271]
80 We compare the performance of our method (GLML—Generative Local Metric Learning) with recent metric learning discriminative methods which report state-of-the-art performance on a number of datasets. [sent-140, score-0.69]
81 We also compare with a local metric given by the Fisher kernel [12] assuming a single Gaussian for the generative model and using the location parameter to derive the Fisher information matrix. [sent-143, score-0.814]
82 The metric from the Fisher kernel was not originally intended for nearest neighbor classification, but it is the only other reported algorithm that learns a local metric from generative models. [sent-144, score-2.058]
83 As the dimensionality grows, the original nearest neighbor performance degrades because of the high dimensionality. [sent-152, score-0.893]
84 However, we see that the proposed local metric highly outperforms the discriminative nearest neighbor performance in a high dimensional space appropriately. [sent-153, score-1.493]
85 3, our local metric algorithm generally outperforms most of the other metrics across most of the datasets. [sent-169, score-0.552]
86 On quite a number of datasets, many of the other methods do not outperform the original Euclidean nearest neighbor classifier. [sent-170, score-0.762]
87 On the other hand, the local metric derived from simple Gaussian distributions always shows a performance gain over the naive nearest neighbor classifier. [sent-172, score-1.406]
88 In contrast, using Bayes rule with these simple Gaussian generative models often results in very poor performance. [sent-173, score-0.236]
89 The computational time using a local metric is also very competitive, since the underlying SDP optimization has a simple spectral solution. [sent-174, score-0.552]
90 This is in contrast to other methods which numerically solve for a global metric using an SDP over the data points. [sent-175, score-0.507]
91 7 Conclusions In our study, we showed how a local metric for nearest neighbor classification can be learned using generative models. [sent-176, score-1.55]
92 The learning algorithm is derived from an analysis of the asymptotic performance of the nearest neighbor classifier, such that the optimal metric minimizes the bias of the expected performance of the classifier. [sent-178, score-1.594]
93 This connection to generative models is very powerful, and can easily be extended to include missing data—one of the large advantages of generative models 1 http://userweb. [sent-179, score-0.472]
94 The Fisher kernel and BM are omitted for (f)∼(i) and (h)∼(i) respectively, since their performances are much worse than the naive nearest neighbor classifier. [sent-244, score-0.816]
95 Here we used simple Gaussians for the generative models, but this could be also easily extended to include other possibilities such as mixture models, hidden Markov models, or other dynamic generative models. [sent-246, score-0.472]
96 The kernelization of this work is straightforward, and the extension to the k-nearest neighbor setting using the theoretical distribution of k-th nearest neighbors is an interesting future direction. [sent-247, score-0.833]
97 Another possible future avenue of work is to combine dimensionality reduction and metric learning using this framework. [sent-248, score-0.659]
98 Linear dimensionality reduction via a heteroscedastic extension of LDA: The chernoff criterion. [sent-358, score-0.22]
99 generative classifiers: A comparison of logistic regression and naive Bayes. [sent-370, score-0.264]
100 Distance metric learning for large margin nearest neighbor classification. [sent-423, score-1.244]
wordName wordTfidf (topN-words)
[('metric', 0.482), ('nearest', 0.404), ('neighbor', 0.358), ('generative', 0.236), ('dn', 0.172), ('aopt', 0.172), ('discriminative', 0.15), ('exn', 0.13), ('bias', 0.126), ('glml', 0.123), ('asymptotic', 0.116), ('classi', 0.106), ('dimensionality', 0.102), ('edn', 0.098), ('dx', 0.092), ('distance', 0.088), ('fisher', 0.081), ('reduction', 0.075), ('xn', 0.074), ('easymp', 0.074), ('twonorm', 0.074), ('pc', 0.073), ('local', 0.07), ('fukunaga', 0.065), ('separation', 0.064), ('dz', 0.062), ('wine', 0.06), ('understood', 0.059), ('usps', 0.058), ('datum', 0.056), ('sdp', 0.055), ('en', 0.054), ('divergence', 0.053), ('eigenvectors', 0.053), ('expectation', 0.052), ('tries', 0.049), ('llt', 0.049), ('qc', 0.049), ('discriminant', 0.047), ('purely', 0.046), ('theoretical', 0.046), ('nite', 0.046), ('datasets', 0.044), ('bm', 0.044), ('heteroscedastic', 0.043), ('verd', 0.043), ('itml', 0.043), ('korea', 0.043), ('seoul', 0.043), ('rd', 0.042), ('shrunk', 0.04), ('hypersphere', 0.04), ('lmnn', 0.04), ('error', 0.039), ('eigenvalues', 0.038), ('kulkarni', 0.037), ('neighbour', 0.037), ('men', 0.037), ('er', 0.036), ('bhattacharyya', 0.035), ('distributions', 0.035), ('directions', 0.034), ('contradiction', 0.034), ('waveform', 0.034), ('synthetic', 0.033), ('cation', 0.032), ('hessian', 0.031), ('residual', 0.031), ('german', 0.031), ('ionosphere', 0.031), ('shen', 0.03), ('advances', 0.029), ('realizations', 0.029), ('speech', 0.029), ('performance', 0.029), ('label', 0.029), ('sampling', 0.029), ('gaussians', 0.028), ('naive', 0.028), ('consist', 0.028), ('pj', 0.028), ('tr', 0.027), ('lt', 0.027), ('pronounced', 0.027), ('training', 0.027), ('lab', 0.026), ('kernel', 0.026), ('class', 0.026), ('pages', 0.026), ('pi', 0.026), ('benchmark', 0.025), ('neighbors', 0.025), ('minimizes', 0.025), ('gaussian', 0.025), ('global', 0.025), ('digits', 0.025), ('jaakkola', 0.025), ('pattern', 0.025), ('wang', 0.025), ('expected', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
2 0.28886399 138 nips-2010-Large Margin Multi-Task Metric Learning
Author: Shibin Parameswaran, Kilian Q. Weinberger
Abstract: Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to support vector machines (svm) by Evgeniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for multi-task learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classification problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classifiers. 1
3 0.15954235 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
4 0.14789587 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
Author: Mario Fritz, Kate Saenko, Trevor Darrell
Abstract: Metric constraints are known to be highly discriminative for many objects, but if training is limited to data captured from a particular 3-D sensor the quantity of training data may be severly limited. In this paper, we show how a crucial aspect of 3-D information–object and feature absolute size–can be added to models learned from commonly available online imagery, without use of any 3-D sensing or reconstruction at training time. Such models can be utilized at test time together with explicit 3-D sensing to perform robust search. Our model uses a “2.1D” local feature, which combines traditional appearance gradient statistics with an estimate of average absolute depth within the local window. We show how category size information can be obtained from online images by exploiting relatively unbiquitous metadata fields specifying camera intrinstics. We develop an efficient metric branch-and-bound algorithm for our search task, imposing 3-D size constraints as part of an optimal search for a set of features which indicate the presence of a category. Experiments on test scenes captured with a traditional stereo rig are shown, exploiting training data from from purely monocular sources with associated EXIF metadata. 1
5 0.1375265 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform
Author: Siwei Lyu
Abstract: Divisive normalization (DN) has been advocated as an effective nonlinear efficient coding transform for natural sensory signals with applications in biology and engineering. In this work, we aim to establish a connection between the DN transform and the statistical properties of natural sensory signals. Our analysis is based on the use of multivariate t model to capture some important statistical properties of natural sensory signals. The multivariate t model justifies DN as an approximation to the transform that completely eliminates its statistical dependency. Furthermore, using the multivariate t model and measuring statistical dependency with multi-information, we can precisely quantify the statistical dependency that is reduced by the DN transform. We compare this with the actual performance of the DN transform in reducing statistical dependencies of natural sensory signals. Our theoretical analysis and quantitative evaluations confirm DN as an effective efficient coding transform for natural sensory signals. On the other hand, we also observe a previously unreported phenomenon that DN may increase statistical dependencies when the size of pooling is small. 1
6 0.11755527 287 nips-2010-Worst-Case Linear Discriminant Analysis
7 0.113399 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
8 0.10861441 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
9 0.10389563 250 nips-2010-Spectral Regularization for Support Estimation
10 0.10306538 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
11 0.098594971 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
12 0.091005944 116 nips-2010-Identifying Patients at Risk of Major Adverse Cardiovascular Events Using Symbolic Mismatch
13 0.089857869 280 nips-2010-Unsupervised Kernel Dimension Reduction
14 0.088743486 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
15 0.087031521 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
16 0.081330836 222 nips-2010-Random Walk Approach to Regret Minimization
17 0.079594716 228 nips-2010-Reverse Multi-Label Learning
18 0.079181015 223 nips-2010-Rates of convergence for the cluster tree
19 0.077655949 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
20 0.06454251 202 nips-2010-Parallelized Stochastic Gradient Descent
topicId topicWeight
[(0, 0.206), (1, 0.071), (2, 0.021), (3, -0.082), (4, 0.097), (5, 0.07), (6, 0.085), (7, -0.056), (8, -0.045), (9, -0.008), (10, 0.059), (11, -0.034), (12, 0.076), (13, -0.158), (14, 0.036), (15, -0.044), (16, 0.05), (17, -0.077), (18, -0.037), (19, -0.029), (20, -0.115), (21, -0.146), (22, -0.095), (23, 0.049), (24, -0.024), (25, -0.054), (26, 0.196), (27, 0.043), (28, -0.181), (29, -0.188), (30, -0.135), (31, 0.055), (32, -0.073), (33, -0.16), (34, 0.127), (35, 0.069), (36, -0.003), (37, -0.066), (38, 0.016), (39, 0.01), (40, -0.062), (41, -0.019), (42, -0.112), (43, 0.067), (44, 0.037), (45, 0.099), (46, 0.112), (47, 0.094), (48, -0.094), (49, -0.163)]
simIndex simValue paperId paperTitle
same-paper 1 0.96696234 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
2 0.67189997 138 nips-2010-Large Margin Multi-Task Metric Learning
Author: Shibin Parameswaran, Kilian Q. Weinberger
Abstract: Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to support vector machines (svm) by Evgeniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for multi-task learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classification problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classifiers. 1
3 0.57285893 287 nips-2010-Worst-Case Linear Discriminant Analysis
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
4 0.54730141 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
Author: Yi-da Wu, Shi-jie Lin, Hsin Chen
Abstract: The Diffusion Network(DN) is a stochastic recurrent network which has been shown capable of modeling the distributions of continuous-valued, continuoustime paths. However, the dynamics of the DN are governed by stochastic differential equations, making the DN unfavourable for simulation in a digital computer. This paper presents the implementation of the DN in analogue Very Large Scale Integration, enabling the DN to be simulated in real time. Moreover, the logdomain representation is applied to the DN, allowing the supply voltage and thus the power consumption to be reduced without limiting the dynamic ranges for diffusion processes. A VLSI chip containing a DN with two stochastic units has been designed and fabricated. The design of component circuits will be described, so will the simulation of the full system be presented. The simulation results demonstrate that the DN in VLSI is able to regenerate various types of continuous paths in real-time. 1
5 0.52078521 280 nips-2010-Unsupervised Kernel Dimension Reduction
Author: Meihong Wang, Fei Sha, Michael I. Jordan
Abstract: We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.
6 0.4996455 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform
7 0.49254945 124 nips-2010-Inductive Regularized Learning of Kernel Functions
8 0.46109232 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
9 0.43494859 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
10 0.42067823 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
11 0.40194437 116 nips-2010-Identifying Patients at Risk of Major Adverse Cardiovascular Events Using Symbolic Mismatch
12 0.40134901 250 nips-2010-Spectral Regularization for Support Estimation
13 0.3948606 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
14 0.39120421 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
15 0.38981956 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
16 0.38081264 2 nips-2010-A Bayesian Approach to Concept Drift
17 0.37765881 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
18 0.36657128 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
19 0.35705727 228 nips-2010-Reverse Multi-Label Learning
20 0.34371835 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
topicId topicWeight
[(13, 0.033), (17, 0.017), (27, 0.062), (30, 0.051), (35, 0.016), (45, 0.167), (48, 0.012), (50, 0.035), (52, 0.034), (60, 0.419), (77, 0.024), (78, 0.013), (90, 0.039)]
simIndex simValue paperId paperTitle
1 0.92385375 223 nips-2010-Rates of convergence for the cluster tree
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
2 0.88991272 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan
Abstract: We propose a new LP relaxation for obtaining the MAP assignment of a binary MRF with pairwise potentials. Our relaxation is derived from reducing the MAP assignment problem to an instance of a recently proposed Bipartite Multi-cut problem where the LP relaxation is guaranteed to provide an O(log k) approximation where k is the number of vertices adjacent to non-submodular edges in the MRF. We then propose a combinatorial algorithm to efficiently solve the LP and also provide a lower bound by concurrently solving its dual to within an approximation. The algorithm is up to an order of magnitude faster and provides better MAP scores and bounds than the state of the art message passing algorithm of [1] that tightens the local marginal polytope with third-order marginal constraints. 1
3 0.88600338 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
Author: Tobias Glasmachers
Abstract: Steinwart was the first to prove universal consistency of support vector machine classification. His proof analyzed the ‘standard’ support vector machine classifier, which is restricted to binary classification problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classification to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1
same-paper 4 0.84684998 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
5 0.79636788 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
6 0.72263342 229 nips-2010-Reward Design via Online Gradient Ascent
7 0.6097154 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
8 0.60718721 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
9 0.58223206 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
10 0.57700574 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
11 0.57471353 4 nips-2010-A Computational Decision Theory for Interactive Assistants
12 0.57310593 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
13 0.57155013 102 nips-2010-Generalized roof duality and bisubmodular functions
14 0.57061243 138 nips-2010-Large Margin Multi-Task Metric Learning
15 0.57017547 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
17 0.5690369 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
18 0.5685125 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
19 0.56312168 287 nips-2010-Worst-Case Linear Discriminant Analysis
20 0.56083518 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods