nips nips2003 nips2003-108 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew Schultz, Thorsten Joachims
Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. [sent-3, score-0.688]
2 Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. [sent-4, score-0.122]
3 We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. [sent-5, score-0.129]
4 We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. [sent-6, score-0.074]
5 1 Introduction Distance metrics are an essential component in many applications ranging from supervised learning and clustering to product recommendations and document browsing. [sent-7, score-0.197]
6 Since designing such metrics by hand is difficult, we explore the problem of learning a metric from examples. [sent-8, score-0.402]
7 In particular, we consider relative and qualitative examples of the form “A is closer to B than A is to C”. [sent-9, score-0.231]
8 We believe that feedback of this type is more easily available in many application setting than quantitative examples (e. [sent-10, score-0.073]
9 35”) as considered in metric Multidimensional Scaling (MDS) (see [4]), or absolute qualitative feedback (e. [sent-13, score-0.381]
10 Building on the study in [7], search-engine query logs are one example where feedback of the form “A is closer to B than A is to C” is readily available for learning a (more semantic) similarity metric on documents. [sent-16, score-0.438]
11 Given a ranked result list for a query, documents that are clicked on can be assumed to be semantically closer than those documents that the user observed but decided to not click on (i. [sent-17, score-0.262]
12 “Aclick is closer to Bclick than Aclick is to Cnoclick ”). [sent-19, score-0.077]
13 In contrast, drawing the conclusion that “Aclick and Cnoclick are not similar” is probably less justified, since a Cnoclick high in the presented ranking is probably still closer to Aclick than most documents in the collection. [sent-20, score-0.198]
14 In this paper, we present an algorithm that can learn a distance metric from such relative and qualitative examples. [sent-21, score-0.7]
15 Given a parametrized family of distance metrics, the algorithms discriminately searches for the parameters that best fulfill the training examples. [sent-22, score-0.339]
16 Taking a maximum-margin approach [9], we formulate the training problem as a convex quadratic program for the case of learning a weighting of the dimensions. [sent-23, score-0.139]
17 We evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. [sent-24, score-0.074]
18 Vectors are denoted with an arrow x where xi is the ith entry in vector x. [sent-26, score-0.112]
19 The vector 0 is the vector composed of all zeros, and 1 is the vector composed of all ones. [sent-27, score-0.054]
20 As training data, we receive a subset Ptrain of all potential relative comparisons defined over the set Xtrain . [sent-41, score-0.199]
21 Each relative comparison (i, j, k) ∈ Ptrain with xi , xj , xk ∈ Xtrain has the semantic xi is closer to xj than xi is to xk . [sent-42, score-1.017]
22 The goal of the learner is to learn a weighted distance metric dw (·, ·) from Ptrain and Xtrain that best approximates the desired notion of distance on a new set of test points Xtest , Xtrain ∩ Xtest = ∅. [sent-43, score-0.97]
23 We evaluate the performance of a metric dw (·, ·) by how many relative comparisons Ptest it fulfills on the test set. [sent-44, score-0.5]
24 3 Parameterized Distance Metrics A (pseudo) distance metric d(x, y) is a function over pairs of objects x and y from some set X. [sent-45, score-0.551]
25 d(x, y) is a pseudo metric, iff it obeys the four following properties for all x, y, and z: d(x, x) = 0, d(x, y) = d(y, x), d(x, y) ≥ 0, d(x, y) + d(y, z) ≥ d(x, z) It is a metric, iff it also obeys d(x, y) = 0 ⇒ x = y. [sent-46, score-0.145]
26 In this paper, we consider a distance metric dA,W (x, y) between vectors x, y ∈ eterized by two matrices, A and W . [sent-47, score-0.551]
27 dA,W (x, y) = (x − y)T AW AT (x − y) N param- (1) W is a diagonal matrix with non-negative entries and A is any real matrix. [sent-48, score-0.031]
28 Note that the matrix AW AT is semi-positive definite so that dA,W (x, y) is a valid distance metric. [sent-49, score-0.316]
29 In the simplest case, A is the identity matrix, I, and dI,W (x, y) = (x − y)T IW I T (x − y) = (x − y)T W (x − y) is a weighted, Eu2 clidean distance dI,W (x, y) = i Wii (xi − yi ) . [sent-51, score-0.285]
30 This corresponds to applying a linear transformation to the input data with the matrix A. [sent-53, score-0.054]
31 After the transformation, the distance becomes a Euclidean distance on the transformed input points AT x, AT y. [sent-54, score-0.617]
32 Let Φ be the matrix where the i-th column is the (training) vector xi projected into a feature space using the function φ(xi ). [sent-56, score-0.173]
33 Then dΦ,W (φ(x), φ(y)) = ((φ(x) − φ(y))T Φ)W (ΦT (φ(x) − φ(y))) (3) n Wii (K(x, xi ) − K(y, xi ))2 = (4) i=1 is a distance metric in the feature space. [sent-57, score-0.775]
34 4 An SVM Algorithm for Learning from Relative Comparisons Given a training set Ptrain of n relative comparisons over a set of vectors Xtrain , and the matrix A, we aim to fit the parameters in the diagonal matrix W of distance metric dA,W (x, y) so that the training error (i. [sent-58, score-0.866]
35 Finding a solution of zero training error is equivalent to finding a W that fulfills the following set of constraints. [sent-61, score-0.054]
36 ∀(i, j, k) ∈ Ptrain : dA,W (xi , xk ) − dA,W (xi , xj ) > 0 (5) If the set of constraints is feasible and a W exists that fulfills all constraints, the solution is typically not unique. [sent-62, score-0.329]
37 We aim to select a matrix AW AT such that dA,W (x, y) remains as close to an unweighted Euclidean metric as possible. [sent-63, score-0.36]
38 As in classification SVMs, we add slack variables [3] to account for constraints that cannot be satisfied. [sent-69, score-0.113]
39 1 min ||AW AT ||2 + C ξijk F 2 i,j,k ∀(i,j,k) ∈ Ptrain : (xi −xk )TAWAT(xi −xk ) − (xi −xj )TAWAT(xi −xj ) ≥ 1 − ξijk ξijk ≥ 0 Wii ≥ 0 The sum of the slack variables ξijk in the objective is an upper bound on the number of violated constraints. [sent-71, score-0.06]
40 All distances dA,W (x, y) can be written in the following linear form. [sent-74, score-0.078]
41 If we let w be the diagonal elements of W then the distance dA,W can be written as = ((x − y)T A)W (AT (x − y)) = dA,W (x, y) wT (AT x − AT y) ∗ (AT x − AT y) (6) where ∗ denotes the element-wise product. [sent-75, score-0.285]
42 If we let ∆xi ,xj = (AT xi − AT xk ) ∗ (AT xi − AT xk ), then the constraints in the optimization problem can be rewritten in the following linear form. [sent-76, score-0.667]
43 In example 1, A is the identity matrix and in example 2 A is composed of the training examples projected into high dimensional space using an RBF kernel. [sent-78, score-0.168]
44 Note that L is positive semiF definite in both cases and that, therefore, the optimization problem is convex quadratic. [sent-83, score-0.049]
45 5 Experiments In Figure 1, we display a graphical example of our method. [sent-84, score-0.026]
46 The input data points are shown in 1a) and our training constraints specify that the distance between two square points should be less than the distance to a circle. [sent-86, score-0.778]
47 Similarly, circles should be closer to each other than to squares. [sent-87, score-0.077]
48 Figure 1 (1b) shows the points after an MDS analysis with the learned distance metric as input. [sent-88, score-0.662]
49 This learned distance metric intuitively correponds to stretching the x-axis and shrinking the y-axis in the original input space. [sent-89, score-0.661]
50 In this example though, there is no way to use a linear weighting measure to accomplish this task. [sent-91, score-0.039]
51 We used an RBF kernel and learned a distance metric to separate the clusters. [sent-92, score-0.638]
52 We first split X, the set of all 4,183 documents, into separate training and test sets, Xtrain and Xtest . [sent-97, score-0.086]
53 70% of the all examples X added to Xtrain and the remaining 30% are in Xtest . [sent-98, score-0.049]
54 We used a binary feature vector without stemming or stop word removal (63,949 features) to represent each document because it is the least biased distance metric to start out with. [sent-99, score-0.701]
55 It also performed best among several different variations of term weighting, stemming and stopword removal. [sent-100, score-0.111]
56 The relative comparison sets, Ptrain and Ptest , were generated as follows. [sent-101, score-0.06]
57 We present results for learning three different notions of distance. [sent-102, score-0.03]
58 • University Distance: This distance is small when the two examples, x, y, are from the same university and larger otherwise. [sent-103, score-0.285]
59 For this data set we used webpages from seven universities. [sent-104, score-0.104]
60 • Topic Distance: This distance metric is small when the two examples, x, y, are from the same topic (e. [sent-105, score-0.856]
61 both are student webpages) and larger when they are each from a different topic. [sent-107, score-0.167]
62 • Topic+FacultyStudent Distance: Again when two examples, x, y, are from the same topic then they have a small distance between them and a larger distance when they come from different topics. [sent-109, score-0.875]
63 However, we add the additional constraint that the distance between a faculty and a student page is smaller than the distance to pages from other topics. [sent-110, score-0.896]
64 To build the training constraints, Ptrain , we first randomly selected three documents, xi , xj , xk , from Xtrain . [sent-111, score-0.412]
65 For the University Distance we added the triplet (i, j, k) to Ptrain if xi and xj were from the same university and xk was from a different university. [sent-112, score-0.381]
66 In building Ptrain for the Topic Distance we added the (i, j, k) to Ptrain if xi and xj were from the same topic (e. [sent-113, score-0.518]
67 “Student Webpages”) and xk was from a different topic (e. [sent-115, score-0.473]
68 Thus the constraints would specify that a student webpage is closer to a faculty webpage than a faculty webpage is to a course webpage. [sent-119, score-0.968]
69 University Distance Topic Distance Topic+FacultyStudent Distance Learned dw (·, ·) 98. [sent-120, score-0.057]
70 06% Table 1: Accuracy of different distance metrics on an unseen test set Ptest . [sent-129, score-0.478]
71 The results of the learned distance measures on unseen test sets Ptest are reported in Table 1. [sent-130, score-0.429]
72 We report the percentage of the relative comparisons in Ptest that were satisfied for each of the three experiments. [sent-132, score-0.145]
73 As a baseline for comparison, we give the results for the static (not learned) distance metric that performs best on the test set. [sent-133, score-0.605]
74 The best performing metric for all static Euclidean distances (Binary and TFIDF) used stemming and stopword removal, which our learned distance did not use. [sent-134, score-0.849]
75 For the other distances, both the Topic Distance and Topic+FacultyStudent Distance satisfied more than 13% more constraints in Ptest than the best unweighted distance. [sent-138, score-0.146]
76 For the second test, we illustrate on the Topic+FacultyStudent data set how the prediction accuracy of the method scales with the number of training constraints. [sent-140, score-0.077]
77 5 0 50 100 150 200 250 Size of Training Set in Thousands of Constraints Figure 2: Learning curves for the Topic+FacultyStudent dataset where the x axis is the size of the training set Ptrain plotted against the y axis which is the percent of constraints in Ptest that were satisfied. [sent-148, score-0.16]
78 is shown in Figure 2 where we plot the training set size (in number of constraints) versus the percentage of test constraints satisfied. [sent-149, score-0.169]
79 The test set Ptest was held constant and sampled in the same way as the training set (|Ptest | = 85,907). [sent-150, score-0.086]
80 As a final test of our method, we graphically display our distance metrics in Table 7. [sent-152, score-0.521]
81 We plot three distance metrics: The standard binary distance (Figure a) for the Topic Distance, the learned metric for Topic Distance (Figure b) and, and the learned metric for the Topic+FacultyStudent Distance (Figure c). [sent-153, score-1.307]
82 To produce the plots in Table 7, all pairwise distances between the points in Xtest were computed and then projected into 2D using a classical, metric MDS algorithm [1]. [sent-154, score-0.398]
83 Figure a) in Table 7 is the result of using the pairwise distances resulting from the unweighted, binary L2 norm in MDS. [sent-155, score-0.109]
84 In Figure b) we see the results of the learned Topic Distance measure. [sent-157, score-0.087]
85 Figure c) shows the result of using the learned Topic+FacultyStudent Distance metric. [sent-159, score-0.087]
86 When compared to Figure b), the Faculty and Student webpages have now moved closer together as desired. [sent-160, score-0.181]
87 6 Related Work The most relevant related work is the work of Xing et al [11] which focused on the problem of learning a distance metric to increase the accuracy of nearest neighbor algorithms. [sent-161, score-0.551]
88 Their work used absolute, qualitative feedback such as “A is similar to B” or “A is dissimilar to B” which is different from the relative constraints considered here. [sent-162, score-0.258]
89 While [10] does not change the distance metric, [2] uses gradient descent to adapt a parameterized distance metric according to user feedback. [sent-165, score-0.886]
90 Metric MDS techniques take as input a matrix D of dissimilarities (or similarities) between all points in some collection and then seeks to arrange the points in a d-dimensional space to minimize the stress. [sent-167, score-0.182]
91 The stress of the arrangement is roughly the difference between the distances in the d-dimensional space and the distances input in matrix D. [sent-168, score-0.21]
92 Our work differs because the input is a set of relative comparisons, not quantitative distances and does not project the data into a lower dimensional space. [sent-170, score-0.214]
93 Non-metric MDS is more similar to our technique than metric MDS. [sent-171, score-0.266]
94 Instead of preserving the exact distances input, the non-metric MDS seeks to maintain the rank order of the distances. [sent-172, score-0.105]
95 However, the goal of our method is not a low dimensional projection, but a new distance metric in the original space. [sent-173, score-0.551]
96 7 Conclusion and Future Work In this paper we presented a method for learning a weighted Euclidean distance from relative constraints. [sent-174, score-0.345]
97 This was accomplished by solving a convex optimization problem similar to SVMs to find the maximum margin weight vector. [sent-175, score-0.049]
98 We evaluated the method on a collection of high dimensional text documents and showed that it can successfully learn different notions of distance. [sent-177, score-0.183]
99 Furthermore, the power of the method would be increased, if it was possible to learn more complex metrics that go beyond feature weighting, for example by incorporating kernels in a more adaptive way. [sent-180, score-0.157]
100 Distance metric learning, with application to clustering with side information. [sent-249, score-0.3]
wordName wordTfidf (topN-words)
[('ptrain', 0.384), ('topic', 0.305), ('distance', 0.285), ('metric', 0.266), ('facultystudent', 0.216), ('ptest', 0.216), ('xtrain', 0.216), ('ijk', 0.192), ('aw', 0.188), ('mds', 0.171), ('xk', 0.168), ('student', 0.167), ('faculty', 0.159), ('metrics', 0.136), ('xtest', 0.12), ('xi', 0.112), ('webpages', 0.104), ('aclick', 0.096), ('tawat', 0.096), ('webpage', 0.095), ('learned', 0.087), ('comparisons', 0.085), ('wii', 0.083), ('constraints', 0.083), ('documents', 0.079), ('distances', 0.078), ('xj', 0.078), ('closer', 0.077), ('cnoclick', 0.072), ('qualitative', 0.068), ('ful', 0.067), ('wt', 0.065), ('unweighted', 0.063), ('stemming', 0.063), ('relative', 0.06), ('lls', 0.057), ('dw', 0.057), ('training', 0.054), ('tfidf', 0.053), ('project', 0.053), ('semantic', 0.052), ('multidimensional', 0.049), ('stopword', 0.048), ('feedback', 0.047), ('euclidean', 0.046), ('graphically', 0.042), ('satis', 0.04), ('weighting', 0.039), ('course', 0.038), ('lw', 0.035), ('cornell', 0.035), ('clustering', 0.034), ('pseudo', 0.033), ('test', 0.032), ('collection', 0.032), ('obeys', 0.032), ('matrix', 0.031), ('binary', 0.031), ('projected', 0.03), ('notions', 0.03), ('indexing', 0.03), ('slack', 0.03), ('violated', 0.03), ('removal', 0.029), ('query', 0.027), ('seeks', 0.027), ('document', 0.027), ('user', 0.027), ('composed', 0.027), ('examples', 0.026), ('display', 0.026), ('exibility', 0.025), ('convex', 0.025), ('unseen', 0.025), ('xing', 0.025), ('table', 0.024), ('optimization', 0.024), ('iff', 0.024), ('points', 0.024), ('percent', 0.023), ('input', 0.023), ('illustrate', 0.023), ('added', 0.023), ('parameterized', 0.023), ('svm', 0.023), ('static', 0.022), ('text', 0.021), ('quadratic', 0.021), ('learn', 0.021), ('yn', 0.021), ('modelling', 0.021), ('probably', 0.021), ('corinna', 0.021), ('tsang', 0.021), ('versatility', 0.021), ('logs', 0.021), ('buja', 0.021), ('satisfied', 0.021), ('arrange', 0.021), ('claire', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 108 nips-2003-Learning a Distance Metric from Relative Comparisons
Author: Matthew Schultz, Thorsten Joachims
Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
2 0.12059696 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
3 0.11234441 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
Author: John C. Platt
Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1
4 0.1103695 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet
Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1
5 0.10163052 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
Author: Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenenbaum, David M. Blei
Abstract: We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts. 1
6 0.09649013 126 nips-2003-Measure Based Regularization
7 0.082004473 46 nips-2003-Clustering with the Connectivity Kernel
8 0.076857597 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
9 0.072091252 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
10 0.07028687 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
11 0.069793433 118 nips-2003-Link Prediction in Relational Data
12 0.06579791 148 nips-2003-Online Passive-Aggressive Algorithms
13 0.065548703 164 nips-2003-Ranking on Data Manifolds
14 0.064219989 107 nips-2003-Learning Spectral Clustering
15 0.059669703 145 nips-2003-Online Classification on a Budget
16 0.059460744 113 nips-2003-Learning with Local and Global Consistency
17 0.055948168 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
18 0.055076372 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
19 0.052737612 73 nips-2003-Feature Selection in Clustering Problems
20 0.051884707 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
topicId topicWeight
[(0, -0.169), (1, -0.093), (2, -0.062), (3, 0.017), (4, 0.011), (5, 0.044), (6, -0.09), (7, 0.06), (8, 0.088), (9, -0.05), (10, 0.13), (11, -0.006), (12, 0.017), (13, -0.036), (14, 0.035), (15, -0.026), (16, -0.004), (17, 0.021), (18, -0.085), (19, 0.022), (20, 0.085), (21, -0.067), (22, -0.072), (23, 0.058), (24, 0.009), (25, -0.029), (26, -0.069), (27, -0.123), (28, 0.184), (29, 0.179), (30, 0.048), (31, -0.061), (32, -0.04), (33, 0.031), (34, -0.07), (35, 0.106), (36, 0.02), (37, -0.023), (38, -0.028), (39, 0.118), (40, 0.108), (41, -0.189), (42, 0.024), (43, -0.074), (44, 0.036), (45, 0.092), (46, 0.158), (47, -0.012), (48, -0.099), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.94686443 108 nips-2003-Learning a Distance Metric from Relative Comparisons
Author: Matthew Schultz, Thorsten Joachims
Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
2 0.68467605 71 nips-2003-Fast Embedding of Sparse Similarity Graphs
Author: John C. Platt
Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1
3 0.60041988 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet
Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1
4 0.49580866 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
5 0.473362 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process
Author: Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenenbaum, David M. Blei
Abstract: We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts. 1
6 0.418843 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
7 0.39931053 46 nips-2003-Clustering with the Connectivity Kernel
8 0.39342552 118 nips-2003-Link Prediction in Relational Data
9 0.39046511 126 nips-2003-Measure Based Regularization
10 0.38086954 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
11 0.37863341 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
12 0.30166969 120 nips-2003-Locality Preserving Projections
13 0.28621459 128 nips-2003-Minimax Embeddings
14 0.2776807 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
15 0.27110243 145 nips-2003-Online Classification on a Budget
16 0.27060273 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
17 0.26973048 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering
18 0.26803142 152 nips-2003-Pairwise Clustering and Graphical Models
19 0.25704592 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
20 0.25070095 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
topicId topicWeight
[(0, 0.031), (11, 0.013), (35, 0.032), (53, 0.072), (71, 0.602), (76, 0.032), (85, 0.048), (91, 0.065)]
simIndex simValue paperId paperTitle
1 0.9717226 133 nips-2003-Mutual Boosting for Contextual Inference
Author: Michael Fink, Pietro Perona
Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex
same-paper 2 0.96765006 108 nips-2003-Learning a Distance Metric from Relative Comparisons
Author: Matthew Schultz, Thorsten Joachims
Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
3 0.96224409 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
4 0.95938891 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines
Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.
5 0.95112556 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
6 0.76005012 117 nips-2003-Linear Response for Approximate Inference
7 0.65397036 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
8 0.61868042 41 nips-2003-Boosting versus Covering
9 0.61429769 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
10 0.61116892 145 nips-2003-Online Classification on a Budget
11 0.60626328 121 nips-2003-Log-Linear Models for Label Ranking
12 0.60608661 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
13 0.60125011 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
14 0.59918171 100 nips-2003-Laplace Propagation
15 0.59487754 122 nips-2003-Margin Maximizing Loss Functions
16 0.59403175 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
17 0.57748699 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
18 0.57715559 187 nips-2003-Training a Quantum Neural Network
19 0.57199025 85 nips-2003-Human and Ideal Observers for Detecting Image Curves
20 0.5706718 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions