jmlr jmlr2009 jmlr2009-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. [sent-8, score-0.329]
2 In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. [sent-10, score-0.278]
3 It can hardly be optimal to use the same distance metric for age and gender classification, even if in both tasks, distances are computed between the same sets of extracted features (e. [sent-30, score-0.393]
4 Motivated by these issues, a number of researchers have demonstrated that kNN classification can be greatly improved by learning an appropriate distance metric from labeled examples (Chopra et al. [sent-33, score-0.305]
5 In this paper, we show how to learn a Mahalanobis distance metric for kNN classification. [sent-43, score-0.276]
6 The first term penalizes large distances between examples in the same class that are desired as k-nearest neighbors, while the second term penalizes small distances between examples with non-matching labels. [sent-49, score-0.352]
7 The Euclidean distances in the transformed space can equivalently be viewed as Mahalanobis distances in the original space. [sent-51, score-0.282]
8 We exploit this equivalence to cast the problem of distance metric learning as a problem in convex optimization. [sent-52, score-0.282]
9 These previous approaches, however, essentially attempt to learn distance metrics that cluster together all similarly labeled inputs, even those that are not k-nearest neighbors. [sent-64, score-0.3]
10 The globally integrated training of local distance metrics distinguishes our approach from earlier work on discriminant adaptive kNN classification (Hastie and Tibshirani, 1996) Our paper is organized as follows. [sent-77, score-0.277]
11 We obtain a family of metrics over X by computing Euclidean distances after performing a linear transformation x′ = Lx. [sent-104, score-0.332]
12 These metrics compute squared distances as: DL (xi , x j ) = L(xi − x j ) 2 , 2 209 (1) W EINBERGER AND S AUL where the linear transformation in Eq. [sent-105, score-0.332]
13 It is common to express squared distances under the metric in Eq. [sent-109, score-0.296]
14 A Mahalanobis distance metric can be parameterized in terms of the matrix L or the matrix M. [sent-121, score-0.302]
15 Many researchers have proposed ways to estimate Mahalanobis distance metrics for the purpose of computing distances in kNN classification. [sent-127, score-0.364]
16 For kNN classification, one seeks a linear transformation such that nearest neighbors computed from the distances in Eq. [sent-132, score-0.44]
17 1 P RINCIPAL C OMPONENT A NALYSIS We briefly review principal component analysis (PCA) (Jolliffe, 1986) in the context of distance metric learning. [sent-147, score-0.308]
18 3 Convex Optimization Recall that the goal of distance metric learning can be stated in two ways: to learn a linear transformation xi → Lxi or, equivalently, to learn a Mahalanobis metric M = LL⊤ . [sent-195, score-0.566]
19 It is possible to formulate certain types of distance metric learning as convex optimizations over the cone of positive semidefinite matrices M. [sent-196, score-0.304]
20 1 M AHALANOBIS M ETRIC FOR C LUSTERING A convex objective function for distance metric learning was first proposed by Xing et al. [sent-200, score-0.282]
21 MMC shares a similar goal as LDA: namely, to minimize the distances between similarly labeled inputs while maximizing the distances between differently labeled inputs. [sent-203, score-0.533]
22 MMC differs from LDA in its formulation of distance metric learning as an convex optimization problem. [sent-204, score-0.307]
23 (6) to compute the linear transformation L, MMC solves a convex optimization over the matrix M = L⊤ L that directly represents the Mahalanobix metric itself. [sent-206, score-0.3]
24 In terms of this notation, MMC attempts to maximize the distances between pairs of inputs with different labels (yi j = 0), while constraining the sum over squared distances of pairs of similarly labeled inputs (yi j = 1). [sent-209, score-0.497]
25 Thus a different objective for distance metric learning is needed to preserve this strength of kNN classification—namely, that it does not implicitly make parametric (or other limiting) assumptions about the input distributions. [sent-219, score-0.285]
26 Like LDA and MMC, POLA attempts to learn a metric that shrinks distances between similarly labeled inputs and expands distances between differently labeled inputs. [sent-225, score-0.693]
27 From streaming tuples of this form, POLA attempts to learn a Mahalanobis metric M and a scalar threshold b such that similarly labeled inputs are at most a distance of b − 1 apart, while differently labeled inputs are at least a distance of b + 1 apart. [sent-230, score-0.686]
28 The margin constraints enforced by POLA are designed to learn a distance metric under which all pairs of similarly labeled inputs are closer than all pairs of differently labeled inputs. [sent-243, score-0.562]
29 (2005) considered how to learn a Mahalnobis distance metric especially for kNN classification. [sent-248, score-0.276]
30 The stochastic classifier uses a Mahalanobis distance metric parameterized by the linear transformation x → Lx in Eqs. [sent-251, score-0.317]
31 The main advantage is that distance metric learning in MLCC can be formulated as a convex optimization over the space of positive semidefinite matrices. [sent-288, score-0.307]
32 In common with all of them, we attempt to learn a Mahalanobis distance metric of the form in Eqs. [sent-294, score-0.276]
33 4), our model was conceived specifically to learn a Mahalanobis distance metric that improves the accuracy of kNN classification. [sent-304, score-0.276]
34 Indeed, the three essential ingredients of our model are (i) its convex loss function, (ii) its goal of margin maximization, and (iii) the constraints on the distance metric imposed by accurate kNN classification. [sent-305, score-0.362]
35 1 Intuition and Terminology Our model is based on two simple intuitions (and idealizations) for robust kNN classification: first, that each training input xi should share the same label yi as its k nearest neighbors; second, that training inputs with different labels should be widely separated. [sent-307, score-0.417]
36 Specifically, one term penalizes large distances between nearby inputs with the same label, while the other term 215 W EINBERGER AND S AUL penalizes small distances between inputs with different labels. [sent-310, score-0.536]
37 Recall that the goal of learning is to estimate a distance metric under which each input xi has k nearest neighbors that share its same label yi . [sent-313, score-0.603]
38 The target neighbors of xi are those that we desire to be closest to xi ; in particular, we attempt to learn a linear transformation of the input space such that the resulting nearest neighbors of xi are indeed its target neighbors. [sent-315, score-0.701]
39 For kNN classification to succeed, the target neighbors of each input xi should be closer than all differently labeled inputs. [sent-325, score-0.342]
40 In particular, for each input xi , we can imagine the target neighbors as establishing a perimeter that differently labeled inputs should not invade. [sent-326, score-0.465]
41 We refer to the differently labeled inputs in the training set that invade this perimeter as impostors; the goal of learning (roughly speaking) is to minimize the number of impostors. [sent-327, score-0.294]
42 For an input xi with label yi and target neighbor x j , an impostor is any input xl with label yl = yi such that L(xi − xl ) 2 ≤ L(xi − x j ) 2 + 1. [sent-332, score-0.404]
43 (10) In other words, an impostor xl is any differently labeled input that invades the perimeter plus unit margin defined by any target neighbor x j of the input xi . [sent-333, score-0.472]
44 Before learning, a training input has both target neighbors and impostors in its local neighborhood. [sent-335, score-0.36]
45 The loss function consists of two terms, one which acts to pull target neighbors closer together, and another which acts to push differently labeled examples further apart. [sent-341, score-0.334]
46 The distance metric is optimized so that: (i) its k=3 target neighbors lie within a smaller radius after training; (ii) differently labeled inputs lie outside this smaller radius by some finite margin. [sent-343, score-0.596]
47 The first term in the loss function penalizes large distances between each input and its target neighbors. [sent-347, score-0.277]
48 (11) only penalizes large distances between inputs and their target neighbors; in particular, it does not penalize large distances between all similarly labeled inputs. [sent-351, score-0.493]
49 The second term in the loss function penalizes small distances between differently labeled examples. [sent-355, score-0.3]
50 , the input xl lies a safe distance away from xi ), then its hinge loss has a negative argument and makes no contribution to the overall loss func217 W EINBERGER AND S AUL tion. [sent-364, score-0.314]
51 Finally, we combine the two terms εpull (L) and εpush (L) into a single loss function for distance metric learning. [sent-370, score-0.278]
52 Both LMNN and NCA are designed to learn a Mahalanobis distance metric over the input space that improves kNN classification at test time. [sent-395, score-0.334]
53 3 Local Versus Global Distances We emphasize that the loss function for LMNN classification only penalizes large distances between target neighbors as opposed to all examples in the same class. [sent-416, score-0.367]
54 Such classes violate a basic assumption behind metric learning algorithms that attempt to shrink global distances between all similarly labeled examples. [sent-425, score-0.349]
55 In particular, we introduce nonnegative slack variables {ξi jl } for all triplets of target neighbors ( j i) and impostors xl . [sent-461, score-0.379]
56 Using the slack variables to monitor these margin violations, we obtain the SDP: Minimize (1 − µ) ∑i, j ⊤ i (xi − x j ) M(xi − x j ) + µ ∑i, j i,l (1 − yil )ξi jl (xi − xl )⊤ M(xi − xl ) − (xi − x j )⊤ M(xi − x j ) ≥ 1 − ξi jl subject to: (1) (2) ξi jl ≥ 0 (3) M 0. [sent-464, score-0.33]
57 The slack variables {ξi jl } are sparse because most inputs xi and xl are well separated relative to the 220 D ISTANCE M ETRIC L EARNING distance between xi and any of its target neighbors x j . [sent-467, score-0.541]
58 In particular, for a test example xt with hypothetical label yt , we locate k (similarly labeled) target neighbors (as determined by Euclidean distance to xt or other a priori considerations) and then compute both terms in Eq. [sent-482, score-0.414]
59 For the first term, we accumulate the squared distances to the k target neighbors of xt . [sent-484, score-0.35]
60 , differently labeled examples) that invade the perimeter around xt as determined by its target neighbors; we also accumulate the hinge loss for differently labeled examples whose perimeters are invaded by xt . [sent-487, score-0.43]
61 All variations of lmnn, rca and lda were applied after pre-processing with pca for general noise reduction. [sent-506, score-0.3]
62 The multiple metrics version of lmnn (mm-lmnn) is comparable with multiclass svm on most data sets (with 20news and yaleFaces as only exceptions). [sent-509, score-0.772]
63 In all of our experiments, we chose the target neighbors based on Euclidean distance in the input space (after dimensionality reduction by PCA or LDA). [sent-541, score-0.36]
64 3 On data sets of this size, a distance metric can be learned in a matter of seconds. [sent-576, score-0.276]
65 3 shows that the LMNN metric outperforms the Euclidean metric and even improves on multiclass SVMs. [sent-617, score-0.339]
66 The middle row shows images from the same class that were among the 3-NN under the learned Mahalanobis metric (after training) but not among the original 3-NN under the Euclidean metric (before training). [sent-622, score-0.382]
67 5 shows some digits whose nearest neighbor changed as a result of learning, from a mismatch using Euclidean distances to a match using Mahalanobis distances. [sent-785, score-0.326]
68 In the absence of prior knowledge, a default choice is to use Euclidean distances to determine target nearest neighbors. [sent-801, score-0.294]
69 While the target nearest neighbors are fixed during learning, however, the actual nearest neighbors may change as a result of the linear transformation of the input space. [sent-802, score-0.608]
70 These changes suggest an iterative approach, in which the Mahalanobis distances learned in one application (or “pass”) of LMNN are used to determine the target nearest neighbors in a subsequent run of the algorithm. [sent-803, score-0.441]
71 For the (p+1)th pass, we can assign target neighbors using the Euclidean distance metric after the linear transformation xi → L p L p−1 . [sent-805, score-0.528]
72 The principal directions of individual distance metrics are indicated by arrows. [sent-818, score-0.279]
73 In this section, we show how to learn different Mahalanobis distance metrics for different examples in the input space. [sent-837, score-0.28]
74 While the training procedure couples the distance metrics in different clusters, the optimization remains a convex problem in semidefinite programming. [sent-845, score-0.332]
75 The globally integrated training of local distance metrics also distinguishes our approach from earlier work (Hastie and Tibshirani, 1996). [sent-846, score-0.277]
76 To learn these metrics from data, we solve a modified version of the original SDP: Minimize (1 − µ) ∑i, j i (xi − x j )⊤ My j (xi − x j ) + µ ∑ j i,l (1 − yil )ξi jl subject to: (1) (xi − xl )⊤ Myl (xi − xl ) − (xi − x j )⊤ My j (xi − x j ) ≥ 1 − ξi jl (2) ξi jl ≥ 0 (3) Mi 0 for i = 1, . [sent-866, score-0.403]
77 zeros ones twos fours Figure 8: Multiple local distance metrics learned for a data set consisting of handwritten digits four, two, one and zero. [sent-873, score-0.284]
78 To speed up training, we initialized the multi-metric optimization by setting each class-dependent metric to the solution from LMNN classification with a single global distance metric. [sent-884, score-0.277]
79 The simplest brute-force way to locate a test example’s nearest neighbors is to compute its distance to all the training examples. [sent-930, score-0.41]
80 In particular, for any test example, a nearest neighbor query consists of computing the distance to each training example and comparing this distance to the k closest examples already located. [sent-937, score-0.458]
81 Thus we can often reduce the dimensionality of the input data without distorting the nearest neighbor relations. [sent-944, score-0.283]
82 The figure compares the speedups when ball trees are used (blue) versus when ball trees are not used (red) in the lower dimensional space. [sent-956, score-0.344]
83 The bounding regions are set up to guarantee that the distance from a test example to a training example inside the bounding region is at least as large as the distance from the test example’s to the region’s boundary. [sent-968, score-0.298]
84 Thus, for each test example, the training examples inside the bounding region can be ruled out as k nearest neighbors if k training examples have already been found that are closer than the region’s boundary. [sent-969, score-0.367]
85 If a set S of training examples is encapsulated inside a ball with center c and radius r, such that ∀x ∈ S : x − c ≤ r, then for any test example xt we can bound the distance to any training example inside the ball by the following expression: ∀xi ∈ S xt − xi ≥ max( xt − c 2 − r, 0). [sent-976, score-0.602]
86 The data structure is based 234 D ISTANCE M ETRIC L EARNING xj c r xt − c − r xt − xj xt xi xt − xi Figure 11: The basic idea behind ball trees: for any training example xi inside the ball we can bound the distance xt − xi 2 from below using (17). [sent-978, score-0.749]
87 If another training example x j outsider the ball is already known to be closer than this bound to the test example xt , then the training examples inside the ball can be ruled out as nearest neighbors. [sent-979, score-0.482]
88 For any training example xi , and for any similarly labeled example x j that 235 W EINBERGER AND S AUL Figure 12: The relative speed-up obtained using ball trees to search for margin violations. [sent-1006, score-0.39]
89 is one of its target k-nearest neighbors (with j i), the impostors consist of all differently labeled examples xl (with yil = 0) that satisfy Eq. [sent-1009, score-0.475]
90 As in their use for kNN search, many subtrees in the depth-first search for impostors can be pruned: if for some ball the lower bound distances between examples is already greater than the right hand side of Eq. [sent-1012, score-0.367]
91 Note that for each training example xi , we only need to search for impostors among other training examples xl that have a different class label (with yil = 0). [sent-1014, score-0.406]
92 12 shows the relative speed-up when ball trees are used to search for margin violations in LMNN classification. [sent-1017, score-0.285]
93 There is an inherent trade-off between dimensionality reduction and nearest 236 D ISTANCE M ETRIC L EARNING 3-NN classification after dimensionality reduction Classification Error in % 4 15x 9x 6x 5x 4x 3x 3x 3x ball tree speedup 3. [sent-1031, score-0.338]
94 In this section, we explore how the distance metric learned for LMNN classification can be used for more effective dimensionality reduction in ball trees. [sent-1047, score-0.438]
95 From labeled training examples, we have shown how to learn a Mahalanobis distance metric for kNN classification. [sent-1076, score-0.383]
96 At the other extreme, to build a kNN classifier that is as fast as possible at test time, our results suggest to combine low-rank distance metrics with ball trees. [sent-1085, score-0.345]
97 The solver iteratively re-estimates the Mahalanobis distance metric as it attempts to minimize the cost function for LMNN classification. [sent-1108, score-0.307]
98 We refer to the Mahalanobis distance metric at the tth iteration as Mt and to its squared Mahalanobis distance in Eq. [sent-1113, score-0.349]
99 To avoid this computational burden, we exploit the fact that the great majority of triples do not incur margin violations: in particular, for each training example, only a very small fraction of differently labeled examples typically lie nearby in the input space. [sent-1148, score-0.288]
100 Distance metric learning for large margin nearest neighbor classification. [sent-1457, score-0.394]
wordName wordTfidf (topN-words)
[('lmnn', 0.617), ('knn', 0.399), ('mahalanobis', 0.21), ('nca', 0.192), ('metric', 0.155), ('distances', 0.141), ('metrics', 0.126), ('neighbors', 0.123), ('aul', 0.114), ('einberger', 0.114), ('lda', 0.113), ('nearest', 0.111), ('impostors', 0.108), ('rca', 0.102), ('distance', 0.097), ('ball', 0.097), ('etric', 0.097), ('semide', 0.093), ('istance', 0.092), ('pca', 0.085), ('mmc', 0.084), ('inputs', 0.081), ('neighbor', 0.074), ('mt', 0.066), ('trees', 0.065), ('dimensionality', 0.065), ('transformation', 0.065), ('pola', 0.06), ('classi', 0.058), ('principal', 0.056), ('margin', 0.054), ('lxi', 0.054), ('yil', 0.054), ('training', 0.054), ('labeled', 0.053), ('nt', 0.053), ('euclidean', 0.05), ('dm', 0.05), ('xl', 0.05), ('mnist', 0.049), ('violations', 0.048), ('images', 0.048), ('xi', 0.046), ('differently', 0.045), ('xt', 0.044), ('svms', 0.042), ('mlcc', 0.042), ('perimeter', 0.042), ('target', 0.042), ('gradient', 0.039), ('handwritten', 0.037), ('sdp', 0.037), ('cil', 0.036), ('shental', 0.036), ('solver', 0.036), ('goldberger', 0.036), ('torresani', 0.036), ('weinberger', 0.036), ('hinge', 0.036), ('penalizes', 0.035), ('input', 0.033), ('jl', 0.033), ('mcc', 0.03), ('convex', 0.03), ('earning', 0.029), ('multiclass', 0.029), ('accelerate', 0.027), ('triples', 0.027), ('gt', 0.027), ('loss', 0.026), ('projected', 0.026), ('push', 0.025), ('matrix', 0.025), ('onto', 0.025), ('optimization', 0.025), ('test', 0.025), ('neighborhood', 0.025), ('learn', 0.024), ('chunklet', 0.024), ('olivetti', 0.024), ('learned', 0.024), ('active', 0.024), ('projecting', 0.023), ('rectangular', 0.023), ('slack', 0.023), ('projection', 0.023), ('nearby', 0.022), ('cone', 0.022), ('search', 0.021), ('scales', 0.021), ('dimensional', 0.02), ('cw', 0.02), ('yt', 0.02), ('eigenvector', 0.02), ('pull', 0.02), ('improvements', 0.019), ('yi', 0.019), ('label', 0.019), ('minimize', 0.019), ('ci', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
2 0.32380357 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
3 0.086864032 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
4 0.082941718 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
5 0.07121072 25 jmlr-2009-Distributed Algorithms for Topic Models
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
6 0.064492382 23 jmlr-2009-Discriminative Learning Under Covariate Shift
7 0.061098706 38 jmlr-2009-Hash Kernels for Structured Data
8 0.059194516 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
9 0.056028277 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
10 0.050170146 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
11 0.04210984 50 jmlr-2009-Learning When Concepts Abound
12 0.039950978 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
13 0.039902203 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
14 0.038545139 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
15 0.036474437 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
16 0.036069427 13 jmlr-2009-Bounded Kernel-Based Online Learning
17 0.03252944 15 jmlr-2009-Cautious Collective Classification
18 0.032504983 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
19 0.032428227 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
20 0.031756159 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey
topicId topicWeight
[(0, 0.221), (1, -0.027), (2, 0.101), (3, -0.08), (4, 0.047), (5, -0.252), (6, 0.154), (7, -0.024), (8, -0.459), (9, 0.051), (10, -0.293), (11, -0.228), (12, -0.085), (13, -0.134), (14, 0.006), (15, 0.028), (16, -0.058), (17, 0.052), (18, 0.062), (19, 0.015), (20, 0.211), (21, -0.047), (22, -0.084), (23, 0.062), (24, 0.062), (25, -0.042), (26, -0.037), (27, 0.033), (28, -0.018), (29, -0.075), (30, -0.024), (31, -0.031), (32, -0.118), (33, -0.034), (34, 0.076), (35, -0.017), (36, 0.03), (37, -0.023), (38, -0.043), (39, 0.028), (40, 0.001), (41, -0.023), (42, 0.04), (43, -0.013), (44, -0.006), (45, -0.021), (46, -0.022), (47, 0.037), (48, -0.068), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.91785598 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
2 0.88944638 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
3 0.27814806 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
4 0.27338347 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
5 0.25522017 25 jmlr-2009-Distributed Algorithms for Topic Models
Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling
Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation
6 0.24740328 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
7 0.21616358 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
8 0.19852486 38 jmlr-2009-Hash Kernels for Structured Data
9 0.18894748 15 jmlr-2009-Cautious Collective Classification
10 0.18538488 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
11 0.18272115 23 jmlr-2009-Discriminative Learning Under Covariate Shift
12 0.16877718 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
13 0.1670038 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
14 0.16643478 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
15 0.16414134 50 jmlr-2009-Learning When Concepts Abound
16 0.15986572 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
17 0.15583536 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
18 0.15526891 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
19 0.14730631 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
20 0.14270622 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
topicId topicWeight
[(8, 0.022), (11, 0.02), (19, 0.013), (26, 0.02), (38, 0.039), (47, 0.013), (52, 0.07), (55, 0.041), (58, 0.034), (66, 0.084), (68, 0.448), (90, 0.082), (96, 0.025)]
simIndex simValue paperId paperTitle
1 0.90038258 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
same-paper 2 0.82577509 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
3 0.70379823 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.50969392 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
5 0.48127955 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
6 0.46692994 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
7 0.43828791 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
8 0.42597118 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
9 0.42356598 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
10 0.41348371 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
11 0.41201821 38 jmlr-2009-Hash Kernels for Structured Data
12 0.40840504 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
13 0.40687165 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
14 0.40222108 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
15 0.40183347 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
16 0.39528543 25 jmlr-2009-Distributed Algorithms for Topic Models
17 0.39517847 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
18 0.39332843 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
19 0.39250442 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models