jmlr jmlr2009 jmlr2009-24 knowledge-graph by maker-knowledge-mining

24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Source: pdf

Author: Kilian Q. Weinberger, Lawrence K. Saul

Abstract: The accuracy of k-nearest neighbor (kNN) classiﬁcation depends signiﬁcantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classiﬁcation from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classiﬁcation using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modiﬁcation or extension for problems in multiway (as opposed to binary) classiﬁcation. In our framework, the Mahalanobis distance metric is obtained as the solution to a semideﬁnite program. On several data sets of varying size and difﬁculty, we ﬁnd that metrics trained in this way lead to signiﬁcant improvements in kNN classiﬁcation. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-deﬁnite programming, Mahalanobis distance, metric learning, multi-class classiﬁcation, support vector machines

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we show how to learn a Mahalanobis distance metric for kNN classiﬁcation from labeled examples. [sent-8, score-0.329]

2 In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. [sent-10, score-0.278]

3 It can hardly be optimal to use the same distance metric for age and gender classiﬁcation, even if in both tasks, distances are computed between the same sets of extracted features (e. [sent-30, score-0.393]

4 Motivated by these issues, a number of researchers have demonstrated that kNN classiﬁcation can be greatly improved by learning an appropriate distance metric from labeled examples (Chopra et al. [sent-33, score-0.305]

5 In this paper, we show how to learn a Mahalanobis distance metric for kNN classiﬁcation. [sent-43, score-0.276]

6 The ﬁrst term penalizes large distances between examples in the same class that are desired as k-nearest neighbors, while the second term penalizes small distances between examples with non-matching labels. [sent-49, score-0.352]

7 The Euclidean distances in the transformed space can equivalently be viewed as Mahalanobis distances in the original space. [sent-51, score-0.282]

8 We exploit this equivalence to cast the problem of distance metric learning as a problem in convex optimization. [sent-52, score-0.282]

9 These previous approaches, however, essentially attempt to learn distance metrics that cluster together all similarly labeled inputs, even those that are not k-nearest neighbors. [sent-64, score-0.3]

10 The globally integrated training of local distance metrics distinguishes our approach from earlier work on discriminant adaptive kNN classiﬁcation (Hastie and Tibshirani, 1996) Our paper is organized as follows. [sent-77, score-0.277]

11 We obtain a family of metrics over X by computing Euclidean distances after performing a linear transformation x′ = Lx. [sent-104, score-0.332]

12 These metrics compute squared distances as: DL (xi , x j ) = L(xi − x j ) 2 , 2 209 (1) W EINBERGER AND S AUL where the linear transformation in Eq. [sent-105, score-0.332]

13 It is common to express squared distances under the metric in Eq. [sent-109, score-0.296]

14 A Mahalanobis distance metric can be parameterized in terms of the matrix L or the matrix M. [sent-121, score-0.302]

15 Many researchers have proposed ways to estimate Mahalanobis distance metrics for the purpose of computing distances in kNN classiﬁcation. [sent-127, score-0.364]

16 For kNN classiﬁcation, one seeks a linear transformation such that nearest neighbors computed from the distances in Eq. [sent-132, score-0.44]

17 1 P RINCIPAL C OMPONENT A NALYSIS We brieﬂy review principal component analysis (PCA) (Jolliffe, 1986) in the context of distance metric learning. [sent-147, score-0.308]

18 3 Convex Optimization Recall that the goal of distance metric learning can be stated in two ways: to learn a linear transformation xi → Lxi or, equivalently, to learn a Mahalanobis metric M = LL⊤ . [sent-195, score-0.566]

19 It is possible to formulate certain types of distance metric learning as convex optimizations over the cone of positive semideﬁnite matrices M. [sent-196, score-0.304]

20 1 M AHALANOBIS M ETRIC FOR C LUSTERING A convex objective function for distance metric learning was ﬁrst proposed by Xing et al. [sent-200, score-0.282]

21 MMC shares a similar goal as LDA: namely, to minimize the distances between similarly labeled inputs while maximizing the distances between differently labeled inputs. [sent-203, score-0.533]

22 MMC differs from LDA in its formulation of distance metric learning as an convex optimization problem. [sent-204, score-0.307]

23 (6) to compute the linear transformation L, MMC solves a convex optimization over the matrix M = L⊤ L that directly represents the Mahalanobix metric itself. [sent-206, score-0.3]

24 In terms of this notation, MMC attempts to maximize the distances between pairs of inputs with different labels (yi j = 0), while constraining the sum over squared distances of pairs of similarly labeled inputs (yi j = 1). [sent-209, score-0.497]

25 Thus a different objective for distance metric learning is needed to preserve this strength of kNN classiﬁcation—namely, that it does not implicitly make parametric (or other limiting) assumptions about the input distributions. [sent-219, score-0.285]

26 Like LDA and MMC, POLA attempts to learn a metric that shrinks distances between similarly labeled inputs and expands distances between differently labeled inputs. [sent-225, score-0.693]

27 From streaming tuples of this form, POLA attempts to learn a Mahalanobis metric M and a scalar threshold b such that similarly labeled inputs are at most a distance of b − 1 apart, while differently labeled inputs are at least a distance of b + 1 apart. [sent-230, score-0.686]

28 The margin constraints enforced by POLA are designed to learn a distance metric under which all pairs of similarly labeled inputs are closer than all pairs of differently labeled inputs. [sent-243, score-0.562]

29 (2005) considered how to learn a Mahalnobis distance metric especially for kNN classiﬁcation. [sent-248, score-0.276]

30 The stochastic classiﬁer uses a Mahalanobis distance metric parameterized by the linear transformation x → Lx in Eqs. [sent-251, score-0.317]

31 The main advantage is that distance metric learning in MLCC can be formulated as a convex optimization over the space of positive semideﬁnite matrices. [sent-288, score-0.307]

32 In common with all of them, we attempt to learn a Mahalanobis distance metric of the form in Eqs. [sent-294, score-0.276]

33 4), our model was conceived specifically to learn a Mahalanobis distance metric that improves the accuracy of kNN classiﬁcation. [sent-304, score-0.276]

34 Indeed, the three essential ingredients of our model are (i) its convex loss function, (ii) its goal of margin maximization, and (iii) the constraints on the distance metric imposed by accurate kNN classiﬁcation. [sent-305, score-0.362]

35 1 Intuition and Terminology Our model is based on two simple intuitions (and idealizations) for robust kNN classiﬁcation: ﬁrst, that each training input xi should share the same label yi as its k nearest neighbors; second, that training inputs with different labels should be widely separated. [sent-307, score-0.417]

36 Speciﬁcally, one term penalizes large distances between nearby inputs with the same label, while the other term 215 W EINBERGER AND S AUL penalizes small distances between inputs with different labels. [sent-310, score-0.536]

37 Recall that the goal of learning is to estimate a distance metric under which each input xi has k nearest neighbors that share its same label yi . [sent-313, score-0.603]

38 The target neighbors of xi are those that we desire to be closest to xi ; in particular, we attempt to learn a linear transformation of the input space such that the resulting nearest neighbors of xi are indeed its target neighbors. [sent-315, score-0.701]

39 For kNN classiﬁcation to succeed, the target neighbors of each input xi should be closer than all differently labeled inputs. [sent-325, score-0.342]

40 In particular, for each input xi , we can imagine the target neighbors as establishing a perimeter that differently labeled inputs should not invade. [sent-326, score-0.465]

41 We refer to the differently labeled inputs in the training set that invade this perimeter as impostors; the goal of learning (roughly speaking) is to minimize the number of impostors. [sent-327, score-0.294]

42 For an input xi with label yi and target neighbor x j , an impostor is any input xl with label yl = yi such that L(xi − xl ) 2 ≤ L(xi − x j ) 2 + 1. [sent-332, score-0.404]

43 (10) In other words, an impostor xl is any differently labeled input that invades the perimeter plus unit margin deﬁned by any target neighbor x j of the input xi . [sent-333, score-0.472]

44 Before learning, a training input has both target neighbors and impostors in its local neighborhood. [sent-335, score-0.36]

45 The loss function consists of two terms, one which acts to pull target neighbors closer together, and another which acts to push differently labeled examples further apart. [sent-341, score-0.334]

46 The distance metric is optimized so that: (i) its k=3 target neighbors lie within a smaller radius after training; (ii) differently labeled inputs lie outside this smaller radius by some ﬁnite margin. [sent-343, score-0.596]

47 The ﬁrst term in the loss function penalizes large distances between each input and its target neighbors. [sent-347, score-0.277]

48 (11) only penalizes large distances between inputs and their target neighbors; in particular, it does not penalize large distances between all similarly labeled inputs. [sent-351, score-0.493]

49 The second term in the loss function penalizes small distances between differently labeled examples. [sent-355, score-0.3]

50 , the input xl lies a safe distance away from xi ), then its hinge loss has a negative argument and makes no contribution to the overall loss func217 W EINBERGER AND S AUL tion. [sent-364, score-0.314]

51 Finally, we combine the two terms εpull (L) and εpush (L) into a single loss function for distance metric learning. [sent-370, score-0.278]

52 Both LMNN and NCA are designed to learn a Mahalanobis distance metric over the input space that improves kNN classiﬁcation at test time. [sent-395, score-0.334]

53 3 Local Versus Global Distances We emphasize that the loss function for LMNN classiﬁcation only penalizes large distances between target neighbors as opposed to all examples in the same class. [sent-416, score-0.367]

54 Such classes violate a basic assumption behind metric learning algorithms that attempt to shrink global distances between all similarly labeled examples. [sent-425, score-0.349]

55 In particular, we introduce nonnegative slack variables {ξi jl } for all triplets of target neighbors ( j i) and impostors xl . [sent-461, score-0.379]

56 Using the slack variables to monitor these margin violations, we obtain the SDP: Minimize (1 − µ) ∑i, j ⊤ i (xi − x j ) M(xi − x j ) + µ ∑i, j i,l (1 − yil )ξi jl (xi − xl )⊤ M(xi − xl ) − (xi − x j )⊤ M(xi − x j ) ≥ 1 − ξi jl subject to: (1) (2) ξi jl ≥ 0 (3) M 0. [sent-464, score-0.33]

57 The slack variables {ξi jl } are sparse because most inputs xi and xl are well separated relative to the 220 D ISTANCE M ETRIC L EARNING distance between xi and any of its target neighbors x j . [sent-467, score-0.541]

58 In particular, for a test example xt with hypothetical label yt , we locate k (similarly labeled) target neighbors (as determined by Euclidean distance to xt or other a priori considerations) and then compute both terms in Eq. [sent-482, score-0.414]

59 For the ﬁrst term, we accumulate the squared distances to the k target neighbors of xt . [sent-484, score-0.35]

60 , differently labeled examples) that invade the perimeter around xt as determined by its target neighbors; we also accumulate the hinge loss for differently labeled examples whose perimeters are invaded by xt . [sent-487, score-0.43]

61 All variations of lmnn, rca and lda were applied after pre-processing with pca for general noise reduction. [sent-506, score-0.3]

62 The multiple metrics version of lmnn (mm-lmnn) is comparable with multiclass svm on most data sets (with 20news and yaleFaces as only exceptions). [sent-509, score-0.772]

63 In all of our experiments, we chose the target neighbors based on Euclidean distance in the input space (after dimensionality reduction by PCA or LDA). [sent-541, score-0.36]

64 3 On data sets of this size, a distance metric can be learned in a matter of seconds. [sent-576, score-0.276]

65 3 shows that the LMNN metric outperforms the Euclidean metric and even improves on multiclass SVMs. [sent-617, score-0.339]

66 The middle row shows images from the same class that were among the 3-NN under the learned Mahalanobis metric (after training) but not among the original 3-NN under the Euclidean metric (before training). [sent-622, score-0.382]

67 5 shows some digits whose nearest neighbor changed as a result of learning, from a mismatch using Euclidean distances to a match using Mahalanobis distances. [sent-785, score-0.326]

68 In the absence of prior knowledge, a default choice is to use Euclidean distances to determine target nearest neighbors. [sent-801, score-0.294]

69 While the target nearest neighbors are ﬁxed during learning, however, the actual nearest neighbors may change as a result of the linear transformation of the input space. [sent-802, score-0.608]

70 These changes suggest an iterative approach, in which the Mahalanobis distances learned in one application (or “pass”) of LMNN are used to determine the target nearest neighbors in a subsequent run of the algorithm. [sent-803, score-0.441]

71 For the (p+1)th pass, we can assign target neighbors using the Euclidean distance metric after the linear transformation xi → L p L p−1 . [sent-805, score-0.528]

72 The principal directions of individual distance metrics are indicated by arrows. [sent-818, score-0.279]

73 In this section, we show how to learn different Mahalanobis distance metrics for different examples in the input space. [sent-837, score-0.28]

74 While the training procedure couples the distance metrics in different clusters, the optimization remains a convex problem in semideﬁnite programming. [sent-845, score-0.332]

75 The globally integrated training of local distance metrics also distinguishes our approach from earlier work (Hastie and Tibshirani, 1996). [sent-846, score-0.277]

76 To learn these metrics from data, we solve a modiﬁed version of the original SDP: Minimize (1 − µ) ∑i, j i (xi − x j )⊤ My j (xi − x j ) + µ ∑ j i,l (1 − yil )ξi jl subject to: (1) (xi − xl )⊤ Myl (xi − xl ) − (xi − x j )⊤ My j (xi − x j ) ≥ 1 − ξi jl (2) ξi jl ≥ 0 (3) Mi 0 for i = 1, . [sent-866, score-0.403]

77 zeros ones twos fours Figure 8: Multiple local distance metrics learned for a data set consisting of handwritten digits four, two, one and zero. [sent-873, score-0.284]

78 To speed up training, we initialized the multi-metric optimization by setting each class-dependent metric to the solution from LMNN classiﬁcation with a single global distance metric. [sent-884, score-0.277]

79 The simplest brute-force way to locate a test example’s nearest neighbors is to compute its distance to all the training examples. [sent-930, score-0.41]

80 In particular, for any test example, a nearest neighbor query consists of computing the distance to each training example and comparing this distance to the k closest examples already located. [sent-937, score-0.458]

81 Thus we can often reduce the dimensionality of the input data without distorting the nearest neighbor relations. [sent-944, score-0.283]

82 The ﬁgure compares the speedups when ball trees are used (blue) versus when ball trees are not used (red) in the lower dimensional space. [sent-956, score-0.344]

83 The bounding regions are set up to guarantee that the distance from a test example to a training example inside the bounding region is at least as large as the distance from the test example’s to the region’s boundary. [sent-968, score-0.298]

84 Thus, for each test example, the training examples inside the bounding region can be ruled out as k nearest neighbors if k training examples have already been found that are closer than the region’s boundary. [sent-969, score-0.367]

85 If a set S of training examples is encapsulated inside a ball with center c and radius r, such that ∀x ∈ S : x − c ≤ r, then for any test example xt we can bound the distance to any training example inside the ball by the following expression: ∀xi ∈ S xt − xi ≥ max( xt − c 2 − r, 0). [sent-976, score-0.602]

86 The data structure is based 234 D ISTANCE M ETRIC L EARNING xj c r xt − c − r xt − xj xt xi xt − xi Figure 11: The basic idea behind ball trees: for any training example xi inside the ball we can bound the distance xt − xi 2 from below using (17). [sent-978, score-0.749]

87 If another training example x j outsider the ball is already known to be closer than this bound to the test example xt , then the training examples inside the ball can be ruled out as nearest neighbors. [sent-979, score-0.482]

88 For any training example xi , and for any similarly labeled example x j that 235 W EINBERGER AND S AUL Figure 12: The relative speed-up obtained using ball trees to search for margin violations. [sent-1006, score-0.39]

89 is one of its target k-nearest neighbors (with j i), the impostors consist of all differently labeled examples xl (with yil = 0) that satisfy Eq. [sent-1009, score-0.475]

90 As in their use for kNN search, many subtrees in the depth-ﬁrst search for impostors can be pruned: if for some ball the lower bound distances between examples is already greater than the right hand side of Eq. [sent-1012, score-0.367]

91 Note that for each training example xi , we only need to search for impostors among other training examples xl that have a different class label (with yil = 0). [sent-1014, score-0.406]

92 12 shows the relative speed-up when ball trees are used to search for margin violations in LMNN classiﬁcation. [sent-1017, score-0.285]

93 There is an inherent trade-off between dimensionality reduction and nearest 236 D ISTANCE M ETRIC L EARNING 3-NN classification after dimensionality reduction Classification Error in % 4 15x 9x 6x 5x 4x 3x 3x 3x ball tree speedup 3. [sent-1031, score-0.338]

94 In this section, we explore how the distance metric learned for LMNN classiﬁcation can be used for more effective dimensionality reduction in ball trees. [sent-1047, score-0.438]

95 From labeled training examples, we have shown how to learn a Mahalanobis distance metric for kNN classiﬁcation. [sent-1076, score-0.383]

96 At the other extreme, to build a kNN classiﬁer that is as fast as possible at test time, our results suggest to combine low-rank distance metrics with ball trees. [sent-1085, score-0.345]

97 The solver iteratively re-estimates the Mahalanobis distance metric as it attempts to minimize the cost function for LMNN classiﬁcation. [sent-1108, score-0.307]

98 We refer to the Mahalanobis distance metric at the tth iteration as Mt and to its squared Mahalanobis distance in Eq. [sent-1113, score-0.349]

99 To avoid this computational burden, we exploit the fact that the great majority of triples do not incur margin violations: in particular, for each training example, only a very small fraction of differently labeled examples typically lie nearby in the input space. [sent-1148, score-0.288]

100 Distance metric learning for large margin nearest neighbor classiﬁcation. [sent-1457, score-0.394]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lmnn', 0.617), ('knn', 0.399), ('mahalanobis', 0.21), ('nca', 0.192), ('metric', 0.155), ('distances', 0.141), ('metrics', 0.126), ('neighbors', 0.123), ('aul', 0.114), ('einberger', 0.114), ('lda', 0.113), ('nearest', 0.111), ('impostors', 0.108), ('rca', 0.102), ('distance', 0.097), ('ball', 0.097), ('etric', 0.097), ('semide', 0.093), ('istance', 0.092), ('pca', 0.085), ('mmc', 0.084), ('inputs', 0.081), ('neighbor', 0.074), ('mt', 0.066), ('trees', 0.065), ('dimensionality', 0.065), ('transformation', 0.065), ('pola', 0.06), ('classi', 0.058), ('principal', 0.056), ('margin', 0.054), ('lxi', 0.054), ('yil', 0.054), ('training', 0.054), ('labeled', 0.053), ('nt', 0.053), ('euclidean', 0.05), ('dm', 0.05), ('xl', 0.05), ('mnist', 0.049), ('violations', 0.048), ('images', 0.048), ('xi', 0.046), ('differently', 0.045), ('xt', 0.044), ('svms', 0.042), ('mlcc', 0.042), ('perimeter', 0.042), ('target', 0.042), ('gradient', 0.039), ('handwritten', 0.037), ('sdp', 0.037), ('cil', 0.036), ('shental', 0.036), ('solver', 0.036), ('goldberger', 0.036), ('torresani', 0.036), ('weinberger', 0.036), ('hinge', 0.036), ('penalizes', 0.035), ('input', 0.033), ('jl', 0.033), ('mcc', 0.03), ('convex', 0.03), ('earning', 0.029), ('multiclass', 0.029), ('accelerate', 0.027), ('triples', 0.027), ('gt', 0.027), ('loss', 0.026), ('projected', 0.026), ('push', 0.025), ('matrix', 0.025), ('onto', 0.025), ('optimization', 0.025), ('test', 0.025), ('neighborhood', 0.025), ('learn', 0.024), ('chunklet', 0.024), ('olivetti', 0.024), ('learned', 0.024), ('active', 0.024), ('projecting', 0.023), ('rectangular', 0.023), ('slack', 0.023), ('projection', 0.023), ('nearby', 0.022), ('cone', 0.022), ('search', 0.021), ('scales', 0.021), ('dimensional', 0.02), ('cw', 0.02), ('yt', 0.02), ('eigenvector', 0.02), ('pull', 0.02), ('improvements', 0.019), ('yi', 0.019), ('label', 0.019), ('minimize', 0.019), ('ci', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, Lawrence K. Saul

2 0.32380357 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

Author: Jie Chen, Haw-ren Fang, Yousef Saad

Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional reﬁnement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method

3 0.086864032 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-deﬁned structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

4 0.082941718 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semideﬁnite (kernel) matrices for machine learning applications. We propose efﬁcient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semideﬁniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness

5 0.07121072 25 jmlr-2009-Distributed Algorithms for Topic Models

Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling

Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The ﬁrst algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the ﬁrst algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using ﬁve real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. Keywords: topic models, latent Dirichlet allocation, hierarchical Dirichlet processes, distributed parallel computation

6 0.064492382 23 jmlr-2009-Discriminative Learning Under Covariate Shift

7 0.061098706 38 jmlr-2009-Hash Kernels for Structured Data

8 0.059194516 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

9 0.056028277 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

10 0.050170146 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

11 0.04210984 50 jmlr-2009-Learning When Concepts Abound

12 0.039950978 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

13 0.039902203 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

14 0.038545139 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

15 0.036474437 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

16 0.036069427 13 jmlr-2009-Bounded Kernel-Based Online Learning

17 0.03252944 15 jmlr-2009-Cautious Collective Classification

18 0.032504983 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

19 0.032428227 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

20 0.031756159 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.221), (1, -0.027), (2, 0.101), (3, -0.08), (4, 0.047), (5, -0.252), (6, 0.154), (7, -0.024), (8, -0.459), (9, 0.051), (10, -0.293), (11, -0.228), (12, -0.085), (13, -0.134), (14, 0.006), (15, 0.028), (16, -0.058), (17, 0.052), (18, 0.062), (19, 0.015), (20, 0.211), (21, -0.047), (22, -0.084), (23, 0.062), (24, 0.062), (25, -0.042), (26, -0.037), (27, 0.033), (28, -0.018), (29, -0.075), (30, -0.024), (31, -0.031), (32, -0.118), (33, -0.034), (34, 0.076), (35, -0.017), (36, 0.03), (37, -0.023), (38, -0.043), (39, 0.028), (40, 0.001), (41, -0.023), (42, 0.04), (43, -0.013), (44, -0.006), (45, -0.021), (46, -0.022), (47, 0.037), (48, -0.068), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91785598 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, Lawrence K. Saul

2 0.88944638 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

Author: Jie Chen, Haw-ren Fang, Yousef Saad

3 0.27814806 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

4 0.27338347 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

5 0.25522017 25 jmlr-2009-Distributed Algorithms for Topic Models

Author: David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling

6 0.24740328 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

7 0.21616358 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)

8 0.19852486 38 jmlr-2009-Hash Kernels for Structured Data

9 0.18894748 15 jmlr-2009-Cautious Collective Classification

10 0.18538488 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

11 0.18272115 23 jmlr-2009-Discriminative Learning Under Covariate Shift

12 0.16877718 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

13 0.1670038 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

14 0.16643478 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

15 0.16414134 50 jmlr-2009-Learning When Concepts Abound

16 0.15986572 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

17 0.15583536 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

18 0.15526891 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

19 0.14730631 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

20 0.14270622 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.022), (11, 0.02), (19, 0.013), (26, 0.02), (38, 0.039), (47, 0.013), (52, 0.07), (55, 0.041), (58, 0.034), (66, 0.084), (68, 0.448), (90, 0.082), (96, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90038258 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

Author: Holger Höfling, Robert Tibshirani

Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also ﬁnd that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables

same-paper 2 0.82577509 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, Lawrence K. Saul

3 0.70379823 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

4 0.50969392 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

Author: Jie Chen, Haw-ren Fang, Yousef Saad

5 0.48127955 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and ﬁnd for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

6 0.46692994 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

7 0.43828791 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

8 0.42597118 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

9 0.42356598 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

10 0.41348371 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

11 0.41201821 38 jmlr-2009-Hash Kernels for Structured Data

12 0.40840504 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

13 0.40687165 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

14 0.40222108 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

15 0.40183347 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

16 0.39528543 25 jmlr-2009-Distributed Algorithms for Topic Models

17 0.39517847 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

18 0.39332843 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

19 0.39250442 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

20 0.39224595 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)