jmlr jmlr2007 jmlr2007-30 knowledge-graph by maker-knowledge-mining

30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

Source: pdf

Author: Michael Biehl, Anarta Ghosh, Barbara Hammer

Abstract: Learning vector quantization (LVQ) schemes constitute intuitive, powerful classiﬁcation heuristics with numerous successful applications but, so far, limited theoretical background. We study LVQ rigorously within a simplifying model situation: two competing prototypes are trained from a sequence of examples drawn from a mixture of Gaussians. Concepts from statistical physics and the theory of on-line learning allow for an exact description of the training dynamics in highdimensional feature space. The analysis yields typical learning curves, convergence properties, and achievable generalization abilities. This is also possible for heuristic training schemes which do not relate to a cost function. We compare the performance of several algorithms, including Kohonen’s LVQ1 and LVQ+/-, a limiting case of LVQ2.1. The former shows close to optimal performance, while LVQ+/- displays divergent behavior. We investigate how early stopping can overcome this difﬁculty. Furthermore, we study a crisp version of robust soft LVQ, which was recently derived from a statistical formulation. Surprisingly, it exhibits relatively poor generalization. Performance improves if a window for the selection of data is introduced; the resulting algorithm corresponds to cost function based LVQ2. The dependence of these results on the model parameters, for example, prior class probabilities, is investigated systematically, simulations conﬁrm our analytical ﬁndings. Keywords: prototype based classiﬁcation, learning vector quantization, Winner-Takes-All algorithms, on-line learning, competitive learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We study LVQ rigorously within a simplifying model situation: two competing prototypes are trained from a sequence of examples drawn from a mixture of Gaussians. [sent-9, score-0.338]

2 Concepts from statistical physics and the theory of on-line learning allow for an exact description of the training dynamics in highdimensional feature space. [sent-10, score-0.155]

3 The classiﬁcation of LVQ is based on a distance measure, usually the Euclidean distance, which quantiﬁes the similarity of given data with so-called prototype or codebook vectors representing the classes. [sent-31, score-0.149]

4 The prototypes are determined in a training process from labeled example data and can be interpreted in a straightforward way as they capture essential features of the data in the very same space. [sent-32, score-0.297]

5 In general, several prototypes can be employed to represent one of the classes. [sent-35, score-0.297]

6 In simple hard or crisp schemes, a data or feature vector is assigned to the closest of all prototypes and the corresponding class. [sent-36, score-0.354]

7 In practice, the computational effort of LVQ training usually scales linearly with the training set size, and that of classiﬁcation depends only on the (ﬁxed) number of prototypes and the input dimensionality. [sent-43, score-0.297]

8 Furthermore, recent variations of LVQ allow for the incorporation of problem speciﬁc metrics or kernels which can be adapted during training such that very few prototypes can model quite complex behavior (Hammer et al. [sent-46, score-0.332]

9 1 and similar strategies which can display divergent behavior unless a proper (also heuristic) window rule is included in the update scheme. [sent-66, score-0.218]

10 Under simplifying assumptions, the evolution of these macroscopic order parameters is given by a system of deterministic coupled ordinary differential equations (ODE) which describe the dynamics of on-line learning exactly in the thermodynamic limit. [sent-86, score-0.273]

11 We will use the term LVQ+/- for this prescription and we consider it with and without an idealized early stopping procedure. [sent-95, score-0.18]

12 1 Prototype Vectors We will investigate situations with only two prototype or codebook vectors w S ∈ RN . [sent-109, score-0.149]

13 The classiﬁcation as parameterized by the prototypes is based on Euclidean distances: Any given input ξ ∈ RN will be assigned to the class of the closest prototype. [sent-112, score-0.297]

14 Note that, in principle, the simple LVQ classiﬁer with only two prototypes could be replaced by a linearly separable classiﬁcation S = sign[w perc · ξ − θ perc ] with perceptron weight vector w perc = 2 2 (w+ − w− ) and threshold θ perc = w+ − w− /2. [sent-114, score-0.401]

15 Here, however, we are interested in the more complex learning dynamics of independent codebook vectors as provided by LVQ algorithms. [sent-115, score-0.148]

16 The left panel shows the projections h ± = w± · ξ of the data on a randomly chosen pair of orthogonal unit vectors w± . [sent-126, score-0.139]

17 The right panel corresponds to the projections b± = B± · ξ, diamonds mark the position of the cluster centers. [sent-127, score-0.194]

18 Statistical physics has been used earlier to analyse the generalization behavior of perceptron type learning prescriptions in similar models. [sent-134, score-0.143]

19 Here we are interested in the dynamics and generalization ability of LVQ-type learning rules for non-separable data. [sent-142, score-0.152]

20 The winner is moved towards (away from) the example input ξµ if prototype label and class membership of the data agree (differ), respectively. [sent-183, score-0.195]

21 This plausible strategy realizes a compromise between (I) the representation of data by prototypes with the same class label and (II) the identiﬁcation of decision boundaries which clearly separate the different classes. [sent-184, score-0.297]

22 Originally it is based on heuristics and aims at a more efﬁcient separation of prototypes which represent different classes. [sent-200, score-0.297]

23 Given a single example, the update concerns two prototypes: (I) the closest among all codebook vectors which belong to the same class as the data, and (II) the closest vector among the prototypes that represent a different class. [sent-201, score-0.323]

24 The so-called correct winner (I) is moved towards the data whereas the wrong winner (II) is moved even farther away. [sent-202, score-0.144]

25 In order to obtain a ﬁrst insight into this type of algorithm, we consider the unrestricted version or, in other words, the limit of inﬁnite window size. [sent-206, score-0.142]

26 In our model with two prototypes only, the identiﬁcation of the respective winners is redundant and the prescription reduces to updating both vectors from each example. [sent-208, score-0.342]

27 The crisp version of RSLVQ is particularly transparent: In the limiting case of deterministic assignments an update analogous to LVQ+/- is performed only if the current conﬁguration of prototypes misclassiﬁes the new example. [sent-220, score-0.354]

28 Nevertheless, our model represents an important aspect of more realistic multi-class multiprototype situations: the competition of the two prototypes which currently deﬁne the decision boundary or a portion thereof. [sent-256, score-0.297]

29 Note furthermore that the model also allows to investigate unsupervised vector quantization (VQ), which is equivalent to the competition of two prototypes within the same class in our framework (see, for example, Biehl et al. [sent-258, score-0.331]

30 Nevertheless, we aim at a principled understanding of how certain ingredients of an algorithm inﬂuence its learning dynamics and generalization behavior. [sent-266, score-0.152]

31 Mathematical Analysis We apply the successful theory of on-line learning (see, for example, Biehl and Caticha, 2003; Engel and van den Broeck, 2001; Saad, 1999, for reviews) to describe the dynamics of LVQ algorithms. [sent-269, score-0.175]

32 The mathematical treatment of our model is based on the assumption of high-dimensional data and prototypes ξ, w± ∈ RN . [sent-270, score-0.297]

33 Fluctuations of the stochastic dynamics can be neglected in the limit N → ∞. [sent-278, score-0.15]

34 A continuous time limit leads to the description of the dynamics in terms of coupled, deterministic ordinary differential equations (ODE) for the above mentioned order parameters. [sent-280, score-0.15]

35 µ and QST = wS · wT µ µ µ The self-overlaps Q++ , Q−− and the symmetric cross-overlap Q+− = Q−+ relate to the lengths and µ relative angle between the prototype vectors, whereas the four quantities R Sσ specify their projections into the subspace spanned by {B+ , B− }. [sent-288, score-0.179]

36 µ−1 (9) Here we use the shorthand f S for the modulation function of prototype S, omitting the arguments. [sent-290, score-0.189]

37 3 Self-averaging Property The thermodynamic limit has an additional simplifying consequence: Fluctuations of the quantities µ µ {RSσ , QST } decrease with increasing N and in fact vanish for N → ∞. [sent-314, score-0.148]

38 Most frequently, we will consider prototypes that are initialized as independent random vectors of squared length Q with no prior knowledge about the cluster positions. [sent-336, score-0.352]

39 (13) Obviously, the precise initial positions of prototypes with respect to the cluster geometry can play a crucial role for the learning dynamics. [sent-338, score-0.377]

40 A classiﬁcation scheme based on two prototypes is restricted to linear decision boundaries. [sent-370, score-0.297]

41 Only for v+ = v− it becomes a plane and LVQ with two prototypes could potentially implement Bayes optimal generalization. [sent-378, score-0.326]

42 As an example, we mention the symmetry breaking specialization of prototypes in unsupervised VQ (Biehl et al. [sent-388, score-0.322]

43 As an attractive feature, the approach provides information about the system which goes beyond its generalization ability, such as the learning dynamics and the location of prototypes. [sent-400, score-0.152]

44 The same idea applies here as we describe the learning dynamics of 2N prototype components in terms of a few characteristic overlaps and disregard their microscopic details. [sent-418, score-0.245]

45 We will compare the typical learning curves of LVQ+/-, an idealized early stopping procedure, LFM, and LFM-W with those of LVQ1. [sent-433, score-0.135]

46 We will put emphasis on the asymptotic behavior in the limit α → ∞, that is, the generalization error achieved from an arbitrarily large number of examples and its dependence on the model parameters. [sent-435, score-0.143]

47 Right panel: Trajectories of prototypes in the limit η → 0, α → ∞. [sent-466, score-0.325]

48 Solid lines correspond to the projections of prototypes into the plane spanned by λB + and λB− (marked by open circles). [sent-467, score-0.354]

49 We consider initialization of prototypes close to the origin, that is, relatively far from the region of high density in our data model. [sent-481, score-0.35]

50 Note that in a WTA algorithm the initial prototypes must not coincide exactly, hence we choose random w± (0) with, for example, squared length Q++ = Q−− = Q = 10−4 in Eq. [sent-482, score-0.297]

51 3 displays the trajectories of prototypes projected into the plane spanned by B+ and B− . [sent-504, score-0.413]

52 Note that, as could be expected from symmetry arguments, the (α → ∞)-asymptotic projections of prototypes into the B± -plane are along the axis connecting the cluster centers. [sent-505, score-0.405]

53 When learning from unbalanced data, for example, p+ > p− as in the example, the prototype representing the stronger cluster will be updated more frequently and in fact overshoots, resulting in the non-monotonic behavior of εg . [sent-512, score-0.213]

54 In the left panel v+ = v− whereas the right panel shows an example with different cluster variances. [sent-516, score-0.277]

55 Solid lines g lvq1 mark the asymptotic εg of LVQ1, the dashed lines corresponds to εstop as obtained from g an idealized early stopping scheme for LVQ+/- with prototypes initialized in the origin. [sent-535, score-0.482]

56 The result of lvq1 LVQ+/- with idealized early stopping is still optimal for equal priors, while ε g = εbld g in p+ ≈ 0. [sent-541, score-0.135]

57 2 LVQ+/Here we report results concerning the divergent behavior displayed by the LVQ+/- prescription in its basic form. [sent-544, score-0.167]

58 In this generic case, the evolution of {RSσ , QST } displays a strong divergent behavior: All quantities associated with the prototype representing the weaker cluster display an exponential increase with α. [sent-552, score-0.382]

59 Even in our simple model problem, repulsion will always dominate for one of the prototypes and, like w− in our example, it will be moved arbitrarily far away from the cluster centers as α → ∞. [sent-561, score-0.381]

60 The divergent behavior is also apparent in Figure 6, which displays example trajectories of w ± in the limit η → 0, projected into the B± -plane. [sent-592, score-0.214]

61 Second, both prototypes move to inﬁnity, that is, away from the data, as α → ∞. [sent-602, score-0.297]

62 In a visualization of the learning dynamics in the spirit of Figure 6, the trajectories become parallel to the symmetry axis as p+ = p− . [sent-603, score-0.185]

63 Filled symbols display the position after early stopping in the minimum of εg (α), short solid lines mark the projection of the respective decision boundaries. [sent-615, score-0.159]

64 In both cases, w+ approaches the same stationary position on the symmetry axis, while w − displays divergent behavior as α = ηα increases. [sent-621, score-0.195]

65 g Right panel: Generalization error εstop of idealized early stopping as a function of p+ for g λ = 1 and equal variances v+ = v− = 1. [sent-625, score-0.16]

66 The solid line marks the outcome of LVQ+/- with early stopping when prototypes are initialized in the origin, case (a) in the left panel, which is also displayed in Fig. [sent-627, score-0.443]

67 Another popular strategy is to update only from data that falls into a window close to the current decision boundary, that is, close to the midplane between the prototypes (Kohonen, 1997). [sent-634, score-0.38]

68 18 0 2 RS+ 50 100 α 150 200 Figure 7: Learning from mistakes (LFM) Left panel: Projected prototype trajectories for LFM with η = 1 and all other parameters as in Fig. [sent-646, score-0.161]

69 Asymptotically, both prototypes approach stationary positions (solid dots) which are not on the symmetry axis of the distribution (dashed line). [sent-649, score-0.369]

70 The idea is to stop the learning process as soon as the divergent behavior starts to increase the generalization error. [sent-666, score-0.154]

71 We are interested in the best generalization error εstop that could be achieved, in principle, by an idealized early stopping method. [sent-670, score-0.165]

72 Hence, we ﬁrst consider prototypes which are initialized precisely in the origin w + (0) = w− (0) = 0 and without offset from the plane spanned by B+ and B− . [sent-676, score-0.326]

73 For cluster weights in an g interval around p+ = 1/2, the result of the early stopping procedure is superior to that of LVQ1. [sent-687, score-0.155]

74 lvq1 It is important to note that our analysis refers to an idealized early stopping procedure based on perfect knowledge of the current εg . [sent-688, score-0.135]

75 As an example, we consider the initialization of prototypes with an offset from the origin and, more importantly, from the plane spanned by the cluster centers. [sent-698, score-0.434]

76 On the contrary, the above discussed results for early stopping are highly sensitive with respect to initialization and demonstrate, at best, the potential usefulness of the strategy. [sent-703, score-0.153]

77 Achieved (α → ∞) asymptotic generalization error ε g as a function of p+ for different rescaled window sizes; from top to bottom: δ → ∞ (solid line, LFM), δ = 6, δ = 5, and δ = 0. [sent-720, score-0.163]

78 In contrast to LVQ1 and LVQ+/-, the typical learning curves of LFM display a monotonic decrease of εg (α), see the right panel of Figure 7 for examples with three different learning rates. [sent-739, score-0.147]

79 Figure 7 (left panel) shows example trajectories of the prototypes in the B ± -plane. [sent-741, score-0.335]

80 Eventually, the projections of the prototypes assume positions along a line which is parallel to the symmetry axis λ(B+ − B− ). [sent-745, score-0.375]

81 Here, 345 B IEHL , G HOSH AND H AMMER the stationarity of order parameters and generalization error does not imply convergence of the prototype vectors themselves. [sent-747, score-0.153]

82 The actual asymptotic conﬁguration of order parameters and prototypes depends furthermore on the learning rate η. [sent-751, score-0.347]

83 The learning rate merely controls the magnitude of the ﬂuctuations orthogonal to {B+ , B− } and the asymptotic distance of prototypes from the decision boundary. [sent-753, score-0.347]

84 Hence, the prototypes will coincide and the consideration of this limit together with α → ∞ is not useful in the case of LFM. [sent-758, score-0.325]

85 It can be seen that LVQ+/- usually displays a divergent behavior whereas LVQ1 and LFM procedures converge towards ﬁxed positions of the order parameters. [sent-794, score-0.173]

86 It should be mentioned that the trajectories of the prototypes need not be the shortest towards the ﬁnal position and, often, initial overshooting as in LVQ1 can be observed if the classes are not balanced. [sent-799, score-0.335]

87 The outcome of LVQ+/- with early stopping can be close to or even slightly better than the α → ∞ asymptotic performance of LVQ1. [sent-802, score-0.15]

88 347 B IEHL , G HOSH AND H AMMER The robustness of LFM as a crisp version of RSLVQ is clearly demonstrated by the fact that its asymptotic behavior is not affected by the magnitude of the learning rate. [sent-805, score-0.142]

89 Frequently, at most two prototypes are updated at a given time step also in larger LVQ networks. [sent-820, score-0.297]

90 Hence, such systems can be interpreted as a superposition of pairs of prototypes within classes and at class borders. [sent-821, score-0.297]

91 First Moments Exploiting the above mentioned statistical independence we can show immediately that hl k = wl · ξ k = wl · ξ k = wl · (λ Bk ) = λ Rlk . [sent-860, score-0.242]

92 Finally, we obtain the conditional second moment h l hm k − h l k hm k = vk Qlm + λ2 Rlk Rmk − λ2 Rlk Rmk = vk Qlm . [sent-869, score-0.176]

93 In an analogous way we get b l bm k − b l k bm k = vk δlm and h l bm k − h l k bm k = vk Rlm . [sent-870, score-0.256]

94 (30) Now we compute the remaining average in (28) in an analogous way and get Θs k = 1 √ 2π Z R 1 Θ αsk y + βsk exp − y2 dy 2 βsk = 1 √ 2π α Z sk −∞ 1 exp − y2 dy = Φ 2 βsk αsk with Φ(z) = Zz −∞ x2 e− 2 dx √ . [sent-926, score-0.272]

95 σ σ The mathematical structure is completely analogous to the case of LVQ1 and we obtain the results Θδ s k =Φ βδ sk αsk and (x)n Θδ s k (Ck α )n 1 βδ sk = √ s exp − 2 αsk 2παsk 2 + (µk )n Φ βδ sk αsk , (33) as well as the corresponding special cases with δ = 0. [sent-937, score-0.493]

96 Here, we have introduced βδ = αs · µk − βδ , βo = αs · µk − βo , and αsk = s s sk sk αs ·Ck αs . [sent-938, score-0.312]

97 (36) Here we have to insert αsk = αs ·Ck αs , βo = αs · µk − βo s sk ασ = (−2σ, +2σ, 0, 0)T and βo = −(Q+σ+σ − Q−σ−σ ). [sent-954, score-0.156]

98 Here again, we have to insert αsk = with αs ·Ck αs , βδ = αs · µk − βδ , βo = αs · µk − βo s s sk sk ασ = (−2σ, +2σ, 0, 0)T , βδ = −(Q+σ+σ − Q−σ−σ − δ), and βo = −(Q+σ+σ − Q−σ−σ ). [sent-960, score-0.312]

99 5 Analytical Results for LVQ+/The system of ODE (35) can be integrated analytically and the solutions are presented below for initialization of prototypes in the origin, that is, Rlm (0) = Qlm (0) = 0. [sent-963, score-0.35]

100 A theoretical framework for analysing the dynamics of LVQ: A statistical physics approach. [sent-1044, score-0.155]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lvq', 0.627), ('prototypes', 0.297), ('lfm', 0.294), ('sk', 0.156), ('biehl', 0.147), ('qlm', 0.134), ('qst', 0.134), ('prototype', 0.123), ('rs', 0.123), ('dynamics', 0.122), ('ammer', 0.121), ('iehl', 0.121), ('panel', 0.111), ('hosh', 0.103), ('hammer', 0.102), ('bility', 0.097), ('eneralization', 0.097), ('ode', 0.089), ('seo', 0.083), ('window', 0.083), ('lgorithms', 0.08), ('rlm', 0.077), ('ck', 0.074), ('rslvq', 0.07), ('modulation', 0.066), ('wl', 0.066), ('broeck', 0.064), ('divergent', 0.064), ('engel', 0.059), ('kohonen', 0.058), ('bld', 0.057), ('crisp', 0.057), ('cluster', 0.055), ('stopping', 0.055), ('hm', 0.054), ('initialization', 0.053), ('den', 0.053), ('ws', 0.053), ('dqlm', 0.051), ('drlm', 0.051), ('thermodynamic', 0.051), ('watkin', 0.051), ('asymptotic', 0.05), ('displays', 0.049), ('obermayer', 0.049), ('wm', 0.047), ('bm', 0.047), ('early', 0.045), ('caticha', 0.045), ('nbm', 0.045), ('prescription', 0.045), ('prescriptions', 0.045), ('villmann', 0.045), ('hl', 0.044), ('winner', 0.043), ('simplifying', 0.041), ('fs', 0.04), ('trajectories', 0.038), ('heaviside', 0.038), ('nhm', 0.038), ('display', 0.036), ('idealized', 0.035), ('behavior', 0.035), ('quantization', 0.034), ('vk', 0.034), ('physics', 0.033), ('ghosh', 0.033), ('dy', 0.033), ('saad', 0.032), ('sato', 0.032), ('macroscopic', 0.032), ('bk', 0.032), ('hs', 0.031), ('unrestricted', 0.031), ('generalization', 0.03), ('plane', 0.029), ('moved', 0.029), ('projections', 0.028), ('quantities', 0.028), ('limit', 0.028), ('averages', 0.027), ('achievable', 0.027), ('evolution', 0.027), ('dynamical', 0.027), ('codebook', 0.026), ('groningen', 0.026), ('nhl', 0.026), ('perc', 0.026), ('rlk', 0.026), ('positions', 0.025), ('exp', 0.025), ('variances', 0.025), ('stop', 0.025), ('symmetry', 0.025), ('bl', 0.024), ('vq', 0.024), ('schemes', 0.024), ('solid', 0.023), ('displayed', 0.023), ('stationary', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

Author: Michael Biehl, Anarta Ghosh, Barbara Hammer

2 0.053702172 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales

Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the ﬁeld of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners

3 0.038383279 70 jmlr-2007-Ranking the Best Instances

Author: Stéphan Clémençon, Nicolas Vayatis

Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We ﬁrst state the problem of ﬁnding the best instances which can be cast as a classiﬁcation problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where ﬁrst, the best instances would be tentatively identiﬁed and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates

4 0.038062461 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

5 0.034786664 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

Author: Marc Teboulle

Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, ﬁxed point methods, deterministic annealing, expectation maximization, information theory and entropy methods

6 0.03342237 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

7 0.030200206 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

8 0.029882656 15 jmlr-2007-Bilinear Discriminant Component Analysis

9 0.028358689 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

10 0.027346367 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

11 0.027270777 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

12 0.025967373 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

13 0.025845267 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

14 0.025723156 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

15 0.025555205 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

16 0.02481116 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

17 0.02426588 9 jmlr-2007-AdaBoost is Consistent

18 0.024139924 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

19 0.02213867 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

20 0.021986457 52 jmlr-2007-Margin Trees for High-dimensional Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.134), (1, 0.046), (2, -0.006), (3, 0.032), (4, 0.025), (5, -0.032), (6, 0.015), (7, 0.022), (8, 0.008), (9, 0.017), (10, -0.064), (11, -0.047), (12, -0.1), (13, -0.03), (14, 0.001), (15, -0.007), (16, -0.076), (17, -0.019), (18, -0.013), (19, 0.108), (20, 0.04), (21, 0.153), (22, -0.164), (23, 0.072), (24, 0.03), (25, 0.018), (26, -0.027), (27, 0.281), (28, -0.027), (29, -0.329), (30, -0.104), (31, -0.334), (32, -0.036), (33, -0.276), (34, 0.101), (35, -0.075), (36, 0.055), (37, -0.16), (38, 0.118), (39, -0.02), (40, 0.282), (41, -0.11), (42, 0.053), (43, 0.316), (44, 0.008), (45, -0.059), (46, 0.189), (47, 0.058), (48, 0.089), (49, 0.133)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96156704 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

Author: Michael Biehl, Anarta Ghosh, Barbara Hammer

2 0.2390943 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales

3 0.21813032 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

Author: Gaëlle Loosli, Stéphane Canu

Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. Keywords: SVM, CVM, large scale, KKT gap, stopping condition, stopping criteria

4 0.20567301 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

Author: Kristen Grauman, Trevor Darrell

Abstract: In numerous domains it is useful to represent a single example by the set of the local features or parts that comprise it. However, this representation poses a challenge to many conventional machine learning techniques, since sets may vary in cardinality and elements lack a meaningful ordering. Kernel methods can learn complex functions, but a kernel over unordered set inputs must somehow solve for correspondences—generally a computationally expensive task that becomes impractical for large set sizes. We present a new fast kernel function called the pyramid match that measures partial match similarity in time linear in the number of features. The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to ﬁnd implicit correspondences based on the ﬁnest resolution histogram cell where a matched pair ﬁrst appears. We show the pyramid match yields a Mercer kernel, and we prove bounds on its error relative to the optimal partial matching cost. We demonstrate our algorithm on both classiﬁcation and regression tasks, including object recognition, 3-D human pose inference, and time of publication estimation for documents, and we show that the proposed method is accurate and signiﬁcantly more efﬁcient than current approaches. Keywords: kernel, sets of features, histogram intersection, multi-resolution histogram pyramid, approximate matching, object recognition

5 0.19203982 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classiﬁcation, regularization

6 0.17164807 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

7 0.17099616 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

8 0.16185993 70 jmlr-2007-Ranking the Best Instances

9 0.15059879 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

10 0.13210002 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

11 0.12188663 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

12 0.1183003 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

13 0.11613648 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

14 0.11172999 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

15 0.11029004 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

16 0.10391798 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

17 0.10040187 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

18 0.099897087 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

19 0.096544132 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

20 0.091695175 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.024), (10, 0.028), (12, 0.555), (15, 0.013), (28, 0.057), (40, 0.024), (48, 0.028), (60, 0.026), (80, 0.011), (85, 0.051), (98, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91062009 79 jmlr-2007-Structure and Majority Classes in Decision Tree Learning

Author: Ray J. Hickey

Abstract: To provide good classification accuracy on unseen examples, a decision tree, learned by an algorithm such as ID3, must have sufficient structure and also identify the correct majority class in each of its leaves. If there are inadequacies in respect of either of these, the tree will have a percentage classification rate below that of the maximum possible for the domain, namely (100 Bayes error rate). An error decomposition is introduced which enables the relative contributions of deficiencies in structure and in incorrect determination of majority class to be isolated and quantified. A sub-decomposition of majority class error permits separation of the sampling error at the leaves from the possible bias introduced by the attribute selection method of the induction algorithm. It is shown that sampling error can extend to 25% when there are more than two classes. Decompositions are obtained from experiments on several data sets. For ID3, the effect of selection bias is shown to vary from being statistically non-significant to being quite substantial, with the latter appearing to be associated with a simple underlying model. Keywords: decision tree learning, error decomposition, majority classes, sampling error, attribute selection bias 1

2 0.89218414 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

Author: Matthias Hein, Jean-Yves Audibert, Ulrike von Luxburg

Abstract: Given a sample from a probability measure with support on a submanifold in Euclidean space one can construct a neighborhood graph which can be seen as an approximation of the submanifold. The graph Laplacian of such a graph is used in several machine learning methods like semi-supervised learning, dimensionality reduction and clustering. In this paper we determine the pointwise limit of three different graph Laplacians used in the literature as the sample size increases and the neighborhood size approaches zero. We show that for a uniform measure on the submanifold all graph Laplacians have the same limit up to constants. However in the case of a non-uniform measure on the submanifold only the so called random walk graph Laplacian converges to the weighted Laplace-Beltrami operator. Keywords: graphs, graph Laplacians, semi-supervised learning, spectral clustering, dimensionality reduction

same-paper 3 0.85601443 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

Author: Michael Biehl, Anarta Ghosh, Barbara Hammer

4 0.45418429 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Speciﬁcally, by introducing a deﬁnition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary signiﬁcantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments conﬁrm the superiority of the normalization scheme motivated by learning theory on artiﬁcial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

5 0.43127164 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

Author: Dima Kuzmin, Manfred K. Warmuth

Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph

6 0.41447493 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

7 0.40984613 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

8 0.37541643 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data (Special Topic on the Conference on Learning Theory 2005)

9 0.35898781 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

10 0.35319024 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

11 0.34827328 51 jmlr-2007-Loop Corrections for Approximate Inference on Factor Graphs

12 0.33443493 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

13 0.33405459 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm

14 0.32999694 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

15 0.32768512 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

16 0.32653537 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

17 0.32090187 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation

18 0.31509501 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

19 0.31508732 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

20 0.31298602 14 jmlr-2007-Behavioral Shaping for Geometric Concepts