nips nips2002 nips2002-140 knowledge-graph by maker-knowledge-mining

140 nips-2002-Margin Analysis of the LVQ Algorithm

Source: pdf

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Prototypes based algorithms are commonly used to reduce the computational complexity of Nearest-Neighbour (NN) classiﬁers. In this paper we discuss theoretical and algorithmical aspects of such algorithms. On the theory side, we present margin based generalization bounds that suggest that these kinds of classiﬁers can be more accurate then the 1-NN rule. Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. We also show that the 20 years old Learning Vector Quantization (LVQ) algorithm emerges naturally from our framework. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 On the theory side, we present margin based generalization bounds that suggest that these kinds of classiﬁers can be more accurate then the 1-NN rule. [sent-15, score-0.449]

2 Furthermore, we derived a training algorithm that selects a good set of prototypes using large margin principles. [sent-16, score-0.925]

3 Alternative approach is to choose a small data-set (aka prototypes) which represents the original training sample, and apply the nearest neighbour rule only with respect to this small data-set. [sent-22, score-0.276]

4 The training data is used to measure the margin of each proposed positioning of the prototypes. [sent-29, score-0.386]

5 We combine these measurements to calculate a risk for each prototype set and select the prototypes that minimize the risk. [sent-30, score-0.67]

6 LVQ iterates over the training data and updates the prototypes position. [sent-37, score-0.57]

7 In this paper we show that algorithms derived from the maximal margin principle contains LVQ as a special case. [sent-39, score-0.409]

8 Buckingham and Geva [10] were the ﬁrst to explore the relations between maximal margin principle and LVQ. [sent-41, score-0.391]

9 Note that in order to generate a good classiﬁcation rule the only signiﬁcant factor is where the decision boundary lies (It is a well known fact that classiﬁcation is easier then density estimation [11]). [sent-46, score-0.185]

10 A discussion and deﬁnition of margin is provided in section 3. [sent-48, score-0.334]

11 Although LVQ was designed as an approximation to nearest neighbour the theorem suggests that the former is more accurate in many cases. [sent-52, score-0.161]

12 In practice the tradeoff is controlled by loss functions. [sent-57, score-0.247]

13 Given a new instance x ∈ Rn we predict that it has the same label as the closest prototype, similar to the 1-nearest-neighbour rule (1-NN). [sent-65, score-0.272]

14 We denote the label predicted using a set of prototypes {µj }k by µ(x). [sent-66, score-0.557]

15 The goal of the learning process in j=1 this model is to ﬁnd a set of prototypes which will predict accurately the labels of unseen instances. [sent-67, score-0.565]

16 The algorithm gets as an input a labelled sample S = {(xl , yl )}m , where xl ∈ Rn and yl ∈ Y l=1 and uses it to ﬁnd a good set of prototypes. [sent-69, score-0.361]

17 The algorithm maintains a set of prototypes each is assigned with a predeﬁned label, which is kept constant during the learning process. [sent-71, score-0.561]

18 It cycles through the training data S and on each iteration modiﬁes the set of prototypes in accordance to one instance (xt , yt ). [sent-72, score-0.632]

19 If the prototype µj has the same label as yt it is attracted to xt but if the label of µj is different it is repelled from it. [sent-73, score-0.534]

20 Hence LVQ updates the closest prototypes to xt according to the rule: µj ← µj ± αt (xt − µj ) , (1) where the sign is positive if the label of xt and µj agree, and negative otherwise. [sent-74, score-1.145]

21 The variants of LVQ differ in which prototypes they choose to update in each iteration and in the speciﬁc scheme used to modify αt . [sent-76, score-0.598]

22 For instance, LVQ1 and OLVQ1 updates only the closest prototype to xt in each iteration. [sent-77, score-0.528]

23 1 which modiﬁes the two closest prototypes µ i and µj to xt . [sent-79, score-0.844]

24 Exactly one of the prototypes has the same label as xt , i. [sent-81, score-0.779]

25 The ratios of their distances from xt falls in a window: 1/s ≤ xt − µi / xt − µj ≤ s, where s is the window size. [sent-85, score-0.687]

26 One approach is to deﬁne margin as the distance between an instance and the decision boundary induced by the classiﬁcation rule as illustrated in ﬁgure 1(a). [sent-89, score-0.601]

27 In this deﬁnition the margin is the distance that the classiﬁer can travel without changing the way it labels any of the sample points. [sent-92, score-0.49]

28 This type of margin is used in AdaBoost [5] and is illustrated in ﬁgure 1(b). [sent-94, score-0.354]

29 It is possible to apply these two types of margin in the context of LVQ. [sent-95, score-0.354]

30 Although using sample margin is more natural as a ﬁrst choice, it turns out that this type of margin is both hard to compute and numerically unstable in our context, since small relocations of the prototypes might lead to a dramatic change in the sample margin. [sent-98, score-1.246]

31 Hence we focus on the hypothesis margin and thus have to deﬁne a distance measure between two classiﬁers. [sent-99, score-0.465]

32 We choose to deﬁne it as the maximal distance between prototypes pairs as illustrated in ﬁgure 2. [sent-100, score-0.623]

33 Note that this deﬁnition is not invariant to permutations of the prototypes but it upper bounds the invariant deﬁnition. [sent-102, score-0.549]

34 Furthermore, the induced margin is easy to compute (lemma 1) and lower bounds the sample-margin (lemma 2). [sent-103, score-0.396]

35 (a) (b) Figure 1: Sample Margin (ﬁgure 1(a)) measures how much can an instance travel before it hits the decision boundary. [sent-104, score-0.163]

36 Lemma 1 Let µ = {µj }k be a set of prototypes and x a sample point. [sent-106, score-0.543]

37 Then the hypothj=1 esis margin of µ with respect to x is θ = 1 ( µj − x − µi − x ) where µi (µj ) is the 2 closest prototype to x with the same (alternative) label. [sent-107, score-0.61]

38 sample-marginS (µ) ≥ hypothesis-marginS (µ) Lemma 2 shows that if we ﬁnd a set of prototypes with large hypothesis margin then it has large sample margin as well. [sent-112, score-1.285]

39 When a classiﬁer is applied to a training data it is natural to use the training error as a prediction to the generalization error (the probability of misclassiﬁcation of an unseen instance). [sent-114, score-0.17]

40 In prototype based hypothesis the classiﬁer assigns a conﬁdence level, i. [sent-115, score-0.236]

41 Taking into account the margin by counting instances with small margin as mistakes gives a better prediction and provide a bound on the generalization error. [sent-118, score-0.814]

42 This bound is given in terms of the number of prototypes, the sample size, the margin and the margin based empirical error. [sent-119, score-0.76]

43 The distance between the white and black prototypes set is the maximal distance between prototypes pairs. [sent-122, score-1.128]

44 • Let µ be a set of prototypes with k prototypes from each class. [sent-125, score-1.016]

45 Second, note that the VC dimension grows as the number of prototypes grows (3). [sent-133, score-0.582]

46 This suggest that using too many prototypes might result in poor performance, therefore there is a non trivial optimal number of prototypes. [sent-134, score-0.508]

47 Hence not only that prototype based methods are faster than Nearest Neighbour, they are more accurate as well. [sent-137, score-0.162]

48 5 Maximizing Hypothesis Margin Through Loss Function Once margin is properly deﬁned it is natural to ask for algorithm that maximizes it. [sent-139, score-0.367]

49 Before going any further we have to understand why maximizing the margin is a good idea. [sent-141, score-0.352]

50 5 zero one loss hinge loss broken linear loss exponential loss 4 3. [sent-143, score-1.143]

51 5 3 loss In theorem 1 we saw that the generalization error can be bounded by a function of the margin θ and the empirical θ-error (α). [sent-144, score-0.712]

52 Therefore it is natural to seek prototypes that obtain small θ-error for a large θ. [sent-145, score-0.508]

53 A natural way to solve this problem is through the use of loss function. [sent-147, score-0.225]

54 The idea is to associate a margin based loss (a “cost”) for each hypothesis with respect to a sample. [sent-151, score-0.633]

55 We use L to compute the loss of an hypothesis with respect to one instance. [sent-166, score-0.299]

56 When a training set is available we sum the loss over the instances: L(µ) = l L(θl ), where θl is the margin of the l’th instance in the training data. [sent-167, score-0.663]

57 The two axioms of loss functions guarantee that L(µ) bounds the empirical error. [sent-168, score-0.29]

58 It is common to add more restrictions on the loss function, such as requiring that L is a non-increasing function. [sent-169, score-0.225]

59 However, the only assumption we make here is that the loss function L is differentiable. [sent-170, score-0.225]

60 AdaBoost uses the exponential loss function L(θ) = e−βθ while SVM uses the “hinge” loss L(θ) = (1 − βθ)+ , where β > 0 is a scaling factor. [sent-172, score-0.471]

61 See ﬁgure 3 for a demonstration of these loss functions. [sent-173, score-0.249]

62 Once a loss function is chosen, the goal of the learning algorithm is ﬁnding an hypothesis that minimizes it. [sent-174, score-0.332]

63 Recall that in our case θl = ( xl − µi − xl − µj )/2 where µj and µi are the closest prototypes to xl with the correct and incorrect labels respectively. [sent-176, score-1.294]

64 Hence we have that2 dθl xl − µ r = Sl (r) dµr xl − µr where Sl (r) is a sign function such that Sl (r) = 1 if µr is the closest prototype with correct label. [sent-177, score-0.68]

65 −1 if µr is the closest prototype with incorrect label. [sent-178, score-0.326]

66 2 Note that if xl = µj the derivative is not deﬁned. [sent-180, score-0.23]

67 Recall that L is a loss function, and γt varies to zero as the algorithm proceeds. [sent-183, score-0.258]

68 Choose an initial positions for the prototypes {µj }k . [sent-185, score-0.508]

69 For t = 1 : T ( or ∞) (a) Receive a labelled instance xt , yt (b) Compute the closest correct and incorrect prototypes to xt : µj , µi , and the margin of xt , i. [sent-187, score-1.812]

70 This leads to two concluµr = l wl xl where αl = dL(θl ) xll (r)r and wl = dθl −µ αr l l sions. [sent-190, score-0.303]

71 Furthermore, from r its deﬁnition it is clear that wl = 0 only for the closest prototypes to xl . [sent-192, score-0.87]

72 In other words, r wl = 0 if and only if µr is either the closest prototype to xl which have the same label as xl , or the closest prototype to xl with alternative label. [sent-193, score-1.235]

73 1 Minimizing The Loss Using (4) we can ﬁnd a local minima of the loss function by a gradient descent algorithm. [sent-196, score-0.243]

74 The iteration in time t computes: µr (t + 1) ← µr (t) + γt l xl − µr (t) dL(θl ) Sl (r) dθ xl − µr (t) where γt approaches zero as t increases. [sent-197, score-0.386]

75 This computation can be done iteratively where in each step we update µr only with respect to one sample point xl . [sent-198, score-0.259]

76 This leads to the following basic update step µr ← µr + γ t dL(θl ) xl − µ r Sl (r) dθ xl − µr Note that Sl (r) differs from zero only for the closest correct and incorrect prototypes to x l , therefore a simple online algorithm is obtained and presented as algorithm 1. [sent-199, score-1.207]

77 2 LVQ1 and OLVQ1 The online loss minimization (algorithm 1) is a general algorithm applicable with different choices of loss functions. [sent-201, score-0.548]

78 We will now apply it with a couple of loss functions and see how LVQ emerges. [sent-202, score-0.245]

79 Recall that the hinge loss is deﬁned to be L(θ) = (1 − βθ)+ . [sent-204, score-0.382]

80 The derivative3 of this loss function is 3 The “hinge” loss has no derivative at the point θ = 1/β. [sent-205, score-0.487]

81 This is the same update rule as in LVQ1 and OLVQ1 algorithm [9] beside the extra factor of β xt −µr . [sent-210, score-0.355]

82 However, this is a minor difference since β/ xt − µr is just a normalizing factor. [sent-211, score-0.242]

83 A demonstration of the affect of OLVQ1 on the “hinge” loss function is provided in ﬁgure 4. [sent-212, score-0.249]

84 As expected the loss decreases as the algorithm proceeds. [sent-215, score-0.258]

85 For this purpose we used the lvq pak package [13]. [sent-216, score-0.549]

86 1 The idea behind the deﬁnition of margin, and especially hypothesis margin was that a minor change in the hypothesis can not change the way it labels an instance which had a large margin. [sent-219, score-0.567]

87 1 updates µr only if the margin of xt falls inside a certain window. [sent-228, score-0.607]

88 1 is the broken linear loss function (see ﬁgure 3). [sent-230, score-0.29]

89 The broken linear loss is deﬁned to be L(θ) = min(2, (1 − βθ) + ). [sent-231, score-0.29]

90 1 and the online loss minimization presented here, however these differences are minor. [sent-236, score-0.29]

91 6 Conclusions and Further Research In this paper we used the maximal margin principle together with loss functions to derive algorithms for prototype positioning. [sent-237, score-0.796]

92 We also provide generalization bounds for any prototype based classiﬁer. [sent-239, score-0.277]

93 The ﬁrst is to use other loss functions such as the exponential loss. [sent-241, score-0.246]

94 The proper way to adapt the algorithm to the chosen rule is to deﬁne the margin accordingly, and modify the minimization process in the training stage. [sent-243, score-0.499]

95 This may result in more complicated formula of the derivative of the loss function, but may improve the results signiﬁcantly in some cases. [sent-249, score-0.262]

96 We also presented a generalization guarantee for prototype based classiﬁer that is based on the margin training error. [sent-251, score-0.602]

97 However, allowing more prototypes to the standard version achieves the same improvement. [sent-256, score-0.536]

98 Recall that a classiﬁer is parametrised by a set of labelled prototypes that deﬁne a Voronoi tessellation. [sent-258, score-0.538]

99 As the number of prototypes grows, the decision boundary consists of more, and shorter lines. [sent-262, score-0.588]

100 Boosting the margin : A new explanation for the effectiveness of voting methods. [sent-302, score-0.359]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('prototypes', 0.508), ('lvq', 0.49), ('margin', 0.334), ('loss', 0.225), ('xt', 0.222), ('xl', 0.193), ('prototype', 0.162), ('hinge', 0.157), ('closest', 0.114), ('sl', 0.109), ('classi', 0.104), ('dl', 0.087), ('neighbour', 0.085), ('generalization', 0.074), ('hypothesis', 0.074), ('rule', 0.069), ('broken', 0.065), ('pak', 0.059), ('travel', 0.059), ('margins', 0.058), ('er', 0.055), ('wl', 0.055), ('yt', 0.052), ('nearest', 0.05), ('incorrect', 0.05), ('label', 0.049), ('voronoi', 0.047), ('rn', 0.046), ('gure', 0.043), ('quantization', 0.043), ('adaboost', 0.042), ('bounds', 0.041), ('nition', 0.04), ('boundary', 0.04), ('decision', 0.04), ('instance', 0.04), ('instances', 0.039), ('variants', 0.039), ('lemma', 0.038), ('maximal', 0.038), ('distance', 0.037), ('derivative', 0.037), ('sample', 0.035), ('freund', 0.035), ('buckingham', 0.034), ('online', 0.034), ('bound', 0.033), ('algorithm', 0.033), ('unseen', 0.032), ('training', 0.032), ('update', 0.031), ('minimization', 0.031), ('labelled', 0.03), ('recall', 0.03), ('updates', 0.03), ('saw', 0.029), ('emerges', 0.029), ('version', 0.028), ('years', 0.027), ('theorem', 0.026), ('dence', 0.026), ('yl', 0.026), ('kohonen', 0.026), ('vc', 0.026), ('grows', 0.026), ('labels', 0.025), ('explanation', 0.025), ('empirical', 0.024), ('hits', 0.024), ('singer', 0.024), ('demonstration', 0.024), ('ers', 0.023), ('let', 0.023), ('prede', 0.023), ('de', 0.023), ('modi', 0.023), ('dimension', 0.022), ('tradeoff', 0.022), ('tishby', 0.022), ('exponential', 0.021), ('induced', 0.021), ('min', 0.021), ('family', 0.021), ('cation', 0.021), ('falls', 0.021), ('minor', 0.02), ('apply', 0.02), ('measure', 0.02), ('illustrated', 0.02), ('maintains', 0.02), ('hence', 0.02), ('choose', 0.02), ('principle', 0.019), ('svm', 0.019), ('boosting', 0.019), ('correct', 0.018), ('density', 0.018), ('algorithms', 0.018), ('descent', 0.018), ('good', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

2 0.22042269 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

3 0.19469313 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

4 0.15259257 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

Abstract: Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by calculating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simpliﬁes the state covariance matrix after incorporating each measurement.

5 0.14095148 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classiﬁers, similar in spirit to Boosting.

6 0.12909205 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

7 0.11403621 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

8 0.11284489 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

9 0.10739931 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

10 0.10580026 105 nips-2002-How to Combine Color and Shape Information for 3D Object Recognition: Kernels do the Trick

11 0.10539255 156 nips-2002-On the Complexity of Learning the Kernel Matrix

12 0.1019488 92 nips-2002-FloatBoost Learning for Classification

13 0.092116341 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

14 0.089513421 45 nips-2002-Boosted Dyadic Kernel Discriminants

15 0.086118497 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

16 0.084791929 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

17 0.08466173 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

18 0.08096806 46 nips-2002-Boosting Density Estimation

19 0.080722101 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

20 0.080327787 120 nips-2002-Kernel Design Using Boosting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, -0.142), (2, 0.027), (3, -0.063), (4, 0.214), (5, -0.024), (6, -0.004), (7, 0.01), (8, -0.014), (9, -0.05), (10, 0.005), (11, -0.155), (12, -0.103), (13, -0.234), (14, -0.141), (15, 0.011), (16, 0.042), (17, -0.06), (18, 0.017), (19, 0.008), (20, 0.05), (21, -0.044), (22, 0.163), (23, 0.116), (24, -0.159), (25, 0.132), (26, 0.072), (27, -0.044), (28, -0.029), (29, -0.087), (30, 0.086), (31, 0.004), (32, -0.136), (33, 0.091), (34, 0.006), (35, 0.033), (36, 0.019), (37, 0.004), (38, 0.003), (39, 0.041), (40, 0.139), (41, 0.068), (42, -0.133), (43, 0.02), (44, -0.028), (45, 0.09), (46, -0.064), (47, -0.019), (48, -0.089), (49, -0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94172764 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

2 0.75139576 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

3 0.74130416 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

4 0.48355761 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

5 0.4662815 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefﬁcients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classiﬁers.

6 0.38715047 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

7 0.38566193 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

8 0.36378646 92 nips-2002-FloatBoost Learning for Classification

9 0.35664481 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

10 0.34727162 58 nips-2002-Conditional Models on the Ranking Poset

11 0.31503731 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

12 0.310826 156 nips-2002-On the Complexity of Learning the Kernel Matrix

13 0.30746353 45 nips-2002-Boosted Dyadic Kernel Discriminants

14 0.3056418 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

15 0.29938805 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

16 0.29747269 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

17 0.28943205 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

18 0.28854308 139 nips-2002-Margin-Based Algorithms for Information Filtering

19 0.28762466 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

20 0.28680921 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.037), (14, 0.022), (23, 0.02), (42, 0.061), (44, 0.203), (54, 0.156), (55, 0.045), (67, 0.013), (68, 0.021), (74, 0.077), (92, 0.085), (98, 0.145)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90040261 85 nips-2002-Fast Kernels for String and Tree Matching

Author: Alex J. Smola, S.v.n. Vishwanathan

Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

same-paper 2 0.86434186 140 nips-2002-Margin Analysis of the LVQ Algorithm

Author: Koby Crammer, Ran Gilad-bachrach, Amir Navot, Naftali Tishby

3 0.83789033 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classiﬁcation that combines the computational advantage of a parametric solution with the ﬂexibility of sequential sampling techniques. We regard the parameters of the classiﬁer as latent states in a ﬁrst order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman ﬁltering. The variational Kalman ﬁlter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classiﬁers both in stationary and non-stationary environments. Although we focus on classiﬁcation, the algorithm is easily extended to other generalized nonlinear models.

4 0.76265669 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

5 0.75224227 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classiﬁer design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classiﬁcation. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classiﬁers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classiﬁcation performance.

6 0.7517463 27 nips-2002-An Impossibility Theorem for Clustering

7 0.74926424 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

8 0.74825269 45 nips-2002-Boosted Dyadic Kernel Discriminants

9 0.74734497 10 nips-2002-A Model for Learning Variance Components of Natural Images

10 0.74611682 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

11 0.74572206 53 nips-2002-Clustering with the Fisher Score

12 0.74182624 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

13 0.74049115 161 nips-2002-PAC-Bayes & Margins

14 0.74004161 120 nips-2002-Kernel Design Using Boosting

15 0.73857027 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

16 0.73626781 46 nips-2002-Boosting Density Estimation

17 0.73616511 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

18 0.73512697 6 nips-2002-A Formulation for Minimax Probability Machine Regression

19 0.73414737 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

20 0.7314499 3 nips-2002-A Convergent Form of Approximate Policy Iteration