nips nips2010 nips2010-188 knowledge-graph by maker-knowledge-mining

188 nips-2010-On Herding and the Perceptron Cycling Theorem

Source: pdf

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classiﬁers such as the voted perceptron and the discriminative RBM. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. [sent-6, score-0.945]

2 It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. [sent-7, score-0.322]

3 This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. [sent-8, score-1.466]

4 We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classiﬁers such as the voted perceptron and the discriminative RBM. [sent-9, score-1.214]

5 1 Introduction The invention of the perceptron [12] goes back to the very beginning of AI more than half a century ago. [sent-10, score-0.252]

6 However, ‘for data sets that are not linearly separable, the perceptron learning algorithm will never converge’ (quoted from [1]). [sent-13, score-0.269]

7 The ‘perceptron cycling theorem’ (PCT) [2, 11] states that for the inseparable case the weights remain bounded and do not diverge to inﬁnity. [sent-15, score-0.206]

8 More importantly, the correlations converge to the sample mean with a rate 1/T , which is √ much faster than sampling based algorithms that converge at a rate 1/ T . [sent-18, score-0.088]

9 1 In the inseparable case, we can interpret the perceptron as a bagging procedure and average predictions instead of picking the single best (or last) weights found during training. [sent-20, score-0.402]

10 2, this is exactly what the voted perceptron (VP) [5] does. [sent-22, score-0.367]

11 Interesting generalization bounds for the voted perceptron have been derived in [5]. [sent-23, score-0.367]

12 In contrast, herding combines the learning and inference phases by treating the weights as dynamic quantities and deﬁning a deterministic set of updates such that averaging predictions preserves certain moments of the training data. [sent-29, score-0.866]

13 The herding algorithm generates a weakly chaotic sequence of weights and a sequence of states of both hidden and visible variables of the MRF model. [sent-30, score-0.877]

14 The intermediate states produced by herding are really ‘representative points’ of an implicit model that interpolates between data cases. [sent-31, score-0.736]

15 However, unlike in perceptron learning, the non-convergence of the weights is needed to generate long, non-periodic trajectories of states that can be averaged over. [sent-34, score-0.341]

16 In this paper, we show that supervised perceptron algorithms and unsupervised herding algorithms can all be derived from the PCT. [sent-35, score-0.945]

17 This connection allows us to strengthen existing herding results. [sent-36, score-0.716]

18 Moreover, the connection suggests new algorithms that lie between supervised perceptron and unsupervised herding algorithms. [sent-38, score-0.961]

19 From the perceptron perspective, conditional herding can be understood as “voted perceptrons with hidden units”. [sent-40, score-1.071]

20 Conditional herding can also be interpreted as the zero temperature limit of discriminative RBMs (dRBMs) [10]. [sent-41, score-0.822]

21 2 Perceptrons, Herding and the Perceptron Cycling Theorem We ﬁrst review the perceptron cycling theorem that was initially introduced in [11] with a gap in the proof that was ﬁxed in [2]. [sent-42, score-0.339]

22 A sequence of vectors {wt }, wt ∈ RD , t = 0, 1, . [sent-43, score-0.18]

23 is generated by the following iterative procedure: wt+1 = wt + vt , where vt is an element of a ﬁnite set, V, and the norm of vt is bounded: maxi ||vi || = R < ∞. [sent-46, score-0.396]

24 ∀t ≥ 0: If wt vt ≤ 0, then there exists a constant M > 0 such that wt − w0 < M . [sent-48, score-0.432]

25 1 T t=1 ∆wt || = || T t=1 vt || < M , Voted Perceptron and Moment Matching The voted perceptron (VP) algorithm [5] repeatedly applies the update rule in Eqn. [sent-54, score-0.477]

26 Predictions of test labels are made after each update and ﬁnal label predictions are taken as an average of all intermediate predictions. [sent-56, score-0.095]

27 2, where we ∗ ∗ identify V = {xi (yi − yi )| , yi = ±1, yi = ±1, i = 1, . [sent-58, score-0.087]

28 For the VP algorithm, the PCT thus guarantees that the moments xy p(x,y) (with p the empirical distribution) are matched with ˜ ˜ xy ∗ p(y∗ |x)p(x) where p(y ∗ |x) is the model distribution implied by how VP generates y ∗ . [sent-62, score-0.118]

29 ˜ In maximum entropy models, one seeks a model that satisﬁes a set of expectation constraints (moments) from the training data, while maximizing the entropy of the remaining degrees of freedom [9]. [sent-63, score-0.094]

30 In contrast, a single perceptron strives to learn a deterministic mapping p(y ∗ |x) = δ[y ∗ − arg maxy (ywT x)] that has zero entropy and gets every prediction on every training case 2 correct (where δ is the delta function). [sent-64, score-0.318]

31 Entropy is created in p(y ∗ |x) only when the weights wt do not converge (i. [sent-65, score-0.26]

32 Rather than learning a single ‘best’ MRF model that can be sampled from to estimate quantities of interest, herding combines learning and inference into a single process. [sent-71, score-0.677]

33 In particular, herding produces a trajectory of weights and states that reproduce the moments of the training data. [sent-72, score-0.836]

34 In herding [15], the parameters w are updated as: (3) wt+1 = wt + φ − φ(x∗ ), t (4) ∗ T where φ = i φ(xi ) and xt = arg maxx wt φ(x). [sent-80, score-1.07]

35 This follows from taking the zero temperature limit of the ML objective (see Section 2. [sent-83, score-0.086]

36 The maximization prevents the herding sequence from converging to a single point estimate on this alternative objective. [sent-85, score-0.717]

37 1 N Let {wt } denote the sequence of weights and {x∗ } denote the sequence of states (pseudo-samples) t produced by herding. [sent-86, score-0.089]

38 We can apply the PCT to herding by identifying V = {φ − φ(x∗ )| x∗ ∈ X }. [sent-87, score-0.677]

39 It is now easy to see that, in general, herding does not converge because under very mild conditions T we can always ﬁnd an x∗ such that wt vt < 0. [sent-88, score-0.962]

40 the pseudo-sample averages of the features converge t to the data averages φ at a rate 1/T 1 . [sent-91, score-0.108]

41 However, the PCT only requires us to ﬁnd some state x∗ t t T such that wt vt ≤ 0 and in most cases this can easily be veriﬁed. [sent-99, score-0.252]

42 In this case, maximizations are initialized on all data-cases and the weights are updated by the difference between the average over the data-cases minus the average over the {x∗ } found after i 1 (partial) maximization. [sent-103, score-0.085]

43 i i T For obvious reasons, it is now guaranteed that wt vt ≤ 0. [sent-105, score-0.252]

44 However, the PCT allows us to consider more general features that depend on the weights 1 Similar convergence could also be achieved (without concern for generalization performance) by sampling directly from the training data. [sent-113, score-0.127]

45 However, herding converges with rate 1/T and is regularized by the weights to prevent overﬁtting. [sent-114, score-0.724]

46 4 Conditional Herding We are now ready to propose our new algorithm: conditional herding (CH). [sent-125, score-0.724]

47 Like the VP algorithm, CH is concerned with discriminative learning and, therefore, it conditions on the input attributes {xi }. [sent-126, score-0.088]

48 At each iteration t, CH randomly samples a subset of the data-cases and their labels Dt = {xit , yit } ⊆ D. [sent-131, score-0.192]

49 For every member of this mini-batch it computes a hidden variable zit using Eqn. [sent-132, score-0.181]

50 The parameters are then updated as: η ∗ (φ(xit , yit , zit ) − φ(xit , yit , z∗t )) (8) wt+1 = wt + i |Dt | it ∈Dt In the positive term, zit , is found as in Eqn. [sent-134, score-0.812]

51 Since wt = η t =0 vt + w0 , t the only scale that matters is the relative scale between w0 and η. [sent-144, score-0.252]

52 Irrespective of the attractor we end up in, the PCT guarantees that: 1 || T T t=1 1 |Dt | it 1 φ(xit , yit , zit ) − T T t=1 1 |Dt | ∗ φ(xit , yit , z∗t )|| ∼ O(1/T ). [sent-147, score-0.525]

53 i (10) it In general, herding systems perform better when we use normalized features: φ(x, z, y) = R, ∀(x, z, y). [sent-148, score-0.677]

54 The reason is that herding selects states by maximizing the inner product wT φ 4 and features with large norms will therefore become more likely to be selected. [sent-149, score-0.757]

55 Finally, predictions on unseen test data are made by: ∗ T (ytst,t , z∗ ) = arg max wt φ(xtst , y , z ), tst,t (11) y ,z The algorithm is summarized in the algorithm-box below. [sent-153, score-0.251]

56 For t ≥ 0: (a) Choose a subset {xit , yit } = Dt ⊆ D. [sent-157, score-0.192]

57 For each (xit , yit ), choose a hidden state zit . [sent-158, score-0.373]

58 ∗ (b) Choose a set of “negative states” {(x∗t = xit , yit , z∗t )}, such that: i i 1 |Dt | T wt−1 φ(xit , yit , zit ) ≤ it 1 |Dt | T ∗ wt−1 φ(xit , yit , z∗t ). [sent-159, score-0.896]

59 Predict on test data as follows: ∗ (a) For every test case xtst,j at every iteration, choose negative states (ytst,jt , z∗ ) in the tst,jt same way as for training data. [sent-164, score-0.081]

60 5 Zero Temperature Limit of Discriminative MRF Learning Regular herding can be understood as gradient descent on the zero temperature limit of an MRF model. [sent-167, score-0.763]

61 Analogously, CH can be viewed as constant step size gradient updates on the zero temperature limit of discriminative MRFs (see [10] for the corresponding RBM model). [sent-169, score-0.18]

62 T z ,y exp [w φ(y , z , x)] z (13) Similar to herding [14], conditional herding introduces a temperature by replacing w by w/T and takes the limit T → 0 of T T , where = i log p(yi |xi ). [sent-171, score-1.487]

63 3 Experiments We studied the behavior of conditional herding on two artiﬁcial and four real-world data sets, comparing its performance to that of the voted perceptron [5] and that of discriminative RBMs [10]. [sent-172, score-1.167]

64 We studied conditional herding in the discriminative RBM architecture illustrated in Figure 1 (i. [sent-176, score-0.783]

65 1 Artiﬁcial Data To investigate the characteristics of VP, dRBMs and CH, we used the techniques to construct decision boundaries on two artiﬁcial data sets: (1) the banana data set; and (2) the Lithuanian data set. [sent-199, score-0.119]

66 The decision bound∗ ary for VP and CH is located at the location where the sign of the prediction ytst changes. [sent-201, score-0.079]

67 The dRBMs also had 20 hidden units and were trained by running conjugate gradients until convergence. [sent-203, score-0.104]

68 The weights of the dRBMs were initialized by sampling from a Gaussian distribution with a variance of 10−4 . [sent-204, score-0.091]

69 The results on the banana data set illustrate the representational advantages of hidden units. [sent-210, score-0.147]

70 Since VP selects data points at random to update the weights, on the banana data set, the weight vector of VP tends to oscillate back and forth yielding a nearly linear decision boundary3 . [sent-211, score-0.14]

71 In contrast, for CH the simple predictor in the top layer can regress onto M = 20 hidden features. [sent-213, score-0.086]

72 2 Real-World Data In addition to the experiments on synthetic data, we also performed experiments on four real-world data sets - namely, (1) the USPS data set, (2) the MNIST data set, (3) the UCI Pendigits data set, and (4) the 20-Newsgroups data set. [sent-216, score-0.085]

73 The MNIST data set contains 70, 000, 28 × 28 grayscale images of digits, with a ﬁxed division into 60, 000 training and 10, 000 test instances. [sent-218, score-0.088]

74 We adopted a 1-of-K encoding, where if yi is the label for data point xi , then yi = {yi,1 , . [sent-228, score-0.095]

75 This partial maximization is not a problem as long as the ensuing conﬁguration satisﬁes T wt vt ≤ 0 4 . [sent-242, score-0.276]

76 , we employed early stopping) or until the negative conditional log-likelihood on the training data stopped coming down. [sent-247, score-0.104]

77 The parameters of the conditional herders were initialized by sampling from a Gaussian distribution. [sent-253, score-0.12]

78 However, since the dimension of the data is typically much greater than the number of classes, the dynamics of the conditional herding system will be largely driven by W. [sent-256, score-0.741]

79 In addition, during the early stages of herding, we adapted the parameter update for the bias on the hidden units θ in such a way that the marginal distribution over the hidden units was nearly uniform. [sent-265, score-0.246]

80 This has the advantage that it encourages high entropy in the hidden units, leading to more useful η dynamics of the system. [sent-266, score-0.092]

81 In practice, we update θ as θ t+1 = θ t + |Dt | it (1 − λ) zit − z∗t , i where zit is the batch mean. [sent-267, score-0.289]

82 λ is initialized to 1 and we gradually half its value every 500 updates, slowly moving from an entropy-encouraging update to the standard update for the biases of the hidden units. [sent-268, score-0.163]

83 The results reveal that the addition of hidden units to the voted perceptron leads to signiﬁcant improvements in terms of generalization error. [sent-274, score-0.471]

84 Furthermore, the results of our experiments indicate that conditional herding performs on par with discriminative RBMs on the MNIST and USPS data sets and better on the 20 Newsgroups data set. [sent-275, score-0.834]

85 The 20 Newsgroups data is high dimensional and sparse and both VP and CH appear to perform ∗,k ∗,k Local maxima can also be found by iterating over ytst , ztst,j , but the proposed procedure is more efﬁcient. [sent-276, score-0.094]

86 We use a ﬁxed order of the mini-batches, so that if there are N data cases and the batch size is K, if the training error is 0 for N/K iterations, the error for the whole training set is 0. [sent-277, score-0.08]

87 4 5 7 One-Versus-All Procedure XXX VP Discriminative RBM Conditional herding Technique XX Data Set XXX 100 200 100 200 X MNIST 7. [sent-278, score-0.677]

88 96% Joint Procedure XXX VP Discriminative RBM Conditional herding Technique XX 50 100 500 50 100 500 Data Set XXX X MNIST 8. [sent-303, score-0.677]

89 Techniques to promote sparsity in the hidden layer when training dRBMs exist (see [10]), but we did not investigate them here. [sent-343, score-0.087]

90 This is particularly evident in the low-dimensional UCI Pendigits data set, where the dRBMs start to badly overﬁt with 500 hidden units, while the test error for CH remains level. [sent-345, score-0.082]

91 4 Concluding Remarks The main contribution of this paper is to expose a relationship between the PCT and herding algorithms. [sent-347, score-0.677]

92 This has allowed us to strengthen certain results for herding - namely, theoretically validating herding with mini-batches and partial optimization. [sent-348, score-1.377]

93 It also directly leads to the insight that non-convergent VPs and herding match moments between data and generated predictions at a rate √ much faster than random sampling (O(1/T ) vs. [sent-349, score-0.801]

94 From these insights, we have proposed a new conditional herding algorithm that is the zero-temperature limit of dRBMs [10]. [sent-351, score-0.751]

95 The herding perspective provides a new way of looking at learning as a dynamical system. [sent-352, score-0.699]

96 In fact, the PCT precisely speciﬁes the conditions that need to hold for a herding system (in batch mode) to be a piecewise isometry [7]. [sent-353, score-0.76]

97 A piecewise isometry is a weakly chaotic dynamical system that divides parameter space into cells and applies a different isometry in each cell. [sent-354, score-0.171]

98 For herding, the isometry is given by a translation and the cells are labeled by the states {x∗ , y∗ , z, z∗ }, whichever combination applies. [sent-355, score-0.097]

99 Many interesting results about piecewise isometries have been proven in the mathematics literature such as the fact that the sequence of sampled states grows algebraically with T and not exponentially as in systems with random or chaotic components [6]. [sent-357, score-0.098]

100 Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. [sent-383, score-0.339]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('herding', 0.677), ('vp', 0.253), ('perceptron', 0.252), ('pct', 0.243), ('ch', 0.209), ('drbms', 0.204), ('xit', 0.204), ('yit', 0.192), ('wt', 0.18), ('zit', 0.116), ('voted', 0.115), ('pendigits', 0.077), ('vt', 0.072), ('cycling', 0.07), ('hidden', 0.065), ('usps', 0.06), ('discriminative', 0.059), ('temperature', 0.059), ('ytst', 0.058), ('xxx', 0.051), ('mrf', 0.05), ('newsgroups', 0.05), ('dt', 0.048), ('moments', 0.048), ('conditional', 0.047), ('weights', 0.047), ('banana', 0.047), ('inseparable', 0.047), ('mnist', 0.043), ('states', 0.042), ('rbm', 0.042), ('rbms', 0.04), ('units', 0.039), ('lithuanian', 0.038), ('update', 0.038), ('isometry', 0.038), ('predictions', 0.037), ('updates', 0.035), ('converge', 0.033), ('rmax', 0.033), ('division', 0.031), ('chaotic', 0.03), ('perceptrons', 0.03), ('uci', 0.03), ('yi', 0.029), ('attributes', 0.029), ('herders', 0.029), ('minsky', 0.029), ('rosenblatt', 0.029), ('tst', 0.029), ('typeset', 0.029), ('entropy', 0.027), ('limit', 0.027), ('xy', 0.027), ('piecewise', 0.026), ('attractor', 0.025), ('maximization', 0.024), ('chaos', 0.023), ('strengthen', 0.023), ('picked', 0.023), ('training', 0.022), ('dynamical', 0.022), ('sampling', 0.022), ('initialized', 0.022), ('decision', 0.021), ('quebec', 0.021), ('regress', 0.021), ('boltzmann', 0.021), ('label', 0.02), ('features', 0.02), ('dim', 0.019), ('averages', 0.019), ('procedure', 0.019), ('batch', 0.019), ('boundary', 0.018), ('irrespective', 0.018), ('cardinality', 0.018), ('analogously', 0.018), ('representational', 0.018), ('stopped', 0.018), ('grayscale', 0.018), ('maximizing', 0.018), ('theorem', 0.017), ('par', 0.017), ('data', 0.017), ('energy', 0.017), ('cells', 0.017), ('arti', 0.017), ('arg', 0.017), ('montreal', 0.016), ('ml', 0.016), ('started', 0.016), ('cial', 0.016), ('connection', 0.016), ('zt', 0.016), ('convergence', 0.016), ('unsupervised', 0.016), ('generates', 0.016), ('updated', 0.016), ('prevents', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 188 nips-2010-On Herding and the Perceptron Cycling Theorem

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

2 0.09446083 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity ﬂow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1

3 0.076715514 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

Abstract: We propose a general framework to online learning for classiﬁcation problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classiﬁcation algorithms. By optimizing the general bound we derive a new online classiﬁcation algorithm, called NAROW, that hybridly uses adaptive- and ﬁxed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1

4 0.065957621 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization

Author: Shinichi Nakajima, Masashi Sugiyama, Ryota Tomioka

Abstract: Bayesian methods of matrix factorization (MF) have been actively explored recently as promising alternatives to classical singular value decomposition. In this paper, we show that, despite the fact that the optimization problem is non-convex, the global optimal solution of variational Bayesian (VB) MF can be computed analytically by solving a quartic equation. This is highly advantageous over a popular VBMF algorithm based on iterated conditional modes since it can only ﬁnd a local optimal solution after iterations. We further show that the global optimal solution of empirical VBMF (hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments.

5 0.063582242 61 nips-2010-Direct Loss Minimization for Structured Prediction

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classiﬁcation one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1

6 0.055564415 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

7 0.054739162 235 nips-2010-Self-Paced Learning for Latent Variable Models

8 0.054429892 268 nips-2010-The Neural Costs of Optimal Control

9 0.052298598 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

10 0.050654408 257 nips-2010-Structured Determinantal Point Processes

11 0.050341628 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions

12 0.050222259 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

13 0.049635291 212 nips-2010-Predictive State Temporal Difference Learning

14 0.044466376 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

15 0.042277075 103 nips-2010-Generating more realistic images using gated MRF's

16 0.041178565 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

17 0.039939135 202 nips-2010-Parallelized Stochastic Gradient Descent

18 0.039898928 43 nips-2010-Bootstrapping Apprenticeship Learning

19 0.039677281 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

20 0.039165199 99 nips-2010-Gated Softmax Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, -0.003), (2, 0.002), (3, -0.011), (4, -0.011), (5, 0.023), (6, -0.012), (7, 0.02), (8, -0.05), (9, 0.025), (10, -0.004), (11, -0.046), (12, 0.114), (13, -0.025), (14, -0.053), (15, 0.004), (16, -0.032), (17, -0.044), (18, -0.007), (19, -0.028), (20, 0.002), (21, 0.067), (22, 0.102), (23, 0.045), (24, -0.039), (25, 0.007), (26, 0.06), (27, 0.002), (28, -0.035), (29, -0.043), (30, 0.013), (31, -0.03), (32, 0.025), (33, 0.083), (34, -0.027), (35, 0.051), (36, 0.037), (37, -0.003), (38, 0.061), (39, -0.019), (40, -0.061), (41, -0.007), (42, 0.07), (43, -0.025), (44, -0.052), (45, 0.023), (46, -0.037), (47, -0.028), (48, -0.061), (49, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88544041 188 nips-2010-On Herding and the Perceptron Cycling Theorem

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

2 0.74439478 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

3 0.66518962 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

4 0.63791621 61 nips-2010-Direct Loss Minimization for Structured Prediction

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

5 0.58799869 274 nips-2010-Trading off Mistakes and Don't-Know Predictions

Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum

Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefﬁcient, algorithm for general ﬁnite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present speciﬁc polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.

6 0.55744666 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

7 0.55363202 235 nips-2010-Self-Paced Learning for Latent Variable Models

8 0.54442638 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

9 0.51114213 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

10 0.47184461 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine

11 0.46074 212 nips-2010-Predictive State Temporal Difference Learning

12 0.44405648 99 nips-2010-Gated Softmax Classification

13 0.44215459 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization

14 0.41669816 271 nips-2010-Tiled convolutional neural networks

15 0.40254003 224 nips-2010-Regularized estimation of image statistics by Score Matching

16 0.40158114 202 nips-2010-Parallelized Stochastic Gradient Descent

17 0.40148988 103 nips-2010-Generating more realistic images using gated MRF's

18 0.39985049 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters

19 0.39594749 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

20 0.39592847 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.031), (17, 0.014), (27, 0.052), (30, 0.044), (35, 0.02), (45, 0.205), (50, 0.038), (52, 0.02), (59, 0.013), (60, 0.033), (61, 0.314), (77, 0.059), (78, 0.018), (90, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74998903 188 nips-2010-On Herding and the Perceptron Cycling Theorem

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

2 0.74175668 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine

Author: Hugo Larochelle, Geoffrey E. Hinton

Abstract: We describe a model based on a Boltzmann machine with third-order connections that can learn how to accumulate information about a shape over several ﬁxations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of ﬁxations and it must combine the “glimpse” at each ﬁxation with the location of the ﬁxation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classiﬁcation datasets, showing that it can perform at least as well as a model trained on whole images. 1

3 0.59665859 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is conﬁrmed both by theoretical lower bounds and simulations for various networks. 1

4 0.59410632 282 nips-2010-Variable margin losses for classifier design

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The problem of controlling the margin of a classiﬁer is studied. A detailed analytical study is presented on how properties of the classiﬁcation risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classiﬁcation margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the ﬁxed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1

5 0.59076881 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

6 0.59035766 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

7 0.58953351 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

8 0.58943027 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

9 0.58916783 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

10 0.58871609 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

11 0.58809656 290 nips-2010-t-logistic regression

12 0.58798343 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

13 0.58792424 155 nips-2010-Learning the context of a category

14 0.58777392 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

15 0.58768207 243 nips-2010-Smoothness, Low Noise and Fast Rates

16 0.58766258 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

17 0.58753455 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

18 0.5874424 224 nips-2010-Regularized estimation of image statistics by Score Matching

19 0.58741689 1 nips-2010-(RF)^2 -- Random Forest Random Field

20 0.58710045 148 nips-2010-Learning Networks of Stochastic Differential Equations