nips nips2010 nips2010-188 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. [sent-6, score-0.945]
2 It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. [sent-7, score-0.322]
3 This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. [sent-8, score-1.466]
4 We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. [sent-9, score-1.214]
5 1 Introduction The invention of the perceptron [12] goes back to the very beginning of AI more than half a century ago. [sent-10, score-0.252]
6 However, ‘for data sets that are not linearly separable, the perceptron learning algorithm will never converge’ (quoted from [1]). [sent-13, score-0.269]
7 The ‘perceptron cycling theorem’ (PCT) [2, 11] states that for the inseparable case the weights remain bounded and do not diverge to infinity. [sent-15, score-0.206]
8 More importantly, the correlations converge to the sample mean with a rate 1/T , which is √ much faster than sampling based algorithms that converge at a rate 1/ T . [sent-18, score-0.088]
9 1 In the inseparable case, we can interpret the perceptron as a bagging procedure and average predictions instead of picking the single best (or last) weights found during training. [sent-20, score-0.402]
10 2, this is exactly what the voted perceptron (VP) [5] does. [sent-22, score-0.367]
11 Interesting generalization bounds for the voted perceptron have been derived in [5]. [sent-23, score-0.367]
12 In contrast, herding combines the learning and inference phases by treating the weights as dynamic quantities and defining a deterministic set of updates such that averaging predictions preserves certain moments of the training data. [sent-29, score-0.866]
13 The herding algorithm generates a weakly chaotic sequence of weights and a sequence of states of both hidden and visible variables of the MRF model. [sent-30, score-0.877]
14 The intermediate states produced by herding are really ‘representative points’ of an implicit model that interpolates between data cases. [sent-31, score-0.736]
15 However, unlike in perceptron learning, the non-convergence of the weights is needed to generate long, non-periodic trajectories of states that can be averaged over. [sent-34, score-0.341]
16 In this paper, we show that supervised perceptron algorithms and unsupervised herding algorithms can all be derived from the PCT. [sent-35, score-0.945]
17 This connection allows us to strengthen existing herding results. [sent-36, score-0.716]
18 Moreover, the connection suggests new algorithms that lie between supervised perceptron and unsupervised herding algorithms. [sent-38, score-0.961]
19 From the perceptron perspective, conditional herding can be understood as “voted perceptrons with hidden units”. [sent-40, score-1.071]
20 Conditional herding can also be interpreted as the zero temperature limit of discriminative RBMs (dRBMs) [10]. [sent-41, score-0.822]
21 2 Perceptrons, Herding and the Perceptron Cycling Theorem We first review the perceptron cycling theorem that was initially introduced in [11] with a gap in the proof that was fixed in [2]. [sent-42, score-0.339]
22 A sequence of vectors {wt }, wt ∈ RD , t = 0, 1, . [sent-43, score-0.18]
23 is generated by the following iterative procedure: wt+1 = wt + vt , where vt is an element of a finite set, V, and the norm of vt is bounded: maxi ||vi || = R < ∞. [sent-46, score-0.396]
24 ∀t ≥ 0: If wt vt ≤ 0, then there exists a constant M > 0 such that wt − w0 < M . [sent-48, score-0.432]
25 1 T t=1 ∆wt || = || T t=1 vt || < M , Voted Perceptron and Moment Matching The voted perceptron (VP) algorithm [5] repeatedly applies the update rule in Eqn. [sent-54, score-0.477]
26 Predictions of test labels are made after each update and final label predictions are taken as an average of all intermediate predictions. [sent-56, score-0.095]
27 2, where we ∗ ∗ identify V = {xi (yi − yi )| , yi = ±1, yi = ±1, i = 1, . [sent-58, score-0.087]
28 For the VP algorithm, the PCT thus guarantees that the moments xy p(x,y) (with p the empirical distribution) are matched with ˜ ˜ xy ∗ p(y∗ |x)p(x) where p(y ∗ |x) is the model distribution implied by how VP generates y ∗ . [sent-62, score-0.118]
29 ˜ In maximum entropy models, one seeks a model that satisfies a set of expectation constraints (moments) from the training data, while maximizing the entropy of the remaining degrees of freedom [9]. [sent-63, score-0.094]
30 In contrast, a single perceptron strives to learn a deterministic mapping p(y ∗ |x) = δ[y ∗ − arg maxy (ywT x)] that has zero entropy and gets every prediction on every training case 2 correct (where δ is the delta function). [sent-64, score-0.318]
31 Entropy is created in p(y ∗ |x) only when the weights wt do not converge (i. [sent-65, score-0.26]
32 Rather than learning a single ‘best’ MRF model that can be sampled from to estimate quantities of interest, herding combines learning and inference into a single process. [sent-71, score-0.677]
33 In particular, herding produces a trajectory of weights and states that reproduce the moments of the training data. [sent-72, score-0.836]
34 In herding [15], the parameters w are updated as: (3) wt+1 = wt + φ − φ(x∗ ), t (4) ∗ T where φ = i φ(xi ) and xt = arg maxx wt φ(x). [sent-80, score-1.07]
35 This follows from taking the zero temperature limit of the ML objective (see Section 2. [sent-83, score-0.086]
36 The maximization prevents the herding sequence from converging to a single point estimate on this alternative objective. [sent-85, score-0.717]
37 1 N Let {wt } denote the sequence of weights and {x∗ } denote the sequence of states (pseudo-samples) t produced by herding. [sent-86, score-0.089]
38 We can apply the PCT to herding by identifying V = {φ − φ(x∗ )| x∗ ∈ X }. [sent-87, score-0.677]
39 It is now easy to see that, in general, herding does not converge because under very mild conditions T we can always find an x∗ such that wt vt < 0. [sent-88, score-0.962]
40 the pseudo-sample averages of the features converge t to the data averages φ at a rate 1/T 1 . [sent-91, score-0.108]
41 However, the PCT only requires us to find some state x∗ t t T such that wt vt ≤ 0 and in most cases this can easily be verified. [sent-99, score-0.252]
42 In this case, maximizations are initialized on all data-cases and the weights are updated by the difference between the average over the data-cases minus the average over the {x∗ } found after i 1 (partial) maximization. [sent-103, score-0.085]
43 i i T For obvious reasons, it is now guaranteed that wt vt ≤ 0. [sent-105, score-0.252]
44 However, the PCT allows us to consider more general features that depend on the weights 1 Similar convergence could also be achieved (without concern for generalization performance) by sampling directly from the training data. [sent-113, score-0.127]
45 However, herding converges with rate 1/T and is regularized by the weights to prevent overfitting. [sent-114, score-0.724]
46 4 Conditional Herding We are now ready to propose our new algorithm: conditional herding (CH). [sent-125, score-0.724]
47 Like the VP algorithm, CH is concerned with discriminative learning and, therefore, it conditions on the input attributes {xi }. [sent-126, score-0.088]
48 At each iteration t, CH randomly samples a subset of the data-cases and their labels Dt = {xit , yit } ⊆ D. [sent-131, score-0.192]
49 For every member of this mini-batch it computes a hidden variable zit using Eqn. [sent-132, score-0.181]
50 The parameters are then updated as: η ∗ (φ(xit , yit , zit ) − φ(xit , yit , z∗t )) (8) wt+1 = wt + i |Dt | it ∈Dt In the positive term, zit , is found as in Eqn. [sent-134, score-0.812]
51 Since wt = η t =0 vt + w0 , t the only scale that matters is the relative scale between w0 and η. [sent-144, score-0.252]
52 Irrespective of the attractor we end up in, the PCT guarantees that: 1 || T T t=1 1 |Dt | it 1 φ(xit , yit , zit ) − T T t=1 1 |Dt | ∗ φ(xit , yit , z∗t )|| ∼ O(1/T ). [sent-147, score-0.525]
53 i (10) it In general, herding systems perform better when we use normalized features: φ(x, z, y) = R, ∀(x, z, y). [sent-148, score-0.677]
54 The reason is that herding selects states by maximizing the inner product wT φ 4 and features with large norms will therefore become more likely to be selected. [sent-149, score-0.757]
55 Finally, predictions on unseen test data are made by: ∗ T (ytst,t , z∗ ) = arg max wt φ(xtst , y , z ), tst,t (11) y ,z The algorithm is summarized in the algorithm-box below. [sent-153, score-0.251]
56 For t ≥ 0: (a) Choose a subset {xit , yit } = Dt ⊆ D. [sent-157, score-0.192]
57 For each (xit , yit ), choose a hidden state zit . [sent-158, score-0.373]
58 ∗ (b) Choose a set of “negative states” {(x∗t = xit , yit , z∗t )}, such that: i i 1 |Dt | T wt−1 φ(xit , yit , zit ) ≤ it 1 |Dt | T ∗ wt−1 φ(xit , yit , z∗t ). [sent-159, score-0.896]
59 Predict on test data as follows: ∗ (a) For every test case xtst,j at every iteration, choose negative states (ytst,jt , z∗ ) in the tst,jt same way as for training data. [sent-164, score-0.081]
60 5 Zero Temperature Limit of Discriminative MRF Learning Regular herding can be understood as gradient descent on the zero temperature limit of an MRF model. [sent-167, score-0.763]
61 Analogously, CH can be viewed as constant step size gradient updates on the zero temperature limit of discriminative MRFs (see [10] for the corresponding RBM model). [sent-169, score-0.18]
62 T z ,y exp [w φ(y , z , x)] z (13) Similar to herding [14], conditional herding introduces a temperature by replacing w by w/T and takes the limit T → 0 of T T , where = i log p(yi |xi ). [sent-171, score-1.487]
63 3 Experiments We studied the behavior of conditional herding on two artificial and four real-world data sets, comparing its performance to that of the voted perceptron [5] and that of discriminative RBMs [10]. [sent-172, score-1.167]
64 We studied conditional herding in the discriminative RBM architecture illustrated in Figure 1 (i. [sent-176, score-0.783]
65 1 Artificial Data To investigate the characteristics of VP, dRBMs and CH, we used the techniques to construct decision boundaries on two artificial data sets: (1) the banana data set; and (2) the Lithuanian data set. [sent-199, score-0.119]
66 The decision bound∗ ary for VP and CH is located at the location where the sign of the prediction ytst changes. [sent-201, score-0.079]
67 The dRBMs also had 20 hidden units and were trained by running conjugate gradients until convergence. [sent-203, score-0.104]
68 The weights of the dRBMs were initialized by sampling from a Gaussian distribution with a variance of 10−4 . [sent-204, score-0.091]
69 The results on the banana data set illustrate the representational advantages of hidden units. [sent-210, score-0.147]
70 Since VP selects data points at random to update the weights, on the banana data set, the weight vector of VP tends to oscillate back and forth yielding a nearly linear decision boundary3 . [sent-211, score-0.14]
71 In contrast, for CH the simple predictor in the top layer can regress onto M = 20 hidden features. [sent-213, score-0.086]
72 2 Real-World Data In addition to the experiments on synthetic data, we also performed experiments on four real-world data sets - namely, (1) the USPS data set, (2) the MNIST data set, (3) the UCI Pendigits data set, and (4) the 20-Newsgroups data set. [sent-216, score-0.085]
73 The MNIST data set contains 70, 000, 28 × 28 grayscale images of digits, with a fixed division into 60, 000 training and 10, 000 test instances. [sent-218, score-0.088]
74 We adopted a 1-of-K encoding, where if yi is the label for data point xi , then yi = {yi,1 , . [sent-228, score-0.095]
75 This partial maximization is not a problem as long as the ensuing configuration satisfies T wt vt ≤ 0 4 . [sent-242, score-0.276]
76 , we employed early stopping) or until the negative conditional log-likelihood on the training data stopped coming down. [sent-247, score-0.104]
77 The parameters of the conditional herders were initialized by sampling from a Gaussian distribution. [sent-253, score-0.12]
78 However, since the dimension of the data is typically much greater than the number of classes, the dynamics of the conditional herding system will be largely driven by W. [sent-256, score-0.741]
79 In addition, during the early stages of herding, we adapted the parameter update for the bias on the hidden units θ in such a way that the marginal distribution over the hidden units was nearly uniform. [sent-265, score-0.246]
80 This has the advantage that it encourages high entropy in the hidden units, leading to more useful η dynamics of the system. [sent-266, score-0.092]
81 In practice, we update θ as θ t+1 = θ t + |Dt | it (1 − λ) zit − z∗t , i where zit is the batch mean. [sent-267, score-0.289]
82 λ is initialized to 1 and we gradually half its value every 500 updates, slowly moving from an entropy-encouraging update to the standard update for the biases of the hidden units. [sent-268, score-0.163]
83 The results reveal that the addition of hidden units to the voted perceptron leads to significant improvements in terms of generalization error. [sent-274, score-0.471]
84 Furthermore, the results of our experiments indicate that conditional herding performs on par with discriminative RBMs on the MNIST and USPS data sets and better on the 20 Newsgroups data set. [sent-275, score-0.834]
85 The 20 Newsgroups data is high dimensional and sparse and both VP and CH appear to perform ∗,k ∗,k Local maxima can also be found by iterating over ytst , ztst,j , but the proposed procedure is more efficient. [sent-276, score-0.094]
86 We use a fixed order of the mini-batches, so that if there are N data cases and the batch size is K, if the training error is 0 for N/K iterations, the error for the whole training set is 0. [sent-277, score-0.08]
87 4 5 7 One-Versus-All Procedure XXX VP Discriminative RBM Conditional herding Technique XX Data Set XXX 100 200 100 200 X MNIST 7. [sent-278, score-0.677]
88 96% Joint Procedure XXX VP Discriminative RBM Conditional herding Technique XX 50 100 500 50 100 500 Data Set XXX X MNIST 8. [sent-303, score-0.677]
89 Techniques to promote sparsity in the hidden layer when training dRBMs exist (see [10]), but we did not investigate them here. [sent-343, score-0.087]
90 This is particularly evident in the low-dimensional UCI Pendigits data set, where the dRBMs start to badly overfit with 500 hidden units, while the test error for CH remains level. [sent-345, score-0.082]
91 4 Concluding Remarks The main contribution of this paper is to expose a relationship between the PCT and herding algorithms. [sent-347, score-0.677]
92 This has allowed us to strengthen certain results for herding - namely, theoretically validating herding with mini-batches and partial optimization. [sent-348, score-1.377]
93 It also directly leads to the insight that non-convergent VPs and herding match moments between data and generated predictions at a rate √ much faster than random sampling (O(1/T ) vs. [sent-349, score-0.801]
94 From these insights, we have proposed a new conditional herding algorithm that is the zero-temperature limit of dRBMs [10]. [sent-351, score-0.751]
95 The herding perspective provides a new way of looking at learning as a dynamical system. [sent-352, score-0.699]
96 In fact, the PCT precisely specifies the conditions that need to hold for a herding system (in batch mode) to be a piecewise isometry [7]. [sent-353, score-0.76]
97 A piecewise isometry is a weakly chaotic dynamical system that divides parameter space into cells and applies a different isometry in each cell. [sent-354, score-0.171]
98 For herding, the isometry is given by a translation and the cells are labeled by the states {x∗ , y∗ , z, z∗ }, whichever combination applies. [sent-355, score-0.097]
99 Many interesting results about piecewise isometries have been proven in the mathematics literature such as the fact that the sequence of sampled states grows algebraically with T and not exponentially as in systems with random or chaotic components [6]. [sent-357, score-0.098]
100 Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. [sent-383, score-0.339]
wordName wordTfidf (topN-words)
[('herding', 0.677), ('vp', 0.253), ('perceptron', 0.252), ('pct', 0.243), ('ch', 0.209), ('drbms', 0.204), ('xit', 0.204), ('yit', 0.192), ('wt', 0.18), ('zit', 0.116), ('voted', 0.115), ('pendigits', 0.077), ('vt', 0.072), ('cycling', 0.07), ('hidden', 0.065), ('usps', 0.06), ('discriminative', 0.059), ('temperature', 0.059), ('ytst', 0.058), ('xxx', 0.051), ('mrf', 0.05), ('newsgroups', 0.05), ('dt', 0.048), ('moments', 0.048), ('conditional', 0.047), ('weights', 0.047), ('banana', 0.047), ('inseparable', 0.047), ('mnist', 0.043), ('states', 0.042), ('rbm', 0.042), ('rbms', 0.04), ('units', 0.039), ('lithuanian', 0.038), ('update', 0.038), ('isometry', 0.038), ('predictions', 0.037), ('updates', 0.035), ('converge', 0.033), ('rmax', 0.033), ('division', 0.031), ('chaotic', 0.03), ('perceptrons', 0.03), ('uci', 0.03), ('yi', 0.029), ('attributes', 0.029), ('herders', 0.029), ('minsky', 0.029), ('rosenblatt', 0.029), ('tst', 0.029), ('typeset', 0.029), ('entropy', 0.027), ('limit', 0.027), ('xy', 0.027), ('piecewise', 0.026), ('attractor', 0.025), ('maximization', 0.024), ('chaos', 0.023), ('strengthen', 0.023), ('picked', 0.023), ('training', 0.022), ('dynamical', 0.022), ('sampling', 0.022), ('initialized', 0.022), ('decision', 0.021), ('quebec', 0.021), ('regress', 0.021), ('boltzmann', 0.021), ('label', 0.02), ('features', 0.02), ('dim', 0.019), ('averages', 0.019), ('procedure', 0.019), ('batch', 0.019), ('boundary', 0.018), ('irrespective', 0.018), ('cardinality', 0.018), ('analogously', 0.018), ('representational', 0.018), ('stopped', 0.018), ('grayscale', 0.018), ('maximizing', 0.018), ('theorem', 0.017), ('par', 0.017), ('data', 0.017), ('energy', 0.017), ('cells', 0.017), ('arti', 0.017), ('arg', 0.017), ('montreal', 0.016), ('ml', 0.016), ('started', 0.016), ('cial', 0.016), ('connection', 0.016), ('zt', 0.016), ('convergence', 0.016), ('unsupervised', 0.016), ('generates', 0.016), ('updated', 0.016), ('prevents', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
2 0.09446083 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
3 0.076715514 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
4 0.065957621 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
Author: Shinichi Nakajima, Masashi Sugiyama, Ryota Tomioka
Abstract: Bayesian methods of matrix factorization (MF) have been actively explored recently as promising alternatives to classical singular value decomposition. In this paper, we show that, despite the fact that the optimization problem is non-convex, the global optimal solution of variational Bayesian (VB) MF can be computed analytically by solving a quartic equation. This is highly advantageous over a popular VBMF algorithm based on iterated conditional modes since it can only find a local optimal solution after iterations. We further show that the global optimal solution of empirical VBMF (hyperparameters are also learned from data) can also be analytically computed. We illustrate the usefulness of our results through experiments.
5 0.063582242 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
6 0.055564415 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
7 0.054739162 235 nips-2010-Self-Paced Learning for Latent Variable Models
8 0.054429892 268 nips-2010-The Neural Costs of Optimal Control
9 0.052298598 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
10 0.050654408 257 nips-2010-Structured Determinantal Point Processes
11 0.050341628 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
12 0.050222259 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
13 0.049635291 212 nips-2010-Predictive State Temporal Difference Learning
14 0.044466376 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
15 0.042277075 103 nips-2010-Generating more realistic images using gated MRF's
16 0.041178565 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
17 0.039939135 202 nips-2010-Parallelized Stochastic Gradient Descent
18 0.039898928 43 nips-2010-Bootstrapping Apprenticeship Learning
19 0.039677281 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
20 0.039165199 99 nips-2010-Gated Softmax Classification
topicId topicWeight
[(0, 0.12), (1, -0.003), (2, 0.002), (3, -0.011), (4, -0.011), (5, 0.023), (6, -0.012), (7, 0.02), (8, -0.05), (9, 0.025), (10, -0.004), (11, -0.046), (12, 0.114), (13, -0.025), (14, -0.053), (15, 0.004), (16, -0.032), (17, -0.044), (18, -0.007), (19, -0.028), (20, 0.002), (21, 0.067), (22, 0.102), (23, 0.045), (24, -0.039), (25, 0.007), (26, 0.06), (27, 0.002), (28, -0.035), (29, -0.043), (30, 0.013), (31, -0.03), (32, 0.025), (33, 0.083), (34, -0.027), (35, 0.051), (36, 0.037), (37, -0.003), (38, 0.061), (39, -0.019), (40, -0.061), (41, -0.007), (42, 0.07), (43, -0.025), (44, -0.052), (45, 0.023), (46, -0.037), (47, -0.028), (48, -0.061), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.88544041 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
2 0.74439478 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
3 0.66518962 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
4 0.63791621 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
5 0.58799869 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
6 0.55744666 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
7 0.55363202 235 nips-2010-Self-Paced Learning for Latent Variable Models
8 0.54442638 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
9 0.51114213 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
10 0.47184461 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
11 0.46074 212 nips-2010-Predictive State Temporal Difference Learning
12 0.44405648 99 nips-2010-Gated Softmax Classification
13 0.44215459 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
14 0.41669816 271 nips-2010-Tiled convolutional neural networks
15 0.40254003 224 nips-2010-Regularized estimation of image statistics by Score Matching
16 0.40158114 202 nips-2010-Parallelized Stochastic Gradient Descent
17 0.40148988 103 nips-2010-Generating more realistic images using gated MRF's
18 0.39985049 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
19 0.39594749 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
20 0.39592847 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
topicId topicWeight
[(13, 0.031), (17, 0.014), (27, 0.052), (30, 0.044), (35, 0.02), (45, 0.205), (50, 0.038), (52, 0.02), (59, 0.013), (60, 0.033), (61, 0.314), (77, 0.059), (78, 0.018), (90, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.74998903 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
2 0.74175668 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
Author: Hugo Larochelle, Geoffrey E. Hinton
Abstract: We describe a model based on a Boltzmann machine with third-order connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of fixations and it must combine the “glimpse” at each fixation with the location of the fixation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classification datasets, showing that it can perform at least as well as a model trained on whole images. 1
3 0.59665859 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
4 0.59410632 282 nips-2010-Variable margin losses for classifier design
Author: Hamed Masnadi-shirazi, Nuno Vasconcelos
Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1
5 0.59076881 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
6 0.59035766 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
7 0.58953351 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
8 0.58943027 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
9 0.58916783 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
10 0.58871609 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
11 0.58809656 290 nips-2010-t-logistic regression
12 0.58798343 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
13 0.58792424 155 nips-2010-Learning the context of a category
14 0.58777392 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.58768207 243 nips-2010-Smoothness, Low Noise and Fast Rates
16 0.58766258 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
17 0.58753455 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
18 0.5874424 224 nips-2010-Regularized estimation of image statistics by Score Matching
19 0.58741689 1 nips-2010-(RF)^2 -- Random Forest Random Field
20 0.58710045 148 nips-2010-Learning Networks of Stochastic Differential Equations