nips nips2010 nips2010-158 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. [sent-8, score-0.472]
2 In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. [sent-9, score-0.403]
3 By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. [sent-10, score-0.105]
4 We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. [sent-11, score-0.129]
5 The success of an online learning algorithm depends critically upon a tradeoff between fitting the current data example and regularizing the solution based upon some memory of prior hypotheses. [sent-14, score-0.192]
6 In this work, we show how to incorporate regularization in an online learning algorithm by constraining the motion of weight vectors in the hypothesis space. [sent-15, score-0.324]
7 In particular, we demonstrate how to use simple constraints on the velocity flow field of Gaussian-distributed weight vectors to regularize online learning algorithms. [sent-16, score-0.418]
8 This process results in herding the motion of the Gaussian weight vectors to yield algorithms that are particularly robust to noisy input data. [sent-17, score-0.264]
9 Recent work has demonstrated how parametric information about the weight vector distribution can be used to guide online learning [1]. [sent-18, score-0.23]
10 CW learning has formal guarantees in the mistake-bound model [7]; however, it can over-fit in certain situations due to its aggressive update rules based upon a separable data assumption. [sent-20, score-0.158]
11 A newer online algorithm, Adaptive Regularization of Weights (AROW) relaxes this separable assumption, resulting in an adaptive regularization for each training example based upon its current confidence [8]. [sent-21, score-0.182]
12 This regularization comes in the form of minimizing a bound on the Kullback-Leibler divergence between Gaussian distributed weight vectors. [sent-22, score-0.131]
13 Here we take a different microscopic view of the online learning process. [sent-23, score-0.115]
14 Instead of reweighting and diffusing the weight vectors in hypothesis space, we model them as flowing under a velocity field given by each data observation. [sent-24, score-0.38]
15 We show that for linear velocity fields, a Gaussian weight vector distribution will maintain its Gaussianity, with corresponding updates for its mean and covariance. [sent-25, score-0.389]
16 The advantage of this approach is that we can incorporate different constraints and regularization on the resulting velocity fields to yield more robust online learning algorithms. [sent-26, score-0.304]
17 1 These algorithms maintain a Gaussian distribution over possible weight vectors in hypothesis space. [sent-28, score-0.265]
18 In traditional stochastic filtering, weight vectors are first reweighted according to how accurately they describe the current data observation. [sent-29, score-0.18]
19 When the reweighting factor depends linearly upon the weight vector in combination with a Gaussian diffusion model, a weight vector distribution will maintain its Gaussianity under such a transformation. [sent-31, score-0.342]
20 The Kalman filter equations then yield the resulting change in the mean and covariance of the new distribution. [sent-32, score-0.096]
21 Our approach, on the other hand, updates the weight vector distribution with each observation by herding the weight vectors using a velocity field. [sent-33, score-0.573]
22 2 Background Consider the following online binary classification problem, that proceeds in rounds. [sent-36, score-0.117]
23 On the ith round the online algorithm receives an input xi ∈ Rd and applies its current prediction rule to make a prediction yi ∈ Y, for the binary set Y = {−1, +1}. [sent-37, score-0.409]
24 It ˆ then receives the correct label yi ∈ Y and suffers a loss (yi , yi ). [sent-38, score-0.282]
25 At this point, the algorithm updates its ˆ prediction rule with the pair (xi , yi ) and proceeds to the next round. [sent-39, score-0.166]
26 A summary of online algorithms can be found in [2]. [sent-40, score-0.111]
27 (a) (b) An initial description for possible online algorithms is provided by the family of passive-aggressive (PA) Figure 1: (a) Traditional stochastic filter- algorithms for linear classifiers [5]. [sent-41, score-0.138]
28 The weight vecing: weight vectors in the hypothesis space are tor w i at each round is updated with the current inreweighted according to the new observation and put xi and label yi , by optimizing: undergo diffusion resulting in a new weight vec1 tor distribution. [sent-42, score-0.832]
29 (b) Herding via a velocity field: w w − wi 2 + C ((xi , yi ), w) , i+1 = arg min w 2 weights vectors flow in hypothesis space accord(1) ing to a constrained velocity field, resulting in a where ((xi , yi ), w) is the squared- or hinge-loss new weight vector distribution. [sent-43, score-0.85]
30 function and C > 0 controls the tradeoff between optimizing the current loss and being close to the old weight vector. [sent-44, score-0.199]
31 (1) can also be expressed in dual form, yielding the PA-II update equation: wi+1 = wi + αi yi xi , αi = max{0, 1 − yi (wi xi )} / xi 2 + 1/C . [sent-46, score-0.708]
32 Online confidence-weighted (CW) learning [9, 7], generalized the PA update principle to multivariate Gaussian distributions over the weight vectors N (µ, Σ) for binary classification. [sent-50, score-0.286]
33 The mean µ ∈ Rd contains the current estimate for the best weight vector, whereas the Gaussian covariance matrix Σ ∈ Rd×d captures the confidence in this estimate. [sent-51, score-0.201]
34 At each round, the new mean and covariance of the weight vector distribution is chosen by optimizing: (µi+1 , Σi+1 ) = arg minµ,Σ DKL (N (µ, Σ) N (µi , Σi )) such that Prw∼N (µ,Σ) [yi (w · xi ) ≥ 0] ≥ η. [sent-53, score-0.335]
35 In this work, we take the view that the Gaussian distribution over weight vectors is modified by herding according to a velocity flow field. [sent-58, score-0.432]
36 First we show that any change in a Gaussian distributed random variable can be related to a linear velocity field: 2 Theorem 1 Assume that the random variable (r. [sent-59, score-0.174]
37 Thus, the transformation U = AW + b can be viewed as a velocity flow resulting in a change of the underlying Gaussian distribution of weight vectors. [sent-75, score-0.34]
38 On the other hand, this microscopic view of the underlying velocity field contains more information than merely tracking the mean and covariance of the Gaussian. [sent-76, score-0.282]
39 This can be seen since many different velocity fields result in the same overall mean and covariance. [sent-77, score-0.193]
40 In the next section, we show how we can define new online learning algorithms by considering various constraints on the overall velocity field. [sent-78, score-0.285]
41 These new algorithms optimize a loss function by constraining the parameters of this velocity field. [sent-79, score-0.282]
42 3 Algorithms Our algorithms maintain a distribution, or infinite collection of weight vectors {Wi } for each round i. [sent-80, score-0.258]
43 Given an instance xi it outputs a prediction based upon the majority of these weight vectors. [sent-81, score-0.242]
44 Each weight vector Wi is then individually updated to Wi+1 according to a generalized PA rule, 1 Wi+1=arg min Ci (W) where Ci (W)= (W−Wi ) Σ−1(W−Wi )+C ((xi , yi ) ,W) , (3) i W 2 and Σi is a PSD matrix that will be defined shortly. [sent-82, score-0.25]
45 Clearly, it is impossible to maintain and update an infinite set of vectors, and thus we employ a parametric density fi (Wi ; θi ) to weight each vector. [sent-84, score-0.252]
46 We thus employ a Gaussian parametric density with W ∼ N (µi , Σi ), and update the distribution collectively, Wi+1 = Ai Wi + bi , d×d where Ai ∈ R represents stretching and rotating the distribution, and the bi ∈ Rd is an overall translation. [sent-86, score-0.181]
47 1 Expectation of the Loss Function We consider the expectation, EWi ∼N (µi ,Σi ) [ ((xi , yi ) , AWi + b)] 3 (6) In general, there is no closed form solution for this expectation, and instead we seek for an appropriate approximation or bound. [sent-97, score-0.102]
48 For simplicity we consider binary classification, denote the signed margin by M = yi (W x) and write ((x, y), W) = (M ) . [sent-98, score-0.172]
49 If the loss is relatively concentrated about its mean, then the loss of the expected weight-vector µ is a good proxy for Eq. [sent-99, score-0.102]
50 A loss function is uniformly λ-bounded in expectation with respect to F if there exists λ > 0 such that for all θ ∈ Θ 2 we have that, E [ (M )] ≤ (E [M ]) + λ E (M − E [M ]) , where all expectations are with 2 respect M ∼ f (M ; θ). [sent-102, score-0.119]
51 For Gaussian distributions we have that Θ = {µ, Σ} and a loss function is uniformly λ-bounded in expectation if there exists a λ such that, EN (µ,Σ) [ ((x, y), W)] ≤ ((x, y), E [W]) + λ x Σx . [sent-104, score-0.119]
52 2 For example, the squared loss 1 y − M x is uniformly (λ =)1-bounded in expectation since 2 its second derivative is bounded by unity (1). [sent-109, score-0.141]
53 (2)), µi+1 = µi + αi yi xi , αi = max{0, 1 − yi (µi xi )} / xi Σi xi + 1/C , (9) We now focus on minimizing the second term (Eq. [sent-122, score-0.628]
54 (8) then becomes, 2 Tr (A − I) (A − I) + C xi AΣi A xi . [sent-128, score-0.212]
55 2 Denote the rth diagonal element of Σi by (Σi )r,r and the rth diagonal element of A by (A)r,r . [sent-129, score-0.214]
56 (8) we get 2 Tr A Σ−1 AΣi − Tr Σ−1 AΣi i i −1 −1 +Tr Σi Σi − Tr A Σi Σi + C xi AΣi A xi . [sent-135, score-0.242]
57 Setting the 2 derivative of the last equation with respect to A we get, Σ−1 AΣi − i I + Cxi xi AΣi = 0 . [sent-136, score-0.128]
58 We multiply both terms by Σ−1 (right) and i combine terms, Σ−1 + Cxi xi A = Σ−1 , Yielding, i i 1 0. [sent-137, score-0.106]
59 5 −1 Σ−1 = AΣi A i+1 = −1 = Σ−1 + 2C +C 2 xi Σi xi xi xi i (12) Finally, using the Woodbury identity [12] to compute to updated covariance matrix, 0 −0. [sent-142, score-0.526]
60 5 Σi+1 = Σi − Σi xi xi Σi C 2 xi Σi xi + 2C / (1 + Cxi Σi xi )2 . [sent-143, score-0.53]
61 5 Figure 2: Top and center panels: an illustration of the algorithm’s update (see text). [sent-158, score-0.093]
62 Bottom panel: an illustration of a single update for the five algorithms. [sent-159, score-0.093]
63 The cyan ellipse represents the weight vector distribution before the example is observed. [sent-160, score-0.153]
64 The red-square represents the mean of the updated distribution and the five ellipses represents the covariance of each of the algorithm after given the data example ((1, 2), +1). [sent-161, score-0.248]
65 The ordering of the area of the five ellipses correlates well with the performance of the algorithms. [sent-162, score-0.106]
66 (8) of [8] ) have the same structure of adding γi xi xi to Σi . [sent-165, score-0.212]
67 AROW sets γi = C while our update sets γi = 2C +C 2 xi Σi xi . [sent-166, score-0.305]
68 In this aspect, the NHERD update is more aggressive as it increases the eigenvalues of Σ−1 at a faster rate. [sent-167, score-0.126]
69 Furthermore, its i update rate is not constant and depends linearly on the current variance of the margin xi Σi xi ; the higher the variance, the faster the eigenvalues of Σi decrease. [sent-168, score-0.362]
70 Lastly, we note that the update matrix Ai can be written as a product of two terms, one depends on the covariance matrix before the update and the other on the co˜ variance matrix after an AROW update. [sent-169, score-0.244]
71 Formally, let Σi+1 be the covariance matrix after updated using the AROW rule, that is, ˜ Σi+1 = Σ−1 + Cxi xi (see eq. [sent-170, score-0.208]
72 The diagonal updates of AROW and NHERD share similar properties. [sent-174, score-0.098]
73 [8] did not specify the specific update for this case, yet using a similar derivation of Sec. [sent-175, score-0.093]
74 1 we get that ˜ ˜ the AROW update for diagonal matrices Σi+1 is Σi+1 = r,r (Σi )r,r / 1 + Cx2 (Σi )r,r i,r ˜ element of Eq. [sent-177, score-0.184]
75 i,r To conclude, the update of NHERD for diagonal covariance matrices is also more aggressive than AROW as it increases the (diagonal) elements of its inverse faster than AROW. [sent-180, score-0.245]
76 The Gaussian distribution before the update is isotropic with mean µ = (0, 0) and Σ = I2 . [sent-183, score-0.133]
77 The plot illustrates the update of the mean vector (red square), weight vectors with unit norm w = 1 (blue), and weight vectors with norm of 2, w = 2 (green). [sent-185, score-0.432]
78 The ellipses with dashed lines illustrate the weights before the Parameter: C > 0 update, and ellipses with solid Initialize: µ1 = 0 , Σ1 = I lines illustrate the weight-vectors for i = 1, . [sent-186, score-0.231]
79 1 xi Σi xi + C The arrows connecting weightFull Covariance: C 2 xi Σ x +2C Set Σi+1=Σi −Σi xi xi Σi (1+Cxi Σii x )2 (Eq. [sent-192, score-0.53]
80 (13)) vectors from the dashed ellipses i i to solid ellipses illustrate the upDiagonal Covariance: date of individual weight-vectors Set (Σi+1 )r,r for r = 1 . [sent-193, score-0.287]
81 end for In both updates the current mean Return: µm+1 , Σm+1 µi is mapped to the next mean µi+1 . [sent-198, score-0.095]
82 This is a consequence of the linear transformation that ties the update between all weight-vectors. [sent-200, score-0.115]
83 The diagonal update, as designed, maintains a diagonal matrix, yet shrinks the matrix more in the directions that are more “orthogonal” to the example. [sent-201, score-0.147]
84 5 Empirical Evaluation We evaluate NHERD on several popular datasets for document classification, optical character recognition (OCR), phoneme recognition, as well as on action recognition in video. [sent-207, score-0.206]
85 We compare our new algorithm NHERD with the AROW [8] algorithm, which was found to outperform other baselines [8]: the perceptron algorithm [13], Passive-Aggressive (PA) [5], confidence weighted learning (CW) [9, 7] and second order perceptron [1] on these datasets. [sent-208, score-0.102]
86 The weight of an edge between two algorithms is the fraction of datasets in which the top algorithm achieves lower test error than the bottom algorithm. [sent-216, score-0.208]
87 Given a new example, the algorithm predicts y = arg maxz µ · f (x, z), and suffers a loss if ˆ y = y . [sent-221, score-0.103]
88 First are datasets from [8]: 36 binary document classification data, and 100 binary OCR data (45 all-pairs of both USPS and MNIST and 1-vs-rest of MNIST). [sent-225, score-0.12]
89 Third, we conducted experiments on a TIMIT phoneme classification task. [sent-227, score-0.114]
90 The ten binary classification problems are identified by a pair of phoneme symbols (one or two Roman letters). [sent-231, score-0.181]
91 2 illustrates a single update of each of the five algorithms: AROW D, AROW D, NHERD D, NHERD E, NHERD P. [sent-250, score-0.093]
92 Each of the five ellipses represents the Gaussian weight vector distribution after a single update on an example 7 by each of the five algorithms. [sent-251, score-0.324]
93 Interestingly, the resulting volume (area) of different ellipses roughly correspond to the overall performance of the algorithms. [sent-252, score-0.125]
94 The best update – NHERD P – has the smallest ellipse (with lowest-entropy), and the update with the worst performance – AROW D – has the largest, highest-entropy ellipse. [sent-253, score-0.214]
95 5 shows the relative improvment in accuracy of NHERD P over AROW P on the ten phoneme recognition tasks with additional 30% label noise. [sent-261, score-0.229]
96 Finally, the bottom-right panel shows the 10-fold accuracy of the five algorithms over the video data, where clearly NHERD P outperforms all other algorithms by a wide margin. [sent-266, score-0.155]
97 Bottom left: relative accuracy improvment of NHERD P over AROW P on the ten phoneme classification tasks. [sent-272, score-0.183]
98 In both cases NHERD P is superior Conclusions: We have seen how to incorporate velocity constraints in an online learning algorithm. [sent-274, score-0.258]
99 In addition to tracking the mean and covariance of a Gaussian weight vector distribution, regularization of the linear velocity terms are used to herd the normal distribution in the learning process. [sent-275, score-0.456]
100 We empirically evaluated the performance of NHERD on a variety of experimental datasets, and show that the projected NHERD algorithm generally outperforms all other online learning algorithms on these datasets. [sent-277, score-0.14]
wordName wordTfidf (topN-words)
[('nherd', 0.652), ('arow', 0.541), ('velocity', 0.174), ('cw', 0.128), ('phoneme', 0.114), ('cxi', 0.108), ('ellipses', 0.106), ('xi', 0.106), ('weight', 0.104), ('yi', 0.102), ('update', 0.093), ('online', 0.084), ('herding', 0.077), ('wi', 0.069), ('diagonal', 0.061), ('covariance', 0.058), ('vectors', 0.056), ('crammer', 0.054), ('herd', 0.053), ('ocr', 0.053), ('loss', 0.051), ('perceptron', 0.051), ('ai', 0.051), ('rth', 0.046), ('ve', 0.046), ('tr', 0.045), ('pa', 0.044), ('updated', 0.044), ('text', 0.044), ('panel', 0.044), ('mnist', 0.043), ('dredze', 0.043), ('usps', 0.042), ('gaussian', 0.042), ('margin', 0.037), ('round', 0.037), ('updates', 0.037), ('datasets', 0.035), ('awi', 0.035), ('ewi', 0.035), ('improvment', 0.035), ('uniformly', 0.035), ('ow', 0.035), ('maintain', 0.034), ('ten', 0.034), ('aggressive', 0.033), ('mc', 0.033), ('expectation', 0.033), ('classi', 0.033), ('binary', 0.033), ('aw', 0.032), ('upon', 0.032), ('dkl', 0.031), ('microscopic', 0.031), ('tor', 0.031), ('constraining', 0.03), ('dence', 0.03), ('get', 0.03), ('outperforms', 0.029), ('noise', 0.029), ('ellipse', 0.028), ('koby', 0.028), ('gaussianity', 0.028), ('video', 0.028), ('label', 0.027), ('arg', 0.027), ('rule', 0.027), ('algorithms', 0.027), ('regularization', 0.027), ('maxz', 0.025), ('shrinks', 0.025), ('winning', 0.024), ('diffusion', 0.024), ('tradeoff', 0.024), ('project', 0.024), ('yielding', 0.024), ('bi', 0.023), ('reweighting', 0.023), ('emnlp', 0.023), ('eld', 0.023), ('hypothesis', 0.023), ('derivative', 0.022), ('transformation', 0.022), ('fraction', 0.022), ('panels', 0.022), ('lastly', 0.022), ('classify', 0.021), ('rd', 0.021), ('parametric', 0.021), ('distribution', 0.021), ('bottom', 0.02), ('current', 0.02), ('resulting', 0.019), ('recognition', 0.019), ('document', 0.019), ('mean', 0.019), ('picked', 0.019), ('elds', 0.019), ('ci', 0.019), ('dashed', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
2 0.30141094 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.09446083 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
4 0.077539422 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difficulty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufficiently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to find a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difficult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be defined by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will find some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary finite data set. To our knowledge, this is the first result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classifier uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is defined as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to find m=1 weights θ that correctly classify the training examples. Consider first the separable case. Define the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is defined by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while fixing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while fixing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random fields (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by configurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing configurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to find a violated constraint. For the pseudo-max objective, finding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisfies particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identified to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simplified instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identifiability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to infinity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four configurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufficiently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the configurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the figure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identifiability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction defined on a fixed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node fields θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k specifies the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input fields xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufficient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classification constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input fields xi (yi ) already suffice to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the first example, we set the input fields for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input fields can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ field for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is finite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for finite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Define Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any finite xi and finite J ∗ , the other xj ’s can be chosen so as to give the y configuration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we define h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and define ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the first term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function defined in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difficulty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certificate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a fixed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We first construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The first i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difficulty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subfigure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classification task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not flattened either), or because the consistency conditions are not fully satisfied: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random field (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to finding the most likely assignment in a pairwise MRF, and fits naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for finding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We significantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Specifically, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly five times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difficulty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to find parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufficient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufficient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufficiently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-specification [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9
5 0.063810289 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
6 0.062098723 243 nips-2010-Smoothness, Low Noise and Fast Rates
7 0.061840191 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
8 0.058163624 22 nips-2010-Active Estimation of F-Measures
9 0.057071712 63 nips-2010-Distributed Dual Averaging In Networks
10 0.056153197 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
11 0.054887105 257 nips-2010-Structured Determinantal Point Processes
12 0.05254858 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
13 0.052515488 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
14 0.051366933 151 nips-2010-Learning from Candidate Labeling Sets
15 0.049859352 235 nips-2010-Self-Paced Learning for Latent Variable Models
16 0.049827024 27 nips-2010-Agnostic Active Learning Without Constraints
17 0.049201168 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
18 0.048504304 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
19 0.0482953 138 nips-2010-Large Margin Multi-Task Metric Learning
20 0.04683242 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
topicId topicWeight
[(0, 0.157), (1, 0.029), (2, 0.07), (3, -0.015), (4, 0.01), (5, 0.048), (6, -0.058), (7, -0.007), (8, -0.068), (9, 0.049), (10, 0.015), (11, -0.057), (12, 0.103), (13, 0.063), (14, -0.05), (15, 0.076), (16, -0.052), (17, -0.108), (18, -0.049), (19, -0.071), (20, 0.009), (21, 0.022), (22, 0.124), (23, 0.066), (24, -0.118), (25, 0.008), (26, 0.128), (27, 0.026), (28, -0.048), (29, -0.008), (30, 0.033), (31, 0.039), (32, 0.029), (33, 0.104), (34, -0.094), (35, 0.003), (36, 0.018), (37, -0.003), (38, 0.164), (39, -0.045), (40, -0.184), (41, 0.002), (42, 0.076), (43, -0.075), (44, -0.009), (45, 0.077), (46, -0.129), (47, -0.085), (48, -0.098), (49, -0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.86178595 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
2 0.79767913 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.70307428 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
4 0.6514039 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
5 0.6084168 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
6 0.42790484 235 nips-2010-Self-Paced Learning for Latent Variable Models
7 0.41190863 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
8 0.41018471 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
9 0.39539763 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
10 0.38741645 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
11 0.36663517 224 nips-2010-Regularized estimation of image statistics by Score Matching
12 0.36046389 138 nips-2010-Large Margin Multi-Task Metric Learning
13 0.3602519 202 nips-2010-Parallelized Stochastic Gradient Descent
14 0.35855886 228 nips-2010-Reverse Multi-Label Learning
15 0.35657826 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
16 0.35492158 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
17 0.33172318 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
18 0.32850936 22 nips-2010-Active Estimation of F-Measures
19 0.32748944 63 nips-2010-Distributed Dual Averaging In Networks
20 0.31915399 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
topicId topicWeight
[(4, 0.221), (13, 0.041), (17, 0.01), (27, 0.08), (30, 0.049), (35, 0.027), (45, 0.192), (50, 0.066), (52, 0.037), (60, 0.043), (64, 0.013), (77, 0.056), (78, 0.025), (90, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.80399895 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
2 0.74066257 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
Author: Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar
Abstract: Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified a priori. Here, we present a methodology that is capable of jointly optimizing the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimizes both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach, which is applicable to plants with non-linear dynamics as well as arbitrary state dependent and quadratic control costs, is evaluated on realistic simulations of a redundant robotic plant.
3 0.72659212 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.72089022 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
5 0.71848255 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefficients. Each coefficient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efficient inference procedures for both the coefficients and the scale parameter. When the scale parameters of a group of coefficients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude fluctuations among coefficients, which have been shown to constitute a large fraction of the redundancy in natural images [1]. We show that, as a consequence of this group sparse coding, the resulting inference of the coefficients follows a divisive normalization rule, and that this may be efficiently implemented in a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model. 1
6 0.71673495 117 nips-2010-Identifying graph-structured activation patterns in networks
7 0.7156136 63 nips-2010-Distributed Dual Averaging In Networks
8 0.71291864 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
9 0.71246862 155 nips-2010-Learning the context of a category
10 0.71191901 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
11 0.71156722 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
12 0.71108615 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
13 0.71097928 17 nips-2010-A biologically plausible network for the computation of orientation dominance
14 0.71008199 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
15 0.7100091 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
16 0.70987797 148 nips-2010-Learning Networks of Stochastic Differential Equations
17 0.70929307 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
18 0.70905173 98 nips-2010-Functional form of motion priors in human motion perception
19 0.70772016 282 nips-2010-Variable margin losses for classifier design
20 0.70762211 265 nips-2010-The LASSO risk: asymptotic results and real world examples