nips nips2008 nips2008-78 knowledge-graph by maker-knowledge-mining

78 nips-2008-Exact Convex Confidence-Weighted Learning


Source: pdf

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. [sent-3, score-0.415]

2 Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. [sent-4, score-0.112]

3 Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. [sent-5, score-0.25]

4 Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. [sent-6, score-0.135]

5 1 Introduction Online learning methods for linear classifiers, such as the perceptron and passive-aggressive (PA) algorithms [4], have been thoroughly analyzed and are widely used. [sent-7, score-0.218]

6 However, these methods do not model the strength of evidence for different weights arising from differences in the use of features in the data, which can be a serious issue in text classification, where weights of rare features should be trusted less than weights of frequent features. [sent-8, score-0.123]

7 Confidence-weighted (CW) learning [6], motivated by PA learning, explicitly models classifier weight uncertainty with a full multivariate Gaussian distribution over weight vectors. [sent-9, score-0.298]

8 The PA geometrical margin constraint is replaced by the probabilistic constraint that a classifier drawn from the distribution should, with high probability, classify correctly the next example. [sent-10, score-0.254]

9 [6] explained CW learning in terms of the standard deviation of the margin induced by the hypothesis Gaussian, in practice they used the margin variance to make the problem convex. [sent-12, score-0.209]

10 In this work, we use their original constraint but maintain convexity, yielding experimental improvements. [sent-13, score-0.162]

11 In large-margin classification, the margin’s magnitude for an instance is sometimes taken as a proxy for prediction confidence for that instance, but that quantity is not calibrated nor is it connected precisely to a measure of weight uncertainty. [sent-16, score-0.204]

12 Bayesian approaches to linear classification, such as Bayesian logistic regression [9], use a simple mathematical relationship between weight uncertainty and prediction uncertainty, which unfortunately cannot be computed exactly. [sent-17, score-0.186]

13 CW learning preserves the convenient computational properties of PA algorithms while providing a precise connection between weight uncertainty and prediction confidence that has led to weight updates that are more effective in practice [6, 5]. [sent-18, score-0.337]

14 We begin with a review of the CW approach, then show that the constraint can be expressed in a convex form, and solve it to obtain a new CW algorithm. [sent-19, score-0.129]

15 Our analysis provides a mistake bound and indicates that the algorithm is invariant to initialization. [sent-21, score-0.167]

16 On round i, the algorithm applies its current linear classification rule hw (x) = sign(w · x) to an instance xi ∈ Rd to produce a prediction yi ∈{− 1, +1}, receives a true label yi ∈{− 1, +1} and suffers a loss (yi , yi ). [sent-26, score-0.883]

17 As usual, we define the margin of an example on round i as mi = yi (wi · xi ), where positive sign corresponds to a correct prediction. [sent-28, score-0.817]

18 CW classification captures the notion of confidence in the weights of a linear classifier with a probability density on classifier weight vectors, specifically a Gaussian distribution with mean µ ∈ Rd and covariance matrix Σ ∈ Rd×d . [sent-29, score-0.282]

19 The values µp and Σp,p represent knowledge of and confidence in the weight for feature p. [sent-30, score-0.112]

20 The smaller Σp,p , the more confidence we have in the mean weight value µp . [sent-31, score-0.112]

21 Each covariance term Σp,q captures our knowledge of the interaction between features p and q. [sent-32, score-0.102]

22 In the CW model, the traditional signed margin is the mean of the induced univariate Gaussian random variable (1) M ∼ N y(µ · x), x Σx . [sent-33, score-0.086]

23 Here, we use the average weight vector E [w] = µ, analogous to Bayes point machines [8]. [sent-35, score-0.112]

24 The information captured by the covariance Σ is then used just to adjust training updates. [sent-36, score-0.102]

25 3 Update Rule The CW update rule of Dredze et al. [sent-37, score-0.129]

26 [6] makes the smallest adjustment to the distribution that ensures the probability of correct prediction on instance i is no smaller than the confidence hyperparameter η ∈ [0, 1]: Pr [yi (w · xi ) ≥ 0] ≥ η. [sent-38, score-0.359]

27 The magnitude of the update is measured by its KL divergence to the previous distribution, yielding the following constrained optimization: (µi+1 , Σi+1 ) = arg min DKL (N (µ, Σ) N (µi , Σi )) s. [sent-39, score-0.134]

28 i i 2 detΣ (3) Unfortunately, while the constraint of this problem is linear in µ, it is not convex in Σ. [sent-45, score-0.129]

29 This change yields the following convex optimization with a convex constraint in µ and Υ simultaneously: detΥ 2 1 1 1 i (µi+1 , Υi+1 ) = arg min log + Tr Υ−2 Υ2 + (µi − µ) Υ−2 (µi − µ) i i 2 detΥ 2 2 2 s. [sent-57, score-0.204]

30 Omitting the PSD constraint for now, we obtain the Lagrangian for (4), 1 detΥ 2 i L= log + Tr Υ−2 Υ2 + (µi − µ) Υ−2 (µi − µ) +α (−yi (µ · xi ) + φ Υxi ) i i 2 detΥ 2 (5) 2 Input parameters a > 0 ; η ∈ [0. [sent-64, score-0.383]

31 , n • Receive a training example xi ∈ Rd ` ` ´´ • Compute Gaussian margin distribution Mi ∼ N (µi · xi ) , xi Σi xi • Receive true label yi and compute «2 „ q 1 2 vi = xi Σi xi , mi = yi (µi · xi ) (11) , ui = −αvi φ + α2 vi φ2 + 4vi 4 ( ! [sent-69, score-3.659]

32 ) r 4 1 αi φ φ αi = max 0, (14) , βi = √ −mi ψ + m2 + vi φ2 ξ i vi ξ 4 ui + vi αi φ • Update (12) (22) µi+1 = µi + αi yi Σi xi (full) ` (10) (diag) Σi+1 = Σi − βi Σi xi xi Σi „ «−1 −1 Σi+1 = Σ−1 + αi φui 2 diag2 (xi ) i (15) ´ Output Gaussian distribution N µn+1 , Σn+1 . [sent-70, score-2.414]

33 At the optimum, it must be that ∂ L = Υ−2 (µ − µi ) − αyi xi = 0 i ∂µ µi+1 = µi + αyi Υ2 xi , i ⇒ (6) where we assumed that Υi is non-singular (PSD). [sent-73, score-0.598]

34 At the optimum, we must also have, Υxi xi ∂ 1 1 xi xi Υ L = −Υ−1 + Υ−2 Υ + ΥΥ−2 + αφ + αφ =0, i ∂Υ 2 i 2 2 xi Υ2 xi 2 xi Υ2 xi (7) from which we obtain the implicit-form update Υ−2 = Υ−2 + αφ i+1 i xi xi xi Υ2 xi i+1 (8) . [sent-74, score-3.391]

35 Conveniently, these updates can be expressed in terms of the covariance matrix 1 : µi+1 = µi + αyi Σi xi , Σ−1 = Σ−1 + αφ i+1 i x i xi xi Σi+1 xi . [sent-75, score-1.364]

36 2 Solving for the Lagrange Multiplier α We now determine the value of the Lagrange multiplier α and make the covariance update explicit. [sent-80, score-0.278]

37 135] to get Σi+1 = Σ−1 + αφ i Let xi x i −1 = Σi − Σi xi xi Σi+1 xi ui = xi Σi+1 xi , vi = xi Σi xi 1 αφ xi Σi+1 xi + xi Σi xi αφ , mi = yi (µi · xi ) . [sent-82, score-4.817]

38 3 Multiplying (10) by xi (left) and xi (right) we get ui = vi − vi solved for ui to obtain √ ui = −αvi φ + √ αφ ui +vi αφ vi , which can be 2 α2 vi φ2 + 4vi . [sent-87, score-2.794]

39 2 (12) The KKT conditions for the optimization imply that either α = 0 and no update is needed, or the constraint (4) is an equality after the update. [sent-88, score-0.316]

40 (9,10,11,12) √ we obtain mi + αvi = φ −αvi φ+ 2 α2 vi φ2 +4vi , which can 2 φ2 1 + 2 + m2 − vi φ2 i be rearranged into a quadratic equation 1 + φ + 2αmi vi = 0 . [sent-90, score-1.503]

41 The larger root is then 2 α2 vi 2 2 2 m2 vi ψ 2 − vi ψ (m2 − vi φ2 ) i i . [sent-93, score-1.717]

42 (13) 2ψ vi √ √ The constraint (4) is satisfied before the update if mi − φ vi ≥ 0. [sent-94, score-1.271]

43 If mi ≤ 0, then mi ≤ φ vi and from (13) we have that γi > 0. [sent-95, score-0.916]

44 If, instead, mi ≥ 0, then, again by (13), we have γi = −mi vi ψ + γi > 0 ⇔ mi vi ψ < 2 2 m2 vi ψ 2 − vi ψ (m2 − vi φ2 ) ⇔ mi <φv i . [sent-96, score-2.837]

45 We summarize the discussion in the following lemma: Lemma 1 The solution of (13) satisfies the KKT conditions, that is either αi ≥ 0 or the constraint of (3) is satisfied before the update with the parameters µi and Σi . [sent-98, score-0.186]

46 We obtain the final form of αi by simplifying (13) together with Lemma 1,    1 −mi ψ + m2 φ4 + vi φ2 ξ  i 4 . [sent-99, score-0.418]

47 max 0,   vi ξ (14) To summarize, after receiving the correct label yi the algorithm checks whether the probability of a correct prediction under the current parameters is greater than a confidence threshold η =Φ( φ). [sent-100, score-0.584]

48 In this case the covariance Σ parameter does not influence the decision, only the mean µ. [sent-111, score-0.102]

49 Furthermore, for length-one input vectors, at the first round we have 2 Σ1 = aI, so the first-round constraint is yi (wi · xi ) ≥ a xi = a, which is equivalent to the original PA update. [sent-112, score-0.865]

50 Second, the update described above yields full covariance matrices. [sent-113, score-0.238]

51 However, sometimes we may prefer diagonal covariance matrices, which can be achieved by projecting the matrix Σi+1 that results from the update onto the set of diagonal matrices. [sent-114, score-0.377]

52 In fact, if Σi is diagonal then we only need to project xi xi to a diagonal matrix. [sent-116, score-0.744]

53 We thus replace (9) with the following update, αi Σ−1 = Σ−1 + φ √ diag2 (xi ) , i+1 i ui (15) where diag2 (xi ) is a diagonal matrix made from the squares of the elements of xi on the diagonal. [sent-117, score-0.53]

54 Finally, the following property of our algorithm shows that it can be used with Mercer kernels: 4 Theorem 2 (Representer Theorem) The mean µi and covariance Σi parameters computed by the algorithm in Fig. [sent-120, score-0.102]

55 1 can be written as linear combinations of the input vectors with coefficients that depend only on inner products of input vectors: i−1 Σi = i−1 (i) πp,q xp xq + aI , µi = (i) νp xp . [sent-121, score-0.102]

56 First, we show that performance does not depend on initialization and then we compute a bound on the number of mistakes that the algorithm makes. [sent-124, score-0.219]

57 1 uses a predefined parameter a to initialize the covariance matrix. [sent-127, score-0.141]

58 Since the decision to update depends on the covariance matrix, which implicitly depends on a through αi and vi , one may assume that a effects performance. [sent-128, score-0.622]

59 In fact the number of mistakes is independent of a, i. [sent-129, score-0.14]

60 Specifically, if it holds for mean and covariance parameters µ and Σ, it holds also for the scaled parameters cµ and c2 Σ for any c > 0. [sent-132, score-0.164]

61 If, in addition to predictions, we also need the distribution over weight vectors, the scale parameter a should be calibrated. [sent-135, score-0.112]

62 Let Σi , µi , mi , vi , αi , ui be the quantities obtained throughout the execution of the algorithm described in Fig. [sent-140, score-0.865]

63 Let also Σi , µi , mi , vi , αi , ui be the corresponding quantities obtained throughout the execution of the algorithm, with an alternative initialization of (0, aI) (for some a > 0). [sent-142, score-0.896]

64 The following relations between the two set of quantities hold: mi = ˜ √ √ 1 ˜ ˜ ami , vi = avi , αi = √ αi , µi = aµi , ui = aui , Σi = aΣi . [sent-143, score-0.837]

65 √ √ From the lemma we see that the quantity mi / vi = mi / vi is invariant to a. [sent-148, score-1.456]

66 Therefore, the ˜ ˜ behavior of the algorithm in general, and its updates and mistakes in particular, are independent to the choice of a. [sent-149, score-0.179]

67 2 Analysis in the Mistake Bound Model The main theorem of the paper bounds the number of mistakes made by CW-Stdev. [sent-152, score-0.14]

68 1, initialized with (0, I), with xi ∈ Rd and y i ∈{− 1, +1} . [sent-157, score-0.299]

69 Assume there exist µ∗ and Σ∗ such that for all i for which the algorithm made an update (αi > 0), µ∗ xi yi ≥ µi+1 xi yi and xi Σ∗ xi ≤ xi Σi+1 xi . [sent-158, score-2.16]

70 mistakes ≤ i 2 αi vi ≤ 1 + φ2 − log detΣ ∗ + Tr (Σ∗ ) + µ∗ Σ−1 µ∗ − d n+1 φ2 5 (19) 160 Cumulative Loss 140 120 1. [sent-160, score-0.558]

71 00 (c) Figure 2: (a) The average and standard deviation of the cumulative number of mistakes for seven algorithms. [sent-170, score-0.244]

72 (b) The average and standard deviation of test error (%) over unseen data for the seven algorithms. [sent-171, score-0.093]

73 The above bound depends on an output of the algorithm, Σn+1 , similar to the bound for the secondorder perceptron [3]. [sent-174, score-0.359]

74 The two conditions (18) imply linear separability of the input sequence by µ∗ : (18) (4) (18) µ∗ xi yi ≥ µi+1 xi yi ≥ φ xi Σi+1 xi ≥ xi Σ∗ xi ≥ min xi Σ∗ xi > 0 , i where the superscripts in parentheses refer to the inequalities used. [sent-175, score-2.752]

75 Furthermore, if µ∗ satisfies the stronger conditions yi (µ∗ · xi ) ≥ xi , from Σi+1 I above it follows that (φµ∗ ) xi yi ≥ φ xi = φ xi Ixi ≥ φ xi Σi+1 xi = µi+1 xi yi , where the last equality holds since we assumed that an update was made for the ith example. [sent-178, score-2.995]

76 2 The quantity µ∗ Σ−1 µ∗ in this bound is analogous to the quantity R2 µ∗ in the perceptron n+1 bound [13], except that the norm of the examples does not come in explicitly as the radius R of the enclosing ball, but implicitly through the fact that Σ−1 is a sum of example outer products (9). [sent-180, score-0.378]

77 In n+1 addition, in this version of the bound we impose a margin of 1 under the condition that examples have unit norm, whereas in the perceptron bound, the margin of 1 is for examples with arbitrary norm. [sent-181, score-0.438]

78 This follows from the fact that (4) is invariant to the norm of xi . [sent-182, score-0.345]

79 Each point’s label depended on the first two coordinates using a separator parallel to the long axis of the ellipsoid, yielding a linearly separable set (Fig. [sent-186, score-0.148]

80 We evaluated five online learning algorithms: the perceptron [16] , the passive-aggressive (PA) algorithm [4], the secondorder perceptron (SOP) [3], CW-Var-diag, CW-Var-full [6], CW-Stdev-diag and CW-Stdev-full. [sent-188, score-0.537]

81 2(a) shows the average cumulative mistakes for each algorithm; error bars indicate one unit of standard deviation. [sent-191, score-0.181]

82 Additionally, CW-Var makes more mistakes than CW-Stdev: 8% more in the diagonal case and 17% more in the full. [sent-193, score-0.213]

83 The Gaussian distribution over weight vectors after 50 rounds is represented in Fig. [sent-200, score-0.158]

84 The dotted segment represents the first two coordinates of possible representations of the true hyperplane in the positive quadrant. [sent-203, score-0.112]

85 Nevertheless, the long axis is already parallel to the true set of possible weight vectors. [sent-207, score-0.181]

86 The remaining nine ellipsoids represent the covariance of pairs of noise features. [sent-209, score-0.152]

87 Those ellipsoids are close to circular and have centers close to the origin, indicating that the corresponding feature weights should be near zero but without much confidence. [sent-210, score-0.091]

88 6 Related Work Online additive algorithms have a long history, from with the perceptron [16] to more recent methods [10, 4]. [sent-216, score-0.218]

89 Our update has a more general form, in which the input vector xi is linearly transformed using the covariance matrix, both rotating the input and assigning weight specific learning rates. [sent-217, score-0.615]

90 5 3 The second order perceptron (SOP) [3] demonstrated that second-order information can improve on first-order methods. [sent-227, score-0.218]

91 SOP uses the current instance in the correlation matrix for prediction while CW updates after prediction. [sent-230, score-0.126]

92 A variant of CW-Stdev similar to SOP follows from our derivation if we fix the Lagrange multiplier in (5) to a predefined value αi = α, omit the square root, and use a gradient-descent optimization step. [sent-231, score-0.104]

93 Bottom: Feature weight distributions of CW-Stdev-full after 50 examples. [sent-235, score-0.112]

94 Gaussian process classification (GPC) maintains a Gaussian distribution over weight vectors (primal) or over regressor values (dual). [sent-237, score-0.195]

95 Our algorithm uses a different update criterion than the standard GPC Bayesian updates [15, Ch. [sent-238, score-0.141]

96 Bayes point machines [8] maintain a collection of weight vectors consistent with the training data, and use the single linear classifier which best represents the collection. [sent-240, score-0.204]

97 Conceptually, the collection is a non-parametric distribution over the weight vectors. [sent-241, score-0.112]

98 Its online version [7] maintains a finite number of weight-vectors which are updated simultaneously. [sent-242, score-0.093]

99 As in our work, the dual parameters are random variables distributed according to a diagonal Gaussian with example specific variance. [sent-244, score-0.109]

100 We assume linear classifiers as experts and maintain a Gaussian distribution over their weight vectors. [sent-246, score-0.158]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vi', 0.418), ('cw', 0.396), ('xi', 0.299), ('mi', 0.249), ('perceptron', 0.218), ('sop', 0.173), ('dredze', 0.159), ('mistakes', 0.14), ('psd', 0.136), ('yi', 0.132), ('ui', 0.131), ('pa', 0.122), ('det', 0.118), ('weight', 0.112), ('covariance', 0.102), ('update', 0.102), ('dence', 0.089), ('margin', 0.086), ('constraint', 0.084), ('classi', 0.083), ('multiplier', 0.074), ('mistake', 0.073), ('crammer', 0.073), ('diagonal', 0.073), ('lagrange', 0.071), ('koby', 0.068), ('con', 0.065), ('nlp', 0.064), ('diag', 0.06), ('var', 0.059), ('ellipsoid', 0.057), ('gpc', 0.057), ('stdev', 0.057), ('online', 0.056), ('kkt', 0.053), ('std', 0.053), ('round', 0.051), ('er', 0.051), ('tr', 0.05), ('ellipsoids', 0.05), ('hw', 0.05), ('bound', 0.048), ('coordinates', 0.047), ('ai', 0.047), ('vectors', 0.046), ('equality', 0.046), ('invariant', 0.046), ('maintain', 0.046), ('secondorder', 0.045), ('woodbury', 0.045), ('shivaswamy', 0.045), ('convex', 0.045), ('root', 0.045), ('lemma', 0.044), ('rd', 0.043), ('parentheses', 0.042), ('ers', 0.042), ('convexity', 0.041), ('weights', 0.041), ('cumulative', 0.041), ('kivinen', 0.04), ('uncertainty', 0.04), ('updates', 0.039), ('quantities', 0.039), ('initialize', 0.039), ('gaussian', 0.038), ('synthetic', 0.038), ('deviation', 0.037), ('maintains', 0.037), ('axis', 0.037), ('hyperplane', 0.037), ('fernando', 0.037), ('dual', 0.036), ('mark', 0.035), ('lagrangian', 0.034), ('prediction', 0.034), ('full', 0.034), ('herbrich', 0.033), ('satis', 0.032), ('yielding', 0.032), ('writing', 0.032), ('quantity', 0.032), ('parallel', 0.032), ('bayes', 0.031), ('holds', 0.031), ('initialization', 0.031), ('optimization', 0.03), ('unseen', 0.03), ('improvements', 0.029), ('execution', 0.028), ('xp', 0.028), ('conditions', 0.028), ('dotted', 0.028), ('matrix', 0.027), ('rule', 0.027), ('seven', 0.026), ('imply', 0.026), ('instance', 0.026), ('batch', 0.026), ('prede', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1

2 0.15380456 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

3 0.12939689 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

4 0.11084212 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1

5 0.10999398 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

Author: Chunhua Shen, Alan Welsh, Lei Wang

Abstract: In this work, we consider the problem of learning a positive semidefinite matrix. The critical issue is how to preserve positive semidefiniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semidefinite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semidefinite matrix with trace being one instead of a classifier. PSDBoost is based on the observation that any trace-one positive semidefinite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented. 1

6 0.10543729 171 nips-2008-Online Prediction on Large Diameter Graphs

7 0.10499521 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

8 0.096682474 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

9 0.09604536 194 nips-2008-Regularized Learning with Networks of Features

10 0.092633598 226 nips-2008-Supervised Dictionary Learning

11 0.091632351 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables

12 0.091593057 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

13 0.089316294 224 nips-2008-Structured ranking learning using cumulative distribution networks

14 0.089114897 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

15 0.08732418 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

16 0.085767828 214 nips-2008-Sparse Online Learning via Truncated Gradient

17 0.085327275 228 nips-2008-Support Vector Machines with a Reject Option

18 0.084886529 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

19 0.084790863 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images

20 0.083727196 84 nips-2008-Fast Prediction on a Tree


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.266), (1, -0.05), (2, -0.115), (3, 0.077), (4, -0.035), (5, -0.003), (6, 0.034), (7, -0.064), (8, 0.003), (9, 0.012), (10, 0.089), (11, 0.122), (12, -0.009), (13, -0.003), (14, -0.01), (15, -0.069), (16, 0.051), (17, -0.003), (18, -0.053), (19, 0.091), (20, 0.069), (21, -0.113), (22, -0.039), (23, 0.13), (24, -0.024), (25, 0.006), (26, 0.038), (27, -0.147), (28, 0.054), (29, -0.078), (30, 0.019), (31, -0.07), (32, 0.06), (33, -0.039), (34, -0.025), (35, -0.018), (36, -0.057), (37, 0.039), (38, -0.098), (39, -0.124), (40, -0.021), (41, 0.047), (42, 0.056), (43, 0.08), (44, -0.081), (45, 0.044), (46, 0.117), (47, -0.135), (48, -0.107), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9579556 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1

2 0.73461777 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

3 0.70225185 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1

4 0.64561075 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

5 0.61832565 185 nips-2008-Privacy-preserving logistic regression

Author: Kamalika Chaudhuri, Claire Monteleoni

Abstract: This paper addresses the important tradeoff between privacy and learnability, when designing algorithms for learning from private databases. We focus on privacy-preserving logistic regression. First we apply an idea of Dwork et al. [6] to design a privacy-preserving logistic regression algorithm. This involves bounding the sensitivity of regularized logistic regression, and perturbing the learned classifier with noise proportional to the sensitivity. We then provide a privacy-preserving regularized logistic regression algorithm based on a new privacy-preserving technique: solving a perturbed optimization problem. We prove that our algorithm preserves privacy in the model due to [6]. We provide learning guarantees for both algorithms, which are tighter for our new algorithm, in cases in which one would typically apply logistic regression. Experiments demonstrate improved learning performance of our method, versus the sensitivity method. Our privacy-preserving technique does not depend on the sensitivity of the function, and extends easily to a class of convex loss functions. Our work also reveals an interesting connection between regularization and privacy. 1

6 0.60918099 228 nips-2008-Support Vector Machines with a Reject Option

7 0.55643147 226 nips-2008-Supervised Dictionary Learning

8 0.55206978 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

9 0.53396195 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

10 0.5008601 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

11 0.50044382 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

12 0.49615976 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

13 0.48833829 126 nips-2008-Localized Sliced Inverse Regression

14 0.47063878 27 nips-2008-Artificial Olfactory Brain for Mixture Identification

15 0.46899474 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

16 0.46603447 239 nips-2008-Tighter Bounds for Structured Estimation

17 0.46077573 214 nips-2008-Sparse Online Learning via Truncated Gradient

18 0.46076787 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

19 0.45105314 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

20 0.44977441 72 nips-2008-Empirical performance maximization for linear rank statistics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.09), (7, 0.068), (12, 0.026), (15, 0.013), (28, 0.152), (57, 0.049), (63, 0.377), (71, 0.02), (77, 0.066), (83, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90844178 3 nips-2008-A Massively Parallel Digital Learning Processor

Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic

Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.

2 0.89789551 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

Author: Peter Carbonetto, Mark Schmidt, Nando D. Freitas

Abstract: The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its farreaching application, there is almost no work on applying stochastic approximation to learning problems with general constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization. 1

3 0.89246523 212 nips-2008-Skill Characterization Based on Betweenness

Author: Ozgur Simsek, Andre S. Barreto

Abstract: We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It captures and generalizes (at least intuitively) the bottleneck concept, which has inspired many of the existing skill-discovery algorithms. Our characterization may be used directly to form a set of skills suitable for a given task. More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. 1

same-paper 4 0.83998495 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1

5 0.57483345 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1

6 0.57094193 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.56794447 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

8 0.56767333 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

9 0.56735188 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

10 0.56628436 196 nips-2008-Relative Margin Machines

11 0.56604534 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms

12 0.5554083 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF

13 0.55521733 202 nips-2008-Robust Regression and Lasso

14 0.54750097 85 nips-2008-Fast Rates for Regularized Objectives

15 0.5473063 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

16 0.54591244 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs

17 0.5449931 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

18 0.54342425 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

19 0.54030108 194 nips-2008-Regularized Learning with Networks of Features

20 0.53955555 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging