nips nips2005 nips2005-50 knowledge-graph by maker-knowledge-mining

50 nips-2005-Convex Neural Networks

Source: pdf

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artiﬁcial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an inﬁnite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time ﬁnding a linear classiﬁer that minimizes a weighted sum of errors. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. [sent-7, score-0.776]

2 This problem involves an inﬁnite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time ﬁnding a linear classiﬁer that minimizes a weighted sum of errors. [sent-8, score-0.593]

3 1 Introduction The objective of this paper is not to present yet another learning algorithm, but rather to point to a previously unnoticed relation between multi-layer neural networks (NNs),Boosting (Freund and Schapire, 1997) and convex optimization. [sent-9, score-0.261]

4 Its main contributions concern the mathematical analysis of an algorithm that is similar to previously proposed incremental NNs, with L1 regularization on the output weights. [sent-10, score-0.246]

5 This analysis helps to understand the underlying convex optimization problem that one is trying to solve. [sent-11, score-0.326]

6 This paper was motivated by the unproven conjecture (based on anecdotal experience) that when the number of hidden units is “large”, the resulting average error is rather insensitive to the random initialization of the NN parameters. [sent-12, score-0.456]

7 When the number of hidden units is large, it seems implausible for none of them to offer a descent direction. [sent-14, score-0.468]

8 More speciﬁcally, if the regularization is the L1 norm of the output layer weights, then we show that a “reasonable” solution exists, involving a ﬁnite number of hidden units (no more than the number of examples, and in practice typically much less). [sent-16, score-0.743]

9 We present a theoretical algorithm that is reminiscent of Column Generation (Chv´ 1983), in which hidden neurons are inserted atal, one at a time. [sent-17, score-0.303]

10 What ˜ we call “Neural Network” (NN) here is a predictor for supervised learning of the form m y (x) = ˆ i=1 wi hi (x) where x is an input vector, hi (x) is obtained from a linear discriminant function hi (x) = s(vi · x) with e. [sent-21, score-0.644]

11 Note that in this formulation, an output non-linearity can still be used, by inserting it in the loss function Q. [sent-29, score-0.296]

12 Examples of such loss functions are the quadratic loss ||ˆ − y||2 , the hinge y loss max(0, 1 − y y ) (used in SVMs), the cross-entropy loss −y log y − (1 − y) log(1 − y ) ˆ ˆ ˆ ˆ (used in logistic regression), and the exponential loss e−yy (used in Boosting). [sent-30, score-0.705]

13 , look for a base learner hm+1 that is best correlated with the gradient of the average loss on the y (xi ) (that would be the residue y (xi ) − yi in the case of the square loss). [sent-37, score-0.199]

14 However, we follow a “stepwise”, less greedy, approach, in which all the output weights are optimized at each step, in order to obtain convergence guarantees. [sent-39, score-0.261]

15 2 Core Ideas Informally, consider the set H of all possible hidden unit functions (i. [sent-44, score-0.322]

16 , of all possible hidden unit weight vectors vi ). [sent-46, score-0.457]

17 Imagine a NN that has all the elements in this set as hidden units. [sent-47, score-0.266]

18 We might want to impose precision limitations on those weights to obtain either a countable or even a ﬁnite set. [sent-48, score-0.195]

19 This can be achieved by using a regularization penalty on the output weights that yields sparse solutions, such as the L1 penalty. [sent-51, score-0.395]

20 The only problem is that there are as many variables in this convex program as there are elements in the set H, which may be very large (possibly inﬁnite). [sent-53, score-0.293]

21 However, we ﬁnd that with L 1 regularization, a ﬁnite solution is obtained, and that such a solution can be obtained by greedily inserting one hidden unit at a time. [sent-54, score-0.522]

22 Furthermore, it is theoretically possible to check that the global optimum has been reached. [sent-55, score-0.195]

23 An element of W can be understood as the output weights vector in a neural network. [sent-61, score-0.261]

24 Let h(x) : H → R the function that maps any element hi of H to hi (x). [sent-62, score-0.368]

25 h(x) can be understood as the vector of activations of hidden units when input x is observed. [sent-63, score-0.415]

26 Let Q : R × R → R be a ˆ cost function convex in its ﬁrst argument that takes a scalar prediction y (x) and a scalar ˆ target value y and returns a scalar cost. [sent-66, score-0.458]

27 Let Ω : W → R be a convex regularization functional that penalizes for the choice of more “complex” parameters (e. [sent-69, score-0.343]

28 We deﬁne the convex NN criterion C(H, Q, Ω, D, w) with parameter w as follows: n C(H, Q, Ω, D, w) = Ω(w) + Q(w · h(xt ), yt ). [sent-72, score-0.46]

29 The convex NN cost C(H, Q, Ω, D, w) is a convex function of w. [sent-76, score-0.599]

30 Q(w · h(xt ), yt ) is convex in w and Ω is convex in w, by the above construction. [sent-78, score-0.675]

31 C is additive in Q(w · h(xt ), yt ) and additive in Ω. [sent-79, score-0.153]

32 Note that there are no constraints in this convex optimization program, so that at the global minimum all the partial derivatives of C with respect to elements of w cancel. [sent-81, score-0.445]

33 2 says that training NNs from a very large class (with one or more hidden layer) can be seen as convex optimization problems, usually in a very high dimensional space, as long as we allow the number of hidden units to be selected by the learning algorithm. [sent-85, score-1.042]

34 By choosing a regularizer that promotes sparse solutions, we obtain a solution that has a ﬁnite number of “active” hidden units (non-zero entries in the output weights vector w). [sent-86, score-0.744]

35 However, even if the solution involves a ﬁnite number of active hidden units, the convex optimization problem could still be computationally intractable because of the large number of variables involved. [sent-89, score-0.709]

36 , add one hidden unit at a time in an incremental fashion. [sent-92, score-0.361]

37 The important ingredient here is a way to know that we have reached the global optimum, thus not requiring to actually visit all the possible hidden units. [sent-93, score-0.382]

38 We show that this can be achieved as long as we can solve the sub-problem of ﬁnding a linear classiﬁer that minimizes the weighted sum of classiﬁcation errors. [sent-94, score-0.158]

39 This can be done exactly only on low dimensional data sets but can be well approached using weighted linear SVMs, weighted logistic regression, or Perceptron-type algorithms. [sent-95, score-0.245]

40 , is at least as good as the solution of a more restricted convex optimization problem. [sent-99, score-0.394]

41 The second minimization can be performed with a local descent algorithm, without the necessity to guarantee that the global optimum will be found. [sent-100, score-0.326]

42 n The training criterion is C(w) = K w 1 + max (0, 1 − yt w · h(xt )). [sent-102, score-0.234]

43 Let us rewrite t=1 this cost function as the constrained optimization problem: n yt [w · h(xt )] ≥ 1 − ξt min L(w, ξ) = K w 1 + ξt s. [sent-103, score-0.295]

44 If the number of hidden units is uncountable, then I is a closed bounded interval of R. [sent-122, score-0.415]

45 Therefore, the primal problem associated is the minimization of the cost function of a NN with n + 1 hidden neurons. [sent-138, score-0.421]

46 4 Incremental Convex NN Algorithm In this section we present a stepwise algorithm to optimize a NN, and show that there is a criterion that allows to verify whether the global optimum has been reached. [sent-139, score-0.325]

47 , (xn , yn )}, convex loss function Q, and scalar regularization penalty λ. [sent-144, score-0.542]

48 s is either the sign function or the tanh function. [sent-145, score-0.175]

49 , 1) and select w1 = argminw1 t Q(w1 s(1), yt ) + λ|w1 |. [sent-149, score-0.153]

50 (3) Repeat i−1 (4) Let qt = Q ( j=1 wj hj (xt ), yt ) (5) If s = sign (5a) train linear classiﬁer hi (x) = sign(vi · x) with examples {(xt , sign(qt ))} ˜ and errors weighted by |qt |, t = 1 . [sent-151, score-1.1]

51 , maximize t qt hi (xt )) (5b) else (s = tanh) (5c) train linear classiﬁer hi (x) = tanh(vi · x) to maximize t qt hi (xt ). [sent-156, score-1.077]

52 , vi ) minimizing (exactly or i approximately) C = t Q( j=1 wj hj (xt ), yt ) + λ j=1 |wj | ∂C such that ∂wj = 0 for j = 1 . [sent-164, score-0.523]

53 (8) Return the predictor y (x) = ˆ i j=1 wj hj (x). [sent-168, score-0.296]

54 A key property of the above algorithm is that, at termination, the global optimum is reached, i. [sent-169, score-0.195]

55 , no hidden unit (linear classiﬁer) can improve the objective. [sent-171, score-0.322]

56 , it involves ﬁnding a classiﬁer which minimizes the ˜ weighted cost t qt sign(v · xt ). [sent-174, score-0.731]

57 Algorithm ConvexNN stops when it reaches the global optimum of C(w) = t Q(w · h(xt ), yt ) + λ||w||1 . [sent-177, score-0.348]

58 Let w be the output weights vector when the algorithm stops. [sent-179, score-0.261]

59 Because the set of hidden units H we consider is such that when h is in H, −h is also in H, we can assume all weights to be non-negative. [sent-180, score-0.551]

60 By contradiction, if w = w is the global optimum, with C(w ) < C(w), then, since C is convex in the output weights, for any ∈ (0, 1), we have C( w + (1 − )w) ≤ C(w ) + (1 − )C(w) < C(w). [sent-181, score-0.467]

61 Let us denote by Ip the set of strictly positive weights in w (and w ), by Iz the set of weights set to zero in w but to a non-zero value in w , and by δ k the difference w ,k − wk in the weight of hidden unit hk between w and w . [sent-184, score-0.708]

62 We can assume δ j < 0 for j ∈ Iz , because instead of setting a small positive weight to hj , one can decrease the weight of −hj by the same amount, which will give either the same cost, or possibly a lower one when the weight of −hj is positive. [sent-185, score-0.367]

63 < λ and ∀j ∈ Iz , −1 δ j (−λ + t qt hj (xt )) > 0 (since (Mason et al. [sent-189, score-0.422]

64 Again, this requires solving as a sub-problem an exact minimization to ﬁnd a function hi ∈ H that is maximally correlated with the gradient Q on the output. [sent-191, score-0.439]

65 Exact Minimization In step (5a) we are required to ﬁnd a linear classiﬁer that minimizes the weighted sum of classiﬁcation errors. [sent-193, score-0.158]

66 However, an exact solution can be easily found in O(n 3 ) computations for d = 2 inputs. [sent-198, score-0.153]

67 Finding a linear classiﬁer that minimizes the weighted sum of classiﬁcation error can be achieved in O(n3 ) steps when the input dimension is d = 2. [sent-201, score-0.158]

68 When u varies smoothly on the unit circle, as the dot product is a continuous function of its arguments, the changes in the order of the dot products will occur only when there is a pair (i, j) such that u · xi = u · xj . [sent-213, score-0.177]

69 Therefore it is interesting to consider approximate schemes for obtaining a linear classiﬁer with weighted costs. [sent-233, score-0.157]

70 , linear classiﬁer with hinge loss), the logistic regression classiﬁer, and variants of the Perceptron algorithm. [sent-236, score-0.255]

71 In that case, step (5c) of the algorithm is not an exact minimization, and one cannot guarantee that the global optimum will be reached. [sent-237, score-0.28]

72 However, it might be reasonable to believe that ﬁnding a linear classiﬁer by minimizing a weighted hinge loss should yield solutions close to the exact minimization. [sent-238, score-0.428]

73 In our experiments we used a quadratic Q, for which the optimization of the output weights can be done with a neural network, using the outputs of the hidden layer as inputs. [sent-242, score-0.645]

74 Let us consider now a bit more carefully what it means to tune the vj ’s in step (7). [sent-243, score-0.17]

75 Indeed, changing the weight vector vj of a selected hidden neuron to decrease the cost is equivalent to a change in the output weights w’s. [sent-244, score-0.841]

76 More precisely, consider the step in which the value of vj becomes vj . [sent-245, score-0.248]

77 This is equivalent to the following operation on the w’s, when wj is the corresponding output weight value: the output weight associated with the value v j of a hidden neuron is set to 0, and the output weight associated with the value vj of a hidden neuron is set to wj . [sent-246, score-1.542]

78 This corresponds to an exchange between two variables in the convex program. [sent-247, score-0.261]

79 The fact that we are simultaneously making such exchanges on all the hidden units when we tune the vj ’s allows us to move faster towards the global optimum. [sent-249, score-0.705]

80 Extension to multiple outputs The multiple outputs case is more involved than the single-output case because it is not enough to check the condition t ht qt > λ. [sent-250, score-0.322]

81 Consider a new hidden neuron whose output is hi when the input is xi . [sent-251, score-0.676]

82 , αno ] the vector of output weights between the new hidden neuron and the no output neurons. [sent-255, score-0.704]

83 The gradient with respect to αj ∂C is gj = ∂αj = t ht qtj − λsign(αj ) with qtj the value of the j-th output neuron with input xt . [sent-256, score-0.907]

84 This means that if, for a given j , we have | t ht qtj | < λ, moving αj away from 0 can only increase the cost. [sent-257, score-0.232]

85 Therefore, the right quantity to consider is (| t ht qtj | − λ)+ . [sent-258, score-0.232]

86 2 We must therefore ﬁnd argmaxv j (| t ht qtj | − λ)+ . [sent-259, score-0.232]

87 As before, this sub-problem is not convex, but it is not as obvious how to approximate it by a convex problem. [sent-260, score-0.298]

88 The stopping criterion becomes: if there is no j such that | t ht qtj | > λ, then all weights must remain equal to 0 and a global minimum is reached. [sent-261, score-0.495]

89 In these experiments, Q(w · h(xt ), yt ) = [w · h(xt ) − yt ]2 . [sent-263, score-0.306]

90 • Optimize the output weights using a convex optimizer. [sent-265, score-0.522]

91 • In case (b), tune both input and output weights by conjugate gradient descent on C and ﬁnally re-optimize the output weights using LASSO regression. [sent-266, score-0.713]

92 • Optionally, remove neurons whose output weight has been set to 0. [sent-267, score-0.223]

93 A penalty of λ = 1 was nearly optimal for the exact algorithm whereas a smaller penalty further improved the test classiﬁcation error of the approximate algorithm. [sent-275, score-0.226]

94 Besides, when running the approximate algorithm for a long time, it converges to a solution whose quadratic error is extremely close to the one of the exact algorithm. [sent-276, score-0.19]

95 5 Conclusion We have shown that training a NN can be seen as a convex optimization problem, and have analyzed an algorithm that can exactly or approximately solve this problem. [sent-277, score-0.361]

96 We have shown that the solution with the hinge loss involved a number of non-zero weights bounded by the number of examples, and much smaller in practice. [sent-278, score-0.427]

97 The above experimental results are in agreement with our initial conjecture: when there are many hidden units we are much less likely to stall in the optimization procedure, because there are many more ways to descend on the convex cost C(w). [sent-280, score-0.818]

98 , of optimizing w’s and v ’s together) is simply the inability to ﬁnd a new hidden unit weight vector that can improve the total cost (ﬁt and regularization term) even if there exists one. [sent-284, score-0.542]

99 Note that as a side-effect of the results presented here, we have a simple way to train neural y networks with hard-threshold hidden units, since increasing t Q (ˆ(xt ), yt )sign(vi xt ) can be either achieved exactly (at great price) or approximately (e. [sent-285, score-0.677]

100 by using a cross-entropy or hinge loss on the corresponding linear classiﬁer). [sent-287, score-0.272]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hidden', 0.266), ('convex', 0.261), ('xt', 0.258), ('qt', 0.238), ('iz', 0.222), ('nn', 0.196), ('hj', 0.184), ('hi', 0.184), ('yt', 0.153), ('units', 0.149), ('qtj', 0.148), ('boosting', 0.142), ('weights', 0.136), ('er', 0.125), ('output', 0.125), ('vj', 0.124), ('hinge', 0.116), ('optimum', 0.114), ('wj', 0.112), ('sign', 0.109), ('loss', 0.107), ('nns', 0.099), ('classi', 0.099), ('gradient', 0.092), ('marcotte', 0.086), ('exact', 0.085), ('ht', 0.084), ('regularization', 0.082), ('global', 0.081), ('mason', 0.078), ('minimization', 0.078), ('cost', 0.077), ('friedman', 0.075), ('atal', 0.074), ('chv', 0.074), ('convexnn', 0.074), ('kortanek', 0.074), ('savard', 0.074), ('vi', 0.074), ('ip', 0.072), ('weighted', 0.071), ('solution', 0.068), ('tanh', 0.066), ('optimization', 0.065), ('hettich', 0.064), ('inserting', 0.064), ('weight', 0.061), ('optionally', 0.059), ('countable', 0.059), ('roux', 0.059), ('freund', 0.058), ('nite', 0.057), ('unit', 0.056), ('delalleau', 0.055), ('hm', 0.055), ('logistic', 0.054), ('layer', 0.053), ('descent', 0.053), ('wk', 0.053), ('penalty', 0.052), ('neuron', 0.052), ('schapire', 0.051), ('xi', 0.049), ('linear', 0.049), ('involves', 0.049), ('criterion', 0.046), ('tune', 0.046), ('ci', 0.045), ('verify', 0.045), ('wi', 0.043), ('demiriz', 0.043), ('conjecture', 0.041), ('scalar', 0.04), ('incremental', 0.039), ('wm', 0.039), ('exchanges', 0.039), ('stepwise', 0.039), ('uk', 0.039), ('minimizes', 0.038), ('generation', 0.038), ('constraints', 0.038), ('bengio', 0.037), ('neurons', 0.037), ('approximate', 0.037), ('bennett', 0.037), ('subproblem', 0.037), ('dot', 0.036), ('regression', 0.036), ('reached', 0.035), ('sort', 0.035), ('training', 0.035), ('theorem', 0.035), ('tsch', 0.034), ('programs', 0.034), ('rumelhart', 0.034), ('assertion', 0.034), ('learners', 0.034), ('toy', 0.034), ('le', 0.032), ('program', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

2 0.18978633 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

Abstract: We consider the problem of constructing an aggregated estimator from a ﬁnite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with speciﬁc weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efﬁcient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1

3 0.15339397 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

Author: Purnamrita Sarkar, Andrew W. Moore

Abstract: This paper explores two aspects of social network modeling. First, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. Second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. The generalized model associates each entity with a point in p-dimensional Euclidian latent space. The points can move as time progresses but large moves in latent space are improbable. Observed links between entities are more likely if the entities are close in latent space. We show how to make such a model tractable (subquadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efﬁcient dynamic adaptation of multidimensional scaling for a ﬁrst pass of approximate projection of entities into latent space; and an efﬁcient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is O(log n). We use both synthetic and real-world data on upto 11,000 entities which indicate linear scaling in computation time and improved performance over four alternative approaches. We also illustrate the system operating on twelve years of NIPS co-publication data. We present a detailed version of this work in [1]. 1

4 0.15182064 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

Author: Ricky Der, Daniel D. Lee

Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.

5 0.13923551 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

6 0.12688863 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

7 0.12647416 58 nips-2005-Divergences, surrogate loss functions and experimental design

8 0.12217209 47 nips-2005-Consistency of one-class SVM and related algorithms

9 0.12196425 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

10 0.1197136 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

11 0.11946099 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

12 0.11411314 126 nips-2005-Metric Learning by Collapsing Classes

13 0.1140236 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

14 0.11125892 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

15 0.10975675 45 nips-2005-Conditional Visual Tracking in Kernel Space

16 0.1088976 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

17 0.10794453 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

18 0.10639264 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

19 0.10580397 95 nips-2005-Improved risk tail bounds for on-line algorithms

20 0.10353869 80 nips-2005-Gaussian Process Dynamical Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.326), (1, 0.083), (2, -0.054), (3, -0.149), (4, 0.144), (5, 0.067), (6, 0.067), (7, -0.048), (8, -0.026), (9, -0.035), (10, -0.051), (11, 0.109), (12, -0.289), (13, -0.054), (14, 0.127), (15, 0.033), (16, 0.012), (17, 0.123), (18, 0.002), (19, -0.07), (20, -0.116), (21, 0.134), (22, -0.122), (23, -0.039), (24, -0.146), (25, -0.008), (26, 0.012), (27, 0.035), (28, 0.024), (29, -0.006), (30, 0.14), (31, 0.024), (32, -0.041), (33, -0.001), (34, -0.004), (35, -0.056), (36, 0.04), (37, -0.099), (38, -0.025), (39, -0.034), (40, 0.119), (41, 0.1), (42, 0.034), (43, -0.021), (44, -0.002), (45, 0.072), (46, 0.049), (47, -0.053), (48, -0.028), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97064084 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

2 0.62036359 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

3 0.56029165 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

Author: Alan L. Yuille

Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1

4 0.53089565 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

Author: Larry Wasserman, John D. Lafferty

Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisﬁes a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modiﬁed version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1

5 0.52597702 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

Author: Lacey Gunter, Ji Zhu

Abstract: In this paper we derive an algorithm that computes the entire solution path of the support vector regression, with essentially the same computational cost as ﬁtting one SVR model. We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1

6 0.51238424 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

7 0.50455487 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

8 0.48185858 58 nips-2005-Divergences, surrogate loss functions and experimental design

9 0.47842434 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

10 0.47576669 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

11 0.45468807 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

12 0.45043141 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

13 0.44301012 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers

14 0.42547661 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

15 0.42442119 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

16 0.42070431 195 nips-2005-Transfer learning for text classification

17 0.41375121 47 nips-2005-Consistency of one-class SVM and related algorithms

18 0.40758353 126 nips-2005-Metric Learning by Collapsing Classes

19 0.40489623 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

20 0.39907297 184 nips-2005-Structured Prediction via the Extragradient Method

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.092), (10, 0.042), (27, 0.016), (28, 0.138), (31, 0.055), (34, 0.145), (39, 0.027), (41, 0.012), (50, 0.019), (55, 0.064), (60, 0.023), (65, 0.01), (69, 0.053), (73, 0.022), (88, 0.132), (91, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92178929 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

2 0.84111655 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classiﬁcation. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classiﬁcation. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classiﬁcation as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classiﬁcation. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classiﬁer to classify new points. Tests on synthetic and real datasets show cases where there are signiﬁcant improvements in performance over the existing approaches. 1

3 0.8355177 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to inﬁnity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held ﬁxed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classiﬁcation error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the ﬁrst time to be a consistent density level set estimator. 1

4 0.83532047 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic ﬂow. This makes the approach an efﬁcient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

5 0.82982767 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientiﬁc computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean ﬁeld approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

6 0.82844549 74 nips-2005-Faster Rates in Regression via Active Learning

7 0.82739013 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

8 0.82633299 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

9 0.82572198 114 nips-2005-Learning Rankings via Convex Hull Separation

10 0.82359874 78 nips-2005-From Weighted Classification to Policy Search

11 0.82312196 105 nips-2005-Large-Scale Multiclass Transduction

12 0.82231969 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

13 0.82224214 151 nips-2005-Pattern Recognition from One Example by Chopping

14 0.82179743 160 nips-2005-Query by Committee Made Real

15 0.82111794 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

16 0.82098585 23 nips-2005-An Application of Markov Random Fields to Range Sensing

17 0.8191331 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

18 0.81877321 58 nips-2005-Divergences, surrogate loss functions and experimental design

19 0.81629467 54 nips-2005-Data-Driven Online to Batch Conversions

20 0.81582505 195 nips-2005-Transfer learning for text classification