nips nips2012 nips2012-80 knowledge-graph by maker-knowledge-mining

80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Source: pdf

Author: Liva Ralaivola

Abstract: This paper provides the ﬁrst —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract This paper provides the ﬁrst —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. [sent-4, score-0.77]

2 The work builds upon recent and elegant results on noncommutative concentration inequalities, i. [sent-5, score-0.056]

3 concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. [sent-7, score-0.06]

4 We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. [sent-8, score-0.119]

5 This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. [sent-9, score-0.066]

6 This way, we step aside the more widespread viewpoint of relying on the misclassiﬁcation rate —the probability of misclassifying a point— as a performance measure for multiclass predictors. [sent-11, score-0.096]

7 First, the confusion information is a ﬁner-grain information than the misclassiﬁcation rate, as it allows one to precisely identify the types of error made by a classiﬁer. [sent-13, score-0.621]

8 Finally, there are many application domains such as medicine, bioinformatics, information retrieval, where the confusion matrix (or an estimate thereof) is precisely the object of interest for an expert who wants to assess the relevance of an automatic prediction procedure. [sent-15, score-0.642]

9 On the one hand, we provide a statistical analysis for the generalization ability of online learning algorithms producing predictors that aim at optimizing the confusion. [sent-18, score-0.11]

10 This requires us to introduce relevant statistical quantities that are taken advantage of via a concentration inequality for matrix martingales proposed by [8]. [sent-19, score-0.06]

11 Motivated by our statistical analysis, we propose an online learning algorithm from the family of passive aggressive learning algorithms [2]: this algorithm is inherently designed to optimize the confusion matrix and numerical simulations are provided that show it reaches its goal. [sent-20, score-0.795]

12 Section 2 formalizes our pursued objective of targetting a small confusion error. [sent-22, score-0.622]

13 Section 3 provides our results regarding the ability of online confusion-aware learning 1 procedures to achieve a small confusion together with the update equations for COPA, a new passiveaggressive learning procedure designed to control the confusion risk. [sent-23, score-1.296]

14 Section 4 reports numerical results that should be viewed as a proof of concept for the effectiveness of our algorithm. [sent-24, score-0.057]

15 1 Problem and Motivation Notation We focus on the problem of multiclass classiﬁcation: the input space is denoted by X and the target space is Y = {1, . [sent-26, score-0.076]

16 The training sequence Z = {Zt = (Xt , Yt )}T is made of T identically t=1 and independently random pairs Zt = (Xt , Yt ) distributed according to some unknown distribution D over X × Y. [sent-30, score-0.067]

17 The sequence of input data will be referred to as X = {Xt }T and the sequence t=1 of corresponding labels Y = {Yt }T . [sent-31, score-0.096]

18 We use DX|y for the conditional t=1 τ distribution on X given that Y = y; therefore, for a given sequence y = (y1 , . [sent-34, score-0.065]

19 For a sequence y of labels, T (y) = [T1 (y) · · · TQ (y)] ∈ NQ is such that Tq (y) is the number of times label q appears in y. [sent-42, score-0.048]

20 H ⊆ {f : f : X → R}Q ), and A designates an online learning algorithm that produces hypothesis ht ∈ H when it encounters a new example Zt . [sent-47, score-0.31]

21 Finally, = ( q|p )1≤p,q≤Q is a family of class-dependent loss functionals q|p : H × X → R+ . [sent-48, score-0.054]

22 For a point x ∈ X of class y ∈ Y, q|y (h, x) is the cost of h’s favoring class q over y for x. [sent-49, score-0.038]

23 The family of (cost-sensitive) misclassiﬁcation losses is deﬁned as . [sent-51, score-0.093]

24 misclass (h, x) = χ[h(x)=q] Cyq , q|y where Cpq ∈ R+ , ∀p, q ∈ Y and χ[E] = 1 if E is true and 0 otherwise. [sent-52, score-0.08]

25 The family of multiclass hinge losses hinge is such that, given W = {w1 , . [sent-54, score-0.271]

26 , wQ } ∈ X Q with wq = 0 and hypothesis hW such that hW (x) = [ w1 , x · · · wQ , x ] hinge q|y (hW , x) 2. [sent-57, score-0.463]

27 class-imbalanced datasets, it is important not to measure the quality of a predictor h based upon its classiﬁcation rate . [sent-63, score-0.043]

28 A more informative object is the confusion matrix C(h) ∈ RQ×Q of h, which is deﬁned as: . [sent-66, score-0.622]

29 Let us now abuse the notation and denote C(h) the confusion matrix where the diagonal entries are zeroed. [sent-69, score-0.622]

30 This says that, with little additional information, the misclassiﬁcation rate might be obtained from the confusion matrix, while the converse is not true. [sent-72, score-0.601]

31 Is is clear that having the confusion matrix C(h) be zero means that h is a perfect predictor. [sent-73, score-0.622]

32 When such situation is not possible (if the data are corrupted by noise, for instance), a valuable objective might be to look for a classiﬁer h having a confusion matrix as close to zero as possible. [sent-74, score-0.622]

33 Here, we choose to measure the closeness to zero of matrices through the operator norm · , deﬁned as: Mv 2 . [sent-75, score-0.115]

34 M = max , v=0 v 2 (5) where · 2 is the standard Euclidean norm — M is merely the largest singular value of M . [sent-76, score-0.057]

35 In addition to being a valid norm, the operator norm has the following nice property, that will be of help to us. [sent-77, score-0.084]

36 Given two matrices A = (aij ) and B = (bij ) of the same dimensions 0 ≤ aij ≤ bij , ∀i, j ⇒ A ≤ B . [sent-78, score-0.077]

37 (6) Given the equivalence between norms and the deﬁnition of the operator norm, it is easy to see that R(h) ≤ Q C(h) , and targetting a small confusion matrix for h may have the consequence of implying a small misclassiﬁcation risk R(h). [sent-79, score-0.805]

38 The discussion conducted so far brings us to a natural goal of multiclass learning, that of minimizing the norm of the confusion matrix, i. [sent-80, score-0.715]

39 In order to deal with the latter problem and prepare the ground for tackling the former one from a theoretical point of view, we now introduce and discuss the confusion risk, which is parameterized by a family of loss functions that may be surrogate for the indicator loss χ[] . [sent-85, score-0.689]

40 The confusion risk C (h) of h is deﬁned as C (h) = cpq 1≤p,q≤Q . [sent-87, score-0.743]

41 ∈ RQ×Q , with cpq = EX|p 0 q|p (h, X) if p = q, otherwise. [sent-88, score-0.08]

42 (7) Observe that if the family misclass of losses from Example 1 is retained, and Cpq = 1 for all p, q then C misclass (h) is precisely the confusion matrix discussed above (with the diagonal set to zero). [sent-89, score-0.895]

43 This says that if we recourse to surrogate that upper bounds the 0-1 indicator function then the norm of the confusion risk is always larger than the norm of the confusion matrix. [sent-96, score-1.468]

44 Armed with the confusion risk and the last lemma, we may now turn towards the legitimate objective min C (h) , h∈H a small value of C (h) implying a small value of C(h) (which was our primary goal). [sent-97, score-0.681]

45 However, as already evoked, it is possible to derive learning strategies based on empirical estimates of the confusion risk, that ensures C (h) will be small. [sent-99, score-0.582]

46 1 (Empirical) Online Confusion Risk Assume an online learning algorithm A that outputs hypotheses from a family H: run on the traning sequence Z ∼ DT , A outputs hypotheses h = {h0 , . [sent-102, score-0.279]

47 Let For h ∈ H and x ∈ X , we deﬁne L|p (h, x) = (luv )1≤u,v≤Q ∈ RQ×Q , =( q|p ) be a family of losses and let p ∈ Y. [sent-107, score-0.093]

48 0 (8) The matrix L|p (h) ∈ RQ×Q is naturally given by L|p (h) = EX|p L|p (h, X). [sent-110, score-0.04]

49 Let y ∈ Y T be a ﬁxed sequence of labels and Y a random sequence of T labels. [sent-114, score-0.096]

50 The conditional and non-conditional confusion risks C C ,y (h) . [sent-119, score-0.635]

51 In order to provide our generalization bounds, it will be more convenient for us to work with the conditional confusion C ,y (h). [sent-124, score-0.62]

52 A simple argument will then allow us to provide generalization bounds for C ,Y (h). [sent-125, score-0.064]

53 , hT −1 } be the hypotheses output by A when run on Z y = {(Xt , yt )}T , such that Xt is distributed according to DX|yt . [sent-131, score-0.302]

54 t=1 The (non-)conditional empirical online confusion risks C C ,y (h, X) T . [sent-132, score-0.69]

55 Consider a sequence {Uk } of d1 × d2 random matrices, and a ﬁxed sequence of scalars {Mk } such that EUk |U1 . [sent-139, score-0.096]

56 k New Results This subsection reports our main results. [sent-146, score-0.061]

57 Suppose that the losses in are such that 0 ≤ q|p ≤ M for some M > 0. [sent-148, score-0.056]

58 , hT −1 } is the set of hypotheses output by A when provided with {(Xt , yt )}T . [sent-153, score-0.302]

59 where we used that the only row of Ut not to be zero is its yt th row (see Remark 1), the fact that supv: v ≤1 v · u = u 2 and the assumption 0 ≤ q|p ≤ M , which gives that |∆t,q | ≤ M . [sent-162, score-0.288]

60 √ Using Corollary 1 on the matrix martingale {Ut }, where Ut ≤ M Q/Tyt almost surely, gives P t Ut ≥ ε ≤ 2Q exp − ε2 2M 2 Q 1 2 t Ty t Setting the right hand side to δ gives that, with probability at least 1 − δ t t −2 Tyt = p Ut ≤ M 2Q t 2Q 1 ln . [sent-163, score-0.081]

61 Indeed, it provides a bound on the norm of the average confusion risks of hypotheses h0 , . [sent-167, score-0.727]

62 , hT −1 , which, from a practical point of view, does not say much about the norm of the confusion risk of a speciﬁc hypothesis. [sent-170, score-0.72]

63 In fact, as is usual in online learning [1], it provides a bound on the confusion risk of the Gibbs classiﬁer, which uniformly samples over h0 , . [sent-171, score-0.735]

64 (14) In that case, we may show the following theorem, that is much more compatible with the stated goal of trying to ﬁnd a hypothesis that has small (or at least, controlled) confusion risk. [sent-178, score-0.644]

65 In addition to the assumptions of Theorem 1 we now assume that losses (as deﬁned in (14)). [sent-180, score-0.056]

66 5 (15) It is now time to give the argument allowing us to state results for the non-conditional online confusion risks. [sent-185, score-0.671]

67 In light of the generic results of this subsection, we are now ready to motivate and derive a new online learning algorithm that aims at a small confusion risk. [sent-188, score-0.674]

68 3 Online Learning with COPA This subsection presents a new online algorithm, COPA (for COnfusion Passive-Aggressive learning). [sent-190, score-0.098]

69 A ﬁrst message from Theorem 2 is that, provided the functions considered are convex, it is relevant to use the average hypothesis h as a predictor. [sent-192, score-0.048]

70 We indeed know by (15) that the confusion risk of h is bounded by C ,y (h, X) , plus some quantity directly related to the number of training data encountered. [sent-193, score-0.681]

71 The second message from the theorem is that the focus naturally comes to C ,y (h, X) and the question as to how minimize this quantity. [sent-194, score-0.067]

72 According to Deﬁnition 4, C ,y (h, X) is the sum of instantaneous confusion matrices L|yt (ht−1 , Xt )/Tyt . [sent-195, score-0.632]

73 h Tyt However, as remarked before, L|yt has only one nonzero row, its yt th. [sent-197, score-0.25]

74 Therefore, the operator norm of L|yt (h, Xt )/Tyt is simply the Euclidean norm of its yt th row. [sent-198, score-0.391]

75 Trying to minimize the square Euclidean norm of this row might be written as min h 1 2 Tyt 2 q|yt (h, Xt ). [sent-199, score-0.094]

76 The hypothesis space H is made of linear classiﬁers, so that a sequence of vectors W = {w1 , . [sent-202, score-0.099]

77 The family COPA builds upon is q|p (hW , x) = wq , x + 1 Q−1 , ∀p, q ∈ Y. [sent-206, score-0.453]

78 , wQ }, using (16), and implementing a passive-aggressive learning strategy [2], the new set of vectors Wt+1 is searched as the solution of W, 1 min wq =0 2 q Q t wq − wq 2 2 + q=1 C 2 2Ty wq , x + q=y 2 1 Q−1 . [sent-212, score-1.52]

79 (17) + It turns out that it is possible to ﬁnd the solution of this optimization problem without having to recourse to any involved optimization procedure. [sent-213, score-0.046]

80 This is akin to a result obtained by [6], which applies for another family of loss functions. [sent-214, score-0.054]

81 We indeed prove the following theorem (proof in supplementary material). [sent-215, score-0.051]

82 Suppose we want to solve W, 1 min wq =0 2 q Q t wq − wq 2 2 + q=1 6 C 2 wq , x + q=y 1 Q−1 2 . [sent-217, score-1.52]

83 + (18) Algorithm 1 COPA Input: z = {(xt , yt )}T training set (realization of Z), R number of epochs over z, C cost t=1 Output: W = {w1 , . [sent-218, score-0.271]

84 = wQ for r=1 to R do for t=1 to T do receive (xt , yt ) compute α∗ according to (20) τ τ ∗ ∀q, perform the update: wq +1 ← wq − αq − τ ←τ +1 end for end for τ 1 k ∀q, wq ← τ k=1 wq Let t q be deﬁned as t q I∗ q=1 1 Q ∗ αq xt . [sent-224, score-1.877]

85 4 Numerical Simulations We here report results of preliminary simulations of COPA on a toy dataset. [sent-251, score-0.042]

86 We compare the results of COPA to that of a simple multiclass perceptron procedure (the number of epochs for each algorithm is set to 5). [sent-260, score-0.141]

87 The essential ﬁnding of these simulations is that COPA and the perceptron achieve similar rate of classiﬁcation accuracy, 0. [sent-262, score-0.086]

88 86, respectively but the norm of the confusion matrix of COPA is 0. [sent-264, score-0.679]

89 This means COPA indeed does its job in trying to minimize the confusion risk. [sent-267, score-0.648]

90 7 5 Conclusion In this paper, we have provided new bounds for online learning algorithms aiming at controlling their confusion risk. [sent-268, score-0.68]

91 Preliminary numerical simulations tend to support the effectiveness of COPA. [sent-271, score-0.064]

92 First, we would like to know whether efﬁcient update equation can be derived if a simple hinge, instead of a square hinge is used. [sent-273, score-0.071]

93 Also, we are interested in comparing the merits of our theoretical results with those recently proposed in [5] and [7], which propose to address learning with the confusion matrix from the algorithmic stability and the PAC-Bayesian viewpoints. [sent-275, score-0.64]

94 Finally, we would like to see how the proposed material can be of some use in the realm of structure prediction and by extension, in the case where the confusion matrix to consider is inﬁnite-dimensional. [sent-276, score-0.622]

95 Consider a ﬁnite sequence {Xk } of self-adjoint matrices in dimension d, and a ﬁxed sequence {Ak } of self-adjoint matrices that satisfy Ek−1 Xk = 0 and 2 Xk A2 , and Ak Xk = Xk Ak , almost surely. [sent-280, score-0.175]

96 To prove the result, it sufﬁces to make use of the dilation technique and apply Theorem 4. [sent-283, score-0.049]

97 The self-adjoint dilation D(U ) of a matrix U ∈ Rd1 ×d2 is the self-adjoint matrix D(U ) of order d1 + d2 deﬁned by 0 U D(U ) = U∗ 0 where U ∗ is the adjoint of U (as U has only real coefﬁcient here, U ∗ is the transpose of U ). [sent-284, score-0.129]

98 Considering Uk , we get that, almost surely: D2 (Uk ) 2 Mk I, and since dilation is a linear operator, we have that EUk |U1 ···Uk−1 D(Uk ) = 0. [sent-286, score-0.066]

99 The sequence of matrices {D(Uk )} is therefore a matrix martingale that veriﬁes the assumption of Theorem 4, the application of which gives P λmax with σ 2 = k k D(Uk ) ≥ t ≤ (d1 + d2 ) exp − t2 2σ 2 , 2 Mk . [sent-287, score-0.143]

100 Exact passiveaggressive algorithm for multiclass classiﬁcation using support class. [sent-319, score-0.116]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('confusion', 0.582), ('copa', 0.398), ('wq', 0.38), ('tyt', 0.258), ('yt', 0.25), ('ht', 0.206), ('xt', 0.107), ('uk', 0.088), ('ut', 0.082), ('risk', 0.081), ('cpq', 0.08), ('misclass', 0.08), ('hw', 0.076), ('multiclass', 0.076), ('online', 0.072), ('misclassi', 0.071), ('ext', 0.065), ('norm', 0.057), ('dx', 0.056), ('losses', 0.056), ('mk', 0.054), ('tq', 0.053), ('hypotheses', 0.052), ('rq', 0.051), ('hinge', 0.051), ('dilation', 0.049), ('sequence', 0.048), ('recourse', 0.046), ('perceptron', 0.044), ('tp', 0.043), ('simulations', 0.042), ('matrix', 0.04), ('hmaj', 0.04), ('luv', 0.04), ('passiveaggressive', 0.04), ('targetting', 0.04), ('xk', 0.038), ('family', 0.037), ('risks', 0.036), ('euk', 0.035), ('multicategory', 0.035), ('reports', 0.035), ('theorem', 0.033), ('hypothesis', 0.032), ('matrices', 0.031), ('trying', 0.03), ('corollary', 0.03), ('ex', 0.029), ('classi', 0.029), ('operator', 0.027), ('pb', 0.027), ('bounds', 0.026), ('ak', 0.026), ('subsection', 0.026), ('bij', 0.025), ('zt', 0.025), ('martingale', 0.024), ('predictor', 0.024), ('numerical', 0.022), ('generalization', 0.021), ('epochs', 0.021), ('aij', 0.021), ('precisely', 0.02), ('surely', 0.02), ('relying', 0.02), ('ready', 0.02), ('update', 0.02), ('concentration', 0.02), ('remark', 0.02), ('upon', 0.019), ('class', 0.019), ('cation', 0.019), ('made', 0.019), ('instantaneous', 0.019), ('says', 0.019), ('row', 0.019), ('minimize', 0.018), ('indeed', 0.018), ('implying', 0.018), ('surrogate', 0.018), ('shimizu', 0.018), ('asap', 0.018), ('infrequent', 0.018), ('liva', 0.018), ('marseille', 0.018), ('merits', 0.018), ('prepare', 0.018), ('ralaivola', 0.018), ('satellite', 0.018), ('traning', 0.018), ('yoshida', 0.018), ('loss', 0.017), ('argument', 0.017), ('almost', 0.017), ('predictors', 0.017), ('consequence', 0.017), ('conditional', 0.017), ('builds', 0.017), ('euclidean', 0.017), ('message', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

2 0.24753529 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been ﬁrst studied in [1]. The paper extends this work in two directions: ﬁrst, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identiﬁability and the uniqueness of maximum likelihood estimates are being given. 1

3 0.22349802 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-conﬁdence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-conﬁdence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

4 0.19241241 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve ﬁnding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of ﬁrst, doing inference over the function space and second, ﬁnding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efﬁciently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

5 0.12788592 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a speciﬁc pattern of recurrent connections learned by the model. 1

6 0.12351439 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

7 0.1153986 293 nips-2012-Relax and Randomize : From Value to Algorithms

8 0.10757679 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

9 0.10530928 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.099261172 292 nips-2012-Regularized Off-Policy TD-Learning

11 0.098277189 324 nips-2012-Stochastic Gradient Descent with Only One Projection

12 0.085584387 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

13 0.082509868 168 nips-2012-Kernel Latent SVM for Visual Recognition

14 0.082298785 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

15 0.078300752 227 nips-2012-Multiclass Learning with Simplex Coding

16 0.073435485 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

17 0.070843831 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

18 0.070453994 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

19 0.067689314 303 nips-2012-Searching for objects driven by context

20 0.064326964 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, -0.014), (2, 0.152), (3, 0.238), (4, 0.146), (5, -0.103), (6, 0.0), (7, 0.012), (8, -0.001), (9, 0.022), (10, 0.042), (11, 0.028), (12, 0.064), (13, 0.014), (14, 0.02), (15, -0.016), (16, 0.037), (17, 0.035), (18, 0.034), (19, 0.033), (20, -0.048), (21, 0.023), (22, 0.045), (23, 0.055), (24, 0.027), (25, -0.006), (26, 0.018), (27, -0.037), (28, -0.011), (29, 0.027), (30, -0.027), (31, 0.06), (32, -0.092), (33, 0.021), (34, 0.032), (35, 0.039), (36, -0.016), (37, 0.012), (38, -0.056), (39, -0.077), (40, -0.005), (41, 0.022), (42, 0.001), (43, 0.026), (44, 0.031), (45, 0.028), (46, 0.041), (47, 0.014), (48, 0.006), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93381238 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

2 0.87459898 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

3 0.85107887 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

4 0.72583759 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

5 0.72302645 293 nips-2012-Relax and Randomize : From Value to Algorithms

Author: Sasha Rakhlin, Ohad Shamir, Karthik Sridharan

Abstract: We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones, also capturing such “unorthodox” methods as Follow the Perturbed Leader and the R2 forecaster. Understanding the inherent complexity of the learning problem thus leads to the development of algorithms. To illustrate our approach, we present several new algorithms, including a family of randomized methods that use the idea of a “random playout”. New versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone’s dimension, efﬁcient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts. 1

6 0.68366528 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

7 0.68030971 283 nips-2012-Putting Bayes to sleep

8 0.60412425 292 nips-2012-Regularized Off-Policy TD-Learning

9 0.58537823 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.54879415 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

11 0.51712084 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

12 0.51405233 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

13 0.48828384 72 nips-2012-Cocktail Party Processing via Structured Prediction

14 0.48709369 222 nips-2012-Multi-Task Averaging

15 0.46661648 303 nips-2012-Searching for objects driven by context

16 0.43847802 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

17 0.43408582 11 nips-2012-A Marginalized Particle Gaussian Process Regression

18 0.3975971 217 nips-2012-Mixability in Statistical Learning

19 0.38548473 324 nips-2012-Stochastic Gradient Descent with Only One Projection

20 0.38505501 161 nips-2012-Interpreting prediction markets: a stochastic approach

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (21, 0.044), (27, 0.312), (38, 0.135), (42, 0.023), (54, 0.021), (55, 0.029), (74, 0.028), (76, 0.098), (80, 0.139), (92, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84751701 39 nips-2012-Analog readout for optical reservoir computers

Author: Anteo Smerieri, François Duport, Yvon Paquot, Benjamin Schrauwen, Marc Haelterman, Serge Massar

Abstract: Reservoir computing is a new, powerful and ﬂexible machine learning technique that is easily implemented in hardware. Recently, by using a timemultiplexed architecture, hardware reservoir computers have reached performance comparable to digital implementations. Operating speeds allowing for real time information operation have been reached using optoelectronic systems. At present the main performance bottleneck is the readout layer which uses slow, digital postprocessing. We have designed an analog readout suitable for time-multiplexed optoelectronic reservoir computers, capable of working in real time. The readout has been built and tested experimentally on a standard benchmark task. Its performance is better than non-reservoir methods, with ample room for further improvement. The present work thereby overcomes one of the major limitations for the future development of hardware reservoir computers. 1

2 0.78286666 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

same-paper 3 0.76154542 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

4 0.68558538 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classiﬁcation. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to ﬁnding a MAP conﬁguration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassiﬁcation loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-oﬀ interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

5 0.66804594 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as conﬁdently as possible on a ﬁxed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reﬂect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classiﬁcation and regression tasks and study the accuracy of the resulting p-values. 1

6 0.58744681 348 nips-2012-Tractable Objectives for Robust Policy Optimization

7 0.57849765 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

8 0.57845241 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.57480419 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

10 0.57402945 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

11 0.57307559 293 nips-2012-Relax and Randomize : From Value to Algorithms

12 0.57152712 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.5712316 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

14 0.57078844 200 nips-2012-Local Supervised Learning through Space Partitioning

15 0.5669809 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.56616104 227 nips-2012-Multiclass Learning with Simplex Coding

17 0.56614202 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

18 0.56484544 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

19 0.56361532 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

20 0.56352335 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models