nips nips2011 nips2011-145 knowledge-graph by maker-knowledge-mining

145 nips-2011-Learning Eigenvectors for Free


Source: pdf

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. [sent-9, score-0.585]

2 sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. [sent-15, score-0.596]

3 Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. [sent-17, score-0.376]

4 Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. [sent-18, score-0.457]

5 the worst case regret is achieved in the classical case. [sent-21, score-0.45]

6 1 Introduction We consider the extension of the classical online problem of predicting outcomes from a finite alphabet to the matrix domain. [sent-23, score-0.587]

7 Whereas classically the goal is to learn as well as the best multinomial distribution over outcomes, in the matrix case we desire to learn the distribution over dyads that best explains the sequence of dyads seen so far. [sent-27, score-0.758]

8 A distribution on dyads is summarized as a density matrix, i. [sent-28, score-0.378]

9 Such matrices are heavily used in quantum physics, where dyads represent states. [sent-31, score-0.442]

10 Even though the matrix generalization led to progress in many contexts, in the original domain of on-line learning, the regret bounds proven for the algorithms in the matrix case are often the same as those provable for the original classical finite alphabet case [17, 19]. [sent-36, score-0.823]

11 Therefore it was posed as an open problem to determine whether this is just a case of loose classical bound or whether there truly exists a “free matrix lunch” for some of these algorithms [18]. [sent-37, score-0.39]

12 , T do Algorithm predicts with density matrix Wt−1 Nature responds with density matrix Xt . [sent-45, score-0.726]

13 , T do Algorithm predicts with probability vector ωt−1 Nature responds with outcome xt . [sent-50, score-0.497]

14 In the classical case, there are n ≥ 2 outcomes and a distribution is parametrized by an n-dimensional probability vector ω, where ωi is the probability of outcome i. [sent-53, score-0.4]

15 We define a “matrix generalization” of a multinomial which is parametrized by a density matrix W (positive matrix of unit trace). [sent-55, score-0.53]

16 Whereas probability vectors represent uncertainty over n basis vectors, density matrices can be viewed as representing uncertainty over infinitely many dyads in Rn . [sent-59, score-0.418]

17 In the classical case, the algorithm predicts at trial t with multinomial ωt−1 . [sent-60, score-0.411]

18 The on-line algorithms are evaluated by their worst-case regret over data sequences, where the regret is the additional loss of the algorithm over the total loss of the best probability vector chosen in hindsight. [sent-69, score-0.674]

19 In this paper we develop the corresponding matrix setting, where the algorithm predicts with a density matrix Wt−1 , Nature produces a dyad xt xt , and the algorithm incurs loss −xt log(Wt−1 )xt . [sent-70, score-1.329]

20 We are particularly interested in how the regret changes when the algorithms are generalized to the matrix case. [sent-72, score-0.393]

21 Surprisingly we can show that for the Laplace as well as the Krichevsky-Trofimov [10] estimators the worst-case regret is the same in the matrix case as it is in the classical case. [sent-73, score-0.604]

22 For the Last-Step Minimax algorithm [16], we can prove the same regret bound for the matrix case that was proven for the classical case. [sent-74, score-0.629]

23 The vector problems are typically retained as special cases of the matrix problems, where the eigensystem is fixed and only the vectors of eigenvalues has to be learned. [sent-80, score-0.398]

24 We exhibit for the first time a basic fundamental problem, for which the regret achievable in the matrix case is no higher than the regret achievable in the original vector setting. [sent-81, score-0.632]

25 Paper outline Definitions and notation are given in the next section, followed by proofs of the free matrix lunch for the three discussed algorithms in Section 3. [sent-82, score-0.395]

26 We also discuss the minimax algorithm for multinomials due to Shtarkov, and corresponding minimax algorithm for density matrices. [sent-84, score-0.55]

27 We provide strong experimental evidence that the free matrix lunch holds for this algorithm as well. [sent-85, score-0.416]

28 2 Setup The protocol for the classical probability vector prediction problem and the new density matrix prediction problem are displayed side-by-side in Figure 1. [sent-88, score-0.597]

29 During trial t the algorithm predicts with a density matrix Wt−1 . [sent-91, score-0.44]

30 Then nature responds with an outcome 2 density matrix Xt . [sent-93, score-0.418]

31 The discrepancy between prediction and outcome is measured by the matrix entropic loss (Wt−1 , Xt ) := − tr Xt log(Wt−1 ) , (1) where log denotes matrix logarithm2 . [sent-94, score-0.937]

32 When the outcome density matrix Xt is a dyad xt xt , then this loss becomes −xt log(Wt−1 )xt , which is the simplified form of the entropic loss discussed in the introduction. [sent-95, score-1.355]

33 it has the form Wt−1 = i ωt−1,i ei ei for some probability vector ωt−1 , and the outcome Xt is an eigendyad ej ej of the same eigensystem, then this loss simplifies to the classical log loss: (Wt−1 , Xt ) = − log(ωt−1,j ). [sent-98, score-0.675]

34 We aim to design algorithms with low regret compared to the best fixed density matrix in hindsight. [sent-101, score-0.547]

35 The loss of the best fixed density matrix can be expressed succinctly in terms of the von Neumann entropy, which is defined for any density matrix D as H(D) := − tr(D log D), and the suffiT T ST cient statistic ST = . [sent-102, score-0.876]

36 , XT , the regret of a strategy that issues prediction Wt after observing X1 , . [sent-106, score-0.368]

37 We will prove a version of the free matrix lunch (FML) for all three algorithms. [sent-116, score-0.395]

38 Finally we discuss the minimax algorithm for which we have experimental evidence that the free matrix lunch holds as well. [sent-117, score-0.586]

39 1 Laplace t After observing classical data with sufficient statistic vector σt = q=1 exq , classical Laplace := σt +1 consisting of the normalized smoothed counts. [sent-119, score-0.521]

40 By predicts with the probability vector ωt t+n t analogy, after observing matrix data with sufficient statistic St = q=1 Xt , matrix Laplace predicts with the correspondingly smoothed matrix Wt := St +I . [sent-120, score-0.713]

41 t+n (3) The worst-case regret of classical Laplace after T iterations equals log T +n−1 ≤ (n−1) log(T +1) n−1 (see e. [sent-128, score-0.575]

42 We now show that in the matrix case, no additional regret is incurred. [sent-131, score-0.393]

43 The worst-case regrets of classical and matrix Laplace coincide. [sent-133, score-0.457]

44 The regret (2) of matrix Laplace can be bounded as follows: T T t=1 2 T ∗ (WT , Xt ) ≤ (Wt−1 , Xt ) − t=1 (Wt−1 , Xt ) − (Wt∗ , Xt ) . [sent-136, score-0.393]

45 t=1 For any positive matrix with eigendecomposition A = 3 i αi ai ai , log(A) := i log(αi ) ai ai . [sent-137, score-0.62]

46 The tth term equals − tr Xt log St−1 + I St − log t−1+n t t−1+n t = log − tr Xt log(St−1 + I) − log St . [sent-139, score-0.847]

47 By omitting it, we obtain an upper bound on the regret of matrix Laplace, that is tight: for any sequence of identical dyads the ∗ matrix part is zero and (4) holds with equality since Wt∗ = WT for all t ≤ T . [sent-144, score-0.859]

48 The same upper bound is also met by classical Laplace on any sequence of identical outcomes [6]. [sent-145, score-0.371]

49 We just showed that matrix Laplace has the same worst-case regret as classical Laplace, albeit matrix Laplace learns a matrix of n2 parameters whereas classical Laplace only learns n probabilities. [sent-146, score-1.123]

50 Jeffreys’ prior, the latter as the solution to the matrix entropic loss minimization problem (3) with n/2 virtual outcomes instead of n for Laplace. [sent-157, score-0.52]

51 The leading term in the worst-case regret for classical KT is the optimal 1 log(T ) rate per parameter 2 instead of the log(T ) rate for Laplace. [sent-158, score-0.45]

52 More precisely, classical KT’s worst-case regret after T Γ(1/2) iterations is known to be log Γ(T +n/2) + log Γ(n/2) ≤ n−1 log(T + 1) + log(π) (see e. [sent-159, score-0.654]

53 Γ(T +1/2) 2 Again we show that no additional regret is incurred in the matrix case. [sent-162, score-0.393]

54 The worst-case regrets of classical and matrix KT coincide. [sent-164, score-0.457]

55 For positive matrices A, B with A = i αi ai ai the eigendecomposition of A: n H(A + B) ≥ i=1 ai Bai H A + tr(B) ai ai , tr(B) Proof of Theorem 2. [sent-166, score-0.611]

56 We start by telescoping the regret (2) of matrix KT as follows T St−1 + Xt t − tr Xt log(Wt−1 ) − tH t=1 + (t − 1)H St−1 t−1 . [sent-167, score-0.589]

57 Let us denote the eigendecomposition of St−1 by St−1 = n i=1 σi si si . [sent-169, score-0.388]

58 Notice that since Wt−1 plays in the eigensystem of St−1 , we have: n − tr Xt log(Wt−1 ) = − tr Xt n log(ωt−1,i ) si si = − i=1 si Xt si log(ωt−1,i ). [sent-170, score-1.3]

59 i=1 Moreover, it follows from Lemma 1 that: H St−1 + Xt t n ≥ si Xt si H i=1 St−1 + si si t . [sent-171, score-0.684]

60 Taking this equality and inequality into account, the tth term in (5) is bounded above by: n St−1 + si si t si Xt si − log(ωt−1,i ) − tH δt := i=1 + (t − 1)H St−1 t−1 which, in turn, is at most: δt ≤ sup − log(ωt−1,i ) − tH i St−1 + si si t 4 + (t − 1)H St−1 t−1 . [sent-172, score-1.105]

61 Hence the matrix KT regret is below the classical KT regret. [sent-182, score-0.604]

62 The crucial part of the KT proof was showing that each term in the telescoped regret (5) can be bounded above by δt as defined in (6), in which all matrices share the same eigensystem, and which is hence equivalent to the corresponding classical expression. [sent-186, score-0.49]

63 Therefore, using the same line of argument, we can show that if for some classical prediction strategy we can obtain a meaningful regret bound by bounding each term in the regret δt independently, we can obtain the same bound for the corresponding matrix strategy, i. [sent-188, score-0.98]

64 In the classical case for the multinomial distribution, after observing data with sufficient statistic σt−1 , classical LSM predicts with t ωt−1 := argmin max ω xt n ∗ (ωt , xq ) (ω, xt ) − q=1 − log(ωt−1,xt ) tH ( exp −tH( σt−1 +ei ) t = i=1 σt t j exp −tH( σt−1 +ej ) t ei . [sent-195, score-1.405]

65 For our straightforward generalization to the classical multinomial case, the regret is bounded by n−1 ln(T + 1) + 1. [sent-197, score-0.518]

66 Applying the Last Step Minimax principle to density prediction, we obtain matrix LSM which issues prediction: Wt−1 := argmin max − tr Xt log(W ) − tH W Xt St t We show that matrix LSM learns the eigenvectors without additional regret. [sent-199, score-0.756]

67 The regrets of classical and matrix LSM are at most . [sent-201, score-0.457]

68 By Sion’s minimax theorem [15]: min max − tr Xt log(W ) − tH W Xt St t = max min EP − tr Xt log(W ) − tH P W St t , where P ranges over probability distribution on density matrices Xt . [sent-205, score-0.756]

69 Using Lemma 1, we can bound the second expression n ≥ EP t St t St−1 + si si t 5 n = t si EP [Xt ] si H i=1 St−1 + si si t . [sent-208, score-1.051]

70 Therefore, we have: n H EP [Xt ] ≤ H (si EP [Xt ]si ) si si = H(p), i=1 where the last entropy is a classical entropy and p is a vector such that pi = si EP [Xt ]si . [sent-216, score-0.794]

71 It follows that Wt−1 is the classical LSM strategy in the eigensystem of St−1 , i. [sent-226, score-0.46]

72 The proof of the classical LSM guarantee is based on bounding the per-round regret increase: δt := − log(ωt−1,xt ) − tH σt−1 + ext t + (t − 1)H σt−1 t−1 , by choosing the worst case w. [sent-229, score-0.473]

73 Since, for matrices, the worst case for the corresponding matrix version of δt , see (6), is the diagonal case, the whole analysis immediately goes through and we get the same bound as for classical LSM. [sent-233, score-0.39]

74 The worst-case regrets of classical and matrix LSM coincide. [sent-241, score-0.457]

75 For dyadic data xt xt it can be incrementally updated in O(n2 ) per trial with methods along the lines of [11]. [sent-244, score-0.678]

76 The minimax algorithm for multinomials, due to Shtarkov [14], minimizes the worstcase regret T (ωt−1 , xt ) − T H inf sup . [sent-247, score-0.788]

77 (9) After observing data with sufficient statistic σt and hence with r := T − t rounds remaining, classical Shtarkov predicts with n ωt := i=1 φr−1 (σt + ei ) ei φr (σt ) where φr (σ) := c1 ,. [sent-251, score-0.54]

78 The regret of classical Shtarkov equals log φT (0) ≈ n−1 log(T ) − log(n − 2) + 1 [6]. [sent-259, score-0.575]

79 6 The minimax algorithm for density matrices, called matrix Shtarkov, optimizes the worst-case regret T (Wt−1 , Xt ) − T H inf sup . [sent-261, score-0.785]

80 Our conjecture is that Lemma 1 holds with the entropy H replaced by the minimax regret tail Rr : Conjecture 2. [sent-270, score-0.549]

81 It follows from this conjecture, using the same argument as for LSM, that matrix Shtarkov predicts in the eigensystem of St , i. [sent-273, score-0.454]

82 The worst-case regrets of classical and matrix Shtarkov coincide. [sent-276, score-0.457]

83 We have verified Conjecture 3 for the matrix Bernoulli case (n = 2) up to T = 5 outcomes, using a grid of 30 dyads per trial, with uniformly spaced (x e1 )2 . [sent-277, score-0.378]

84 4 Motivation and Discussion of the Loss Function The matrix entropic loss (1) that we choose as our loss function has a coding interpretation and it is a proper scoring rule. [sent-283, score-0.583]

85 Variable-length quantum coding requires the definition of a code length operator L, which is a positive matrix such that for any density matrix X, tr(XL) gives the expected dimension (“length”) of the message assigned to X. [sent-293, score-0.725]

86 The quantum version of Kraft’s inequality states that, ignoring rounding issues, for every variable-length quantum code with codelength operator L, there exists a density matrix W such that L = − log W . [sent-294, score-0.815]

87 Therefore, the matrix entropic loss can be interpreted as the (expected) code length of the observed outcome. [sent-295, score-0.423]

88 7 We will say that a matrix loss function (W , X) is proper if for any distribution P on density matrix outcomes, the expected loss with respect to P is minimized by predicting with the mean outcome of P , i. [sent-302, score-0.831]

89 The matrix entropic loss (1) is proper, for EX∼P [− tr(X log W )] = − tr EX∼P [X] log W is minimized at W = EX∼P [X] [12]. [sent-305, score-0.79]

90 Therefore, minimization of the matrix entropic loss leads to well-calibrated forecasting, as in the classical case. [sent-306, score-0.581]

91 A second generalization of the log loss to the matrix domain used in quantum physics [12] is the log trace loss (W , X) := − log tr(XW ) . [sent-307, score-0.917]

92 Also for 2 log trace loss we found an example (not presented) against the FML for the minimax algorithm. [sent-312, score-0.424]

93 However, it equals the matrix entropic loss when X is a dyad. [sent-316, score-0.393]

94 This loss is not a proper scoring function (it behaves like the absolute loss in the vector case) and we have an example that shows that there is no FML for the minimax algorithm in this case (not presented). [sent-318, score-0.449]

95 5 Conclusion We showed that the free matrix lunch holds for the matrix version of the KT estimator. [sent-320, score-0.57]

96 Thus the conjectured free matrix lunch [18] is realized. [sent-321, score-0.395]

97 Perhaps the main one is whether the free matrix lunch holds for the matrix minimax algorithm. [sent-323, score-0.74]

98 Also we would like to know what properties of the loss function and algorithm cause the free matrix lunch to occur. [sent-324, score-0.493]

99 From the examples given in this paper it is tempting to believe that you always get a free matrix lunch when upgrading any classical sufficient-statistics-based predictor to a matrix version by just playing this predictor in the eigensystem of the current matrix sufficient statistics. [sent-325, score-1.138]

100 Since ∂H(D) ∂D = − log(D) − I, n f (γ) = − tr B log(A + γB) + ai Bai tr ai ai log A + γ tr(B) ai ai i=1 = tr B log A + γ tr(B)I Since tr(B)I B, we have A + γ tr(B)I logarithm implies that log A + γ tr(B)I 3 We can compute A − tr B log(A + γB) . [sent-334, score-1.643]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.311), ('lsm', 0.243), ('regret', 0.239), ('st', 0.225), ('dyads', 0.224), ('eigensystem', 0.224), ('classical', 0.211), ('wt', 0.21), ('tr', 0.196), ('shtarkov', 0.19), ('fml', 0.182), ('quantum', 0.178), ('si', 0.171), ('minimax', 0.17), ('lunch', 0.167), ('density', 0.154), ('matrix', 0.154), ('laplace', 0.141), ('kt', 0.127), ('entropic', 0.118), ('ep', 0.117), ('outcomes', 0.113), ('ai', 0.105), ('log', 0.102), ('loss', 0.098), ('regrets', 0.092), ('conjecture', 0.084), ('predicts', 0.076), ('outcome', 0.076), ('th', 0.074), ('free', 0.074), ('multinomial', 0.068), ('alphabet', 0.065), ('ei', 0.065), ('statistic', 0.06), ('proper', 0.057), ('ex', 0.057), ('trial', 0.056), ('multinomials', 0.056), ('rr', 0.055), ('trace', 0.054), ('bai', 0.047), ('eigendecomposition', 0.046), ('mov', 0.046), ('calculus', 0.042), ('eigenvectors', 0.04), ('matrices', 0.04), ('uu', 0.039), ('prediction', 0.039), ('observing', 0.039), ('virtual', 0.037), ('warmuth', 0.037), ('incurs', 0.036), ('lemma', 0.036), ('entropy', 0.035), ('sup', 0.035), ('dyad', 0.035), ('responds', 0.034), ('inf', 0.033), ('argmin', 0.032), ('coding', 0.032), ('ui', 0.031), ('wouter', 0.03), ('physics', 0.029), ('ej', 0.029), ('code', 0.028), ('logarithm', 0.028), ('issues', 0.026), ('scoring', 0.026), ('bound', 0.025), ('length', 0.025), ('strategy', 0.025), ('horizon', 0.024), ('suf', 0.024), ('online', 0.024), ('rounds', 0.024), ('tth', 0.024), ('xx', 0.024), ('equals', 0.023), ('bounding', 0.023), ('manfred', 0.023), ('arora', 0.023), ('sequence', 0.022), ('desire', 0.022), ('forecasting', 0.022), ('drew', 0.022), ('semide', 0.022), ('learn', 0.022), ('xq', 0.021), ('rounding', 0.021), ('holds', 0.021), ('score', 0.021), ('predicting', 0.02), ('xw', 0.02), ('minimized', 0.02), ('june', 0.02), ('equality', 0.02), ('retained', 0.02), ('trials', 0.019), ('events', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 145 nips-2011-Learning Eigenvectors for Free

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

2 0.31816661 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

3 0.24625137 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

4 0.21977034 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

5 0.20954147 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Author: Elad Hazan, Satyen Kale

Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1

6 0.19098271 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

7 0.1755666 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

8 0.14386088 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

9 0.14083649 80 nips-2011-Efficient Online Learning via Randomized Rounding

10 0.12625518 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

11 0.12536937 25 nips-2011-Adaptive Hedge

12 0.11481651 162 nips-2011-Lower Bounds for Passive and Active Learning

13 0.10717384 61 nips-2011-Contextual Gaussian Process Bandit Optimization

14 0.09779308 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

15 0.09765885 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

16 0.091913521 202 nips-2011-On the Universality of Online Mirror Descent

17 0.090279356 272 nips-2011-Stochastic convex optimization with bandit feedback

18 0.090273783 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

19 0.08463496 32 nips-2011-An Empirical Evaluation of Thompson Sampling

20 0.082710095 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.236), (1, -0.299), (2, -0.013), (3, -0.094), (4, 0.192), (5, 0.0), (6, 0.077), (7, -0.003), (8, 0.105), (9, -0.04), (10, 0.188), (11, -0.018), (12, 0.15), (13, 0.015), (14, -0.046), (15, 0.006), (16, 0.036), (17, 0.049), (18, 0.093), (19, -0.049), (20, -0.018), (21, -0.015), (22, 0.109), (23, 0.003), (24, -0.029), (25, -0.071), (26, 0.04), (27, -0.021), (28, 0.024), (29, -0.021), (30, -0.005), (31, 0.08), (32, 0.045), (33, -0.015), (34, -0.049), (35, 0.034), (36, 0.021), (37, -0.034), (38, -0.043), (39, 0.045), (40, -0.057), (41, -0.012), (42, 0.091), (43, -0.046), (44, -0.015), (45, 0.015), (46, -0.029), (47, 0.091), (48, 0.112), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96802962 145 nips-2011-Learning Eigenvectors for Free

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

2 0.86292887 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Author: Elad Hazan, Satyen Kale

Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1

3 0.84550971 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

4 0.82383233 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

5 0.76672202 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1

6 0.68787456 80 nips-2011-Efficient Online Learning via Randomized Rounding

7 0.67455029 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

8 0.60672027 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

9 0.56799161 21 nips-2011-Active Learning with a Drifting Distribution

10 0.54853928 272 nips-2011-Stochastic convex optimization with bandit feedback

11 0.50856858 162 nips-2011-Lower Bounds for Passive and Active Learning

12 0.50744271 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

13 0.46937954 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

14 0.45341715 25 nips-2011-Adaptive Hedge

15 0.4524003 61 nips-2011-Contextual Gaussian Process Bandit Optimization

16 0.44662809 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

17 0.43556401 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

18 0.41086185 32 nips-2011-An Empirical Evaluation of Thompson Sampling

19 0.3801114 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

20 0.37562814 139 nips-2011-Kernel Bayes' Rule


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.043), (20, 0.035), (26, 0.023), (31, 0.099), (33, 0.015), (43, 0.056), (45, 0.111), (57, 0.018), (74, 0.027), (83, 0.444), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9628883 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation confirms the method’s applicability to learning both stationary and temporal spike patterns. 1

2 0.93683141 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

Author: Omar Z. Khan, Pascal Poupart, John-mark M. Agosta

Abstract: In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test’s relative diagnostic value. We demonstrate that consistency with an expert’s test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario. 1 INTRODUCTION The problem of learning-by-example has the promise to create strong models from a restricted number of cases; certainly humans show the ability to generalize from limited experience. Machine Learning has seen numerous approaches to learning task performance by imitation, going back to some of the approaches to inductive learning from examples [14]. Of particular interest are problemsolving tasks that use a model to infer the source, or cause of a problem from a sequence of investigatory steps or tests. The specific example we adopt is a diagnostic task such as appears in medicine, electro-mechanical fault isolation, customer support and network diagnostics, among others. We define a diagnostic sequence as consisting of the assignment of values to a subset of tests. The diagnostic process embodies the choice of the best next test to execute at each step in the sequence, by measuring the diagnostic value among the set of available tests at each step, that is, the ability of a test to distinguish among the possible causes. One possible implementation with which to carry out this process, the one we apply, is a Bayes network [9]. As with all model-based approaches, provisioning an adequate model can be daunting, resulting in a “knowledge elicitation bottleneck.” A recent approach for easing the bottleneck grew out of the realization that the best time to gain an expert’s insight into the model structure is during the diagnostic process. Recent work in “QueryBased Diagnostics” [1] demonstrated a way to improve model quality by merging model use and model building into a single process. More precisely the expert can take steps to modify the network structure to add or remove nodes or links, interspersed within the diagnostic sequence. In this paper we show how to extend this variety of learning-by-example to include also refinement of model parameters based on the expert’s choice of test, from which we determine constraints. The nature of these constraints, as shown herein, is derived from the value of the tests to distinguish causes, a value referred to informally as value of information [10]. It is the effect of these novel constraints on network parameter learning that is elucidated in this paper. ∗ J. M. Agosta is no longer affiliated with Intel Corporation 1 Conventional statistical learning approaches are not suited to this problem, since the number of cases available from diagnostic sessions is small, and the data from any case is sparse. (Only a fraction of the tests are taken.) But more relevant is that one diagnostic sequence from an expert user represents the true behavior expected of the model, rather than a noisy realization of a case generated by the true model. We adopt a Bayesian approach, which offers a principled way to incorporate knowledge (constraints and data, when available) and also consider weakening the constraints, by applying a likelihood to them, so that possibly conflicting constraints can be incorporated consistently. Sec. 2 reviews related work and Sec. 3 provides some background on diagnostic networks and model consistency. Then, Sec. 4 describes an augmented Bayesian network that incorporates constraints implied by an expert’s choice of tests. Some sampling techniques are proposed to find the Maximum a posterior setting of the parameters given the constraints (and any data available). The approach is evaluated in Sec. 5 on synthetic data and a real world manufacturing diagnostic scenario. Finally, Sec. 6 discusses some future work. 2 RELATED WORK Parameter learning for Bayesian networks can be viewed as searching in a high-dimensional space. Adopting constraints on the parameters based on some domain knowledge is a way of pruning this search space and learning the parameters more efficiently, both in terms of data needed and time required. Qualitative probabilistic networks [17] allow qualitative constraints on the parameter space to be specified by experts. For instance, the influence of one variable on another, or the combined influence of multiple variables on another variable [5] leads to linear inequalities on the parameters. Wittig and Jameson [18] explain how to transform the likelihood of violating qualitative constraints into a penalty term to adjust maximum likelihood, which allows gradient ascent and Expectation Maximization (EM) to take into account linear qualitative constraints. Other examples of qualitative constraints include some parameters being larger than others, bounded in a range, within ϵ of each other, etc. Various proposals have been made that exploit such constraints. Altendorf et al. [2] provide an approximate technique based on constrained convex optimization for parameter learning. Niculescu et al. [15] also provide a technique based on constrained optimization with closed form solutions for different classes of constraints. Feelders [6] provides an alternate method based on isotonic regression while Liao and Ji [12] combine gradient descent with EM. de Campos and Ji [4] also use constrained convex optimization, however, they use Dirichlet priors on the parameters to incorporate any additional knowledge. Mao and Lebanon [13] also use Dirichlet priors, but they use probabilistic constraints to allow inaccuracies in the specification of the constraints. A major difference between our technique and previous work is on the type of constraints. Our constraints do not need to be explicitly specified by an expert. Instead, we passively observe the expert and learn from what choices are made and not made [16]. Furthermore, as we shall show later, our constraints are non-convex, preventing the direct application of existing techniques that assume linear or convex functions. We use Beta priors on the parameters, which can easily be extended to Dirichlet priors like previous work. We incorporate constraints in an augmented Bayesian network, similar to Liang et al. [11], though their constraints are on model predictions as opposed to ours which are on the parameters of the network. Finally, we also use the notion of probabilistic constraints to handle potential mistakes made by experts. 3 3.1 BACKGROUND DIAGNOSTIC BAYES NETWORKS We consider the class of bipartite Bayes networks that are widely used as diagnostic models, though our approach can be used for networks with any structure. The network forms a sparse, directed, causal graph, where arcs go from causes to observable node variables. We use upper case to denote random variables; C for causes, and T for observables (tests). Lower case letters denote values in the domain of a variable, e.g. c ∈ dom(C) = {c, c}, and bold letters denote sets of variables. A ¯ set of marginally independent binary-valued node variables C with distributions Pr(C) represent unobserved causes, and condition the remaining conditionally independent binary-valued test vari2 able nodes T. Each cause conditions one or more tests; likewise each test is conditioned by one or more causes, resulting in a graph with one or more possibly multiply-connected components. The test variable distributions Pr(T |C) incorporate the further modeling assumption of Independence of Causal Influence, the most familiar example being the Noisy-Or model [8]. To keep the exposition simple, we assume that all variables are binary and that conditional distributions are parametrized by the Noisy-Or; however, the algorithms described in the rest of the paper generalize to any discrete non-binary variable models. Conventionally, unobserved tests are ranked in a diagnostic Bayes network by their Value Of Information (VOI) conditioned on tests already observed. To be precise, VOI is the expected gain in utility if the test were to be observed. The complete computation requires a model equivalent to a partially observable Markov decision process. Instead, VOI is commonly approximated by a greedy computation of the Mutual Information between a test and the set of causes [3]. In this case, it is easy to show that Mutual Information is in turn well approximated to second order by the Gini impurity [7] as shown in Equation 1. ] [∑ ∑ GI(C|T ) = Pr(T = t) Pr(C = c|T = t)(1 − Pr(C = c|T = t)) (1) t c We will use the Gini measure as a surrogate for VOI, as a way to rank the best next test in the diagnostic sequence. 3.2 MODEL CONSISTENCY A model that is consistent with an expert would generate Gini impurity rankings consistent with the expert’s diagnostic sequence. We interpret the expert’s test choices as implying constraints on Gini impurity rankings between tests. To that effect, [1] defines the notion of Cause Consistency and Test Consistency, which indicate whether the cause and test orderings induced by the posterior distribution over causes and the VOI of each test agree with an expert’s observed choice. Assuming that the expert greedily chooses the most informative test T ∗ (i.e., test that yields the lowest Gini impurity) at each step, then the model is consistent with the expert’s choices when the following constraints are satisfied: GI(C|T ∗ ) ≤ GI(C|Ti ) ∀i (2) We demonstrate next how to exploit these constraints to refine the Bayes network. 4 MODEL REFINEMENT Consider a simple diagnosis example with two possible causes C1 and C2 and two tests T1 and T2 as shown in Figure 1. To keep the exposition simple, suppose that the priors for each cause are known (generally separate data is available to estimate these), but the conditional distribution of each test is unknown. Using the Noisy-OR parameterizations for the conditional distributions, the number of parameters are linear in the number of parents instead of exponential. ∏ i i Pr(Ti = true|C) = 1 − (1 − θ0 ) (1 − θj ) (3) j|Cj =true i Here, θ0 = Pr(Ti = true|Cj = f alse ∀j) is the leak probability that Ti will be true when none of i the causes are true and θj = Pr(Ti = true|Cj = true, Ck = f alse ∀k ̸= j) is the link reliability, which indicates the independent contribution of cause Cj to the probability that test Ti will be true. In the rest of this section, we describe how to learn the θ parameters while respecting the constraints implied by test consistency. 4.1 TEST CONSISTENCY CONSTRAINTS Suppose that an expert chooses test T1 instead of test T2 during the diagnostic process. This ordering by the expert implies that the current model (parametrized by the θ’s) must be consistent with the constraint GI(C|T2 ) − GI(C|T1 ) ≥ 0. Using the definition of Gini impurity in Eq. 1, we can rewrite 3 Figure 1: Network with 2 causes and 2 tests Figure 2: Augmented network with parameters and constraints Figure 3: Augmented network extended to handle inaccurate feedback the constraint for the network shown in Fig. 1 as follows: ∑ t1 ( ∑ (Pr(t1 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 Pr(t1 ) − Pr(t1 ) c ,c 1 2 ) ( ) ∑ ∑ (Pr(t2 |c1 , c2 ) Pr(c1 ) Pr(c2 ))2 − Pr(t2 ) − ≥0 Pr(t2 ) t c ,c 2 1 2 (4) Furthermore, using the Noisy-Or encoding from Eq. 3, we can rewrite the constraint as a polynomial in the θ’s. This polynomial is non-linear, and in general, not concave. The feasible space may consist of disconnected regions. Fig. 4 shows the surface corresponding to the polynomial for the 2 1 i i case where θ0 = 0 and θ1 = 0.5 for each test i, which leaves θ2 and θ2 as the only free variables. The parameters’ feasible space, satisfying the constraint consists of the two disconnected regions where the surface is positive. 4.2 AUGMENTED BAYES NETWORK Our objective is to learn the θ parameters of diagnostic Bayes networks given test constraints of the form described in Eq. 4. To deal with non-convex constraints and disconnected feasible regions, we pursue a Bayesian approach whereby we explicitly model the parameters and constraints as random variables in an augmented Bayes network (see Fig. 2). This allows us to frame the problem of learning the parameters as an inference problem in a hybrid Bayes network of discrete (T, C, V ) and continuous (Θ) variables. As we will see shortly, this augmented Bayes network provides a unifying framework to simultaneously learn from constraints and data, to deal with possibly inconsistent constraints, and to express preferences over the degree of satisfaction of the constraints. We encode the constraint derived from the expert feedback as a binary random variable V in the Bayes network. If V is true the constraint is satisfied, otherwise it is violated. Thus, if V is true then Θ lies in the positive region of Fig. 4, and if V is f alse then Θ lies in the negative region. We model the CPT for V as Pr(V |Θ) = max(0, π), where π = GI(C|T1 ) − GI(C|T2 ). Note that the value of GI(C|T ) lies in the interval [0,1], so the probability π will always be normalized. The intuition behind this definition of the CPT for V is that a constraint is more likely to be satisfied if the parameters lie in the interior of the constraint region. We place a Beta prior over each Θ parameter. Since the test variables are conditioned on the Θ parameters that are now part of the network, their conditional distributions become known. For instance, the conditional distribution for Ti (given in Eq. 3) is fully defined given the noisy-or parami eters θj . Hence the problem of learning the parameters becomes an inference problem to compute posteriors over the parameters given that the constraint is satisfied (and any data). In practice, it is more convenient to obtain a single value for the parameters instead of a posterior distribution since it is easier to make diagnostic predictions based on one Bayes network. We estimate the parameters by computing a maximum a posteriori (MAP) hypothesis given that the constraint is satisfied (and any data): Θ∗ = arg maxΘ Pr(Θ|V = true). 4 Algorithm 1 Pseudo Code for Gibbs Sampling, Stochastic Hill Climbing and Greedy Search 1 Fix observed variables, let V = true and randomly sample feasible starting state S 2 for i = 1 to #samples 3 for j = 1 to #hiddenV ariables 4 acceptSample = f alse; k = 0 5 repeat 6 Sample s′ from conditional of j th hidden variable Sj 7 S′ = S; Sj = s′ 8 if Sj is cause or test, then acceptSample = true 9 elseif S′ obeys constraints V∗ 10 if algo == Gibbs 11 Sample u from uniform distribution, U(0,1) p(S′ 12 if u < M q(S)′ ) where p and q are the true and proposal distributions and M > 1 13 acceptSample = true 14 elseif algo = = StochasticHillClimbing 15 if likelihood(S′ ) > likelihood(S), then acceptSample = true 16 elseif algo = = Greedy, then acceptSample = true 17 elseif algo = = Greedy 18 k = k+1 19 if k = = maxIterations, then s′ = Sj ; acceptSample = true 20 until acceptSample = = true 21 Sj = s′ 4.3 MAP ESTIMATION Previous approaches for parameter learning with domain knowledge include modified versions of EM or some other optimization techniques that account for linear/convex constraints on the parameters. Since our constraints are non-convex, we propose a new approach based on Gibbs sampling to approximate the posterior distribution, from which we compute the MAP estimate. Although the technique converges to the MAP in the limit, it may require excessive time. Hence, we modify Gibbs sampling to obtain more efficient stochastic hill climbing and greedy search algorithms with anytime properties. The pseudo code for our Gibbs sampler is provided in Algorithm 1. The two key steps are sampling the conditional distributions of each variable (line 6) and rejection sampling to ensure that the constraints are satisfied (lines 9 and 12). We sample each variable given the rest according to the following distributions: ti ∼ Pr(Ti |c, θi ) ∀i cj ∼ Pr(Cj |c − cj , t, θ) ∝ ∏ Pr(Cj ) j ∏ (5) Pr(ti |c, θi ) ∀j i i θj ∼ Pr(Θi |Θ − Θi , t, c, v) ∝ Pr(v|t, Θ) j j ∏ Pr(ti |cj , θi ) ∀i, j (6) (7) i The tests and causes are easily sampled from the multinomials as described in the equations above. However, sampling the θ’s is more difficult due to the factor Pr(v|Θ, t) = max(0, π), which is a truncated mixture of Betas. So, instead of sampling θ from its true conditional, we sample it from a proposal distribution that replaces max(0, π) by an un-truncated mixture of Betas equal to π + a where a is a constant that ensures that π + a is always positive. This is equivalent to ignoring the constraints. Then we ensure that the constraints are satisfied by rejecting the samples that violate the constraints. Once Gibbs sampling has been performed, we obtain a sample that approximates the posterior distribution over the parameters given the constraints (and any data). We return a single setting of the parameters by selecting the sampled instance with the highest posterior probability (i.e., MAP estimate). Since we will only return the MAP estimate, it is possible to speed up the search by modifying Gibbs sampling. In particular, we obtain a stochastic hill climbing algorithm by accepting a new sample only if its posterior probability improves upon that of the previous sample 5 Posterior Probability 0.1 0.08 Difference in Gini Impurity 0.1 0.05 0 −0.05 0.06 0.04 0.02 −0.1 1 0 1 1 0.8 0.5 0.6 0.8 0.4 Link Reliability of Test 2 and Cause 1 0 0.6 0.2 0 0.4 Link Reliability of Test 2 and Cause 2 Figure 4: Difference in Gini impurity for the network in 1 2 Fig. 1 when θ2 and θ2 are the only parameters allowed to vary. 0.2 Link Reliability of Test 2 and Cause 1 0 0 0.2 0.4 0.6 0.8 1 Link Reliability of Test 2 and Cause 1 Figure 5: Posterior over parameters computed through calculation after discretization. Figure 6: Posterior over parameters calculated through Sampling. (line 15). Thus, each iteration of the stochastic hill climber requires more time, but always improves the solution. As the number of constraints grows and the feasibility region shrinks, the Gibbs sampler and stochastic hill climber will reject most samples. We can mitigate this by using a Greedy sampler that caps the number of rejected samples, after which it abandons the sampling for the current variable to move on to the next variable (line 19). Even though the feasibility region is small overall, it may still be large in some dimensions, so it makes sense to try sampling another variable (that may have a larger range of feasible values) when it is taking too long to find a new feasible value for the current variable. 4.4 MODEL REFINEMENT WITH INCONSISTENT CONSTRAINTS So far, we have assumed that the expert’s actions generate a feasible region as a consequence of consistent constraints. We handle inconsistencies by further extending our augmented diagnostic Bayes network. We treat the observed constraint variable, V , as a probabilistic indicator of the true constraint V ∗ as shown in Figure 3. We can easily extend our techniques for computing the MAP to cater for this new constraint node by sampling an extra variable. 5 EVALUATION AND EXPERIMENTS 5.1 EVALUATION CRITERIA Formally, for M ∗ , the true model that we aim to learn, the diagnostic process determines the choice of best next test as the one with the smallest Gini impurity. If the correct choice for the next test is known (such as demonstrated by an expert), we can use this information to include a constraint on the model. We denote by V+ the set of observed constraints and by V∗ the set of all possible constraints that hold for M ∗ . Having only observed V+ , our technique will consider any M + ∈ M+ as a possible true model, where M+ is the set of all models that obey V + . We denote by M∗ the set of all models that are diagnostically equivalent to M ∗ (i.e., obey V ∗ and would recommend the MAP same steps as M ∗ ) and by MV+ the particular model obtained by MAP estimation based on the MAP constraints V+ . Similarly, when a dataset D is available, we denote by MD the model obtained MAP by MAP estimation based on D and by MDV+ , the model based on D and V+ . Ideally we would like to find the true underlying model M ∗ , hence we will report the KL divergence between the models found and M ∗ . However, other diagnostically equivalent M ∗ may recommend the same tests as M ∗ and thus have similar constraints, so we also report test consistency with M ∗ (i.e., # of recommended tests that are the same). 5.2 CORRECTNESS OF MODEL REFINEMENT Given V∗ , our technique for model adjustment is guaranteed to choose a model M MAP ∈ M∗ by construction. If any constraint V ∗ ∈ V∗ is violated, the rejection sampling step of our technique 6 100 Comparing convergence of Different Techniques 80 70 60 50 40 30 Data Only Constraints Only Data+Constraints 20 10 0 1 2 3 4 5 Number of constraints used 6 −10 −12 −14 −16 −18 7 −20 Figure 7: Mean KLdivergence and one standard deviation for a 3 cause 3 test network on learning with data, constraints and data+constraints. Gibbs Sampling Stochastic Hill Climbing Greedy Sampling −8 Negative Log Likelihood of MAP Estimate Percentage of tests correctly predicted 90 0 1 2 3 10 10 10 10 Elapsed Time (plotted on log scale from 0 to 1500 seconds) Figure 8: Test Consistency for a 3 cause 3 test network on learning with data, constraints and data+constraints. Figure 9: Convergence rate comparison. would reject that set of parameters. To illustrate this, consider the network in Fig. 2. There are six parameters (four link reliabilities and two leak parameters). Let us fix the leak parameters and the link reliability from the first cause to each test. Now we can compute the posterior surface over the two variable parameters after discretizing each parameter in small steps and then calculating the posterior probability at each step as shown in Fig. 5. We can compare this surface with that obtained after Gibbs sampling using our technique as shown in Fig. 6. We can see that our technique recovers the posterior surface from which we can compute the MAP. We obtain the same MAP estimate with the stochastic hill climbing and greedy search algorithms. 5.3 EXPERIMENTAL RESULTS ON SYNTHETIC PROBLEMS We start by presenting our results on a 3-cause by 3-test fully-connected bipartite Bayes network. We assume that there exists some M ∗ ∈ M∗ that we want to learn given V+ . We use our technique to find M MAP . To evaluate M MAP , we first compute the constraints, V∗ for M ∗ to get the feasible region associated with the true model. Next, we sample 100 other models from this feasible region that are diagnostically equivalent. We compare these models with M MAP (after collecting 200 samples with non-informative priors for the parameters). We compute the KL-divergence of M MAP with respect to each sampled model. We expect KLdivergence to decrease as the number of constraints in V+ increases since the feasible region beMAP comes smaller. Figure 7 confirms this trend and shows that MDV+ has lower mean KL-divergence MAP MAP than MV+ , which has lower mean KL-divergence than MD . The data points in D are limited to the results of the diagnostic sessions needed to obtain V+ . As constraints increase, more data is available and so the results for the data-only approach also improve with increasing constraints. We also compare the test consistency when learning from data only, constraints only or both. Given a fixed number of constraints, we enumerate the unobserved trajectories, and then compute the highest ranked test using the learnt model and the sampled true models, for each trajectory. The test consistency is reported as a percentage, with 100% consistency indicating that the learned and true models had the same highest ranked tests on every trajectory. Figure 8 presents these percentatges for the greedy sampling technique (the results are similar for the other techniques). It again appears that learning parameters with both constraints and data is better than learning with only constraints, which is most of the times better than learning with only data. Figure 9 compares the convergence rate of each technique to find the MAP estimate. As expected, Stochastic Hill Climbing and Greedy Sampling take less time than Gibbs sampling to find parameter settings with high posterior probability. 5.4 EXPERIMENTAL RESULTS ON REAL-WORLD PROBLEMS We evaluate our technique on a real-world diagnostic network collected and reported by Agosta et al. [1], where the authors collected detailed session logs over a period of seven weeks in which the 7 KL−divergence of when computing joint over all tests 8 Figure 10: Diagnostic Bayesian network collected from user trials and pruned to retain sub-networks with at least one constraint Data Only Constraints Only Data+Constraints 7 6 5 4 3 2 1 6 8 10 12 14 16 Number of constraints used 18 20 22 Figure 11: KL divergence comparison as the number of constraints increases for the real world problem. entire diagnostic sequence was recorded. The sequences intermingle model building and querying phases. The model network structure was inferred from an expert’s sequence of positing causes and tests. Test-ranking constraints were deduced from the expert’s test query sequences once the network structure is established. The 157 sessions captured over the seven weeks resulted in a Bayes network with 115 tests, 82 root causes and 188 arcs. The network consists of several disconnected sub-networks, each identified with a symptom represented by the first test in the sequence, and all subsequent tests applied within the same subnet. There were 20 sessions from which we were able to observe trajectories with at least two tests, resulting in a total of 32 test constraints. We pruned our diagnostic network to remove the sub-networks with no constraints to get a Bayes network with 54 tests, 30 root causes, and 67 parameters divided in 7 sub-networks, as shown in Figure 10, on which we apply our model refinement technique to learn the parameters for each sub-network separately. Since we don’t have the true underlying network and the full set of constraints (more constraints could be observed in future diagnostic sessions), we treated the 32 constraints as if they were V∗ and the corresponding feasible region M∗ as if it contained models diagnostically equivalent to the unknown true model. Figure 11 reports the KL divergence between the models found by our algorithms and sampled models from M∗ as we increase the number of constraints. With such limited constraints and consequently large feasible regions, it is not surprising that the variation in KL divergence is large. Again, the MAP estimate based on both the constraints and the data has lower KL divergence than constraints only and data only. 6 CONCLUSION AND FUTURE WORK In summary, we presented an approach that can learn the parameters of a Bayes network based on constraints implied by test consistency and any data available. While several approaches exist to incorporate qualitative constraints in learning procedures, our work makes two important contributions: First, this is the first approach that exploits implicit constraints based on value of information assessments. Secondly it is the first approach that can handle non-convex constraints. We demonstrated the approach on synthetic data and on a real-world manufacturing diagnostic problem. Since data is generally sparse in diagnostics, this work makes an important advance to mitigate the model acquisition bottleneck, which has prevented the widespread application of diagnostic networks so far. In the future, it would be interesting to generalize this work to reinforcement learning in applications where data is sparse, but constraints may be inferred from expert interactions. Acknowledgments This work was supported by a grant from Intel Corporation. 8 References [1] John Mark Agosta, Omar Zia Khan, and Pascal Poupart. Evaluation results for a query-based diagnostics application. In The Fifth European Workshop on Probabilistic Graphical Models (PGM 10), Helsinki, Finland, September 13–15 2010. [2] Eric E. Altendorf, Angelo C. Restificar, and Thomas G. Dietterich. Learning from sparse data by exploiting monotonicity constraints. In Proceedings of Twenty First Conference on Uncertainty in Artificial Intelligence (UAI), Edinburgh, Scotland, July 2005. [3] Brigham S. Anderson and Andrew W. Moore. Fast information value for graphical models. In Proceedings of Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 51–58, Vancouver, BC, Canada, December 2005. [4] Cassio P. de Campos and Qiang Ji. Improving Bayesian network parameter learning using constraints. In International Conference in Pattern Recognition (ICPR), Tampa, FL, USA, 2008. [5] Marek J. Druzdzel and Linda C. van der Gaag. Elicitation of probabilities for belief networks: combining qualitative and quantitative information. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 141–148, Montreal, QC, Canada, 1995. [6] Ad J. Feelders. A new parameter learning method for Bayesian networks with qualitative influences. In Proceedings of Twenty Third International Conference on Uncertainty in Artificial Intelligence (UAI), Vancouver, BC, July 2007. [7] Mara Angeles Gil and Pedro Gil. A procedure to test the suitability of a factor for stratification in estimating diversity. Applied Mathematics and Computation, 43(3):221 – 229, 1991. [8] David Heckerman and John S. Breese. Causal independence for probability assessment and inference using bayesian networks. IEEE Systems, Man, and Cybernetics, 26(6):826–831, November 1996. [9] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49–56, 1995. [10] Ronald A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, August 1966. [11] Percy Liang, Michael I. Jordan, and Dan Klein. Learning from measurements in exponential families. In Proceedings of Twenty Sixth Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009. [12] Wenhui Liao and Qiang Ji. Learning Bayesian network parameters under incomplete data with domain knowledge. Pattern Recognition, 42:3046–3056, 2009. [13] Yi Mao and Guy Lebanon. Domain knowledge uncertainty and probabilistic parameter constraints. In Proceedings of Twenty Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QC, Canada, 2009. [14] Ryszard S. Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–116, 1984. [15] Radu Stefan Niculescu, Tom M. Mitchell, and R. Bharat Rao. Bayesian network learning with parameter constraints. Journal of Machine Learning Research, 7:1357–1383, 2006. [16] Mark A. Peot and Ross D. Shachter. Learning from what you dont observe. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 439–446, Madison, WI, July 1998. [17] Michael P. Wellman. Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44(3):257–303, August 1990. [18] Frank Wittig and Anthony Jameson. Exploiting qualitative knowledge in the learning of conditional probabilities of Bayesian networks. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, July 2000. 9

same-paper 3 0.92965311 145 nips-2011-Learning Eigenvectors for Free

Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth

Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1

4 0.90799755 13 nips-2011-A blind sparse deconvolution method for neural spike identification

Author: Chaitanya Ekanadham, Daniel Tranchina, Eero P. Simoncelli

Abstract: We consider the problem of estimating neural spikes from extracellular voltage recordings. Most current methods are based on clustering, which requires substantial human supervision and systematically mishandles temporally overlapping spikes. We formulate the problem as one of statistical inference, in which the recorded voltage is a noisy sum of the spike trains of each neuron convolved with its associated spike waveform. Joint maximum-a-posteriori (MAP) estimation of the waveforms and spikes is then a blind deconvolution problem in which the coefficients are sparse. We develop a block-coordinate descent procedure to approximate the MAP solution, based on our recently developed continuous basis pursuit method. We validate our method on simulated data as well as real data for which ground truth is available via simultaneous intracellular recordings. In both cases, our method substantially reduces the number of missed spikes and false positives when compared to a standard clustering algorithm, primarily by recovering overlapping spikes. The method offers a fully automated alternative to clustering methods that is less susceptible to systematic errors. 1

5 0.85855806 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik

Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.

6 0.72303635 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

7 0.68099642 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

8 0.66329062 249 nips-2011-Sequence learning with hidden units in spiking neural networks

9 0.62761027 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

10 0.59334773 86 nips-2011-Empirical models of spiking in neural populations

11 0.58386588 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models

12 0.57841438 219 nips-2011-Predicting response time and error rates in visual search

13 0.57405692 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

14 0.57297426 75 nips-2011-Dynamical segmentation of single trials from population neural data

15 0.57036877 102 nips-2011-Generalised Coupled Tensor Factorisation

16 0.55775928 217 nips-2011-Practical Variational Inference for Neural Networks

17 0.55351365 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

18 0.55024946 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

19 0.5493266 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

20 0.54670829 24 nips-2011-Active learning of neural response functions with Gaussian processes