jmlr jmlr2010 jmlr2010-81 knowledge-graph by maker-knowledge-mining

81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Source: pdf

Author: Daniil Ryabko

Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a speciﬁc and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also ﬁnd several sufﬁcient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The goal is to have a predictor whose predicted probabilities converge (in a certain sense) to the correct ones (that is, to µ-conditional probabilities). [sent-29, score-0.428]

2 In other words, one cannot have a predictor whose error goes to zero for any measure µ. [sent-31, score-0.431]

3 The questions addressed in this work are a part of the following general problem: given an arbitrary set C of measures, how can we ﬁnd a predictor that performs well when the data is generated by any µ ∈ C , and whether it is possible to ﬁnd such a predictor at all. [sent-33, score-0.793]

4 More precisely, we show that if a predictor that predicts every µ ∈ C exists, then such a predictor can also be obtained as a weighted sum of countably many elements of C . [sent-40, score-1.169]

5 This result can also be viewed as a justiﬁcation of the Bayesian approach to sequence prediction: if there exists a predictor which predicts well every measure in the class, then there exists a Bayesian predictor (with a rather simple prior) that has this property too. [sent-41, score-1.17]

6 In this respect it is important to note that the result obtained about such a Bayesian predictor is pointwise (holds for every µ in C ), and stretches far beyond the set its prior is concentrated on. [sent-42, score-0.509]

7 Next, we derive some characterizations of families C for which a predictor exist. [sent-43, score-0.502]

8 We then derive some sufﬁcient conditions for the existence of a predictor which are based on local (truncated to the ﬁrst n observation) behaviour of measures in the class C . [sent-45, score-0.656]

9 1 Prior Work As it was mentioned, if the class C of measures is countable (that is, if C can be represented as C := {µk : k ∈ N}), then there exists a predictor which performs well for any µ ∈ C . [sent-56, score-0.731]

10 Such a predictor can be obtained as a Bayesian mixture ρS := ∑k∈N wk µk , where wk are summable positive real weights, and it has very strong predictive properties; in particular, ρS predicts every µ ∈ C in total variation distance, as follows from the result of Blackwell and Dubins (1962). [sent-57, score-1.407]

11 Total variation distance measures the difference in (predicted and true) conditional probabilities of all future events, that is, not only the probabilities of the next observations, but also of observations that are arbitrary far off in the future (see formal deﬁnitions below). [sent-58, score-0.416]

12 A key observation here is that a predictor ρS = ∑ wk µk may be a good predictor not only when the data is generated by one of the processes µk , k ∈ N, but when it comes from a much larger class. [sent-66, score-1.123]

13 As was found by Ryabko (1988), the combination ρR := ∑ wk λk is a good predictor not only for the set ∪k∈N Mk of all ﬁnite-memory processes, but also for any measure µ coming from a much larger class: that of all stationary measures on X ∞ . [sent-77, score-0.903]

14 Here prediction is possible only in the Cesaro sense (more precisely, ρR predicts every stationary process in expected time-average Kullback-Leibler divergence, see deﬁnitions below). [sent-78, score-0.51]

15 The Laplace predictor itself can be obtained as a Bayes mixture over all Bernoulli i. [sent-79, score-0.432]

16 Taking a dense subset of the values of these parameters, and a mixture of the corresponding measures, results in a predictor for the class of k-order Markov processes. [sent-90, score-0.471]

17 Mixing over these (for all k ∈ N) yields, as in Ryabko (1988), a predictor for the class of all stationary processes. [sent-91, score-0.482]

18 Thus, for the mentioned classes of processes, a predictor can be obtained as a Bayes mixture of countably many measures in the class. [sent-92, score-0.652]

19 Indeed, a natural way to obtain a predictor for a class C of stochastic processes is to take a Bayesian mixture of the class. [sent-94, score-0.58]

20 The results of the present work show that when a predictor exists it can indeed be given as a Bayesian predictor, which predicts every (and not almost every) measure in the class, while its support is only a countable set. [sent-107, score-0.978]

21 In Blackwell and Dubins (1962) it was shown that if one measure is absolutely continuous with respect to another, than the latter predicts the former (the conditional probabilities converge in a very strong sense). [sent-110, score-0.427]

22 2 The Results First, we show that if there is a predictor that performs well for every measure coming from a class C of processes, then a predictor can also be obtained as a convex combination ∑k∈N wk µk for some µk ∈ C and some wk > 0, k ∈ N. [sent-113, score-1.367]

23 The analysis for the total variation case relies on the fact that if ρ predicts µ in total variation distance, then µ is absolutely continuous with respect to ρ, so that ρ(x1. [sent-115, score-0.716]

24 However, if we settle for a weaker measure of performance, such as expected average KL divergence, measures µ ∈ C are typically singular with respect to a predictor ρ. [sent-120, score-0.693]

25 Combining these predictors for all n results in a predictor that predicts every µ ∈ C in average KL divergence. [sent-126, score-0.819]

26 The proof techniques developed have a potential to be used in solving other questions concerning sequence prediction, in particular, the general question of how to ﬁnd a predictor for an arbitrary class C of measures. [sent-127, score-0.495]

27 We then exhibit some sufﬁcient conditions on the class C , under which a predictor for all measures in C exists. [sent-128, score-0.561]

28 Conditions of the ﬁrst type concern separability of C with respect to the total variation distance and the expected average KL divergence. [sent-131, score-0.419]

29 We show that in the case of total variation separability is a necessary and sufﬁcient condition for the existence of a predictor, whereas in the case of expected average KL divergence it is sufﬁcient but is not necessary. [sent-132, score-0.488]

30 We show that, in general, if cn grows subexponentially then a predictor exists that predicts any measure in C in expected average KL divergence. [sent-147, score-0.856]

31 In Section 3 we show that if any predictor works than there is a Bayesian one that works, while in Section 4 we provide several characterizations of predictable classes of processes. [sent-154, score-0.452]

32 A∈F Deﬁnition 1 We say that ρ predicts µ in total variation if v(µ, ρ, x1. [sent-192, score-0.425]

33 Thus, for a class C of measures there is a predictor ρ that predicts every µ ∈ C in total variation if and only if every µ ∈ C has a density with respect to ρ. [sent-199, score-1.182]

34 Deﬁnition 3 We say that ρ predicts µ in expected average KL divergence if 1 dn (µ, ρ) → 0. [sent-217, score-0.581]

35 With prediction quality so measured, predictors exist for relatively large classes of measures; most notably, Ryabko (1988) provides a predictor which predicts every stationary process in expected average KL divergence. [sent-219, score-1.01]

36 Fully Nonparametric Bayes Predictors In this section we show that if there is a predictor that predicts every µ in some class C , then there is a Bayesian mixture of countably many elements from C that predicts every µ ∈ C too. [sent-233, score-1.148]

37 If there is a measure ρ such that ρ predicts every µ ∈ C in total variation, then there is a sequence µk ∈ C , k ∈ N such that the measure ν := ∑k∈N wk µk predicts every µ ∈ C in total variation, where wk are any positive weights that sum to 1. [sent-237, score-1.381]

38 This relatively simple fact can be proven in different ways, relying on the mentioned equivalence (Blackwell and Dubins, 1962; Kalai and Lehrer, 1994) of the statements “ρ predicts µ in total variation distance” and “µ is absolutely continuous with respect to ρ. [sent-238, score-0.528]

39 Next we will construct a sequence of measures µk ∈ C , k ∈ N such that the union of the sets Tµk has probability 1 with respect to every µ ∈ C , and will show that this is a sequence of measures whose existence is asserted in the theorem statement. [sent-256, score-0.551]

40 Thus, µ is absolutely continuous with respect to ν, and so, by Theorem 2, ν predicts µ in total variation distance. [sent-285, score-0.528]

41 Thus, examples of families C for which there is a ρ that predicts every µ ∈ C in total variation, are limited to families of measures which have a density with respect to some measure ρ. [sent-286, score-0.814]

42 measures does not allow for a predictor that predicts every measure in total variation. [sent-292, score-0.949]

43 As it was mentioned, predicting in total variation distance means predicting with arbitrarily growing horizon (Kalai and Lehrer, 1994), while prediction in expected average KL divergence is only concerned with the probabilities of the next observation, and only on time and data average. [sent-294, score-0.499]

44 For the latter measure of prediction quality, consistent predictors exist not only for the class of all Bernoulli processes, but also for the class of all stationary processes (Ryabko, 1988). [sent-295, score-0.41]

45 If there is a measure ρ such that ρ predicts every µ ∈ C in expected average KL divergence, then there exist a sequence µk ∈ C , k ∈ N and a sequence wk > 0, k ∈ N, such that ∑k,∈N wk = 1, and the measure ν := ∑k∈N wk µk predicts every µ ∈ C in expected average KL divergence. [sent-298, score-1.612]

46 An interesting related question (which is beyond the scope of this paper) is how to chose the weights to optimize the behaviour of a predictor before asymptotic. [sent-302, score-0.508]

47 ) We then use, for each given n, the same scheme to cover the set X n with countably many Tµn , as was used in the proof of Theorem 4 to construct a countable covering of the set X ∞ , obtaining for each n a predictor νn . [sent-314, score-0.695]

48 Then the predictor ν is obtained as ∑n∈N wn νn , where the weights decrease subexponentially. [sent-315, score-0.502]

49 We will show that ν predicts every µ ∈ C , and then in the end of the proof (Step r) we will show how to replace γ by a combination of a countable set of elements of C (in fact, γ is just a regularizer which ensures that ν-probability of any word is never too close to 0). [sent-369, score-0.55]

50 (11) µ(A) 2 Thus, from (11) and (8) we get II ≥ −µ(Tµn \T jn ) log ρ(Tµn \T jn ) − 1/2 ≥ −µ(Tµn \T jn ) log εn − 1/2. [sent-421, score-0.601]

51 Combining (9) with the bounds (10), (12) and (13) we obtain dn (µ, ρ) ≥ − log n − µ(Tµn \T jn ) log εn − 1 − log |X |, n µ µ so that µ(Tµn \T jn ) ≤ n µ 1 dn (µ, ρ) + log n + 1 + log |X | . [sent-429, score-1.022]

52 − log εn µ (14) Since dn (µ, ρ) = o(n), we can deﬁne the parameters εn in such a way that − log εn = o(n) while µ µ at the same time the bound (14) gives µ(Tµn \T jn ) = o(1). [sent-430, score-0.474]

53 In this case, any mixture predictor ρ := ∑k∈N wk µk predicts all µ ∈ C both in total variation and in expected average KL divergence. [sent-543, score-1.148]

54 The problem of predicting all computable measures was introduced in Solomonoff (1978), where a mixture predictor was proposed. [sent-546, score-0.574]

55 Therefore, by Theorem 2, a predictor that predicts any µ ∈ CB in total variation does not exist, demonstrating that this notion of prediction is rather strong. [sent-559, score-0.873]

56 , Krichevsky, 1993) that the Laplace predictor (1) predicts every Bernoulli i. [sent-562, score-0.705]

57 Hence, Theorem 4 implies that there is a countable mixture predictor for this family too. [sent-566, score-0.67]

58 measures with rational probability of 0, and let ρ := ∑q∈Q wq µq , where wq are arbitrary positive weights that sum to 1. [sent-571, score-0.421]

59 n ) 1 1 dn (µ p , ρ) = Eµ p log ≤ Eµ p log n n log ρ(x1. [sent-582, score-0.397]

60 (20) =− n Since this holds for each ε, we conclude that 1 dn (µ p , ρ) → 0 and ρ predicts every µ ∈ CB in expected n average KL divergence. [sent-587, score-0.551]

61 In Ryabko (1988) a predictor ρR was constructed which predicts every stationary process ρ ∈ CS in expected average KL divergence. [sent-589, score-0.858]

62 (This predictor is obtained as a mixture of predictors for k-order Markov sources, for all k ∈ N. [sent-590, score-0.522]

63 ) Therefore, Theorem 5 implies that there is also a countable mixture predictor for this family of processes. [sent-591, score-0.67]

64 Such a predictor can be constructed as follows (the proof in this example is based on the proof in Ryabko and Astola, 2006, Appendix 1). [sent-592, score-0.442]

65 We will show that any predictor k ν := ∑k∈N ∑q∈Q2k wk wq µk , where wk , k ∈ N and wq , q ∈ Q2 , k ∈ N are any sequences of positive q real weights that sum to 1, predicts every stationary µ ∈ CS in expected average KL divergence. [sent-599, score-1.548]

66 It is easy to see that for every ε > 0 and every k ∈ N we can ﬁnd a k-order stationary Markov measure µk ε , q 2k with rational values of the parameters, such that qε ∈ Q Eµ log µ(xk+1 |x1. [sent-605, score-0.425]

67 k ) q (22) We have log wk wqε 1 1 dn (µ, ν) ≤ − + dn (µ, µk ε ) q n n n 1 1 = O(k/n) + Eµ log µ(x1. [sent-610, score-0.732]

68 We will construct a sequence of measures νk , k ∈ N, a measure µ, and two sequences of positive weights wk and w′ with k ∑k∈N wk = ∑k∈N w′ = 1, for which ν := ∑k∈N wk νk predicts µ in expected average KL divergence, k but ν′ := ∑k∈N w′ νk does not. [sent-630, score-1.287]

69 We have dn (µ, ν) = − log(∑k≥n wk ) ≤ − log(wn−2 ) = o(n), but dn (µ, ν′ ) = − log(∑k≥n w′ ) = − log(2−n+1 ) = n − 1 = o(n), proving the claim. [sent-634, score-0.584]

70 Characterizing Predictable Classes Knowing that a mixture of a countable subset gives a predictor if there is one, a notion that naturally comes to mind, when trying to characterize families of processes for which a predictor exists, is separability. [sent-636, score-1.231]

71 Can we say that there is a predictor for a class C of measures if and only if C is separable? [sent-637, score-0.528]

72 Nonetheless, in the case of total variation distance we obviously have a candidate topology: that of total variation distance, and indeed separability with respect to this topology is equivalent to the existence of a predictor, as the next theorem shows. [sent-642, score-0.702]

73 1 Separability Deﬁnition 6 (unconditional total variation distance) Introduce the (unconditional) total variation distance v(µ, ρ) := sup |µ(A) − ρ(A)|. [sent-647, score-0.46]

74 There is a measure ρ such that ρ predicts every µ ∈ C in total variation if and only if C is separable with respect to the topology of total variation distance. [sent-649, score-0.877]

75 In this case, any measure ν of the form ν = ∑∞ wk µk , where {µk : k ∈ N} k=1 is any dense countable subset of C and wk are any positive weights that sum to 1, predicts every µ ∈ C in total variation. [sent-650, score-1.168]

76 Let C be separable in total variation distance, and let D = {νk : k ∈ N} be its dense countable subset. [sent-652, score-0.498]

77 We have to show that ν := ∑k∈N wk νk , where wk are any positive real weights that sum to 1, predicts every µ ∈ C in total variation. [sent-653, score-0.881]

78 Hence νk (A) ≥ µ(A) − v(µ, νk ) ≥ ε/2 and ν(A) ≥ wk νk (A) ≥ wk ε/2 > 0. [sent-657, score-0.468]

79 The set of all measures that have a density with respect to ρ, is separable with respect to this distance (for example, a dense countable subset can be constructed based on measures whose densities are step-functions, that take only rational values, see, e. [sent-661, score-0.761]

80 594 P REDICTORS FOR A RBITRARY FAMILIES OF P ROCESSES Deﬁnition 8 (asymptotic KL “distance” D) Deﬁne asymptotic expected average KL divergence between measures µ and ρ as 1 (24) D(µ, ρ) = lim sup dn (µ, ρ). [sent-668, score-0.56]

81 (ii) There is an uncountable set C of measures, and a measure ν, such that ν predicts every µ ∈ C in expected average KL divergence, but µ1 = µ2 implies D(µ1 , µ2 ) = ∞ for every µ1 , µ2 ∈ C ; in particular, C is not separable with respect to D. [sent-670, score-0.589]

82 n ) ≤ Eµ log = − log wk + dn (µ, νk ) ≤ nε + o(n). [sent-677, score-0.557]

83 It is easy to check that µ1 = µ2 implies D(µ1 , µ2 ) = ∞ for every µ1 , µ2 ∈ C , but the predictor ν, given by ν(xn = 0) := 1/n independently for different n, predicts every µ ∈ C in expected average KL divergence. [sent-685, score-0.844]

84 and k-order Markov processes, the (countable) sets of processes that have rational values of the parameters, considered in the previous section, are dense both in the topology of the parametrization and with respect to the asymptotic average divergence D. [sent-692, score-0.487]

85 On the other hand, neither of these latter families is separable with respect to the topology of total variation distance. [sent-695, score-0.418]

86 2 Conditions Based on the Local Behaviour of Measures Next we provide some sufﬁcient conditions for the existence of a predictor based on local characteristics of the class of measures, that is, measures truncated to the ﬁrst n observations. [sent-697, score-0.597]

87 Yet, the class of measures on X n , obtained by truncating all measures in C0 to the ﬁrst n observations, coincides with what would be obtained by truncating all deterministic measures to the ﬁrst n observations, the latter class being obviously not predictable at all (see also examples below). [sent-701, score-0.547]

88 n ) (indexed by n) in general does not immediately deﬁne a stochastic process over X ∞ (λC are not consistent for different n); thus, in particular, using average KL divergence for measuring prediction quality would not make sense, since dn (µ(·|x1. [sent-727, score-0.404]

89 Yet, by taking an appropriate mixture, it is still possible to construct a predictor (a stochastic process) based on λ, that predicts all the measures in the class. [sent-741, score-0.796]

90 n ) 1 cn (C ) 1 ≤ log = (log cn (C ) + 2 log n + log w). [sent-760, score-0.434]

91 Indeed, for the class of all deterministic processes (that is, each process from the class produces some ﬁxed sequence of observations with probability 1) we have cn = 2n , while obviously for this class a predictor does not exist. [sent-790, score-0.667]

92 For a class C of measures we are interested in a predictor that has a small (or minimal) worst-case (with respect to the class C ) probability of error. [sent-802, score-0.569]

93 There exists a measure ρC such that C(C n ) log n 1 ; dn (µ, ρC ) ≤ +O n n n in particular, if C(C n )/n → 0, then ρC predicts every µ ∈ C in expected average KL divergence. [sent-820, score-0.67]

94 Thus, if the channel capacity C(C n ) grows sublinearly, a predictor can be constructed for the class of processes C . [sent-839, score-0.643]

95 In this case the problem of constructing the predictor is reduced to ﬁnding the channel capacities for different n and ﬁnding the corresponding measures on which they are attained or approached. [sent-840, score-0.607]

96 More generally, the questions we addressed in this work are a part of a larger problem: given an arbitrary class C of stochastic processes, ﬁnd the best predictor for it. [sent-852, score-0.438]

97 The second one is to characterize families of processes for which a predictor exists. [sent-855, score-0.596]

98 Towards this end, we have found a rather simple form that some solution to this question has if a solution exists: a Bayesian predictor whose prior is concentrated on a countable set. [sent-862, score-0.615]

99 We have also identiﬁed some sufﬁcient conditions under which a predictor can actually be constructed (e. [sent-863, score-0.419]

100 However, the larger question of how to construct an optimal predictor for an arbitrary given family of processes, remains open. [sent-866, score-0.468]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ryabko', 0.432), ('predictor', 0.386), ('predicts', 0.237), ('wk', 0.234), ('countable', 0.203), ('daniil', 0.178), ('kl', 0.177), ('tk', 0.176), ('dn', 0.175), ('jn', 0.151), ('measures', 0.142), ('variation', 0.131), ('rbitrary', 0.127), ('redictors', 0.127), ('processes', 0.117), ('divergence', 0.112), ('cn', 0.106), ('hutter', 0.102), ('stationary', 0.096), ('separability', 0.095), ('families', 0.093), ('predictors', 0.09), ('rocesses', 0.09), ('tkn', 0.089), ('every', 0.082), ('wn', 0.079), ('channel', 0.079), ('kalai', 0.078), ('countably', 0.078), ('dubins', 0.076), ('lehrer', 0.076), ('wq', 0.076), ('blackwell', 0.076), ('bernoulli', 0.074), ('log', 0.074), ('cb', 0.063), ('prediction', 0.062), ('absolutely', 0.062), ('capacity', 0.061), ('behaviour', 0.059), ('total', 0.057), ('gallager', 0.054), ('mk', 0.053), ('krichevsky', 0.051), ('nml', 0.051), ('topology', 0.051), ('environment', 0.048), ('sup', 0.046), ('mixture', 0.046), ('rational', 0.046), ('measure', 0.045), ('separable', 0.045), ('tp', 0.045), ('predictable', 0.043), ('mn', 0.043), ('probabilities', 0.042), ('respect', 0.041), ('theorem', 0.04), ('dense', 0.039), ('fix', 0.038), ('distance', 0.038), ('weights', 0.037), ('existence', 0.036), ('environments', 0.036), ('bayesian', 0.036), ('family', 0.035), ('sequence', 0.034), ('sequences', 0.033), ('conditions', 0.033), ('hk', 0.033), ('expected', 0.033), ('xn', 0.032), ('markov', 0.031), ('stochastic', 0.031), ('universal', 0.029), ('laplace', 0.029), ('parametrization', 0.029), ('asymptotic', 0.028), ('proof', 0.028), ('truncating', 0.027), ('question', 0.026), ('cesaro', 0.025), ('plesner', 0.025), ('solomonoff', 0.025), ('subexponentially', 0.025), ('indeed', 0.025), ('cs', 0.025), ('average', 0.024), ('deterministic', 0.024), ('density', 0.024), ('parametrized', 0.024), ('kn', 0.023), ('let', 0.023), ('minimax', 0.023), ('characterizations', 0.023), ('weaker', 0.022), ('jackson', 0.022), ('normalizer', 0.022), ('actions', 0.021), ('arbitrary', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

2 0.11011568 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

Abstract: Large-scale linear classiﬁcation is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difﬁculties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we ﬁrst broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efﬁcient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classiﬁcation, optimization methods, logistic regression, support vector machines, document classiﬁcation

3 0.091482624 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classiﬁer parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is ﬁxed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks

4 0.088997163 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman

Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean ﬁeld variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efﬁciently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efﬁcient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean ﬁeld approximation

5 0.079442836 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

6 0.076500662 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

7 0.067857884 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

8 0.066847801 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

9 0.061472077 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

10 0.058571644 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

11 0.05565061 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

12 0.053744659 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

13 0.050798103 84 jmlr-2010-On Spectral Learning

14 0.048953597 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

15 0.047399953 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

16 0.045367859 63 jmlr-2010-Learning Instance-Specific Predictive Models

17 0.045325283 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

18 0.044336993 82 jmlr-2010-On Learning with Integral Operators

19 0.042965744 27 jmlr-2010-Consistent Nonparametric Tests of Independence

20 0.041930642 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, -0.068), (2, 0.056), (3, -0.084), (4, -0.078), (5, -0.025), (6, -0.019), (7, 0.127), (8, 0.075), (9, 0.069), (10, 0.016), (11, 0.061), (12, 0.036), (13, -0.177), (14, 0.133), (15, -0.053), (16, 0.123), (17, -0.287), (18, 0.145), (19, -0.032), (20, 0.081), (21, -0.007), (22, 0.021), (23, 0.022), (24, -0.108), (25, 0.173), (26, 0.159), (27, -0.075), (28, -0.104), (29, 0.084), (30, -0.021), (31, -0.004), (32, -0.038), (33, 0.155), (34, 0.138), (35, -0.145), (36, 0.021), (37, 0.303), (38, 0.048), (39, 0.077), (40, 0.059), (41, -0.038), (42, 0.076), (43, -0.001), (44, -0.107), (45, -0.078), (46, -0.169), (47, -0.072), (48, -0.059), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96652955 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

2 0.46395227 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

3 0.40282175 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

4 0.29267952 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.28074342 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman

6 0.27671862 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

7 0.27109864 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

8 0.27094439 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

9 0.26056501 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

10 0.25860777 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

11 0.24752416 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

12 0.23444711 84 jmlr-2010-On Spectral Learning

13 0.22635043 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

14 0.22363566 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

15 0.21907961 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

16 0.20830564 63 jmlr-2010-Learning Instance-Specific Predictive Models

17 0.20411347 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

18 0.17913398 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

19 0.16125561 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

20 0.15823874 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.045), (8, 0.021), (15, 0.48), (21, 0.017), (24, 0.01), (32, 0.045), (36, 0.062), (37, 0.041), (75, 0.116), (81, 0.014), (85, 0.035), (96, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81994438 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

2 0.66930759 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods

3 0.37009147 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

Author: Jean-Yves Audibert, Sébastien Bubeck

Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efﬁcient and bandit label efﬁcient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a uniﬁed γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which deﬁnes a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus ﬁll in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modiﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efﬁcient, upper conﬁdence bound (UCB) policy, online learning, prediction with limited feedback.

4 0.35441613 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0, 1] range better than other normalized variants. Keywords: clustering comparison, information theory, adjustment for chance, normalized information distance

5 0.34558347 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal

Abstract: With the goal to generate more scalable algorithms with higher efﬁciency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate signiﬁcant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efﬁcient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies

6 0.34308174 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

7 0.34208769 63 jmlr-2010-Learning Instance-Specific Predictive Models

8 0.3362571 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.33073795 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

10 0.32843482 27 jmlr-2010-Consistent Nonparametric Tests of Independence

11 0.32550594 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

12 0.32548296 60 jmlr-2010-Learnability, Stability and Uniform Convergence

13 0.3216477 25 jmlr-2010-Composite Binary Losses

14 0.32100528 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

15 0.32011306 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

16 0.31993341 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

17 0.31985566 109 jmlr-2010-Stochastic Composite Likelihood

18 0.31400359 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

19 0.31210312 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?

20 0.31155694 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm