jmlr jmlr2010 jmlr2010-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniil Ryabko
Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction
Reference: text
sentIndex sentText sentNum sentScore
1 The goal is to have a predictor whose predicted probabilities converge (in a certain sense) to the correct ones (that is, to µ-conditional probabilities). [sent-29, score-0.428]
2 In other words, one cannot have a predictor whose error goes to zero for any measure µ. [sent-31, score-0.431]
3 The questions addressed in this work are a part of the following general problem: given an arbitrary set C of measures, how can we find a predictor that performs well when the data is generated by any µ ∈ C , and whether it is possible to find such a predictor at all. [sent-33, score-0.793]
4 More precisely, we show that if a predictor that predicts every µ ∈ C exists, then such a predictor can also be obtained as a weighted sum of countably many elements of C . [sent-40, score-1.169]
5 This result can also be viewed as a justification of the Bayesian approach to sequence prediction: if there exists a predictor which predicts well every measure in the class, then there exists a Bayesian predictor (with a rather simple prior) that has this property too. [sent-41, score-1.17]
6 In this respect it is important to note that the result obtained about such a Bayesian predictor is pointwise (holds for every µ in C ), and stretches far beyond the set its prior is concentrated on. [sent-42, score-0.509]
7 Next, we derive some characterizations of families C for which a predictor exist. [sent-43, score-0.502]
8 We then derive some sufficient conditions for the existence of a predictor which are based on local (truncated to the first n observation) behaviour of measures in the class C . [sent-45, score-0.656]
9 1 Prior Work As it was mentioned, if the class C of measures is countable (that is, if C can be represented as C := {µk : k ∈ N}), then there exists a predictor which performs well for any µ ∈ C . [sent-56, score-0.731]
10 Such a predictor can be obtained as a Bayesian mixture ρS := ∑k∈N wk µk , where wk are summable positive real weights, and it has very strong predictive properties; in particular, ρS predicts every µ ∈ C in total variation distance, as follows from the result of Blackwell and Dubins (1962). [sent-57, score-1.407]
11 Total variation distance measures the difference in (predicted and true) conditional probabilities of all future events, that is, not only the probabilities of the next observations, but also of observations that are arbitrary far off in the future (see formal definitions below). [sent-58, score-0.416]
12 A key observation here is that a predictor ρS = ∑ wk µk may be a good predictor not only when the data is generated by one of the processes µk , k ∈ N, but when it comes from a much larger class. [sent-66, score-1.123]
13 As was found by Ryabko (1988), the combination ρR := ∑ wk λk is a good predictor not only for the set ∪k∈N Mk of all finite-memory processes, but also for any measure µ coming from a much larger class: that of all stationary measures on X ∞ . [sent-77, score-0.903]
14 Here prediction is possible only in the Cesaro sense (more precisely, ρR predicts every stationary process in expected time-average Kullback-Leibler divergence, see definitions below). [sent-78, score-0.51]
15 The Laplace predictor itself can be obtained as a Bayes mixture over all Bernoulli i. [sent-79, score-0.432]
16 Taking a dense subset of the values of these parameters, and a mixture of the corresponding measures, results in a predictor for the class of k-order Markov processes. [sent-90, score-0.471]
17 Mixing over these (for all k ∈ N) yields, as in Ryabko (1988), a predictor for the class of all stationary processes. [sent-91, score-0.482]
18 Thus, for the mentioned classes of processes, a predictor can be obtained as a Bayes mixture of countably many measures in the class. [sent-92, score-0.652]
19 Indeed, a natural way to obtain a predictor for a class C of stochastic processes is to take a Bayesian mixture of the class. [sent-94, score-0.58]
20 The results of the present work show that when a predictor exists it can indeed be given as a Bayesian predictor, which predicts every (and not almost every) measure in the class, while its support is only a countable set. [sent-107, score-0.978]
21 In Blackwell and Dubins (1962) it was shown that if one measure is absolutely continuous with respect to another, than the latter predicts the former (the conditional probabilities converge in a very strong sense). [sent-110, score-0.427]
22 2 The Results First, we show that if there is a predictor that performs well for every measure coming from a class C of processes, then a predictor can also be obtained as a convex combination ∑k∈N wk µk for some µk ∈ C and some wk > 0, k ∈ N. [sent-113, score-1.367]
23 The analysis for the total variation case relies on the fact that if ρ predicts µ in total variation distance, then µ is absolutely continuous with respect to ρ, so that ρ(x1. [sent-115, score-0.716]
24 However, if we settle for a weaker measure of performance, such as expected average KL divergence, measures µ ∈ C are typically singular with respect to a predictor ρ. [sent-120, score-0.693]
25 Combining these predictors for all n results in a predictor that predicts every µ ∈ C in average KL divergence. [sent-126, score-0.819]
26 The proof techniques developed have a potential to be used in solving other questions concerning sequence prediction, in particular, the general question of how to find a predictor for an arbitrary class C of measures. [sent-127, score-0.495]
27 We then exhibit some sufficient conditions on the class C , under which a predictor for all measures in C exists. [sent-128, score-0.561]
28 Conditions of the first type concern separability of C with respect to the total variation distance and the expected average KL divergence. [sent-131, score-0.419]
29 We show that in the case of total variation separability is a necessary and sufficient condition for the existence of a predictor, whereas in the case of expected average KL divergence it is sufficient but is not necessary. [sent-132, score-0.488]
30 We show that, in general, if cn grows subexponentially then a predictor exists that predicts any measure in C in expected average KL divergence. [sent-147, score-0.856]
31 In Section 3 we show that if any predictor works than there is a Bayesian one that works, while in Section 4 we provide several characterizations of predictable classes of processes. [sent-154, score-0.452]
32 A∈F Definition 1 We say that ρ predicts µ in total variation if v(µ, ρ, x1. [sent-192, score-0.425]
33 Thus, for a class C of measures there is a predictor ρ that predicts every µ ∈ C in total variation if and only if every µ ∈ C has a density with respect to ρ. [sent-199, score-1.182]
34 Definition 3 We say that ρ predicts µ in expected average KL divergence if 1 dn (µ, ρ) → 0. [sent-217, score-0.581]
35 With prediction quality so measured, predictors exist for relatively large classes of measures; most notably, Ryabko (1988) provides a predictor which predicts every stationary process in expected average KL divergence. [sent-219, score-1.01]
36 Fully Nonparametric Bayes Predictors In this section we show that if there is a predictor that predicts every µ in some class C , then there is a Bayesian mixture of countably many elements from C that predicts every µ ∈ C too. [sent-233, score-1.148]
37 If there is a measure ρ such that ρ predicts every µ ∈ C in total variation, then there is a sequence µk ∈ C , k ∈ N such that the measure ν := ∑k∈N wk µk predicts every µ ∈ C in total variation, where wk are any positive weights that sum to 1. [sent-237, score-1.381]
38 This relatively simple fact can be proven in different ways, relying on the mentioned equivalence (Blackwell and Dubins, 1962; Kalai and Lehrer, 1994) of the statements “ρ predicts µ in total variation distance” and “µ is absolutely continuous with respect to ρ. [sent-238, score-0.528]
39 Next we will construct a sequence of measures µk ∈ C , k ∈ N such that the union of the sets Tµk has probability 1 with respect to every µ ∈ C , and will show that this is a sequence of measures whose existence is asserted in the theorem statement. [sent-256, score-0.551]
40 Thus, µ is absolutely continuous with respect to ν, and so, by Theorem 2, ν predicts µ in total variation distance. [sent-285, score-0.528]
41 Thus, examples of families C for which there is a ρ that predicts every µ ∈ C in total variation, are limited to families of measures which have a density with respect to some measure ρ. [sent-286, score-0.814]
42 measures does not allow for a predictor that predicts every measure in total variation. [sent-292, score-0.949]
43 As it was mentioned, predicting in total variation distance means predicting with arbitrarily growing horizon (Kalai and Lehrer, 1994), while prediction in expected average KL divergence is only concerned with the probabilities of the next observation, and only on time and data average. [sent-294, score-0.499]
44 For the latter measure of prediction quality, consistent predictors exist not only for the class of all Bernoulli processes, but also for the class of all stationary processes (Ryabko, 1988). [sent-295, score-0.41]
45 If there is a measure ρ such that ρ predicts every µ ∈ C in expected average KL divergence, then there exist a sequence µk ∈ C , k ∈ N and a sequence wk > 0, k ∈ N, such that ∑k,∈N wk = 1, and the measure ν := ∑k∈N wk µk predicts every µ ∈ C in expected average KL divergence. [sent-298, score-1.612]
46 An interesting related question (which is beyond the scope of this paper) is how to chose the weights to optimize the behaviour of a predictor before asymptotic. [sent-302, score-0.508]
47 ) We then use, for each given n, the same scheme to cover the set X n with countably many Tµn , as was used in the proof of Theorem 4 to construct a countable covering of the set X ∞ , obtaining for each n a predictor νn . [sent-314, score-0.695]
48 Then the predictor ν is obtained as ∑n∈N wn νn , where the weights decrease subexponentially. [sent-315, score-0.502]
49 We will show that ν predicts every µ ∈ C , and then in the end of the proof (Step r) we will show how to replace γ by a combination of a countable set of elements of C (in fact, γ is just a regularizer which ensures that ν-probability of any word is never too close to 0). [sent-369, score-0.55]
50 (11) µ(A) 2 Thus, from (11) and (8) we get II ≥ −µ(Tµn \T jn ) log ρ(Tµn \T jn ) − 1/2 ≥ −µ(Tµn \T jn ) log εn − 1/2. [sent-421, score-0.601]
51 Combining (9) with the bounds (10), (12) and (13) we obtain dn (µ, ρ) ≥ − log n − µ(Tµn \T jn ) log εn − 1 − log |X |, n µ µ so that µ(Tµn \T jn ) ≤ n µ 1 dn (µ, ρ) + log n + 1 + log |X | . [sent-429, score-1.022]
52 − log εn µ (14) Since dn (µ, ρ) = o(n), we can define the parameters εn in such a way that − log εn = o(n) while µ µ at the same time the bound (14) gives µ(Tµn \T jn ) = o(1). [sent-430, score-0.474]
53 In this case, any mixture predictor ρ := ∑k∈N wk µk predicts all µ ∈ C both in total variation and in expected average KL divergence. [sent-543, score-1.148]
54 The problem of predicting all computable measures was introduced in Solomonoff (1978), where a mixture predictor was proposed. [sent-546, score-0.574]
55 Therefore, by Theorem 2, a predictor that predicts any µ ∈ CB in total variation does not exist, demonstrating that this notion of prediction is rather strong. [sent-559, score-0.873]
56 , Krichevsky, 1993) that the Laplace predictor (1) predicts every Bernoulli i. [sent-562, score-0.705]
57 Hence, Theorem 4 implies that there is a countable mixture predictor for this family too. [sent-566, score-0.67]
58 measures with rational probability of 0, and let ρ := ∑q∈Q wq µq , where wq are arbitrary positive weights that sum to 1. [sent-571, score-0.421]
59 n ) 1 1 dn (µ p , ρ) = Eµ p log ≤ Eµ p log n n log ρ(x1. [sent-582, score-0.397]
60 (20) =− n Since this holds for each ε, we conclude that 1 dn (µ p , ρ) → 0 and ρ predicts every µ ∈ CB in expected n average KL divergence. [sent-587, score-0.551]
61 In Ryabko (1988) a predictor ρR was constructed which predicts every stationary process ρ ∈ CS in expected average KL divergence. [sent-589, score-0.858]
62 (This predictor is obtained as a mixture of predictors for k-order Markov sources, for all k ∈ N. [sent-590, score-0.522]
63 ) Therefore, Theorem 5 implies that there is also a countable mixture predictor for this family of processes. [sent-591, score-0.67]
64 Such a predictor can be constructed as follows (the proof in this example is based on the proof in Ryabko and Astola, 2006, Appendix 1). [sent-592, score-0.442]
65 We will show that any predictor k ν := ∑k∈N ∑q∈Q2k wk wq µk , where wk , k ∈ N and wq , q ∈ Q2 , k ∈ N are any sequences of positive q real weights that sum to 1, predicts every stationary µ ∈ CS in expected average KL divergence. [sent-599, score-1.548]
66 It is easy to see that for every ε > 0 and every k ∈ N we can find a k-order stationary Markov measure µk ε , q 2k with rational values of the parameters, such that qε ∈ Q Eµ log µ(xk+1 |x1. [sent-605, score-0.425]
67 k ) q (22) We have log wk wqε 1 1 dn (µ, ν) ≤ − + dn (µ, µk ε ) q n n n 1 1 = O(k/n) + Eµ log µ(x1. [sent-610, score-0.732]
68 We will construct a sequence of measures νk , k ∈ N, a measure µ, and two sequences of positive weights wk and w′ with k ∑k∈N wk = ∑k∈N w′ = 1, for which ν := ∑k∈N wk νk predicts µ in expected average KL divergence, k but ν′ := ∑k∈N w′ νk does not. [sent-630, score-1.287]
69 We have dn (µ, ν) = − log(∑k≥n wk ) ≤ − log(wn−2 ) = o(n), but dn (µ, ν′ ) = − log(∑k≥n w′ ) = − log(2−n+1 ) = n − 1 = o(n), proving the claim. [sent-634, score-0.584]
70 Characterizing Predictable Classes Knowing that a mixture of a countable subset gives a predictor if there is one, a notion that naturally comes to mind, when trying to characterize families of processes for which a predictor exists, is separability. [sent-636, score-1.231]
71 Can we say that there is a predictor for a class C of measures if and only if C is separable? [sent-637, score-0.528]
72 Nonetheless, in the case of total variation distance we obviously have a candidate topology: that of total variation distance, and indeed separability with respect to this topology is equivalent to the existence of a predictor, as the next theorem shows. [sent-642, score-0.702]
73 1 Separability Definition 6 (unconditional total variation distance) Introduce the (unconditional) total variation distance v(µ, ρ) := sup |µ(A) − ρ(A)|. [sent-647, score-0.46]
74 There is a measure ρ such that ρ predicts every µ ∈ C in total variation if and only if C is separable with respect to the topology of total variation distance. [sent-649, score-0.877]
75 In this case, any measure ν of the form ν = ∑∞ wk µk , where {µk : k ∈ N} k=1 is any dense countable subset of C and wk are any positive weights that sum to 1, predicts every µ ∈ C in total variation. [sent-650, score-1.168]
76 Let C be separable in total variation distance, and let D = {νk : k ∈ N} be its dense countable subset. [sent-652, score-0.498]
77 We have to show that ν := ∑k∈N wk νk , where wk are any positive real weights that sum to 1, predicts every µ ∈ C in total variation. [sent-653, score-0.881]
78 Hence νk (A) ≥ µ(A) − v(µ, νk ) ≥ ε/2 and ν(A) ≥ wk νk (A) ≥ wk ε/2 > 0. [sent-657, score-0.468]
79 The set of all measures that have a density with respect to ρ, is separable with respect to this distance (for example, a dense countable subset can be constructed based on measures whose densities are step-functions, that take only rational values, see, e. [sent-661, score-0.761]
80 594 P REDICTORS FOR A RBITRARY FAMILIES OF P ROCESSES Definition 8 (asymptotic KL “distance” D) Define asymptotic expected average KL divergence between measures µ and ρ as 1 (24) D(µ, ρ) = lim sup dn (µ, ρ). [sent-668, score-0.56]
81 (ii) There is an uncountable set C of measures, and a measure ν, such that ν predicts every µ ∈ C in expected average KL divergence, but µ1 = µ2 implies D(µ1 , µ2 ) = ∞ for every µ1 , µ2 ∈ C ; in particular, C is not separable with respect to D. [sent-670, score-0.589]
82 n ) ≤ Eµ log = − log wk + dn (µ, νk ) ≤ nε + o(n). [sent-677, score-0.557]
83 It is easy to check that µ1 = µ2 implies D(µ1 , µ2 ) = ∞ for every µ1 , µ2 ∈ C , but the predictor ν, given by ν(xn = 0) := 1/n independently for different n, predicts every µ ∈ C in expected average KL divergence. [sent-685, score-0.844]
84 and k-order Markov processes, the (countable) sets of processes that have rational values of the parameters, considered in the previous section, are dense both in the topology of the parametrization and with respect to the asymptotic average divergence D. [sent-692, score-0.487]
85 On the other hand, neither of these latter families is separable with respect to the topology of total variation distance. [sent-695, score-0.418]
86 2 Conditions Based on the Local Behaviour of Measures Next we provide some sufficient conditions for the existence of a predictor based on local characteristics of the class of measures, that is, measures truncated to the first n observations. [sent-697, score-0.597]
87 Yet, the class of measures on X n , obtained by truncating all measures in C0 to the first n observations, coincides with what would be obtained by truncating all deterministic measures to the first n observations, the latter class being obviously not predictable at all (see also examples below). [sent-701, score-0.547]
88 n ) (indexed by n) in general does not immediately define a stochastic process over X ∞ (λC are not consistent for different n); thus, in particular, using average KL divergence for measuring prediction quality would not make sense, since dn (µ(·|x1. [sent-727, score-0.404]
89 Yet, by taking an appropriate mixture, it is still possible to construct a predictor (a stochastic process) based on λ, that predicts all the measures in the class. [sent-741, score-0.796]
90 n ) 1 cn (C ) 1 ≤ log = (log cn (C ) + 2 log n + log w). [sent-760, score-0.434]
91 Indeed, for the class of all deterministic processes (that is, each process from the class produces some fixed sequence of observations with probability 1) we have cn = 2n , while obviously for this class a predictor does not exist. [sent-790, score-0.667]
92 For a class C of measures we are interested in a predictor that has a small (or minimal) worst-case (with respect to the class C ) probability of error. [sent-802, score-0.569]
93 There exists a measure ρC such that C(C n ) log n 1 ; dn (µ, ρC ) ≤ +O n n n in particular, if C(C n )/n → 0, then ρC predicts every µ ∈ C in expected average KL divergence. [sent-820, score-0.67]
94 Thus, if the channel capacity C(C n ) grows sublinearly, a predictor can be constructed for the class of processes C . [sent-839, score-0.643]
95 In this case the problem of constructing the predictor is reduced to finding the channel capacities for different n and finding the corresponding measures on which they are attained or approached. [sent-840, score-0.607]
96 More generally, the questions we addressed in this work are a part of a larger problem: given an arbitrary class C of stochastic processes, find the best predictor for it. [sent-852, score-0.438]
97 The second one is to characterize families of processes for which a predictor exists. [sent-855, score-0.596]
98 Towards this end, we have found a rather simple form that some solution to this question has if a solution exists: a Bayesian predictor whose prior is concentrated on a countable set. [sent-862, score-0.615]
99 We have also identified some sufficient conditions under which a predictor can actually be constructed (e. [sent-863, score-0.419]
100 However, the larger question of how to construct an optimal predictor for an arbitrary given family of processes, remains open. [sent-866, score-0.468]
wordName wordTfidf (topN-words)
[('ryabko', 0.432), ('predictor', 0.386), ('predicts', 0.237), ('wk', 0.234), ('countable', 0.203), ('daniil', 0.178), ('kl', 0.177), ('tk', 0.176), ('dn', 0.175), ('jn', 0.151), ('measures', 0.142), ('variation', 0.131), ('rbitrary', 0.127), ('redictors', 0.127), ('processes', 0.117), ('divergence', 0.112), ('cn', 0.106), ('hutter', 0.102), ('stationary', 0.096), ('separability', 0.095), ('families', 0.093), ('predictors', 0.09), ('rocesses', 0.09), ('tkn', 0.089), ('every', 0.082), ('wn', 0.079), ('channel', 0.079), ('kalai', 0.078), ('countably', 0.078), ('dubins', 0.076), ('lehrer', 0.076), ('wq', 0.076), ('blackwell', 0.076), ('bernoulli', 0.074), ('log', 0.074), ('cb', 0.063), ('prediction', 0.062), ('absolutely', 0.062), ('capacity', 0.061), ('behaviour', 0.059), ('total', 0.057), ('gallager', 0.054), ('mk', 0.053), ('krichevsky', 0.051), ('nml', 0.051), ('topology', 0.051), ('environment', 0.048), ('sup', 0.046), ('mixture', 0.046), ('rational', 0.046), ('measure', 0.045), ('separable', 0.045), ('tp', 0.045), ('predictable', 0.043), ('mn', 0.043), ('probabilities', 0.042), ('respect', 0.041), ('theorem', 0.04), ('dense', 0.039), ('fix', 0.038), ('distance', 0.038), ('weights', 0.037), ('existence', 0.036), ('environments', 0.036), ('bayesian', 0.036), ('family', 0.035), ('sequence', 0.034), ('sequences', 0.033), ('conditions', 0.033), ('hk', 0.033), ('expected', 0.033), ('xn', 0.032), ('markov', 0.031), ('stochastic', 0.031), ('universal', 0.029), ('laplace', 0.029), ('parametrization', 0.029), ('asymptotic', 0.028), ('proof', 0.028), ('truncating', 0.027), ('question', 0.026), ('cesaro', 0.025), ('plesner', 0.025), ('solomonoff', 0.025), ('subexponentially', 0.025), ('indeed', 0.025), ('cs', 0.025), ('average', 0.024), ('deterministic', 0.024), ('density', 0.024), ('parametrized', 0.024), ('kn', 0.023), ('let', 0.023), ('minimax', 0.023), ('characterizations', 0.023), ('weaker', 0.022), ('jackson', 0.022), ('normalizer', 0.022), ('actions', 0.021), ('arbitrary', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
Author: Daniil Ryabko
Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction
2 0.11011568 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
3 0.091482624 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis
Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks
4 0.088997163 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
5 0.079442836 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
Author: Jitkomut Songsiri, Lieven Vandenberghe
Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization
6 0.076500662 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
7 0.067857884 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
8 0.066847801 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
9 0.061472077 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
10 0.058571644 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
11 0.05565061 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
12 0.053744659 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
13 0.050798103 84 jmlr-2010-On Spectral Learning
14 0.048953597 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
15 0.047399953 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
16 0.045367859 63 jmlr-2010-Learning Instance-Specific Predictive Models
17 0.045325283 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
18 0.044336993 82 jmlr-2010-On Learning with Integral Operators
19 0.042965744 27 jmlr-2010-Consistent Nonparametric Tests of Independence
20 0.041930642 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
topicId topicWeight
[(0, -0.184), (1, -0.068), (2, 0.056), (3, -0.084), (4, -0.078), (5, -0.025), (6, -0.019), (7, 0.127), (8, 0.075), (9, 0.069), (10, 0.016), (11, 0.061), (12, 0.036), (13, -0.177), (14, 0.133), (15, -0.053), (16, 0.123), (17, -0.287), (18, 0.145), (19, -0.032), (20, 0.081), (21, -0.007), (22, 0.021), (23, 0.022), (24, -0.108), (25, 0.173), (26, 0.159), (27, -0.075), (28, -0.104), (29, 0.084), (30, -0.021), (31, -0.004), (32, -0.038), (33, 0.155), (34, 0.138), (35, -0.145), (36, 0.021), (37, 0.303), (38, 0.048), (39, 0.077), (40, 0.059), (41, -0.038), (42, 0.076), (43, -0.001), (44, -0.107), (45, -0.078), (46, -0.169), (47, -0.072), (48, -0.059), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96652955 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
Author: Daniil Ryabko
Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction
2 0.46395227 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis
Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is fixed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks
3 0.40282175 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
Author: Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines, document classification
4 0.29267952 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian
Abstract: Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classification and regression, maximum likelihood, latent variable models
5 0.28074342 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
6 0.27671862 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
7 0.27109864 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
8 0.27094439 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
9 0.26056501 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
10 0.25860777 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
11 0.24752416 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
12 0.23444711 84 jmlr-2010-On Spectral Learning
13 0.22635043 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
14 0.22363566 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
16 0.20830564 63 jmlr-2010-Learning Instance-Specific Predictive Models
17 0.20411347 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
18 0.17913398 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
19 0.16125561 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
20 0.15823874 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
topicId topicWeight
[(3, 0.045), (8, 0.021), (15, 0.48), (21, 0.017), (24, 0.01), (32, 0.045), (36, 0.062), (37, 0.041), (75, 0.116), (81, 0.014), (85, 0.035), (96, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.81994438 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
Author: Daniil Ryabko
Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction
2 0.66930759 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
3 0.37009147 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
Author: Nguyen Xuan Vinh, Julien Epps, James Bailey
Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0, 1] range better than other normalized variants. Keywords: clustering comparison, information theory, adjustment for chance, normalized information distance
5 0.34558347 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
6 0.34308174 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
7 0.34208769 63 jmlr-2010-Learning Instance-Specific Predictive Models
9 0.33073795 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
10 0.32843482 27 jmlr-2010-Consistent Nonparametric Tests of Independence
11 0.32550594 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
12 0.32548296 60 jmlr-2010-Learnability, Stability and Uniform Convergence
13 0.3216477 25 jmlr-2010-Composite Binary Losses
14 0.32100528 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
15 0.32011306 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
16 0.31993341 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
17 0.31985566 109 jmlr-2010-Stochastic Composite Likelihood
18 0.31400359 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
19 0.31210312 117 jmlr-2010-Why Does Unsupervised Pre-training Help Deep Learning?
20 0.31155694 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm