jmlr jmlr2009 jmlr2009-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
Reference: text
sentIndex sentText sentNum sentScore
1 From now on, we use the terms control and interrogation interchangeably; control is the conventional expression, whereas the word interrogation expresses our aims better. [sent-68, score-0.762]
2 We show that, (iii) using the Doptimality interrogation technique, these two tasks are incoherent in the myopic (i. [sent-77, score-0.283]
3 , single step look-ahead) control scheme: signals derived from this principle for parameter estimation are suboptimal (basically the worst possible) for the estimation of the driving noise and vice versa. [sent-79, score-0.588]
4 (iv) We show that for the case of noise estimation task the two principles, that is, A- and D-optimality principles result in the same cost function. [sent-80, score-0.317]
5 (v) For the A-optimality case, we derive equations for the joined estimation of the noise and the parameters. [sent-81, score-0.272]
6 Section 6 deals with our second task, when the goal is the estimation of the driving noise of the RNN. [sent-88, score-0.293]
7 We combine the two tasks for both optimality principles in Section 8 and consider the cost functions for the joined estimation of the parameters and the driving noise. [sent-91, score-0.24]
8 In Section 9 a nonmyopic heuristics is introduced for the noise estimation task. [sent-93, score-0.251]
9 Let us assume that we have d simple computational units called ‘neurons’ in a recurrent neural network: I J i=0 rt+1 = g j=0 ∑ Fi rt−i + ∑ B j ut+1− j + et+1 , (1) where {et }, the driving noise of the RNN, denotes temporally independent and identically distributed (i. [sent-99, score-0.254]
10 , J) and the covariance matrix V, as well as the driving noise et by means of the control signals. [sent-123, score-0.467]
11 (2) In order to estimate the unknown quantities (parameter matrix A, noise et and its covariance matrix V) in an online fashion, we rely on Bayes’ method. [sent-147, score-0.251]
12 Now, one can rewrite model (2) as follows: P(A|V) = NA (M, V, K), (3) P(et |V) = Net (0, V), (4) P(yt |A, xt , V) = Nyt (Axt , V), (5) and P(V) = P I G V (α, β) or P(V) = I W V (Q, n) depending on whether we want to use A- or Doptimality. [sent-176, score-0.439]
13 Assuming that {x}t1 , {y}t1 are given, according to the infomax principle our goal is to compute arg max I(θ, yt+1 ; {x}t+1 , {y}t1 ), 1 ut+1 (6) where I(a, b; c) denotes the mutual information of stochastic variables a and b for fixed parameters c. [sent-184, score-0.235]
14 So, opt T ut+1 = arg max xt+1 Kt−1 xt+1 , ut+1 ∈U (14) In what follows D-optimal control will be referred to as ‘infomax interrogation scheme’. [sent-219, score-0.451]
15 2 = tr T Kt + xt+1 xt+1 −1 = tr tr Var[A|{x}t+1 , {y}t+1 ] 1 1 T Kt + xt+1 xt+1 −1 E[trV|{x}t+1 , {y}t+1 ], 1 1 d (βt+1 )i ∑ (αt+1 )i − 1 . [sent-247, score-0.303]
16 D-Optimality Approach for Noise Estimation One might wish to compute the optimal control for estimating noise et in (1), instead of the identification problem above. [sent-255, score-0.343]
17 Based on (1) and because I J i=0 j=0 et+1 = yt+1 − ∑ Fi rt−i − ∑ B j ut+1− j , (22) one might think that the best strategy is to use the optimal infomax control of Table 1, since it provides good estimations for parameters A = [FI , . [sent-256, score-0.379]
18 At time t + 1, let the estimation of the noise ˆ ˆ ˆ ˆ ˆ be et+1 = yt+1 − ∑I Fti rt−i − ∑J Btj ut+1− j , where Fti (i=0,. [sent-264, score-0.226]
19 (23) This hints that the control should be ut = 0 for all times in order to get rid of the error contribution of matrix B j in (23). [sent-272, score-0.442]
20 1 ut+1 In other words, for the estimation of the noise we want to design a control signal ut+1 such that the next output is the best from the point of view of greedy optimization of mutual information between the next output yt+1 and the noise et+1 . [sent-275, score-0.569]
21 After some mathematical calculation we can prove that the D-optimal interrogation scheme for noise estimation gives rise to the following control: 524 BAYESIAN I NTERROGATION FOR PARAMETER AND N OISE I DENTIFICATION Lemma 6. [sent-278, score-0.451]
22 It is worth noting that this D-optimal cost function for noise estimation and the D-optimal cost function derived for parameter estimation in (13) are not compatible with each other. [sent-281, score-0.352]
23 We shall show later (Section 9) that for large t values, expression (25) gives rise to control values close to ut = 0. [sent-283, score-0.435]
24 In this section we investigate the A- and D-optimality principles for the joint parameter and noise estimation task. [sent-301, score-0.259]
25 1 A-optimality According to the A-optimality principle, the joined objective for parameter and noise estimation is given as: Z dyt+1 P(yt+1 |{x}t+1 , {y}t1 )trVar[vec(A), diag(V), et+1 |{x}t+1 , {y}t+1 ]. [sent-303, score-0.272]
26 1 The A-optimality principle in the joined parameter and noise estimation task gives rise to the following choice for control: opt ut+1 = arg max ut+1 ∈U T 1 + xt+1 Kt−1 Kt−1 xt+1 T 1 + xt+1 Kt−1 xt+1 . [sent-307, score-0.445]
27 An implication—as we shall see below—is that we cannot use the D-optimality principle for the joint parameter and noise estimation task. [sent-312, score-0.28]
28 The first 1 1 term is a finite real number, thus we can conclude that the D-optimality cost function is constant −∞, and therefore the D-optimality principle does not suit the joint parameter and noise estimation task. [sent-314, score-0.312]
29 In this section, we show a nonmyopic heuristics for the noise estimation task (25). [sent-317, score-0.277]
30 We will call this non-greedy interrogation heuristics introduced for noise estimation ‘τ-infomax noise interrogation’. [sent-334, score-0.617]
31 This non-myopic heuristics admits that parameter estimation of the dynamics is the prerequisite of noise estimation, because improper parameter estimation makes apparent noise, and thus the heuristics sacrifices τ steps for parameter estimation. [sent-335, score-0.367]
32 The compromise is that in the first τ steps the performance of the non-myopic control can be worse than that of the other control methods. [sent-337, score-0.358]
33 We note that in the τ-infomax noise interrogation, for large switching time τ and for large t values, |Kt22 | will be large, and hence—according to (29)—the optimal ut for interrogation will be close to 0. [sent-338, score-0.599]
34 ) A reasonable approximation of the ‘τ-infomax noise interrogation’ is to use the control given in Table 1 for τ steps and to switch to zero-interrogation onwards. [sent-340, score-0.343]
35 According to the figure, zero control may give rise to early and sudden drops in the MSE values of matrix F. [sent-376, score-0.259]
36 Not surprisingly, however, zero control is unable to estimate matrix B. [sent-377, score-0.236]
37 As can be expected, τ-zero control, which is identical to D-optimal control in the first τ steps fell behind D-optimal control after τ since it changes the objective and estimates the noise and not the parameters afterwards. [sent-379, score-0.522]
38 We investigated the noise estimation capability of the interrogation in (25) for four cases. [sent-420, score-0.428]
39 The first set of experiments illustrates that the estimation of driving noise et for large τ values barely differs if we replace the τ-infomax noise interrogation with the τ-zero interrogation scheme. [sent-421, score-0.861]
40 Parameters were the same as above and the MSE of the noise estimation was computed. [sent-422, score-0.226]
41 3: for the case of τ = 21, cost function (25) of the τ-zero interrogation is higher than that of τ-infomax interrogation. [sent-424, score-0.234]
42 Given that τ-zero and τ-infomax noise interrogation behave similarly for large τ values, we compare the τ-zero interrogation scheme with other schemes in our numerical experiments. [sent-426, score-0.568]
43 In the second experiment we investigated the problem of noise estimation on a toy problem. [sent-427, score-0.226]
44 1, and the following strategies were compared: zero control, infomax control, random control and τ-zero control for different τ values. [sent-430, score-0.527]
45 It is clear from the figure that neither the zero control, nor the infomax (D-optimal) control of Table 1 work for this case. [sent-433, score-0.348]
46 Note, however, that parameter estimation requires to keep control values non-negligible forever (Table 1). [sent-435, score-0.241]
47 We investigate the D-optimal, the zero, the random, and the greedy noise control of (25), as well as the 71-zero and 101-zero controls. [sent-459, score-0.343]
48 Results show that if we may sacrifice the first τ steps, then the non-myopic τ − zero control gives rise to the smallest MSE for the estimated noise and the smallest values for the cost function (25) after τ steps considering all studied control methods. [sent-460, score-0.604]
49 Neither random control, nor infomax interrogation of Table 1 (Fig. [sent-476, score-0.344]
50 3 J OINT PARAMETER AND N OISE E STIMATIONS In Section 8 we showed that the objective of the D-optimality principle is constant for the joined parameter and noise estimation task. [sent-498, score-0.326]
51 (a): MSE of noise estimation T for different control strategies. [sent-512, score-0.405]
52 We have compared this strategy with the parameter estimation strategy of the D-optimality principle, with zero control strategy, with random control strategy, and with τ − zero control for several τ values. [sent-520, score-0.653]
53 Inspecting the optimal control series of the winner, we found that the algorithm chooses control values from the boundaries of the hypercube in the first 45 or so iterations. [sent-524, score-0.358]
54 Then up to about 130 iterations it is switching between zero control and controls on the boundaries, but eventually it uses zero controls only. [sent-525, score-0.233]
55 That is, the A-optimality principle is able to detect the need for the switch from high control values (to determine the parameters of the dynamics) to zero control values for noise estimation. [sent-526, score-0.603]
56 Informally, we assume that our sources are doubly covered: they are the driving noise processes of AR processes which can not be directly observed due to the mixing with an unknown matrix. [sent-532, score-0.231]
57 We will study our methods for this particular example and we 533 ˝ P ÓCZOS AND L ORINCZ original noise random control 0. [sent-533, score-0.343]
58 05 0 1 0 1 1 0 1 0 0 −1 0 −1 −1 (a) −1 (b) infomax zero control 0. [sent-541, score-0.348]
59 2 zero control infomax control random control 21−zero control 51−zero control 81−zero control 4 10 0. [sent-554, score-1.243]
60 534 BAYESIAN I NTERROGATION FOR PARAMETER AND N OISE I DENTIFICATION (a) (b) Figure 7: Negative logarithm of the objective function for the joint parameter and noise estimation task for different K matrices. [sent-565, score-0.252]
61 MSE joint, d= 15, control dim= 30 5 10 A−optimal joint D−optimal param random control zero control 21−zero control 51−zero control 81−zero control 4 10 3 10 2 10 1 10 0 50 100 150 200 250 300 350 400 Figure 8: MSE of the joint parameter and noise estimation task. [sent-567, score-1.327]
62 Comparisons between joint parameter and noise estimation using A-optimality principle, parameter estimation using D-optimality principle, random control, zero control and τ − zero control for different τ values. [sent-568, score-0.7]
63 In our numerical experiments we studied the following special case: rt+1 = Frt + But+1 + Cet+1 , where the dimension of the noise was 3, the dimension of the control was 15. [sent-594, score-0.343]
64 We compared 5 different control methods (zero control, D-optimal control developed for parameter estimation, random control, A-optimal control developed for joint estimation of parameters and noise, as well as the τ-zero control with τ=81 that we developed for noise estimation). [sent-596, score-0.942]
65 , 1000 and then applying the JADE ICA algorithm (Cardoso, 1999) for the estimation of the noise components (et+1 ). [sent-600, score-0.226]
66 In short, τ-zero control performs slightly better than the joint parameter and noise estimation using the A-optimality principle. [sent-615, score-0.405]
67 Amari distances, d= 3, control dim= 15 0 10 zero control D−optimal param random control A−optimal joint 81−zero control −1 10 −2 10 300 400 500 600 700 800 900 1000 Figure 9: ARX-ICA experiment. [sent-619, score-0.743]
68 In this simulation, we studied the D-optimality principle and compared it with the random control method. [sent-628, score-0.233]
69 Dynamics of the pendulum are also determined by the different masses, that is, the masses of the links and the mass of the end effector as well as by the two motors, which are able to rotate the horizontal link and the swinging link in both directions. [sent-636, score-0.286]
70 zero angular speed for swinging link Time intervals between interrogations Maximum control value Value 0. [sent-640, score-0.324]
71 m: mass; l: length, M: mass of the end effector, subscript a: horizontal link, subscript p: swinging link, φ: angle of horizontal link, θ: angle of swinging link . [sent-654, score-0.241]
72 We observed these rt ∈ R144 quantities and then calculated the ut+1 ∈ R2 D-optimal control ˜ using the algorithm of Table 1, where we approximated the pendulum with the model rt+1 = Frt + But+1 , F ∈ R144×144 , B ∈ R144×2 . [sent-667, score-0.421]
73 First, we note that the angle of the swinging link and the angular speeds are important from the point of view of the prediction of the dynamics, whereas the angle of the horizontal link can be neglected. [sent-673, score-0.241]
74 So, if our cumulated error estimation at time t was e(t) = ∑1,728 ek (t) and the pendulum entered the ith domain at time t + 1, then we k=1 ˆ set ek (t + 1) = ek (t) for all k = i and ei (t + 1) at ei (t + 1) = rt+1 − rt+1 . [sent-689, score-0.244]
75 It is hard for the random control to guide the pendulum to the upper domain, so we also investigated how the D-optimal control performs here. [sent-696, score-0.486]
76 For the full domain the number of visited domains is 456 (26%) and 818 (47%) for the random control and the D-optimal control, respectively after 5,000 control steps (Fig. [sent-699, score-0.445]
77 While the D-optimal controlled pendulum visited more domains and achieved smaller errors, the domain-wise estimation error is about the same for the domains visited; both methods gained about 29. [sent-703, score-0.32]
78 The number of visited upper domains is 9 and 114 for the random control and for the D-optimal control, respectively (Fig. [sent-706, score-0.266]
79 (1) Control ut+1 is computed from D-optimal principle, (2) control acts upon the pendulum, (3) signals predicted before control step, (4) sensory information after control step. [sent-713, score-0.537]
80 (c): visited domains for swing angle above horizontal, (d): upper bound for cumulated estimation error for domains with swing angle above vertical. [sent-727, score-0.3]
81 We note that the D-optimal interrogation scheme is also called InfoMax control in the literature by Lewi et al. [sent-733, score-0.381]
82 The authors modeled spiking neurons and assumed that the main source of the noise is this spiking, which appears at the output of the neurons and adds linearly to the neural activity. [sent-740, score-0.246]
83 Due to the properties of the inverted Wishart distribution, the noise covariance matrix is not restricted to the isotropic form. [sent-765, score-0.249]
84 In the D-optimal case the optimal interrogation strategies for parameter (14) and noise estimation (25) appeared in attractive, intriguingly simple quadratic forms. [sent-768, score-0.428]
85 For the A-optimality principle, we found that the objective of the noise estimation task is identical to that of the D-optimality principle. [sent-772, score-0.252]
86 In the noise estimation task, however, cost functions of the A- and D-optimality principles are the same. [sent-784, score-0.291]
87 For the noise estimation task, we suggested a heuristic solution that we called τ-infomax noise interrogation. [sent-791, score-0.39]
88 Numerical experiments served to show that τ-infomax noise interrogation overcomes several other estimation strategies. [sent-792, score-0.428]
89 The novel τ-infomax noise interrogation uses the D-optimal interrogation of Table 1 up to τ-steps, and applies the noise estimation control detailed in (25) afterwards. [sent-793, score-0.973]
90 This heuristics decreases the estimation 543 ˝ P ÓCZOS AND L ORINCZ error of the coefficients of matrices F and B up to time τ and thus—upon turning off the explorative D-optimization—tries to minimize the estimation error of the value of the noise at time τ + 1. [sent-794, score-0.313]
91 We introduced the τ-zero interrogation scheme and showed that it is a good approximation of the τinfomax noise scheme for large τ values. [sent-795, score-0.366]
92 In the joint noise parameter estimation task the D-optimal principle leads to a meaningless constant valued cost function. [sent-796, score-0.338]
93 Still, its myopic optimization gave us better results than the myopic D-optimal objective derived for parameter estimation only. [sent-798, score-0.224]
94 Interestingly, myopic A-optimal interrogations behaved similarly to the non-myopic τ-zero control (D-optimal parameter estimation up to τ steps, then zero control) with emergent automatic τ selection. [sent-799, score-0.349]
95 We illustrated the working of the algorithm on artificial databases both for the parameter estimation problem and for the noise estimation task. [sent-807, score-0.288]
96 We studied the robustness of the algorithm and studied the noise estimation problem for a situation in which the conditions of our theorems were not satisfied, namely when the noise was neither Gaussian nor i. [sent-809, score-0.39]
97 We found that active control can speed-up the learning process. [sent-816, score-0.229]
98 The pendulum problem demonstrated that D-optimality maximizes mutual information by exploring new areas without significant compromise in the precision of estimation in the visited domains. [sent-823, score-0.234]
99 Using the well-known formula for the entropy of a multivariate and normally distributed variable (Cover and Thomas, 1991) and applying the relation |V ⊗ K−1 | = |V|m /|K|d , we have that H(A; V) = 1 dm m d dm ln |V ⊗ K−1 | + ln(2πe) = ln |V| − ln |K| + ln(2πe). [sent-852, score-0.286]
100 (36) that Q n+d +1 1 n − ln |V| − tr(V−1 Q) , H(V) = −EI W V (Q,n) − ln(Zn,d ) + ln 2 2 2 2 d Q n+d +1 n+1−i nd n + ln |Q| − ∑ Ψ( )) − d ln 2 + , = ln(Zn,d ) − ln 2 2 2 2 2 i=1 d +1 ln |Q| + f1,4 (d, n), 2 where f1,4 (d, n) depends only on d and n. [sent-862, score-0.426]
wordName wordTfidf (topN-words)
[('kt', 0.491), ('xt', 0.439), ('ut', 0.233), ('interrogation', 0.202), ('control', 0.179), ('noise', 0.164), ('czos', 0.148), ('infomax', 0.142), ('orincz', 0.135), ('mse', 0.132), ('axt', 0.128), ('nterrogation', 0.128), ('pendulum', 0.128), ('oise', 0.12), ('yt', 0.119), ('dyt', 0.115), ('rt', 0.114), ('mt', 0.111), ('dentification', 0.109), ('tr', 0.101), ('nt', 0.084), ('myopic', 0.081), ('xxt', 0.08), ('trvar', 0.074), ('ln', 0.071), ('driving', 0.067), ('estimation', 0.062), ('qt', 0.062), ('rd', 0.059), ('dim', 0.058), ('estimations', 0.058), ('mx', 0.055), ('cumulated', 0.054), ('lewi', 0.054), ('rnn', 0.054), ('principle', 0.054), ('bayesian', 0.053), ('na', 0.053), ('active', 0.05), ('var', 0.047), ('furuta', 0.047), ('wiener', 0.047), ('arx', 0.046), ('joined', 0.046), ('swinging', 0.046), ('visited', 0.044), ('domains', 0.043), ('link', 0.043), ('hyperbolic', 0.042), ('neurons', 0.041), ('hammerstein', 0.04), ('hyv', 0.04), ('stimulus', 0.04), ('arg', 0.039), ('curves', 0.039), ('fi', 0.035), ('principles', 0.033), ('ax', 0.033), ('cost', 0.032), ('ica', 0.031), ('wishart', 0.031), ('opt', 0.031), ('posterior', 0.031), ('matrix', 0.03), ('dynamics', 0.029), ('vi', 0.029), ('angular', 0.029), ('sensors', 0.028), ('inverted', 0.028), ('ds', 0.027), ('doptimality', 0.027), ('rinen', 0.027), ('rnns', 0.027), ('rot', 0.027), ('trv', 0.027), ('varv', 0.027), ('verdinelli', 0.027), ('yxt', 0.027), ('zero', 0.027), ('angle', 0.027), ('covariance', 0.027), ('sq', 0.026), ('angles', 0.026), ('horizontal', 0.026), ('derivations', 0.026), ('task', 0.026), ('minka', 0.026), ('ta', 0.026), ('heuristics', 0.025), ('lemma', 0.025), ('entropy', 0.025), ('pearson', 0.025), ('correlation', 0.024), ('dm', 0.024), ('recurrent', 0.023), ('rise', 0.023), ('schein', 0.023), ('gelman', 0.023), ('spike', 0.023), ('ews', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999821 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
2 0.20745072 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
3 0.19846022 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
4 0.14906681 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
5 0.13989849 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
Author: Alexander L. Strehl, Lihong Li, Michael L. Littman
Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the wellknown E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Keywords: reinforcement learning, Markov decision processes, PAC-MDP, exploration, sample complexity
6 0.13950096 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
7 0.12235885 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
8 0.10629082 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
9 0.091345057 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
10 0.077895492 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
11 0.062865108 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
12 0.060409553 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
13 0.058740024 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
14 0.054628104 23 jmlr-2009-Discriminative Learning Under Covariate Shift
15 0.052271906 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
16 0.050170146 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
17 0.048183657 46 jmlr-2009-Learning Halfspaces with Malicious Noise
18 0.048035242 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
19 0.047588706 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
20 0.046190932 67 jmlr-2009-Online Learning with Sample Path Constraints
topicId topicWeight
[(0, 0.307), (1, 0.17), (2, -0.211), (3, 0.123), (4, -0.018), (5, -0.181), (6, -0.027), (7, -0.035), (8, -0.178), (9, -0.195), (10, 0.147), (11, 0.123), (12, 0.042), (13, 0.073), (14, 0.027), (15, 0.058), (16, -0.192), (17, 0.021), (18, 0.061), (19, 0.045), (20, -0.058), (21, 0.036), (22, -0.05), (23, 0.154), (24, -0.163), (25, 0.076), (26, 0.044), (27, 0.052), (28, 0.033), (29, -0.028), (30, -0.047), (31, -0.074), (32, 0.029), (33, -0.104), (34, 0.055), (35, 0.035), (36, 0.053), (37, -0.079), (38, -0.007), (39, -0.053), (40, -0.05), (41, 0.033), (42, -0.037), (43, 0.085), (44, 0.04), (45, -0.084), (46, -0.008), (47, -0.058), (48, 0.076), (49, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.97987664 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
2 0.70469719 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
3 0.59169239 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
4 0.57226974 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
5 0.45256799 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
Author: Roberto Esposito, Daniele P. Radicioni
Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classifiers optimization, exact inference
7 0.37189454 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
8 0.35975701 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
9 0.32573709 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
10 0.28386661 46 jmlr-2009-Learning Halfspaces with Malicious Noise
11 0.27216333 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
12 0.25482646 7 jmlr-2009-An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs
13 0.25077692 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
14 0.23920089 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
15 0.23842816 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
16 0.23200132 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
17 0.23132344 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
18 0.21082489 67 jmlr-2009-Online Learning with Sample Path Constraints
19 0.19985904 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
20 0.19611879 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
topicId topicWeight
[(8, 0.014), (11, 0.023), (21, 0.013), (26, 0.014), (38, 0.023), (47, 0.019), (52, 0.032), (55, 0.58), (58, 0.021), (66, 0.084), (68, 0.017), (90, 0.04), (96, 0.028)]
simIndex simValue paperId paperTitle
1 0.94686645 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
Author: Wenxin Jiang
Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation
same-paper 2 0.93552864 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
Author: Barnabás Póczos, András Lőrincz
Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design
3 0.91407442 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
Author: Junhui Wang, Xiaotong Shen, Wei Pan
Abstract: In classification, semisupervised learning usually involves a large amount of unlabeled data with only a small number of labeled data. This imposes a great challenge in that it is difficult to achieve good classification performance through labeled data alone. To leverage unlabeled data for enhancing classification, this article introduces a large margin semisupervised learning method within the framework of regularization, based on an efficient margin loss for unlabeled data, which seeks efficient extraction of the information from unlabeled data for estimating the Bayes decision boundary for classification. For implementation, an iterative scheme is derived through conditional expectations. Finally, theoretical and numerical analyses are conducted, in addition to an application to gene function prediction. They suggest that the proposed method enables to recover the performance of its supervised counterpart based on complete data in rates of convergence, when possible. Keywords: difference convex programming, classification, nonconvex minimization, regularization, support vectors
4 0.49939153 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
5 0.48248482 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
6 0.45716017 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
7 0.44730434 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
8 0.43830404 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
9 0.43825457 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
10 0.42992485 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence
11 0.42900419 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
13 0.40956837 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
14 0.40499628 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
15 0.40072826 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
16 0.39962056 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
17 0.39908287 38 jmlr-2009-Hash Kernels for Structured Data
18 0.39212355 18 jmlr-2009-Consistency and Localizability
19 0.39183587 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
20 0.37848917 87 jmlr-2009-Sparse Online Learning via Truncated Gradient