jmlr jmlr2008 jmlr2008-74 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
Reference: text
sentIndex sentText sentNum sentScore
1 We present a simultaneous online update rule that uses the entire set of binary examples received at each trial while retaining the simplicity of algorithms whose update is based on a single binary example. [sent-19, score-0.444]
2 A template algorithm for additive simultaneous projection in an online learning setting with multiple instances is described in Sec. [sent-106, score-0.358]
3 At trial t the algorithm receives a matrix Xt of size kt × n, where each row of Xt is an instance, and is required to make a prediction on the ˆ label associated with each instance. [sent-131, score-0.551]
4 We allow ytj ˆ ˆ to take any value in R, where the actual label being predicted is sign(ytj ) and |ytj | is the confidence ˆ ˆ in the prediction. [sent-133, score-0.39]
5 After making a prediction yt the algorithm receives the correct labels yt where ytj ∈ {−1, 1} for all j ∈ [kt ]. [sent-134, score-0.712]
6 We thus say that the vector yt ˆ was imperfectly predicted if there exists an outcome j such that ytj = sign(ytj ). [sent-137, score-0.496]
7 Therefore, we use an adaptation of the hinge-loss, defined ˆ (ˆ t , yt ) = max j 1 − ytj ytj , as a proxy for the combinatorial error. [sent-140, score-0.876]
8 The quantity ytj ytj is often y ˆ + referred to as the (signed) margin of the prediction and ties the correctness and the confidence in ˆ the prediction. [sent-141, score-0.765]
9 We use (ωt ; (Xt , yt )) to denote (ˆ t , yt ) where yt = Xt ωt . [sent-142, score-0.405]
10 We also denote the set y t = { j | sign(yt ) = yt }, and similarly the of instances whose labels were predicted incorrectly by M ˆj j ˆ set of instances whose hinge-losses are greater than zero by Γt = { j | [1 − ytj ytj ]+ > 0}. [sent-143, score-0.98]
11 1403 A MIT, S HALEV-S HWARTZ AND S INGER Figure 1: Illustration of the simultaneous projections algorithm: each instance casts a constraint on ω and each such constraint defines a halfspace of feasible solutions. [sent-180, score-0.361]
12 Simultaneous Projection Algorithms ˆ Recall that on trial t the algorithm receives a matrix, Xt , of kt instances, and predicts yt = Xt ωt . [sent-182, score-0.63]
13 Each instancelabel pair casts a constraint on ωt , ytj ωt · xtj ≥ 1. [sent-184, score-0.708]
14 ω · xtj ytj ≥ 1−ξ (2) ξ≥0 We denote the objective function of Eq. [sent-194, score-0.664]
15 The dual optimization problem of P t is the maximization problem kt maxt t ∑ αtj − α1 ,. [sent-196, score-0.492]
16 ,αkt j=1 kt 1 t ω + ∑ αtj ytj xtj 2 j=1 kt 2 The complete derivation is given in Appendix A. [sent-198, score-1.366]
17 , T : Receive instance matrix X t ∈ Rkt ×n ˆ Predict yt = Xt ωt Receive correct labels yt Suffer loss (ωt ; (Xt , yt )) If > 0: Choose importance weights µt s. [sent-208, score-0.46]
18 µtj ≥ 0 and ∑kt µtj = 1 j=1 Choose individual dual solutions αtj Update ωt+1 = ωt + ∑kt µtj αtj ytj xtj j=1 Figure 2: Template of simultaneous projections algorithm. [sent-210, score-1.015]
19 The minimizer of the primal problem is calculated from the optimal dual solution as follows, ωt+1 = ωt + ∑kt αtj ytj xtj . [sent-212, score-0.834]
20 j=1 Unfortunately, in the common case, where each xtj is in an arbitrary orientation, there does not exist an analytic solution for the dual problem (Eq. [sent-213, score-0.447]
21 We tackle the problem by breaking it down into kt reduced problems, each of which focuses on a single dual variable. [sent-216, score-0.482]
22 Each reduced optimization problem amounts to the following problem max αtj − t αj 1 t ω + αtj ytj xtj 2 2 s. [sent-220, score-0.729]
23 We then choose a non-negative vector µ ∈ ∆ kt where ∆kt is the kt dimension probability simplex, formally µi ≥ 0 and ∑kt µ j = 1. [sent-227, score-0.702]
24 Finally, the algorithm uses the combined solution and sets j=1 ωt+1 = ωt + ∑kt µtj αtj ytj xtj . [sent-232, score-0.695]
25 4 (ωt ;(xtj ,ytj )) xtj 2 (ωt ;(xtj ,ytj )) xtj 2 Figure 3: Schemes for choosing µ and α. [sent-236, score-0.606]
26 2 Soft Simultaneous Projections The soft simultaneous projections scheme uses the fact that each reduced problem has an analytic solution, which yields αtj = min C, ωt ; (xtj , ytj ) / xtj 2 . [sent-250, score-0.942]
27 By exploring the structure of the problem on hand we show that this joint optimization problem can efficiently be solved in kt log kt time. [sent-270, score-0.745]
28 , kt into two sets, indices j whose µ j > 0 and indices for which µ j = 0. [sent-332, score-0.351]
29 This variant of the simultaneous projections framework is guaranteed to yield the largest increase in the dual compared to all other simultaneous projections schemes. [sent-389, score-0.589]
30 ∀t ∈ [T ], ∀ j ∈ [kt ] : ytj ω · xtj ≥ 1 − ξt ∀t : ξt ≥ 0 . [sent-400, score-0.664]
31 (11) is, T max λ ∑ kt ∑ λt, j − t=1 j=1 1 2 T ∑ kt ∑ λt, j ytj xtj 2 t=0 j=1 kt s. [sent-404, score-1.736]
32 Through our derivation we use the fact that any set of dual variables λ 1 , · · · , λT defines a T feasible solution ω = ∑t=1 ∑kt λt, j ytj xtj with a corresponding assignment of the slack variables. [sent-409, score-0.889]
33 (13) can be rewritten as, kt 1 t ω + ∑ λ j ytj xtj max ∑ λ j − 2 λ1 ,. [sent-428, score-1.034]
34 , XT , yT be a sequence of examples where Xt is a matrix of kt examples and yt are the associated labels. [sent-450, score-0.486]
35 We remind the reader that by unraveling the update of ωt we get that ωt = ∑s 0 the term 2 ωt +Cytj xtj be upper bounded. [sent-495, score-0.352]
36 Therefore, ∆t can further be bounded from below as follows, ∆t ≥ 1 µtj C − C2 ytj xtj 2 j∈M t ∑ 2 ≥ . [sent-496, score-0.664]
37 Decomposable Losses Recall that our algorithms tackle complex decision problems by decomposing each instance into multiple binary decision tasks, thus, on trial t the algorithm receives kt instances. [sent-521, score-0.556]
38 The classificaˆ tion scheme is evaluated by looking at the maximal violation of the margin constraints ( yt , yt ) = t yt max j 1 − y j ˆ j . [sent-522, score-0.477]
39 As a corollary we obtain a Simultaneous Projection algorithm that is competitive with the average performance error on each set of kt instances. [sent-525, score-0.351]
40 On trial t the algorithms receives a partition of the kt instances into rt sets. [sent-527, score-0.6]
41 The definition of the loss is extended to ˆ yt , yt = 1 ˆ rt rt ∑ max j∈S l=1 t l 1 − ytj ytj ˆ + . [sent-534, score-1.131]
42 Thus, each iteration the algorithm receives kt instances and a partition of the labels into sets St1 , . [sent-538, score-0.458]
43 for each set Stl , ∑ j∈Stl µtj = 1 Choose individual dual solutions αtj Update ωt+1 = ωt + ∑rt ∑ j∈Stl µtj αtj ytj xtj l=1 Figure 5: The extended simultaneous projections algorithm for decomposable losses. [sent-554, score-1.032]
44 ω · xtj ≥ 1 − ξl (18) ∀l ∈ [rt ] : ξl ≥ 0 The dual of Eq. [sent-558, score-0.416]
45 (18) is thus kt kt 1 ωt + ∑ αtj ytj xtj ∑ αtj − 2 maxt t α1 ,. [sent-559, score-1.366]
46 Stl : ytj ω · xtj 1415 ≥ 1 − ξt,l ∀t∀l : ξt,l ≥ 0 (19) A MIT, S HALEV-S HWARTZ AND S INGER and its dual is T max λ ∑ kt ∑ λt, j − t=1 j=1 1 2 T kt ∑ ∑ λt, j ytj xtj 2 t=0 j=1 s. [sent-578, score-2.162]
47 As previously showed, the simultaneous projection scheme can be viewed as an incremental update to the dual of Eq. [sent-587, score-0.373]
48 It is interesting to note that for every decomposition of the kt instances into sets, the value of ˆ(ω; (Xt , yt )) is upper bounded by (ω; (Xt , yt )), as ˆ is the average over the margin violations while corresponds to the worst margin violation. [sent-589, score-0.706]
49 1 C − 2 C 2 R2 In conclusion, the simultaneous projection scheme allows us to easily obtain online algorithms and update schemes for complex problems, such algorithms are obtained by decomposing a complex problem into multiple binary problems. [sent-596, score-0.406]
50 Simultaneous Multiplicative Updates In this section we describe and analyze a multiplicative version of the simultaneous projection scheme. [sent-601, score-0.376]
51 Since we now prevent such scaling due to the choice of the simplex domain, we need to slightly modify the definition of the loss and introduce the following definition, γ (ˆ t , yt ) = max j γ − ytj ytj . [sent-612, score-0.892]
52 y ˆ + 1416 O NLINE L EARNING U SING S IMULTANEOUS P ROJECTIONS Recall that on trial t the algorithm receives kt instances arranged in a matrix Xt . [sent-613, score-0.548]
53 ∑ τij i=1 αtj j=1 kt n γ ∑ αtj − log ≤C j=1 ∀j : . [sent-621, score-0.366]
54 αtj ∀j : τ = j ≥0 (21) αtj ytj xtj The prediction vector ω is set as follows, exp ∑kt τi j=1 j ωi = ωti t ∑n ωtl exp ∑ j=1 τi l=1 k j . [sent-622, score-0.691]
55 (21) into kt separate problems, each concerning a single dual variable. [sent-626, score-0.464]
56 (23) τ j = αtj ytj xtj ≤C We next obtain an exact or approximate solution for each reduced problem as if it were independent of the rest. [sent-630, score-0.713]
57 Each µtj αtj ≥ 0 and the fact that αtj ≤ C implies that kt ∑ j=1 µtj αtj ≤ C. [sent-634, score-0.351]
58 The template of the multiplicative simultaneous projections algorithm is described in Fig. [sent-637, score-0.425]
59 , T : Receive instance matrix X t ∈ Rkt ×n ˆ Predict yt = Xt ωt Receive correct labels yt Suffer loss (ωt ; (Xt , yt )) If > 0: Choose importance weights µt s. [sent-651, score-0.46]
60 ∑kt µtj = 1 j=1 Choose individual dual solutions αtj Compute τ j = αtj ytj xtj j k Update ωt+1 = i t ωti exp ∑ j=1 µtj τi j k t ∑l ωtl exp ∑ j=1 µtj τl Figure 6: The multiplicative simultaneous projections algorithm. [sent-653, score-1.202]
61 , XT , yT be a sequence of examples where Xt is a matrix of kt examples and yt are the associated labels. [sent-666, score-0.486]
62 ∀t ∈ [T ], ∀ j ∈ [kt ] : ytj ω · xtj ≥ γ − ξt s. [sent-672, score-0.664]
63 (24) is T γ∑ kt n ∑ exp ∑ λtj − log t=1 j=1 i=1 kt s. [sent-675, score-0.717]
64 ∑ ∀t ∈ [T ] : λtj j=1 T kt ∑ ∑ τti j t=1 j=1 ≤C . [sent-677, score-0.351]
65 ∀t, ∀ : λtj ∀t, ∀ j : τ = tj ≥0 (25) λtj ytj xtj We denote the objective of Eq. [sent-678, score-1.216]
66 Lemma 7 Let θ = ∑t−1 ∑kl λtj ytj xtj denote the dual variables assigned in trials prior to t by the l=1 j=1 SimPerc scheme. [sent-688, score-0.824]
67 Then, the difference, n log ∑ exp θi +Cxtji ytj i=1 is upper bounded by 1 C2 x 2 n − log ∑ exp (θi ) , i=1 2 ∞. [sent-690, score-0.391]
68 To recap, we showed that the instantaneous dual can be seen as incrementally constructing an assignment for a global dual function (Eq. [sent-702, score-0.353]
69 On the synthetic data we compare our simultaneous projections algorithms with a commonly used technique whose updates are based on the most violating constraint on each online round (see for instance Crammer 1420 O NLINE L EARNING U SING S IMULTANEOUS P ROJECTIONS et al. [sent-721, score-0.372]
70 This update form constitutes a feasible solution to the instantaneous dual and casts a simple update for the online algorithm. [sent-740, score-0.463]
71 In order to compare both the additive and the multiplicative versions of our framework, we confined ourselves to the more restrictive setting of the multiplicative schemes as described in Sec. [sent-758, score-0.449]
72 M Mistakes 80 60 40 20 0 1 2 5 10 20 30 50 BlockSize Figure 7: The number of mistakes suffered by the various the additive and multiplicative simultaneous projections methods. [sent-770, score-0.55]
73 We compared all simultaneous projection variants presented earlier, as well as the multiplicative and additive versions of the MaxPA update. [sent-785, score-0.472]
74 5 Percent Relevant Features Figure 8: The performance of the additive and multiplicative simultaneous projections algorithms as a function of the sparsity of the hypothesis generating the data. [sent-814, score-0.478]
75 2 Label Noise Figure 9: The number of mistakes of the additive and multiplicative simultaneous projections algorithms as a function of the label noise. [sent-867, score-0.579]
76 6 Table 1: The percentage of online mistakes of the four additive variants compared to MaxPA and the optimal solver of each instantaneous problem. [sent-911, score-0.366]
77 As the number of instances per trial increases, the performance of all of simultaneous projections variants is comparable and they all perform better than any of the MaxPA variants. [sent-920, score-0.441]
78 2 Email Classification Experiments We next tested performance of the different additive and multiplicative simultaneous projection methods described in Sec. [sent-922, score-0.429]
79 7 Table 2: The percentage of online mistakes of three additive variants and the MaxPA algorithm compared to their multiplicative counterparts. [sent-1032, score-0.418]
80 1 Table 3: The percentage of online mistakes of four additive simultaneous projection algorithms. [sent-1106, score-0.377]
81 We then use our simultaneous projections scheme to propose a feasible solution to the optimization problem which competes with any decomposition loss (see Sec. [sent-1115, score-0.371]
82 If, on the other hand, all instances received on trial t are exactly the same, then the simultaneous projections approach cannot hope to attain anything better than the MaxPA algorithm. [sent-1122, score-0.398]
83 On the other hand, the simultaneous projections approach cannot easily construct a feasible dual solution where multiple equality constraints are required. [sent-1134, score-0.418]
84 (2) as follows min ω∈Rn ,ξ 1 ω − ωt t ≥0,β≥0 2 α max 2 kt +Cξ + ∑ αtj 1 − ξ − ytj ω · xtj j=1 − βξt . [sent-1141, score-1.034]
85 We rearrange the terms in the above equation and rewrite it as follows, kt 1 ∑ αtj + 2 ω∈R ,ξ α ≥0,β≥0 min n t max j=1 kt kt j=1 2 ω − ωt j=1 − ∑ αtj ytj ω · xtj + ξ C − ∑ αtj − β . [sent-1142, score-1.736]
86 (26) is attained by changing the order of the min and max and is given by kt max minn αt ≥0,β≥0 ω∈R 1 ∑ αtj + 2 j=1 2 ω − ωt kt kt − ∑ αtj ytj ω · xtj + min ξ C − ∑ αtj − β ξ j=1 . [sent-1144, score-1.807]
87 (27) j=1 The equation above can be written equivalently as kt max minn ∑ αtj + t α ≥0 ω∈R j=1 1 ω − ωt 2 2 kt kt − ∑ αtj ytj ω · xtj s. [sent-1145, score-1.761]
88 The constraint β ≥ 0 thus translates to the constraint kt ∑ j=1 αtj ≤ C. [sent-1152, score-0.393]
89 Fixing αt , the derivative of L with respect to ω is given by kt ∂L = ω − ωt − ∑ αtj ytj xtj . [sent-1153, score-1.015]
90 ∂ω j=1 1429 A MIT, S HALEV-S HWARTZ AND S INGER Comparing the derivative to 0, yields the following equation, ω = ωt + ∑kt αtj ytj xtj . [sent-1154, score-0.664]
91 (28) yields the following simplified constrained optimization problem, kt max ∑ t α ≥0 j=1 1 2 αtj + kt ∑ 2 kt − ∑ αtj ytj αtj ytj xtj j=1 j=1 kt ωt + ∑ αtl ytl xtl · xtj l=1 . [sent-1156, score-2.779]
92 j=1 Rearranging the terms and adding constants which do not depend of αt , we obtain the following dual problem, kt max ∑ t α j=1 αtj − kt 1 t ω + ∑ αtj ytj xtj 2 j=1 kt ∑ αtj ≤ C s. [sent-1159, score-1.849]
93 i ω · xtj ≥ γ−ξ (29) ξl ≥ 0 We again use Lagrange theory and rewrite the optimization task above as, n min max ∑ ωi log t ω∈∆n ,ξ≥0 α j ≥0 i=1 kt ωi +Cξ + ∑ αtj γ − ξ − ytj ω · xtj ωti j=1 . [sent-1168, score-1.396]
94 Rearranging the terms in the above expression we obtain n min max ∑ ωi log t ω∈∆n ,ξ≥0 α j ≥0 i=1 kt kt ωi + ξ C − ∑ αtj + ∑ αtj γ − ytj ω · xtj ωti j=1 j=1 . [sent-1169, score-1.4]
95 (30) is thus obtained by reversing the order of the min and max and is thus given by ωi n ∑ ωi log ωt + ξ α ≥0 ω∈∆ ,ξ≥0 max t j min n i i=1 kt kt j=1 j=1 C − ∑ αtj + ∑ αtj γ − ytj ω · xtj . [sent-1171, score-1.419]
96 (31) The equation above can be rewritten equivalently as follows n max minn ∑ ωi log t α j ,βt ω∈R i=1 kt s. [sent-1172, score-0.41]
97 ∑ j=1 αtj ≤C kt ωi + ∑ αtj γ − ytj ω · xtj ωti j=1 ∀j : αtj ≥0 1430 n + βt ( ∑ ωi − 1) i=1 . [sent-1174, score-1.015]
98 (32), let us first denote the vector ∑kt αtj ytj xtj by τ. [sent-1182, score-0.664]
99 (29) kt max γ ∑ αtj − log t αj j=1 kt n ∑ ωti eτi ∑ αtj ≤ C s. [sent-1189, score-0.736]
100 i=1 j=1 ∀ j : αtj ≥ 0 τ= kt ∑ αtj ytj xtj . [sent-1191, score-1.015]
wordName wordTfidf (topN-words)
[('tj', 0.552), ('ytj', 0.361), ('kt', 0.351), ('xtj', 0.303), ('multiplicative', 0.187), ('simperc', 0.146), ('simultaneous', 0.142), ('yt', 0.135), ('simproj', 0.128), ('dual', 0.113), ('trial', 0.107), ('conproj', 0.105), ('inger', 0.105), ('maxpa', 0.105), ('imultaneous', 0.099), ('rojections', 0.099), ('instantaneous', 0.099), ('projections', 0.096), ('hwartz', 0.089), ('nline', 0.084), ('sing', 0.084), ('simopt', 0.076), ('mistakes', 0.072), ('mistake', 0.072), ('xt', 0.071), ('stl', 0.07), ('sopopo', 0.064), ('online', 0.063), ('additive', 0.053), ('instances', 0.053), ('rt', 0.052), ('update', 0.049), ('trials', 0.047), ('projection', 0.047), ('multiclass', 0.046), ('email', 0.045), ('variants', 0.043), ('recap', 0.041), ('categorization', 0.038), ('ti', 0.038), ('crammer', 0.038), ('receives', 0.037), ('feasible', 0.036), ('solver', 0.036), ('singer', 0.034), ('imperfect', 0.034), ('earning', 0.033), ('tl', 0.032), ('solution', 0.031), ('cast', 0.03), ('block', 0.03), ('label', 0.029), ('competitor', 0.029), ('assignment', 0.028), ('synthetic', 0.028), ('optimization', 0.028), ('attained', 0.027), ('prediction', 0.027), ('primal', 0.026), ('minn', 0.025), ('assures', 0.025), ('th', 0.023), ('casts', 0.023), ('enron', 0.023), ('rkt', 0.023), ('xtji', 0.023), ('scheme', 0.022), ('schemes', 0.022), ('complex', 0.022), ('instance', 0.022), ('constraint', 0.021), ('tied', 0.02), ('ordinal', 0.02), ('kivinen', 0.02), ('receive', 0.019), ('message', 0.019), ('prototype', 0.019), ('max', 0.019), ('reduced', 0.018), ('aggressive', 0.018), ('jc', 0.018), ('aggressiveness', 0.017), ('categorizer', 0.017), ('mmm', 0.017), ('multilabel', 0.017), ('strt', 0.017), ('username', 0.017), ('slack', 0.017), ('attributed', 0.017), ('binary', 0.017), ('decomposable', 0.017), ('labels', 0.017), ('margin', 0.016), ('loss', 0.016), ('losses', 0.016), ('task', 0.016), ('log', 0.015), ('messages', 0.015), ('schapire', 0.015), ('violation', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
2 0.1163381 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
3 0.10349227 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
4 0.081096873 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
5 0.072571129 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
6 0.064845115 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
7 0.050112016 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
8 0.038948279 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
9 0.038942214 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
10 0.032377169 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
11 0.030799801 58 jmlr-2008-Max-margin Classification of Data with Absent Features
12 0.030554626 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.029711505 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
14 0.02938268 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
15 0.028349977 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
16 0.027800614 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
17 0.027434509 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
18 0.026443873 52 jmlr-2008-Learning from Multiple Sources
19 0.026209075 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
20 0.025358493 86 jmlr-2008-SimpleMKL
topicId topicWeight
[(0, 0.145), (1, -0.038), (2, -0.0), (3, 0.026), (4, -0.127), (5, -0.1), (6, -0.039), (7, 0.065), (8, -0.206), (9, -0.181), (10, -0.107), (11, -0.171), (12, -0.051), (13, -0.098), (14, -0.169), (15, 0.06), (16, -0.231), (17, 0.064), (18, 0.164), (19, 0.012), (20, -0.081), (21, -0.152), (22, -0.071), (23, 0.048), (24, -0.016), (25, -0.059), (26, -0.002), (27, 0.211), (28, -0.08), (29, -0.125), (30, -0.051), (31, -0.099), (32, -0.045), (33, -0.152), (34, 0.007), (35, 0.062), (36, -0.104), (37, 0.013), (38, 0.019), (39, -0.036), (40, -0.022), (41, 0.11), (42, 0.1), (43, -0.057), (44, 0.056), (45, 0.0), (46, 0.184), (47, 0.163), (48, -0.108), (49, -0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.95176947 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
2 0.67272788 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
3 0.40770686 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
4 0.34527367 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
5 0.23826483 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
Author: Tianjiao Chu, Clark Glymour
Abstract: Pointwise consistent, feasible procedures for estimating contemporaneous linear causal structure from time series data have been developed using multiple conditional independence tests, but no such procedures are available for non-linear systems. We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efficiently and consistently using tests based on additive model regression. Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model specification procedure is able to extract relatively detailed causal information. We investigate the finite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. Several variants of the procedure are also discussed. Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices
6 0.23453285 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
7 0.19593817 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
8 0.1956466 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
9 0.18698014 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
10 0.14357351 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
11 0.1423759 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
12 0.14103746 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
13 0.13653614 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
14 0.1331622 86 jmlr-2008-SimpleMKL
15 0.12592642 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
16 0.12349478 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
17 0.12059693 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.11647277 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
19 0.11400247 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
topicId topicWeight
[(0, 0.018), (5, 0.02), (31, 0.014), (40, 0.049), (51, 0.011), (54, 0.093), (58, 0.039), (60, 0.018), (66, 0.045), (76, 0.03), (88, 0.057), (92, 0.034), (94, 0.071), (96, 0.364), (99, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.72910291 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
2 0.37507364 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
3 0.36919922 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
4 0.36674151 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
Author: David Mease, Abraham Wyner
Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost
5 0.36220944 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
6 0.35318738 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
7 0.3520197 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
8 0.35130724 86 jmlr-2008-SimpleMKL
9 0.3461355 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
10 0.34605199 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.34559417 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.34532398 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
13 0.34526774 83 jmlr-2008-Robust Submodular Observation Selection
14 0.34244016 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
15 0.34095854 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
16 0.33259127 9 jmlr-2008-Active Learning by Spherical Subdivision
17 0.33057946 57 jmlr-2008-Manifold Learning: The Price of Normalization
18 0.3297492 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
19 0.32855883 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
20 0.32765642 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management