jmlr jmlr2011 jmlr2011-29 knowledge-graph by maker-knowledge-mining

29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Source: pdf

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Microsoft Research One Memorial Drive Cambridge, MA 02142, USA Editor: Russ Greiner Abstract We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. [sent-7, score-0.828]

2 In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. [sent-8, score-1.116]

3 This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. [sent-10, score-0.565]

4 Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory 1. [sent-13, score-0.18]

5 Second, each test has some associated cost, and we typically have a budget on the amount of money to spend for collecting the training information. [sent-18, score-0.463]

6 This scenario, where there is a hard constraint on the number of training attributes the ∗. [sent-19, score-0.539]

7 o C ESA -B IANCHI , S HALEV-S HWARTZ AND S HAMIR learner has access to, is known as budgeted learning. [sent-22, score-0.294]

8 1 Note that the constraint on the number of training attributes may be local (no single participant is willing to undergo many tests) or global (the overall number of tests that can be performed is limited). [sent-23, score-0.62]

9 In a different but related budgeted learning setting, the system may be facing a restriction on the number of attributes that can be viewed at test time. [sent-24, score-0.619]

10 This may happen, for example, in a search engine, where a ranking of web pages must be generated for each incoming user query and there is no time to evaluate a large number of attributes to answer the query. [sent-25, score-0.44]

11 We may thus distinguish three basic budgeted learning settings: • Local Budget Constraint: The learner has access to at most k attributes of each individual example, where k is a parameter of the problem. [sent-26, score-0.734]

12 The learner has the freedom to actively choose which of the attributes is revealed, as long as at most k of them will be given. [sent-27, score-0.577]

13 • Global Budget Constraint: The total number of training attributes the learner is allowed to see is bounded by k. [sent-28, score-0.597]

14 As in the local budget constraint setting, the learner has the freedom to actively choose which of the attributes is revealed. [sent-29, score-1.013]

15 In contrast to the local budget constraint setting, the learner can choose to access more than k/m attributes from speciﬁc examples (where m is the overall number of examples) as long as the global number of attributes is bounded by k. [sent-30, score-1.483]

16 • Prediction on a budget: The learner receives the entire training set, however, at test time, the predictor can see at most k attributes of each instance and then must form a prediction. [sent-31, score-0.806]

17 The predictor is allowed to actively choose which of the attributes is revealed. [sent-32, score-0.667]

18 In this paper we focus on budgeted linear regression, and prove negative and positive learning results in the three abovementioned settings. [sent-33, score-0.178]

19 Our ﬁrst result shows that, under a global budget constraint, no algorithm can learn a general d-dimensional linear predictor while observing less than Ω(d) attributes at training time. [sent-34, score-1.103]

20 This is complemented by the following positive result: we show an efﬁcient algorithm for learning under a given local budget constraint of 2k attributes per example, for any k ≥ 1. [sent-35, score-0.911]

21 The algorithm actively picks which attributes to observe in each example in a randomized way depending on past observed attributes, and constructs a “noisy” version of all attributes. [sent-36, score-0.486]

22 We show that the overall number of attributes our algorithm needs to learn a regressor is at most a factor of d bigger than that used by standard regression algorithms that view all the attributes of each example. [sent-38, score-0.969]

23 Ignoring logarithmic factors, the same gap of d exists when the attribute bound of our algorithm is specialized to the choice of parameters that is used to prove the abovementioned Ω(d) lower bound under the global budget constraint. [sent-39, score-0.555]

24 In the prediction on a budget setting, we prove that in general it is not possible (even with an inﬁnite amount of training examples) to build an active classiﬁer that uses at most two attributes of each example at test time, and whose error will be smaller than a constant. [sent-40, score-0.951]

25 This in contrast with the local budget setting, where it is possible to learn a consistent predictor by accessing at most two attributes of each example at training time. [sent-41, score-1.16]

26 2 Our algorithm for the local budget setting actively chooses which attributes to observe for each example. [sent-54, score-0.889]

27 Unlike compressed learning, where learners are both trained and evaluated in the compressed domain, our techniques are mainly designed for a scenario in which only the access to training data is restricted. [sent-69, score-0.19]

28 We note that a recent follow-up work (Hazan and Koren, 2011) present 1-norm and 2-norm based algorithms for our local budget setting, whose theoretical guarantees improve on those presented in this paper, and match our lower bound to within logarithmic factors. [sent-70, score-0.465]

29 The goal of the learner is to ﬁnd a linear predictor x → w, x . [sent-75, score-0.272]

30 In the rest of the paper, we use the term predictor to denote the vector w ∈ Rd . [sent-76, score-0.181]

31 The performance of a predictor w on an instance-target pair, (x, y) ∈ Rd × R, is measured by a loss function ℓ( w, x , y). [sent-77, score-0.211]

32 The goal of the learner is to ﬁnd a predictor with low risk, deﬁned as the expected loss def LD (w) = E (x,y)∼D ℓ( w, x , y) . [sent-81, score-0.302]

33 Impossibility Results Our ﬁrst result states that any budget learning algorithm (local or global) needs in general a budget of Ω(d) attributes for learning a d-dimensional linear predictor. [sent-93, score-1.178]

34 In Section 6 we prove that under the same assumptions as those of Theorem 1, it is possible to learn a predictor using a local budget of two attributes per example and using a total of O (d 2 ) training examples. [sent-96, score-1.113]

35 Next, we consider the prediction on a budget setting. [sent-98, score-0.369]

36 A k-active predictor is restricted to use at most k attributes per test example x, where the choice of the i-th attribute of x may depend on the values of the i − 1 attributes of x that have been already observed. [sent-101, score-1.138]

37 (2002) show that it is possible to learn a k-active predictor from training examples whose performance is slightly worse than that of the best k-active predictor. [sent-103, score-0.27]

38 We now show that even in simple cases in which there exists a linear predictor w⋆ with LD (w⋆ ) = 0, the risk of the best k-active predictor can be high. [sent-105, score-0.408]

39 We use the notation LD (A) to denote the expected loss of the k-active predictor A on a test example. [sent-107, score-0.239]

40 Theorem 2 There exists a weight vector w⋆ ∈ Rd and a distribution D such that w⋆ k LD (w⋆ ) = 0, while any k-active predictor A must have LD (A) ≥ 1 − d . [sent-108, score-0.181]

41 Therefore, the theorem tells us that no k active predictor can get an improvement over the naive predictor of more than d . [sent-110, score-0.41]

42 Without loss of generality, suppose the k-active predictor asks for the ﬁrst k attributes of a test example and forms its prediction to be y. [sent-117, score-0.679]

43 Since the generation of attributes is independent, we have that the value of ˆ xk+1 , . [sent-118, score-0.44]

44 The following theorem shows that even if we restrict w⋆ to have w⋆ 1 = 1, LD (w⋆ ) = 0, and w⋆ 0 > k, we still have that the risk of the best k-active predictor can be non-vanishing. [sent-127, score-0.227]

45 Theorem 3 There exists a weight vector w⋆ ∈ Rd and a distribution D such that w⋆ 1 = 1, LD (w⋆ ) = 0, and w⋆ 0 = ck (for c > 1) such that any k-active predictor A must have LD (A) ≥ 1 1 − 1 ck . [sent-128, score-0.335]

46 Note that if w 0 ≤ k there is a trivial way to predict on a budget of k attributes by always querying the attributes corresponding to the non-zero elements of w⋆ . [sent-131, score-1.249]

47 Without loss of generality, suppose the k-active predictor asks for the ﬁrst k < ck attributes of a test example and form its prediction to be y. [sent-140, score-0.756]

48 Again similarly to the proof of Theorem2, since the ˆ generation of attributes is independent, we have that the value of xk+1 , . [sent-141, score-0.463]

49 Therefore, ˆ  k 2 E (y − w⋆ , x )2 = E  y − ∑ w⋆ xi ˆ ˆ i  + ∑ (w⋆ )2 E[xi2 ] i i=1 ck − k (ck)2 c−1 1 = 2 = 1− c k c  ≥ 0+ i>k 1 ck which concludes our proof. [sent-148, score-0.175]

50 These negative results highlight an interesting phenomenon: in Section 6 we show that one can learn an arbitrarily accurate predictor w with a local budget of k = 2. [sent-149, score-0.607]

51 Therefore, at least in the worst-case sense, learning on a budget is much easier than predicting on a budget. [sent-151, score-0.369]

52 Local Budget Constraint: A Baseline Algorithm In this section we describe a straightforward adaptation of Lasso (Tibshirani, 1996) to the local budget setting. [sent-153, score-0.403]

53 , 2007) with the adaptive sampling of attributes to estimate the gradient of the loss at each step. [sent-156, score-0.502]

54 A popular approach for learning a linear regressor is to minimize the empirical loss on the training set plus a regularization term, which often takes the form of a norm of the predictor w. [sent-157, score-0.313]

55 In the constraint form, the predictor is a solution to the following optimization problem 1 2 min ∑ w, x − y |S| (x,y)∈S w∈Rd (1) s. [sent-161, score-0.214]

56 , 2008) LS (w) − LD (w) ≤ Lmax B 2 ln 2d + ℓmax m 1 2 ln 2m δ that holds with probability at least 1 − δ for all w ∈ Rd such that w 1 ≤ B, where Lmax bounds the Lipschitz constant for the square loss from above, and ℓmax bounds the square loss from above. [sent-172, score-0.258]

57 This sampling process can be easily generalized to the case where k > 1 attributes can be seen. [sent-190, score-0.44]

58 The following theorem shows that similar to Lasso, the Baseline algorithm is competitive with the optimal linear predictor with a bounded 1-norm. [sent-192, score-0.181]

59 2865 C ESA -B IANCHI , S HALEV-S HWARTZ AND S HAMIR Lemma 9 Let vt be the vector constructed at iteration t of the Baseline algorithm and note that 1 m 1 m v = m ∑t=1 2 yt vt . [sent-225, score-0.359]

60 Proof Based on Lemma 7, it is easy to verify that E[2 yt vt ] = 2 yt xt . [sent-228, score-0.541]

61 Additionally, since we sample k elements without replacement, each element of vt is in − d , d (because we assume xt ∞ ≤ 1) k k and thus each element of 2yt vt is in − 2dB , 2dB (because we assume that |yt | ≤ B). [sent-229, score-0.253]

62 Gradient-Based Attribute Efﬁcient Regression In this section, by avoiding the estimation of the matrix xx⊤ , we signiﬁcantly decrease the number of additional examples sufﬁcient for learning with k attributes per training example. [sent-246, score-0.506]

63 We can estimate the gradient using an even number k ≥ 2 of attributes as follows. [sent-249, score-0.472]

64 In many situations, the 1-norm of a good predictor is signiﬁcantly smaller than d and in these cases the gradient based estimate is better than the loss based estimate. [sent-275, score-0.243]

65 Finally, Pegasos updates the predictor according to 1 the gradient descent rule w ← w − λt ∇g(w) where t is the current iteration number. [sent-287, score-0.213]

66 Then, there exists a constant c > 0 such that LD (w) ≤ ¯ min w : w 1 ≤B LD (w) + c dB2 2869 m 1 ln km δ C ESA -B IANCHI , S HALEV-S HWARTZ AND S HAMIR holds with probability at least 1 − δ over both the choice of the training set and the algorithm’s own randomization. [sent-306, score-0.201]

67 Proof Let yt , yt , vt , wt be the values of y, y, v, w, respectively, at each iteration t of the AER algorithm. [sent-307, score-0.581]

68 ˆ ˆ Moreover, let ∇t = 2( wt , xt − yt )xt and ∇t = 2(yt − yt )vt . [sent-308, score-0.576]

69 Recalling 2 2 1 that the right-hand side in the AER update (10) is equal to wt − λt ∇ ft (wt ), we can apply the following logarithmic regret bound for λ-strongly convex functions (Hazan et al. [sent-312, score-0.24]

70 , 2006; Kakade and Shalev-Shwartz, 2008) m m 1 ∑ ft (wt ) − ∑ ft (w⋆ ) ≤ λ t=1 max ∇ ft (wt ) t=1 t 2 ln m which remains valid also in the presence of the projection steps (11). [sent-313, score-0.27]

71 Similarly to the analysis of Pegasos, and using our assumptions on xt ∞ and |yt |, the norm of the gradient ∇ ft (wt ) is bounded as follows 2 ˆ ∇ ft (wt ) ≤ λ wt + 2 yt − yt vt ≤ λ wt + 4Bd . [sent-314, score-0.929]

72 , using an iductive argument) that 1 wt ≤ 4Bd λ 2 , k which yields ∇ ft (wt ) ≤ 8Bd 2870 2 . [sent-317, score-0.178]

73 k E FFICIENT L EARNING WITH PARTIALLY O BSERVED ATTRIBUTES This gives the bound m ˆ ∑ 2(yt − yt ) vt , wt − w⋆ ≤ t=1 Choosing λ = 16d λ 128(dB)2 ln m + m w⋆ λk 2 log(m)/(km) and noting that m ˆ ∑ 2(yt − yt ) vt , wt − w⋆ t=1 · 2 ≤ · ≤ 16dB2 1 2 2 . [sent-318, score-0.911]

74 k The resulting bound is then m E ∑ t=1 wt , xt − yt 2 m ≤ ∑ w⋆ , xt − yt 2 + 16dB2 t=1 m ln m . [sent-320, score-0.78]

75 Since w, xt − yt ≤ 4B2 for all w such that w 1 ≤ B (recall our assumptions on xt and yt ), and using the convexity of the square loss, we obtain that 1 2 1 ln m + 4B2 ln LD (w) ≤ inf LD (w) + 16dB2 ¯ km m δ w : w ≤B holds with probability at least 1 − δ with respect to all random events. [sent-323, score-0.77]

76 4 Since each AER example uses only k attributes while each Lasso example uses all d attributes, the ratio between the total number of attributes AER needs and the number of attributes Lasso needs to achieve the same error is O (d). [sent-327, score-1.32]

77 But, this implies that both AER and Lasso needs the same number of attributes in order to achieve the same level of error! [sent-334, score-0.44]

78 When w⋆ is sparse we have w⋆ 1 ≈ w⋆ 2 and then AER needs more attributes than Lasso. [sent-336, score-0.44]

79 Both ridge regression and Lasso were run in the full information setting: Namely, they enjoyed full access to all attributes of all examples in the training set. [sent-356, score-0.723]

80 The Baseline algorithm and AER, however, were given access to only four attributes from each training example. [sent-357, score-0.558]

81 2872 E FFICIENT L EARNING WITH PARTIALLY O BSERVED ATTRIBUTES The bottom plot of Figure 4 is similar, only that now the X-axis represents the accumulative number of attributes seen by each algorithm rather than the number of examples. [sent-365, score-0.44]

82 For the partialinformation algorithm, the graph ends at approximately 49,000 attributes, which is the total number of attributes accessed by the algorithm after running over all training examples, seeing k = 4 pixels from each example. [sent-366, score-0.57]

83 However, for the full-information algorithms 49,000 attributes are already seen after just 62 examples. [sent-367, score-0.44]

84 In the local budget setting, we have introduced a simple and efﬁcient algorithm, AER, that learns by accessing a pre-speciﬁed number of attributes from each training example. [sent-400, score-0.956]

85 This result is complemented by a general lower bound for the global budget setting which is a factor d smaller than the upper bound achieved by our algorithm. [sent-402, score-0.476]

86 We note that this gap has been recently closed by Hazan and Koren (2011), which in our local budget setting, show 1-norm and 2-norm-based algorithms for learning linear predictors using only O (d) attributes, thus matching our lower bound to within logarithmic factors. [sent-403, score-0.49]

87 In contrast to the local/global budget settings, where we can learn efﬁciently by accessing few attributes of each training example, we showed that accessing a limited number of attributes at test time is a signiﬁcantly harder setting. [sent-406, score-1.46]

88 Indeed, we proved that is not possible to build an active linear predictor that uses two attributes of each test example and whose error is smaller than a certain constant, even when there exists a linear predictor achieving zero error on the same data source. [sent-407, score-0.878]

89 We then show that if some algorithm learns a linear predictor with an extra risk of at most ε, then it must know the value of the good feature. [sent-421, score-0.227]

90 Next, we construct a variant of a multi-armed bandit problem out of our distribution and show that a good learner can yield a good prediction strategy. [sent-422, score-0.201]

91 (2003), to conclude that the number k of attributes viewed by a good learner must satisfy k = Ω d . [sent-424, score-0.531]

92 Given j and yt , xt ∈ {±1} is 1 generated according to P xt,i = yt = 2 + 1{i = j}p where p > 0 is chosen later. [sent-435, score-0.455]

93 It is easy to see that the optimal linear predictor under the distribution P j is w⋆ = 2p e j , and the risk of w⋆ is LP j (w⋆ ) = E j ( w⋆ , x − y)2 = 1 2 + p (1 − 2p)2 + 1 2 − p (1 + 2p)2 = 1 + 4p2 − 8p2 = 1 − 4p2 . [sent-444, score-0.227]

94 , d} is an arm and the reward of pulling i at time t is 1{xNi,t ,i = yNi,t } ∈ {0, 1}, where Ni,t denotes the random number of times arm i has been pulled in the ﬁrst t plays. [sent-457, score-0.273]

95 , d, can learn a linear predictor with LP j (w) − LP j (w⋆ ) < ε using k attributes. [sent-464, score-0.204]

96 5 An Upper Bound On the Reward Of Any Bandit Strategy 1 Recall that under distribution P j the expected reward for pulling arm I is 2 + p1{I = j}. [sent-469, score-0.176]

97 Hence, the total expected reward of a player that runs for T rounds is upper bounded by 1 T + pE j [N j ], 2 where N j = N j,T is the overall number of pulls of arm j. [sent-470, score-0.175]

98 However, since P j xi,s = ys = P j xi,s = ys | ys for all i (including i = j), the knowledge of ys does not provide any information about the distribution of rewards for arm i. [sent-474, score-0.328]

99 In particular, for any bandit strategy there exists some arm j such that the expected reward of the strategy under distribution P j is at most T +p 2 T +T d − T ln(1 − 4p2 ) d ≤ T +p 2 T +T d 6T 2 p d (14) where we used the inequality − ln(1 − q) ≤ 3 q for q ∈ [0, 1/4]. [sent-479, score-0.251]

100 6 Concluding The Proof Take a learning algorithm that ﬁnds an ε-good predictor using k attributes. [sent-482, score-0.181]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('attributes', 0.44), ('aer', 0.371), ('budget', 0.369), ('ld', 0.245), ('yt', 0.187), ('predictor', 0.181), ('ls', 0.154), ('budgeted', 0.151), ('esa', 0.151), ('hamir', 0.151), ('ianchi', 0.151), ('bserved', 0.137), ('hwartz', 0.129), ('wt', 0.121), ('lasso', 0.113), ('bandit', 0.11), ('pegasos', 0.108), ('ln', 0.099), ('learner', 0.091), ('fficient', 0.09), ('vt', 0.086), ('xt', 0.081), ('ck', 0.077), ('arm', 0.076), ('ridge', 0.073), ('baseline', 0.07), ('rd', 0.07), ('training', 0.066), ('reward', 0.065), ('ys', 0.063), ('greiner', 0.058), ('ft', 0.057), ('db', 0.054), ('access', 0.052), ('attribute', 0.049), ('hazan', 0.048), ('active', 0.048), ('lp', 0.047), ('accessing', 0.047), ('risk', 0.046), ('actively', 0.046), ('dichterman', 0.041), ('kapoor', 0.041), ('lmax', 0.041), ('partially', 0.039), ('logarithmic', 0.038), ('xx', 0.038), ('pixels', 0.037), ('compressed', 0.036), ('km', 0.036), ('regressor', 0.036), ('complemented', 0.035), ('deterministically', 0.035), ('pulling', 0.035), ('earning', 0.034), ('player', 0.034), ('local', 0.034), ('constraint', 0.033), ('kakade', 0.032), ('gradient', 0.032), ('full', 0.031), ('regression', 0.03), ('auer', 0.03), ('loss', 0.03), ('randomization', 0.029), ('partial', 0.029), ('deng', 0.029), ('compensate', 0.029), ('js', 0.029), ('test', 0.028), ('abovementioned', 0.027), ('calderbank', 0.027), ('milano', 0.027), ('ohad', 0.027), ('beygelzimer', 0.027), ('accessed', 0.027), ('sgn', 0.025), ('agnostic', 0.025), ('aw', 0.025), ('digits', 0.025), ('predictors', 0.025), ('global', 0.024), ('bound', 0.024), ('nput', 0.023), ('guha', 0.023), ('madani', 0.023), ('willing', 0.023), ('learn', 0.023), ('mnist', 0.023), ('enjoy', 0.023), ('proof', 0.023), ('icml', 0.022), ('lemma', 0.022), ('bd', 0.022), ('concludes', 0.021), ('xr', 0.021), ('koren', 0.021), ('utput', 0.021), ('cun', 0.021), ('pulled', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

2 0.13948146 14 jmlr-2011-Better Algorithms for Benign Bandits

Author: Elad Hazan, Satyen Kale

Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best ﬁxed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efﬁcient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efﬁcient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning

3 0.13925369 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

4 0.10510471 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

5 0.090686917 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modiﬁed strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with inﬁnitely many arms, optimistic online optimization, regret bounds, minimax rates

6 0.085783087 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

7 0.083153389 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

8 0.076378822 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

9 0.074410856 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

10 0.072119772 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

11 0.056046374 28 jmlr-2011-Double Updating Online Learning

12 0.054466929 36 jmlr-2011-Generalized TD Learning

13 0.049779028 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

14 0.048664059 59 jmlr-2011-Learning with Structured Sparsity

15 0.04532811 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

16 0.042569477 91 jmlr-2011-The Sample Complexity of Dictionary Learning

17 0.041957028 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

18 0.03965674 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

19 0.039605401 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

20 0.039506074 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.227), (1, 0.289), (2, -0.006), (3, 0.03), (4, 0.038), (5, -0.014), (6, 0.064), (7, -0.002), (8, -0.07), (9, 0.007), (10, 0.045), (11, -0.159), (12, 0.029), (13, -0.012), (14, -0.001), (15, -0.055), (16, -0.127), (17, 0.112), (18, -0.021), (19, -0.02), (20, -0.14), (21, -0.081), (22, -0.012), (23, 0.042), (24, -0.059), (25, 0.124), (26, -0.004), (27, -0.079), (28, 0.005), (29, -0.083), (30, -0.1), (31, 0.095), (32, -0.04), (33, -0.013), (34, 0.045), (35, 0.049), (36, 0.003), (37, 0.038), (38, -0.067), (39, 0.066), (40, 0.027), (41, -0.062), (42, 0.11), (43, -0.111), (44, -0.022), (45, 0.072), (46, -0.019), (47, -0.071), (48, -0.078), (49, 0.152)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94250357 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

2 0.54533088 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

3 0.48218346 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

4 0.47860244 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

5 0.43495959 14 jmlr-2011-Better Algorithms for Benign Bandits

Author: Elad Hazan, Satyen Kale

6 0.41940609 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

7 0.39028609 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

8 0.37101346 59 jmlr-2011-Learning with Structured Sparsity

9 0.3472304 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

10 0.34252042 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

11 0.30323055 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

12 0.29514569 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.2844086 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

14 0.27338099 91 jmlr-2011-The Sample Complexity of Dictionary Learning

15 0.26984686 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

16 0.26565063 22 jmlr-2011-Differentially Private Empirical Risk Minimization

17 0.26284239 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

18 0.25941005 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.25683665 28 jmlr-2011-Double Updating Online Learning

20 0.2453008 36 jmlr-2011-Generalized TD Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.064), (9, 0.029), (10, 0.032), (15, 0.344), (24, 0.051), (31, 0.075), (32, 0.026), (41, 0.029), (60, 0.01), (65, 0.029), (67, 0.031), (71, 0.014), (73, 0.026), (78, 0.119), (90, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71131831 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

2 0.42988271 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

3 0.42878842 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA

4 0.42597482 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

5 0.42117706 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

6 0.41814089 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

7 0.41704294 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

8 0.41507453 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

9 0.41333619 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

10 0.41325718 22 jmlr-2011-Differentially Private Empirical Risk Minimization

11 0.41246995 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

12 0.41157275 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

13 0.41114768 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

14 0.4107497 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

15 0.41043004 16 jmlr-2011-Clustering Algorithms for Chains

16 0.41022417 59 jmlr-2011-Learning with Structured Sparsity

17 0.40978938 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

18 0.40812075 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

19 0.40779507 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

20 0.40759003 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing