nips nips2012 nips2012-174 knowledge-graph by maker-knowledge-mining

174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs


Source: pdf

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. [sent-2, score-0.255]

2 We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. [sent-6, score-0.181]

3 In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. [sent-7, score-0.181]

4 Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. [sent-8, score-0.388]

5 1 Introduction Some of the most influential machine learning tools are based on the hypothesis class of halfspaces with margin. [sent-9, score-0.187]

6 In this paper we study the computational complexity of learning halfspaces with margin. [sent-11, score-0.214]

7 Relying on the kernel trick, our sole assumption on X is that we are able to calculate efficiently the inner-product between any two instances (see for example Sch¨ lkopf and Smola [2002], Cristianini and Shawe-Taylor [2004]). [sent-15, score-0.144]

8 The error rate of a predictor h : X → {±1} is defined as L01 (h) = P[h(x) ̸= y], where the probability is over some unknown distribution over X ×{±1}. [sent-17, score-0.126]

9 The γ-margin error rate of a predictor x → ⟨w, x⟩ is defined as Lγ (w) = P[y⟨w, x⟩ ≤ γ]. [sent-18, score-0.126]

10 We study the runtime required to learn a predictor such that with high probability over the choice of S, the error rate of the learnt predictor satisfies L01 (A(S)) ≤ (1 + α) L∗ + ϵ where L∗ = γ γ min w:∥w∥=1 Lγ (w) . [sent-26, score-0.494]

11 γ Ben-David and Simon [2000] showed that, no proper learning algorithm can satisfy Equation (1) with α = 0 while running in time polynomial in both 1/γ and 1/ϵ. [sent-35, score-0.219]

12 By “proper” we mean an algorithm which returns a halfspace predictor. [sent-36, score-0.13]

13 Most algorithms that are being used in practice minimize a convex surrogate loss. [sent-40, score-0.185]

14 That is, inˆ stead of minimizing the number of mistakes on the training set, the algorithms minimize L(w) = ∑m 1 i=1 ℓ(yi ⟨w, xi ⟩), where ℓ : R → R is a convex function that upper bounds the 0 − 1 loss. [sent-41, score-0.263]

15 The advantage of surrogate convex losses is that minimizing them can be performed in time poly(1/γ, 1/ϵ). [sent-43, score-0.218]

16 It is ˆ easy to verify that minimizing L(w) with respect to the hinge loss yields a predictor that satisfies 1 Equation (1) with α = γ . [sent-44, score-0.305]

17 [2012] have ( ) 1 shown that any convex surrogate loss cannot guarantee Equation (1) if α < 1 γ − 1 . [sent-46, score-0.327]

18 2 Despite the centrality of this problem, not much is known on the runtime required to guarantee Equation (1) with other values of α. [sent-47, score-0.26]

19 In particular, a natural question is how the runtime changes 1 when enlarging α from 0 to γ . [sent-48, score-0.186]

20 Our main contribution is an upper bound on the required runtime as a function of α. [sent-50, score-0.311]

21 We show that the runtime required to guarantee Equation (1) is at most exp (C τ min{τ, log(1/γ)}), where C is a universal constant (we ignore additional factors which are polynomial in 1/γ, 1/ϵ—see a precise statement with the exact constants in Theorem 1). [sent-52, score-0.441]

22 That is, when we enlarge α, the runtime decreases gradually from being exponential to being polynomial. [sent-53, score-0.186]

23 We also show how one can design specific kernels that will fit well certain values of α while minimizing our upper bound on the sample and time complexity. [sent-55, score-0.17]

24 In Section 4 we extend our results to the more challenging learning settings of adversarial online learning and PAC learning with malicious noise. [sent-56, score-0.388]

25 For both cases, we obtain similar upper bounds on the runtime as a function of α. [sent-57, score-0.274]

26 The technique we use in the malicious noise case may be of independent interest. [sent-58, score-0.279]

27 In this case, τ = log(1/γ) and hence the γ log(1/γ) runtime is still polynomial in 1/γ. [sent-60, score-0.367]

28 Their technique is based on a smooth boosting algorithm applied on top of a weak learner which constructs random halfspaces and takes their majority vote. [sent-62, score-0.28]

29 They show that no convex surrogate can obtain α = o(1/γ). [sent-64, score-0.185]

30 As mentioned before, our technique is rather different as we do rely on the hinge loss as a surrogate convex loss. [sent-65, score-0.331]

31 There is no contradiction to Long and Servedio since we apply the convex loss in the feature space induced by our kernel function. [sent-66, score-0.266]

32 The negative result of Long and Servedio holds only if the convex surrogate is applied on the original space. [sent-67, score-0.185]

33 1 We did not analyze the case α < 5 because the runtime is already exponential in 1/γ even when α = 5. [sent-68, score-0.186]

34 Note, however, that our bound for α = 5 is slightly better than the bound of Shalev-Shwartz et al. [sent-69, score-0.137]

35 1 Additional related work The problem of learning kernel-based halfspaces has been extensively studied before in the framework of SVM [Vapnik, 1998, Cristianini and Shawe-Taylor, 2004, Sch¨ lkopf and Smola, 2002] and o the Perceptron [Freund and Schapire, 1999]. [sent-72, score-0.221]

36 [2009] derived similar results for the case of malicious noise. [sent-81, score-0.308]

37 For example, Kalai and Sastry [2009] solves the problem in polynomial time if there exists a vector w and a monotonically non-increasing function ϕ such that P(Y = 1|X = x) = ϕ(⟨w, x⟩). [sent-83, score-0.246]

38 [2006] also studied the relationship between surrogate convex loss functions and the 0-1 loss function. [sent-85, score-0.379]

39 They introduce the notion of well calibrated loss functions, meaning that the excess risk of a predictor h (over the Bayes optimal) with respect to the 0-1 loss can be bounded using the excess risk of the predictor with respect to the surrogate loss. [sent-86, score-0.747]

40 [2012] show in detail, without making additional distributional assumptions the fact that a loss function is well calibrated does not yield finite-sample or finite-time bounds. [sent-89, score-0.195]

41 only studied the case α = 0, we are interested in understanding the whole curve of runtime as a function of α. [sent-93, score-0.186]

42 (4) i=1 3 The upper bounds we derive hold for the above kernel-based SVM with the kernel function K(x, x′ ) = 1 1− 1 ′ 2 ⟨x, x ⟩ . [sent-111, score-0.198]

43 ϵ2 Let A be the algorithm which solves Equation (3) with the kernel function given in Equation (5), and returns the predictor defined in Equation (4). [sent-116, score-0.315]

44 As a direct corollary we obtain that there is an efficient algorithm that achieves an approximation factor of α = o(1/γ): Corollary 2 For any ϵ, δ, γ ∈ (0, 1), let α = √ 1/γ log(1/γ) and let B = 0. [sent-119, score-0.166]

45 As another corollary of Theorem 1 we obtain that for any constant c ∈ (0, 1), it is possible to satisfy Equation (1) with α = c/γ in polynomial time. [sent-122, score-0.25]

46 However, the dependence of the runtime on the 2 constant c is e4/c . [sent-123, score-0.186]

47 ∑d Theorem 3 For any γ, α, let p be a polynomial of the form p(z) = j=1 βj z 2j−1 (namely, p is odd) that satisfies max |p(z)| ≤ α and min |p(z)| ≥ 1 . [sent-126, score-0.242]

48 z∈[−1,1] z:|z|≥γ Let m be a training set size that satisfies 16 max{∥β∥2 , 2 log(4/δ), (1 + α)2 log(2/δ)} 1 ϵ2 Let A be the algorithm which solves Equation (3) with the following kernel function m ≥ K(x, x′ ) = d ∑ |βj |(⟨x, x′ ⟩)2j−1 , j=1 and returns the predictor defined in Equation (4). [sent-127, score-0.346]

49 The above theorem provides us with a recipe for constructing good kernel functions: Given γ and α, ∑d find a vector β with minimal ℓ1 norm such that the polynomial p(z) = j=1 βj z 2j−1 satisfies the conditions given in Theorem 3. [sent-129, score-0.423]

50 Indeed, for any β, we can find the extreme points of the polynomial 4 it defines, and then determine whether β satisfies all the constraints or, if it doesn’t, we can find a violated constraint. [sent-136, score-0.181]

51 Lemma 4 Given γ < 2/3, consider the polynomial p(z) = β1 z + β2 z 3 , where β1 = 1 γ + γ 1+γ 1 , β2 = − γ(1+γ) . [sent-139, score-0.181]

52 It is interesting to compare the guarantee given in the above lemma to the guarantee of using the 1 vanilla hinge-loss. [sent-144, score-0.187]

53 For the vanilla hinge1 loss we obtain the approximation factor γ while for the kernel given in Lemma 4 we obtain the 1 approximation factor of α ≤ 0. [sent-146, score-0.32]

54 [2012] have shown that without utilizing kernels, no convex surrogate loss can guarantee an approximation factor smaller 1 than α < 1 ( γ − 1). [sent-150, score-0.356]

55 3 Proofs Given a scalar loss function ℓ : R → R, and a vector w, we denote by L(w) = E(x,y)∼D [ℓ(y⟨w, x⟩)] the expected loss value of the predictions of w with respect to a distribution D over X × {±1}. [sent-152, score-0.194]

56 , (xm , ym ), we denote by L(w) = m i=1 ℓ(yi ⟨w, xi ⟩) the empirical loss of w. [sent-156, score-0.176]

57 Similarly L01 (w), Lγ (w), Lh (w), and ˆ ramp (w) are the empirical losses of w with respect to the different loss functions. [sent-161, score-0.236]

58 Since the ramp-loss upper bounds the zero-one loss we have that L01 (v) ≤ Lramp (v). [sent-165, score-0.185]

59 The advantage of using the ramp loss is that it is both a Lipschitz function and it is bounded by 1. [sent-166, score-0.236]

60 Bartlett and Mendelson [2002], Bousquet [2002]) yields that with probability of at least 1 − δ/2 over the choice of S we have: √ √ ˆ ramp (v) + 2 Bx B + 2 ln(4/δ) . [sent-169, score-0.169]

61 Lramp (v) ≤ L (7) m m =ϵ1 Since the ramp loss is upper bounded by the hinge-loss, we have shown the following inequalities, ˆ ˆ L01 (v) ≤ Lramp (v) ≤ Lramp (v) + ϵ1 ≤ Lh (v) + ϵ1 . [sent-170, score-0.287]

62 However, when the original instance space is also an RKHS, and our kernel is composed on top of the original kernel, the increase in the time complexity is not significant. [sent-175, score-0.137]

63 5 ∑∞ ∑∞ 2 Claim 5 Let p(z) = j=0 βj z j be any polynomial that satisfies j=0 βj 2j ≤ B, and let w be any vector in X . [sent-176, score-0.215]

64 Then, there exists vw in the RKHS defined by the kernel given in Equation (5), such that ∥vw ∥2 ≤ B and for all x ∈ X , ⟨vw , ψ(x)⟩ = p(⟨w, x⟩). [sent-177, score-0.465]

65 ˆ For any polynomial p, let ℓp (z) = ℓh (p(z)), and let Lp be defined analogously. [sent-178, score-0.249]

66 By the definition of v as minimizing ∑∞ 2 ˆ Lh (v) over ∥v∥2 ≤ B, it follows from the above claim that for any odd p that satisfies j=0 βj 2j ≤ B and for any w∗ ∈ X, we have that ˆ ˆ ˆ Lh (v) ≤ Lh (vw∗ ) = Lp (w∗ ) . [sent-180, score-0.189]

67 Next, it is straightforward to verify that if p is an odd polynomial that satisfies: max |p(z)| ≤ α z∈[−1,1] and min p(z) ≥ 1 z∈[γ,1] (9) ˆ then, ℓp (z) ≤ (1 + α)ℓγ (z) for all z ∈ [−1, 1]. [sent-181, score-0.289]

68 Let p be an odd polynomial such that j βj 2j ≤ B and such that Equation (9) holds. [sent-186, score-0.262]

69 Then, there exists a polynomial that satisfies the conditions of Corollary 6 with the parameters α, γ, B. [sent-192, score-0.211]

70 Then, there exists a polynomial that satisfies the conditions of Corollary 6 with the parameters α, γ, B. [sent-194, score-0.211]

71 1 Proof of Theorem 3 The proof is similar to the proof of Theorem 1 except that we replace Claim 5 with the following: ∑d Lemma 9 Let p(z) = j=1 βj z 2j−1 be any polynomial, and let w be any vector in X . [sent-196, score-0.134]

72 Then, there exists vw in the RKHS defined by the kernel given in Theorem 3, such that ∥vw ∥2 ≤ ∥β∥1 and for all x ∈ X , ⟨vw , ψ(x)⟩ = p(⟨w, x⟩). [sent-197, score-0.465]

73 Next, for any w ∈ X , we define explicitly the vector vw for which ⟨vw , ψ(x)⟩ = p(⟨w, x⟩). [sent-213, score-0.325]

74 6 4 Extension to other learning models In this section we briefly describe how our results can be extended to adversarial online learning and to PAC learning with malicious noise. [sent-223, score-0.388]

75 1 Online learning Online learning is performed in a sequence of consecutive rounds, where at round t the learner is given an instance, xt ∈ X , and is required to predict its label. [sent-226, score-0.167]

76 After predicting yt , the target label, ˆ yt , is revealed. [sent-227, score-0.184]

77 The goal of the learner is to make as few prediction mistakes as possible. [sent-228, score-0.145]

78 The Perceptron maintains a vector wt and predicts according to yt = sign(⟨wt , xt ⟩). [sent-231, score-0.178]

79 Initially, w1 = 0, and at round t ˆ the Perceptron updates the vector using the rule wt+1 = wt + 1[ˆt ̸= yt ] yt xt . [sent-232, score-0.27]

80 Agmon [1954] and others have shown that if there exists w∗ such that for all t, yt ⟨w∗ , xt ⟩ ≥ 1 and ∥xt ∥2 ≤ Bx , then the Perceptron will make at most ∥w∗ ∥2 Bx prediction mistakes. [sent-234, score-0.167]

81 This mistake bound has been generalized to the noisy case (see for example Gentile [2003]) as ∑ follows. [sent-236, score-0.159]

82 , (xm , ym ), and a vector w∗ , let Lh (w∗ ) = m 1 ∗ t=1 ℓh (yt ⟨w , xt ⟩), where ℓh is the hinge-loss. [sent-240, score-0.158]

83 [2009] have derived a mistake bound whose leading term depends on Lγ (w∗ ) (namely, it 2 corresponds to α = 0), but the runtime of the algorithm is at least m1/γ . [sent-245, score-0.373]

84 The main result of this section is to derive a mistake bound for the Perceptron based on all values of α between 5 and 1/γ. [sent-246, score-0.128]

85 Furthermore, similarly to the proof of Theorem 1, for any polynomial p that satisfies the conditions of Corollary 6 we have that there exists v ∗ in the RKHS corresponding to the kernel, with ∥v ∗ ∥2 ≤ B and with Lh (v ∗ ) ≤ (1 + α)Lγ (w∗ ). [sent-252, score-0.261]

86 The goal of the learner is to output a predictor that has L01 (h) ≤ ϵ, with respect to the “clean” distribution. [sent-259, score-0.219]

87 Auer and Cesa-Bianchi [1998] described a general conversion from online learning to the malicious noise setting. [sent-260, score-0.386]

88 Servedio [2003] used this conversion to derive a bound based on the Perceptron’s mistake bound. [sent-261, score-0.178]

89 In our case, we cannot rely on the conversion of Auer and Cesa-Bianchi [1998] since it requires a proper learner, while the online learner described in the previous section is not proper. [sent-262, score-0.238]

90 Then, solve kernel SVM on the resulting noisy training set. [sent-265, score-0.172]

91 Let ¯ denotes the empirical loss over S and L denotes the empirical loss over S. [sent-272, score-0.194]

92 of at least 1 − δ, for any v in the RKHS corresponding to the kernel that satisfies ∥v∥2 ≤ B we have that: ¯ L01 (v) ≤ Lramp (v) + 3ϵ/8 , (11) ¯ by our assumption on m. [sent-275, score-0.14]

93 , the one which achieves correct predictions with margin γ on clean examples), and let vw∗ be its corresponding element in the RKHS (see Claim 5). [sent-281, score-0.13]

94 η (13) In the above, the first inequality is since the ramp loss is upper bounded by the hinge loss, the second inequality is by the definition of v, the third equality is by Claim 5, the fourth inequality is by the properties of p, and the last inequality follows from the definition of η . [sent-283, score-0.488]

95 5 Summary and Open Problems We have derived upper bounds on the time and sample complexities as a function of the approximation factor. [sent-287, score-0.149]

96 We further provided a recipe for designing kernel functions with a small time and sample complexity for any given value of approximation factor and margin. [sent-288, score-0.226]

97 Our results are applicable to agnostic PAC Learning, online learning, and PAC learning with malicious noise. [sent-289, score-0.39]

98 Another open question is whether the upper bounds we have derived for an improper learner can be also derived for a proper learner. [sent-292, score-0.313]

99 On-line learning with malicious noise and the closure algorithm. [sent-303, score-0.279]

100 Minimizing the misclassification error rate using a surrogate convex loss. [sent-336, score-0.185]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vw', 0.325), ('lh', 0.302), ('malicious', 0.279), ('lramp', 0.257), ('servedio', 0.227), ('perceptron', 0.221), ('halfspaces', 0.187), ('runtime', 0.186), ('polynomial', 0.181), ('rkhs', 0.161), ('ramp', 0.139), ('bx', 0.134), ('predictor', 0.126), ('equation', 0.126), ('surrogate', 0.126), ('pac', 0.125), ('kernel', 0.11), ('poly', 0.104), ('loss', 0.097), ('learner', 0.093), ('yt', 0.092), ('satis', 0.092), ('halfspace', 0.086), ('klivans', 0.086), ('mistake', 0.083), ('odd', 0.081), ('ym', 0.079), ('cristianini', 0.077), ('claim', 0.075), ('freund', 0.073), ('kalai', 0.072), ('theorem', 0.072), ('corollary', 0.069), ('xm', 0.067), ('distributional', 0.065), ('recipe', 0.06), ('auer', 0.06), ('kj', 0.06), ('convex', 0.059), ('online', 0.057), ('svm', 0.057), ('es', 0.056), ('vanilla', 0.055), ('agnostic', 0.054), ('adversarial', 0.052), ('mistakes', 0.052), ('lp', 0.051), ('upper', 0.051), ('blais', 0.05), ('conversion', 0.05), ('proof', 0.05), ('hinge', 0.049), ('clean', 0.048), ('schapire', 0.048), ('margin', 0.048), ('achievable', 0.047), ('et', 0.047), ('long', 0.045), ('xt', 0.045), ('bound', 0.045), ('guarantee', 0.045), ('transfer', 0.044), ('returns', 0.044), ('rosenblatt', 0.044), ('bartlett', 0.043), ('sign', 0.043), ('lemma', 0.042), ('kernels', 0.041), ('wt', 0.041), ('risk', 0.039), ('log', 0.039), ('proper', 0.038), ('vapnik', 0.038), ('inequality', 0.038), ('bounds', 0.037), ('improper', 0.036), ('solves', 0.035), ('lkopf', 0.034), ('let', 0.034), ('ga', 0.034), ('calibrated', 0.033), ('ha', 0.033), ('minimizing', 0.033), ('shai', 0.032), ('excess', 0.032), ('complexities', 0.032), ('noisy', 0.031), ('training', 0.031), ('fix', 0.03), ('rademacher', 0.03), ('least', 0.03), ('exists', 0.03), ('factor', 0.029), ('misclassi', 0.029), ('derived', 0.029), ('required', 0.029), ('august', 0.028), ('sch', 0.028), ('complexity', 0.027), ('min', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

2 0.14682251 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

3 0.12820198 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1

4 0.11693862 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

5 0.11493658 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

6 0.1133765 16 nips-2012-A Polynomial-time Form of Robust Regression

7 0.10284473 188 nips-2012-Learning from Distributions via Support Measure Machines

8 0.10280878 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.10245191 227 nips-2012-Multiclass Learning with Simplex Coding

10 0.1023732 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.096124932 293 nips-2012-Relax and Randomize : From Value to Algorithms

12 0.093954489 330 nips-2012-Supervised Learning with Similarity Functions

13 0.088812999 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

14 0.083866529 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

15 0.083408117 324 nips-2012-Stochastic Gradient Descent with Only One Projection

16 0.083290569 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

17 0.08271753 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

18 0.082298785 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

19 0.079159699 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

20 0.078274526 160 nips-2012-Imitation Learning by Coaching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.204), (1, -0.007), (2, 0.124), (3, 0.039), (4, 0.145), (5, -0.005), (6, -0.013), (7, 0.123), (8, -0.073), (9, -0.03), (10, 0.075), (11, 0.047), (12, 0.095), (13, 0.041), (14, 0.055), (15, -0.084), (16, -0.026), (17, -0.032), (18, -0.017), (19, 0.045), (20, -0.018), (21, 0.008), (22, -0.031), (23, 0.009), (24, -0.079), (25, 0.065), (26, -0.066), (27, -0.034), (28, -0.081), (29, -0.065), (30, 0.043), (31, -0.005), (32, -0.047), (33, 0.019), (34, 0.077), (35, 0.017), (36, -0.015), (37, -0.021), (38, -0.019), (39, -0.087), (40, -0.022), (41, -0.023), (42, 0.003), (43, -0.008), (44, -0.069), (45, -0.023), (46, 0.055), (47, 0.027), (48, 0.043), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93986547 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

2 0.79426098 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

3 0.77080691 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

Abstract: We study consistency properties of surrogate loss functions for general multiclass classification problems, defined by a general loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be classification calibrated with respect to a loss matrix in this setting. We then introduce the notion of classification calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classification calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difficulty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classification calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1

4 0.76835006 217 nips-2012-Mixability in Statistical Learning

Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson

Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1

5 0.72024471 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

Abstract: We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1

6 0.68120986 16 nips-2012-A Polynomial-time Form of Robust Regression

7 0.67714798 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

8 0.66369534 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

9 0.65330815 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

10 0.6455034 280 nips-2012-Proper losses for learning from partial labels

11 0.62088758 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

12 0.61164916 271 nips-2012-Pointwise Tracking the Optimal Regression Function

13 0.61134601 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

14 0.59983712 330 nips-2012-Supervised Learning with Similarity Functions

15 0.5962314 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

16 0.58496082 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

17 0.57027966 161 nips-2012-Interpreting prediction markets: a stochastic approach

18 0.56703687 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

19 0.56700438 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

20 0.5611245 167 nips-2012-Kernel Hyperalignment


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.021), (21, 0.034), (38, 0.157), (39, 0.011), (42, 0.037), (53, 0.024), (54, 0.021), (55, 0.033), (74, 0.028), (76, 0.129), (80, 0.13), (92, 0.049), (98, 0.249)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92823017 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

Author: Sean Gerrish, David M. Blei

Abstract: We develop a probabilistic model of legislative data that uses the text of the bills to uncover lawmakers’ positions on specific political issues. Our model can be used to explore how a lawmaker’s voting patterns deviate from what is expected and how that deviation depends on what is being voted on. We derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we demonstrate both improvement in heldout predictive performance and the model’s utility in interpreting an inherently multi-dimensional space. 1

2 0.88419175 267 nips-2012-Perceptron Learning of SAT

Author: Alex Flint, Matthew Blaschko

Abstract: Boolean satisfiability (SAT) as a canonical NP-complete decision problem is one of the most important problems in computer science. In practice, real-world SAT sentences are drawn from a distribution that may result in efficient algorithms for their solution. Such SAT instances are likely to have shared characteristics and substructures. This work approaches the exploration of a family of SAT solvers as a learning problem. In particular, we relate polynomial time solvability of a SAT subset to a notion of margin between sentences mapped by a feature function into a Hilbert space. Provided this mapping is based on polynomial time computable statistics of a sentence, we show that the existance of a margin between these data points implies the existance of a polynomial time solver for that SAT subset based on the Davis-Putnam-Logemann-Loveland algorithm. Furthermore, we show that a simple perceptron-style learning rule will find an optimal SAT solver with a bounded number of training updates. We derive a linear time computable set of features and show analytically that margins exist for important polynomial special cases of SAT. Empirical results show an order of magnitude improvement over a state-of-the-art SAT solver on a hardware verification task. 1

3 0.82738787 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classification, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.

same-paper 4 0.79825604 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

5 0.79571354 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

6 0.75700176 180 nips-2012-Learning Mixtures of Tree Graphical Models

7 0.7061047 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

8 0.70487297 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

9 0.7035355 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

10 0.70164186 65 nips-2012-Cardinality Restricted Boltzmann Machines

11 0.70141637 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

12 0.700279 293 nips-2012-Relax and Randomize : From Value to Algorithms

13 0.69991744 227 nips-2012-Multiclass Learning with Simplex Coding

14 0.69941896 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

15 0.69939518 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

16 0.69830102 200 nips-2012-Local Supervised Learning through Space Partitioning

17 0.69757915 292 nips-2012-Regularized Off-Policy TD-Learning

18 0.69716805 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

19 0.69697666 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

20 0.69387859 330 nips-2012-Supervised Learning with Similarity Functions