nips nips2006 nips2006-173 knowledge-graph by maker-knowledge-mining

173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

Source: pdf

Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein

Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The ﬁrst main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Under the prediction model of learning, a prediction strategy is presented with an i. [sent-14, score-0.338]

2 achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. [sent-19, score-0.247]

3 The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. [sent-20, score-0.189]

4 The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. [sent-22, score-0.318]

5 Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. [sent-23, score-0.408]

6 This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. [sent-24, score-0.183]

7 The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. [sent-25, score-0.299]

8 On the one hand the derived VC(F)/n upper-bound on the worst-case expected risk of the one-inclusion strategy learning from F ⊆ {0, 1}X improved on the PAC-based previous-best by an order of log n. [sent-28, score-0.23]

9 At the same time Haussler [3] introduced the idea of shifting sub- sets of the n-cube down around the origin—an idea previously developed in Combinatorics—as a powerful tool for learning-theoretic results. [sent-30, score-0.275]

10 In particular, shifting admitted deeply insightful proofs of Sauer’s Lemma and a VC-dimension bound on the density of the one-inclusion graph—the key result needed for the one-inclusion strategy’s expected risk bound. [sent-31, score-0.57]

11 Recently shifting has impacted on work towards the sample compressibility conjecture of [7] e. [sent-32, score-0.368]

12 Here we continue to study the one-inclusion graph—the natural graph structure induced by a subset of the n-cube—and its related prediction strategy under the lens of shifting. [sent-35, score-0.381]

13 After the necessary background, we develop the technique of shatter-invariant shifting in Section 3. [sent-36, score-0.275]

14 While a subset’s VC-dimension cannot be increased by shifting, shatter-invariant shifting guarantees a ﬁnite sequence of shifts to a ﬁxed-point under which the shattering of a chosen set remains invariant, thus preserving VC-dimension throughout. [sent-37, score-0.406]

15 In Section 4 we apply a group-theoretic symmetrization to tighten the mistake bound—the worst-case expected risk bound—of the deterministic (randomized) oned d d d inclusion strategy from d/n to Dn /n (Dn /n), where Dn < d for all n, d. [sent-38, score-0.517]

16 The derived Dn density bound positively resolves a conjecture of Kuzmin & Warmuth which was suggested as a step towards a correctness proof of the Peeling compression scheme [5]. [sent-39, score-0.426]

17 Finally we generalize the prediction model, the one-inclusion strategy and its bounds from binary to k-class learning in Section 5. [sent-40, score-0.273]

18 Where ΨG -dim (F) and ΨP -dim (F) denote the Graph and Pollard dimensions of F, the best bound on expected risk for k ∈ N to-date is O(α log α) for α = ΨG -dim (F) /n, for consistent learners [8, 1, 2, 4]. [sent-41, score-0.204]

19 For large n this is O(log nΨG -dim (F) /n); we derive an improved bound of ΨP -dim (F) /n which we show is at most a O(log k) factor from optimal. [sent-42, score-0.094]

20 Thus, as in the binary case, exploiting class structure enables signiﬁcantly better bounds on expected risk for multiclass prediction. [sent-43, score-0.229]

21 We write the density of graph G = (V, E) as dens (G) = |E|/|V |, the indicator of A as 1 [A], and ∃! [sent-51, score-0.515]

22 1 The prediction model of learning We begin with the basic setup of [4]. [sent-54, score-0.109]

23 For notational convenience we write sam (x, f ) = ((x1 , f (x1 )) , . [sent-56, score-0.118]

24 A prediction strategy is a mapping of the form Q : n>1 (X × {0, 1}) Deﬁnition 2. [sent-61, score-0.229]

25 1 The prediction model of learning concerns the following scenario. [sent-62, score-0.109]

26 Given full knowledge of strategy Q, an adversary picks a distribution P on X and concept f ∈ F so as to maximize i. [sent-63, score-0.163]

27 Thus the measure of performance is the worst-case expected risk ˆ MQ,F (n) = sup sup EX∼P n [1 [Q (sam ((X1 , . [sent-70, score-0.199]

28 f ∈F P ˆ A mistake bound for Q with respect to F is an upper-bound on MQ,F . [sent-74, score-0.29]

29 In contrast to Valiant’s PAC model, the prediction learning model is not interested in approximating f given an f -labeled sample, but instead in predicting f (Xn ) with small worst-case probability of erring. [sent-75, score-0.109]

30 1 [4]) For any n > 1, concept class F and prediction strategy Q, 1 ˆ MQ,F (n) ≤ sup sup 1 Q sam xg(1) , . [sent-79, score-0.5]

31 ˆ ˆ A permutation mistake bound for Q with respect to F is an upper-bound on MQ,F . [sent-84, score-0.311]

32 3 The Vapnik-Chervonenkis dimension of concept class F is deﬁned as VC(F) = sup {n | ∃x ∈ X n , Πx (F) = {0, 1}n }. [sent-94, score-0.098]

33 We next describe three important translation families used in this paper. [sent-120, score-0.075]

34 3 The one-inclusion prediction strategy A subset of the n-cube—the projection of some F—induces the one-inclusion graph, which underlies a natural prediction strategy. [sent-130, score-0.373]

35 , k}n is the undirected graph with vertex-set V and hyperedge-set E of maximal (with respect to inclusion) sets of pairwise hamming-1 separated vertices. [sent-139, score-0.095]

36 Algorithm 1 The deterministic multiclass one-inclusion prediction strategy Q G,F Given: F ⊆ {0, . [sent-140, score-0.363]

37 , xn−1 ), f ) ∈ (X × {0, 1}) Returns: a prediction of f (xn ) n−1 , xn ∈ X V ←− Πx (F) ; G ←− G (V ) ; → − G ←− orient G to minimize the maximum outdegree ; Vspace ←− {v ∈ V | v1 = f (x1 ), . [sent-146, score-0.258]

38 , vn−1 = f (xn−1 )} ; if Vspace = {v} then return vn ; → − else return the nth component of the head of hyperedge Vspace in G ; The one-inclusion graph’s prediction strategy QG,F [4] immediately generalizes to the multiclass prediction strategy described by Algorithm 1. [sent-149, score-0.832]

39 A lower bound in [6] showed that the one-inclusion strategy’s performance is optimal within a factor of 1 + o(1). [sent-154, score-0.094]

40 Replacing orientation with a distribution over each edge induces a randomized strategy QGrand,F . [sent-155, score-0.211]

41 4 [4]) For any n ∈ N and V ⊆ {0, 1}n , dens (G (V )) ≤ VC(V ). [sent-160, score-0.33]

42 Consider any s ∈ [n], v ∈ V and let Ss (v; V ) be v shifted along s: if vs = 0, or if vs = 1 and there exists some u ∈ V differing to v only in the sth coordinate, then Ss (v; V ) = v; otherwise v shifts down—its sth coordinate is decreased from 1 to 0. [sent-162, score-0.569]

43 The entire family V can be shifted to Ss (V ) = {Ss (v; V ) | v ∈ V } and this shifted vertex-set induces Ss (E) the edge-set of G (Ss (V )), where (V, E) = G (V ). [sent-163, score-0.309]

44 A number of properties of shifting follow relatively easily: |Ss (V )| VC(Ss (V )) |E| |Ss (E)| = ≤ ≤ ≥ ∃T ∈ N, s ∈ [n]T |V | , by the injectivity of Ss ( · ; V ) VC(V ) , as Ss (V ) shatters I ⊆ [n] ⇒ V shatters I |V | · VC(V ) , as V closed-below ⇒ maxv∈V v l1 ≤ VC(V ) |E| , by cases s. [sent-168, score-0.527]

45 Shatter-invariant shifting While [3] shifts to bound density, the number of edges can increase and the VC-dimension can decrease—both contributing to the observed gap between graph density and capacity. [sent-190, score-0.631]

46 The next result demonstrates that shifting can in fact be controlled to preserve VC-dimension. [sent-191, score-0.296]

47 1 Consider arbitrary n ∈ N, I ⊆ [n] and V ⊆ {0, 1}n that shatters I. [sent-193, score-0.126]

48 Proof: ΠI (·) is invariant to shifting on I = [n]\I. [sent-205, score-0.3]

49 So some ﬁnite number of shifts on I will produce a I-closed-below family W that shatters I. [sent-206, score-0.276]

50 Thus the shattering of I is invariant to the shifting of W on I, so that a ﬁnite number of shifts on I produces an I-closed-below W that shatters I. [sent-208, score-0.557]

51 Repeating the process a ﬁnite number of times until no non-trivial shifts are made produces a closed-below family that shatters I. [sent-209, score-0.276]

52 4 Tightly bounding graph density by symmetrization d Kuzmin and Warmuth [5] introduced Dn as a potential bound on the graph density of maximum d classes. [sent-211, score-0.552]

53 We begin with properties of Dn , a technical lemma and then proceed to the main result. [sent-212, score-0.107]

54 2 Dn d (i) equals the graph density of Vn for each n ∈ N and d ∈ [n]; (ii) is strictly upper-bounded by d, for all n; (iii) equals d for all n = d ∈ N; 2 (iv) is strictly monotonic increasing in d (with n ﬁxed); (v) is strictly monotonic increasing in n (with d ﬁxed); and (vi) limits to d as n → ∞. [sent-217, score-0.352]

55 d d Proof: By counting, for each d ≤ n < ∞, the density of G Vn equals Dn : d E G Vn d |Vn | = d i=1 i d i=0 n i n i = n d−1 i+1 n i=0 n i+1 n ≤d = n d−1 n−1 i=0 i n ≤d = n n−1 ≤d−1 n ≤d A A+C A C proving (i). [sent-218, score-0.146]

56 3 For arbitrary U, V ⊆ {0, 1}n with dens (G (V )) ≥ ρ > 0, |U | ≤ |V | and |E (G (U )) | ≥ |E (G (V )) |, if dens (G (U ∩ V )) < ρ then dens (G (U ∪ V )) > ρ. [sent-238, score-0.99]

57 The density bounding D n is plotted (dotted solid) alongside the previous best d (dashed), for each d ∈ {1, 2, 10}. [sent-241, score-0.129]

58 4 Every family V ⊆ {0, 1}n with d = VC(V ) has (V, E) = G (V ) with graph density |E| d ≤ Dn < d . [sent-243, score-0.258]

59 Proof: Allow a permutation g ∈ Sn to act on vector v ∈ {0, 1}n and family V ⊆ {0, 1}n by g(v) = vg(1) , . [sent-246, score-0.094]

60 Note d that a closed-below VC-dimension d family V ⊆ {0, 1}n satisﬁes Sn (V ) = V iff V = Vn , as VC(V ) ≥ d implies V contains an embedded d-cube, invariance to Sn implies further that V d contains all n such cubes, and VC(V ) ≤ d implies that V ⊆ Vn . [sent-250, score-0.224]

61 Consider now any d ∗ Vn,d ∈ arg min |U | U∈ arg max dens (G (U )) . [sent-251, score-0.33]

62 Then if ∗ ∗ dens G Vn,d ∩ g(Vn,d ) ∗ ≥ dens G Vn,d ∗ then Vn,d would not have been selected above ∗ ∗ (i. [sent-253, score-0.66]

63 But then again Vn,d would > dens G Vn,d Thus dens G Vn,d ∪ g(Vn,d ) ∗ ∗ not have been selected (i. [sent-258, score-0.66]

64 (i) dens G Vn,d d ∗ = Dn , for d = VC(Vn,d ) ≤ d. [sent-263, score-0.33]

65 (iv) this implies that d = d and (6) is true for all closed-below families; Vn uniquely maximizes density amongst all closed-below VC-dimension d families in the n-cube. [sent-266, score-0.213]

66 Noting that VC(W ) ≤ d and dens (G (V )) ≤ dens (G (W )) by (2) and (1) & (4) respectively, the bound (6) follows directly for V . [sent-269, score-0.754]

67 Furthermore if we shift to preserve d VC-dimension then VC(W ) = d while still |V | = |W |. [sent-270, score-0.08]

68 And since dens (G (W )) = Dn only if d W = Vn , it follows that V maximizes density amongst all VC-dimension d families in the n-cube, d with dens (G (V )) = Dn , only if it is maximum. [sent-271, score-0.851]

69 4 improves on the VC-dimension density bound of Lemma 2. [sent-273, score-0.184]

70 This new result immediately implies the following one-inclusion mistake bounds. [sent-275, score-0.218]

71 d For small d, n∗ (d) = min n ≥ d | d = Dn —the ﬁrst n for which the new and old deterministic one-inclusion mistake bounds coincide—appears to remain very close to 2. [sent-279, score-0.278]

72 5 Bounds for multiclass prediction As in the k = 1 case, the key to developing the multiclass one-inclusion mistake bound is in bounding hypergraph density. [sent-283, score-0.702]

73 We proceed by shifting a graph induced by the one-inclusion hypergraph. [sent-284, score-0.392]

74 Proof: We begin by replacing the hyperedge structure E with a related edge structure E . [sent-290, score-0.077]

75 Two vertices u, v ∈ V are connected in the graph (V, E ) iff there exists an i ∈ [n] such that u, v differ only at i and no w ∈ V exists such that ui < wi < vi and wj = uj = vj on [n]\{i}. [sent-291, score-0.266]

76 |V | |V | |V | (7) Consider now shifting vertex v ∈ V at shift label t ∈ [k] along shift coordinate s ∈ [n] by Ss,t (v; V ) where vs(i) vs = vs(vs ) = (v1 , . [sent-293, score-0.547]

77 We shift V on s at t as usual; we shift V on s alone by bubbling vertices down to ﬁll gaps below: Ss,t (V ) = {Ss,t (v; V ) | v ∈ V } Ss (V ) = Ss,k (Ss,k−1 (. [sent-307, score-0.181]

78 If i = s then no other vertex w ∈ V can come between u and v during shifting by construction of E , so {Ss (u; V ), Ss (v; V )} ∈ Ss (E ). [sent-314, score-0.275]

79 If both vertices shift down by the same number of labels then they remain connected in Ss (E ). [sent-316, score-0.122]

80 Otherwise assume WLOG that Ss (u; V )s < Ss (v; V )s then the shifted vertices will lose their edge, however since vs did not shift down to Ss (u; V )s there must have been some w ∈ V different to u on {i, s} such that ws < vs with Ss (w; V )s = Ss (u; V )s . [sent-317, score-0.493]

81 (10) In a ﬁnite number of shifts starting from (V, E ), a closed-below family W with induced edge-set F will be reached. [sent-325, score-0.172]

82 Counting edges in F by upper-adjoining vertices we have proved that (V, E ) ﬁnitely shifts to closed-below graph (W, F ) Combining properties (7)–(11) we have that |E| |V | ≤ |E | |V | ≤ s. [sent-334, score-0.235]

83 The remaining arguments from the k = 1 case of [4, 3] now imply the multiclass mistake bound. [sent-338, score-0.292]

84 The multiˆ class one-inclusion prediction strategy satisﬁes MQG,F ,F (n) ≤ ΨP -dim (F) /n. [sent-344, score-0.229]

85 1 A lower bound We now show that the preceding multiclass mistake bound is optimal to within a O(log k) factor, noting that ΨN is smaller than ΨP by at most such a factor [1, Theorem 10]. [sent-346, score-0.48]

86 4 Consider any deterministic or randomized prediction strategy Q and any F ⊆ {0, . [sent-353, score-0.315]

87 Proof: Following [2], we use the probabilistic method to prove the existence of a target in F for which prediction under a distribution P supported by a ΨN -shattered subset is hard. [sent-358, score-0.144]

88 , zd } ΨN -shattered by F and then a subset FZ ⊆ F of 2d functions that ΨN -shatters Z. [sent-363, score-0.084]

89 For any f ∈ FZ and x ∈ Z n with xn = xi for all i ∈ en [n − 1], exactly half of the functions in FZ consistent with sam ((x1 , . [sent-366, score-0.287]

90 , k} on xn and the remaining half output some j ∈ {0, . [sent-372, score-0.169]

91 6 Conclusions and open problems In this paper we have developed new shifting machinery and tightened the binary one-inclusion d d mistake bound from d/n to Dn /n ( Dn /n for the deterministic strategy) representing a solid improvement for d ≈ n. [sent-388, score-0.603]

92 We have described the multiclass generalization of the prediction learning model and derived a mistake bound for the multiclass one-inclusion prediction strategy that improves on previous PAC-based expected risk bounds by O(log n) and that is within O(log k) of optimal. [sent-389, score-0.953]

93 Here shifting with invariance to the shattering of a single set was described, however we are aware of invariance to more complex shatterings. [sent-390, score-0.389]

94 Another serious application of shatter-invariant shifting, to appear in a sequel to this paper, is to the study of the cubical structure of maximum and maximal classes with connections to the compressibility conjecture of [7]. [sent-391, score-0.093]

95 4 resolves one conjecture of Kuzmin & Warmuth [5], the remainder of the conjectured correctness proof for the Peeling compression scheme is known to be false [8]. [sent-393, score-0.215]

96 4 can be extended over subgroups G ⊂ S n to gain tighter d density bounds. [sent-395, score-0.09]

97 Just as the Sn -invariant Vn is the maximizer of density among all closed-below d V ⊆ Vn , there exist G-invariant families that maximize the density over all of their sub-families. [sent-396, score-0.255]

98 While a general ΨG -based bound would allow direct comparison with the PAC-based expected risk bound, it should also be noted that Ψ P and ΨG are in fact incomparable—neither ΨG ≤ ΨP nor ΨP ≤ ΨG singly holds for all classes [1, Theorem 1]. [sent-399, score-0.183]

99 : A general lower bound on the number of examples needed for learning. [sent-419, score-0.094]

100 : The one-inclusion graph algorithm is near optimal for the prediction model of learning. [sent-435, score-0.204]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ss', 0.436), ('vc', 0.34), ('dens', 0.33), ('shifting', 0.275), ('dn', 0.231), ('mistake', 0.196), ('vn', 0.187), ('xn', 0.149), ('vs', 0.132), ('shatters', 0.126), ('strategy', 0.12), ('sam', 0.118), ('prediction', 0.109), ('lemma', 0.107), ('shifted', 0.107), ('warmuth', 0.105), ('sn', 0.104), ('sst', 0.103), ('haussler', 0.1), ('multiclass', 0.096), ('graph', 0.095), ('bound', 0.094), ('density', 0.09), ('kuzmin', 0.09), ('xg', 0.083), ('fz', 0.082), ('shifts', 0.077), ('families', 0.075), ('family', 0.073), ('hypergraph', 0.072), ('sauer', 0.072), ('rubinstein', 0.065), ('risk', 0.064), ('vertices', 0.063), ('vspace', 0.062), ('shift', 0.059), ('theorem', 0.057), ('sup', 0.055), ('witnesses', 0.054), ('shattering', 0.054), ('conjecture', 0.052), ('compression', 0.052), ('symmetrization', 0.049), ('peeling', 0.049), ('zd', 0.049), ('randomized', 0.048), ('pollard', 0.046), ('littlestone', 0.046), ('bounds', 0.044), ('proof', 0.043), ('concept', 0.043), ('compressibility', 0.041), ('eunif', 0.041), ('resolves', 0.041), ('vt', 0.04), ('iv', 0.039), ('bounding', 0.039), ('deterministic', 0.038), ('nition', 0.038), ('hyperedge', 0.036), ('sth', 0.036), ('prp', 0.036), ('subset', 0.035), ('berkeley', 0.034), ('vg', 0.033), ('valiant', 0.031), ('invariance', 0.03), ('equals', 0.03), ('iff', 0.03), ('nth', 0.029), ('correctness', 0.027), ('positively', 0.027), ('exists', 0.027), ('bartlett', 0.026), ('proving', 0.026), ('amongst', 0.026), ('generalizes', 0.026), ('embedded', 0.025), ('inclusion', 0.025), ('invariant', 0.025), ('expected', 0.025), ('vi', 0.024), ('combinatorial', 0.024), ('implies', 0.022), ('induces', 0.022), ('induced', 0.022), ('monotonic', 0.022), ('meeting', 0.022), ('proofs', 0.022), ('coordinate', 0.022), ('permutation', 0.021), ('relating', 0.021), ('strictly', 0.021), ('preserve', 0.021), ('log', 0.021), ('edge', 0.021), ('counting', 0.02), ('replacing', 0.02), ('half', 0.02), ('division', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein

2 0.16171485 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar

Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random ﬁeld. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish suﬃcient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is suﬃcient to achieve neighborhood consistency. Keywords: Graphical models; Markov random ﬁelds; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1

3 0.12591363 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

4 0.11169393 163 nips-2006-Prediction on a Graph with a Perceptron

Author: Mark Herbster, Massimiliano Pontil

Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a ﬁnite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a ﬁnite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1

5 0.105534 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to ﬁnd a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufﬁcient.

6 0.090283722 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

7 0.084687829 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

8 0.078499615 57 nips-2006-Conditional mean field

9 0.066148333 109 nips-2006-Learnability and the doubling dimension

10 0.063558914 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

11 0.060111824 14 nips-2006-A Small World Threshold for Economic Network Formation

12 0.059451155 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

13 0.059343934 77 nips-2006-Fast Computation of Graph Kernels

14 0.053041078 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

15 0.051893786 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization

16 0.051455557 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

17 0.050882641 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

18 0.050643593 35 nips-2006-Approximate inference using planar graph decomposition

19 0.05058828 156 nips-2006-Ordinal Regression by Extended Binary Classification

20 0.050023962 21 nips-2006-AdaBoost is Consistent

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, 0.057), (2, -0.122), (3, 0.0), (4, 0.037), (5, 0.087), (6, 0.009), (7, 0.114), (8, -0.053), (9, -0.032), (10, 0.127), (11, -0.104), (12, 0.057), (13, -0.018), (14, -0.065), (15, -0.026), (16, 0.015), (17, 0.029), (18, -0.016), (19, -0.168), (20, -0.096), (21, 0.102), (22, 0.064), (23, -0.054), (24, 0.048), (25, -0.087), (26, 0.017), (27, 0.121), (28, 0.078), (29, 0.154), (30, 0.052), (31, 0.056), (32, 0.007), (33, -0.101), (34, 0.033), (35, -0.042), (36, -0.051), (37, -0.06), (38, 0.06), (39, -0.15), (40, 0.066), (41, -0.068), (42, -0.104), (43, -0.104), (44, 0.05), (45, -0.077), (46, 0.048), (47, -0.093), (48, -0.076), (49, -0.091)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96702778 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein

2 0.67559808 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression

Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar

3 0.59108853 163 nips-2006-Prediction on a Graph with a Perceptron

Author: Mark Herbster, Massimiliano Pontil

4 0.5187856 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classiﬁcation this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

5 0.51279509 109 nips-2006-Learnability and the doubling dimension

Author: Yi Li, Philip M. Long

Abstract: Given a set of classiﬁers and a probability distribution over their domain, one can deﬁne a metric by taking the distance between a pair of classiﬁers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classiﬁer with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).

6 0.49763775 116 nips-2006-Learning from Multiple Sources

7 0.43268961 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

8 0.42975593 14 nips-2006-A Small World Threshold for Economic Network Formation

9 0.41576511 21 nips-2006-AdaBoost is Consistent

10 0.3809565 77 nips-2006-Fast Computation of Graph Kernels

11 0.37978017 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

12 0.37722114 156 nips-2006-Ordinal Regression by Extended Binary Classification

13 0.33380598 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

14 0.32451406 193 nips-2006-Tighter PAC-Bayes Bounds

15 0.32246703 35 nips-2006-Approximate inference using planar graph decomposition

16 0.31955007 146 nips-2006-No-regret Algorithms for Online Convex Programs

17 0.313214 128 nips-2006-Manifold Denoising

18 0.29883093 155 nips-2006-Optimal Single-Class Classification Strategies

19 0.29873472 5 nips-2006-A Kernel Method for the Two-Sample-Problem

20 0.29311505 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.053), (7, 0.055), (9, 0.546), (22, 0.04), (44, 0.082), (57, 0.034), (65, 0.04), (69, 0.018), (87, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93924087 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein

2 0.9298166 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall

Author: Máté Lengyel, Peter Dayan

Abstract: Many neural areas, notably, the hippocampus, show structured, dynamical, population behavior such as coordinated oscillations. It has long been observed that such oscillations provide a substrate for representing analog information in the ﬁring phases of neurons relative to the underlying population rhythm. However, it has become increasingly clear that it is essential for neural populations to represent uncertainty about the information they capture, and the substantial recent work on neural codes for uncertainty has omitted any analysis of oscillatory systems. Here, we observe that, since neurons in an oscillatory network need not only ﬁre once in each cycle (or even at all), uncertainty about the analog quantities each neuron represents by its ﬁring phase might naturally be reported through the degree of concentration of the spikes that it ﬁres. We apply this theory to memory in a model of oscillatory associative recall in hippocampal area CA3. Although it is not well treated in the literature, representing and manipulating uncertainty is fundamental to competent memory; our theory enables us to view CA3 as an effective uncertainty-aware, retrieval system. 1

3 0.9219709 149 nips-2006-Nonnegative Sparse PCA

Author: Ron Zass, Amnon Shashua

Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various ﬁelds, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efﬁcient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1

4 0.81166881 7 nips-2006-A Local Learning Approach for Clustering

Author: Mingrui Wu, Bernhard Schölkopf

Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also brieﬂy investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1

5 0.5459199 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

6 0.54424363 158 nips-2006-PG-means: learning the number of clusters in data

7 0.51681298 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

8 0.50997543 80 nips-2006-Fundamental Limitations of Spectral Clustering

9 0.50582916 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

10 0.49199998 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

11 0.48089412 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

12 0.45711982 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

13 0.45643002 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

14 0.45150244 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights

15 0.45125693 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

16 0.44594851 65 nips-2006-Denoising and Dimension Reduction in Feature Space

17 0.4427487 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

18 0.44228545 181 nips-2006-Stability of $K$-Means Clustering

19 0.42376474 163 nips-2006-Prediction on a Graph with a Perceptron

20 0.42033356 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds