nips nips2006 nips2006-173 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Under the prediction model of learning, a prediction strategy is presented with an i. [sent-14, score-0.338]
2 achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. [sent-19, score-0.247]
3 The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. [sent-20, score-0.189]
4 The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. [sent-22, score-0.318]
5 Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. [sent-23, score-0.408]
6 This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. [sent-24, score-0.183]
7 The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. [sent-25, score-0.299]
8 On the one hand the derived VC(F)/n upper-bound on the worst-case expected risk of the one-inclusion strategy learning from F ⊆ {0, 1}X improved on the PAC-based previous-best by an order of log n. [sent-28, score-0.23]
9 At the same time Haussler [3] introduced the idea of shifting sub- sets of the n-cube down around the origin—an idea previously developed in Combinatorics—as a powerful tool for learning-theoretic results. [sent-30, score-0.275]
10 In particular, shifting admitted deeply insightful proofs of Sauer’s Lemma and a VC-dimension bound on the density of the one-inclusion graph—the key result needed for the one-inclusion strategy’s expected risk bound. [sent-31, score-0.57]
11 Recently shifting has impacted on work towards the sample compressibility conjecture of [7] e. [sent-32, score-0.368]
12 Here we continue to study the one-inclusion graph—the natural graph structure induced by a subset of the n-cube—and its related prediction strategy under the lens of shifting. [sent-35, score-0.381]
13 After the necessary background, we develop the technique of shatter-invariant shifting in Section 3. [sent-36, score-0.275]
14 While a subset’s VC-dimension cannot be increased by shifting, shatter-invariant shifting guarantees a finite sequence of shifts to a fixed-point under which the shattering of a chosen set remains invariant, thus preserving VC-dimension throughout. [sent-37, score-0.406]
15 In Section 4 we apply a group-theoretic symmetrization to tighten the mistake bound—the worst-case expected risk bound—of the deterministic (randomized) oned d d d inclusion strategy from d/n to Dn /n (Dn /n), where Dn < d for all n, d. [sent-38, score-0.517]
16 The derived Dn density bound positively resolves a conjecture of Kuzmin & Warmuth which was suggested as a step towards a correctness proof of the Peeling compression scheme [5]. [sent-39, score-0.426]
17 Finally we generalize the prediction model, the one-inclusion strategy and its bounds from binary to k-class learning in Section 5. [sent-40, score-0.273]
18 Where ΨG -dim (F) and ΨP -dim (F) denote the Graph and Pollard dimensions of F, the best bound on expected risk for k ∈ N to-date is O(α log α) for α = ΨG -dim (F) /n, for consistent learners [8, 1, 2, 4]. [sent-41, score-0.204]
19 For large n this is O(log nΨG -dim (F) /n); we derive an improved bound of ΨP -dim (F) /n which we show is at most a O(log k) factor from optimal. [sent-42, score-0.094]
20 Thus, as in the binary case, exploiting class structure enables significantly better bounds on expected risk for multiclass prediction. [sent-43, score-0.229]
21 We write the density of graph G = (V, E) as dens (G) = |E|/|V |, the indicator of A as 1 [A], and ∃! [sent-51, score-0.515]
22 1 The prediction model of learning We begin with the basic setup of [4]. [sent-54, score-0.109]
23 For notational convenience we write sam (x, f ) = ((x1 , f (x1 )) , . [sent-56, score-0.118]
24 A prediction strategy is a mapping of the form Q : n>1 (X × {0, 1}) Definition 2. [sent-61, score-0.229]
25 1 The prediction model of learning concerns the following scenario. [sent-62, score-0.109]
26 Given full knowledge of strategy Q, an adversary picks a distribution P on X and concept f ∈ F so as to maximize i. [sent-63, score-0.163]
27 Thus the measure of performance is the worst-case expected risk ˆ MQ,F (n) = sup sup EX∼P n [1 [Q (sam ((X1 , . [sent-70, score-0.199]
28 f ∈F P ˆ A mistake bound for Q with respect to F is an upper-bound on MQ,F . [sent-74, score-0.29]
29 In contrast to Valiant’s PAC model, the prediction learning model is not interested in approximating f given an f -labeled sample, but instead in predicting f (Xn ) with small worst-case probability of erring. [sent-75, score-0.109]
30 1 [4]) For any n > 1, concept class F and prediction strategy Q, 1 ˆ MQ,F (n) ≤ sup sup 1 Q sam xg(1) , . [sent-79, score-0.5]
31 ˆ ˆ A permutation mistake bound for Q with respect to F is an upper-bound on MQ,F . [sent-84, score-0.311]
32 3 The Vapnik-Chervonenkis dimension of concept class F is defined as VC(F) = sup {n | ∃x ∈ X n , Πx (F) = {0, 1}n }. [sent-94, score-0.098]
33 We next describe three important translation families used in this paper. [sent-120, score-0.075]
34 3 The one-inclusion prediction strategy A subset of the n-cube—the projection of some F—induces the one-inclusion graph, which underlies a natural prediction strategy. [sent-130, score-0.373]
35 , k}n is the undirected graph with vertex-set V and hyperedge-set E of maximal (with respect to inclusion) sets of pairwise hamming-1 separated vertices. [sent-139, score-0.095]
36 Algorithm 1 The deterministic multiclass one-inclusion prediction strategy Q G,F Given: F ⊆ {0, . [sent-140, score-0.363]
37 , xn−1 ), f ) ∈ (X × {0, 1}) Returns: a prediction of f (xn ) n−1 , xn ∈ X V ←− Πx (F) ; G ←− G (V ) ; → − G ←− orient G to minimize the maximum outdegree ; Vspace ←− {v ∈ V | v1 = f (x1 ), . [sent-146, score-0.258]
38 , vn−1 = f (xn−1 )} ; if Vspace = {v} then return vn ; → − else return the nth component of the head of hyperedge Vspace in G ; The one-inclusion graph’s prediction strategy QG,F [4] immediately generalizes to the multiclass prediction strategy described by Algorithm 1. [sent-149, score-0.832]
39 A lower bound in [6] showed that the one-inclusion strategy’s performance is optimal within a factor of 1 + o(1). [sent-154, score-0.094]
40 Replacing orientation with a distribution over each edge induces a randomized strategy QGrand,F . [sent-155, score-0.211]
41 4 [4]) For any n ∈ N and V ⊆ {0, 1}n , dens (G (V )) ≤ VC(V ). [sent-160, score-0.33]
42 Consider any s ∈ [n], v ∈ V and let Ss (v; V ) be v shifted along s: if vs = 0, or if vs = 1 and there exists some u ∈ V differing to v only in the sth coordinate, then Ss (v; V ) = v; otherwise v shifts down—its sth coordinate is decreased from 1 to 0. [sent-162, score-0.569]
43 The entire family V can be shifted to Ss (V ) = {Ss (v; V ) | v ∈ V } and this shifted vertex-set induces Ss (E) the edge-set of G (Ss (V )), where (V, E) = G (V ). [sent-163, score-0.309]
44 A number of properties of shifting follow relatively easily: |Ss (V )| VC(Ss (V )) |E| |Ss (E)| = ≤ ≤ ≥ ∃T ∈ N, s ∈ [n]T |V | , by the injectivity of Ss ( · ; V ) VC(V ) , as Ss (V ) shatters I ⊆ [n] ⇒ V shatters I |V | · VC(V ) , as V closed-below ⇒ maxv∈V v l1 ≤ VC(V ) |E| , by cases s. [sent-168, score-0.527]
45 Shatter-invariant shifting While [3] shifts to bound density, the number of edges can increase and the VC-dimension can decrease—both contributing to the observed gap between graph density and capacity. [sent-190, score-0.631]
46 The next result demonstrates that shifting can in fact be controlled to preserve VC-dimension. [sent-191, score-0.296]
47 1 Consider arbitrary n ∈ N, I ⊆ [n] and V ⊆ {0, 1}n that shatters I. [sent-193, score-0.126]
48 Proof: ΠI (·) is invariant to shifting on I = [n]\I. [sent-205, score-0.3]
49 So some finite number of shifts on I will produce a I-closed-below family W that shatters I. [sent-206, score-0.276]
50 Thus the shattering of I is invariant to the shifting of W on I, so that a finite number of shifts on I produces an I-closed-below W that shatters I. [sent-208, score-0.557]
51 Repeating the process a finite number of times until no non-trivial shifts are made produces a closed-below family that shatters I. [sent-209, score-0.276]
52 4 Tightly bounding graph density by symmetrization d Kuzmin and Warmuth [5] introduced Dn as a potential bound on the graph density of maximum d classes. [sent-211, score-0.552]
53 We begin with properties of Dn , a technical lemma and then proceed to the main result. [sent-212, score-0.107]
54 2 Dn d (i) equals the graph density of Vn for each n ∈ N and d ∈ [n]; (ii) is strictly upper-bounded by d, for all n; (iii) equals d for all n = d ∈ N; 2 (iv) is strictly monotonic increasing in d (with n fixed); (v) is strictly monotonic increasing in n (with d fixed); and (vi) limits to d as n → ∞. [sent-217, score-0.352]
55 d d Proof: By counting, for each d ≤ n < ∞, the density of G Vn equals Dn : d E G Vn d |Vn | = d i=1 i d i=0 n i n i = n d−1 i+1 n i=0 n i+1 n ≤d = n d−1 n−1 i=0 i n ≤d = n n−1 ≤d−1 n ≤d A A+C A C proving (i). [sent-218, score-0.146]
56 3 For arbitrary U, V ⊆ {0, 1}n with dens (G (V )) ≥ ρ > 0, |U | ≤ |V | and |E (G (U )) | ≥ |E (G (V )) |, if dens (G (U ∩ V )) < ρ then dens (G (U ∪ V )) > ρ. [sent-238, score-0.99]
57 The density bounding D n is plotted (dotted solid) alongside the previous best d (dashed), for each d ∈ {1, 2, 10}. [sent-241, score-0.129]
58 4 Every family V ⊆ {0, 1}n with d = VC(V ) has (V, E) = G (V ) with graph density |E| d ≤ Dn < d . [sent-243, score-0.258]
59 Proof: Allow a permutation g ∈ Sn to act on vector v ∈ {0, 1}n and family V ⊆ {0, 1}n by g(v) = vg(1) , . [sent-246, score-0.094]
60 Note d that a closed-below VC-dimension d family V ⊆ {0, 1}n satisfies Sn (V ) = V iff V = Vn , as VC(V ) ≥ d implies V contains an embedded d-cube, invariance to Sn implies further that V d contains all n such cubes, and VC(V ) ≤ d implies that V ⊆ Vn . [sent-250, score-0.224]
61 Consider now any d ∗ Vn,d ∈ arg min |U | U∈ arg max dens (G (U )) . [sent-251, score-0.33]
62 Then if ∗ ∗ dens G Vn,d ∩ g(Vn,d ) ∗ ≥ dens G Vn,d ∗ then Vn,d would not have been selected above ∗ ∗ (i. [sent-253, score-0.66]
63 But then again Vn,d would > dens G Vn,d Thus dens G Vn,d ∪ g(Vn,d ) ∗ ∗ not have been selected (i. [sent-258, score-0.66]
64 (i) dens G Vn,d d ∗ = Dn , for d = VC(Vn,d ) ≤ d. [sent-263, score-0.33]
65 (iv) this implies that d = d and (6) is true for all closed-below families; Vn uniquely maximizes density amongst all closed-below VC-dimension d families in the n-cube. [sent-266, score-0.213]
66 Noting that VC(W ) ≤ d and dens (G (V )) ≤ dens (G (W )) by (2) and (1) & (4) respectively, the bound (6) follows directly for V . [sent-269, score-0.754]
67 Furthermore if we shift to preserve d VC-dimension then VC(W ) = d while still |V | = |W |. [sent-270, score-0.08]
68 And since dens (G (W )) = Dn only if d W = Vn , it follows that V maximizes density amongst all VC-dimension d families in the n-cube, d with dens (G (V )) = Dn , only if it is maximum. [sent-271, score-0.851]
69 4 improves on the VC-dimension density bound of Lemma 2. [sent-273, score-0.184]
70 This new result immediately implies the following one-inclusion mistake bounds. [sent-275, score-0.218]
71 d For small d, n∗ (d) = min n ≥ d | d = Dn —the first n for which the new and old deterministic one-inclusion mistake bounds coincide—appears to remain very close to 2. [sent-279, score-0.278]
72 5 Bounds for multiclass prediction As in the k = 1 case, the key to developing the multiclass one-inclusion mistake bound is in bounding hypergraph density. [sent-283, score-0.702]
73 We proceed by shifting a graph induced by the one-inclusion hypergraph. [sent-284, score-0.392]
74 Proof: We begin by replacing the hyperedge structure E with a related edge structure E . [sent-290, score-0.077]
75 Two vertices u, v ∈ V are connected in the graph (V, E ) iff there exists an i ∈ [n] such that u, v differ only at i and no w ∈ V exists such that ui < wi < vi and wj = uj = vj on [n]\{i}. [sent-291, score-0.266]
76 |V | |V | |V | (7) Consider now shifting vertex v ∈ V at shift label t ∈ [k] along shift coordinate s ∈ [n] by Ss,t (v; V ) where vs(i) vs = vs(vs ) = (v1 , . [sent-293, score-0.547]
77 We shift V on s at t as usual; we shift V on s alone by bubbling vertices down to fill gaps below: Ss,t (V ) = {Ss,t (v; V ) | v ∈ V } Ss (V ) = Ss,k (Ss,k−1 (. [sent-307, score-0.181]
78 If i = s then no other vertex w ∈ V can come between u and v during shifting by construction of E , so {Ss (u; V ), Ss (v; V )} ∈ Ss (E ). [sent-314, score-0.275]
79 If both vertices shift down by the same number of labels then they remain connected in Ss (E ). [sent-316, score-0.122]
80 Otherwise assume WLOG that Ss (u; V )s < Ss (v; V )s then the shifted vertices will lose their edge, however since vs did not shift down to Ss (u; V )s there must have been some w ∈ V different to u on {i, s} such that ws < vs with Ss (w; V )s = Ss (u; V )s . [sent-317, score-0.493]
81 (10) In a finite number of shifts starting from (V, E ), a closed-below family W with induced edge-set F will be reached. [sent-325, score-0.172]
82 Counting edges in F by upper-adjoining vertices we have proved that (V, E ) finitely shifts to closed-below graph (W, F ) Combining properties (7)–(11) we have that |E| |V | ≤ |E | |V | ≤ s. [sent-334, score-0.235]
83 The remaining arguments from the k = 1 case of [4, 3] now imply the multiclass mistake bound. [sent-338, score-0.292]
84 The multiˆ class one-inclusion prediction strategy satisfies MQG,F ,F (n) ≤ ΨP -dim (F) /n. [sent-344, score-0.229]
85 1 A lower bound We now show that the preceding multiclass mistake bound is optimal to within a O(log k) factor, noting that ΨN is smaller than ΨP by at most such a factor [1, Theorem 10]. [sent-346, score-0.48]
86 4 Consider any deterministic or randomized prediction strategy Q and any F ⊆ {0, . [sent-353, score-0.315]
87 Proof: Following [2], we use the probabilistic method to prove the existence of a target in F for which prediction under a distribution P supported by a ΨN -shattered subset is hard. [sent-358, score-0.144]
88 , zd } ΨN -shattered by F and then a subset FZ ⊆ F of 2d functions that ΨN -shatters Z. [sent-363, score-0.084]
89 For any f ∈ FZ and x ∈ Z n with xn = xi for all i ∈ en [n − 1], exactly half of the functions in FZ consistent with sam ((x1 , . [sent-366, score-0.287]
90 , k} on xn and the remaining half output some j ∈ {0, . [sent-372, score-0.169]
91 6 Conclusions and open problems In this paper we have developed new shifting machinery and tightened the binary one-inclusion d d mistake bound from d/n to Dn /n ( Dn /n for the deterministic strategy) representing a solid improvement for d ≈ n. [sent-388, score-0.603]
92 We have described the multiclass generalization of the prediction learning model and derived a mistake bound for the multiclass one-inclusion prediction strategy that improves on previous PAC-based expected risk bounds by O(log n) and that is within O(log k) of optimal. [sent-389, score-0.953]
93 Here shifting with invariance to the shattering of a single set was described, however we are aware of invariance to more complex shatterings. [sent-390, score-0.389]
94 Another serious application of shatter-invariant shifting, to appear in a sequel to this paper, is to the study of the cubical structure of maximum and maximal classes with connections to the compressibility conjecture of [7]. [sent-391, score-0.093]
95 4 resolves one conjecture of Kuzmin & Warmuth [5], the remainder of the conjectured correctness proof for the Peeling compression scheme is known to be false [8]. [sent-393, score-0.215]
96 4 can be extended over subgroups G ⊂ S n to gain tighter d density bounds. [sent-395, score-0.09]
97 Just as the Sn -invariant Vn is the maximizer of density among all closed-below d V ⊆ Vn , there exist G-invariant families that maximize the density over all of their sub-families. [sent-396, score-0.255]
98 While a general ΨG -based bound would allow direct comparison with the PAC-based expected risk bound, it should also be noted that Ψ P and ΨG are in fact incomparable—neither ΨG ≤ ΨP nor ΨP ≤ ΨG singly holds for all classes [1, Theorem 1]. [sent-399, score-0.183]
99 : A general lower bound on the number of examples needed for learning. [sent-419, score-0.094]
100 : The one-inclusion graph algorithm is near optimal for the prediction model of learning. [sent-435, score-0.204]
wordName wordTfidf (topN-words)
[('ss', 0.436), ('vc', 0.34), ('dens', 0.33), ('shifting', 0.275), ('dn', 0.231), ('mistake', 0.196), ('vn', 0.187), ('xn', 0.149), ('vs', 0.132), ('shatters', 0.126), ('strategy', 0.12), ('sam', 0.118), ('prediction', 0.109), ('lemma', 0.107), ('shifted', 0.107), ('warmuth', 0.105), ('sn', 0.104), ('sst', 0.103), ('haussler', 0.1), ('multiclass', 0.096), ('graph', 0.095), ('bound', 0.094), ('density', 0.09), ('kuzmin', 0.09), ('xg', 0.083), ('fz', 0.082), ('shifts', 0.077), ('families', 0.075), ('family', 0.073), ('hypergraph', 0.072), ('sauer', 0.072), ('rubinstein', 0.065), ('risk', 0.064), ('vertices', 0.063), ('vspace', 0.062), ('shift', 0.059), ('theorem', 0.057), ('sup', 0.055), ('witnesses', 0.054), ('shattering', 0.054), ('conjecture', 0.052), ('compression', 0.052), ('symmetrization', 0.049), ('peeling', 0.049), ('zd', 0.049), ('randomized', 0.048), ('pollard', 0.046), ('littlestone', 0.046), ('bounds', 0.044), ('proof', 0.043), ('concept', 0.043), ('compressibility', 0.041), ('eunif', 0.041), ('resolves', 0.041), ('vt', 0.04), ('iv', 0.039), ('bounding', 0.039), ('deterministic', 0.038), ('nition', 0.038), ('hyperedge', 0.036), ('sth', 0.036), ('prp', 0.036), ('subset', 0.035), ('berkeley', 0.034), ('vg', 0.033), ('valiant', 0.031), ('invariance', 0.03), ('equals', 0.03), ('iff', 0.03), ('nth', 0.029), ('correctness', 0.027), ('positively', 0.027), ('exists', 0.027), ('bartlett', 0.026), ('proving', 0.026), ('amongst', 0.026), ('generalizes', 0.026), ('embedded', 0.025), ('inclusion', 0.025), ('invariant', 0.025), ('expected', 0.025), ('vi', 0.024), ('combinatorial', 0.024), ('implies', 0.022), ('induces', 0.022), ('induced', 0.022), ('monotonic', 0.022), ('meeting', 0.022), ('proofs', 0.022), ('coordinate', 0.022), ('permutation', 0.021), ('relating', 0.021), ('strictly', 0.021), ('preserve', 0.021), ('log', 0.021), ('edge', 0.021), ('counting', 0.02), ('replacing', 0.02), ('half', 0.02), ('division', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
2 0.16171485 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
3 0.12591363 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
4 0.11169393 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
5 0.105534 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
6 0.090283722 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
7 0.084687829 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
8 0.078499615 57 nips-2006-Conditional mean field
9 0.066148333 109 nips-2006-Learnability and the doubling dimension
10 0.063558914 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
11 0.060111824 14 nips-2006-A Small World Threshold for Economic Network Formation
12 0.059451155 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
13 0.059343934 77 nips-2006-Fast Computation of Graph Kernels
14 0.053041078 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction
15 0.051893786 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
16 0.051455557 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
17 0.050882641 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
18 0.050643593 35 nips-2006-Approximate inference using planar graph decomposition
19 0.05058828 156 nips-2006-Ordinal Regression by Extended Binary Classification
20 0.050023962 21 nips-2006-AdaBoost is Consistent
topicId topicWeight
[(0, -0.152), (1, 0.057), (2, -0.122), (3, 0.0), (4, 0.037), (5, 0.087), (6, 0.009), (7, 0.114), (8, -0.053), (9, -0.032), (10, 0.127), (11, -0.104), (12, 0.057), (13, -0.018), (14, -0.065), (15, -0.026), (16, 0.015), (17, 0.029), (18, -0.016), (19, -0.168), (20, -0.096), (21, 0.102), (22, 0.064), (23, -0.054), (24, 0.048), (25, -0.087), (26, 0.017), (27, 0.121), (28, 0.078), (29, 0.154), (30, 0.052), (31, 0.056), (32, 0.007), (33, -0.101), (34, 0.033), (35, -0.042), (36, -0.051), (37, -0.06), (38, 0.06), (39, -0.15), (40, 0.066), (41, -0.068), (42, -0.104), (43, -0.104), (44, 0.05), (45, -0.077), (46, 0.048), (47, -0.093), (48, -0.076), (49, -0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.96702778 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
2 0.67559808 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
Author: Martin J. Wainwright, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1 regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an 1 -constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to establish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observations n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency. Keywords: Graphical models; Markov random fields; structure learning; 1 -regularization; model selection; convex risk minimization; high-dimensional asymptotics; concentration. 1
3 0.59108853 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
4 0.5187856 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1
5 0.51279509 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
6 0.49763775 116 nips-2006-Learning from Multiple Sources
7 0.43268961 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
8 0.42975593 14 nips-2006-A Small World Threshold for Economic Network Formation
9 0.41576511 21 nips-2006-AdaBoost is Consistent
10 0.3809565 77 nips-2006-Fast Computation of Graph Kernels
11 0.37978017 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
12 0.37722114 156 nips-2006-Ordinal Regression by Extended Binary Classification
13 0.33380598 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
14 0.32451406 193 nips-2006-Tighter PAC-Bayes Bounds
15 0.32246703 35 nips-2006-Approximate inference using planar graph decomposition
16 0.31955007 146 nips-2006-No-regret Algorithms for Online Convex Programs
17 0.313214 128 nips-2006-Manifold Denoising
18 0.29883093 155 nips-2006-Optimal Single-Class Classification Strategies
19 0.29873472 5 nips-2006-A Kernel Method for the Two-Sample-Problem
20 0.29311505 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
topicId topicWeight
[(1, 0.053), (7, 0.055), (9, 0.546), (22, 0.04), (44, 0.082), (57, 0.034), (65, 0.04), (69, 0.018), (87, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.93924087 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
2 0.9298166 197 nips-2006-Uncertainty, phase and oscillatory hippocampal recall
Author: Máté Lengyel, Peter Dayan
Abstract: Many neural areas, notably, the hippocampus, show structured, dynamical, population behavior such as coordinated oscillations. It has long been observed that such oscillations provide a substrate for representing analog information in the firing phases of neurons relative to the underlying population rhythm. However, it has become increasingly clear that it is essential for neural populations to represent uncertainty about the information they capture, and the substantial recent work on neural codes for uncertainty has omitted any analysis of oscillatory systems. Here, we observe that, since neurons in an oscillatory network need not only fire once in each cycle (or even at all), uncertainty about the analog quantities each neuron represents by its firing phase might naturally be reported through the degree of concentration of the spikes that it fires. We apply this theory to memory in a model of oscillatory associative recall in hippocampal area CA3. Although it is not well treated in the literature, representing and manipulating uncertainty is fundamental to competent memory; our theory enables us to view CA3 as an effective uncertainty-aware, retrieval system. 1
3 0.9219709 149 nips-2006-Nonnegative Sparse PCA
Author: Ron Zass, Amnon Shashua
Abstract: We describe a nonnegative variant of the ”Sparse PCA” problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates — a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets. 1
4 0.81166881 7 nips-2006-A Local Learning Approach for Clustering
Author: Mingrui Wu, Bernhard Schölkopf
Abstract: We present a local learning approach for clustering. The basic idea is that a good clustering result should have the property that the cluster label of each data point can be well predicted based on its neighboring data and their cluster labels, using current supervised learning methods. An optimization problem is formulated such that its solution has the above property. Relaxation and eigen-decomposition are applied to solve this optimization problem. We also briefly investigate the parameter selection issue and provide a simple parameter selection method for the proposed algorithm. Experimental results are provided to validate the effectiveness of the proposed approach. 1
5 0.5459199 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.54424363 158 nips-2006-PG-means: learning the number of clusters in data
7 0.51681298 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
8 0.50997543 80 nips-2006-Fundamental Limitations of Spectral Clustering
9 0.50582916 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
10 0.49199998 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
11 0.48089412 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
12 0.45711982 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
13 0.45643002 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
14 0.45150244 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
15 0.45125693 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity
16 0.44594851 65 nips-2006-Denoising and Dimension Reduction in Feature Space
17 0.4427487 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
18 0.44228545 181 nips-2006-Stability of $K$-Means Clustering
19 0.42376474 163 nips-2006-Prediction on a Graph with a Perceptron
20 0.42033356 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds