nips nips2010 nips2010-223 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Rates of convergence for the cluster tree Kamalika Chaudhuri UC San Diego kchaudhuri@ucsd. [sent-1, score-0.424]
2 edu Abstract For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. [sent-4, score-0.527]
3 The set of all high-density clusters form a hierarchy called the cluster tree of f . [sent-5, score-0.62]
4 We present a procedure for estimating the cluster tree given samples from f . [sent-6, score-0.389]
5 , Xn from distribution f , and is told to find k clusters, what do these clusters reveal about f ? [sent-15, score-0.168]
6 Pollard [8] proved a basic consistency result: if the algorithm always finds the global minimum of the k-means cost function (which is NP-hard, see Theorem 3 of [3]), then as n → ∞, the clustering is the globally optimal k-means solution for f . [sent-16, score-0.213]
7 Here we follow some early work on clustering (for instance, [5]) by associating clusters with high-density connected regions. [sent-20, score-0.41]
8 Specifically, a cluster of density f is any connected component of {x : f (x) ≥ λ}, for any λ > 0. [sent-21, score-0.527]
9 The collection of all such clusters forms an (infinite) hierarchy called the cluster tree (Figure 1). [sent-22, score-0.62]
10 Are there hierarchical clustering algorithms which converge to the cluster tree? [sent-23, score-0.337]
11 Previous theory work [5, 7] has provided weak consistency results for the single-linkage clustering algorithm, while other work [13] has suggested ways to overcome the deficiencies of this algorithm by making it more robust, but without proofs of convergence. [sent-24, score-0.181]
12 We establish its finite-sample rate of convergence (Theorem 6); the centerpiece of our argument is a result on continuum percolation (Theorem 11). [sent-26, score-0.163]
13 We also give a lower bound on the problem of cluster tree estimation (Theorem 12), which matches our upper bound in its dependence on most of the parameters of interest. [sent-27, score-0.389]
14 We exclusively consider Euclidean distance on X , denoted B(x, r) be the closed ball of radius r around x. [sent-29, score-0.232]
15 Let f (x) λ1 λ2 λ3 C1 C2 X C3 Figure 1: A probability density f on R, and three of its clusters: C1 , C2 , and C3 . [sent-31, score-0.131]
16 1 The cluster tree We start with notions of connectivity. [sent-33, score-0.389]
17 If x = P (0) and y = P (1), we write x y, and we say that x and y are connected in S. [sent-35, score-0.191]
18 This relation – “connected in S” – is an equivalence relation that partitions S into its connected components. [sent-36, score-0.164]
19 We say S ⊂ X is connected if it has a single connected component. [sent-37, score-0.355]
20 The cluster tree is a hierarchy each of whose levels is a partition of a subset of X , which we will occasionally call a subpartition of X . [sent-38, score-0.51]
21 Definition 1 For any f : X → R, the cluster tree of f is a function Cf : R → Π(X ) given by Cf (λ) = connected components of {x ∈ X : f (x) ≥ λ}. [sent-40, score-0.553]
22 Any element of Cf (λ), for any λ, is called a cluster of f . [sent-41, score-0.232]
23 For any λ, Cf (λ) is a set of disjoint clusters of X . [sent-42, score-0.226]
24 We will sometimes deal with the restriction of the cluster tree to a finite set of points x1 , . [sent-49, score-0.48]
25 Formally, the restriction of a subpartition C ∈ Π(X ) to these points is defined to be C[x1 , . [sent-53, score-0.149]
26 Likewise, the restriction of the cluster tree is Cf [x1 , . [sent-60, score-0.432]
27 2 Notion of convergence and previous work Suppose a sample Xn ⊂ X of size n is used to construct a tree Cn that is an estimate of Cf . [sent-75, score-0.217]
28 Hartigan [5] provided a very natural notion of consistency for this setting. [sent-76, score-0.141]
29 Definition 3 For any sets A, A′ ⊂ X , let An (resp, A′ ) denote the smallest cluster of Cn containing n A ∩ Xn (resp, A′ ∩ Xn ). [sent-77, score-0.257]
30 We say Cn is consistent if, whenever A and A′ are different connected components of {x : f (x) ≥ λ} (for some λ > 0), P(An is disjoint from A′ ) → 1 as n → ∞. [sent-78, score-0.308]
31 n It is well known that if Xn is used to build a uniformly consistent density estimate fn (that is, supx |fn (x) − f (x)| → 0), then the cluster tree Cfn is consistent; see the appendix for details. [sent-79, score-0.743]
32 The big problem is that Cfn is not easy to compute for typical density estimates fn : imagine, for instance, how one might go about trying to find level sets of a mixture of Gaussians! [sent-80, score-0.271]
33 Wong and 2 f (x) X Figure 2: A probability density f , and the restriction of Cf to a finite set of eight points. [sent-81, score-0.174]
34 Lane [14] have an efficient procedure that tries to approximate Cfn when fn is a k-nearest neighbor density estimate, but they have not shown that it preserves the consistency property of Cfn . [sent-82, score-0.377]
35 There is a simple and elegant algorithm that is a plausible estimator of the cluster tree: single linkage (or Kruskal’s algorithm); see the appendix for pseudocode. [sent-83, score-0.442]
36 But he also demonstrates, by a lovely reduction to continuum percolation, that this consistency fails in higher dimension d ≥ 2. [sent-85, score-0.173]
37 The problem is the requirement that A ∩ Xn ⊂ An : by the time the clusters are large enough that one of them contains all of A, there is a reasonable chance that this cluster will be so big as to also contain part of A′ . [sent-86, score-0.429]
38 With this insight, Hartigan defines a weaker notion of fractional consistency, under which An (resp, A′ ) need not contain all of A ∩ Xn (resp, A′ ∩ Xn ), but merely a sizeable chunk of it – and ought to n be very close (at distance → 0 as n → ∞) to the remainder. [sent-87, score-0.16]
39 He then shows that single linkage has this weaker consistency property for any pair A, A′ for which the ratio of inf{f (x) : x ∈ A ∪ A′ } to sup{inf{f (x) : x ∈ P } : paths P from A to A′ } is sufficiently large. [sent-88, score-0.287]
40 More recent work by Penrose [7] closes the gap and shows fractional consistency whenever this ratio is > 1. [sent-89, score-0.171]
41 A more robust version of single linkage has been proposed by Wishart [13]: when connecting points at distance r from each other, only consider points that have at least k neighbors within distance r (for some k > 2). [sent-90, score-0.484]
42 Thus initially, when r is small, only the regions of highest density are available for linkage, while the rest of the data set is ignored. [sent-91, score-0.131]
43 Stuetzle and Nugent [12] have an appealing top-down scheme for estimating the cluster tree, along with a post-processing step (called runt pruning) that helps identify modes of the distribution. [sent-95, score-0.275]
44 Several recent papers [6, 10, 9, 11] have considered the problem of recovering the connected components of {x : f (x) ≥ λ} for a user-specified λ: the flat version of our problem. [sent-97, score-0.164]
45 In particular, the algorithm of [6] is intuitively similar to ours, though they use a single graph in which each point is connected to its k nearest neighbors, whereas we have a hierarchy of graphs in which each point is connected to other points at distance ≤ r (for various r). [sent-98, score-0.624]
46 Interestingly, k-nn graphs are valuable for flat clustering because they can adapt to clusters of different scales (different average interpoint distances). [sent-99, score-0.28]
47 For each xi set rk (xi ) = inf{r : B(xi , r) contains k data points}. [sent-106, score-0.209]
48 We provide finite-sample convergence rates for√ 1 ≤ α ≤ 2 and we can achieve all k ∼ d log n, which we conjecture to be the best possible, if α > 2. [sent-120, score-0.133]
49 1 A notion of cluster salience Suppose density f is supported on some subset X of Rd . [sent-124, score-0.401]
50 But the more interesting question is, what clusters will be identified from a finite sample? [sent-126, score-0.168]
51 The first consideration is that a cluster is hard to identify if it contains a thin “bridge” that would make it look disconnected in a small sample. [sent-128, score-0.287]
52 Second, the ease of distinguishing two clusters A and A′ depends inevitably upon the separation between them. [sent-132, score-0.218]
53 A ∩ Xn and A′ ∩ Xn are each individually connected in Gr(λ) . [sent-149, score-0.164]
54 The two parts of this theorem – separation and connectedness – are proved in Sections 3. [sent-150, score-0.175]
55 We mention in passing that this finite-sample result implies consistency (Definition 3): as n → ∞, take kn = (d log n)/ǫ2 with any schedule of (ǫn : n = 1, 2, . [sent-153, score-0.195]
56 n Under mild conditions, any two connected components A, A′ of {f ≥ λ} are (σ, ǫ)-separated for some σ, ǫ > 0 (see appendix); thus they will get distinguished for sufficiently large n. [sent-157, score-0.206]
57 3 Analysis: separation The cluster tree algorithm depends heavily on the radii rk (x): the distance within which x’s nearest k neighbors lie (including x itself). [sent-159, score-0.673]
58 To show that rk (x) is meaningful, we need to establish that the mass of this ball under density f is also, very approximately, k/n. [sent-161, score-0.36]
59 Lemma 8 Pick 0 < r < 2σ/(α + 2) such that vd r d λ vd rd λ(1 − ǫ) k Cδ + n n k Cδ − n n ≥ < kd log n kd log n (recall that vd is the volume of the unit ball in Rd ). [sent-169, score-1.88]
60 P ROOF : For (1), any point x ∈ (Aσ−r ∪A′ ) has f (B(x, r)) ≥ vd rd λ; and thus, by Lemma 7, has σ−r at least k neighbors within radius r. [sent-174, score-0.762]
61 Likewise, any point x ∈ Sσ−r has f (B(x, r)) < vd rd λ(1 − ǫ); and thus, by Lemma 7, has strictly fewer than k neighbors within distance r. [sent-175, score-0.733]
62 For (2), since points in Sσ−r are absent from Gr , any path from A to A′ in that graph must have an edge across Sσ−r . [sent-176, score-0.145]
63 Definition 9 Define r(λ) to be the value of r for which vd rd λ = k n + Cδ √ kd log n. [sent-178, score-0.725]
64 5 x′ xi xi+1 xi x′ π(xi ) π(xi ) x x Figure 4: Left: P is a path from x to x′ , and π(xi ) is the point furthest along the path that is within distance r of xi . [sent-180, score-0.865]
65 Right: The next point, xi+1 √ Xn , is chosen from a slab of B(π(xi ), r) that is ∈ perpendicular to xi − π(xi ) and has width 2ζ/ d. [sent-181, score-0.398]
66 4 Analysis: connectedness We need to show that points in A (and similarly A′ ) are connected in Gr(λ) . [sent-183, score-0.27]
67 Then with probability ≥ 1 − δ, A ∩ Xn is connected in Gr whenever r ≤ 2σ/(2 + α) and the conditions of Lemma 8 hold, and vd r d λ ≥ d 2 α Cδ d log n . [sent-186, score-0.695]
68 Then, with probability > 1 − δ, A ∩ Xn is connected in Gr whenever r ≤ σ/2 and the conditions of Lemma 8 hold, and vd r d λ ≥ 8 Cδ d log n · . [sent-191, score-0.695]
69 We now also require a more complicated class G, each element of which is the intersection of an open ball and a slab defined by two parallel hyperplanes. [sent-193, score-0.34]
70 We will describe any such set as “the slab of B(µ, r) in direction u”. [sent-195, score-0.203]
71 A simple calculation (see Lemma 4 of [4]) shows that the volume of this slab is at least ζ/4 that of B(x, r). [sent-196, score-0.253]
72 Thus, if the slab lies entirely in Aσ , its probability mass is at least (ζ/4)vd rd λ. [sent-197, score-0.468]
73 By uniform convergence over G (which has VC dimension 2d), we can then conclude (as in Lemma 7) that if (ζ/4)vd rd λ ≥ (2Cδ d log n)/n, then with probability at least 1 − δ, every such slab within A contains at least one data point. [sent-198, score-0.505]
74 , ending in x , such that for every i, point xi is active in Gr and xi −xi+1 ≤ αr. [sent-203, score-0.333]
75 This will confirm that x is connected to x′ in Gr . [sent-204, score-0.164]
76 Define π(y) = P (sup N (y)), the furthest point along the path within distance r of y (Figure 4, left). [sent-209, score-0.303]
77 6 • By construction, xi is within distance r of path P and hence N (xi ) is nonempty. [sent-217, score-0.341]
78 • Let B be the open ball of radius r around π(xi ). [sent-218, score-0.205]
79 The slab of B in direction xi − π(xi ) must contain a data point; this is xi+1 (Figure 4, right). [sent-219, score-0.358]
80 The process eventually stops because each π(xi+1 ) is strictly further along path P than π(xi ); formally, P −1 (π(xi+1 )) > P −1 (π(xi )). [sent-220, score-0.14]
81 This is because xi+1 − π(xi ) < r, so by continuity of the function P , there are points further along the path (beyond π(xi )) whose distance to xi+1 is still < r. [sent-221, score-0.252]
82 Each xi lies in Ar ⊆ Aσ−r and is thus active in Gr (Lemma 8). [sent-227, score-0.187]
83 Finally, the distance between successive points is: xi − xi+1 2 = = ≤ xi − π(xi ) + π(xi ) − xi+1 2 xi − π(xi ) 2 + π(xi ) − xi+1 2ζr2 ≤ α2 r 2 , 2r2 + √ d 2 + 2(xi − π(xi )) · (π(xi ) − xi+1 ) where the second-last inequality comes from the definition of slab. [sent-228, score-0.577]
84 The relationship that defines r(λ) (Definition 9) then translates into k ǫ vd r d λ = 1+ . [sent-230, score-0.438]
85 n 2 This shows that clusters at density level λ emerge when the growing radius r of the cluster tree algorithm reaches roughly (k/(λvd n))1/d . [sent-231, score-0.756]
86 In order for (σ, ǫ)-separated clusters to be distinguished, we need this radius to be at most σ/2; this is what yields the final lower bound on λ. [sent-232, score-0.236]
87 4 Lower bound We have shown that the algorithm of Figure 3 distinguishes pairs of clusters that are (σ, ǫ)-separated. [sent-233, score-0.168]
88 The number of samples it requires to capture clusters at density ≥ λ is, by Theorem 6, O d d log vd (σ/2)d λǫ2 vd (σ/2)d λǫ2 , We’ll now show that this dependence on σ, λ, and ǫ is optimal. [sent-234, score-1.234]
89 Then there exist: an input space X ⊂ Rd ; a finite family of densities Θ = {θi } on X ; subsets Ai , A′ , Si ⊂ i X such that Ai and A′ are (σ, ǫ)-separated by Si for density θi , and inf x∈Ai,σ ∪A′ θi (x) ≥ λ, with i i,σ the following additional property. [sent-237, score-0.272]
90 samples Xn from some θi ∈ Θ and, with probability at least 1/2, outputs a tree in which the smallest cluster containing Ai ∩ Xn is disjoint from the smallest cluster containing A′ ∩ Xn . [sent-241, score-0.754]
91 Then i n = Ω 1 vd σ d λǫ2 d1/2 log 1 vd σ d λ . [sent-242, score-0.935]
92 X is made up of two disjoint regions: a cylinder X0 , and an additional region X1 whose sole purpose is as a repository for excess probability mass. [sent-244, score-0.199]
93 Let Bd−1 be the unit ball in Rd−1 , and let σBd−1 be this same ball scaled to have radius σ. [sent-245, score-0.268]
94 The cylinder X0 stretches along the x1 -axis; its cross-section is σBd−1 and its length is 4(c + 1)σ for some c > 1 to be specified: X0 = [0, 4(c + 1)σ] × σBd−1 . [sent-246, score-0.159]
95 Since the crosssectional area of the cylinder is vd−1 σ d−1 , the total mass here is λvd−1 σ d (4(c + 1) − 2). [sent-253, score-0.191]
96 density λ(1 − ǫ) 2σ density λ point mass 1/2c For any i = j, the densities θi and θj differ only on the cylindrical sections (4σi + σ, 4σi + 3σ) × σBd−1 and (4σj + σ, 4σj + 3σ) × σBd−1 , which are disjoint and each have volume 2vd−1 σ d . [sent-262, score-0.508]
97 Now define the clusters and separators as follows: for each 1 ≤ i ≤ c − 1, • Ai is the line segment [σ, 4σi] along the x1 -axis, • A′ is the line segment [4σ(i + 1), 4(c + 1)σ − σ] along the x1 -axis, and i • Si = {4σi + 2σ} × σBd−1 is the cross-section of the cylinder at location 4σi + 2σ. [sent-265, score-0.418]
98 It can be checked i that Ai and A′ are (σ, ǫ)-separated by Si in density θi . [sent-267, score-0.131]
99 Fast rates for plug-in estimators of density level sets. [sent-319, score-0.17]
100 A generalized single linkage method for estimating the cluster tree of a density. [sent-335, score-0.549]
wordName wordTfidf (topN-words)
[('vd', 0.438), ('gr', 0.249), ('cf', 0.239), ('cluster', 0.232), ('xn', 0.23), ('slab', 0.203), ('bd', 0.19), ('clusters', 0.168), ('connected', 0.164), ('linkage', 0.16), ('tree', 0.157), ('xi', 0.155), ('rd', 0.133), ('density', 0.131), ('cfn', 0.116), ('cylinder', 0.116), ('resp', 0.116), ('fn', 0.111), ('consistency', 0.103), ('ball', 0.1), ('path', 0.097), ('kd', 0.095), ('roof', 0.088), ('wishart', 0.083), ('lemma', 0.082), ('clustering', 0.078), ('hartigan', 0.076), ('inf', 0.076), ('mass', 0.075), ('fano', 0.07), ('continuum', 0.07), ('radius', 0.068), ('ai', 0.067), ('densities', 0.065), ('distance', 0.064), ('hierarchy', 0.063), ('dasgupta', 0.062), ('log', 0.059), ('disjoint', 0.058), ('connectedness', 0.058), ('percolation', 0.058), ('subpartition', 0.058), ('disconnected', 0.055), ('rk', 0.054), ('pick', 0.054), ('furthest', 0.051), ('stuetzle', 0.051), ('separation', 0.05), ('neighbors', 0.05), ('appendix', 0.05), ('points', 0.048), ('si', 0.047), ('cn', 0.046), ('restriction', 0.043), ('along', 0.043), ('distinguished', 0.042), ('nearest', 0.041), ('ll', 0.04), ('width', 0.04), ('rates', 0.039), ('notion', 0.038), ('open', 0.037), ('wong', 0.037), ('vc', 0.037), ('supx', 0.037), ('nition', 0.035), ('convergence', 0.035), ('theorem', 0.035), ('fractional', 0.034), ('graphs', 0.034), ('whenever', 0.034), ('uc', 0.033), ('kn', 0.033), ('neighbor', 0.032), ('proved', 0.032), ('lies', 0.032), ('diego', 0.031), ('likewise', 0.03), ('big', 0.029), ('hierarchical', 0.027), ('say', 0.027), ('suppose', 0.026), ('consistent', 0.025), ('buffer', 0.025), ('penrose', 0.025), ('masses', 0.025), ('kamalika', 0.025), ('sole', 0.025), ('assouad', 0.025), ('hausdorff', 0.025), ('kruskal', 0.025), ('rinaldo', 0.025), ('least', 0.025), ('smallest', 0.025), ('sample', 0.025), ('within', 0.025), ('volume', 0.025), ('weaker', 0.024), ('segment', 0.024), ('point', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 223 nips-2010-Rates of convergence for the cluster tree
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
2 0.16641653 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
Author: Hairong Liu, Longin J. Latecki, Shuicheng Yan
Abstract: In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k ≥ 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.
3 0.15249744 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
Author: Sofia Mosci, Silvia Villa, Alessandro Verri, Lorenzo Rosasco
Abstract: We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in [12], where the group lasso penalty is generalized to overlapping groups of variables. While in [12] the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and projected Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing for dimensionality reduction. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations. 1
4 0.13695034 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
5 0.12000661 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
6 0.11881509 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
7 0.10970324 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
8 0.10604835 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
9 0.10469947 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
10 0.096573636 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
11 0.095486306 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
12 0.093573317 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
13 0.092419505 155 nips-2010-Learning the context of a category
14 0.089708276 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
15 0.087163568 63 nips-2010-Distributed Dual Averaging In Networks
16 0.08147829 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
17 0.080152981 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
18 0.080080397 268 nips-2010-The Neural Costs of Optimal Control
19 0.079181015 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
20 0.077702552 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
topicId topicWeight
[(0, 0.214), (1, 0.05), (2, 0.124), (3, 0.067), (4, -0.033), (5, 0.002), (6, -0.031), (7, -0.037), (8, 0.146), (9, 0.029), (10, 0.023), (11, -0.095), (12, -0.106), (13, -0.153), (14, 0.177), (15, -0.158), (16, 0.161), (17, 0.052), (18, -0.013), (19, -0.101), (20, 0.027), (21, -0.091), (22, 0.043), (23, 0.031), (24, -0.076), (25, -0.012), (26, 0.022), (27, 0.016), (28, -0.069), (29, -0.004), (30, -0.039), (31, -0.003), (32, 0.036), (33, 0.09), (34, -0.028), (35, 0.001), (36, 0.029), (37, 0.031), (38, -0.026), (39, 0.089), (40, 0.032), (41, -0.041), (42, -0.072), (43, 0.008), (44, -0.065), (45, 0.044), (46, -0.101), (47, -0.055), (48, 0.06), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9647277 223 nips-2010-Rates of convergence for the cluster tree
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
2 0.6945594 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
3 0.6838246 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
4 0.66093117 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
Author: Hairong Liu, Longin J. Latecki, Shuicheng Yan
Abstract: In this paper, we regard clustering as ensembles of k-ary affinity relations and clusters correspond to subsets of objects with maximal average affinity relations. The average affinity relation of a cluster is relaxed and well approximated by a constrained homogenous function. We present an efficient procedure to solve this optimization problem, and show that the underlying clusters can be robustly revealed by using priors systematically constructed from the data. Our method can automatically select some points to form clusters, leaving other points un-grouped; thus it is inherently robust to large numbers of outliers, which has seriously limited the applicability of classical methods. Our method also provides a unified solution to clustering from k-ary affinity relations with k ≥ 2, that is, it applies to both graph-based and hypergraph-based clustering problems. Both theoretical analysis and experimental results show the superiority of our method over classical solutions to the clustering problem, especially when there exists a large number of outliers.
5 0.56512356 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
6 0.54210871 220 nips-2010-Random Projection Trees Revisited
7 0.52485704 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
8 0.51232839 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
9 0.49726504 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
10 0.48878905 155 nips-2010-Learning the context of a category
12 0.47122309 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
13 0.46364498 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
14 0.45091638 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
15 0.44807005 221 nips-2010-Random Projections for $k$-means Clustering
16 0.43291804 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
17 0.42757672 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
18 0.42625746 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
19 0.42175549 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
20 0.41555572 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
topicId topicWeight
[(13, 0.028), (27, 0.045), (30, 0.081), (35, 0.011), (45, 0.141), (50, 0.02), (52, 0.028), (60, 0.496), (77, 0.045), (90, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.89028496 223 nips-2010-Rates of convergence for the cluster tree
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
2 0.84440166 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan
Abstract: We propose a new LP relaxation for obtaining the MAP assignment of a binary MRF with pairwise potentials. Our relaxation is derived from reducing the MAP assignment problem to an instance of a recently proposed Bipartite Multi-cut problem where the LP relaxation is guaranteed to provide an O(log k) approximation where k is the number of vertices adjacent to non-submodular edges in the MRF. We then propose a combinatorial algorithm to efficiently solve the LP and also provide a lower bound by concurrently solving its dual to within an approximation. The algorithm is up to an order of magnitude faster and provides better MAP scores and bounds than the state of the art message passing algorithm of [1] that tightens the local marginal polytope with third-order marginal constraints. 1
3 0.83876175 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
Author: Tobias Glasmachers
Abstract: Steinwart was the first to prove universal consistency of support vector machine classification. His proof analyzed the ‘standard’ support vector machine classifier, which is restricted to binary classification problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classification to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1
4 0.78592753 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
5 0.72515124 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
6 0.64902371 229 nips-2010-Reward Design via Online Gradient Ascent
7 0.52573037 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
8 0.52028662 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
9 0.50929248 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
10 0.50152802 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
11 0.49952608 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
12 0.49655128 4 nips-2010-A Computational Decision Theory for Interactive Assistants
13 0.49454504 102 nips-2010-Generalized roof duality and bisubmodular functions
14 0.49309382 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
15 0.4921242 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
16 0.49051678 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
17 0.48586711 220 nips-2010-Random Projection Trees Revisited
18 0.48327574 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
19 0.48326093 222 nips-2010-Random Walk Approach to Regret Minimization
20 0.48163873 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates