nips nips2011 nips2011-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
Reference: text
sentIndex sentText sentNum sentScore
1 We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. [sent-8, score-0.442]
2 Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. [sent-10, score-0.365]
3 1 Introduction We design improved algorithms for Euclidean k-means in the streaming model. [sent-13, score-0.297]
4 Our goal is to select k points in this space to designate asfacilities (sometimes called centers or means); the overall cost of the solution is the sum of the squared distances from each point to its nearest facility. [sent-15, score-0.304]
5 In the streaming model, we require that the point set be read sequentially, and that our algorithm stores very few points at any given time. [sent-17, score-0.412]
6 Many problems which are easy to solve in the standard batch-processing model require more complex techniques in the streaming model (a survey of streaming results is available [3]); nonetheless there are a number of existing streaming approximations for Euclidean k-means. [sent-18, score-0.891]
7 We present a new algorithm for the problem based on [9] with several significant improvements; we are able to prove a faster worst-case running time and a better approximation factor. [sent-19, score-0.228]
8 Many of the applications for k-means have experienced a large growth in data that has overtaken the amount of memory typically available to a computer. [sent-24, score-0.221]
9 This is expressed in the streaming model, where an algorithm must make one (or very few) passes through the data, reflecting cases where random access to the data is unavailable, such as a very large file on· a hard disk. [sent-25, score-0.392]
10 They "guess" the cost of the optimum, then run the online facility location algorithm of [24] until either the total cost of the solution exceeds a constant times the guess or the total number of facilities exceeds some computed value 1\,. [sent-28, score-1.364]
11 They then declare the end of a phase, increase the guess, consolidate the facilities via matching, and continue with the next point. [sent-29, score-0.415]
12 When the stream has been exhausted, the algorithm has some I\, facilities, which are then consolidated down to k. [sent-30, score-0.235]
13 They then run a ball k-means step (similar to [25]) by maintaining samples of the points assigned to each facility and moving the facilities to the centers of mass of these samples. [sent-31, score-1.026]
14 2 for the definition), they use ball k-means to improve the approximation factor to 1 + 0(0- 2 ). [sent-34, score-0.132]
15 The approximation factor is in the hundreds, and the 0 (k log n) memory requirement has sufficiently high constants that there are actually more than n facilities for many of the data sets analyzed in previous papers. [sent-36, score-0.818]
16 We improve the manner by which the algorithm determines better facility cost as the stream is processed, removing unnecessary checks and allowing the user to parametrize what remains. [sent-40, score-0.871]
17 We remove the end-of-phase condition based on the total cost, ending phases only when the number of facilities exceeds 1\,. [sent-43, score-0.449]
18 We also simplify the transition between phases, observing that it's quite simple to bound the number of phases by log 0 PT (where OPT is the optimum k-means cost), and that in practice this number of phases is usually quite a bit less than n. [sent-45, score-0.243]
19 Our proof is based on a much tighter bound on the cost incurred per phase, along with a more flexible definition of the "critical phase" by which the algorithm should terminate. [sent-47, score-0.202]
20 For appropriately chosen constants our approximation factor will be roughly 17, substantially less than the factor claimed in [9] prior to the ball k-means step. [sent-49, score-0.245]
21 In addition, we apply approximate nearest-neighbor algorithms to compute the facility assignment of each point. [sent-50, score-0.5]
22 The running time of our algorithm is dominated by repeated nearest-neighbor calculations, and an appropriate technique can change our running time from 8 (nk log n) to 8 (n(log k + loglogn)), an improvement for most values of k. [sent-51, score-0.231]
23 Note that our final running time is actually faster than the 8 (nk) time needed to compute the k-means cost of a given set of facilities! [sent-53, score-0.319]
24 This allows us to compare our algorithm to previous [4, 2] streaming k-means results. [sent-55, score-0.333]
25 These become the new facilities for the next phase, and the process repeats until it is stable. [sent-60, score-0.38]
26 Unfortunately, Lloyd's algorithm has no provable approximation bound, and arbitrarily bad examples exist. [sent-61, score-0.129]
27 They defined the notion of (J"separability, where the input to k-means is said to be (J"-separable if reducing the number of facilities They designed an from k to k - 1 would incr~ase the cost of the optimum solution by a factor algorithm with approximation ratio 1 + O((J"2). [sent-68, score-0.794]
28 ;2' There are two basic approaches to the streaming version of the k-means problem. [sent-70, score-0.297]
29 Our approach is based on solving k-means as we go (thus at each point in the algorithm, our memory contains a current set of facilities). [sent-71, score-0.221]
30 Their work was based on guessing a lower bound on the optimum k-median cost and running O(log n) parallel versions of the online facility location algorithm of Meyerson [24] with facility cost based on the guessed lower bound. [sent-75, score-1.589]
31 When these parallel calls exceeded the approximation bounds, they would be terminated and the guessed lower bound on the optimum k-median cost would increase. [sent-76, score-0.312]
32 The recent paper of Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku [9] extended the result of [11] to k-means and improved the space bound to 0 ( k log n) by proving high-probability bounds on the performance of online facility location. [sent-77, score-0.564]
33 This result also added a ball k-means step (as in [25]) to substantially improve the approximation factor under the assumption that the original data was (J"-separable. [sent-78, score-0.177]
34 Another recent result for streaming k-means, due to Ailon, Jaiswal, and Monteleoni [4], is based on a divide and conquer approach, similar to the k-median algorithm of Guha, Meyerson Mishra, Motwani, and O'Callaghan [16]. [sent-79, score-0.385]
35 Their experiment showed that this algorithm is an improvement over an online variant of Lloyd's algorithm and was comparable to the batch version of Lloyd's. [sent-81, score-0.167]
36 The other approach to streaming k-means is based on coresets: selecting a weighted subset of the original input points such that any k-means solution on the subset has roughly the same cost as on the original point set. [sent-82, score-0.513]
37 At any point in the algorithm, the memory should contain a weighted representative sample of the points. [sent-83, score-0.221]
38 2 Algorithm and Theory Both our algorithm and that of [9] are based on the online facility location algorithm of [24]. [sent-87, score-0.655]
39 For the facility location problem, the number of clusters is not part of the input (as it is for k-means), but rather a facility cost is given; an algorithm to solve this problem may have as many clusters as it desires in its output, simply by denoting some point as a facility. [sent-88, score-1.22]
40 The solution cost is then the sum of the resulting k-means cost ("service cost") and the total paid for facilities. [sent-89, score-0.305]
41 Our algorithm runs the online facility location algorithm of [24] with a small facility cost until we have more than ~ E e (k log n) facilities. [sent-90, score-1.32]
42 It then increases the facility cost, re-evaluates the current facilities, and continues with the stream. [sent-91, score-0.5]
43 We ignore the overall service cost in determining when to end a phase and raise our facility cost f. [sent-95, score-0.909]
44 Further, the number of facilities which must open to end a phase can be any ~ E (k log n), the constants do not depend directly on the competitive ratio of online facility location (as they did in [9]). [sent-96, score-1.136]
45 Our constant is substantially smaller than those implicit in [9], with most of the loss occurring in the final non-streaming k-means algorithm to consolidate ~ means down to k. [sent-99, score-0.216]
46 Suppose that our algorithm completes the data stream when the facility cost is f. [sent-102, score-0.871]
47 Then the overall solution prior to the final re-clustering has expected service cost at most ~~, and the probability of being within 1 + E of the expected service cost is at least 1 - pol~( n) . [sent-103, score-0.551]
48 With probability at least 1 - pol~(n)' the algorithm will either halt with 1 :::; e(~*){3, where C* is the optimum k-means cost, or it will halt within one phase of exceeding this value. [sent-105, score-0.246]
49 In practice, we would expect the performance of online facility location to be substantially better than worst-case (in fact, if the ordering of points in the stream is non-adversarial there is a proof to this effect in [24]); in addition the assumption was made that distances add (i. [sent-108, score-0.874]
50 We also assumed that using more than k facilities does not substantially help the optimum service cost (also unlikely to be true for real data). [sent-111, score-0.71]
51 Combining these, it would be unsurprising if our service cost was actually better than optimum at the end of the data stream (of course, we used many more facilities than optimum, so it is not precisely a fair comparison). [sent-112, score-0.864]
52 4 We note that if we run the streaming part of the algorithm NI times in parallel, we can take the solution with the smallest final facility cost. [sent-116, score-0.994]
53 memory requirement and increasing NI can increase both memory and running time requirements. [sent-122, score-0.525]
54 A theoretically sound approach involves mapping these means back to randomly selected points from the original set (these can be maintained in a streaming manner) and then approximating k-means on ~ points using a non-streaming algorithm. [sent-124, score-0.391]
55 The overall approximation ratio will be twice the ratio established by our algorithm (we lose a factor of two by mapping back to the original points) plus the approximation ratio for the non-streaming algorithm. [sent-125, score-0.314]
56 This requires as many as ~ distance computations; there are a number of results enabling fast computation of approximate nearest neighbors and applying these results will improve our running time. [sent-129, score-0.141]
57 We store our facilities sorted by their inner product with w. [sent-137, score-0.38]
58 When a new point x arrives, instead of taking O(~) to determine its (exact) nearest neighbor, we instead use O(log~) to find the two facilities that x . [sent-138, score-0.438]
59 We determine the (exact) closer of these two facilities; this determines the value of 6 in lines 5 and 17 and the "closest" facility in lines 9 and 21. [sent-140, score-0.5]
60 If our approximate nearest neighbor computation finds a facility with distance at most v times the distance to the closest facility in expectation, then the approximation ratio increase by a constant factor. [sent-142, score-1.241]
61 We selected data sets which have been used previously to demonstrate streaming algorithms. [sent-147, score-0.297]
62 The main motivation for streaming is very large data sets, so we are more interested in sets that might be difficult to fit in a main memory and focused on the largest examples. [sent-149, score-0.553]
63 We took the best observed cost for each value of k, and found the four values of k minimizing the ratio of k-means cost to (k - I)-means cost. [sent-159, score-0.314]
64 Instead, we ran a modified version of our algorithm; at the end of a phase, it adjusts the facility cost and restarts the stream. [sent-161, score-0.675]
65 First, the memory is not configurable, making it not fit into the common baselme that we will define shortly. [sent-166, score-0.221]
66 Second, the memory requirements and runtime, while asymptotically nice, have large leading constants that cause it to be impracticaL In fact, it was an attempt to implement this algorithm that initially motivated the work on this paper. [sent-167, score-0.295]
67 1 Implementation Discussion The divide and conquer ("D&C;") algorithm [4] can use its available memory in two possible ways. [sent-169, score-0.309]
68 First, it can use the entire amount to read from the stream, writing the results of computing their 3k log k means to disk; when the stream is exhausted, this file is treated as a stream, until an iteration produces a file that fits entirely into main memory. [sent-170, score-0.378]
69 Alternately, the available memory could be partitioned into layers; the first layer would be filled by reading from the stream, and the weighted facilities produced would be stored in the second. [sent-171, score-0.601]
70 Upon completion of the stream, any remaining points are gathered and clustered to produce k final means. [sent-173, score-0.147]
71 When larger amounts of memory are available, the latter method is preferred. [sent-174, score-0.221]
72 As our goal is to judge streaming algorithms under low memory conditions, we used the first approach, which is more fitting to such a constraint. [sent-176, score-0.518]
73 9 GhZ and with 6 GB main memory (although nowhere near the entirety of this was used by any algorithm). [sent-179, score-0.221]
74 With all algorithms, the reported cost is determined by taking the resulting k facilities and computing the k-means cost across the entire dataset. [sent-181, score-0.652]
75 The time to compute this cost is not included in the reported running times of the algorithms. [sent-182, score-0.219]
76 Instead of just comparing for the same dataset and cluster count, we further constrained each to use the same amount of memory (in terms of number of points stored in random access). [sent-186, score-0.268]
77 The memory constraints were chosen to reflect the usage of small amounts of memory that are close to the algorithms' designers' specifications, where possible. [sent-187, score-0.442]
78 Ailon et al [4] suggest -v:n:k memory for the batch process; this memory availability is marked in the charts by an asterisk. [sent-188, score-0.502]
79 The suggestion from [2] for a coreset of size 200k was not run for all algorithms, as the amount of memory necessary for computing a coreset of this size is much larger than the other cases, and our goal is to compare the algorithms at a small memory limit. [sent-189, score-0.712]
80 This does produce a drop in solution quality compared to running the algorithm at their suggested parameters, although their approach remains competitive. [sent-190, score-0.152]
81 Finally, our algorithm suggests memory of ~ = k log n or a small constant times the same. [sent-191, score-0.286]
82 In each case, the memory constraint dictates the parameters; for the divide and conquer algorithm, this is simply the batch size. [sent-192, score-0.333]
83 Our algorithm is a little more parametrizable; when M memory is available, we allowed ~ = M /5 and each facility to have four samples. [sent-194, score-0.757]
84 edu/ - shindler / to access code for our algorithms 6 JIIIOurs+ANN IIIlD&C; JIIID&C; 2520 3350 MemoryAvailable MemoryAvailaible Figure 1: Census Data, k=8, cost ~ Figure 2: Census Data, k=8, time 1I0ur,,+ANN 4DDE+13 1I0&C; . [sent-198, score-0.254]
85 D&C; 3780 5040 Memory Availahle MemOfyAvaUahle Figure 3: Census Data, k= 12, cost ~ Figure 4: Census Data, k=12, time 1. [sent-199, score-0.136]
86 l+ANN MemoryAvailahle MemoryAvaiEable Figure 5: BigCross Data, k=13, cost Figure 6: BigCross Data, k=13, time ; Ours 1lI0urs-:-ANN IlIStreamKMH MemoryAvaifable MemoryAVlliiable Figure 7: BigCross Data, k=24, cost Figure 8: BigCross Data, k=24, time 7 4. [sent-202, score-0.272]
87 Furthermore, our algorithm stands to gain the most by improved solutions to batch k-means, due to the better representative sample present after the stream is processed. [sent-205, score-0.295]
88 The prohibitively high running time of the divide-and-conquer algorithm [4] is due to the many repeated instances of running their k-means# algorithm on each batch of the given size. [sent-206, score-0.298]
89 The decline in accuracy for StreamKM++ at very low memory can be partially explained by the 8(k 2 1og8 n) points' worth of memory needed for a strong guarantee in previous theory work [12]. [sent-213, score-0.442]
90 However, the fact that the algorithm is able to achieve a good approximation in practice while using far less than that amount of memory suggests that improved provable bounds for coreset algorithms may be on the horizon. [sent-214, score-0.471]
91 We should note that the performance of the algorithm declines sharply as the memory difference with the authors' specification grows, but gains accuracy as the memory grows. [sent-215, score-0.478]
92 The final approximation ratios can be described as a(l + E) where a is the loss from the final batch algorithm. [sent-217, score-0.321]
93 The coreset E is a direct function of the memory allowed to the algorithm, and can be made arbitrarily small. [sent-218, score-0.342]
94 However, the memory needed to provably reduce E to a small constant is quite substantial, and while StreamKM++ does produce a good resulting clustering, it is not immediately clear the the discovery of better batch k-means algorithms would improve their solution quality. [sent-219, score-0.314]
95 Our algorithm's E represents the ratio of the cost of our f1:-mean solution to the cost of the optimum k-means solution. [sent-220, score-0.423]
96 For our algorithm, the observed value of 1 + E has been typically between 1 and 3, whereas the D&C; approach did not yield one better than 24, and was high (low thousands) for the very low memory conditions. [sent-222, score-0.221]
97 The coreset algorithm was the worst, with even the best values in the mid ten figures (tens to hundreds of billions). [sent-223, score-0.157]
98 The low ratio for our algorithm also suggests that our ~ facilities are a good sketch of the overall data, and thus our observed accuracy can be expected to improve as more accurate batch k-means algorithms are discovered. [sent-224, score-0.518]
99 Local search heuristic for k-median and facility location problems. [sent-254, score-0.548]
100 On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. [sent-266, score-0.129]
wordName wordTfidf (topN-words)
[('facility', 0.5), ('facilities', 0.38), ('streaming', 0.297), ('memory', 0.221), ('stream', 0.199), ('meyerson', 0.157), ('bigcross', 0.137), ('cost', 0.136), ('coreset', 0.121), ('lloyd', 0.121), ('shindler', 0.118), ('streamkm', 0.118), ('census', 0.111), ('final', 0.1), ('running', 0.083), ('mishra', 0.079), ('ostrovsky', 0.079), ('sohler', 0.079), ('optimum', 0.076), ('christian', 0.073), ('service', 0.073), ('phases', 0.069), ('guha', 0.069), ('motwani', 0.069), ('adam', 0.067), ('nk', 0.067), ('clustering', 0.066), ('phase', 0.064), ('coresets', 0.063), ('approximation', 0.061), ('batch', 0.06), ('file', 0.059), ('liadan', 0.059), ('sufficiently', 0.059), ('nearest', 0.058), ('ailon', 0.052), ('conquer', 0.052), ('exhausted', 0.052), ('location', 0.048), ('stoc', 0.048), ('pol', 0.048), ('significant', 0.048), ('points', 0.047), ('neighbor', 0.046), ('substantially', 0.045), ('ann', 0.042), ('ratio', 0.042), ('ball', 0.041), ('ackermann', 0.039), ('alenex', 0.039), ('badoiu', 0.039), ('guessed', 0.039), ('iki', 0.039), ('jaiswal', 0.039), ('lammersen', 0.039), ('martens', 0.039), ('minyek', 0.039), ('modified', 0.039), ('rabani', 0.039), ('rafail', 0.039), ('raupach', 0.039), ('roytman', 0.039), ('sariel', 0.039), ('schulman', 0.039), ('setk', 0.039), ('sudipto', 0.039), ('unread', 0.039), ('vassilvitskii', 0.039), ('constants', 0.038), ('arthur', 0.037), ('algorithm', 0.036), ('online', 0.035), ('scg', 0.035), ('consolidate', 0.035), ('difficult', 0.035), ('halt', 0.035), ('kanungo', 0.035), ('netanyahu', 0.035), ('nina', 0.035), ('piatko', 0.035), ('rajeev', 0.035), ('silverman', 0.035), ('soda', 0.034), ('closest', 0.034), ('solution', 0.033), ('guess', 0.032), ('read', 0.032), ('provable', 0.032), ('focs', 0.032), ('braverman', 0.032), ('charikar', 0.032), ('mount', 0.032), ('logn', 0.032), ('piotr', 0.032), ('factor', 0.03), ('definition', 0.03), ('centers', 0.03), ('log', 0.029), ('run', 0.028), ('streams', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
2 0.2073745 241 nips-2011-Scalable Training of Mixture Models via Coresets
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
3 0.068067014 231 nips-2011-Randomized Algorithms for Comparison-based Search
Author: Dominique Tschopp, Suhas Diggavi, Payam Delgosha, Soheil Mohajer
Abstract: This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle n inequalities on ranks. We present a lower bound of Ω(D log D + D2 ) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop 3 a randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD ) 3 questions. The learning requires asking O(nD3 log2 n + D log2 n log log nD ) questions and O(n log2 n/ log(2D)) bits to store.
4 0.056444407 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.04579027 54 nips-2011-Co-regularized Multi-view Spectral Clustering
Author: Abhishek Kumar, Piyush Rai, Hal Daume
Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.
6 0.045277052 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
7 0.045054857 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
8 0.043952897 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
9 0.043421425 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
10 0.043262489 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations
11 0.042475782 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
12 0.041677542 198 nips-2011-On U-processes and clustering performance
13 0.041619502 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
14 0.041337717 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
15 0.038711213 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
16 0.037564643 64 nips-2011-Convergent Bounds on the Euclidean Distance
17 0.036556214 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
18 0.036508992 29 nips-2011-Algorithms and hardness results for parallel large margin learning
19 0.035690676 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
20 0.034760311 202 nips-2011-On the Universality of Online Mirror Descent
topicId topicWeight
[(0, 0.124), (1, -0.007), (2, -0.023), (3, -0.014), (4, -0.002), (5, 0.01), (6, -0.026), (7, -0.006), (8, -0.027), (9, -0.039), (10, 0.018), (11, 0.054), (12, -0.007), (13, -0.093), (14, 0.108), (15, 0.044), (16, -0.041), (17, 0.037), (18, -0.023), (19, -0.01), (20, 0.037), (21, 0.017), (22, 0.021), (23, -0.024), (24, -0.013), (25, 0.014), (26, 0.039), (27, 0.096), (28, 0.088), (29, 0.008), (30, -0.046), (31, -0.035), (32, 0.005), (33, 0.061), (34, -0.053), (35, 0.054), (36, 0.068), (37, -0.006), (38, 0.014), (39, -0.015), (40, -0.04), (41, -0.059), (42, -0.041), (43, 0.03), (44, 0.072), (45, -0.027), (46, -0.032), (47, 0.107), (48, 0.092), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.89641428 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
2 0.79829937 241 nips-2011-Scalable Training of Mixture Models via Coresets
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
3 0.61116964 64 nips-2011-Convergent Bounds on the Euclidean Distance
Author: Yoonho Hwang, Hee-kap Ahn
Abstract: Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of vectors in V . For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V . Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance provides upper and lower bounds of Euclidean distance between any pair of vectors in V in constant time. Furthermore, these bounds can be refined further in such a way to converge monotonically to the exact Euclidean distance within d refinement steps. An analysis on a random sequence of refinement steps shows that the MS-distance provides very tight bounds in only a few refinement steps. The MS-distance can be used to various applications where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches. 1
4 0.52961403 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
Author: Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari
Abstract: We present a computationally efficient technique to compute the distance of highdimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18] and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average. 1
5 0.51296741 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
Author: Kumar Sricharan, Alfred O. Hero
Abstract: Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BPkNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1OkNNG detectors. We show that BP-kNNG is asymptotically consistent in recovering the p-value of each test point. Experimental results are given that illustrate the superior performance of BP-kNNG as compared to the L1O-kNNG and other state of the art anomaly detection schemes.
6 0.48101076 186 nips-2011-Noise Thresholds for Spectral Clustering
7 0.47273663 198 nips-2011-On U-processes and clustering performance
8 0.43519107 157 nips-2011-Learning to Search Efficiently in High Dimensions
9 0.43456945 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
10 0.43401706 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
11 0.43229377 231 nips-2011-Randomized Algorithms for Comparison-based Search
12 0.4293901 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
13 0.4292219 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies
14 0.40651762 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
15 0.40406778 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
16 0.4034273 299 nips-2011-Variance Penalizing AdaBoost
17 0.40113002 263 nips-2011-Sparse Manifold Clustering and Embedding
18 0.39877281 111 nips-2011-Hashing Algorithms for Large-Scale Learning
19 0.39220074 20 nips-2011-Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
20 0.38895255 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
topicId topicWeight
[(0, 0.023), (4, 0.055), (20, 0.022), (26, 0.035), (28, 0.342), (31, 0.092), (33, 0.02), (43, 0.046), (45, 0.101), (57, 0.066), (74, 0.035), (83, 0.029), (99, 0.048)]
simIndex simValue paperId paperTitle
1 0.76627922 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
Author: Liang Xiong, Barnabás Póczos, Jeff G. Schneider
Abstract: An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies. We evaluate the effectiveness of FGM on both synthetic and real data sets including images and turbulence data, and show that it is superior to existing approaches in detecting group anomalies. 1
same-paper 2 0.70518655 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
3 0.63718116 226 nips-2011-Projection onto A Nonnegative Max-Heap
Author: Jun Liu, Liang Sun, Jieping Ye
Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called maximal root-tree of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1
4 0.63509184 156 nips-2011-Learning to Learn with Compound HD Models
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
5 0.52226716 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua
Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1
6 0.45936885 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
7 0.45718923 64 nips-2011-Convergent Bounds on the Euclidean Distance
8 0.45680201 231 nips-2011-Randomized Algorithms for Comparison-based Search
9 0.45577666 180 nips-2011-Multiple Instance Filtering
10 0.45513472 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
11 0.45510191 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
12 0.45504987 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
13 0.45412719 229 nips-2011-Query-Aware MCMC
14 0.45358446 219 nips-2011-Predicting response time and error rates in visual search
15 0.45235828 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
16 0.45183963 32 nips-2011-An Empirical Evaluation of Thompson Sampling
17 0.45161551 66 nips-2011-Crowdclustering
18 0.45154944 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
19 0.45049012 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.44956499 30 nips-2011-Algorithms for Hyper-Parameter Optimization