jmlr jmlr2013 jmlr2013-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Reza Bosagh Zadeh, Ashish Goel
Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com. Keywords: cosine, Jaccard, overlap, dice, similarity, MapReduce, dimension independent
Reference: text
sentIndex sentText sentNum sentScore
1 We study Cosine, Dice, Overlap, and the Jaccard similarity measures. [sent-5, score-0.225]
2 For Jaccard similarity we include an improved version of MinHash. [sent-6, score-0.225]
3 Introduction Computing similarity between all pairs of vectors in large-scale data sets is a challenge. [sent-12, score-0.326]
4 Consider the problem of finding all pairs of similarities between D indicator (0/1 entries) vectors, each of dimension N. [sent-18, score-0.211]
5 In particular we focus on cosine similarities between all pairs of D vectors in RN . [sent-19, score-0.302]
6 For example, typical values to compute similarities between all pairs of a subset of Twitter users can be: N = 109 (the universe of Twitter users), D = 107 (a subset of Twitter users), L = 1000. [sent-21, score-0.226]
7 In particular, in addition to correctness, we prove that to estimate similarities above ε, the shuffle size of the DISCO scheme is only O(DL log(D)/ε), with no dependence on the dimension N, hence the name. [sent-34, score-0.186]
8 This means as long as there are enough mappers to read the data, one can use the DISCO sampling scheme to make the shuffle size tractable. [sent-35, score-0.178]
9 We have also used the scheme to find highly similar pairs of words by taking each dimension to be the indicator vector that signals in which tweets the word appears. [sent-38, score-0.335]
10 Our sampling scheme can be used to implement many other similarity measures. [sent-40, score-0.262]
11 We focus on four similarity measures: Cosine, Dice, Overlap, and the Jaccard similarity measures. [sent-41, score-0.45]
12 For Jaccard similarity we present an improved version of the well known MinHash scheme (Broder, 1997). [sent-42, score-0.262]
13 Third, we prove results on highly similar pairs, since common applications require thresholding the similarity score with a high threshold value. [sent-49, score-0.247]
14 For such applications of similarity, DISCO is particularly helpful since higher similarity pairs are estimated with provably better accuracy. [sent-51, score-0.326]
15 We are interested in similarity scores between pairs of words in a dictionary containing D words {w1 , . [sent-81, score-0.454]
16 The number of documents in which two words wi and w j co-occur is denoted #(wi , w j ). [sent-85, score-0.226]
17 The number of documents in which a single word wi occurs is denoted #(wi ). [sent-86, score-0.254]
18 To each word in the dictionary, an N-dimensional indicator vector is assigned, indicating in which documents the word occurs. [sent-87, score-0.206]
19 We focus on 5 different similarity measures, including cosine similarity, which is very popular and produces high quality results across different domains (Chien and Immorlica, 2005; Chuang and Chien, 2005; Sahami and Heilman, 2006; Spertus et al. [sent-91, score-0.342]
20 Cosine similarity is simply the vector √ normalized dot product: √ #(x,y) where #(x) = ∑N x[i] and #(x, y) = ∑N x[i]y[i]. [sent-93, score-0.278]
21 In addition to i=1 i=1 #(x) #(y) cosine similarity, we consider many variations of similarity scores that use the dot product. [sent-94, score-0.419]
22 We focus on shuffle size because after one trivially reads in the data via mappers, the shuffle phase is the bottleneck since our mappers and reducers are all linear in their input size. [sent-100, score-0.23]
23 We assume that the dictionary can fit into memory but that the number of dimensions (documents) is so large that many machines will be needed to even hold the documents on disk. [sent-113, score-0.182]
24 We are interested in several similarity measures outlined in Table 1. [sent-120, score-0.225]
25 Our goal is to compute similarity scores between all pairs of words, and prove accuracy results for pairs which have similarity above ε. [sent-122, score-0.676]
26 The naive approach is to first compute the dot product between all pairs of words in a Map-Reduce (Dean and Ghemawat, 2008) style framework. [sent-123, score-0.259]
27 Mappers act on each document, and emit key-value pairs of the form (wi , w j ) → 1. [sent-124, score-0.242]
28 The difficulty of computing the similarity scores lies almost entirely in computing #(wi , w j ). [sent-127, score-0.249]
29 Instead of emitting every pair, only some pairs are output, and instead of computing intermediary dot products, we directly compute the similarity score. [sent-133, score-0.401]
30 Related Work Previously the all-pairs similarity search problem has been studied (Broder et al. [sent-143, score-0.225]
31 There are many papers discussing the all pairs similarity computation problem. [sent-150, score-0.326]
32 The all-pairs similarity search problem has also been addressed in the database community, where it is known as the similarity join problem (Arasu et al. [sent-154, score-0.45]
33 (2010), the authors consider the problem of finding highly similar pairs of documents in MapReduce. [sent-160, score-0.213]
34 In particular, there are two problems with pairwise similarity computation, large computation time with respect to dimensions, which we address, and large computation time with respect to the number of points. [sent-168, score-0.225]
35 , 2009) proposed a highly scalable term similarity algorithm, implemented in the MapReduce framework, and deployed an over 200 billion word crawl of the Web to compute pairwise similarities between terms. [sent-170, score-0.389]
36 Algorithms and Methods The Naive algorithm for computing similarities first computes dot products, then simply divides by whatever is necessary to obtain a similarity score. [sent-179, score-0.362]
37 Reduce using the NaiveReducer (Algorithm 2) 1610 D IMENSION I NDEPENDENT S IMILARITY C OMPUTATION Algorithm 1 NaiveMapper(t) for all pairs (w1 , w2 ) in t do emit ((w1 , w2 ) → 1) end for Algorithm 2 NaiveReducer((w1 , w2 ), r1 , . [sent-184, score-0.242]
38 , rR ) a = ∑R ri i=1 output √ a #(w1 )#(w2 ) The above steps will compute all dot products, which will then be scaled by appropriate factors for each of the similarity scores. [sent-187, score-0.278]
39 Instead of using the Naive algorithm, we modify the mapper of the naive algorithm and replace it with Algorithm 3, and replace the reducer with Algorithm 4 to directly compute the actual similarity score, rather than the dot products. [sent-188, score-0.397]
40 The following sections detail how to obtain dimensionality independence for each of the similarity scores. [sent-189, score-0.225]
41 Algorithm 3 DISCOMapper(t) for all pairs (w1 , w2 ) in t do emit using custom Emit function end for Algorithm 4 DISCOReducer((w1 , w2 ), r1 , . [sent-190, score-0.242]
42 Algorithm 5 CosineSampleEmit(w1 , w2 ) - p/ε is the oversampling parameter With probability p 1 ε #(w1 ) #(w2 ) emit ((w1 , w2 ) → 1) Note that the more popular a word is, the less likely it is to be output. [sent-195, score-0.225]
43 Since the CosineSampleEmit algorithm is only guaranteed to produce the correct similarity score in expectation, we must show that the expected value is highly likely to be obtained. [sent-199, score-0.247]
44 In Theorem 2 we prove that any algorithm that purports to accurately calculate highly similar pairs must at least output them, and sometimes there are at least DL such pairs, and so any algorithm that is accurate on highly similar pairs must have at least DL shuffle size. [sent-215, score-0.246]
45 In each group, it is trivial to check that all pairs of words of have similarity exactly 1. [sent-226, score-0.356]
46 There are L pairs for 2 each group and there are D/L groups, making for a total of (D/L) L = Ω(DL) pairs with similarity 2 1, and thus also at least ε. [sent-227, score-0.427]
47 Similar corrections have to be made for the other similarity scores (Dice and Overlap, but not MinHash), so we do not repeat this point. [sent-232, score-0.249]
48 2 Jaccard Similarity Traditionally MinHash (Broder, 1997) is used to compute Jaccard similarity scores between all pairs in a dictionary. [sent-239, score-0.35]
49 Let h(t) be a hash function that maps documents to distinct numbers in [0, 1], and for any word w define g(w) (called the MinHash of w) to be the minimum value of h(t) over all t that contain w. [sent-241, score-0.314]
50 Then g(w1 ) = g(w2 ) exactly when the minimum hash value of the union with size #(w1 ) + #(w2 ) − #(w1 , w2 ) lies in the intersection with size #(w1 , w2 ). [sent-242, score-0.244]
51 Thus Pr[g(w1 ) = g(w2 )] = #(w1 , w2 ) = Jac(w1 , w2 ) #(w1 ) + #(w2 ) − #(w1 , w2 ) Therefore the indicator random variable that is 1 when g(w1 ) = g(w2 ) has expectation equal to the Jaccard similarity between the two words. [sent-243, score-0.225]
52 The idea of the MinHash scheme is to reduce the variance by averaging together k of these variables constructed in the same way with k different hash functions. [sent-245, score-0.203]
53 a similarity score that is above ε has a relative error of no more than δ. [sent-253, score-0.225]
54 Now that minhash values for all hash functions are available, for each pair we can simply compute the number of hash collisions and divide by the total number of hash functions Algorithm 7 MinHashSampleMap(t, w1 , . [sent-272, score-0.746]
55 , wL ) for i = 1 to L do for j = 1 to k do if h j (t) ≤ c log(Dk) then #(wi ) emit ((wi , j) → h j (t)) end if end for end for Recall that a document has at most L words. [sent-275, score-0.219]
56 After the map phase, for each of the k hash functions, the standard MinReducer is used, which will compute g j (w). [sent-277, score-0.187]
57 We now prove that MinHashSampleMap will with high probability Emit the minimum hash value for a given word w and hash function h, thus ensuring the steps following MinHashSampleMap will be unaffected. [sent-281, score-0.39]
58 , hk map documents to [0, 1] uniform randomly, and c = 1 3, then with probability at least 1 − (Dk)2 , for all words w and hash functions h ∈ {h1 , . [sent-285, score-0.307]
59 , hk }, MinHashSampleMap will emit the document that realizes g(w). [sent-288, score-0.241]
60 Proof Fix a word w and hash function h and let z = mint|w∈t h(t). [sent-289, score-0.224]
61 Now the probability that MinHashSampleMap will not emit the document that realizes g(w) is Pr z > c log(Dk) c log(Dk) = 1− #(w) #(w) 1615 #(w) ≤ e−c log(Dk) = 1 (Dk)c B OSAGH Z ADEH AND G OEL Thus for a single w and h we have shown MinHashSampleMap will w. [sent-290, score-0.241]
62 To show the same result for all hash functions and words in the dictionary, set 1 c = 3 and use union bound to get a ( Dk )2 bound on the probability of error for any w1 , . [sent-294, score-0.196]
63 For the other similarity measures, combining the Naive mappers can only bring down the shuffle size to O(mD2 ), which DISCO improves upon asymptotically by obtaining a bound of O(DL/ε log(D)) without even combining, so combining will help even more. [sent-315, score-0.366]
64 In this setting, data is streamed dimension-by-dimension through a single machine that can at any time answer queries of the form “what is the similarity between points x and y considering all the input so far? [sent-319, score-0.249]
65 Our algorithm will be very similar to the mapreduce setup, but in place of emitting pairs to be shuffled for a reduce phase, we instead insert them into a hash map H, keyed by pairs of words, with each entry holding a bag of emissions. [sent-323, score-0.72]
66 There are two operations to be described: the update that occurs when a new dimension (document) arrives, and the constant time algorithm used to answer similarity queries. [sent-328, score-0.251]
67 Let the query be for the similarity between words x and y. [sent-338, score-0.281]
68 We show that the memory used by H when our algorithm is run on C′ with the all-knowing counters dominates (in expectation) the size of H for the original data set in the streaming model, thus achieving the claimed bound in the current theorem statement. [sent-367, score-0.179]
69 Using the same analysis used in Theorem 2, the shuffle size for C′ is at most O(DL lg(N) log(D)/ε) and therefore so is the size of H for the all-knowing algorithm run on C′ , and, by the analysis above, so is the hash map size for the original data set C. [sent-374, score-0.304]
70 Correctness and Shuffle Size Proofs for other Similarity Measures We now give consideration to similarity scores other than cosine similarity, along proofs of their shuffle size. [sent-376, score-0.366]
71 1 Overlap Similarity Overlap similarity follows the same pattern we used for cosine similarity; thus we only explain the parts that are different. [sent-378, score-0.342]
72 Algorithm 8 OverlapSampleEmit(w1 , w2 ) - p/ε is the oversampling parameter With probability 1 p ε min(#(w1 ), #(w2 )) emit ((w1 , w2 ) → 1) The correctness proof is nearly identical to that of cosine similarity so we do not restate it. [sent-380, score-0.509]
73 2 Dice Similarity Dice similarity follows the same pattern as we used for cosine similarity, thus we only explain the parts that are different. [sent-387, score-0.342]
74 Algorithm 9 DiceSampleEmit(w1 , w2 ) - p/ε is the oversampling parameter With probability p 2 ε #(w1 ) + #(w2 ) emit ((w1 , w2 ) → 1) The correctness proof is nearly identical to cosine similarity so we do not restate it. [sent-389, score-0.509]
75 1 Similar Users Consider the problem of finding all pairs of similarities between a subset of D = 107 twitter users. [sent-410, score-0.348]
76 The number of dimensions N = 198, 134, 530 in our data is equal to the number of tweets and each tweet is a document with size at most 140 characters, providing a small upper bound for L. [sent-424, score-0.204]
77 We ran experiments with D = 106 , but cannot report true error since finding the true cosine similarities is too computationally intensive. [sent-432, score-0.201]
78 5 similarity threshold Figure 1: Average error for all pairs with similarity ≥ ε. [sent-454, score-0.551]
79 4 Error vs Similarity Magnitude All of our theorems report better accuracy for pairs that have higher similarity than otherwise. [sent-459, score-0.326]
80 To see this empirically, we plot the average error of all pairs that have true similarity above ε. [sent-460, score-0.326]
81 Note that the reason for large portions of the error being constant in these plots is that there are very few pairs with very high similarities, and therefore the error remains constant while ε is between the difference of two such very high similarity pairs. [sent-462, score-0.326]
82 Although we use the MapReduce (Dean and Ghemawat, 2008) and Streaming computation models to discuss shuffle size and memory, the sampling strategy we use can be generalized to 1621 B OSAGH Z ADEH AND G OEL DISCO Cosine shuffle size vs accuracy tradeoff 1 2. [sent-468, score-0.222]
83 When the true similarity is zero, DISCO always also returns zero, so we always get those pairs right. [sent-478, score-0.326]
84 It remains to estimate those pairs with similarity > 0, and that is the average relative error for those pairs that we report here. [sent-479, score-0.427]
85 5 similarity threshold Figure 3: Average error for all pairs with similarity ≥ ε. [sent-493, score-0.551]
86 1622 D IMENSION I NDEPENDENT S IMILARITY C OMPUTATION DISCO Dice shuffle size vs accuracy tradeoff 1 2 0. [sent-502, score-0.183]
87 5 0 1 0 2 4 6 8 10 12 14 16 avg relative err DISCO Shuffle / Naive Shuffle DISCO / Naive avg relative err 0 18 log(p / ε) Figure 4: As p/ε increases, shuffle size increases and error decreases. [sent-503, score-0.379]
88 5 similarity threshold Figure 5: Average error for all pairs with similarity ≥ ε. [sent-512, score-0.551]
89 1623 B OSAGH Z ADEH AND G OEL DISCO Overlap shuffle size vs accuracy tradeoff 2. [sent-521, score-0.183]
90 5 similarity threshold Figure 7: Average error for all pairs with similarity ≥ ε. [sent-543, score-0.551]
91 MinHash Jaccard similarity error decreases for more similar pairs. [sent-544, score-0.225]
92 Keyword generation for search engine advertising using semantic similarity between terms. [sent-549, score-0.246]
93 5 similarity threshold Figure 8: Average error for all pairs with similarity ≥ ε. [sent-579, score-0.551]
94 DISCO MinHash Jaccard similarity error decreases for more similar pairs. [sent-580, score-0.225]
95 Finding associations and computing similarity via biased pair sampling. [sent-611, score-0.249]
96 A primitive operator for similarity joins in data cleaning. [sent-623, score-0.225]
97 Semantic similarity between search engine queries using temporal correlation. [sent-629, score-0.27]
98 Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. [sent-705, score-0.303]
99 A web-based kernel function for measuring the similarity of short text snippets. [sent-737, score-0.225]
100 Evaluating similarity measures: a large-scale study in the orkut social network. [sent-750, score-0.252]
wordName wordTfidf (topN-words)
[('shuf', 0.472), ('disco', 0.367), ('mapreduce', 0.285), ('similarity', 0.225), ('minhash', 0.224), ('hash', 0.166), ('twitter', 0.163), ('emit', 0.141), ('dice', 0.122), ('cosine', 0.117), ('dl', 0.115), ('adeh', 0.112), ('oel', 0.112), ('osagh', 0.112), ('shuffle', 0.112), ('wi', 0.106), ('jac', 0.102), ('mappers', 0.102), ('ndependent', 0.102), ('pairs', 0.101), ('jaccard', 0.096), ('avg', 0.096), ('lg', 0.096), ('cosinesampleemit', 0.092), ('documents', 0.09), ('imension', 0.087), ('imilarity', 0.087), ('streaming', 0.087), ('similarities', 0.084), ('document', 0.078), ('naive', 0.075), ('err', 0.074), ('omputation', 0.072), ('minhashsamplemap', 0.071), ('goel', 0.061), ('hadoop', 0.061), ('reducers', 0.061), ('tweets', 0.061), ('pr', 0.058), ('dk', 0.058), ('word', 0.058), ('broder', 0.055), ('dot', 0.053), ('dean', 0.052), ('ghemawat', 0.051), ('chernoff', 0.05), ('cos', 0.047), ('overlap', 0.046), ('dictionary', 0.044), ('reducer', 0.044), ('users', 0.041), ('minhashmap', 0.041), ('sahami', 0.041), ('zadeh', 0.041), ('emissions', 0.039), ('size', 0.039), ('scheme', 0.037), ('emission', 0.036), ('production', 0.036), ('chien', 0.035), ('stanford', 0.035), ('tradeoff', 0.032), ('log', 0.032), ('counters', 0.031), ('dicesampleemit', 0.031), ('munagala', 0.031), ('olston', 0.031), ('overlapsampleemit', 0.031), ('pig', 0.031), ('spertus', 0.031), ('occurrences', 0.03), ('words', 0.03), ('vldb', 0.029), ('coin', 0.029), ('keyword', 0.029), ('phase', 0.028), ('social', 0.027), ('acm', 0.027), ('dimensions', 0.026), ('indyk', 0.026), ('oversampling', 0.026), ('dimension', 0.026), ('query', 0.026), ('magnitudes', 0.026), ('web', 0.025), ('bag', 0.024), ('scores', 0.024), ('pair', 0.024), ('queries', 0.024), ('memory', 0.022), ('xy', 0.022), ('sigmod', 0.022), ('realizes', 0.022), ('emitting', 0.022), ('highly', 0.022), ('engine', 0.021), ('map', 0.021), ('google', 0.02), ('arasu', 0.02), ('baraglia', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 33 jmlr-2013-Dimension Independent Similarity Computation
Author: Reza Bosagh Zadeh, Ashish Goel
Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com. Keywords: cosine, Jaccard, overlap, dice, similarity, MapReduce, dimension independent
2 0.15347983 78 jmlr-2013-On the Learnability of Shuffle Ideals
Author: Dana Angluin, James Aspnes, Sarah Eisenstat, Aryeh Kontorovich
Abstract: PAC learning of unrestricted regular languages is long known to be a difficult problem. The class of shuffle ideals is a very restricted subclass of regular languages, where the shuffle ideal generated by a string u is the collection of all strings containing u as a subsequence. This fundamental language family is of theoretical interest in its own right and provides the building blocks for other important language families. Despite its apparent simplicity, the class of shuffle ideals appears quite difficult to learn. In particular, just as for unrestricted regular languages, the class is not properly PAC learnable in polynomial time if RP = NP, and PAC learning the class improperly in polynomial time would imply polynomial time algorithms for certain fundamental problems in cryptography. In the positive direction, we give an efficient algorithm for properly learning shuffle ideals in the statistical query (and therefore also PAC) model under the uniform distribution. Keywords: PAC learning, statistical queries, regular languages, deterministic finite automata, shuffle ideals, subsequences
3 0.064688526 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
Author: Aleksandrs Slivkins, Filip Radlinski, Sreenivas Gollapudi
Abstract: Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches. Keywords: online learning, clickthrough data, diversity, multi-armed bandits, contextual bandits, regret, metric spaces
4 0.036313608 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
Author: David Picard, Nicolas Thome, Matthieu Cord
Abstract: JKernelMachines is a Java library for learning with kernels. It is primarily designed to deal with custom kernels that are not easily found in standard libraries, such as kernels on structured data. These types of kernels are often used in computer vision or bioinformatics applications. We provide several kernels leading to state of the art classification performances in computer vision, as well as various kernels on sets. The main focus of the library is to be easily extended with new kernels. Standard SVM optimization algorithms are available, but also more sophisticated learning-based kernel combination methods such as Multiple Kernel Learning (MKL), and a recently published algorithm to learn powered products of similarities (Product Kernel Learning). Keywords: classification, support vector machines, kernel, computer vision
5 0.032873806 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
Author: Gang Niu, Bo Dai, Lin Shang, Masashi Sugiyama
Abstract: The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A soft-label MVC method using sequential quadratic programming and a hard-label MVC method using semi-definite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in soft-label MVC is akin to a convex one in terms of the resulting clusters. Secondly, the soft-label MVC method pos∗. A preliminary and shorter version has appeared in Proceedings of 14th International Conference on Artificial Intelligence and Statistics (Niu et al., 2011). The preliminary work was done when GN was studying at Department of Computer Science and Technology, Nanjing University, and BD was studying at Institute of Automation, Chinese Academy of Sciences. A Matlab implementation of maximum volume clustering is available from http://sugiyama-www.cs.titech.ac.jp/∼gang/software.html. c 2013 Gang Niu, Bo Dai, Lin Shang and Masashi Sugiyama. N IU , DAI , S HANG AND S UGIYAMA sesses a clustering error bound. Thirdly, MVC includes the optimization problems of a spectral clustering, two relaxed k-means clustering and an information-maximization clustering as special limit cases when its regularization parameter goes to infinity. Experiments on several artificial and benchmark data sets demonstrate that the proposed MVC compares favorably with state-of-the-art clustering methods. Keywords: discriminative clustering, large volume principle, sequential quadratic programming, semi-definite programming, finite sample stability, clustering error
6 0.030848062 108 jmlr-2013-Stochastic Variational Inference
7 0.029315701 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
8 0.02809526 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
9 0.027767105 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
10 0.025413176 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
11 0.025384204 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
12 0.025306942 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
13 0.024144558 104 jmlr-2013-Sparse Single-Index Model
14 0.023855893 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
15 0.023336748 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
16 0.022954751 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
17 0.022460477 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
18 0.022184387 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
19 0.021785302 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
20 0.021692254 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
topicId topicWeight
[(0, -0.118), (1, 0.017), (2, -0.026), (3, 0.012), (4, -0.034), (5, 0.03), (6, 0.046), (7, 0.016), (8, -0.125), (9, 0.108), (10, 0.012), (11, -0.285), (12, -0.036), (13, 0.242), (14, -0.171), (15, 0.055), (16, -0.057), (17, -0.101), (18, -0.063), (19, 0.002), (20, -0.059), (21, 0.008), (22, 0.013), (23, 0.026), (24, -0.107), (25, 0.084), (26, 0.086), (27, 0.105), (28, 0.015), (29, 0.051), (30, 0.052), (31, -0.062), (32, 0.017), (33, -0.025), (34, -0.118), (35, 0.104), (36, -0.019), (37, 0.226), (38, 0.164), (39, 0.026), (40, -0.136), (41, 0.142), (42, 0.067), (43, 0.14), (44, 0.099), (45, 0.034), (46, -0.09), (47, 0.012), (48, 0.069), (49, -0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.95313001 33 jmlr-2013-Dimension Independent Similarity Computation
Author: Reza Bosagh Zadeh, Ashish Goel
Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com. Keywords: cosine, Jaccard, overlap, dice, similarity, MapReduce, dimension independent
2 0.62925398 78 jmlr-2013-On the Learnability of Shuffle Ideals
Author: Dana Angluin, James Aspnes, Sarah Eisenstat, Aryeh Kontorovich
Abstract: PAC learning of unrestricted regular languages is long known to be a difficult problem. The class of shuffle ideals is a very restricted subclass of regular languages, where the shuffle ideal generated by a string u is the collection of all strings containing u as a subsequence. This fundamental language family is of theoretical interest in its own right and provides the building blocks for other important language families. Despite its apparent simplicity, the class of shuffle ideals appears quite difficult to learn. In particular, just as for unrestricted regular languages, the class is not properly PAC learnable in polynomial time if RP = NP, and PAC learning the class improperly in polynomial time would imply polynomial time algorithms for certain fundamental problems in cryptography. In the positive direction, we give an efficient algorithm for properly learning shuffle ideals in the statistical query (and therefore also PAC) model under the uniform distribution. Keywords: PAC learning, statistical queries, regular languages, deterministic finite automata, shuffle ideals, subsequences
3 0.40664271 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
Author: Aleksandrs Slivkins, Filip Radlinski, Sreenivas Gollapudi
Abstract: Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches. Keywords: online learning, clickthrough data, diversity, multi-armed bandits, contextual bandits, regret, metric spaces
4 0.29095247 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel
Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive
Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown
Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound
6 0.25589946 44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
7 0.24867234 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
8 0.2460828 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars
9 0.21470508 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
10 0.20658676 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
11 0.18823783 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
12 0.18612225 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
13 0.18194391 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
14 0.16878781 21 jmlr-2013-Classifier Selection using the Predicate Depth
15 0.16809152 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections
16 0.15983617 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
17 0.15942138 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
18 0.15701333 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
19 0.1470414 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library
20 0.14039326 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
topicId topicWeight
[(0, 0.03), (5, 0.069), (6, 0.037), (10, 0.052), (20, 0.019), (23, 0.033), (53, 0.025), (68, 0.027), (70, 0.032), (75, 0.032), (78, 0.494), (85, 0.017), (87, 0.019), (89, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.6834802 33 jmlr-2013-Dimension Independent Similarity Computation
Author: Reza Bosagh Zadeh, Ashish Goel
Abstract: We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com. Keywords: cosine, Jaccard, overlap, dice, similarity, MapReduce, dimension independent
2 0.61144662 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
Author: Cynthia Rudin, Benjamin Letham, David Madigan
Abstract: We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction.” In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer’s online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start” problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis. Keywords: statistical learning theory, algorithmic stability, association rules, sequence prediction, associative classification c 2013 Cynthia Rudin, Benjamin Letham and David Madigan. RUDIN , L E
3 0.21502106 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
Author: Partha Niyogi
Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on finite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates
4 0.21392466 108 jmlr-2013-Stochastic Variational Inference
Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics
5 0.21336022 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
6 0.21207412 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
7 0.21196614 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
10 0.20718093 120 jmlr-2013-Variational Algorithms for Marginal MAP
11 0.2069733 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
12 0.20667487 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
13 0.20627818 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
14 0.20622124 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
15 0.20616895 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
16 0.20607005 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
17 0.20580381 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
18 0.20549071 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
19 0.20548378 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
20 0.20537403 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning