nips nips2012 nips2012-338 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract We introduce a new discrepancy score between two distributions that gives an indication on their similarity. [sent-7, score-0.216]
2 The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. [sent-9, score-0.187]
3 We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. [sent-11, score-0.177]
4 We also compare the score’s capacity to detect similarity with that of other known measures on real data. [sent-13, score-0.198]
5 1 Introduction The question of similarity between two sets of examples is common to many fields, including statistics, data mining, machine learning and computer vision. [sent-14, score-0.174]
6 The main focus of this work is providing a similarity score and a corresponding statistical procedure that gives one possible answer to this question. [sent-18, score-0.266]
7 Discrepancy between distributions has been studied for decades, and a wide variety of distance scores have been proposed. [sent-19, score-0.146]
8 However, not all proposed scores can be used for testing similarity. [sent-20, score-0.075]
9 The main difficulty is that most scores have not been designed for statistical testing of similarity but equality, known as the Two-Sample Problem (TSP). [sent-21, score-0.27]
10 Formally, let P and Q be the generating distributions of the data; the TSP tests the null hypothesis H0 : P = Q against the general alternative H1 : P �= Q. [sent-22, score-0.243]
11 However, sometimes, like in DA, the interesting question is with regards to similarity rather than equality. [sent-24, score-0.174]
12 By design, most equality tests may not be transformed to test similarity; see Section 3 for a review of representative works. [sent-25, score-0.113]
13 In this work, we quantify similarity using a new score, the Perturbed Variation (PV). [sent-26, score-0.174]
14 We propose that similarity is related to some predefined value of permitted variations. [sent-27, score-0.174]
15 Consider the gait of two male subjects as an example. [sent-28, score-0.238]
16 Put more generally, similarity depends on what “small changes” are in a given application, and implies that similarity is domain specific. [sent-31, score-0.348]
17 The PV, as hinted by its name, measures the discrepancy between two distributions while allowing for some perturbation of each distribution; that is, it allows small differences between the distributions. [sent-32, score-0.193]
18 Let P and Q be two distributions on a Banach space X , and let M (P, Q) be the set of all joint distributions on X × X with marginals P and Q. [sent-39, score-0.132]
19 Put into words, Equation (1) defines the joint distribution µ that couples the two distributions such that the probability of the event of a pair (X, Y ) ∼ µ being within a distance grater than � is minimized. [sent-42, score-0.111]
20 The solution to (1) is a special case of the classical mass transport problem of Monge [1] and its � version by Kantorovich: inf µ∈M (P,Q) X ×X c(x, y)dµ(x, y), where c : X ×X → R is a measurable cost function. [sent-43, score-0.088]
21 The probability µ(y|x) defines the and may be rewritten as inf µ transportation plan of x to y. [sent-46, score-0.165]
22 The PV optimal transportation plan is obtained by perturbing the mass of each point x in its � neighborhood so that it redistributes to the distribution of Q. [sent-47, score-0.181]
23 These small perturbations do not add any cost, while transportation of mass to further areas is equally costly. [sent-48, score-0.166]
24 Despite this limitation, this cost function fully quantifies the intuition that small variations should not be penalized when similarity is considered. [sent-51, score-0.174]
25 In this sense, similarity is not unique by definition, as more than one distribution can be similar to a reference distribution. [sent-52, score-0.194]
26 This formulation argues that any transportation plan, even to a close neighbor, is costly. [sent-54, score-0.098]
27 As explained, it relaxes the sensitivity of the TV; however, it does not “over optimize” the transportation plan. [sent-61, score-0.098]
28 Define the neighborhood of ai as ng(ai , �) = {z ; d(z, ai ) ≤ �}. [sent-78, score-0.084]
29 The PV(µ1 , µ2 , �, d) between the two distributions is: N N 1� 1� wi + vj (2) min wi ≥0,vi ≥0,Zij ≥0 2 2 j=1 i=1 � Zij + wi = µ1 (ai ), ∀i s. [sent-79, score-0.258]
30 aj ∈ng(ai ,�) � ai ∈ng(aj ,�) Zij + vj = µ2 (aj ), ∀j Zij = 0 , ∀(i, j) �∈ ng(ai , �). [sent-81, score-0.13]
31 Each row in the matrix Z ∈ RN ×N corresponds to a point mass in µ1 , and each column to a point mass in µ2 . [sent-82, score-0.082]
32 For each i, Z(i, :) is zero in columns corresponding to non neighboring elements, and non-zero only for columns j for which transportation between µ2 (aj ) → µ1 (ai ) is performed. [sent-83, score-0.098]
33 The discrepancies between the distributions are depicted by the scalars wi and vi that count the “leftover” mass in µ1 (ai ) and µ2 (aj ). [sent-84, score-0.181]
34 The objective is to minimize these discrepancies, therefore matrix Z describes the optimal transportation plan constrained to �-perturbations. [sent-85, score-0.14]
35 , ym }, generated by distributions P and Q respectively, � PV(S1 , S2 , �, d) is: n m 1 � 1 � wi + vj (3) min wi ≥0,vi ≥0,Zij ≥0 2n 2m j=1 i=1 � � Zij + wi = 1, Zij + vj = 1, ∀i, j s. [sent-96, score-0.345]
36 yj ∈ng(xi ,�) Zij = 0 , xi ∈ng(yj ,�) ∀(i, j) �∈ ng(xi , �), where Z ∈ R . [sent-98, score-0.088]
37 This solution may be found by solving the optimal assignment on an appropriate bipartite graph [3]. [sent-103, score-0.081]
38 n m with edge weight zero to yj ∈ ng(xi ) and with weight ∞ to yj �∈ ng(xi ). [sent-125, score-0.132]
39 To make the graph complete, assign zero cost edges between all vertices xi and wk for k �= i (and vertices yj and vk for k �= j). [sent-127, score-0.185]
40 We note that the Earth Mover Distance (EMD) [4], a sampled version of the transportation problem, is also formulated by a linear program that may be solved by optimal assignment. [sent-128, score-0.098]
41 Contrarily, graph G, which describes PV, is a simple bipartite graph for which maximum cardinality matching, a much simpler problem, can be applied to find the optimal assignment. [sent-130, score-0.107]
42 To find the optimal assignment, first solve the maximum matching on the partial graph between vertices xi , yj that have zero weight edges (corresponding to neighboring vertices). [sent-131, score-0.173]
43 Then, assign vertices xi and yj for whom a match was not found with wi and vj respectively; see Algorithm 1 and Figure 1 for an illustration of a matching. [sent-132, score-0.221]
44 Let k be the average number of neighbors of a sample, then the average number of edges in the ˆ ˆ bipartite graph G is |E| = n × k. [sent-137, score-0.088]
45 3 Related Work Many scores have been defined for testing discrepancy between distributions. [sent-139, score-0.154]
46 Among these tests is the well known Kolmogorov-Smirnov test (for one dimensional distributions), and its generalization to higher dimensions by minimal spanning trees [6]. [sent-144, score-0.092]
47 A different statistic is defined by the portion of k-nearest neighbors of each sample that belongs to different distributions; larger portions mean the distributions are closer [7]. [sent-145, score-0.145]
48 These scores are well known in the statistical literature but cannot be easily changed to test similarity, as their analysis relies on testing equality. [sent-146, score-0.125]
49 In both cases, the distance is not estimated directly on the samples, but on a higher level partition of the space: histogram bins or signatures (cluster centers). [sent-149, score-0.126]
50 It is impractical to use the EMD to estimate the Wasserstein metric between the continuous distributions, as convergence would require the number of bins to be exponentially dependent on the dimension. [sent-150, score-0.08]
51 It is possible to consider the PV as a refinement of the EMD notion of similarity; instead of clustering the data to signatures and moving the signatures, it perturbs each sample. [sent-154, score-0.079]
52 In this manner, it captures a finer notion of similarity better suited for statistical testing. [sent-155, score-0.229]
53 1) = 1 Figure 2: Two distributions on R: The PV captures the perceptual similarity of (a),(b) against the disimilarity in (c). [sent-176, score-0.306]
54 Therefore, instead of the TV, it may be interesting to consider the L1 as a similarity distance on the measures after discretization. [sent-197, score-0.243]
55 Namely, the problem is the choice of a single partition to measure similarity of a reference distribution to multiple distributions, while choosing multiple partitions would make the distances incomparable. [sent-200, score-0.226]
56 The last group of statistics are scores established in machine learning: the dA distance presented by Kifer et al. [sent-202, score-0.08]
57 that is based on the maximum discrepancy on a chosen subset of the support [8], and Maximum Mean Discrepancy (MMD) by Gretton et al. [sent-203, score-0.079]
58 , which define discrepancy after embeddings the distributions to a Reproducing Kernel Hilbert Space (RKHS)[9]. [sent-204, score-0.145]
59 These scores have corresponding statistical tests for the TSP; however, since their analysis is based on finite convergence bounds, in principle they may be modified to test similarity. [sent-205, score-0.171]
60 The MMD captures the distance between the samples in some RKHS. [sent-207, score-0.11]
61 The MMD may be used to define a similarity test, yet this would require defining two parameters, σ and the similarity rate, whose dependency is not intuitive. [sent-208, score-0.348]
62 Namely, for any similarity rate the result of the test is highly dependent on the choice of σ, but it is not clear how it should be made. [sent-209, score-0.203]
63 , ym } ∈ Rd generated by distributions P and Q, respectively. [sent-224, score-0.102]
64 However, intuitively, slow convergence is not always the case, for example when the support of the distributions lies in a lower dimensional manifold of the space. [sent-272, score-0.089]
65 5 Statistical Inference We construct two types of complementary procedures for hypothesis testing of similarity and dissimilarity2 . [sent-276, score-0.299]
66 In the first type of procedures, given 0 ≤ θ < 1, we distinguish between the null (1) hypothesis H0 : PV(P, Q, �, d) ≤ θ, which implies similarity, and the alternative hypothesis (1) H1 : PV(P, Q, �, d) > θ. [sent-277, score-0.167]
67 In the second type of procedures, we test whether two distributions are similar. [sent-280, score-0.095]
68 Note that there isn’t an equivalent of this form for the TSP, therefore we can not infer similarity using the TSP test, but only reject equality. [sent-282, score-0.174]
69 Our hypothesis tests are based on the finite sample analysis presented in Section 4; see Appendix A. [sent-283, score-0.139]
70 The idea of bootstrapping for estimating CIs is based on a two step procedure: approximation of the sampling distribution of the statistic by resampling with replacement from the initial sample – the bootstrap stage – following, a computation of the CI based on the resulting distribution. [sent-286, score-0.114]
71 Using the CI, a hypothesis test may be formed: (1) the null H0 is rejected with significance α if the range [0, θ] �⊂ [CI, CI]. [sent-289, score-0.196]
72 Also, for the second test, we apply the principle of CI inclusion [11], which states that if [CI, CI] ⊂ [0, θ], dissimilarity is rejected and similarity deduced. [sent-290, score-0.227]
73 2 The two procedures are distinct, as, in general, lacking evidence to reject similarity is not sufficient to infer dissimilarity, and vice versa. [sent-291, score-0.206]
74 For this purpose, we apply significance testing for similarity on two univariate uniform distributions: P ∼ U [0, 1] and Q ∼ U [Δ(�), 1 + Δ(�)], where Δ(�) is a varying size of perturbation. [sent-328, score-0.214]
75 For each value �� , we test the null hypothesis H0 : P V (P, Q, �� ) = 0 for ten equally � � spaced values of Δ(� ) in the range [0, 2� ]. [sent-335, score-0.143]
76 In this manner, we test the ability of the PV to detect similarity for different sizes of perturbations. [sent-336, score-0.203]
77 The percentage of times the null hypothesis was falsely rejected, i. [sent-337, score-0.114]
78 The percentage of times the null hypothesis was correctly rejected, the power of the test, was estimated as a function of the sample size and averaged over 500 repetitions. [sent-341, score-0.159]
79 Also, when finer perturbations need to be detected, more samples are needed to gain statistical power. [sent-345, score-0.079]
80 Note that, given a sufficient sample size, any statistic for the TSP would have rejected similarity for any Δ > 0. [sent-349, score-0.277]
81 2 Comparing Distance Measures Next, we test the ability of the PV to measure similarity on real data. [sent-351, score-0.203]
82 To this end, we test the ranking performance of the PV score against other known distributional distances. [sent-352, score-0.125]
83 We compare the PV to the multivariate extension of the Wald-Wolfowitz score of Friedman & Rafsky (FR) [6] , Schilling’s nearest neighbors score (KNN) [7], and the Maximum Mean Discrepancy score of Gretton et al. [sent-353, score-0.242]
84 We rank similarity for the applications of video retrieval and gait recognition. [sent-355, score-0.366]
85 For each similarity rank, where T is the total 1 ≤ i ≤ r of these observations, define ri ∈ [1, T − 1] as its � number of observations. [sent-358, score-0.174]
86 Twenty unique videos were selected as query videos, each of which has one similar clip in 3 Note that the statistical tests of these measures test equality while the PV tests similarity and therefore our experiments are not of statistical power but of ranking similarity. [sent-364, score-0.494]
87 Even in the case of the distances that may be transformed for similarity, like the MMD, there is no known function between the PV similarity to other forms of similarity. [sent-365, score-0.206]
88 As a result, there is no basis on which to compare which similarity test has better performance. [sent-366, score-0.203]
89 We computed the similarity rate for each query video to all videos in the set, and ranked the position of each video. [sent-403, score-0.234]
90 We also tested gait similarity of female and male subjects; same gender samples are assumed similar. [sent-407, score-0.494]
91 We used gait data that was recorded by a mobile phone, available at [13]. [sent-408, score-0.186]
92 The comparison was done by ranking by gender the 39 samples with respect to a reference walk. [sent-415, score-0.096]
93 This information provides a reasonable explanation to the PV results, as it appears that a subject from the male group may have a gait that is as dissimilar to the gait of a female subject as it is to a different male. [sent-425, score-0.432]
94 In the female group the subjects are more similar and therefore the precision is higher. [sent-426, score-0.097]
95 7 Discussion We proposed a new score that measures the similarity between two multivariate distributions, and assigns to it a value in the range [0,1]. [sent-427, score-0.269]
96 Although it is not a metric, our experiments show that it captures the distance between similar distributions as well as well known distributional distances. [sent-432, score-0.145]
97 Based on this analysis we provide hypothesis tests that give statistical significance to the resulting score. [sent-434, score-0.137]
98 In addition, the PV has an intuitive interpretation that makes it an attractive score for a meaningful statistical testing of similarity. [sent-436, score-0.153]
99 A metric for distributions with applications to image databases. [sent-458, score-0.092]
100 Distribution-based similarity measures for multi-dimensional point set retrieval applications. [sent-512, score-0.198]
wordName wordTfidf (topN-words)
[('pv', 0.848), ('similarity', 0.174), ('gait', 0.163), ('mmd', 0.125), ('tsp', 0.114), ('transportation', 0.098), ('discrepancy', 0.079), ('emd', 0.075), ('bca', 0.072), ('score', 0.071), ('yj', 0.066), ('distributions', 0.066), ('tv', 0.064), ('ci', 0.064), ('tests', 0.063), ('null', 0.061), ('knn', 0.06), ('wasserstein', 0.058), ('zij', 0.057), ('male', 0.055), ('hypothesis', 0.053), ('rejected', 0.053), ('female', 0.051), ('vj', 0.051), ('signatures', 0.05), ('contrarily', 0.049), ('fr', 0.048), ('ng', 0.048), ('wi', 0.047), ('distance', 0.045), ('perturbed', 0.043), ('unmatched', 0.043), ('plan', 0.042), ('ai', 0.042), ('mass', 0.041), ('da', 0.041), ('testing', 0.04), ('aj', 0.037), ('ym', 0.036), ('vertices', 0.035), ('scores', 0.035), ('captures', 0.034), ('bootstrap', 0.034), ('monge', 0.033), ('distances', 0.032), ('perceptual', 0.032), ('bipartite', 0.032), ('procedures', 0.032), ('videos', 0.031), ('samples', 0.031), ('bins', 0.031), ('bootstrapping', 0.03), ('video', 0.029), ('neighbors', 0.029), ('test', 0.029), ('perturbs', 0.029), ('kifer', 0.029), ('gretton', 0.028), ('variation', 0.028), ('graph', 0.027), ('perturbations', 0.027), ('statistic', 0.027), ('qc', 0.027), ('discrepancies', 0.027), ('kantorovich', 0.027), ('cance', 0.026), ('metric', 0.026), ('precision', 0.026), ('inf', 0.025), ('qb', 0.025), ('shie', 0.025), ('ranking', 0.025), ('israel', 0.024), ('perturbation', 0.024), ('measures', 0.024), ('matching', 0.023), ('sample', 0.023), ('mobile', 0.023), ('qa', 0.023), ('phone', 0.023), ('convergence', 0.023), ('power', 0.022), ('pc', 0.022), ('pb', 0.022), ('transport', 0.022), ('xi', 0.022), ('assignment', 0.022), ('cardinality', 0.021), ('equality', 0.021), ('technion', 0.021), ('mannor', 0.021), ('brightness', 0.021), ('statistical', 0.021), ('intuitive', 0.021), ('gender', 0.02), ('des', 0.02), ('subjects', 0.02), ('chapter', 0.02), ('reference', 0.02), ('sw', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
2 0.16156597 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
3 0.12564354 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
4 0.1168644 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
Author: Yung-kyun Noh, Frank Park, Daniel D. Lee
Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classification. We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria. 1
5 0.085796878 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain
Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1
6 0.078814417 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
7 0.066574112 330 nips-2012-Supervised Learning with Similarity Functions
8 0.061953761 148 nips-2012-Hamming Distance Metric Learning
9 0.057652656 242 nips-2012-Non-linear Metric Learning
10 0.056551132 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
11 0.055696838 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
12 0.055111229 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
13 0.051356751 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
14 0.050907481 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
15 0.050607838 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
16 0.048407089 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
17 0.047200419 32 nips-2012-Active Comparison of Prediction Models
18 0.046676964 165 nips-2012-Iterative ranking from pair-wise comparisons
19 0.045864381 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking
20 0.040216547 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
topicId topicWeight
[(0, 0.131), (1, 0.025), (2, 0.007), (3, -0.06), (4, 0.033), (5, -0.008), (6, 0.006), (7, 0.067), (8, -0.008), (9, 0.051), (10, 0.012), (11, -0.079), (12, 0.01), (13, -0.039), (14, -0.013), (15, -0.003), (16, -0.008), (17, -0.003), (18, 0.015), (19, 0.023), (20, 0.037), (21, -0.072), (22, 0.051), (23, -0.004), (24, 0.044), (25, 0.028), (26, 0.032), (27, 0.061), (28, -0.004), (29, -0.016), (30, 0.071), (31, -0.032), (32, -0.004), (33, -0.016), (34, 0.028), (35, 0.021), (36, 0.061), (37, 0.01), (38, 0.015), (39, 0.065), (40, -0.032), (41, -0.085), (42, -0.014), (43, 0.054), (44, -0.021), (45, 0.089), (46, 0.063), (47, 0.048), (48, -0.061), (49, 0.121)]
simIndex simValue paperId paperTitle
same-paper 1 0.90374237 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
2 0.60177684 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain
Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1
3 0.58687598 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
4 0.54045039 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
5 0.52959543 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski
Abstract: Diversified ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to find the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.
6 0.52450055 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
7 0.52294248 242 nips-2012-Non-linear Metric Learning
9 0.50984097 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
10 0.50755417 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
11 0.50103205 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
12 0.49143857 148 nips-2012-Hamming Distance Metric Learning
13 0.48930591 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
14 0.46712232 196 nips-2012-Learning with Partially Absorbing Random Walks
15 0.46701187 39 nips-2012-Analog readout for optical reservoir computers
16 0.4628768 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking
17 0.45102957 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
18 0.44315565 179 nips-2012-Learning Manifolds with K-Means and K-Flats
19 0.44110575 222 nips-2012-Multi-Task Averaging
20 0.43454394 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis
topicId topicWeight
[(0, 0.032), (21, 0.023), (38, 0.117), (39, 0.012), (42, 0.046), (54, 0.015), (55, 0.028), (59, 0.207), (74, 0.063), (76, 0.199), (80, 0.081), (84, 0.029), (92, 0.032)]
simIndex simValue paperId paperTitle
1 0.90433872 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks
Author: Qirong Ho, Junming Yin, Eric P. Xing
Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1
2 0.86233908 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
Author: Jun Wang, Alexandros Kalousis, Adam Woznica
Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1
3 0.85903698 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
Author: Daniel Zoran, Yair Weiss
Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '
same-paper 4 0.84457159 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
5 0.83969277 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams
Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1
6 0.83682275 294 nips-2012-Repulsive Mixtures
7 0.80603606 298 nips-2012-Scalable Inference of Overlapping Communities
8 0.77400595 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
9 0.77058285 188 nips-2012-Learning from Distributions via Support Measure Machines
10 0.76727974 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
11 0.76666987 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC
12 0.76543057 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
13 0.76527518 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
14 0.76354527 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
15 0.76349491 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
16 0.76278692 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
17 0.76198816 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
18 0.76179492 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
19 0.76141775 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
20 0.76067495 277 nips-2012-Probabilistic Low-Rank Subspace Clustering