nips nips2012 nips2012-338 knowledge-graph by maker-knowledge-mining

338 nips-2012-The Perturbed Variation

Source: pdf

Author: Maayan Harel, Shie Mannor

Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two ﬁnite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best ﬁt each other. The score is deﬁned between distributions, and can be efﬁciently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We introduce a new discrepancy score between two distributions that gives an indication on their similarity. [sent-7, score-0.216]

2 The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best ﬁt each other. [sent-9, score-0.187]

3 We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. [sent-11, score-0.177]

4 We also compare the score’s capacity to detect similarity with that of other known measures on real data. [sent-13, score-0.198]

5 1 Introduction The question of similarity between two sets of examples is common to many ﬁelds, including statistics, data mining, machine learning and computer vision. [sent-14, score-0.174]

6 The main focus of this work is providing a similarity score and a corresponding statistical procedure that gives one possible answer to this question. [sent-18, score-0.266]

7 Discrepancy between distributions has been studied for decades, and a wide variety of distance scores have been proposed. [sent-19, score-0.146]

8 However, not all proposed scores can be used for testing similarity. [sent-20, score-0.075]

9 The main difﬁculty is that most scores have not been designed for statistical testing of similarity but equality, known as the Two-Sample Problem (TSP). [sent-21, score-0.27]

10 Formally, let P and Q be the generating distributions of the data; the TSP tests the null hypothesis H0 : P = Q against the general alternative H1 : P �= Q. [sent-22, score-0.243]

11 However, sometimes, like in DA, the interesting question is with regards to similarity rather than equality. [sent-24, score-0.174]

12 By design, most equality tests may not be transformed to test similarity; see Section 3 for a review of representative works. [sent-25, score-0.113]

13 In this work, we quantify similarity using a new score, the Perturbed Variation (PV). [sent-26, score-0.174]

14 We propose that similarity is related to some predeﬁned value of permitted variations. [sent-27, score-0.174]

15 Consider the gait of two male subjects as an example. [sent-28, score-0.238]

16 Put more generally, similarity depends on what “small changes” are in a given application, and implies that similarity is domain speciﬁc. [sent-31, score-0.348]

17 The PV, as hinted by its name, measures the discrepancy between two distributions while allowing for some perturbation of each distribution; that is, it allows small differences between the distributions. [sent-32, score-0.193]

18 Let P and Q be two distributions on a Banach space X , and let M (P, Q) be the set of all joint distributions on X × X with marginals P and Q. [sent-39, score-0.132]

19 Put into words, Equation (1) deﬁnes the joint distribution µ that couples the two distributions such that the probability of the event of a pair (X, Y ) ∼ µ being within a distance grater than � is minimized. [sent-42, score-0.111]

20 The solution to (1) is a special case of the classical mass transport problem of Monge [1] and its � version by Kantorovich: inf µ∈M (P,Q) X ×X c(x, y)dµ(x, y), where c : X ×X → R is a measurable cost function. [sent-43, score-0.088]

21 The probability µ(y|x) deﬁnes the and may be rewritten as inf µ transportation plan of x to y. [sent-46, score-0.165]

22 The PV optimal transportation plan is obtained by perturbing the mass of each point x in its � neighborhood so that it redistributes to the distribution of Q. [sent-47, score-0.181]

23 These small perturbations do not add any cost, while transportation of mass to further areas is equally costly. [sent-48, score-0.166]

24 Despite this limitation, this cost function fully quantiﬁes the intuition that small variations should not be penalized when similarity is considered. [sent-51, score-0.174]

25 In this sense, similarity is not unique by deﬁnition, as more than one distribution can be similar to a reference distribution. [sent-52, score-0.194]

26 This formulation argues that any transportation plan, even to a close neighbor, is costly. [sent-54, score-0.098]

27 As explained, it relaxes the sensitivity of the TV; however, it does not “over optimize” the transportation plan. [sent-61, score-0.098]

28 Deﬁne the neighborhood of ai as ng(ai , �) = {z ; d(z, ai ) ≤ �}. [sent-78, score-0.084]

29 The PV(µ1 , µ2 , �, d) between the two distributions is: N N 1� 1� wi + vj (2) min wi ≥0,vi ≥0,Zij ≥0 2 2 j=1 i=1 � Zij + wi = µ1 (ai ), ∀i s. [sent-79, score-0.258]

30 aj ∈ng(ai ,�) � ai ∈ng(aj ,�) Zij + vj = µ2 (aj ), ∀j Zij = 0 , ∀(i, j) �∈ ng(ai , �). [sent-81, score-0.13]

31 Each row in the matrix Z ∈ RN ×N corresponds to a point mass in µ1 , and each column to a point mass in µ2 . [sent-82, score-0.082]

32 For each i, Z(i, :) is zero in columns corresponding to non neighboring elements, and non-zero only for columns j for which transportation between µ2 (aj ) → µ1 (ai ) is performed. [sent-83, score-0.098]

33 The discrepancies between the distributions are depicted by the scalars wi and vi that count the “leftover” mass in µ1 (ai ) and µ2 (aj ). [sent-84, score-0.181]

34 The objective is to minimize these discrepancies, therefore matrix Z describes the optimal transportation plan constrained to �-perturbations. [sent-85, score-0.14]

35 , ym }, generated by distributions P and Q respectively, � PV(S1 , S2 , �, d) is: n m 1 � 1 � wi + vj (3) min wi ≥0,vi ≥0,Zij ≥0 2n 2m j=1 i=1 � � Zij + wi = 1, Zij + vj = 1, ∀i, j s. [sent-96, score-0.345]

36 yj ∈ng(xi ,�) Zij = 0 , xi ∈ng(yj ,�) ∀(i, j) �∈ ng(xi , �), where Z ∈ R . [sent-98, score-0.088]

37 This solution may be found by solving the optimal assignment on an appropriate bipartite graph [3]. [sent-103, score-0.081]

38 n m with edge weight zero to yj ∈ ng(xi ) and with weight ∞ to yj �∈ ng(xi ). [sent-125, score-0.132]

39 To make the graph complete, assign zero cost edges between all vertices xi and wk for k �= i (and vertices yj and vk for k �= j). [sent-127, score-0.185]

40 We note that the Earth Mover Distance (EMD) [4], a sampled version of the transportation problem, is also formulated by a linear program that may be solved by optimal assignment. [sent-128, score-0.098]

41 Contrarily, graph G, which describes PV, is a simple bipartite graph for which maximum cardinality matching, a much simpler problem, can be applied to ﬁnd the optimal assignment. [sent-130, score-0.107]

42 To ﬁnd the optimal assignment, ﬁrst solve the maximum matching on the partial graph between vertices xi , yj that have zero weight edges (corresponding to neighboring vertices). [sent-131, score-0.173]

43 Then, assign vertices xi and yj for whom a match was not found with wi and vj respectively; see Algorithm 1 and Figure 1 for an illustration of a matching. [sent-132, score-0.221]

44 Let k be the average number of neighbors of a sample, then the average number of edges in the ˆ ˆ bipartite graph G is |E| = n × k. [sent-137, score-0.088]

45 3 Related Work Many scores have been deﬁned for testing discrepancy between distributions. [sent-139, score-0.154]

46 Among these tests is the well known Kolmogorov-Smirnov test (for one dimensional distributions), and its generalization to higher dimensions by minimal spanning trees [6]. [sent-144, score-0.092]

47 A different statistic is deﬁned by the portion of k-nearest neighbors of each sample that belongs to different distributions; larger portions mean the distributions are closer [7]. [sent-145, score-0.145]

48 These scores are well known in the statistical literature but cannot be easily changed to test similarity, as their analysis relies on testing equality. [sent-146, score-0.125]

49 In both cases, the distance is not estimated directly on the samples, but on a higher level partition of the space: histogram bins or signatures (cluster centers). [sent-149, score-0.126]

50 It is impractical to use the EMD to estimate the Wasserstein metric between the continuous distributions, as convergence would require the number of bins to be exponentially dependent on the dimension. [sent-150, score-0.08]

51 It is possible to consider the PV as a reﬁnement of the EMD notion of similarity; instead of clustering the data to signatures and moving the signatures, it perturbs each sample. [sent-154, score-0.079]

52 In this manner, it captures a ﬁner notion of similarity better suited for statistical testing. [sent-155, score-0.229]

53 1) = 1 Figure 2: Two distributions on R: The PV captures the perceptual similarity of (a),(b) against the disimilarity in (c). [sent-176, score-0.306]

54 Therefore, instead of the TV, it may be interesting to consider the L1 as a similarity distance on the measures after discretization. [sent-197, score-0.243]

55 Namely, the problem is the choice of a single partition to measure similarity of a reference distribution to multiple distributions, while choosing multiple partitions would make the distances incomparable. [sent-200, score-0.226]

56 The last group of statistics are scores established in machine learning: the dA distance presented by Kifer et al. [sent-202, score-0.08]

57 that is based on the maximum discrepancy on a chosen subset of the support [8], and Maximum Mean Discrepancy (MMD) by Gretton et al. [sent-203, score-0.079]

58 , which deﬁne discrepancy after embeddings the distributions to a Reproducing Kernel Hilbert Space (RKHS)[9]. [sent-204, score-0.145]

59 These scores have corresponding statistical tests for the TSP; however, since their analysis is based on ﬁnite convergence bounds, in principle they may be modiﬁed to test similarity. [sent-205, score-0.171]

60 The MMD captures the distance between the samples in some RKHS. [sent-207, score-0.11]

61 The MMD may be used to deﬁne a similarity test, yet this would require deﬁning two parameters, σ and the similarity rate, whose dependency is not intuitive. [sent-208, score-0.348]

62 Namely, for any similarity rate the result of the test is highly dependent on the choice of σ, but it is not clear how it should be made. [sent-209, score-0.203]

63 , ym } ∈ Rd generated by distributions P and Q, respectively. [sent-224, score-0.102]

64 However, intuitively, slow convergence is not always the case, for example when the support of the distributions lies in a lower dimensional manifold of the space. [sent-272, score-0.089]

65 5 Statistical Inference We construct two types of complementary procedures for hypothesis testing of similarity and dissimilarity2 . [sent-276, score-0.299]

66 In the ﬁrst type of procedures, given 0 ≤ θ < 1, we distinguish between the null (1) hypothesis H0 : PV(P, Q, �, d) ≤ θ, which implies similarity, and the alternative hypothesis (1) H1 : PV(P, Q, �, d) > θ. [sent-277, score-0.167]

67 In the second type of procedures, we test whether two distributions are similar. [sent-280, score-0.095]

68 Note that there isn’t an equivalent of this form for the TSP, therefore we can not infer similarity using the TSP test, but only reject equality. [sent-282, score-0.174]

69 Our hypothesis tests are based on the ﬁnite sample analysis presented in Section 4; see Appendix A. [sent-283, score-0.139]

70 The idea of bootstrapping for estimating CIs is based on a two step procedure: approximation of the sampling distribution of the statistic by resampling with replacement from the initial sample – the bootstrap stage – following, a computation of the CI based on the resulting distribution. [sent-286, score-0.114]

71 Using the CI, a hypothesis test may be formed: (1) the null H0 is rejected with signiﬁcance α if the range [0, θ] �⊂ [CI, CI]. [sent-289, score-0.196]

72 Also, for the second test, we apply the principle of CI inclusion [11], which states that if [CI, CI] ⊂ [0, θ], dissimilarity is rejected and similarity deduced. [sent-290, score-0.227]

73 2 The two procedures are distinct, as, in general, lacking evidence to reject similarity is not sufﬁcient to infer dissimilarity, and vice versa. [sent-291, score-0.206]

74 For this purpose, we apply signiﬁcance testing for similarity on two univariate uniform distributions: P ∼ U [0, 1] and Q ∼ U [Δ(�), 1 + Δ(�)], where Δ(�) is a varying size of perturbation. [sent-328, score-0.214]

75 For each value �� , we test the null hypothesis H0 : P V (P, Q, �� ) = 0 for ten equally � � spaced values of Δ(� ) in the range [0, 2� ]. [sent-335, score-0.143]

76 In this manner, we test the ability of the PV to detect similarity for different sizes of perturbations. [sent-336, score-0.203]

77 The percentage of times the null hypothesis was falsely rejected, i. [sent-337, score-0.114]

78 The percentage of times the null hypothesis was correctly rejected, the power of the test, was estimated as a function of the sample size and averaged over 500 repetitions. [sent-341, score-0.159]

79 Also, when ﬁner perturbations need to be detected, more samples are needed to gain statistical power. [sent-345, score-0.079]

80 Note that, given a sufﬁcient sample size, any statistic for the TSP would have rejected similarity for any Δ > 0. [sent-349, score-0.277]

81 2 Comparing Distance Measures Next, we test the ability of the PV to measure similarity on real data. [sent-351, score-0.203]

82 To this end, we test the ranking performance of the PV score against other known distributional distances. [sent-352, score-0.125]

83 We compare the PV to the multivariate extension of the Wald-Wolfowitz score of Friedman & Rafsky (FR) [6] , Schilling’s nearest neighbors score (KNN) [7], and the Maximum Mean Discrepancy score of Gretton et al. [sent-353, score-0.242]

84 We rank similarity for the applications of video retrieval and gait recognition. [sent-355, score-0.366]

85 For each similarity rank, where T is the total 1 ≤ i ≤ r of these observations, deﬁne ri ∈ [1, T − 1] as its � number of observations. [sent-358, score-0.174]

86 Twenty unique videos were selected as query videos, each of which has one similar clip in 3 Note that the statistical tests of these measures test equality while the PV tests similarity and therefore our experiments are not of statistical power but of ranking similarity. [sent-364, score-0.494]

87 Even in the case of the distances that may be transformed for similarity, like the MMD, there is no known function between the PV similarity to other forms of similarity. [sent-365, score-0.206]

88 As a result, there is no basis on which to compare which similarity test has better performance. [sent-366, score-0.203]

89 We computed the similarity rate for each query video to all videos in the set, and ranked the position of each video. [sent-403, score-0.234]

90 We also tested gait similarity of female and male subjects; same gender samples are assumed similar. [sent-407, score-0.494]

91 We used gait data that was recorded by a mobile phone, available at [13]. [sent-408, score-0.186]

92 The comparison was done by ranking by gender the 39 samples with respect to a reference walk. [sent-415, score-0.096]

93 This information provides a reasonable explanation to the PV results, as it appears that a subject from the male group may have a gait that is as dissimilar to the gait of a female subject as it is to a different male. [sent-425, score-0.432]

94 In the female group the subjects are more similar and therefore the precision is higher. [sent-426, score-0.097]

95 7 Discussion We proposed a new score that measures the similarity between two multivariate distributions, and assigns to it a value in the range [0,1]. [sent-427, score-0.269]

96 Although it is not a metric, our experiments show that it captures the distance between similar distributions as well as well known distributional distances. [sent-432, score-0.145]

97 Based on this analysis we provide hypothesis tests that give statistical signiﬁcance to the resulting score. [sent-434, score-0.137]

98 In addition, the PV has an intuitive interpretation that makes it an attractive score for a meaningful statistical testing of similarity. [sent-436, score-0.153]

99 A metric for distributions with applications to image databases. [sent-458, score-0.092]

100 Distribution-based similarity measures for multi-dimensional point set retrieval applications. [sent-512, score-0.198]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pv', 0.848), ('similarity', 0.174), ('gait', 0.163), ('mmd', 0.125), ('tsp', 0.114), ('transportation', 0.098), ('discrepancy', 0.079), ('emd', 0.075), ('bca', 0.072), ('score', 0.071), ('yj', 0.066), ('distributions', 0.066), ('tv', 0.064), ('ci', 0.064), ('tests', 0.063), ('null', 0.061), ('knn', 0.06), ('wasserstein', 0.058), ('zij', 0.057), ('male', 0.055), ('hypothesis', 0.053), ('rejected', 0.053), ('female', 0.051), ('vj', 0.051), ('signatures', 0.05), ('contrarily', 0.049), ('fr', 0.048), ('ng', 0.048), ('wi', 0.047), ('distance', 0.045), ('perturbed', 0.043), ('unmatched', 0.043), ('plan', 0.042), ('ai', 0.042), ('mass', 0.041), ('da', 0.041), ('testing', 0.04), ('aj', 0.037), ('ym', 0.036), ('vertices', 0.035), ('scores', 0.035), ('captures', 0.034), ('bootstrap', 0.034), ('monge', 0.033), ('distances', 0.032), ('perceptual', 0.032), ('bipartite', 0.032), ('procedures', 0.032), ('videos', 0.031), ('samples', 0.031), ('bins', 0.031), ('bootstrapping', 0.03), ('video', 0.029), ('neighbors', 0.029), ('test', 0.029), ('perturbs', 0.029), ('kifer', 0.029), ('gretton', 0.028), ('variation', 0.028), ('graph', 0.027), ('perturbations', 0.027), ('statistic', 0.027), ('qc', 0.027), ('discrepancies', 0.027), ('kantorovich', 0.027), ('cance', 0.026), ('metric', 0.026), ('precision', 0.026), ('inf', 0.025), ('qb', 0.025), ('shie', 0.025), ('ranking', 0.025), ('israel', 0.024), ('perturbation', 0.024), ('measures', 0.024), ('matching', 0.023), ('sample', 0.023), ('mobile', 0.023), ('qa', 0.023), ('phone', 0.023), ('convergence', 0.023), ('power', 0.022), ('pc', 0.022), ('pb', 0.022), ('transport', 0.022), ('xi', 0.022), ('assignment', 0.022), ('cardinality', 0.021), ('equality', 0.021), ('technion', 0.021), ('mannor', 0.021), ('brightness', 0.021), ('statistical', 0.021), ('intuitive', 0.021), ('gender', 0.02), ('des', 0.02), ('subjects', 0.02), ('chapter', 0.02), ('reference', 0.02), ('sw', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 338 nips-2012-The Perturbed Variation

Author: Maayan Harel, Shie Mannor

2 0.16156597 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their ﬁnite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

3 0.12564354 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur

Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

4 0.1168644 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

Author: Yung-kyun Noh, Frank Park, Daniel D. Lee

Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classiﬁcation. We show that conventional k-nearest neighbor classiﬁcation can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classiﬁcation. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose ﬁve different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classiﬁcation criteria. 1

5 0.085796878 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to deﬁne an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by deﬁning the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difﬁculty in learning the pairwise similarity measure is that there is a signiﬁcant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difﬁculty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efﬁciency. 1

6 0.078814417 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

7 0.066574112 330 nips-2012-Supervised Learning with Similarity Functions

8 0.061953761 148 nips-2012-Hamming Distance Metric Learning

9 0.057652656 242 nips-2012-Non-linear Metric Learning

10 0.056551132 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

11 0.055696838 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

12 0.055111229 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

13 0.051356751 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

14 0.050907481 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

15 0.050607838 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

16 0.048407089 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models

17 0.047200419 32 nips-2012-Active Comparison of Prediction Models

18 0.046676964 165 nips-2012-Iterative ranking from pair-wise comparisons

19 0.045864381 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

20 0.040216547 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, 0.025), (2, 0.007), (3, -0.06), (4, 0.033), (5, -0.008), (6, 0.006), (7, 0.067), (8, -0.008), (9, 0.051), (10, 0.012), (11, -0.079), (12, 0.01), (13, -0.039), (14, -0.013), (15, -0.003), (16, -0.008), (17, -0.003), (18, 0.015), (19, 0.023), (20, 0.037), (21, -0.072), (22, 0.051), (23, -0.004), (24, 0.044), (25, 0.028), (26, 0.032), (27, 0.061), (28, -0.004), (29, -0.016), (30, 0.071), (31, -0.032), (32, -0.004), (33, -0.016), (34, 0.028), (35, 0.021), (36, 0.061), (37, 0.01), (38, 0.015), (39, 0.065), (40, -0.032), (41, -0.085), (42, -0.014), (43, 0.054), (44, -0.021), (45, 0.089), (46, 0.063), (47, 0.048), (48, -0.061), (49, 0.121)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90374237 338 nips-2012-The Perturbed Variation

Author: Maayan Harel, Shie Mannor

2 0.60177684 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

3 0.58687598 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indeﬁnite) similarity function between data points. Existing work on learning with indeﬁnite kernels has concentrated solely on binary/multiclass classiﬁcation problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classiﬁcation. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efﬁcient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness deﬁnition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves signiﬁcantly higher accuracies than the baseline methods while offering reduced computational costs. 1

4 0.54045039 145 nips-2012-Gradient Weights help Nonparametric Regressors

Author: Samory Kpotufe, Abdeslam Boularias

Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efﬁcient way to signiﬁcantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efﬁciently learned online. 1

5 0.52959543 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski

Abstract: Diversiﬁed ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to ﬁnd the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.

6 0.52450055 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

7 0.52294248 242 nips-2012-Non-linear Metric Learning

8 0.5158174 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

9 0.50984097 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

10 0.50755417 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

11 0.50103205 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

12 0.49143857 148 nips-2012-Hamming Distance Metric Learning

13 0.48930591 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

14 0.46712232 196 nips-2012-Learning with Partially Absorbing Random Walks

15 0.46701187 39 nips-2012-Analog readout for optical reservoir computers

16 0.4628768 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

17 0.45102957 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

18 0.44315565 179 nips-2012-Learning Manifolds with K-Means and K-Flats

19 0.44110575 222 nips-2012-Multi-Task Averaging

20 0.43454394 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (21, 0.023), (38, 0.117), (39, 0.012), (42, 0.046), (54, 0.015), (55, 0.028), (59, 0.207), (74, 0.063), (76, 0.199), (80, 0.081), (84, 0.029), (92, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90433872 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

Author: Qirong Ho, Junming Yin, Eric P. Xing

Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1

2 0.86233908 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classiﬁcation. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased ﬂexibility its downside is the considerable risk of overﬁtting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics deﬁned on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classiﬁcation problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a signiﬁcant manner. 1

3 0.85903698 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

Author: Daniel Zoran, Yair Weiss

Abstract: Simple Gaussian Mixture Models (GMMs) learned from pixels of natural image patches have been recently shown to be surprisingly strong performers in modeling the statistics of natural images. Here we provide an in depth analysis of this simple yet rich model. We show that such a GMM model is able to compete with even the most successful models of natural images in log likelihood scores, denoising performance and sample quality. We provide an analysis of what such a model learns from natural images as a function of number of mixture components including covariance structure, contrast variation and intricate structures such as textures, boundaries and more. Finally, we show that the salient properties of the GMM learned from natural images can be derived from a simplified Dead Leaves model which explicitly models occlusion, explaining its surprising success relative to other models. 1 GMMs and natural image statistics models Many models for the statistics of natural image patches have been suggested in recent years. Finding good models for natural images is important to many different research areas - computer vision, biological vision and neuroscience among others. Recently, there has been a growing interest in comparing different aspects of models for natural images such as log-likelihood and multi-information reduction performance, and much progress has been achieved [1,2, 3,4,5, 6]. Out of these results there is one which is particularly interesting: simple, unconstrained Gaussian Mixture Models (GMMs) with a relatively small number of mixture components learned from image patches are extraordinarily good in modeling image statistics [6, 4]. This is a surprising result due to the simplicity of GMMs and their ubiquity. Another surprising aspect of this result is that many of the current models may be thought of as GMMs with an exponential or infinite number of components, having different constraints on the covariance structure of the mixture components. In this work we study the nature of GMMs learned from natural image patches. We start with a thorough comparison to some popular and cutting edge image models. We show that indeed, GMMs are excellent performers in modeling natural image patches. We then analyze what properties of natural images these GMMs capture, their dependence on the number of components in the mixture and their relation to the structure of the world around us. Finally, we show that the learned GMM suggests a strong connection between natural image statistics and a simple variant of the dead leaves model [7, 8] , explicitly modeling occlusions and explaining some of the success of GMMs in modeling natural images. 1 3.5 .,...- ••.......-.-.. -..---'-. 1 ~~6\8161·· -.. .-.. --...--.-- ---..-.- -. --------------MII+··+ilIl ..... .. . . ~ '[25 . . . ---- ] B'II 1_ -- ~2 ;t:: fI 1 - --- ,---- ._.. : 61.5 ..... '

same-paper 4 0.84457159 338 nips-2012-The Perturbed Variation

Author: Maayan Harel, Shie Mannor

5 0.83969277 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

6 0.83682275 294 nips-2012-Repulsive Mixtures

7 0.80603606 298 nips-2012-Scalable Inference of Overlapping Communities

8 0.77400595 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

9 0.77058285 188 nips-2012-Learning from Distributions via Support Measure Machines

10 0.76727974 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

11 0.76666987 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

12 0.76543057 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

13 0.76527518 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

14 0.76354527 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

15 0.76349491 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

16 0.76278692 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

17 0.76198816 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

18 0.76179492 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

19 0.76141775 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

20 0.76067495 277 nips-2012-Probabilistic Low-Rank Subspace Clustering