nips nips2012 nips2012-330 knowledge-graph by maker-knowledge-mining

330 nips-2012-Supervised Learning with Similarity Functions


Source: pdf

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. [sent-5, score-0.343]

2 We give a “goodness” criterion for similarity functions w. [sent-8, score-0.352]

3 a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. [sent-11, score-0.679]

4 We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. [sent-12, score-0.789]

5 Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. [sent-13, score-0.745]

6 1 Introduction The goal of this paper is to develop an extended framework for supervised learning with similarity functions. [sent-15, score-0.343]

7 However, these algorithms typically require the similarity function to be a positive semi-definite (PSD) function, which can be a limiting factor for several applications. [sent-17, score-0.283]

8 Reasons being: 1) the Mercer’s condition is a formal statement that is hard to verify, 2) several natural notions of similarity that arise in practical scenarios are not PSD, and 3) it is not clear as to why an artificial constraint like PSD-ness should limit the usability of a kernel. [sent-18, score-0.33]

9 Several recent papers have demonstrated that indefinite similarity functions can indeed be successfully used for learning [2, 3, 4, 5]. [sent-19, score-0.283]

10 A notable exception is the line of work by [6, 7, 8] that defines a goodness criterion for a similarity function and then provides an algorithm that can exploit this goodness criterion to obtain provably accurate classifiers. [sent-21, score-0.865]

11 Using this view, we propose a generic goodness definition that also admits the 1 goodness definition of [6] for classification as a special case. [sent-24, score-0.524]

12 Furthermore, our definition can be seen as imposing the existence of a smooth function over a generic space defined by similarity functions, rather than over a Hilbert space as required by typical goodness definitions of PSD kernels. [sent-25, score-0.531]

13 To this end, we consider three specific problems: a) regression, b) ordinal regression, and c) ranking. [sent-28, score-0.285]

14 Moreover, by adapting a general framework given by [9], we show that these guarantees do not require the goodness definition to be overly restrictive by showing that our definitions admit all good PSD kernels as well. [sent-30, score-0.413]

15 For the problem of real-valued regression, we additionally provide a goodness definition that captures the intuition that usually, only a small number of landmarks are influential w. [sent-31, score-0.551]

16 However, to recover these landmarks, the uniform sampling technique would require sampling a large number of landmarks thus increasing the training/test time of the predictor. [sent-35, score-0.33]

17 We address this issue by applying a sparse vector recovery algorithm given by [10] and show that the resulting sparse predictor still has bounded generalization error. [sent-36, score-0.436]

18 We also address an important issue faced by algorithms that use landmarking as a feature constructions step viz [6, 7, 8], namely that they typically assume separate landmark and training sets for ease of analysis. [sent-37, score-0.38]

19 Related Work: Existing approaches to extend kernel learning algorithms to indefinite kernels can be classified into three broad categories: a) those that use indefinite kernels directly with existing kernel learning algorithms, resulting in non-convex formulations [2, 3]. [sent-43, score-0.396]

20 Moreover, any domain knowledge stored in the original kernel is lost due to these task oblivious operations and consequently, no generalization guarantees can be given. [sent-46, score-0.305]

21 These models are able to use the indefinite kernel directly with existing PSD kernel learning techniques; all the while retaining the ability to give generalization bounds that quantitatively parallel those of PSD kernel learning models. [sent-49, score-0.552]

22 Y that restricts its interaction with data points to computing similarity values given by K. [sent-53, score-0.311]

23 Now, if the similarity function K is not discriminative enough for the given task then we cannot hope to construct a predictor out of it that enjoys good generalization properties. [sent-54, score-0.615]

24 Hence, it is natural to define the “goodness” of a given similarity function with respect to the learning task at hand. [sent-55, score-0.348]

25 Y over some distribution D, a similarity function K : X ⇥ X ! [sent-58, score-0.283]

26 x ⇠D The above definition is inspired by the definition of a “good” similarity function with respect to classification tasks given in [6]. [sent-61, score-0.349]

27 Y over a distribution D, an (✏0 , B)-good similarity function K, labeled training points sampled from D: T = (xt , y1 ), . [sent-63, score-0.351]

28 R with bounded true loss over D 1: Sample d unlabeled landmarks from D: L = xl , . [sent-69, score-0.508]

29 , xl 1 d // Else subsample d landmarks from T (see Appendix B for details) p l l d 2: L : x 7! [sent-72, score-0.326]

30 Similar to [6], the above definition calls a similarity function K “good” if the target value y(x) of a given point x can be approximated in terms of (a weighted combination of) the target values of the K-“neighbors” of x. [sent-78, score-0.331]

31 Moreover, it defines goodness in terms of violations, a non-convex loss function. [sent-81, score-0.322]

32 Y over some distribution D, a similarity function K is said to be (✏0 , B)-good with respect to a loss function `S : R ⇥ Y ! [sent-85, score-0.444]

33 x2X Having given this definition we must make sure that “good” similarity functions allow the construction of effective predictors (Utility property). [sent-95, score-0.321]

34 A similarity function K is said to be ✏0 -useful w. [sent-100, score-0.334]

35 Here, the ✏0 term captures the misfit or the bias of the similarity function with respect to the learning problem. [sent-107, score-0.319]

36 All our utility guarantees proceed by first using unlabeled samples as landmarks to construct a landmarked space. [sent-109, score-0.77]

37 Next, using the goodness definition, we show the existence of a good linear predictor in the landmarked space. [sent-110, score-0.722]

38 This guarantee is obtained in two steps as outlined in Algorithm 1: first of all we choose d unlabeled landmark points and construct a map : X ! [sent-111, score-0.298]

39 Rd (see Step 1 of Algorithm 1) and show that there exists a linear predictor over Rd that closely approximates the predictor f used in Definition 2 (see Lemma 15 in Appendix A). [sent-112, score-0.402]

40 In the second step, we learn a predictor (over the landmarked space) using ERM over a fresh labeled training set (see Step 3 of Algorithm 1). [sent-113, score-0.468]

41 We then use individual task-specific arguments and Rademacher average-based generalization bounds [13] thus proving the utility of the similarity function. [sent-114, score-0.502]

42 The notion of a good PSD kernel for us will be one that corresponds to a prevalent large margin technique for the given problem. [sent-117, score-0.29]

43 3 Applications We will now instantiate the general learning model described above to real-valued regression, ordinal regression and ranking by providing utility and admissibility guarantees. [sent-126, score-0.793]

44 In the following we shall present algorithms for performing real-valued regression using non-PSD similarity measures. [sent-130, score-0.495]

45 The above loss function automatically gives us notions of good kernels and similarity functions by appealing to Definitions 4 and 2 respectively. [sent-133, score-0.506]

46 1, we can reduce the problem of real regression to that of a linear regression problem in the landmarked space. [sent-137, score-0.666]

47 1 along with specific arguments for the ✏-insensitive loss, we can prove generalization guarantees and hence utility guarantees for the similarity function. [sent-141, score-0.625]

48 Every similarity function that is (✏0 , B)-good for a regression problem with respect to the insensitive loss function `✏ (·, ·) is (✏0 + ✏)-useful with respect to absolute loss as well as (B✏0 + B✏)-useful with respect to mean squared error. [sent-143, score-0.87]

49 Moreover, both the dimensionality⌘of the ⇣ 2 landmarked space as well as the labeled sample complexity can be bounded by O B2 log 1 . [sent-144, score-0.347]

50 ⇣Every PSD kernel that is (✏0 , )-good for a regression problem is, for any ✏1 > 0, ⇣ ⌘⌘ ✏0 + ✏1 , O 1 ✏1 2 -good as a similarity function as well. [sent-146, score-0.637]

51 Moreover, for any ✏1 < 1/2 and any < 1, there exists a regression instance and a corresponding kernel that ⇣ (0,⌘ )-good for the is 1 ✏1 2 . [sent-147, score-0.384]

52 regression problem but only (✏1 , B)-good as a similarity function for B = ⌦ 4 3. [sent-148, score-0.495]

53 2 Sparse regression models An artifact of a random choice of landmarks is that very few of them might turn out to be “informative” with respect to the prediction problem at hand. [sent-149, score-0.551]

54 If the relative abundance of such nodes is low then random selection would compel us to choose a large number of landmarks before enough “informative” ones have been collected. [sent-151, score-0.303]

55 Thus, the ability to prune away irrelevant landmarks would speed up training and test routines. [sent-153, score-0.303]

56 In contrast, we guarantee that our predictor will select a small number of landmarks while incurring bounded generalization error. [sent-155, score-0.659]

57 A similarity function K is said to be (✏0 , B, ⌧ )-good for a real-valued regression problem y : X ! [sent-158, score-0.546]

58 While the motivation behind [15] was to give an improved admissibility result for binary classification, we squarely focus on the utility guarantees; with the aim of accelerating our learning algorithms via landmark pruning. [sent-167, score-0.449]

59 Given a similarity function that is (✏0 , B, ⌧ )-good for a regression problem, there exists ⇣ ⌘ a randomized map 2 B : X ! [sent-171, score-0.525]

60 The number of landmarks required here is a ⌦ (1/⌧ ) fraction greater than that required by Theorem 5. [sent-176, score-0.303]

61 Moreover, this utility can be achieved by an O (⌧ )✏1 1 sparse predictor on the landmarked space. [sent-183, score-0.59]

62 Every PSD kernel that is (✏0 , )-good for a regression problem is, for any ✏1 > 0, ⇣ ⇣ ⌘ ⌘ ✏0 + ✏1 , O 1 ✏1 2 , 1 -good as a similarity function as well. [sent-192, score-0.637]

63 3 Ordinal Regression The problem of ordinal regression requires an accurate prediction of (discrete) labels coming from a finite ordered set [r] = {1, 2, . [sent-194, score-0.497]

64 A natural and rather tempting way to solve this problem is to relax the problem to real-valued regression and threshold the output of the learned real-valued predictor using predefined thresholds b1 , . [sent-200, score-0.45]

65 Thus, if we define the -margin loss function to be [x] := [ x]+ (note that this is simply the well known hinge loss function scaled by a factor of ), we can define our goodness criterion as follows: Definition 12. [sent-216, score-0.439]

66 A similarity function K is said to be (✏0 , B)-good for an ordinal regression problem y : X ! [sent-217, score-0.831]

67 Let K be a similarity function that is (✏0 , B)-good for an ordinal regression problem with respect to -spaced thresholds and -margin loss. [sent-232, score-0.868]

68 Then⇣K ⌘ is ⇣ ⌘ ✏0 0 -useful with respect to ordinal regression error (absolute loss). [sent-234, score-0.567]

69 We can bound, both dimensionality of the landmarked space as well as labeled sampled complexity, ⇣ 2 ⌘ by O B2 log 1 . [sent-236, score-0.306]

70 Notice that for ✏0 < 1 and large enough d, n, we can ensure that the ordinal ✏ 1 regression error rate is also bounded above by 1 since sup ( (x)) = 1. [sent-237, score-0.596]

71 This is in contrast x2[0,1], >0 with the direct reduction to real valued regression which has ordinal regression error rate bounded below by 1. [sent-238, score-0.808]

72 We can show that our definition of a good similarity function admits all good PSD kernels as well. [sent-240, score-0.459]

73 The kernel goodness criterion we adopt corresponds to the large margin framework proposed by [16]. [sent-241, score-0.476]

74 ⇣ Every⌘⌘ PSD kernel that is (✏0 , )-good for an ordinal regression problem is also ⇣ 1 ✏0 + ✏1 , O 2 1 ✏1 2 -good as a similarity function with respect to the 1 -margin loss for any 1 , ✏1 > 0. [sent-245, score-1.032]

75 Moreover, for any ✏1 < 1 /2, there exists an ordinal regression instance and a corresponding kernel that is (0, )-good for the ordinal regression problem but only (✏1 , B)-good as a ⇣ ⌘ similarity function with respect to the 1 -margin loss function for B = ⌦ 6 2 1 ✏1 2 . [sent-246, score-1.559]

76 (a) Mean squared error for landmarking (RegLand), sparse landmarking (RegLand-Sp) and kernel regression (KR) (b) Avg. [sent-247, score-0.901]

77 absolute error for landmarking (ORLand) and kernel regression (KR) on ordinal regression datasets Figure 1: Performance of landmarking algorithms with increasing number of landmarks on realvalued regression (Figure 1a) and ordinal regression (Figure 1b) datasets. [sent-248, score-2.472]

78 Datasets Abalone [18] N = 4177 d=8 Bodyfat [19] N = 252 d = 14 CAHousing [19] N = 20640 d=8 CPUData [20] N = 8192 d = 12 PumaDyn-8 [20] N = 8192 d=8 PumaDyn-32 [20] N = 8192 d = 32 KR Sigmoid kernel Land-Sp Manhattan kernel KR Land-Sp 2. [sent-249, score-0.284]

79 1e-04) (a) Mean squared error for real regression KR Sigmoid kernel ORLand Manhattan kernel KR ORLand 6. [sent-297, score-0.56]

80 3e-02) (b) Mean absolute error for ordinal regression Table 1: Performance of landmarking-based algorithms (with 50 landmarks) vs. [sent-345, score-0.596]

81 Due to lack of space we refer the reader to Appendix F for a discussion on ranking models that includes utility and admissibility guarantees with respect to the popular NDCG loss. [sent-349, score-0.43]

82 4 Experimental Results In this section we present an empirical evaluation of our learning models for the problems of realvalued regression and ordinal regression on benchmark datasets taken from a variety of sources [18, 19, 20]. [sent-350, score-0.806]

83 In all cases, we compare our algorithms against kernel regression (KR), a well known technique [21] for non-linear regression, whose predictor is of the form: P y(xi )K(x, xi ) P . [sent-351, score-0.567]

84 We selected KR as the baseline as it is a popular regression method that does not require similarity functions to be PSD. [sent-354, score-0.53]

85 For ordinal regression problems, we rounded off the result of the KR predictor to get a discrete label. [sent-355, score-0.683]

86 For RegLand, we constructed the landmarked space as specified in Algorithm 1 and learned a linear predictor using the LIBLINEAR package [14] that minimizes ✏-insensitive loss. [sent-360, score-0.428]

87 In the second algorithm (RegLand-Sp), we used the sparse learning algorithm of [10] on the landmarked space to learn the best predictor for a given sparsity level. [sent-361, score-0.471]

88 Moreover, the Manhattan kernel seems to match or outperform the Sigmoid kernel on all the datasets. [sent-373, score-0.284]

89 Of the six datasets considered here, the two Wine datasets are ordinal regression datasets where the quality of the wine is to be predicted on a scale from 1 to 10. [sent-378, score-0.64]

90 The remaining four datasets are regression datasets whose labels were subjected to equi-frequency binning to obtain ordinal regression datasets [16]. [sent-379, score-0.823]

91 Figure 1b compares ORLand with KR as the number of landmarks increases. [sent-381, score-0.303]

92 The Sigmoid kernel seems to outperform the Manhattan kernel on a couple of datasets. [sent-384, score-0.284]

93 5 Conclusion In this work we considered the general problem of supervised learning using non-PSD similarity functions. [sent-386, score-0.343]

94 We provided a goodness criterion for similarity functions w. [sent-387, score-0.574]

95 We then focused on the problem of identifying influential landmarks with the aim of learning sparse predictors. [sent-393, score-0.346]

96 We presented a model that formalized the intuition that typically only a small fraction of landmarks is influential for a given learning problem. [sent-394, score-0.303]

97 We adapted existing sparse vector recovery algorithms within our model to learn provably sparse predictors with bounded generalization error. [sent-395, score-0.288]

98 Finally, we empirically evaluated our learning algorithms on benchmark regression and ordinal regression tasks. [sent-396, score-0.736]

99 In all cases, our learning methods, especially the sparse recovery algorithm, consistently outperformed the kernel regression baseline. [sent-397, score-0.425]

100 ´ An interesting direction for future research would be learning good similarity functions a la metric learning or kernel learning. [sent-398, score-0.471]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('landmarks', 0.303), ('ordinal', 0.285), ('similarity', 0.283), ('goodness', 0.248), ('landmarked', 0.242), ('psd', 0.241), ('landmarking', 0.22), ('kr', 0.216), ('regression', 0.212), ('inde', 0.202), ('predictor', 0.186), ('landmark', 0.16), ('orland', 0.154), ('admissibility', 0.144), ('kernel', 0.142), ('utility', 0.119), ('manhattan', 0.112), ('sigmoid', 0.105), ('nition', 0.1), ('regland', 0.088), ('loss', 0.074), ('appendix', 0.073), ('jw', 0.071), ('generalization', 0.071), ('mse', 0.067), ('hw', 0.067), ('absolute', 0.065), ('bounded', 0.065), ('guarantees', 0.063), ('india', 0.061), ('supervised', 0.06), ('kernels', 0.056), ('thresholds', 0.052), ('said', 0.051), ('nathan', 0.05), ('accuracies', 0.047), ('notions', 0.047), ('good', 0.046), ('annual', 0.045), ('cpudata', 0.044), ('kar', 0.044), ('purushottam', 0.044), ('yihua', 0.044), ('criterion', 0.043), ('margin', 0.043), ('sparse', 0.043), ('uential', 0.043), ('unlabeled', 0.043), ('moreover', 0.042), ('labeled', 0.04), ('avrim', 0.039), ('exc', 0.039), ('datasets', 0.038), ('predictors', 0.038), ('rkhs', 0.037), ('respect', 0.036), ('bi', 0.036), ('maya', 0.036), ('weighing', 0.036), ('baseline', 0.035), ('reader', 0.035), ('classi', 0.034), ('guarantee', 0.034), ('error', 0.034), ('prateek', 0.034), ('ranking', 0.033), ('outlined', 0.033), ('prevalent', 0.032), ('realvalued', 0.032), ('parentheses', 0.031), ('hk', 0.03), ('tasks', 0.03), ('nitions', 0.03), ('informative', 0.03), ('squared', 0.03), ('exists', 0.03), ('wine', 0.029), ('task', 0.029), ('bounds', 0.029), ('ali', 0.028), ('erm', 0.028), ('balcan', 0.028), ('points', 0.028), ('recovery', 0.028), ('admits', 0.028), ('technique', 0.027), ('microsoft', 0.027), ('benchmark', 0.027), ('nite', 0.026), ('give', 0.026), ('prove', 0.026), ('theorem', 0.025), ('shai', 0.025), ('dimensionality', 0.024), ('insensitive', 0.024), ('remedy', 0.024), ('srebro', 0.024), ('cheng', 0.024), ('target', 0.024), ('xl', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

2 0.18723412 40 nips-2012-Analyzing 3D Objects in Cluttered Images

Author: Mohsen Hejrati, Deva Ramanan

Abstract: We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation (station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset. 1

3 0.18213364 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams

Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1

4 0.12039782 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

5 0.11399014 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain

Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1

6 0.11350474 16 nips-2012-A Polynomial-time Form of Robust Regression

7 0.10202589 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

8 0.09540505 145 nips-2012-Gradient Weights help Nonparametric Regressors

9 0.094763666 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

10 0.094690017 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

11 0.093954489 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

12 0.092724457 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

13 0.091731429 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

14 0.089177415 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

15 0.07808686 231 nips-2012-Multiple Operator-valued Kernel Learning

16 0.076546095 188 nips-2012-Learning from Distributions via Support Measure Machines

17 0.07266625 271 nips-2012-Pointwise Tracking the Optimal Regression Function

18 0.071371913 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

19 0.069502249 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

20 0.069180429 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, 0.038), (2, 0.008), (3, -0.081), (4, 0.113), (5, -0.044), (6, 0.004), (7, 0.18), (8, -0.022), (9, -0.04), (10, -0.003), (11, -0.024), (12, 0.037), (13, 0.007), (14, 0.078), (15, -0.019), (16, 0.059), (17, 0.043), (18, -0.042), (19, 0.014), (20, 0.039), (21, -0.013), (22, 0.041), (23, -0.002), (24, 0.02), (25, 0.011), (26, 0.056), (27, 0.039), (28, -0.064), (29, -0.027), (30, -0.007), (31, -0.19), (32, 0.047), (33, -0.024), (34, 0.057), (35, -0.013), (36, -0.083), (37, -0.052), (38, -0.001), (39, 0.036), (40, -0.05), (41, 0.005), (42, 0.026), (43, -0.047), (44, -0.093), (45, 0.178), (46, 0.094), (47, 0.111), (48, -0.006), (49, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9328627 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

2 0.70652664 145 nips-2012-Gradient Weights help Nonparametric Regressors

Author: Samory Kpotufe, Abdeslam Boularias

Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1

3 0.65874016 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou

Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1

4 0.64986587 338 nips-2012-The Perturbed Variation

Author: Maayan Harel, Shie Mannor

Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1

5 0.63310033 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

6 0.6230402 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

7 0.59248137 231 nips-2012-Multiple Operator-valued Kernel Learning

8 0.58869743 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

9 0.58344144 264 nips-2012-Optimal kernel choice for large-scale two-sample tests

10 0.58128494 167 nips-2012-Kernel Hyperalignment

11 0.57049584 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

12 0.55245721 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

13 0.54523385 221 nips-2012-Multi-Stage Multi-Task Feature Learning

14 0.53040552 188 nips-2012-Learning from Distributions via Support Measure Machines

15 0.52427423 16 nips-2012-A Polynomial-time Form of Robust Regression

16 0.52212334 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

17 0.51020384 271 nips-2012-Pointwise Tracking the Optimal Regression Function

18 0.50392348 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

19 0.50180572 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

20 0.49532643 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (21, 0.018), (38, 0.158), (39, 0.012), (42, 0.042), (45, 0.204), (53, 0.01), (54, 0.012), (55, 0.025), (74, 0.077), (76, 0.149), (80, 0.135), (92, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94133359 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

Author: Amy Greenwald, Jiacui Li, Eric Sodomka

Abstract: In many large economic markets, goods are sold through sequential auctions. Examples include eBay, online ad auctions, wireless spectrum auctions, and the Dutch flower auctions. In this paper, we combine methods from game theory and decision theory to search for approximate equilibria in sequential auction domains, in which bidders do not know their opponents’ values for goods, bidders only partially observe the actions of their opponents’, and bidders demand multiple goods. We restrict attention to two-phased strategies: first predict (i.e., learn); second, optimize. We use best-reply dynamics [4] for prediction (i.e., to predict other bidders’ strategies), and then assuming fixed other-bidder strategies, we estimate and solve the ensuing Markov decision processes (MDP) [18] for optimization. We exploit auction properties to represent the MDP in a more compact state space, and we use Monte Carlo simulation to make estimating the MDP tractable. We show how equilibria found using our search procedure compare to known equilibria for simpler auction domains, and we approximate an equilibrium for a more complex auction domain where analytical solutions are unknown. 1

2 0.8717134 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

Author: Qixia Jiang, Jun Zhu, Maosong Sun, Eric P. Xing

Abstract: An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihoodbased supervised topic models, of which posterior inference can be carried out using the Bayes’ rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

same-paper 3 0.85816693 330 nips-2012-Supervised Learning with Similarity Functions

Author: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1

4 0.85526603 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.

5 0.85398453 292 nips-2012-Regularized Off-Policy TD-Learning

Author: Bo Liu, Sridhar Mahadevan, Ji Liu

Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1

6 0.79402101 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

7 0.7922138 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.79193985 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.78898776 168 nips-2012-Kernel Latent SVM for Visual Recognition

10 0.78804141 200 nips-2012-Local Supervised Learning through Space Partitioning

11 0.78770632 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.78740829 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

13 0.78724164 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

14 0.78664595 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.78582823 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

16 0.78562647 65 nips-2012-Cardinality Restricted Boltzmann Machines

17 0.78444958 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.78246623 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

19 0.78246015 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

20 0.78233492 16 nips-2012-A Polynomial-time Form of Robust Regression