nips nips2010 nips2010-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu California Institute of Technology Pasadena, CA 91106 Abstract Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? [sent-6, score-0.184]
2 We present a framework that simultaneously clusters the data and trains a discriminative classifier. [sent-7, score-0.187]
3 RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. [sent-9, score-0.184]
4 The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. [sent-10, score-0.133]
5 In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. [sent-11, score-0.131]
6 A great number of clustering principles have been proposed, most of which can be described as either generative or discriminative in nature. [sent-15, score-0.183]
7 Generative clustering algorithms provide constructive definitions of categories in terms of their geometric properties in a feature space or as statistical processes for generating data. [sent-16, score-0.157]
8 In order for generative clustering to be practical, restrictive assumptions must be made about the underlying category definitions. [sent-18, score-0.189]
9 Rather than modeling categories explicitly, discriminative clustering techniques represent the boundaries or distinctions between categories. [sent-19, score-0.27]
10 Spectral graph partitioning [1] and maximum margin clustering [2] are example discriminative clustering methods. [sent-21, score-0.282]
11 A disadvantage of existing discriminative approaches is that they lack a probabilistic foundation, making them potentially unsuitable in applications that require reasoning under uncertainty or in data exploration. [sent-22, score-0.119]
12 We propose a principled probabilistic approach to discriminative clustering, by formalizing the problem as unsupervised learning of a conditional probabilistic model. [sent-23, score-0.299]
13 [4] in order to learn probabilistic classifiers that are appropriate for multi-class discriminative clustering, as explained in Section 2. [sent-25, score-0.119]
14 We identify two fundamental, competing quantities, class balance and class separation, and develop an information theoretic objective function which trades off these quantities. [sent-26, score-0.184]
15 Our approach corresponds to maximizing mutual information between the empirical distribution on the inputs and the induced 1 label distribution, regularized by a complexity penalty. [sent-27, score-0.149]
16 In summary, our contribution is RIM, a probabilistic framework for discriminative clustering with a number of attractive properties. [sent-29, score-0.218]
17 Thanks to its probabilistic formulation, RIM is flexible: it is compatible with diverse likelihood functions and allows specification of prior assumptions about expected class proportions. [sent-30, score-0.127]
18 Our approach is to construct a functional F (p(y|x, W); X, λ) which evaluates the suitability of p(y|x, W) as a discriminative clustering model. [sent-42, score-0.201]
19 We then use standard discriminative classifiers such as logistic regression for p(y|x, W), and maximize the resulting function F (W; X, λ) over the parameters W. [sent-43, score-0.125]
20 The first is that the discriminative model’s decision boundaries should not be located in regions of the input space that are densely populated with datapoints. [sent-46, score-0.136]
21 Grandvalet & Bengio [3] 1 show that a conditional entropy term − N i H{p(y|xi , W)} very effectively captures the cluster assumption when training probabilistic classifiers with partial labels. [sent-48, score-0.235]
22 However, in the case of fully unsupervised learning this term alone is not enough to ensure sensible solutions, because conditional entropy may be reduced by simply removing decision boundaries and unlabeled categories tend to be removed. [sent-49, score-0.371]
23 We illustrate this in Figure 1 (left) with an example using the multilogit regression classifier as the conditional model p(y|x, W), which we will develop in Section 3. [sent-50, score-0.181]
24 In order to avoid degenerate solutions, we incorporate the notion of class balance: we prefer configurations in which category labels are assigned evenly across the dataset. [sent-51, score-0.161]
25 A natural way to encode our preference towards class balance is to use the entropy H{ˆ(y; W)}, because it is maximized when the labels p are uniformly distributed. [sent-53, score-0.216]
26 Combining the two terms, we arrive at 1 IW {y; x} = H{ˆ(y; W)}− p H{p(y|xi , W)} (1) N i which is the empirical estimate of the mutual information between x and y under the conditional model p(y|x, W). [sent-54, score-0.121]
27 However, they note that IW {y; x} may be trivially maximized by a conditional model that classifies each data point xi into its own category yi , and that classifiers trained with this objective tend to fragment the data into a large number of categories, see Figure 1 (center). [sent-57, score-0.196]
28 This term penalizes conditional models with complex decision boundaries in order to yield sensible clustering solutions. [sent-59, score-0.192]
29 While we motivated this objective with notions of class balance and seperation, our approach may be interpreted as learning a conditional distribution for y that preserves information from the data set, subject to a complexity penalty. [sent-61, score-0.202]
30 2 −1 0 x1 1 2 Figure 1: Example unsupervised multilogit regression solutions on a simple dataset with three clusters. [sent-78, score-0.236]
31 The top and bottom rows show the category label arg maxy p(y|x, W) and conditional entropy H{p(y|x, W)} at each point x, respectively. [sent-79, score-0.217]
32 We find that both class balance and regularization terms are necessary to learn unsupervised classifiers suitable for multi-class clustering. [sent-80, score-0.209]
33 Specifically, if K is the maximum number of classes, we choose T wk wk , T p(y = k|x, W) ∝ exp(wk x + bk ) and R(W; λ) = λ (3) k where the set of parameters W = {w1 , . [sent-83, score-0.477]
34 , bK } consists of weight vectors wk and bias values bk for each class k. [sent-89, score-0.365]
35 Each weight vector wk ∈ RD is D-dimensional with components wkd . [sent-90, score-0.328]
36 For clarity, we define pki ≡ p(y = k|xi , W), and pk ≡ p(y = k; W). [sent-96, score-0.244]
37 The partial derivatives are ˆ ˆ ∂F 1 pci ∂F 1 pci ∂pci ∂pci = log − 2λwkd and = log . [sent-97, score-0.629]
38 (4) ∂wkd N ic ∂wkd pc ˆ ∂bk N ic ∂bk pc ˆ Naive computation of the gradient requires O(N K 2 D), since there are K(D + 1) parameters and each derivative requires a sum over N K terms. [sent-98, score-0.174]
39 However, the form of the conditional probability derivatives for multi-logit regression are: ∂pci ∂pci = (δkc − pci )pki xid and = (δkc − pci )pki , ∂wkd ∂bk where δkc is equal to one when indices k and c are equal, and zero otherwise. [sent-99, score-0.722]
40 c pci log pci pc ˆ may be The above gradients are used in the L-BFGS [6] quasi-Newton optimization algorithm1 . [sent-102, score-0.645]
41 The algorithm is initialized with 50 category weight vectors wk . [sent-113, score-0.285]
42 The negative bias terms of the unpopulated categories drive the unpopulated class probabilities pk towards zero. [sent-115, score-0.313]
43 The corresponding weight vectors wk have norms ˆ near zero. [sent-116, score-0.215]
44 to multilogit regression, the objective function F (W; x, λ) is non-concave. [sent-117, score-0.142]
45 5) equal to zero yields the following condition at stationary points of F : wk = αki xi (6) i where we have defined αki ≡ 1 pki − pki log 2λN pk ˆ pci log c pci . [sent-123, score-1.252]
46 pc ˆ (7) The L2 regularizing function R(W; λ) in Eq. [sent-124, score-0.12]
47 3 is additively composed of penalty terms associated T with each category: wk wk = ij αki αkj xT xj . [sent-125, score-0.412]
48 It is instructive to observe the limiting behavior i T of the penalty term wk wk when datapoints are not assigned to category k; that is, when pk = ˆ 1 i pki → 0. [sent-126, score-0.743]
49 This implies that pki → 0 for all i, and therefore αki → 0 for all i. [sent-127, score-0.184]
50 This means that the regularizing function does not penalize unpopulated categories. [sent-129, score-0.117]
51 We find empirically that when we initialize with a large number of category weights wk , many decay away depending on the value of λ. [sent-130, score-0.267]
52 The bias terms bk are unpenalized and are adjusted during optimization to drive the class probablities pk arbitrarily close to zero for unpopulated classes. [sent-133, score-0.321]
53 We first oversegment the data into a large number of clusters (using k-means or other suitable algorithm) and train a supervised multi-logit classifier using these cluster labels. [sent-136, score-0.179]
54 We use this insight as justification to define explicit coefficients αki and enforce the constraint wk = i αki xi during optimization. [sent-141, score-0.224]
55 Substituting this equation into the multilogit regression conditional likelihood allows T replacement of all inner products wk x with i αki K(xi , x), where K is a positive definite kernel function that evaluates the inner product xT x. [sent-142, score-0.396]
56 The conditional model now has the form i p(y = k|x, α, b) ∝ exp αki K(xi , x) + bk . [sent-143, score-0.128]
57 i 4 T Substituting the constraint into the regularizing function k wk wk yields a natural replacement of T wk wk by the Reproducing Hilbert Space (RKHS) norm of the function i αki K(xi , ·): R(α) = αki αkj K(xi , xj ). [sent-144, score-0.841]
58 (8) k ij We use the L-BFGS algorithm to optimize the kernelized algorithm over the coefficients αki and biases bk . [sent-145, score-0.164]
59 The partial derivatives for the kernel coefficients are ∂F 1 pki pci = K(xj , xi )pki log − pci log − 2λ αki K(xj , xi ) ∂αkj N i pk ˆ pc ˆ c i and the derivatives for the biases are unchanged. [sent-146, score-1.021]
60 Kernelized unsupervised multilogit regression exhibits the same model selection behavior as the linear algorithm. [sent-148, score-0.236]
61 1 Semi-supervised Classification In semi-supervised classification, we assume that there are unlabeled examples XU = {xU , · · · , xU } as well as labeled examples XL = {xL , · · · , xL } with labels Y = {y1 , · · · , yM }. [sent-151, score-0.129]
62 1) to define the relationship between unlabeled points and the model parameters, but we incorporate an additional parameter τ which will define the tradeoff between labeled and unlabeled examples. [sent-153, score-0.185]
63 Our approach is related to the objective in [3], which does not contain the class balance term H(ˆ(y; W)). [sent-155, score-0.138]
64 In the following, we show how RIM allows flexible expression of prior assumptions about non-uniform class label proportions. [sent-159, score-0.139]
65 We p can capture prior beliefs about the average label distribution by substituting a reference distribution D(y; γ) in place of U (γ is a parameter that may be fixed or optimized during learning). [sent-163, score-0.135]
66 [7] also use relative entropy as a means of enforcing prior beliefs, although not with respect to class distributions in multi-class classification problems. [sent-164, score-0.127]
67 This construction may be used in a clustering task in which we believe that the cluster sizes obey a power law distribution as, for example, considered by [8] who use the Pitman-Yor process for nonparametric language modeling. [sent-165, score-0.175]
68 04 MMC 16 18 20 0 2 4 6 8 SC 10 12 14 16 18 20 22 # of clusters Figure 3: Unsupervised Clustering: Adjusted Rand Index (relative to ground truth) versus number of clusters. [sent-187, score-0.144]
69 We compare RIM against the spectral clustering (SC) algorithm of [1], the fast maximum margin clustering (MMC) algorithm of [9], and kernelized k-means [10]. [sent-193, score-0.279]
70 We evaluate unsupervised clustering performance in terms of how well the discovered clusters reflect known ground truth labels of the dataset. [sent-197, score-0.46]
71 We report the Adjusted Rand Index (ARI) [11] between an inferred clustering and the ground truth categories. [sent-198, score-0.183]
72 We evaluated a number of other measures for comparing clusterings to ground truth including mutual information, normalized mutual information [12], and cluster impurity [13]. [sent-200, score-0.351]
73 We test the algorithms on an image clustering task with 350 images from four Caltech-256 [14] categories (Faces-Easy, Motorbikes, Airplanes, T-Shirt) for a total of N = 1400 images. [sent-207, score-0.179]
74 Overall, N the clusterings that best match ground truth are given by RIM when it discovers four clusters. [sent-212, score-0.147]
75 RIM outperforms kernelized k-means when discovering between 4 and 8 clusters. [sent-214, score-0.123]
76 Figure 4 shows example images taken from clusters discovered by RIM. [sent-216, score-0.172]
77 We further test RIM’s unsupervised learning performance on two molecular graph datasets. [sent-220, score-0.119]
78 We find that of all N methods, RIM produces the clusterings that are nearest to ground truth (when discovering 2 clusters 6 Classification Performance 0. [sent-229, score-0.248]
79 65 0 C5 50 100 Number of labeled examples 150 Figure 4: Left: Randomly chosen example images from clusters discovered by unsupervised RIM on Caltech Image. [sent-236, score-0.309]
80 Right, the waveform with the most uncertain cluster membership according to the classifier learned by RIM. [sent-239, score-0.175]
81 RIM has the advantage over k-means when discovering a small number of clusters and is comparable at other settings. [sent-242, score-0.125]
82 The waveforms are composed of 38 samples from each of the four electrodes and are the output of a neural spike detector which aligns signal peaks to the 13-th sample, see the average waveform in Figure 5 (left). [sent-247, score-0.208]
83 We set λ to N and RIM discovers 33 clusters and finishes in 12 minutes. [sent-250, score-0.127]
84 The top row consists of cluster member waveforms superimposed on each other, with the cluster’s mean waveform plotted in red. [sent-253, score-0.263]
85 Taken as a whole, the clusters give an idea of the typical waveform patterns. [sent-255, score-0.202]
86 The bottom row shows the learned classifier’s discriminative weights wk for each category, which can be used to gain a sense for how the cluster’s members differ from the dataset mean waveform. [sent-256, score-0.308]
87 We can use the probabilistic classifier learned by RIM to discover atypical waveforms by ranking them according to their conditional entropy H{p(y|xi , W)}. [sent-257, score-0.2]
88 Figure 5 (right) shows the waveform whose cluster membership is most uncertain. [sent-258, score-0.175]
89 Wave Cluster 1 Figure 6: Two clusters discovered by RIM on the Tetrode data set. [sent-260, score-0.15]
90 Top row: Superimposed waveform members, with cluster mean in red. [sent-261, score-0.175]
91 Bottom row: The discriminative category weights wk associated with each cluster. [sent-262, score-0.351]
92 The methods were trained using both unlabeled and labeled examples, and classification performance is assessed on the unlabeled portion. [sent-266, score-0.167]
93 The method of [20] aims to maximize a similarity measure computed between members within the same cluster while penalizing the mutual information between the cluster label y and the input x. [sent-277, score-0.302]
94 [22] also view clustering as maximization of the dependence between the input variable and output label variable. [sent-280, score-0.188]
95 Like our approach, the works of [2] and [21] use notions of class balance, seperation, and regularization to learn unsupervised discriminative classifiers. [sent-283, score-0.23]
96 Another semi-supervised method [23] uses mutual information as a regularizing term to be minimized, in contrast to ours which attempts to maximize mutual information. [sent-291, score-0.205]
97 new class label values) not captured by the labeled data, since it is incomplete. [sent-295, score-0.13]
98 8 Conclusions We considered the problem of learning a probabilistic discriminative classifier from an unlabeled data set. [sent-296, score-0.184]
99 Our approach consists of optimizing an intuitive information theoretic objective function that incorporates class separation, class balance and classifier complexity, which may be interpreted as maximizing the mutual information between the empirical input and implied label distributions. [sent-298, score-0.326]
100 In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. [sent-302, score-0.131]
wordName wordTfidf (topN-words)
[('rim', 0.724), ('pci', 0.289), ('wk', 0.197), ('pki', 0.184), ('mmc', 0.155), ('multilogit', 0.113), ('wkd', 0.113), ('ki', 0.109), ('clusters', 0.103), ('unsupervised', 0.1), ('waveform', 0.099), ('clustering', 0.099), ('sc', 0.099), ('iw', 0.091), ('discriminative', 0.084), ('bk', 0.083), ('kernelized', 0.081), ('cluster', 0.076), ('mutual', 0.076), ('category', 0.07), ('pc', 0.067), ('classi', 0.066), ('unlabeled', 0.065), ('waveforms', 0.065), ('tetrode', 0.064), ('unpopulated', 0.064), ('balance', 0.063), ('grandvalet', 0.061), ('pk', 0.06), ('categories', 0.058), ('entropy', 0.055), ('rand', 0.055), ('regularizing', 0.053), ('caltech', 0.053), ('bridle', 0.052), ('xid', 0.049), ('adjusted', 0.047), ('discovered', 0.047), ('label', 0.047), ('sweep', 0.046), ('class', 0.046), ('conditional', 0.045), ('er', 0.044), ('truth', 0.043), ('maximization', 0.042), ('ground', 0.041), ('beliefs', 0.039), ('kj', 0.039), ('xl', 0.039), ('clusterings', 0.039), ('labeled', 0.037), ('ari', 0.036), ('probabilistic', 0.035), ('datapoints', 0.035), ('kc', 0.033), ('gomes', 0.032), ('instantiate', 0.032), ('seperation', 0.032), ('index', 0.032), ('exible', 0.031), ('objective', 0.029), ('ers', 0.029), ('boundaries', 0.029), ('subtree', 0.029), ('lossy', 0.028), ('separation', 0.028), ('labels', 0.027), ('members', 0.027), ('xu', 0.027), ('derivatives', 0.027), ('xi', 0.027), ('electrodes', 0.026), ('regularized', 0.026), ('prior', 0.026), ('maximized', 0.025), ('discovers', 0.024), ('partial', 0.024), ('substituting', 0.023), ('superimposed', 0.023), ('populated', 0.023), ('regression', 0.023), ('stationary', 0.022), ('discovering', 0.022), ('bengio', 0.022), ('images', 0.022), ('bias', 0.021), ('assumptions', 0.02), ('minutes', 0.02), ('outperforms', 0.02), ('kl', 0.02), ('ic', 0.02), ('interpreted', 0.019), ('sensible', 0.019), ('molecular', 0.019), ('perona', 0.018), ('incorporate', 0.018), ('weight', 0.018), ('composed', 0.018), ('logistic', 0.018), ('evaluates', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
2 0.11076254 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
3 0.10771903 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.089757331 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
5 0.07593789 155 nips-2010-Learning the context of a category
Author: Dan Navarro
Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.
6 0.073580451 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
7 0.072218873 261 nips-2010-Supervised Clustering
8 0.071525343 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
9 0.063944593 23 nips-2010-Active Instance Sampling via Matrix Partition
10 0.062005252 223 nips-2010-Rates of convergence for the cluster tree
11 0.060102139 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
12 0.057700019 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
13 0.056058411 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
14 0.053791422 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
15 0.053730831 177 nips-2010-Multitask Learning without Label Correspondences
16 0.053719178 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
17 0.051981736 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
18 0.050566074 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
19 0.049385469 280 nips-2010-Unsupervised Kernel Dimension Reduction
20 0.048400991 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
topicId topicWeight
[(0, 0.152), (1, 0.057), (2, 0.021), (3, -0.054), (4, 0.007), (5, 0.021), (6, 0.008), (7, -0.039), (8, 0.063), (9, -0.038), (10, 0.01), (11, -0.007), (12, 0.072), (13, -0.085), (14, 0.136), (15, -0.051), (16, -0.012), (17, 0.078), (18, 0.003), (19, -0.045), (20, 0.035), (21, -0.043), (22, -0.003), (23, 0.064), (24, -0.058), (25, -0.026), (26, -0.009), (27, 0.029), (28, -0.005), (29, 0.029), (30, 0.046), (31, -0.003), (32, -0.056), (33, 0.001), (34, -0.068), (35, 0.021), (36, -0.011), (37, 0.037), (38, -0.02), (39, 0.036), (40, -0.012), (41, 0.011), (42, 0.023), (43, -0.028), (44, 0.038), (45, 0.11), (46, 0.014), (47, 0.028), (48, 0.037), (49, -0.117)]
simIndex simValue paperId paperTitle
same-paper 1 0.9036836 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
2 0.65628999 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
3 0.62448007 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.61596811 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
5 0.60361135 155 nips-2010-Learning the context of a category
Author: Dan Navarro
Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.
6 0.60270041 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
7 0.55907261 67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
8 0.52553076 223 nips-2010-Rates of convergence for the cluster tree
9 0.521263 228 nips-2010-Reverse Multi-Label Learning
10 0.51732254 2 nips-2010-A Bayesian Approach to Concept Drift
11 0.48458377 177 nips-2010-Multitask Learning without Label Correspondences
12 0.46358499 138 nips-2010-Large Margin Multi-Task Metric Learning
13 0.44061312 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
14 0.43661568 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
15 0.43357956 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
16 0.42793071 24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification
17 0.42073286 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
18 0.41871306 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
19 0.41864333 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding
20 0.41737735 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
topicId topicWeight
[(13, 0.029), (17, 0.013), (27, 0.054), (30, 0.047), (35, 0.026), (45, 0.187), (50, 0.041), (52, 0.021), (60, 0.412), (77, 0.028), (78, 0.012), (90, 0.046)]
simIndex simValue paperId paperTitle
1 0.93663758 223 nips-2010-Rates of convergence for the cluster tree
Author: Kamalika Chaudhuri, Sanjoy Dasgupta
Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give finite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1
2 0.90788352 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan
Abstract: We propose a new LP relaxation for obtaining the MAP assignment of a binary MRF with pairwise potentials. Our relaxation is derived from reducing the MAP assignment problem to an instance of a recently proposed Bipartite Multi-cut problem where the LP relaxation is guaranteed to provide an O(log k) approximation where k is the number of vertices adjacent to non-submodular edges in the MRF. We then propose a combinatorial algorithm to efficiently solve the LP and also provide a lower bound by concurrently solving its dual to within an approximation. The algorithm is up to an order of magnitude faster and provides better MAP scores and bounds than the state of the art message passing algorithm of [1] that tightens the local marginal polytope with third-order marginal constraints. 1
3 0.90353668 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
Author: Tobias Glasmachers
Abstract: Steinwart was the first to prove universal consistency of support vector machine classification. His proof analyzed the ‘standard’ support vector machine classifier, which is restricted to binary classification problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classification to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1
4 0.86828554 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
same-paper 5 0.82225907 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
6 0.7515536 229 nips-2010-Reward Design via Online Gradient Ascent
7 0.64085376 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
8 0.63269734 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
9 0.6098966 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
10 0.60935879 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
11 0.60457498 138 nips-2010-Large Margin Multi-Task Metric Learning
12 0.60329169 102 nips-2010-Generalized roof duality and bisubmodular functions
13 0.60209531 4 nips-2010-A Computational Decision Theory for Interactive Assistants
14 0.60204834 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.60190707 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
16 0.60033768 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
17 0.59830272 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
18 0.59638566 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
19 0.59534812 287 nips-2010-Worst-Case Linear Discriminant Analysis
20 0.59229028 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods