nips nips2010 nips2010-228 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract Multi-label classification is the task of predicting potentially multiple labels for a given instance. [sent-7, score-0.071]
2 In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. [sent-9, score-0.179]
3 By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. [sent-10, score-0.175]
4 We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. [sent-11, score-0.222]
5 As diverse as the applications, however, are the evaluation measures used to assess the performance of different methods. [sent-17, score-0.059]
6 In web search, on the other hand, precision is also important, so the F1 -score, which is the harmonic mean of precision and recall, might be more appropriate. [sent-20, score-0.09]
7 In this paper we present a method for MLC which is able to optimise appropriate surrogates for a variety of performance measures. [sent-21, score-0.173]
8 This means that the objective function being optimised by the method is tailored to the performance measure on which we want to do well in our specific application. [sent-22, score-0.12]
9 This is in contrast particularly with probabilistic approaches, which typically aim for maximisation of likelihood scores rather than the performance measure used to assess the quality of the results. [sent-23, score-0.152]
10 In addition, the method is based on well-understood facts from the domain of structured output learning, which gives us theoretical guarantees regarding the accuracy of the results obtained. [sent-24, score-0.072]
11 An interesting aspect of the method is that we are only able to optimise the desired performance measures because we formulate the prediction problem in a reverse manner, in the spirit of [8]. [sent-26, score-0.393]
12 We pose the prediction problem as predicting sets of instances given the labels. [sent-27, score-0.16]
13 When this insight is fit into max-margin structured output methods, we obtain surrogate losses for the most widely used performance measures for multi-label classification. [sent-28, score-0.131]
14 We perform experiments against state-of-theart methods in five publicly available benchmark datasets for MLC, and the proposed approach is the best performing overall. [sent-29, score-0.122]
15 Instead, we focus particularly on some state-of-the-art approaches 1 that have been tested on publicly available benchmark datasets for MLC, which facilitates a fair comparison against our method. [sent-32, score-0.122]
16 A straightforward way to deal with multiple labels is to solve a binary classification problem for each one of them, treating them independently. [sent-33, score-0.071]
17 Since the order of the classifiers in the chain is arbitrary, the authors also propose an ensemble method – Ensemble of Classifier Chains (ECC) – where several random chains are combined with a voting scheme. [sent-36, score-0.308]
18 Probabilistic Classifier Chains (PCC) [1] extends CC to the probabilistic setting, with EPCC [1] being its corresponding ensemble method. [sent-37, score-0.135]
19 Another way of working with multiple labels is to consider each possible set of labels as a class, thus encoding the problem as single-label classification. [sent-38, score-0.142]
20 RAndom K-labELsets (RAKEL) [10] deals with that by proposing an ensemble of classifiers, each one taking a small random subset of the labels and learning a single-label classifier for the prediction of each element in the power set of this subset. [sent-40, score-0.243]
21 Other proposed ensemble methods are Ensemble of Binary Method (EBM) [4], which applies a simple voting scheme to a set of BM classifiers, and Ensemble of Pruned Sets (EPS) [11], which combines a set of Pruned Sets (PS) classifiers. [sent-41, score-0.138]
22 PS is essentially a problem transformation method that maps sets of labels to single labels while pruning away infrequently occurring sets. [sent-42, score-0.142]
23 Canonical Correlation Analysis (CCA) [3] exploits label relatedness by using a probabilistic interpretation of CCA as a dimensionality reduction technique and applying it to learn useful predictive features for multi-label learning. [sent-43, score-0.151]
24 Meta Stacking (MS) [12] also exploits label relatedness by combining text features and features indicating relationships between classes in a discriminative framework. [sent-44, score-0.149]
25 In [13] the author proposes a smooth but non-concave relaxation of the F -measure for binary classification problems using a logistic regression classifier, and optimisation is performed by taking the maximum across several runs of BFGS starting from random initial values. [sent-46, score-0.205]
26 In [14] the author proposes a method for optimising multivariate performance measures in a general setting in which the loss function is not assumed to be additive in the instances nor in the labels. [sent-47, score-0.363]
27 The method also consists of optimising a convex relaxation of the derived losses. [sent-48, score-0.127]
28 The key difference of our method is that we have a specialised convex relaxation for the case in which the loss does not decompose over the instances, but does decompose over the labels. [sent-49, score-0.156]
29 2 The Model Let the input x ∈ X denote a label (e. [sent-50, score-0.079]
30 , a tag of an image), and the output y ∈ Y denote a set of instances, (e. [sent-52, score-0.038]
31 Let N = |X| be the number of labels and V be the number of instances. [sent-55, score-0.071]
32 For example if N = 5 the second label is denoted as x = [0 1 0 0 0]. [sent-59, score-0.079]
33 An output instance y is encoded as n y ∈ {0, 1}V (Y := {0, 1}V ), and yi = 1 iff instance xn was annotated with label i. [sent-60, score-0.264]
34 For example if V = 10 and only instances 1 and 3 are annotated with label 2, then the y corresponding to x = [0 1 0 0 0] is y = [1 0 1 0 0 0 0 0 0 0]. [sent-61, score-0.168]
35 We assume a given training set {(xn , y n )}N , where n=1 {xn }N comprises the entirety of labels available ({xn }N = X), and {y n }N represents the n=1 n=1 n=1 sets of instances associated to those labels. [sent-62, score-0.16]
36 1 Loss Functions The reason for this reverse prediction is the following: most widely accepted performance measures target information retrieval (IR) applications – that is, given a label we want to find a set of relevant instances. [sent-67, score-0.299]
37 As a consequence, the measures are averaged over the set of possible labels. [sent-68, score-0.059]
38 This is the case for, in particular, Macro-precision, Macro-recall, Macro-Fβ 1 and Hamming loss [10]: Macro-precision = 1 N N p(y n , y n ), ¯ Macro-recall = n=1 1 N N r(y n , y n ) ¯ n=1 1 Macro-F1 is the particular case of this when β equals to 1. [sent-69, score-0.122]
39 2 Macro-Fβ = 1 N N (1 + β 2 ) n=1 p(y n , y n )r(y n , y n ) ¯ ¯ , 2 p(y n , y n ) + r(y n , y n ) β ¯ ¯ Hamming loss = 1 N N h(y n , y n ), ¯ n=1 where h(y, y ) = ¯ y T 1 + y T 1 − 2y T y ¯ ¯ , V p(y, y ) = ¯ yT y ¯ , Ty y ¯ ¯ r(y, y ) = ¯ yT y ¯ . [sent-71, score-0.122]
40 Ty y Here, y n is our prediction for input label n, and y n the corresponding ground-truth. [sent-72, score-0.15]
41 Since these ¯ measures average over the labels, in order to optimise them we need to average over the labels as well, and this happens naturally in a setting in which the empirical risk is additive on the labels. [sent-73, score-0.335]
42 2 Instead of maximising a performance measure we frame the problem as minimising a loss function associated to the performance measure. [sent-74, score-0.173]
43 We assume a known loss function ∆ : Y × Y → R+ which assigns a non-negative number to every possible pair of outputs. [sent-75, score-0.122]
44 This loss function represents how much we want to penalise a prediction y when the correct prediction is y, i. [sent-76, score-0.264]
45 + yT y ¯ ¯ (2) β 2 yT y Features and Parameterization Our next assumption is that the prediction for a given input x returns the maximiser(s) of a linear score of the model parameter vector θ, i. [sent-82, score-0.137]
46 , a prediction is given by y such that 3 ¯ y ∈ argmax φ(x, y), θ . [sent-84, score-0.108]
47 ¯ (3) y∈Y Here we assume that φ(x, y) is linearly composed of features of the instances encoded in each yv , V i. [sent-85, score-0.204]
48 The map φ(x, y) will be the zero vector whenever yv = 0, i. [sent-89, score-0.073]
49 (4) In other words, we want to find a model that minimises the average prediction loss in the training set plus a quadratic regulariser that penalises complex solutions (the parameter λ determines the tradeoff between data fitting and good generalisation). [sent-97, score-0.193]
50 2 The Hamming loss also averages over the instances so it can be optimised in the ‘normal’ (not reverse) direction as well. [sent-99, score-0.298]
51 1 Optimisation Convex Relaxation The optimisation problem (4) is non-convex. [sent-101, score-0.171]
52 Even more critical, the loss is a piecewise constant function of θ. [sent-102, score-0.122]
53 4 A similar problem occurs when one aims at optimising a 0/1 loss in binary classification; in that case, a typical workaround consists of minimising a surrogate convex loss function which upper bounds the 0/1 loss, for example the hinge loss, what gives rise to the support vector machine. [sent-103, score-0.388]
54 Here we use an analogous approach, notably popularised in [16], which optimises a convex upper bound on the structured loss of (4). [sent-104, score-0.156]
55 The resulting optimisation problem is [θ∗ , ξ ∗ ] = argmin θ,ξ 1 N N ξn + n=1 n λ θ 2 2 (5) s. [sent-105, score-0.208]
56 ξn ≥ 0 (6) ∗ It is easy to see that ξn upper bounds ∆(¯∗ , y n ) (and therefore the objective in (5) upper bounds yn that of (4) for the optimal solution). [sent-108, score-0.165]
57 Second, the left hand side of the inequality ¯n n for y = y must be non-positive from the definition of y in equation (3). [sent-111, score-0.042]
58 yn The constraints (6) basically enforce a loss-sensitive margin: θ is learned so that mispredictions y that incur some loss end up with a score φ(xn , y), θ that is smaller than the score φ(xn , y n ), θ of the correct prediction y n by a margin equal to that loss (minus slack ξ). [sent-113, score-0.658]
59 The formulation is a generalisation of support vector machines for the case in which there are an exponential number of classes y. [sent-114, score-0.062]
60 The optimisation problem (5) has n|Y| = n2V constraints. [sent-117, score-0.171]
61 Here we resort to a constraint generation strategy, which consists of starting with no constraints and iteratively adding the most violated constraint for the current solution of the optimisation problem. [sent-119, score-0.359]
62 The key problem that needs to be solved at each iteration is constraint generation, i. [sent-121, score-0.051]
63 , to find the maximiser of the violation margin ξn , ∗ yn ∈ argmax [∆(y, y n ) + φ(xn , y), θ ] . [sent-123, score-0.26]
64 (7) y∈Y The difficulty in solving the above optimisation problem depends on the choice of φ(x, y) and ∆. [sent-124, score-0.171]
65 4 Algorithm 1 Reverse Multi-Label Learning 1: Input: training set {(xn , y n )}N , λ, β, Output: θ n=1 2: Initialize i = 1, θ1 = 0, MAX= −∞ 3: repeat 4: for n = 1 to N do ∗ 5: Compute yn (Na¨ve: Algorithm 2. [sent-131, score-0.165]
66 find top k entries in zn in O(V ) time) 6: CURRENT= maxy∈Yk y, zn 7: if CURRENT>MAX then 8: MAX = CURRENT ∗ 9: yn = y ∗ 10: end if 11: end for ∗ 12: return yn We now investigate how to solve (8) for a fixed θ. [sent-134, score-0.676]
67 A ı simple algorithm can be obtained by first noticing that zn depends on y only through the number of its nonzero elements. [sent-137, score-0.173]
68 Then the objective in (8), if the maximisation is instead restricted to the domain Yk , is effectively linear in y, since zn in this case is a constant w. [sent-141, score-0.224]
69 Therefore we can solve separately for each Yk by finding the top k entries in zn . [sent-145, score-0.173]
70 Algorithm 1 describes in detail the optimisation, as solved by BMRM [18], and Algorithm 2 shows the na¨ve constraint generation routine. [sent-148, score-0.137]
71 The ı BMRM solver requires both the value of the objective function for the slack corresponding to the ∗ most violated constraint and its gradient. [sent-149, score-0.097]
72 (3)) has the same form as the problem of constraint generation (eq. [sent-154, score-0.137]
73 (7)), the only difference being that zn = Ψθn (i. [sent-155, score-0.173]
74 Since zn a constant vector, the solution yn for (7) is the vector that indicates the positive entries of zn , which can be efficiently found in O(V ). [sent-159, score-0.511]
75 5 Table 1: Evaluation scores and corresponding losses score ∆(y, y ) ¯ 2 )(y T ¯) macro-Fβ 1 − (1+β y+¯Ty¯ β 2 yT y y macro-precision 1− yT y ¯ yT y ¯ ¯ macro-recall 1− yT y ¯ yT y Hamming loss y T 1+¯T 1−2y T y y ¯ V 3. [sent-161, score-0.255]
76 #train/#test denotes the number of observations used for training and testing respectively; N is the number of labels and D the dimensionality of the features. [sent-163, score-0.071]
77 We can however optimise other scores, in particular the popular Hamming loss – Table 1 shows a list with the corresponding loss, which we then plug in eq. [sent-165, score-0.33]
78 Note that for Hamming loss and macro-recall the denominator is constant, and therefore it is not necessary to solve (8) multiple times as described earlier, which makes constraint generation as fast as test-time prediction (see subsection 3. [sent-167, score-0.33]
79 These scores were chosen because macro-Fβ is a generalisation of the most relevant scores, and the Hamming loss is a generic, popular score in the multi-label classification literature. [sent-170, score-0.352]
80 Datasets We used 5 publicly available5 multi-label datasets: yeast, scene, medical, enron and emotions. [sent-171, score-0.198]
81 We selected these datasets because they cover a variety of application domains – biology, image, text and music – and there are published results of competing methods on them for some of the popular evaluation measures for MLC (macro-F1 and Hamming loss). [sent-172, score-0.35]
82 7 Comparison to published results on Macro-F1 In our first set of experiments we compared our model to published results on the Macro-F1 score. [sent-178, score-0.17]
83 We strived to make our comparison as broad as possible, but we limited ourselves to methods with published results on public datasets, where the experimental setting was described in enough detail to allow us to make a fair comparison. [sent-179, score-0.117]
84 We can see that our model has the best performance in yeast, medical and enron. [sent-182, score-0.089]
85 html 6 6 scene it doesn’t perform as well – we suspect this is related to the label cardinality of this dataset: almost all instances have just one label, making this essentially equivalent to a multiclass dataset. [sent-195, score-0.237]
86 Comparison to published results on Hamming Loss To illustrate the flexibility of our model we also evaluated it on the Hamming loss. [sent-196, score-0.085]
87 Here, we compared our model to classifier chains [4] (CC), probabilistic classifier chains [1] (PCC), ensembles of classifier chains [4] (ECC) and ensembled probabilistic classifier chains [1] (EPCC). [sent-197, score-0.822]
88 These are the methods for which we could find Hamming loss results on publicly available data. [sent-198, score-0.18]
89 Results on Fβ One strength of our method is that it can be optimised for the specific measure we are interested in. [sent-201, score-0.087]
90 In Macro-Fβ , for example, β is a trade-off between precision and recall: when β → 0 we recover precision, and when β → ∞ we get recall. [sent-202, score-0.045]
91 Unlike with other methods, given a desired precision/recall trade-off encoded in a choice of β, we can optimise our model such that it gets the best performance on Macro-Fβ . [sent-203, score-0.215]
92 In this case, however, we could not find published results to compare to, so we used Mulan8 , an open-source library for learning from multi-label datasets, to train three models: BM[9], RAKEL[10] and MLKNN[20]. [sent-205, score-0.127]
93 We can see that BM tends to prioritize recall (right side of the plot), while ML-KNN and RAKEL give more emphasis to precision (left side). [sent-215, score-0.087]
94 In both scene and yeast it dominates the right side while is still competitive on the left side. [sent-217, score-0.229]
95 And in the other three datasets – medical, enron and emotions – it practically dominates over the entire range of β. [sent-218, score-0.305]
96 5 Conclusion and Future Work We presented a new approach to multi-label learning which consists of predicting sets of instances from the labels. [sent-219, score-0.089]
97 This apparent unintuitive approach is in fact natural since, once the problem is viewed from this perspective, many popular performance measures admit convex relaxations that can be directly and efficiently optimised with existing methods. [sent-220, score-0.262]
98 The method leverages on existing tools from structured output learning, which gives us certain theoretical guarantees. [sent-222, score-0.072]
99 A simple version of constraint generation is presented for small problems, but we also developed a scalable, fast version for dealing with large datasets. [sent-223, score-0.137]
100 Large margin methods for structured and interdependent output variables. [sent-432, score-0.072]
wordName wordTfidf (topN-words)
[('rakel', 0.346), ('hamming', 0.223), ('bm', 0.223), ('zn', 0.173), ('mlc', 0.173), ('optimise', 0.173), ('optimisation', 0.171), ('chains', 0.17), ('yn', 0.165), ('yt', 0.162), ('ecc', 0.144), ('enron', 0.14), ('macro', 0.127), ('loss', 0.122), ('yeast', 0.118), ('cc', 0.114), ('classi', 0.112), ('xn', 0.105), ('cca', 0.103), ('ensemble', 0.101), ('emotions', 0.101), ('knn', 0.095), ('optimising', 0.093), ('reverse', 0.09), ('instances', 0.089), ('medical', 0.089), ('optimised', 0.087), ('ebm', 0.087), ('epcc', 0.087), ('ioannis', 0.087), ('mlknn', 0.087), ('generation', 0.086), ('published', 0.085), ('yk', 0.085), ('er', 0.084), ('ml', 0.081), ('label', 0.079), ('pcc', 0.076), ('eps', 0.076), ('bmrm', 0.076), ('ensembles', 0.074), ('pruned', 0.073), ('yv', 0.073), ('labels', 0.071), ('prediction', 0.071), ('ty', 0.07), ('scene', 0.069), ('scores', 0.067), ('score', 0.066), ('australian', 0.065), ('datasets', 0.064), ('generalisation', 0.062), ('nicta', 0.062), ('argmaxy', 0.059), ('measures', 0.059), ('publicly', 0.058), ('grigorios', 0.058), ('maximiser', 0.058), ('pfahringer', 0.058), ('na', 0.055), ('ve', 0.052), ('constraint', 0.051), ('maximisation', 0.051), ('jesse', 0.051), ('minimising', 0.051), ('tsoumakas', 0.051), ('relaxations', 0.049), ('slack', 0.046), ('precision', 0.045), ('meta', 0.044), ('encoded', 0.042), ('library', 0.042), ('side', 0.042), ('canberra', 0.041), ('music', 0.039), ('relatedness', 0.038), ('multilabel', 0.038), ('output', 0.038), ('argmax', 0.037), ('argmin', 0.037), ('voting', 0.037), ('ms', 0.036), ('biology', 0.036), ('competing', 0.036), ('ranging', 0.035), ('popular', 0.035), ('australia', 0.034), ('relaxation', 0.034), ('structured', 0.034), ('probabilistic', 0.034), ('tailored', 0.033), ('bundle', 0.033), ('stacking', 0.033), ('public', 0.032), ('sm', 0.032), ('risk', 0.032), ('text', 0.032), ('admit', 0.032), ('ps', 0.032), ('jason', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 228 nips-2010-Reverse Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
2 0.13421389 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
Author: Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen
Abstract: In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classifiers already exist for many subtasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a ‘black-box’ interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about what error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classification. 1
3 0.12993856 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.12965347 290 nips-2010-t-logistic regression
Author: Nan Ding, S.v.n. Vishwanathan
Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1
5 0.10265654 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
6 0.096841201 282 nips-2010-Variable margin losses for classifier design
7 0.093387671 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
8 0.090615585 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
9 0.090347268 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
10 0.089036152 235 nips-2010-Self-Paced Learning for Latent Variable Models
11 0.087713115 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
12 0.087659776 151 nips-2010-Learning from Candidate Labeling Sets
13 0.087411933 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
14 0.085443966 284 nips-2010-Variational bounds for mixed-data factor analysis
15 0.083058856 138 nips-2010-Large Margin Multi-Task Metric Learning
16 0.081274405 243 nips-2010-Smoothness, Low Noise and Fast Rates
17 0.079795077 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
18 0.079594716 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
19 0.07904537 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
20 0.077672809 192 nips-2010-Online Classification with Specificity Constraints
topicId topicWeight
[(0, 0.225), (1, 0.083), (2, 0.049), (3, -0.126), (4, 0.001), (5, 0.056), (6, -0.07), (7, -0.047), (8, -0.005), (9, -0.051), (10, -0.049), (11, 0.047), (12, 0.146), (13, 0.091), (14, -0.025), (15, 0.047), (16, -0.004), (17, 0.068), (18, 0.035), (19, -0.043), (20, 0.002), (21, -0.022), (22, 0.014), (23, 0.002), (24, -0.01), (25, 0.023), (26, 0.104), (27, 0.1), (28, 0.012), (29, -0.034), (30, -0.027), (31, 0.015), (32, -0.032), (33, 0.043), (34, -0.002), (35, 0.047), (36, 0.025), (37, 0.008), (38, -0.097), (39, 0.103), (40, 0.029), (41, 0.044), (42, -0.017), (43, -0.08), (44, 0.102), (45, 0.039), (46, -0.044), (47, 0.024), (48, -0.044), (49, -0.13)]
simIndex simValue paperId paperTitle
same-paper 1 0.92914546 228 nips-2010-Reverse Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
2 0.75843817 290 nips-2010-t-logistic regression
Author: Nan Ding, S.v.n. Vishwanathan
Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1
3 0.66403627 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.60979074 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
Author: Andreas Krause, Pietro Perona, Ryan G. Gomes
Abstract: Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
5 0.60654736 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
Author: Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen
Abstract: In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classifiers already exist for many subtasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a ‘black-box’ interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about what error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classification. 1
6 0.5930146 269 nips-2010-Throttling Poisson Processes
7 0.59004885 61 nips-2010-Direct Loss Minimization for Structured Prediction
8 0.56312585 151 nips-2010-Learning from Candidate Labeling Sets
9 0.56295455 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
10 0.56052804 235 nips-2010-Self-Paced Learning for Latent Variable Models
11 0.55606222 282 nips-2010-Variable margin losses for classifier design
12 0.54839128 177 nips-2010-Multitask Learning without Label Correspondences
13 0.54253089 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
14 0.54242283 15 nips-2010-A Theory of Multiclass Boosting
15 0.53576064 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
16 0.53255379 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
17 0.52261281 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
18 0.52096111 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
19 0.52073103 99 nips-2010-Gated Softmax Classification
20 0.51359719 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
topicId topicWeight
[(13, 0.059), (17, 0.428), (27, 0.043), (30, 0.052), (35, 0.015), (45, 0.177), (50, 0.023), (52, 0.02), (60, 0.034), (77, 0.027), (78, 0.029), (90, 0.037)]
simIndex simValue paperId paperTitle
1 0.74374807 91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS
Author: Jean-philippe Vert, Kevin Bleakley
Abstract: We present a fast algorithm for the detection of multiple change-points when each is frequently shared by members of a set of co-occurring one-dimensional signals. We give conditions on consistency of the method when the number of signals increases, and provide empirical evidence to support the consistency results. 1
same-paper 2 0.74105436 228 nips-2010-Reverse Multi-Label Learning
Author: James Petterson, Tibério S. Caetano
Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1
3 0.73756588 53 nips-2010-Copula Bayesian Networks
Author: Gal Elidan
Abstract: We present the Copula Bayesian Network model for representing multivariate continuous distributions, while taking advantage of the relative ease of estimating univariate distributions. Using a novel copula-based reparameterization of a conditional density, joined with a graph that encodes independencies, our model offers great flexibility in modeling high-dimensional densities, while maintaining control over the form of the univariate marginals. We demonstrate the advantage of our framework for generalization over standard Bayesian networks as well as tree structured copula models for varied real-life domains that are of substantially higher dimension than those typically considered in the copula literature. 1
4 0.68614948 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun
Abstract: We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting filters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efficiency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efficient feed-forward encoder that predicts quasisparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse filters, including center-surround filters, corner detectors, cross detectors, and oriented grating detectors. We show that using these filters in multistage convolutional network architecture improves performance on a number of visual recognition and detection tasks. 1
5 0.53183818 96 nips-2010-Fractionally Predictive Spiking Neurons
Author: Jaldert Rombouts, Sander M. Bohte
Abstract: Recent experimental work has suggested that the neural firing rate can be interpreted as a fractional derivative, at least when signal variation induces neural adaptation. Here, we show that the actual neural spike-train itself can be considered as the fractional derivative, provided that the neural signal is approximated by a sum of power-law kernels. A simple standard thresholding spiking neuron suffices to carry out such an approximation, given a suitable refractory response. Empirically, we find that the online approximation of signals with a sum of powerlaw kernels is beneficial for encoding signals with slowly varying components, like long-memory self-similar signals. For such signals, the online power-law kernel approximation typically required less than half the number of spikes for similar SNR as compared to sums of similar but exponentially decaying kernels. As power-law kernels can be accurately approximated using sums or cascades of weighted exponentials, we demonstrate that the corresponding decoding of spiketrains by a receiving neuron allows for natural and transparent temporal signal filtering by tuning the weights of the decoding kernel. 1
6 0.49394575 271 nips-2010-Tiled convolutional neural networks
7 0.49189746 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
8 0.49162543 54 nips-2010-Copula Processes
9 0.48449045 224 nips-2010-Regularized estimation of image statistics by Score Matching
10 0.48145694 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
11 0.47990263 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
12 0.47724473 257 nips-2010-Structured Determinantal Point Processes
13 0.47581795 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
14 0.47216853 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
15 0.47043487 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
16 0.46986806 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
17 0.46978849 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
18 0.46863738 36 nips-2010-Avoiding False Positive in Multi-Instance Learning
19 0.46829489 155 nips-2010-Learning the context of a category
20 0.46805301 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression