nips nips2012 nips2012-130 knowledge-graph by maker-knowledge-mining

130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification


Source: pdf

Author: Yao-nan Chen, Hsuan-tien Lin

Abstract: Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 tw Abstract Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. [sent-7, score-0.076]

2 Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. [sent-8, score-0.293]

3 In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. [sent-9, score-0.112]

4 The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. [sent-10, score-0.177]

5 In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. [sent-12, score-0.06]

6 In contrast to the multiclass problem, which associates only a single label to each instance, the multi-label classification problem allows multiple labels for each instance. [sent-15, score-0.104]

7 Label space dimension reduction (LSDR) is a new paradigm in multi-label classification [4, 5]. [sent-18, score-0.076]

8 By viewing the set of multiple labels as a high-dimensional vector in some label space, LSDR approaches use certain assumed or observed properties of the vectors to “compress” them. [sent-19, score-0.118]

9 For instance, a representative LSDR approach is the principal label space transformation [PLST; 5]. [sent-22, score-0.176]

10 PLST takes advantage of the key linear correlations between labels to build a small number of regression tasks. [sent-23, score-0.088]

11 LSDR approaches are homologous to the feature space dimension reduction (FSDR) approaches and share similar advantages: saving computational power and storage without much loss of prediction accuracy and improving performance by removing irrelevant, redundant, or noisy information [7]. [sent-24, score-0.176]

12 Unsupervised FSDR considers only feature information during reduction, while supervised FSDR considers the additional label information. [sent-26, score-0.153]

13 A typical instance of unsupervised FSDR is principal component analysis [PCA; 8]. [sent-27, score-0.055]

14 On the other hand, the supervised FSDR approaches include supervised principal component analysis [9], sliced inverse regression [10], and kernel dimension reduction [11]. [sent-29, score-0.278]

15 In particular, for multi-label classification, a 1 leading supervised FSDR approach is canonical correlation analysis [CCA; 6, 12] which is based on linear projections in both the feature space and the label space. [sent-30, score-0.184]

16 In general, well-tuned supervised FSDR approaches can perform better than unsupervised ones because of the additional label information. [sent-31, score-0.118]

17 PLST can be viewed as the counterpart of PCA in the label space [5] and is feature-unaware. [sent-32, score-0.096]

18 That is, it considers only the label information during reduction. [sent-33, score-0.099]

19 Motivated by the superiority of supervised FSDR over unsupervised approaches, we are interested in studying feature-aware LSDR: LSDR that considers feature information. [sent-34, score-0.054]

20 In this paper, we propose a novel feature-aware LSDR approach, conditional principal label space transformation (CPLST). [sent-35, score-0.189]

21 We derive CPLST by minimizing an upper bound of the popular Hamming loss and show that CPLST can be accomplished by a simple use of singular value decomposition. [sent-37, score-0.053]

22 Moreover, CPLST can be flexibly extended by the kernel trick with suitable regularization, thereby allowing the use of sophisticated feature information to assist LSDR. [sent-38, score-0.077]

23 The experimental results on real-world datasets confirm that CPLST can reduce the number of learning tasks without loss of prediction performance. [sent-39, score-0.074]

24 2 Label Space Dimension Reduction The multi-label classification problem aims at finding a classifier from the input vector x to a label set Y, where x ∈ Rd , Y ⊆ {1, 2, . [sent-45, score-0.081]

25 The label set Y is often conveniently represented as a label vector, y ∈ {0, 1}K , where y[k] = 1 if and only if k ∈ Y. [sent-49, score-0.162]

26 BR decomposes the original dataset D into K binary classification datasets, Dk = {(xn , yn [k])}N , and learns K independent binary classifiers, each of n=1 which is learned from Dk and is responsible for predicting whether the label set Y includes label k. [sent-53, score-0.185]

27 Facing the above challenges, LSDR (Label Space Dimension Reduction) offers a potential solution to these issues by compressing the K-dimensional label space before learning. [sent-56, score-0.096]

28 LSDR transforms D N into M datasets, where Dm = {(xn , tn [m])}n=1 , m = 1, 2, . [sent-57, score-0.051]

29 , M , and M K such that the multi-label classification problem can be tackled efficiently without significant loss of prediction performance. [sent-60, score-0.059]

30 For instance, compressive sensing [CS; 4], a precursor of LSDR, is based on the assumption that the label set vector y is sparse (i. [sent-62, score-0.128]

31 CS transforms the original multi-label classification problem into M T regression tasks with Dm = {(xn , tn [m])}N , where tn [m] = vm yn . [sent-65, score-0.2]

32 After obtaining a multin=1 output regressor r(x) for predicting the code vector t, CS decodes r(x) to the optimal label set vector by solving an optimization problem for each input instance x under the sparsity assumption, which can be time-consuming. [sent-66, score-0.118]

33 1 Principal Label Space Transformation Principal label space transformation [PLST; 5] is another approach to LSDR. [sent-68, score-0.121]

34 PLST first shifts each N 1 ¯ ¯ label set vector y to z = y − y, where y = N n=1 yn is the estimated mean of the label set vectors. [sent-69, score-0.185]

35 Unlike CS, however, PLST takes principal directions vm (to be introduced next) rather than the random ones, and does not need to solve an optimization problem during decoding. [sent-71, score-0.104]

36 2 In particular, PLST considers only a matrix V with orthogonal rows, and decodes r(x) to the pre¯ dicted labels by h(x) = round(VT r(x)+ y), which is called round-based decoding. [sent-72, score-0.081]

37 The first part is r(X) − ZVT 2 , which represents F the prediction error from the regressor r(xn ) to the desired code vectors tn . [sent-76, score-0.118]

38 The second part is Z − ZVT V 2 , which stands for the encoding error for projecting zn into the closest vector in F span{v1 , · · · , vM }, which is VT tn . [sent-77, score-0.141]

39 PLST is derived by minimizing the encoding error [5] and finds the optimal M by K matrix V by applying the singular value decomposition on Z and take the M right-singular vectors vm that correspond to the M largest singular values. [sent-78, score-0.164]

40 The M right-singular vectors are called the principal directions for representing zn . [sent-79, score-0.089]

41 2 Canonical Correlation Analysis A related technique that we will consider in this paper is canonical correlation analysis [CCA; 6], a well-known statistical technique for analyzing the linear relationship between two multidimensional variables. [sent-85, score-0.052]

42 When CCA is considered in the context of multi-label classification, X is the matrix that contains the T mean-shifted xT as rows and Z is the shifted label matrix that contains the mean-shifted yn as rows. [sent-90, score-0.15]

43 n Traditionally, CCA is used as a supervised FSDR approach that discards Wz and uses only Wx to project features onto a lower-dimension space before learning with binary relevance [12, 17]. [sent-91, score-0.052]

44 In particular, CCA is equivalent to first seeking projection directions Wz of Z, and then performing a multi-output linear regression from xn to Wz zn , under the T constraints Wx XT XWx = I, to obtain Wx . [sent-93, score-0.122]

45 2, CCA is equivalent to finding a projection that minimizes the squared prediction error T T under the constraints Wx XT XWx = Wz ZT ZWz = I. [sent-98, score-0.082]

46 Then, using the Hamming loss bound (1), when V = Wz and T T r(x) = XWz , OCCA minimizes r(x) − ZWz in (1) with the hope that the Hamming loss is also minimized. [sent-101, score-0.084]

47 In other words, OCCA is employed for the orthogonal directions V that are “easy to learn” (of low prediction error) in terms of linear regression. [sent-102, score-0.069]

48 For every fixed Wz = V in (3), the optimization problem for Wx is simply a linear regression from T X to ZVT . [sent-103, score-0.065]

49 VVT =I (4) The matrix H = XX† is called the hat matrix for linear regression [19]. [sent-106, score-0.104]

50 1 Conditional Principal Label Space Transformation From the previous discussions, OCCA captures the input-output relation to minimize the prediction error in bound (1) with the “easy” directions. [sent-109, score-0.074]

51 In contrast, PLST minimizes the encoding error in bound (1) with the “principal” directions. [sent-110, score-0.118]

52 We begin by continuing our derivation of OCCA, which obtains r(x) by a linear regression from X to ZVT . [sent-112, score-0.065]

53 Such a matrix V minimizes the prediction error term and the encoding error term simultaneously. [sent-115, score-0.179]

54 The resulting algorithm is called conditional principal label space transformation (CPLST), as shown in Algorithm 1. [sent-116, score-0.189]

55 3: Encode {(xn , yn )}N to {(xn , tn )}N , where tn = VM zn . [sent-123, score-0.115]

56 n=1 n=1 4: Learn a multi-dimension regressor r(x) from {(xn , tn )}N . [sent-124, score-0.057]

57 CPLST balances the prediction error with the encoding error and is closely related with bound (1). [sent-126, score-0.158]

58 PLST focuses on the encoding error and does not consider the features during LSDR, i. [sent-131, score-0.084]

59 In contrast to PLST, the two feature-aware approaches OCCA and CPLST must calculate the matrix H and are thus slower than PLST if the dimension d of the input space is large. [sent-136, score-0.073]

60 2 Kernelization and Regularization Kernelization—extending a linear model to a nonlinear one using the kernel trick [21]—and regularization are two important techniques in machine learning. [sent-138, score-0.06]

61 1, we derive CPLST by using linear regression as the underlying multi-output regression method. [sent-142, score-0.119]

62 Next, we replace linear regression by its kernelized form with 2 regularization, kernel ridge regression [22], as the underlying regression algorithm. [sent-143, score-0.275]

63 Kernel ridge regression considers a feature mapping Φ : X → F before performing regularized linear regression. [sent-144, score-0.145]

64 Therefore, a high or even an infinite dimensional feature transform can be used to assist LSDR in kernel-CPLST through a suitable kernel function. [sent-150, score-0.065]

65 Because kernel ridge regression itself, kernel-CPLST need to invert an N by N matrix, we can only afford to conduct a fair comparison using mid-sized datasets. [sent-154, score-0.138]

66 PBR randomly selects M labels from the original label set during training and only learns those M binary classifiers for prediction. [sent-181, score-0.104]

67 The yeast dataset reveals clear differences between the four LSDR approaches and is hence taken for presentation here, while similar differences have been observed on other datasets as well. [sent-185, score-0.072]

68 To understand the cause of the different performance, we plot the (test) encoding error Z − ZVT V 2 , the prediction error XWT − ZVT 2 , and the loss bound (1) in Figure 1. [sent-190, score-0.183]

69 Figure 1(b) F F shows the encoding error on the test set, which matches the design of PLST. [sent-191, score-0.084]

70 Regardless of the approaches used, the encoding error decreases to 0 when using all 14 dimensions because the {vm }’s can span the whole label space. [sent-192, score-0.179]

71 As expected, PLST achieves the lowest encoding error across every number of dimensions. [sent-193, score-0.096]

72 CPLST partially minimizes the encoding error in its objective function, and hence also achieves a decent encoding error. [sent-194, score-0.162]

73 On the other hand, OCCA is blind to and hence worst at the encoding error. [sent-195, score-0.057]

74 In particular, its encoding error is even worse than that of the baseline PBR. [sent-196, score-0.084]

75 Figure 1(c) shows the prediction error XWT − ZVT 2 on the test set, which matches the design F of OCCA. [sent-197, score-0.061]

76 First, OCCA indeed achieves the lowest prediction error across all number of dimensions. [sent-198, score-0.073]

77 PLST, which is blind to the prediction error, reaches the highest prediction error, and is even worse than PBR. [sent-199, score-0.068]

78 The results further reveal the trade-off between the encoding error and the prediction error: more efficient encoding of the label space are harder to predict. [sent-200, score-0.271]

79 PLST takes the more efficient encoding to the extreme, and results in worse prediction error; OCCA, on the other hand, is better in terms of the prediction error, but leads to the least efficient encoding. [sent-201, score-0.125]

80 Figure 1(d) shows the scaled upper bound (1) of the Hamming loss, which equals the sum of the encoding error and the prediction error. [sent-202, score-0.131]

81 The results 6 Table 3: Test Hamming loss of PLST and CPLST with linear regression Dataset bibtex corel5k emotions enron genbase medical scene yeast Algorithm PLST CPLST PLST CPLST PLST CPLST PLST CPLST PLST CPLST PLST CPLST PLST CPLST PLST CPLST M = 20%K 0. [sent-205, score-0.29]

82 The test Hamming loss achieved by PLST and CPLST on other datasets with different percentage of used labels are reported in Table 3. [sent-527, score-0.063]

83 These results demonstrate that LSDR approaches, like their feature space dimension reduction counterparts, can potentially help resolve the issue of overfitting. [sent-531, score-0.089]

84 2 Coupling Label Space Dimension Reduction with the M5P Decision Tree CPLST is designed by assuming a specific regression method. [sent-533, score-0.054]

85 Next, we demonstrate that the inputoutput relationship captured by CPLST is not restricted for coupling with linear regression, but can be effective for other regression methods in the learning stage (step 4 of Algorithm 1). [sent-534, score-0.077]

86 The results demonstrate that the captured input-output relation is also effective for regression methods other than linear regression. [sent-541, score-0.065]

87 0009 (those within one standard error of the lower one are in bold) during LSDR and take kernel ridge regression with the same kernel and the same regularization parameter as the underlying multi-output regression method. [sent-704, score-0.268]

88 We also couple PLST with kernel ridge regression for a fair comparison. [sent-705, score-0.15]

89 We select the Gaussian kernel parameter γ and the regularization parameter λ with a grid search on (log2 λ, log2 γ) using a 5-fold cross validation using the sum of the Hamming loss across all dimensions. [sent-706, score-0.074]

90 When coupled with kernel ridge regression, the comparison between PLST and kernel-CPLST in terms of the Hamming loss is shown in Table 5. [sent-708, score-0.126]

91 The results provide practical insights on the two types of label correlation [14]: unconditional correlation (feature-unaware) and conditional correlation (feature-aware). [sent-714, score-0.179]

92 5 Conclusion In this paper, we studied feature-aware label space dimension reduction (LSDR) approaches, which utilize the feature information during LSDR and can be viewed as the counterpart of supervised feature space dimension reduction. [sent-717, score-0.252]

93 We proposed a novel feature-aware LSDR algorithm, conditional principal label space transformation (CPLST) which utilizes the key conditional correlations for dimension reduction. [sent-718, score-0.233]

94 CPLST enjoys the theoretical guarantee in balancing between the prediction error and the encoding error in minimizing the Hamming loss bound. [sent-719, score-0.17]

95 The experimental results demonstrated that CPLST is the best among the LSDR approaches when coupled with linear regression or kernel ridge regression. [sent-722, score-0.18]

96 Moreover, the input-output relation captured by CPLST can be utilized by regression method other than linear regression. [sent-724, score-0.065]

97 Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. [sent-777, score-0.109]

98 Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. [sent-788, score-0.088]

99 On label dependence and loss minimization in multi-label classification. [sent-808, score-0.106]

100 Feature-aware label space dimension reduction for multi-label classification problem. [sent-870, score-0.157]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('plst', 0.593), ('cplst', 0.559), ('lsdr', 0.351), ('zvt', 0.217), ('occa', 0.175), ('wz', 0.133), ('cca', 0.109), ('fsdr', 0.1), ('wx', 0.09), ('pbr', 0.084), ('label', 0.081), ('hamming', 0.068), ('zwz', 0.058), ('encoding', 0.057), ('principal', 0.055), ('regression', 0.054), ('xwx', 0.05), ('ridge', 0.049), ('zt', 0.048), ('vvt', 0.048), ('yeast', 0.043), ('vm', 0.037), ('tn', 0.035), ('kernel', 0.035), ('xwt', 0.034), ('prediction', 0.034), ('bibtex', 0.033), ('enron', 0.032), ('taiwan', 0.032), ('dimension', 0.031), ('reduction', 0.03), ('emotions', 0.029), ('kernelization', 0.029), ('error', 0.027), ('classi', 0.027), ('transformation', 0.025), ('loss', 0.025), ('genbase', 0.025), ('vzt', 0.025), ('xn', 0.023), ('labels', 0.023), ('yn', 0.023), ('supervised', 0.023), ('zn', 0.022), ('regressor', 0.022), ('unconditional', 0.022), ('scene', 0.021), ('correlation', 0.021), ('minimizes', 0.021), ('rows', 0.02), ('vz', 0.02), ('canonical', 0.02), ('compressive', 0.019), ('tsoumakas', 0.018), ('zv', 0.018), ('subsection', 0.018), ('considers', 0.018), ('hz', 0.018), ('kernelized', 0.018), ('assist', 0.017), ('coupled', 0.017), ('medical', 0.017), ('decoding', 0.017), ('mulan', 0.017), ('qkkq', 0.017), ('weka', 0.017), ('zvtv', 0.017), ('transforms', 0.016), ('space', 0.015), ('vt', 0.015), ('sensing', 0.015), ('singular', 0.015), ('cx', 0.015), ('decodes', 0.015), ('elisseeff', 0.015), ('datasets', 0.015), ('cs', 0.014), ('relevance', 0.014), ('regularization', 0.014), ('approaches', 0.014), ('katakis', 0.014), ('br', 0.014), ('feature', 0.013), ('bound', 0.013), ('matrix', 0.013), ('conditional', 0.013), ('cation', 0.013), ('sliced', 0.013), ('tai', 0.013), ('hat', 0.013), ('precursor', 0.013), ('directions', 0.012), ('cz', 0.012), ('couple', 0.012), ('lowest', 0.012), ('orthogonal', 0.012), ('coupling', 0.012), ('sophisticated', 0.012), ('bold', 0.011), ('linear', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

Author: Yao-nan Chen, Hsuan-tien Lin

Abstract: Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1

2 0.05348397 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

3 0.046400797 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

4 0.04268105 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning

Author: Liping Liu, Thomas G. Dietterich

Abstract: In the superset label learning problem (SLL), each training instance provides a set of candidate labels of which one is the true label of the instance. As in ordinary regression, the candidate label set is a noisy version of the true label. In this work, we solve the problem by maximizing the likelihood of the candidate label sets of training instances. We propose a probabilistic model, the Logistic StickBreaking Conditional Multinomial Model (LSB-CMM), to do the job. The LSBCMM is derived from the logistic stick-breaking process. It first maps data points to mixture components and then assigns to each mixture component a label drawn from a component-specific multinomial distribution. The mixture components can capture underlying structure in the data, which is very useful when the model is weakly supervised. This advantage comes at little cost, since the model introduces few additional parameters. Experimental tests on several real-world problems with superset labels show results that are competitive or superior to the state of the art. The discovered underlying structures also provide improved explanations of the classification predictions. 1

5 0.037363671 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

6 0.03583923 71 nips-2012-Co-Regularized Hashing for Multimodal Data

7 0.035421483 200 nips-2012-Local Supervised Learning through Space Partitioning

8 0.035149977 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

9 0.032291777 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

10 0.030772712 330 nips-2012-Supervised Learning with Similarity Functions

11 0.029503202 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.02870333 280 nips-2012-Proper losses for learning from partial labels

13 0.028586708 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

14 0.028312843 271 nips-2012-Pointwise Tracking the Optimal Regression Function

15 0.027683655 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

16 0.026438667 187 nips-2012-Learning curves for multi-task Gaussian process regression

17 0.026164122 231 nips-2012-Multiple Operator-valued Kernel Learning

18 0.02592266 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

19 0.025771478 16 nips-2012-A Polynomial-time Form of Robust Regression

20 0.025348308 305 nips-2012-Selective Labeling via Error Bound Minimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.072), (1, 0.019), (2, -0.0), (3, -0.008), (4, 0.026), (5, -0.002), (6, 0.025), (7, 0.068), (8, -0.015), (9, -0.016), (10, 0.01), (11, 0.011), (12, 0.013), (13, 0.028), (14, -0.0), (15, 0.006), (16, 0.016), (17, 0.006), (18, 0.012), (19, -0.004), (20, -0.019), (21, 0.033), (22, -0.022), (23, 0.018), (24, 0.039), (25, -0.029), (26, -0.016), (27, -0.01), (28, -0.028), (29, 0.017), (30, -0.04), (31, 0.003), (32, 0.029), (33, 0.002), (34, -0.061), (35, -0.027), (36, -0.01), (37, -0.042), (38, -0.035), (39, -0.03), (40, 0.009), (41, -0.023), (42, 0.013), (43, -0.025), (44, -0.032), (45, -0.033), (46, 0.004), (47, -0.017), (48, -0.005), (49, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8259179 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

Author: Yao-nan Chen, Hsuan-tien Lin

Abstract: Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1

2 0.61443532 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

3 0.53767163 280 nips-2012-Proper losses for learning from partial labels

Author: Jesús Cid-sueiro

Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1

4 0.5298717 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams

Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1

5 0.52118903 225 nips-2012-Multi-task Vector Field Learning

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

6 0.51987445 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

7 0.51634258 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities

8 0.51519567 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

9 0.51232129 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

10 0.50733262 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

11 0.50487453 198 nips-2012-Learning with Target Prior

12 0.50269186 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

13 0.48873162 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

14 0.48660165 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

15 0.48032904 200 nips-2012-Local Supervised Learning through Space Partitioning

16 0.4790892 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

17 0.47211802 330 nips-2012-Supervised Learning with Similarity Functions

18 0.47080588 361 nips-2012-Volume Regularization for Binary Classification

19 0.46890569 289 nips-2012-Recognizing Activities by Attribute Dynamics

20 0.46566534 50 nips-2012-Bandit Algorithms boost Brain Computer Interfaces for motor-task selection of a brain-controlled button


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (18, 0.373), (21, 0.02), (38, 0.087), (39, 0.016), (42, 0.042), (54, 0.018), (55, 0.027), (74, 0.039), (76, 0.08), (80, 0.098), (92, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.74942052 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt

Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1

same-paper 2 0.69043648 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

Author: Yao-nan Chen, Hsuan-tien Lin

Abstract: Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets. 1

3 0.40871924 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of A one by one via local search. 1

4 0.40868506 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.

5 0.40785727 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

6 0.40648198 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

7 0.40627712 65 nips-2012-Cardinality Restricted Boltzmann Machines

8 0.40582365 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.40525419 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

10 0.40462449 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

11 0.40437868 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

12 0.40397537 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.40327108 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

14 0.40301815 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

15 0.40285626 168 nips-2012-Kernel Latent SVM for Visual Recognition

16 0.40247524 292 nips-2012-Regularized Off-Policy TD-Learning

17 0.4021154 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.401849 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

19 0.40094119 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

20 0.40085196 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)