nips nips2010 nips2010-94 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. [sent-3, score-0.801]
2 In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. [sent-4, score-0.847]
3 In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. [sent-5, score-0.765]
4 Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. [sent-7, score-0.463]
5 The embedding and the combination parameters are learned jointly on the final classification objective. [sent-8, score-0.326]
6 This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. [sent-9, score-0.678]
7 1 Introduction Many applications require classification techniques dealing with train and/or test instances with missing features: e. [sent-10, score-0.753]
8 a churn predictor might deal with incomplete log features for new customers, a spam filter might be trained from data originating from servers storing different features, a face detector might deal with images for which high resolution cues are corrupted. [sent-12, score-0.345]
9 In this work, we address a learning setting in which the missing features are either missing at random [6], i. [sent-13, score-1.329]
10 deletion due to corruption or noise, or structurally missing [4], i. [sent-15, score-0.697]
11 some features do not make sense for some examples, e. [sent-17, score-0.221]
12 We do not consider setups in which the features are maliciously deleted to fool the classifier [5]. [sent-20, score-0.404]
13 Techniques for dealing with incomplete data fall mainly into two categories: techniques which impute the missing features and techniques considering an instance-specific subspace. [sent-21, score-0.901]
14 In this case, the data instances are viewed as feature vectors in a high-dimensional space and the classifier is a function from this space into the discrete set of classes. [sent-23, score-0.289]
15 Prior to classification, the missing vector components need to be imputed. [sent-24, score-0.554]
16 Early imputation approaches fill any missing value with a constant, zero or the average of the feature over the observed cases [18]. [sent-25, score-0.887]
17 Along this line, more complex strategies based on generative models have been used to fill missing features according to the most likely value given the observed features. [sent-27, score-0.857]
18 Building upon this generative model strategy, several approaches have considered integrating out the missing values, either by integrating the loss [2] or the decision function [22]. [sent-29, score-0.629]
19 15 Feature B missing Feature C missing Feature D missing p(F, 0. [sent-34, score-1.662]
20 Each pair is mapped into an embedding space, then the embedded vectors are combined into a single vector (either linearly with mean or non-linearly with max). [sent-43, score-0.497]
21 Our learning procedure learns both the embedding space and the linear classifier jointly. [sent-45, score-0.326]
22 In this work, we propose a novel strategy called feature set embedding. [sent-53, score-0.259]
23 Contrary to previous work, we do not consider instances as vectors from a given feature space. [sent-54, score-0.255]
24 For that purpose, we introduce a model which maps each (feature, value) pair onto an embedding space and combines the embedded pairs into a single vector before applying a linear classifier, see Figure 1. [sent-56, score-0.463]
25 The embedding space mapping and the linear classifier are jointly learned to maximize the conditional probability of the label given the observed input. [sent-57, score-0.326]
26 Contrary to previous work, this set embedding framework naturally handles incomplete data without modeling the missing feature distribution, or considering an instance specific decision function. [sent-58, score-1.172]
27 Compared to other work on learning from sets, our approach is original as it proposes to learn to embed set elements and to classify sets as a single optimization problem, while prior strategies learn their decision function considering a fixed mapping from sets into a feature space [12, 3]. [sent-59, score-0.386]
28 At the lower level, (feature, value) pairs are |X| individually mapped into an embedding space of dimension m: given an example X = {(fi , vi )}i=1 , m a function p predicts an embedding vector pi = p(fi , vi ) ∈ R for each feature value pair (fi , vi ). [sent-79, score-1.158]
29 At the upper level, the embedded vectors are combined to make the class prediction: a function h takes |X| |X| the set of embedded vectors {pi }i=1 and predicts a vector of confidence values h({pi }i=1 ) ∈ Rk in which the correct class should be assigned the highest value. [sent-80, score-0.301]
30 1 Feature Embedding Feature embedding offers great flexibility. [sent-85, score-0.326]
31 For discrete features, the simplest embedding strategy learns a distinct parameter vector for each (f, v) pair, i. [sent-87, score-0.457]
32 When the feature values are continuous, we adopt a similar strategy and define (a) p(f, v) = W Lf (b) vLf (a) (a) (b) Lf ∈ Rl and Lf ∈ Rl l(a) + l(b) = l where (a) (b) (3) (b) where Lf informs about the presence of feature f , while vLf informs about its value. [sent-93, score-0.534]
33 When the dataset contains a mix of continuous and discrete features, both embedding approaches can be used jointly. [sent-97, score-0.36]
34 Feature embedding is hence a versatile strategy; the practitioner defines the model parameterization according to the nature of the features, and the learned parameters L and W encode the correlation between features. [sent-98, score-0.326]
35 2 Classifying from an Embedded Feature Set The second level of our architecture h considers the set of embedded features and predicts a vector |X| of confidence values. [sent-100, score-0.413]
36 Hence, in this case, the dimension of the embedding space m bounds the rank of the matrix V W . [sent-106, score-0.37]
37 In the specific case where features are continuous and no presence information is provided, (b) i. [sent-108, score-0.254]
38 Lf,v = vLf , our model is equivalent to a classical linear classifier operating on feature vectors when all features are present, i. [sent-110, score-0.427]
39 Intuitively, each dimension in the embedding space provides a meta-feature describing each (feature, value) pair, the max operator then outputs the best meta-feature match over the set of (feature, value) pairs, performing a kind of soft-OR, i. [sent-124, score-0.358]
40 3 Experiments Our experiments consider different setups: features missing at train and test time, features missing only at train time, features missing only at test time. [sent-143, score-2.625]
41 1 Missing Features at Train and Test Time The setup in which features are missing at train and test time is relevant to applications suffering sensor failure or communication errors. [sent-147, score-0.984]
42 It is also relevant to applications in which some features are 4 UCI sick pima hepatitis echo hypo MNIST-5-vs-6 Cars USPS Physics Mine MNIST-miss-test† MNIST-full † Table 1: Dataset Statistics Train set Test set # eval. [sent-148, score-0.453]
43 (%) 90 90 90 90 90 25 62 85 85 26 0 to 99† 0 to 87 Continuous or discrete c c c c c d d c c c d d Features missing only at training time for USPS, Physics and Mine. [sent-151, score-0.662]
44 For each dataset, 90% of the features are removed at random. [sent-162, score-0.221]
45 Contrary to UCI, the deleted features have some structure; for each example, a square area covering 25% of the image surface is removed at random. [sent-164, score-0.275]
46 This task presents a problem where some features are structurally missing. [sent-166, score-0.302]
47 For each example, regions of interests corresponding to potential car parts are detected, and features are extracted for each region. [sent-167, score-0.256]
48 Hence, at most 1900 = 19 × 10 × 10 features are provided for each image. [sent-171, score-0.221]
49 These baselines are all variants of Support Vector Machines (SVMs), suitable for the missing feature problem. [sent-174, score-0.716]
50 Flag relies on the Zero imputation but complements the examples with binary features indicating whether each feature was observed or imputed. [sent-176, score-0.595]
51 Finally, geom is a subspace-based strategy [4]; for each example, a classifier in the subspace corresponding to the observed features is considered. [sent-177, score-0.427]
52 For each experiment, the hyperparameters of our model l, m and the number of training iterations are validated by first training the model on 4/5 of the training data and assessing it on the remainder of the training data. [sent-179, score-0.373]
53 In the case of structurally missing features, the car experiment shows a substantial advantage for FSE over the second best approach geom, which was specifically introduced for this kind of setup. [sent-185, score-0.67]
54 We therefore solely validate non-linear FSE in the following: For feature embedding of continuous data, feature presence information has proven to be useful in all cases, i. [sent-201, score-0.72]
55 For feature embedding of discrete data, sharing parameters across different values of the same feature, i. [sent-205, score-0.574]
56 We also relied on sharing parameters across different features with the same value, i. [sent-209, score-0.273]
57 gray levels for MNIST and region features for cars. [sent-214, score-0.254]
58 For the hyperparameters (l, m) of our model, we observed that the main control on our model capacity is the embedding size m. [sent-215, score-0.395]
59 2 Missing Features at Train Time The setup presenting missing features at training time is relevant to applications which rely on different sources for training. [sent-219, score-0.908]
60 At test time however, the feature detector can be designed to collect the complete feature set. [sent-221, score-0.373]
61 The training set is degraded and 85% of the features are missing. [sent-225, score-0.328]
62 Again, the training set is degraded and 85% of the features are missing. [sent-227, score-0.328]
63 The third set considers the problem of detecting land-mines from 4 types of sensors, each sensor provides a different set of features or views. [sent-228, score-0.288]
64 In this case, for each instance, whole views are considered missing during training. [sent-229, score-0.554]
65 Infinite imputation is a general technique proposed for the case where features are missing at train time. [sent-232, score-1.047]
66 Instead of pretraining the distribution governing the missing values with a generative objective, infinite imputations proposes to train the imputation model and the final classifier in a joint optimization framework [6]. [sent-233, score-0.92]
67 In this context, we consider an SVM with a RBF kernel as the classifier and three alternative imputation models Mean, GMM and MeanFeat which corresponds to mean imputations in the feature space. [sent-234, score-0.428]
68 In this case, features are highly correlated and GMM imputation yields a challenging baseline. [sent-239, score-0.392]
69 of missing features 750 Figure 2: Results for MNIST-miss-test (12 binary problems with features missing at test time only) error rates for all models. [sent-241, score-1.599]
70 In this case, feature correlation is low and GMM imputation is yielding the worse performance, while our model brings a strong improvement. [sent-242, score-0.333]
71 3 Missing Features at Test Time The setup presenting missing features at test time considers applications in which the training data have been produced with more care than the test data. [sent-244, score-1.073]
72 Both strategies propose to learn a classifier which avoids assigning high weight to a small subset of features, hence limiting the impact of the deletion of some features at test time. [sent-247, score-0.381]
73 Since no features are missing at train time, we adapt our training procedure to take into account the mismatched conditions between train and test. [sent-257, score-1.051]
74 Figure 2 plots the error rate as a function of the number of missing features. [sent-260, score-0.554]
75 FSE has a clear advantage in most settings: it achieves a lower error rate than Globerson & Roweis [10] in all cases, while it is better than Dekel & Shamir [5], as soon as the number of missing features is above 50, i. [sent-261, score-0.775]
76 In fact, we observe that FSE is very robust to feature deletion; its error rate remains below 20% for up to 700 missing features i. [sent-264, score-0.937]
77 On the other end, the alternative strategies report performance close to random when the number of missing features reaches 150, i. [sent-267, score-0.858]
78 features are intentionally deleted to fool the classifier, that is beyond the scope of this work. [sent-272, score-0.325]
79 These setups proposed small training sets motivated by the training cost of the compared alternatives (see Table 1). [sent-275, score-0.321]
80 All conditions are considered: features missing at training time, at testing time, and at both times. [sent-277, score-0.849]
81 100, 200, 500 and 784 features which approximately correspond to 90, 60, 35 and 0% missing 7 Table 4: Error Rate (%) 10-class MNIST-full Experiments # train f. [sent-280, score-0.876]
82 all training features are available but the training procedure randomly hides some features each time it examines an example. [sent-305, score-0.59]
83 when facing a test problem with 300 available features, the model trained with 300 features is better than the models trained with 100, 500 or 784 features. [sent-310, score-0.319]
84 We also observe that models facing less features at test time than at train time yield poor performance, while the models trained with few features yield satisfying performance when facing more features. [sent-312, score-0.69]
85 This seems to suggest that training with missing features yields more robust models as it avoids the decision function to rely solely on few specific features that might be corrupted. [sent-313, score-1.149]
86 In other word, training with missing features seems to achieve a similar goal as L∞ regularization [5]. [sent-314, score-0.849]
87 In fact, the results obtained with no missing features (1. [sent-317, score-0.775]
88 The regularization effect of missing training features could be related to noise injection techniques for regularization [21, 11]. [sent-322, score-0.887]
89 4 Conclusions This paper introduces Feature Set Embedding for the problem of classification with missing features. [sent-323, score-0.554]
90 Our approach deviates from the standard classification paradigm: instead of considering examples as feature vectors, we consider examples as sets of (feature, value) pairs which handle the missing feature problem more naturally. [sent-324, score-0.973]
91 In order to classify sets, we propose a new strategy relying on two levels of modeling. [sent-325, score-0.242]
92 At the first level, each (feature, value) is mapped onto an embedding space. [sent-326, score-0.365]
93 Our training algorithm then relies on stochastic gradient ascent to jointly learn the embedding space and the final linear decision function. [sent-328, score-0.518]
94 First, sets are conceptually better suited than vectors for dealing with missing values. [sent-330, score-0.644]
95 Second, embedding (feature, value) pairs offers a flexible framework which easily allows encoding prior knowledge about the features. [sent-331, score-0.375]
96 From a broader perspective, the flexible feature embedding framework could go beyond the missing feature application. [sent-333, score-1.204]
97 the embedding vector of the temperature features in a weather prediction system could be computed from the locations of their sensors. [sent-336, score-0.547]
98 It also enables the designing of a system in which new sensors are added without requiring full model re-training; in this case, the model could be quickly adapted by only updating embedding vectors corresponding to the new sensors. [sent-337, score-0.402]
99 Also, our approach of relying on feature sets offers interesting opportunities for feature selection and adversarial feature deletion. [sent-338, score-0.603]
100 A second order cone programming formulation for classifying missing data. [sent-351, score-0.554]
wordName wordTfidf (topN-words)
[('missing', 0.554), ('embedding', 0.326), ('fse', 0.264), ('features', 0.221), ('imputation', 0.171), ('feature', 0.162), ('gmm', 0.126), ('lf', 0.126), ('train', 0.101), ('lfi', 0.1), ('strategy', 0.097), ('usps', 0.089), ('embedded', 0.088), ('incomplete', 0.088), ('classi', 0.084), ('structurally', 0.081), ('setups', 0.079), ('er', 0.078), ('fi', 0.076), ('geom', 0.075), ('vfi', 0.075), ('vlf', 0.075), ('training', 0.074), ('vi', 0.073), ('rl', 0.072), ('relying', 0.071), ('considers', 0.067), ('physics', 0.066), ('hepatitis', 0.066), ('validation', 0.063), ('deletion', 0.062), ('imputations', 0.061), ('setup', 0.059), ('shamir', 0.057), ('deleted', 0.054), ('sharing', 0.052), ('dekel', 0.052), ('fool', 0.05), ('hypo', 0.05), ('meanfeat', 0.05), ('facing', 0.049), ('knn', 0.049), ('cars', 0.049), ('instances', 0.049), ('mnist', 0.049), ('strategies', 0.049), ('test', 0.049), ('pairs', 0.049), ('alternatives', 0.048), ('discriminating', 0.046), ('sets', 0.046), ('rank', 0.044), ('dtrain', 0.044), ('mines', 0.044), ('dick', 0.044), ('sick', 0.044), ('vectors', 0.044), ('globerson', 0.043), ('decision', 0.042), ('validated', 0.041), ('classify', 0.041), ('uci', 0.041), ('relies', 0.041), ('informs', 0.04), ('orr', 0.04), ('handwritten', 0.04), ('mapped', 0.039), ('injection', 0.038), ('pima', 0.038), ('nec', 0.038), ('lv', 0.038), ('parameterizing', 0.038), ('impute', 0.038), ('predicts', 0.037), ('solely', 0.037), ('hyperparameters', 0.036), ('liao', 0.036), ('heitz', 0.036), ('originating', 0.036), ('elidan', 0.036), ('chechik', 0.036), ('car', 0.035), ('ascent', 0.035), ('reports', 0.035), ('contrary', 0.035), ('iain', 0.034), ('echo', 0.034), ('discrete', 0.034), ('subspace', 0.034), ('alternative', 0.034), ('generative', 0.033), ('levels', 0.033), ('degraded', 0.033), ('customers', 0.033), ('presence', 0.033), ('capacity', 0.033), ('table', 0.032), ('outputs', 0.032), ('sensors', 0.032), ('labs', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
2 0.24102053 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu
Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1
3 0.19803558 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
4 0.13333285 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
5 0.11060884 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
6 0.10368912 162 nips-2010-Link Discovery using Graph Feature Tracking
7 0.10070478 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
8 0.095354363 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
9 0.094823554 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
10 0.093620487 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
11 0.089552522 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
12 0.085986905 99 nips-2010-Gated Softmax Classification
13 0.08477769 280 nips-2010-Unsupervised Kernel Dimension Reduction
14 0.076792046 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
15 0.076317109 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
16 0.074721865 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
17 0.074327119 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
18 0.07417652 124 nips-2010-Inductive Regularized Learning of Kernel Functions
19 0.072981156 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
20 0.071204573 138 nips-2010-Large Margin Multi-Task Metric Learning
topicId topicWeight
[(0, 0.228), (1, 0.071), (2, -0.006), (3, -0.11), (4, 0.029), (5, -0.002), (6, 0.012), (7, -0.047), (8, -0.102), (9, -0.081), (10, 0.001), (11, 0.054), (12, 0.158), (13, 0.018), (14, 0.09), (15, -0.061), (16, -0.041), (17, -0.015), (18, -0.018), (19, -0.049), (20, -0.084), (21, 0.063), (22, 0.009), (23, -0.235), (24, 0.154), (25, 0.099), (26, 0.054), (27, 0.1), (28, -0.109), (29, -0.037), (30, 0.049), (31, -0.058), (32, -0.136), (33, 0.023), (34, 0.043), (35, -0.065), (36, 0.122), (37, 0.017), (38, 0.118), (39, 0.081), (40, -0.039), (41, -0.025), (42, -0.034), (43, -0.019), (44, 0.041), (45, -0.097), (46, 0.05), (47, -0.035), (48, 0.037), (49, 0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.95380604 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
2 0.82690978 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu
Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1
3 0.63409865 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
4 0.55569464 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
Author: Umar Syed, Ben Taskar
Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1
5 0.53739572 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
Author: Rob Fergus, George Williams, Ian Spiro, Christoph Bregler, Graham W. Taylor
Abstract: This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that realworld performance can be improved through the use of synthetic data. 1
6 0.53126574 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
7 0.51651818 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
8 0.51318496 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
9 0.49824932 162 nips-2010-Link Discovery using Graph Feature Tracking
10 0.4872562 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
11 0.48335084 99 nips-2010-Gated Softmax Classification
12 0.47991613 212 nips-2010-Predictive State Temporal Difference Learning
13 0.4774248 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
14 0.47470042 144 nips-2010-Learning Efficient Markov Networks
15 0.47267693 228 nips-2010-Reverse Multi-Label Learning
16 0.46903092 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
17 0.46070239 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
18 0.45929748 280 nips-2010-Unsupervised Kernel Dimension Reduction
19 0.45577615 284 nips-2010-Variational bounds for mixed-data factor analysis
20 0.43183559 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
topicId topicWeight
[(13, 0.113), (17, 0.027), (27, 0.067), (30, 0.049), (35, 0.014), (45, 0.241), (50, 0.06), (52, 0.037), (60, 0.032), (64, 0.179), (77, 0.035), (78, 0.021), (90, 0.06)]
simIndex simValue paperId paperTitle
1 0.91620189 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
same-paper 2 0.85930395 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
3 0.83608979 262 nips-2010-Switched Latent Force Models for Movement Segmentation
Author: Mauricio Alvarez, Jan R. Peters, Neil D. Lawrence, Bernhard Schölkopf
Abstract: Latent force models encode the interaction between multiple related dynamical systems in the form of a kernel or covariance function. Each variable to be modeled is represented as the output of a differential equation and each differential equation is driven by a weighted sum of latent functions with uncertainty given by a Gaussian process prior. In this paper we consider employing the latent force model framework for the problem of determining robot motor primitives. To deal with discontinuities in the dynamical systems or the latent driving force we introduce an extension of the basic latent force model, that switches between different latent functions and potentially different dynamical systems. This creates a versatile representation for robot movements that can capture discrete changes and non-linearities in the dynamics. We give illustrative examples on both synthetic data and for striking movements recorded using a Barrett WAM robot as haptic input device. Our inspiration is robot motor primitives, but we expect our model to have wide application for dynamical systems including models for human motion capture data and systems biology. 1
4 0.83416808 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
5 0.83108103 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
6 0.82968754 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
8 0.82876247 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
9 0.82787412 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.82606328 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.82597858 117 nips-2010-Identifying graph-structured activation patterns in networks
12 0.82576889 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
13 0.82336825 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
14 0.82266104 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
15 0.82121402 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
16 0.82073265 261 nips-2010-Supervised Clustering
17 0.82066482 217 nips-2010-Probabilistic Multi-Task Feature Selection
18 0.82014602 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
19 0.8195076 182 nips-2010-New Adaptive Algorithms for Online Classification
20 0.81855547 63 nips-2010-Distributed Dual Averaging In Networks