jmlr jmlr2008 jmlr2008-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
Reference: text
sentIndex sentText sentNum sentScore
1 IL School of Computer Science and Engineering Interdisciplinary Center for Neural Computation The Hebrew University Jerusalem, 91904, Israel Editor: Isabelle Guyon Abstract Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. [sent-9, score-0.456]
2 This approach enables prediction of the quality of features without measuring their value on the training instances. [sent-12, score-0.621]
3 We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. [sent-13, score-0.667]
4 Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). [sent-16, score-0.498]
5 Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. [sent-17, score-0.629]
6 Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. [sent-20, score-0.456]
7 In feature extraction the original input features (for example, pixels) are used to generate new, more complicated features (for example logical AND of sets of 3 binary pixels). [sent-23, score-0.841]
8 In the most common selection paradigm an evaluation function is used to assign scores to subsets of features and a search algorithm is used to search for a subset with a high score. [sent-26, score-0.475]
9 One very common such method is Infogain (Quinlan, 1990), which ranks features according to the mutual information 1 between each feature and the labels. [sent-34, score-0.454]
10 In each round it measures the quality of the candidate features by training SVM and eliminates the features with the lowest weights. [sent-38, score-0.891]
11 Classical methods of feature selection tell us which features are better. [sent-41, score-0.545]
12 However, they do not tell us what characterizes these features or how to judge new features which were not measured in the training data. [sent-42, score-0.704]
13 The MF-PFS algorithm uses predicted quality to select new good features, while eliminating many low-quality features without measuring them. [sent-51, score-0.632]
14 In the context of object recognition there is an advantage in finding one set of features (referred to as a universal dictionary) that is sufficient for recognition of most kinds of objects. [sent-53, score-0.569]
15 Here we show what characterizes good universal features and demonstrate that their quality can be predicted accurately by their meta-features. [sent-56, score-0.579]
16 The ability to predict feature quality is also a very valuable tool for feature extraction, where the learner has to decide which potential complex features have a good chance of being the most useful. [sent-57, score-0.804]
17 For this task we derive a selection algorithm (called Mufasa) that uses meta-features to explore a huge number of candidate features efficiently. [sent-58, score-0.482]
18 p(x)p(y) L EARNING TO S ELECT F EATURES quality of individual unseen features and how this ability can be combined with RFE. [sent-65, score-0.565]
19 (2003) used meta-features of words for text classification when there are features (words) that are unseen in the training set, but appear in the test set. [sent-84, score-0.433]
20 Generalization from observed (training) features to unobserved features is discussed by Krupka and Tishby (2008). [sent-87, score-0.656]
21 Other ideas using feature properties to produce or select good features can be found in the literature and have been employed in various applications. [sent-93, score-0.523]
22 Kadous and Sammut (2005) used property-based clustering of features for handwritten Chinese recognition and other applications. [sent-103, score-0.438]
23 The standard task of feature selection is to select a subset of features that enables good prediction of the label. [sent-109, score-0.652]
24 We can also consider the instances as abstract entities in space S and think of the features as measurements on the instances. [sent-111, score-0.4]
25 We m also use F to denote a set of features and SF to denote the training set restricted to F; that is, each instance is described only by the features in F. [sent-115, score-0.704]
26 The ability to predict the quality of features without measuring them is advantageous for many applications. [sent-129, score-0.572]
27 We can directly measure the quality (that is, usefulness) of these features using the training set. [sent-132, score-0.529]
28 Based on the quality of these features, our goal is to predict the quality of all features, including features that were not part of the training set. [sent-133, score-0.719]
29 regalg (XMF ,YMF ) ˆ to learn a mapping from meta-feature value to quality: Q = The quality can be based on any kind of standard evaluation function that uses the labeled training set to evaluate features (for example, Infogain or the square weights assigned by linear SVM). [sent-140, score-0.626]
30 Now we have a new supervised learning problem, with the original features as instances, the meta-features as features and YMF as the (continuous) target label. [sent-142, score-0.656]
31 That is, the generalization ability to new features can be derived using standard generalization bounds for regression learning. [sent-147, score-0.421]
32 In SVM-RFE you start by training SVM using all the features, then eliminate the ones with the smallest square weights in the result linear classifier and repeat the same procedure with the remaining features until the set of selected features is small enough. [sent-155, score-0.775]
33 The reason that features are eliminated iteratively and not in one step is that the weights given by SVM to a feature depend on the set of features that was used for training. [sent-156, score-0.809]
34 The main idea is to run SVM on only a small (random) subset of the features, and then use the assigned weights for these features to predict the quality of all candidate features using their meta-features (Algorithm 1). [sent-162, score-0.907]
35 Thus, it is extremely valuable in a situation where there are a large number of candidate features and the cost of measuring each feature is very high. [sent-166, score-0.542]
36 while |F| > n, (a) Select a set F0 of random αn features out of F, measure them and produce a training set m of features SF0 . [sent-171, score-0.704]
37 ˆ (c) Use Q to estimate the quality of all the features in F. [sent-175, score-0.481]
38 In Serre’s paper the features are selected randomly by choosing random patches from a data set of natural images (or the training set itself). [sent-214, score-0.562]
39 One method to improve the selected set of features is to use SVM-RFE to select the best features from a larger set of randomly selected candidate features. [sent-217, score-0.813]
40 In the following we show that by using these meta-features we can predict the quality of new features with high accuracy, and hence we can drop bad features without measuring their values on the images. [sent-232, score-0.9]
41 Alternatively, it can improve the classification accuracy since we are able to select from a large set of features in a reasonable training time (using feature selection by MF-PFS). [sent-234, score-0.628]
42 Based on this quality definition, Figure 2 presents an example of prototypes of “good” and “bad” features of different sizes. [sent-240, score-0.526]
43 In Figure 3 we explore the quality of features as a function of two meta-features: the patch size and the standard deviation (std). [sent-241, score-0.66]
44 This observation is notable since it explains the existence of a universal set of features (prototypes) that enables recognition of most objects, regardless of whether the prototypes were taken from pictures that contain the relevant objects or not. [sent-258, score-0.516]
45 (2005) found that a set of features (prototypes) which consists of prototypes taken randomly from any natural images constitute such a universal set; however, they did not characterize which features are good. [sent-260, score-0.773]
46 However, in their work the features are object fragments, and hence their quality is dependent on the specific training set and not on general statistical properties as we found in this work. [sent-268, score-0.596]
47 The features in this setting are very expensive to compute and thus a standard selection methods cannot consider many candidate features. [sent-272, score-0.453]
48 04 0 50 100 150 200 250 300 350 400 Number of top ranked features Figure 4: Predicting the quality of patches. [sent-330, score-0.481]
49 The first was a standard RFE which starts with all the N = 10n features and selects n features using 6 elimination steps (referred to as RFEall). [sent-362, score-0.656]
50 1n features that were selected randomly from the N = 10n features (referred to as RFEsmall). [sent-364, score-0.7]
51 As a baseline we also compared it to random selection of the n features as done in Serre et al. [sent-367, score-0.445]
52 When RFE measures the same number of features (RFEsmall), it needs to select twice the number of selected features (n) to achieve the same classification accuracy as MF-PFS. [sent-375, score-0.735]
53 Here we show how the low dimensional representation of features by a relatively small number of meta-features enables efficient selection even when the number of potential features is very large or even infinite and the evaluation function is expensive (for example, the wrapper model). [sent-380, score-0.815]
54 5 0 200 400 600 800 1000 1200 Number of selected feature (n) Figure 5: Applying SVM with different feature selection methods for object recognition. [sent-406, score-0.454]
55 When the number of selected features (n) is not too small, our meta-feature based selection (MF-PFS) achieves the same accuracy as RFEall which measures 5 times more features at training time. [sent-407, score-0.839]
56 Note that Mufasa does not use explicit prediction of the quality of unseen features as we did in MF-PFS, but it is clear that it cannot work unless the meta-features are informative on the quality. [sent-435, score-0.538]
57 Namely, Mufasa can only work if the likelihood of drawing a good set of features from p (v|u) is some continuous function of u; that is, a small change in u results in a small change in the chance of drawing a good set of features. [sent-436, score-0.396]
58 For feature selection it is possible to cluster the features based on the similarity of meta-features, and then randomly select features per cluster. [sent-448, score-0.908]
59 2 we demonstrate the ability of Mufasa to efficiently select good features in the presence of a huge number of candidate (extracted) features on a handwritten digit recognition problem. [sent-453, score-0.955]
60 Here, however, we use a variant of this graph which replaces the number of selected features by the total cost of computing the selected features. [sent-489, score-0.416]
61 Thus a selection algorithm is restricted by a budget which the total cost of the selected set of features cannot be exceeded, rather than by an allowed number of selected features. [sent-501, score-0.62]
62 We chose specific features, given a value of meta-features, by re-drawing features randomly from a uniform distribution over the features that satisfied the given value of the meta-features until the full allowed budget was used up. [sent-505, score-0.769]
63 We used 2-fold cross validation of the linear multi-class SVM (Shalev-Shwartz and Singer, 2006; Crammer, 2003) to check the quality of the set of selected features in each step. [sent-506, score-0.525]
64 We first drew features randomly using a budget which was 50 times larger, then we sorted them by Infogain (Quinlan, 1990) normalized by the cost6 (that is, the value of Infogain divided by 5. [sent-509, score-0.441]
65 As a sanity check, we also compared the results to those obtained by doing 50 steps of choosing features of the allowed budget randomly; that is, over all possible values of the meta-features. [sent-519, score-0.441]
66 As shown in Figure 7b, when the budget is very limited, it is better to take more cheap features rather than fewer more expensive shift invariant features. [sent-532, score-0.485]
67 We assume that the classifier that is going to use the selected features is chosen from a hypothesis class Hc of real valued functions and the classification is made by taking the sign. [sent-553, score-0.407]
68 From Theorem 5 we know that γ erD (hc , Fi ) ≤ erS (hc , Fi )+ ˆ 2 34em 8 d ln log(578m) + ln m d γδFi , where erD (hc , Fi ) denotes the generalization error of the selected hypothesis for the fixed set of features Fi . [sent-616, score-0.591]
69 Nevertheless, it can select a good set of features out of O 2J candidate sets. [sent-624, score-0.431]
70 2 VC-dimension of Joint Feature Selection and Classification In the previous section we presented an analysis which assumes that the selection of features is made using Mufasa. [sent-632, score-0.419]
71 hs defines which features are selected as follows: f is selected ⇐⇒ hs (u ( f )) = 1, where, as usual, u ( f ) is the value of the meta-features for feature f . [sent-635, score-0.888]
72 Given the values of the metafeatures of all the features together with hs we get a single feature selection hypothesis. [sent-636, score-0.776]
73 Since we are interested in selecting exactly n features (n is predefined), we use only a subset of Hs where we only include functions that imply the selection of n features. [sent-638, score-0.419]
74 Lemma 7 Let H f s be a class of the possible selection schemes for selecting n features out of N and let Hc be a class of classifiers over Rn . [sent-646, score-0.419]
75 If dc ≥ 11 then the VC-dim of the combined problem (that is, choosing (h f s , hc ) ∈ H f s × Hc ) is bounded by (dc + log |H f s | + 1) log dc . [sent-648, score-1.049]
76 2369 K RUPKA , NAVOT AND T ISHBY Theorem 8 Let Hs be a class of mappings from the meta-feature space (Rk ) to {0, 1}, let H f s be the induced class of feature selection schemes for selecting n out of N features and let H c be a class of classifiers over Rn . [sent-654, score-0.599]
77 If dc ≥ 11, then the VC-dim of the joint class H f s × Hc is upper bounded as follows VC-dim (H f s × Hc ) ≤ dc + ds log eN + 1 log dc , ds where ds is the VC-dim of Hs . [sent-656, score-1.243]
78 First, note that if we do not use meta-features, but consider all the possible ways to select n out of N features the above bound is replaced by dc + log N n + 1 log dc , (1) which is very large for reasonable values of N and n. [sent-659, score-1.173]
79 Assuming that both Hs and Hc are classes of linear classifiers on Rk and Rn respectively, then ds = k + 1 and dc = n + 1 and we get that the VC of the combined problem of selection and classification is upper bounded by O ( (n + k log N) log n ) . [sent-661, score-0.603]
80 If Hc is a class of linear classifiers, but we allow any selection of n features the bound is (by substituting in 1): O ((n + n log N) log n) , which is much larger if k n. [sent-662, score-0.555]
81 Assuming that the meta-features are binary and Hs is the class of all possible functions from meta-feature to {0, 1}, then ds = 2k and the bound is O dc + 2k log N log dc , which is still much better than the bound in equation 1 if k log n. [sent-666, score-0.935]
82 One might claim that in addition to the standard hard task of finding a good representation of the instances, now we also have to find a good representation of the features by meta-features. [sent-669, score-0.396]
83 It gives us a systematic way to incorporate prior knowledge about the features in the feature selection and extraction process. [sent-671, score-0.604]
84 The fact that the number of meta-features is typically significantly smaller than the number of features makes it easy to understand the results of the feature selection process. [sent-673, score-0.545]
85 It is easier to guess which properties might be indicative of feature quality than to guess which exact features are good. [sent-674, score-0.607]
86 The (x, y)-locations of features are good indicators of their quality, but features from similar positions tend to be redundant. [sent-699, score-0.69]
87 This approach can be used for predicting the quality of features without measuring them even on a single instance. [sent-704, score-0.535]
88 The results of random selection of features are also presented. [sent-710, score-0.419]
89 We suggest exploring for new good features by assessing features with meta-feature values similar to those of known good features. [sent-711, score-0.724]
90 The first algorithm is MF-PFS, which estimates the quality of individual features and obviates the need to calculate them on the instances. [sent-713, score-0.481]
91 Mufasa is very helpful in feature extraction, where the number of potential features is huge. [sent-716, score-0.454]
92 In the context of object recognition we showed that the feature (patch) quality can be predicted by its general statistical properties which are not dependent on the objects we are trying to recognize. [sent-718, score-0.443]
93 This result supports the existence of a universal set of features (universal-dictionary) that can be used for recognition of most objects. [sent-719, score-0.433]
94 We also showed that when the selection of features is based on metafeatures it is possible to derive better generalization bounds on the combined problem of selection and classification. [sent-721, score-0.654]
95 Our search for good features is computationally efficient and has good generalization properties because we do not examine each individual feature. [sent-723, score-0.458]
96 In addition to the applications presented here which involve predicting the quality of unseen features, the meta-features framework can also be used to improve estimation of the quality of features that we do see in the training set. [sent-735, score-0.739]
97 A Proof for Lemma 7 Lemma 7 Let H f s be a class of the possible selection schemes for selecting n features out of N and let Hc be a class of classifiers over Rn . [sent-739, score-0.419]
98 If dc ≥ 11 then the VC-dim of the combined problem (that is, choosing (h f s , hc ) ∈ H f s × Hc ) is bounded by (dc + log |H f s | + 1) log dc . [sent-741, score-1.049]
99 Proof For a given set of selected features, the possible number of classifications of m instances is upper bounded em dc dc (see Kearns and Vazirani 1994 pp. [sent-742, score-0.759]
100 The < 2m : dc (2) (3) (4) (5) (6) K RUPKA , NAVOT AND T ISHBY (6) alog b = blog a ∀a, b > 1 Therefore, dc + log H f s + 1 log dc is an upper bound on VC-dim of the combined learning problem. [sent-745, score-1.174]
wordName wordTfidf (topN-words)
[('mufasa', 0.429), ('dc', 0.337), ('features', 0.328), ('hc', 0.242), ('navot', 0.206), ('hs', 0.173), ('mf', 0.162), ('patch', 0.155), ('quality', 0.153), ('elect', 0.151), ('serre', 0.139), ('ishby', 0.138), ('rupka', 0.138), ('feature', 0.126), ('budget', 0.113), ('patches', 0.106), ('infogain', 0.093), ('eatures', 0.092), ('selection', 0.091), ('rfeall', 0.081), ('rfesmall', 0.081), ('krupka', 0.079), ('erd', 0.079), ('svm', 0.077), ('ln', 0.075), ('featquality', 0.07), ('regalg', 0.07), ('shiftinvlen', 0.07), ('ymf', 0.07), ('recognition', 0.069), ('rfe', 0.069), ('hypotheses', 0.068), ('object', 0.067), ('pixels', 0.061), ('sm', 0.059), ('extraction', 0.059), ('metafeatures', 0.058), ('digit', 0.057), ('unseen', 0.057), ('im', 0.057), ('mappings', 0.054), ('measuring', 0.054), ('mnist', 0.053), ('log', 0.053), ('tishby', 0.053), ('inputs', 0.051), ('earning', 0.05), ('training', 0.048), ('xmf', 0.046), ('prototypes', 0.045), ('shift', 0.044), ('selected', 0.044), ('dictionary', 0.042), ('ds', 0.042), ('rk', 0.042), ('instances', 0.041), ('handwritten', 0.041), ('guided', 0.04), ('enables', 0.038), ('predict', 0.037), ('images', 0.036), ('universal', 0.036), ('simard', 0.035), ('select', 0.035), ('fbest', 0.035), ('hfs', 0.035), ('metafeature', 0.035), ('qbest', 0.035), ('ubest', 0.035), ('hypothesis', 0.035), ('candidate', 0.034), ('good', 0.034), ('classi', 0.034), ('generalization', 0.034), ('redundancy', 0.032), ('measurements', 0.031), ('prototype', 0.03), ('wrapper', 0.03), ('lecun', 0.03), ('bound', 0.03), ('std', 0.029), ('huge', 0.029), ('image', 0.029), ('guyon', 0.028), ('predicted', 0.028), ('scatter', 0.028), ('search', 0.028), ('combined', 0.027), ('lemma', 0.027), ('weights', 0.027), ('rand', 0.026), ('eyal', 0.026), ('gabor', 0.026), ('baseline', 0.026), ('fi', 0.026), ('fj', 0.025), ('message', 0.025), ('bounds', 0.025), ('deviation', 0.024), ('ath', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
2 0.1427746 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.10685157 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
4 0.072883487 87 jmlr-2008-Stationary Features and Cat Detection
Author: François Fleuret, Donald Geman
Abstract: Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classiÄ?Ĺš ers. The efÄ?Ĺš ciency of this strategy for a complex pose, that is, for Ä?Ĺš ne-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: (1) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; (2) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efÄ?Ĺš cient, one-shot learning of pose-speciÄ?Ĺš c classiÄ?Ĺš ers. To avoid expensive scene processing, we arrange these classiÄ?Ĺš ers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-Ä?Ĺš ne search, this allows for efÄ?Ĺš cient processing. The hierarchy is then â€?foldedâ€? for training: all the classiÄ?Ĺš ers at each level are derived from one base predictor learned from all the data. The hierarchy is â€?unfoldedâ€? for testing: parsing a scene amounts to examining increasingly Ä?Ĺš ner object descriptions only when there is sufÄ?Ĺš cient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes. Keywords: supervised learning, computer vision, image interpretation, cats, stationary features, hierarchical search
5 0.068164855 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
6 0.067920454 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
7 0.067686915 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
8 0.065229058 52 jmlr-2008-Learning from Multiple Sources
9 0.062202375 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
10 0.058759212 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
11 0.057240933 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
12 0.049436882 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.046805419 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
14 0.046599016 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
15 0.045931265 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
16 0.04466182 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
17 0.043554682 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
18 0.043287959 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
19 0.042558808 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
20 0.041991524 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
topicId topicWeight
[(0, 0.228), (1, -0.065), (2, 0.017), (3, -0.031), (4, -0.143), (5, 0.132), (6, 0.179), (7, -0.003), (8, 0.105), (9, 0.002), (10, -0.089), (11, 0.078), (12, 0.104), (13, 0.014), (14, -0.04), (15, 0.009), (16, -0.162), (17, 0.153), (18, -0.07), (19, 0.256), (20, 0.314), (21, -0.016), (22, 0.0), (23, 0.096), (24, -0.032), (25, -0.186), (26, -0.023), (27, -0.038), (28, -0.062), (29, 0.079), (30, 0.073), (31, 0.082), (32, 0.004), (33, 0.061), (34, 0.041), (35, 0.007), (36, -0.076), (37, 0.058), (38, 0.052), (39, -0.112), (40, -0.061), (41, 0.042), (42, -0.133), (43, -0.06), (44, -0.017), (45, 0.085), (46, 0.142), (47, -0.009), (48, 0.053), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.9660362 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
2 0.68860507 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.67892283 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
4 0.4567019 87 jmlr-2008-Stationary Features and Cat Detection
Author: François Fleuret, Donald Geman
Abstract: Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classiÄ?Ĺš ers. The efÄ?Ĺš ciency of this strategy for a complex pose, that is, for Ä?Ĺš ne-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: (1) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; (2) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efÄ?Ĺš cient, one-shot learning of pose-speciÄ?Ĺš c classiÄ?Ĺš ers. To avoid expensive scene processing, we arrange these classiÄ?Ĺš ers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-Ä?Ĺš ne search, this allows for efÄ?Ĺš cient processing. The hierarchy is then â€?foldedâ€? for training: all the classiÄ?Ĺš ers at each level are derived from one base predictor learned from all the data. The hierarchy is â€?unfoldedâ€? for testing: parsing a scene amounts to examining increasingly Ä?Ĺš ner object descriptions only when there is sufÄ?Ĺš cient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes. Keywords: supervised learning, computer vision, image interpretation, cats, stationary features, hierarchical search
5 0.43289343 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
6 0.34404859 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
7 0.33326924 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
8 0.32336989 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
9 0.31468087 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
10 0.31296778 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
11 0.29467753 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.29093903 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
13 0.27962494 52 jmlr-2008-Learning from Multiple Sources
14 0.24044599 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
15 0.23994322 22 jmlr-2008-Closed Sets for Labeled Data
16 0.23580837 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
17 0.22978856 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.21943106 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
20 0.21868151 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
topicId topicWeight
[(0, 0.02), (1, 0.01), (5, 0.021), (31, 0.013), (40, 0.562), (54, 0.031), (58, 0.022), (66, 0.035), (76, 0.025), (78, 0.012), (88, 0.072), (92, 0.046), (94, 0.042), (99, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.97105843 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
2 0.9487313 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
3 0.59085739 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
4 0.56919765 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
5 0.55873042 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
6 0.52677697 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
7 0.52258211 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
8 0.5157432 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
9 0.50060171 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
10 0.49854338 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.49243173 57 jmlr-2008-Manifold Learning: The Price of Normalization
12 0.49234447 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
13 0.48887324 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
14 0.48607966 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
15 0.48504052 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
16 0.47431195 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
17 0.47380215 22 jmlr-2008-Closed Sets for Labeled Data
18 0.46886271 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
19 0.46362096 86 jmlr-2008-SimpleMKL
20 0.46266744 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction