jmlr jmlr2008 jmlr2008-58 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
Reference: text
sentIndex sentText sentNum sentScore
1 The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. [sent-12, score-0.694]
2 We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. [sent-14, score-0.651]
3 We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. [sent-15, score-0.277]
4 More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. [sent-19, score-1.423]
5 Keywords: max margin, missing features, network reconstruction, metabolic pathways ∗. [sent-21, score-0.698]
6 Introduction In the traditional formulation of supervised learning, data instances are viewed as vectors of features in some high-dimensional space. [sent-25, score-0.309]
7 However, in many real-world tasks, data instances have a complex pattern of missing features. [sent-26, score-0.527]
8 Sometimes features are missing due to measurement noise or corruption, but in other cases samples have varying subsets of observable features due to the inherent properties of the instances. [sent-27, score-0.842]
9 In visual object recognition, an object is often classified using a set of image patches corresponding to parts of the object (e. [sent-29, score-0.338]
10 , 2005) or predict networks of chemical reactions involving enzymes and other metabolic compounds (Kharchenko et al. [sent-36, score-0.608]
11 Common methods for classification with missing features assume that the features exist but their values are unknown. [sent-41, score-0.842]
12 First the values of the missing attributes are filled in during a preprocessing phase, followed by the application of a standard classifier to the completed data (Little and Rubin, 1987; Roth, 1994). [sent-43, score-0.46]
13 A second family of imputation methods builds probabilistic generative models of the features (with or without the label) using raw maximum likelihood or algorithms such as expectation maximization (EM) to find the most probable completion (Ghahramani and Jordan, 1994). [sent-46, score-0.319]
14 These methods have been shown to work very well for missing at random (MAR) data settings, because they assume that the missing features are generated by the same model that generates the observed features. [sent-48, score-1.111]
15 The distribution over missing features, can also be approximated using multiple imputations, where several missing values are filled in. [sent-49, score-0.92]
16 More importantly, they may produce meaningless completions for features that are structurally absent, which will likely decrease classification performance. [sent-55, score-0.319]
17 The approach we propose here aims to classify data instances directly without filling hypothetical missing values, and is therefore different from imputation. [sent-67, score-0.574]
18 To formulate the classification problem we go back to the geometric intuitions underlying max-margin classification: We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower-dimensional subspace. [sent-69, score-0.482]
19 Although the intuitions underlying our method arise from cases with non-existing features, we show that the performance of our methods is not inferior to other simple imputation methods, even in the case of missing at random. [sent-73, score-0.588]
20 We then evaluate our approach on the task of predicting missing enzymes in genetic pathways and detecting automobiles in natural images. [sent-74, score-0.753]
21 In both of these domains, features may be inherently absent due to some or all of the mechanisms described above. [sent-75, score-0.263]
22 In this example, the margin of a sample that only has one feature (the x dimension) is measured both in the higher dimensional space (ρ2 ) and the lower one (ρ1 ). [sent-90, score-0.203]
23 Consider now learning such a classifier in the presence of missing data. [sent-102, score-0.46]
24 At first glance, it may appear that since the x’s only affect the optimization through inner products with w, missing features can merely be skipped (or equivalently, replaced with zeros), thereby preserving the values of the inner product. [sent-103, score-0.688]
25 The reason is illustrated in Figure 1 where a single sample in R 2 has one existing and one missing feature. [sent-105, score-0.46]
26 Due to the missing feature, measuring the margin in the full space ρ2 , underestimates the correct geometric margin of the sample in the sub space ρ 1 . [sent-106, score-0.899]
27 In the case of missing features, each instance resides in w a subspace of the joint feature space, therefore the margin yi wxi which measures the distance to a w hyperplane in the full joint space is not well defined. [sent-110, score-0.887]
28 To address this problem we treat the margin of each instance in its own relevant subspace. [sent-111, score-0.205]
29 We define the instance margin ρi (w) for the ith instance as yi w(i) xi w(i) ρi (w) = (2) where w(i) is a vector obtained by taking the entries of w that are relevant for x i , namely those for which the sample xi has valid features. [sent-112, score-0.449]
30 We now consider our k new geometric margin to be the minimum over all instance margins, ρ(w) = min i ρi (w), and arrive at a new optimization problem for the missing-features case max min w i yi w(i) xi w(i) . [sent-114, score-0.472]
31 (3) To solve this optimization problem, consider first the case of where no features are missing, as in standard SVM. [sent-115, score-0.228]
32 There, maximizing the margin ρ f ull (w), max ρ f ull (w) = max min w w i yi wxi w (4) is transformed into the quadratic programming problem of Equation (1) in two steps. [sent-116, score-0.445]
33 Unfortunately, in the case of missing features, since different margin terms are normalized by different norms w(i) in Equation (3), the denominator can no longer be taken out of the minimization. [sent-122, score-0.712]
34 The next section discusses how the geometric margin of Equation (3) can be optimized. [sent-124, score-0.277]
35 Algorithms We investigate three approaches for maximizing the geometric margin of Equation (3). [sent-126, score-0.277]
36 We keep instance-specific numerators of the instance margin w and approximate only the denominator by a common term that does not depend on the instance. [sent-180, score-0.249]
37 However, even with many missing features, the denominator will also be a good approximation of the denominator in Equation (4) if all the norm terms are equal. [sent-185, score-0.548]
38 Equation (8) therefore translates into 1 w 2 2 1 yi w(i) xi ≥ 1 ∀i si si = w(i) / w ∀i min w,s s. [sent-207, score-0.273]
39 Because the dual objective of Equation (10) depends only on the inner products of the data instances, we can consider the features to lie in a space of much higher dimension than is feasible in the primal formulation of Equation (9). [sent-265, score-0.242]
40 Importantly, kernels in this formulation operate over vectors with missing features, and we therefore have to develop kernels that handle this case correctly. [sent-275, score-0.623]
41 Fortunately, for some standard kernels, there is an easy procedure to construct a modified kernel that takes such missing values into account. [sent-276, score-0.46]
42 A kernel in this class can be easily transformed into a kernel that handles missing features, by considering only features that are valid for both samples, and skipping the rest. [sent-279, score-0.718]
43 This is not equivalent to ignoring missing features (as in filling with zeros), since the algorithm applied on Equation (11) using this kernel does already correct for the missing information by taking into account the s i terms. [sent-280, score-1.111]
44 For instance, for a polynomial kernel K(xi , x j ) = ( xi , x j + 1)d , define the modified kernel K (xi , x j ) = ( xi , x j F + 1)d , with the inner product calculated over valid features xi , x j F = ∑k: fk ∈F ( j )∩Fi xik , x jk . [sent-281, score-0.347]
45 It is straightforward to see that K is a kernel, since instances with missing values can be filled with zeros: Formally, define x = h(x), where h : {R, NaN}d → Rd replaces invalid entries (missing features) with zeros. [sent-282, score-0.595]
46 Then we have K (xi , x j ) = K(xi , x j ), simply since multiplying by zero is equivalent to skipping the missing values. [sent-283, score-0.527]
47 First, as a sanity check, we explore performance when features are missing at random. [sent-287, score-0.651]
48 Second, we study a visual object recognition application where some features are missing because they cannot be located in the image. [sent-289, score-0.78]
49 Finally, we apply our methods to a problem of biological network completion, where missingness patterns of features is determined by the known structure of the network. [sent-290, score-0.281]
50 10 M AX - MARGIN C LASSIFICATION OF DATA WITH A BSENT F EATURES For all applications, we compare the two variants of our method with five common approaches for filling missing features. [sent-291, score-0.46]
51 Additional features (“flags”) were added, explicitly denoting whether a feature is missing for a given instance. [sent-297, score-0.692]
52 To minimize the number of added flags, they were shared between features in the following way: all features that had the same pattern of missingness across samples were assigned a single flag. [sent-298, score-0.511]
53 A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data and (2) re-filling missing values using cluster means, weighted by the posterior probability that a cluster generated the sample. [sent-305, score-0.46]
54 1 Missing at Random As a first sanity check, we evaluate performance when feature values are missing at random. [sent-321, score-0.501]
55 In practice, features are often structured such that validity is determined at the level of groups of features, that is, all features in a group are either valid or invalid together. [sent-325, score-0.416]
56 This happens, for example, if a part of an object is missing in an image, since all its corresponding features are invalid. [sent-326, score-0.738]
57 evaluated the performance in a setting where we can control the mechanism that governs missing features. [sent-329, score-0.46]
58 To make the effect of missing feature prominent, we removed 90% of the features of each sample randomly, and applied the competing methods described above for each of the tasks. [sent-335, score-0.692]
59 05 Table 1: Mean accuracies (± standard errors over five random partitions, each separating the data into five cross validation sets) for classification of UCI data sets with randomly missing values. [sent-438, score-0.46]
60 2 Visual Object Recognition We now turn to evaluating our methods in an application where instances have structurally missing features. [sent-451, score-0.617]
61 We consider a visual object recognition task, where we attempt to determine if an object from a certain class (automobiles) is present in a given input image. [sent-452, score-0.216]
62 These patches typically include several candidates for any object part, and some images may have more candidates for a given part than others. [sent-458, score-0.2]
63 In this case, we consider the corresponding features to be structurally missing. [sent-460, score-0.281]
64 Our object model contains a set of “landmarks”, defining the outline of an object, for which we find several matches in a given image (see Elidan et al. [sent-466, score-0.212]
65 For each candidate, we compute the first 10 principal component coefficients of the image patch and concatenate these descriptors to form the image feature vector with 1900 features per image. [sent-474, score-0.345]
66 1 0 geom avg |w| zero mean EM flag kNN Figure 5: Classification accuracy of the different methods for object recognition in natural images. [sent-481, score-0.323]
67 Because we only include “good” matches in the feature vector, we would expect a classifier learned using our method to outperform methods that do not handle these missing features as well. [sent-489, score-0.781]
68 In the first case, all landmark matches seem to be good, while the in second, all matches appear to be incorrect. [sent-492, score-0.255]
69 This analysis provides strong evidence that our method is leveraging a proper handling of missing features to obtain a boost in performance for this application. [sent-495, score-0.694]
70 (a) and (b) Five of the candidate matches miss the car, leading to irrelevant features for this instance. [sent-501, score-0.316]
71 (c) All candidate matches successfully match the car, making all features valid. [sent-503, score-0.316]
72 3 Metabolic Pathway Reconstruction As a final application, we consider the problem of predicting missing enzymes in metabolic pathways, a long-standing and important challenge in computational biology (Vert and Yamanishi, 2004; Kharchenko et al. [sent-511, score-0.839]
73 Instances in this task have missing features due to the structure of the biochemical network. [sent-513, score-0.651]
74 Cells use a complex network of chemical reactions to produce their chemical building blocks (Figure 7). [sent-515, score-0.243]
75 For example, in Figure 7 the enzyme ARO7 catalyzes a reaction that turns chorismate to prephanate. [sent-517, score-0.237]
76 For many reactions, the enzyme responsible for their catalysis is unknown, making it an important computational task to predict the identity of such missing enzymes. [sent-518, score-0.646]
77 , ARO7), but in some cases these enzymes are unknown. [sent-525, score-0.2]
78 As a result, neighboring enzyme pairs have non trivial correlations over their features that reflect their functional relations. [sent-527, score-0.377]
79 For example, an enzyme in a linear chain depends on the outputs of the preceding reaction as its own inputs. [sent-529, score-0.237]
80 Hence, the genes that code for these enzymes are expected to be co-expressed (Kharchenko et al. [sent-530, score-0.2]
81 On the other hand, enzymes in forking motifs (using a shared input compound, but creating different outputs) often have anti-correlated expression profiles (Ihmels et al. [sent-533, score-0.2]
82 Each enzyme is represented using a vector of features that measure its relatedness to each of its different neighbors, across different data types. [sent-536, score-0.416]
83 A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors. [sent-537, score-0.811]
84 For example, the enzyme PHA2, which transforms prephenate to phenylpyruvate in Figure 7, does not have a neighbor of type fork, and therefore all features assigned to such a neighbor are absent in the representation of the reaction “Prephanate → Phenylpyruvate”. [sent-538, score-0.534]
85 We used the metabolic network of budding yeast (S. [sent-539, score-0.213]
86 We used three types of data for enzymes attributes: A compendium of gene expression assays, protein domains content of enzymes, and the cellular localization of proteins. [sent-543, score-0.29]
87 All similarity features were concatenated, yielding a feature vector of length ∼ 3900. [sent-546, score-0.232]
88 As explained above, features in this vector are often missing since some enzymes do not have all types of neighbors. [sent-547, score-0.851]
89 We created positive examples from the reactions with known enzymes (∼ 520 reactions), and negative examples by plugging a random impostor genes into each neighborhood. [sent-548, score-0.329]
90 The geometric margin approach achieves significantly better performance in this task. [sent-553, score-0.277]
91 Discussion We presented a novel method for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. [sent-558, score-1.111]
92 Instead of completing missing features in a preprocessing phase, we developed a max-margin learning objective based on a geometric interpretation of the margin when different instances essentially lie in different spaces. [sent-559, score-0.995]
93 The classification task is to identify if a candidate enzyme is in the right “neighborhood”. [sent-570, score-0.222]
94 which approximates the instance-specific geometric margins, did not perform better than simple imputation methods. [sent-572, score-0.243]
95 The standard treatment of missing features is based on the concept that missing features exist but are unobserved due to noise or insufficient measurements. [sent-577, score-1.302]
96 While imputation methods may be impractical in such a setting, approaches 18 M AX - MARGIN C LASSIFICATION OF DATA WITH A BSENT F EATURES based on skipping unknown entries could reduce the computation costs significantly, by operating on known entries only. [sent-585, score-0.263]
97 Interestingly, the problem of classifying with missing features is related to another problem, where individual reliability measures are available for features at each instance separately. [sent-586, score-0.885]
98 This problem can be viewed in the same framework described in this paper: the geometric margin must be defined separately for each instance since the different noise levels distort the relative scale of each coordinate of each instance separately. [sent-589, score-0.363]
99 Relative to this setting, the completely missing and fully valid features discussed in this paper are extreme points on the spectrum of reliability. [sent-590, score-0.651]
100 Second order cone programming approaches for handling missing and uncertain data. [sent-735, score-0.558]
wordName wordTfidf (topN-words)
[('missing', 0.46), ('enzymes', 0.2), ('features', 0.191), ('enzyme', 0.186), ('metabolic', 0.179), ('bbeel', 0.168), ('bsent', 0.168), ('eitz', 0.168), ('hechik', 0.168), ('margin', 0.162), ('lidan', 0.143), ('reactions', 0.129), ('imputation', 0.128), ('wxi', 0.118), ('geometric', 0.115), ('lassification', 0.102), ('eatures', 0.102), ('geom', 0.101), ('kharchenko', 0.101), ('stanford', 0.096), ('ax', 0.092), ('structurally', 0.09), ('missingness', 0.09), ('matches', 0.089), ('object', 0.087), ('socp', 0.086), ('knn', 0.084), ('si', 0.079), ('gal', 0.077), ('landmark', 0.077), ('absent', 0.072), ('lling', 0.072), ('ags', 0.071), ('elidan', 0.07), ('koller', 0.069), ('separable', 0.068), ('skipping', 0.067), ('instances', 0.067), ('yi', 0.063), ('equation', 0.061), ('car', 0.059), ('avg', 0.059), ('pathways', 0.059), ('chemical', 0.057), ('kernels', 0.056), ('lled', 0.056), ('cone', 0.055), ('classi', 0.052), ('xi', 0.052), ('formulation', 0.051), ('localization', 0.051), ('reaction', 0.051), ('vert', 0.051), ('ull', 0.051), ('yamanishi', 0.051), ('mnist', 0.048), ('hypothetical', 0.047), ('norms', 0.046), ('qp', 0.046), ('denominator', 0.044), ('instance', 0.043), ('heitz', 0.043), ('chechik', 0.043), ('berg', 0.043), ('compounds', 0.043), ('grauman', 0.043), ('handling', 0.043), ('em', 0.043), ('recognition', 0.042), ('margins', 0.041), ('patch', 0.041), ('patches', 0.041), ('feature', 0.041), ('cellular', 0.039), ('across', 0.039), ('meaningless', 0.038), ('iterative', 0.038), ('optimization', 0.037), ('fi', 0.037), ('mini', 0.037), ('convex', 0.037), ('candidate', 0.036), ('image', 0.036), ('candidates', 0.036), ('entries', 0.034), ('automobiles', 0.034), ('budding', 0.034), ('flag', 0.034), ('forks', 0.034), ('forster', 0.034), ('funnels', 0.034), ('geremy', 0.034), ('huh', 0.034), ('hulo', 0.034), ('ihmels', 0.034), ('invalid', 0.034), ('jaimovich', 0.034), ('neville', 0.034), ('nonexisting', 0.034), ('phenylpyruvate', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
2 0.15268147 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
3 0.1427746 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
Author: Giorgio Corani, Marco Zaffalon
Abstract: JNCC2 implements the naive credal classifier 2 (NCC2). This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classifications also when dealing with small or incomplete data sets. Robustness is achieved by delivering set-valued classifications (that is, returning multiple classes) on the instances for which (i) the learning set is not informative enough to smooth the effect of choice of the prior density or (ii) the uncertainty arising from missing data prevents the reliable indication of a single class. JNCC2 is released under the GNU GPL license. Keywords: imprecise probabilities, missing data, naive Bayes, naive credal classifier 2, Java
5 0.076987863 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
Author: Giorgio Corani, Marco Zaffalon
Abstract: In this paper, the naive credal classifier, which is a set-valued counterpart of naive Bayes, is extended to a general and flexible treatment of incomplete data, yielding a new classifier called naive credal classifier 2 (NCC2). The new classifier delivers classifications that are reliable even in the presence of small sample sizes and missing values. Extensive empirical evaluations show that, by issuing set-valued classifications, NCC2 is able to isolate and properly deal with instances that are hard to classify (on which naive Bayes accuracy drops considerably), and to perform as well as naive Bayes on the other instances. The experiments point to a general problem: they show that with missing values, empirical evaluations may not reliably estimate the accuracy of a traditional classifier, such as naive Bayes. This phenomenon adds even more value to the robust approach to classification implemented by NCC2. Keywords: naive Bayes, naive credal classifier, imprecise probabilities, missing values, conservative inference rule, missing at random
6 0.07389532 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
7 0.067713514 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.067274235 87 jmlr-2008-Stationary Features and Cat Detection
9 0.064627565 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
10 0.064010836 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
11 0.058678299 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
12 0.057238258 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
13 0.056517143 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
14 0.053408086 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
15 0.053267468 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
16 0.052437175 96 jmlr-2008-Visualizing Data using t-SNE
17 0.052129768 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
18 0.047369815 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
19 0.046495724 52 jmlr-2008-Learning from Multiple Sources
20 0.046454579 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
topicId topicWeight
[(0, 0.27), (1, -0.074), (2, 0.107), (3, 0.041), (4, -0.23), (5, 0.167), (6, 0.264), (7, 0.108), (8, 0.0), (9, 0.041), (10, 0.001), (11, -0.125), (12, 0.142), (13, 0.011), (14, 0.063), (15, -0.01), (16, -0.148), (17, 0.079), (18, -0.048), (19, 0.163), (20, 0.133), (21, 0.024), (22, 0.047), (23, 0.015), (24, 0.098), (25, -0.087), (26, -0.023), (27, -0.092), (28, 0.069), (29, 0.018), (30, -0.03), (31, -0.044), (32, -0.008), (33, 0.073), (34, -0.019), (35, -0.032), (36, -0.032), (37, -0.026), (38, -0.168), (39, 0.042), (40, -0.015), (41, 0.056), (42, -0.017), (43, -0.021), (44, -0.092), (45, -0.025), (46, 0.003), (47, -0.169), (48, 0.045), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.95120662 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
2 0.76229042 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter
Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values
3 0.68298352 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
Author: Giorgio Corani, Marco Zaffalon
Abstract: JNCC2 implements the naive credal classifier 2 (NCC2). This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classifications also when dealing with small or incomplete data sets. Robustness is achieved by delivering set-valued classifications (that is, returning multiple classes) on the instances for which (i) the learning set is not informative enough to smooth the effect of choice of the prior density or (ii) the uncertainty arising from missing data prevents the reliable indication of a single class. JNCC2 is released under the GNU GPL license. Keywords: imprecise probabilities, missing data, naive Bayes, naive credal classifier 2, Java
5 0.40761581 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
6 0.38194221 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2
7 0.37949568 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
8 0.35295931 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
9 0.34222648 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.314881 87 jmlr-2008-Stationary Features and Cat Detection
11 0.2905612 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
12 0.28603399 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
13 0.28222212 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
14 0.27237988 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
15 0.26415247 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
16 0.25908405 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
17 0.25699142 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
18 0.25129408 86 jmlr-2008-SimpleMKL
19 0.24568723 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
20 0.24374723 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
topicId topicWeight
[(0, 0.032), (1, 0.013), (5, 0.023), (21, 0.316), (31, 0.015), (40, 0.078), (54, 0.045), (58, 0.04), (66, 0.072), (76, 0.038), (78, 0.015), (86, 0.01), (88, 0.08), (92, 0.034), (94, 0.056), (99, 0.038)]
simIndex simValue paperId paperTitle
Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy
Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics
same-paper 2 0.76768941 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
3 0.44193935 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.43897742 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
5 0.43366107 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
6 0.42653549 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
7 0.42588094 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.42567375 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
9 0.4249008 86 jmlr-2008-SimpleMKL
10 0.42471454 56 jmlr-2008-Magic Moments for Structured Output Prediction
11 0.42347869 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
12 0.42295113 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
13 0.42157382 83 jmlr-2008-Robust Submodular Observation Selection
14 0.41536725 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
15 0.41515911 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
16 0.41504619 9 jmlr-2008-Active Learning by Spherical Subdivision
17 0.41161382 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.40797976 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
19 0.40752026 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
20 0.40612066 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes