jmlr jmlr2008 jmlr2008-58 knowledge-graph by maker-knowledge-mining

58 jmlr-2008-Max-margin Classification of Data with Absent Features

Source: pdf

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classiﬁers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to ﬁrst complete their unknown values, and then use a standard classiﬁcation procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classiﬁed directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efﬁciently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The common approach for handling missing features in discriminative models is to ﬁrst complete their unknown values, and then use a standard classiﬁcation procedure over the completed data. [sent-12, score-0.694]

2 We show how incomplete data can be classiﬁed directly without any completion of the missing features using a max-margin learning framework. [sent-14, score-0.651]

3 We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. [sent-15, score-0.277]

4 More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. [sent-19, score-1.423]

5 Keywords: max margin, missing features, network reconstruction, metabolic pathways ∗. [sent-21, score-0.698]

6 Introduction In the traditional formulation of supervised learning, data instances are viewed as vectors of features in some high-dimensional space. [sent-25, score-0.309]

7 However, in many real-world tasks, data instances have a complex pattern of missing features. [sent-26, score-0.527]

8 Sometimes features are missing due to measurement noise or corruption, but in other cases samples have varying subsets of observable features due to the inherent properties of the instances. [sent-27, score-0.842]

9 In visual object recognition, an object is often classiﬁed using a set of image patches corresponding to parts of the object (e. [sent-29, score-0.338]

10 , 2005) or predict networks of chemical reactions involving enzymes and other metabolic compounds (Kharchenko et al. [sent-36, score-0.608]

11 Common methods for classiﬁcation with missing features assume that the features exist but their values are unknown. [sent-41, score-0.842]

12 First the values of the missing attributes are ﬁlled in during a preprocessing phase, followed by the application of a standard classiﬁer to the completed data (Little and Rubin, 1987; Roth, 1994). [sent-43, score-0.46]

13 A second family of imputation methods builds probabilistic generative models of the features (with or without the label) using raw maximum likelihood or algorithms such as expectation maximization (EM) to ﬁnd the most probable completion (Ghahramani and Jordan, 1994). [sent-46, score-0.319]

14 These methods have been shown to work very well for missing at random (MAR) data settings, because they assume that the missing features are generated by the same model that generates the observed features. [sent-48, score-1.111]

15 The distribution over missing features, can also be approximated using multiple imputations, where several missing values are ﬁlled in. [sent-49, score-0.92]

16 More importantly, they may produce meaningless completions for features that are structurally absent, which will likely decrease classiﬁcation performance. [sent-55, score-0.319]

17 The approach we propose here aims to classify data instances directly without ﬁlling hypothetical missing values, and is therefore different from imputation. [sent-67, score-0.574]

18 To formulate the classiﬁcation problem we go back to the geometric intuitions underlying max-margin classiﬁcation: We try to maximize the worst-case margin of the separating hyperplane, while measuring the margin of each data instance in its own lower-dimensional subspace. [sent-69, score-0.482]

19 Although the intuitions underlying our method arise from cases with non-existing features, we show that the performance of our methods is not inferior to other simple imputation methods, even in the case of missing at random. [sent-73, score-0.588]

20 We then evaluate our approach on the task of predicting missing enzymes in genetic pathways and detecting automobiles in natural images. [sent-74, score-0.753]

21 In both of these domains, features may be inherently absent due to some or all of the mechanisms described above. [sent-75, score-0.263]

22 In this example, the margin of a sample that only has one feature (the x dimension) is measured both in the higher dimensional space (ρ2 ) and the lower one (ρ1 ). [sent-90, score-0.203]

23 Consider now learning such a classiﬁer in the presence of missing data. [sent-102, score-0.46]

24 At ﬁrst glance, it may appear that since the x’s only affect the optimization through inner products with w, missing features can merely be skipped (or equivalently, replaced with zeros), thereby preserving the values of the inner product. [sent-103, score-0.688]

25 The reason is illustrated in Figure 1 where a single sample in R 2 has one existing and one missing feature. [sent-105, score-0.46]

26 Due to the missing feature, measuring the margin in the full space ρ2 , underestimates the correct geometric margin of the sample in the sub space ρ 1 . [sent-106, score-0.899]

27 In the case of missing features, each instance resides in w a subspace of the joint feature space, therefore the margin yi wxi which measures the distance to a w hyperplane in the full joint space is not well deﬁned. [sent-110, score-0.887]

28 To address this problem we treat the margin of each instance in its own relevant subspace. [sent-111, score-0.205]

29 We deﬁne the instance margin ρi (w) for the ith instance as yi w(i) xi w(i) ρi (w) = (2) where w(i) is a vector obtained by taking the entries of w that are relevant for x i , namely those for which the sample xi has valid features. [sent-112, score-0.449]

30 We now consider our k new geometric margin to be the minimum over all instance margins, ρ(w) = min i ρi (w), and arrive at a new optimization problem for the missing-features case max min w i yi w(i) xi w(i) . [sent-114, score-0.472]

31 (3) To solve this optimization problem, consider ﬁrst the case of where no features are missing, as in standard SVM. [sent-115, score-0.228]

32 There, maximizing the margin ρ f ull (w), max ρ f ull (w) = max min w w i yi wxi w (4) is transformed into the quadratic programming problem of Equation (1) in two steps. [sent-116, score-0.445]

33 Unfortunately, in the case of missing features, since different margin terms are normalized by different norms w(i) in Equation (3), the denominator can no longer be taken out of the minimization. [sent-122, score-0.712]

34 The next section discusses how the geometric margin of Equation (3) can be optimized. [sent-124, score-0.277]

35 Algorithms We investigate three approaches for maximizing the geometric margin of Equation (3). [sent-126, score-0.277]

36 We keep instance-speciﬁc numerators of the instance margin w and approximate only the denominator by a common term that does not depend on the instance. [sent-180, score-0.249]

37 However, even with many missing features, the denominator will also be a good approximation of the denominator in Equation (4) if all the norm terms are equal. [sent-185, score-0.548]

38 Equation (8) therefore translates into 1 w 2 2 1 yi w(i) xi ≥ 1 ∀i si si = w(i) / w ∀i min w,s s. [sent-207, score-0.273]

39 Because the dual objective of Equation (10) depends only on the inner products of the data instances, we can consider the features to lie in a space of much higher dimension than is feasible in the primal formulation of Equation (9). [sent-265, score-0.242]

40 Importantly, kernels in this formulation operate over vectors with missing features, and we therefore have to develop kernels that handle this case correctly. [sent-275, score-0.623]

41 Fortunately, for some standard kernels, there is an easy procedure to construct a modiﬁed kernel that takes such missing values into account. [sent-276, score-0.46]

42 A kernel in this class can be easily transformed into a kernel that handles missing features, by considering only features that are valid for both samples, and skipping the rest. [sent-279, score-0.718]

43 This is not equivalent to ignoring missing features (as in ﬁlling with zeros), since the algorithm applied on Equation (11) using this kernel does already correct for the missing information by taking into account the s i terms. [sent-280, score-1.111]

44 For instance, for a polynomial kernel K(xi , x j ) = ( xi , x j + 1)d , deﬁne the modiﬁed kernel K (xi , x j ) = ( xi , x j F + 1)d , with the inner product calculated over valid features xi , x j F = ∑k: fk ∈F ( j )∩Fi xik , x jk . [sent-281, score-0.347]

45 It is straightforward to see that K is a kernel, since instances with missing values can be ﬁlled with zeros: Formally, deﬁne x = h(x), where h : {R, NaN}d → Rd replaces invalid entries (missing features) with zeros. [sent-282, score-0.595]

46 Then we have K (xi , x j ) = K(xi , x j ), simply since multiplying by zero is equivalent to skipping the missing values. [sent-283, score-0.527]

47 First, as a sanity check, we explore performance when features are missing at random. [sent-287, score-0.651]

48 Second, we study a visual object recognition application where some features are missing because they cannot be located in the image. [sent-289, score-0.78]

49 Finally, we apply our methods to a problem of biological network completion, where missingness patterns of features is determined by the known structure of the network. [sent-290, score-0.281]

50 10 M AX - MARGIN C LASSIFICATION OF DATA WITH A BSENT F EATURES For all applications, we compare the two variants of our method with ﬁve common approaches for ﬁlling missing features. [sent-291, score-0.46]

51 Additional features (“ﬂags”) were added, explicitly denoting whether a feature is missing for a given instance. [sent-297, score-0.692]

52 To minimize the number of added ﬂags, they were shared between features in the following way: all features that had the same pattern of missingness across samples were assigned a single ﬂag. [sent-298, score-0.511]

53 A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the ﬁlled data and (2) re-ﬁlling missing values using cluster means, weighted by the posterior probability that a cluster generated the sample. [sent-305, score-0.46]

54 1 Missing at Random As a ﬁrst sanity check, we evaluate performance when feature values are missing at random. [sent-321, score-0.501]

55 In practice, features are often structured such that validity is determined at the level of groups of features, that is, all features in a group are either valid or invalid together. [sent-325, score-0.416]

56 This happens, for example, if a part of an object is missing in an image, since all its corresponding features are invalid. [sent-326, score-0.738]

57 evaluated the performance in a setting where we can control the mechanism that governs missing features. [sent-329, score-0.46]

58 To make the effect of missing feature prominent, we removed 90% of the features of each sample randomly, and applied the competing methods described above for each of the tasks. [sent-335, score-0.692]

59 05 Table 1: Mean accuracies (± standard errors over ﬁve random partitions, each separating the data into ﬁve cross validation sets) for classiﬁcation of UCI data sets with randomly missing values. [sent-438, score-0.46]

60 2 Visual Object Recognition We now turn to evaluating our methods in an application where instances have structurally missing features. [sent-451, score-0.617]

61 We consider a visual object recognition task, where we attempt to determine if an object from a certain class (automobiles) is present in a given input image. [sent-452, score-0.216]

62 These patches typically include several candidates for any object part, and some images may have more candidates for a given part than others. [sent-458, score-0.2]

63 In this case, we consider the corresponding features to be structurally missing. [sent-460, score-0.281]

64 Our object model contains a set of “landmarks”, deﬁning the outline of an object, for which we ﬁnd several matches in a given image (see Elidan et al. [sent-466, score-0.212]

65 For each candidate, we compute the ﬁrst 10 principal component coefﬁcients of the image patch and concatenate these descriptors to form the image feature vector with 1900 features per image. [sent-474, score-0.345]

66 1 0 geom avg |w| zero mean EM flag kNN Figure 5: Classiﬁcation accuracy of the different methods for object recognition in natural images. [sent-481, score-0.323]

67 Because we only include “good” matches in the feature vector, we would expect a classiﬁer learned using our method to outperform methods that do not handle these missing features as well. [sent-489, score-0.781]

68 In the ﬁrst case, all landmark matches seem to be good, while the in second, all matches appear to be incorrect. [sent-492, score-0.255]

69 This analysis provides strong evidence that our method is leveraging a proper handling of missing features to obtain a boost in performance for this application. [sent-495, score-0.694]

70 (a) and (b) Five of the candidate matches miss the car, leading to irrelevant features for this instance. [sent-501, score-0.316]

71 (c) All candidate matches successfully match the car, making all features valid. [sent-503, score-0.316]

72 3 Metabolic Pathway Reconstruction As a ﬁnal application, we consider the problem of predicting missing enzymes in metabolic pathways, a long-standing and important challenge in computational biology (Vert and Yamanishi, 2004; Kharchenko et al. [sent-511, score-0.839]

73 Instances in this task have missing features due to the structure of the biochemical network. [sent-513, score-0.651]

74 Cells use a complex network of chemical reactions to produce their chemical building blocks (Figure 7). [sent-515, score-0.243]

75 For example, in Figure 7 the enzyme ARO7 catalyzes a reaction that turns chorismate to prephanate. [sent-517, score-0.237]

76 For many reactions, the enzyme responsible for their catalysis is unknown, making it an important computational task to predict the identity of such missing enzymes. [sent-518, score-0.646]

77 , ARO7), but in some cases these enzymes are unknown. [sent-525, score-0.2]

78 As a result, neighboring enzyme pairs have non trivial correlations over their features that reﬂect their functional relations. [sent-527, score-0.377]

79 For example, an enzyme in a linear chain depends on the outputs of the preceding reaction as its own inputs. [sent-529, score-0.237]

80 Hence, the genes that code for these enzymes are expected to be co-expressed (Kharchenko et al. [sent-530, score-0.2]

81 On the other hand, enzymes in forking motifs (using a shared input compound, but creating different outputs) often have anti-correlated expression proﬁles (Ihmels et al. [sent-533, score-0.2]

82 Each enzyme is represented using a vector of features that measure its relatedness to each of its different neighbors, across different data types. [sent-536, score-0.416]

83 A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors. [sent-537, score-0.811]

84 For example, the enzyme PHA2, which transforms prephenate to phenylpyruvate in Figure 7, does not have a neighbor of type fork, and therefore all features assigned to such a neighbor are absent in the representation of the reaction “Prephanate → Phenylpyruvate”. [sent-538, score-0.534]

85 We used the metabolic network of budding yeast (S. [sent-539, score-0.213]

86 We used three types of data for enzymes attributes: A compendium of gene expression assays, protein domains content of enzymes, and the cellular localization of proteins. [sent-543, score-0.29]

87 All similarity features were concatenated, yielding a feature vector of length ∼ 3900. [sent-546, score-0.232]

88 As explained above, features in this vector are often missing since some enzymes do not have all types of neighbors. [sent-547, score-0.851]

89 We created positive examples from the reactions with known enzymes (∼ 520 reactions), and negative examples by plugging a random impostor genes into each neighborhood. [sent-548, score-0.329]

90 The geometric margin approach achieves signiﬁcantly better performance in this task. [sent-553, score-0.277]

91 Discussion We presented a novel method for max-margin training of classiﬁers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. [sent-558, score-1.111]

92 Instead of completing missing features in a preprocessing phase, we developed a max-margin learning objective based on a geometric interpretation of the margin when different instances essentially lie in different spaces. [sent-559, score-0.995]

93 The classiﬁcation task is to identify if a candidate enzyme is in the right “neighborhood”. [sent-570, score-0.222]

94 which approximates the instance-speciﬁc geometric margins, did not perform better than simple imputation methods. [sent-572, score-0.243]

95 The standard treatment of missing features is based on the concept that missing features exist but are unobserved due to noise or insufﬁcient measurements. [sent-577, score-1.302]

96 While imputation methods may be impractical in such a setting, approaches 18 M AX - MARGIN C LASSIFICATION OF DATA WITH A BSENT F EATURES based on skipping unknown entries could reduce the computation costs signiﬁcantly, by operating on known entries only. [sent-585, score-0.263]

97 Interestingly, the problem of classifying with missing features is related to another problem, where individual reliability measures are available for features at each instance separately. [sent-586, score-0.885]

98 This problem can be viewed in the same framework described in this paper: the geometric margin must be deﬁned separately for each instance since the different noise levels distort the relative scale of each coordinate of each instance separately. [sent-589, score-0.363]

99 Relative to this setting, the completely missing and fully valid features discussed in this paper are extreme points on the spectrum of reliability. [sent-590, score-0.651]

100 Second order cone programming approaches for handling missing and uncertain data. [sent-735, score-0.558]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('missing', 0.46), ('enzymes', 0.2), ('features', 0.191), ('enzyme', 0.186), ('metabolic', 0.179), ('bbeel', 0.168), ('bsent', 0.168), ('eitz', 0.168), ('hechik', 0.168), ('margin', 0.162), ('lidan', 0.143), ('reactions', 0.129), ('imputation', 0.128), ('wxi', 0.118), ('geometric', 0.115), ('lassification', 0.102), ('eatures', 0.102), ('geom', 0.101), ('kharchenko', 0.101), ('stanford', 0.096), ('ax', 0.092), ('structurally', 0.09), ('missingness', 0.09), ('matches', 0.089), ('object', 0.087), ('socp', 0.086), ('knn', 0.084), ('si', 0.079), ('gal', 0.077), ('landmark', 0.077), ('absent', 0.072), ('lling', 0.072), ('ags', 0.071), ('elidan', 0.07), ('koller', 0.069), ('separable', 0.068), ('skipping', 0.067), ('instances', 0.067), ('yi', 0.063), ('equation', 0.061), ('car', 0.059), ('avg', 0.059), ('pathways', 0.059), ('chemical', 0.057), ('kernels', 0.056), ('lled', 0.056), ('cone', 0.055), ('classi', 0.052), ('xi', 0.052), ('formulation', 0.051), ('localization', 0.051), ('reaction', 0.051), ('vert', 0.051), ('ull', 0.051), ('yamanishi', 0.051), ('mnist', 0.048), ('hypothetical', 0.047), ('norms', 0.046), ('qp', 0.046), ('denominator', 0.044), ('instance', 0.043), ('heitz', 0.043), ('chechik', 0.043), ('berg', 0.043), ('compounds', 0.043), ('grauman', 0.043), ('handling', 0.043), ('em', 0.043), ('recognition', 0.042), ('margins', 0.041), ('patch', 0.041), ('patches', 0.041), ('feature', 0.041), ('cellular', 0.039), ('across', 0.039), ('meaningless', 0.038), ('iterative', 0.038), ('optimization', 0.037), ('fi', 0.037), ('mini', 0.037), ('convex', 0.037), ('candidate', 0.036), ('image', 0.036), ('candidates', 0.036), ('entries', 0.034), ('automobiles', 0.034), ('budding', 0.034), ('flag', 0.034), ('forks', 0.034), ('forster', 0.034), ('funnels', 0.034), ('geremy', 0.034), ('huh', 0.034), ('hulo', 0.034), ('ihmels', 0.034), ('invalid', 0.034), ('jaimovich', 0.034), ('neville', 0.034), ('nonexisting', 0.034), ('phenylpyruvate', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

2 0.15268147 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random ﬁelds (CRFs) provide a ﬂexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random ﬁelds, functional gradient, gradient tree boosting, missing values

3 0.1427746 54 jmlr-2008-Learning to Select Features using their Properties

Author: Eyal Krupka, Amir Navot, Naftali Tishby

Abstract: Feature selection is the task of choosing a small subset of features that is sufﬁcient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efﬁciently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classiﬁcation, and how it contributes to a better understanding of the problem. Speciﬁcally, in the context of object recognition, previous works showed that it is possible to ﬁnd one set of features which ﬁts most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and ﬁnd that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features

4 0.1419796 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)

Author: Giorgio Corani, Marco Zaffalon

Abstract: JNCC2 implements the naive credal classiﬁer 2 (NCC2). This is an extension of naive Bayes to imprecise probabilities that aims at delivering robust classiﬁcations also when dealing with small or incomplete data sets. Robustness is achieved by delivering set-valued classiﬁcations (that is, returning multiple classes) on the instances for which (i) the learning set is not informative enough to smooth the effect of choice of the prior density or (ii) the uncertainty arising from missing data prevents the reliable indication of a single class. JNCC2 is released under the GNU GPL license. Keywords: imprecise probabilities, missing data, naive Bayes, naive credal classiﬁer 2, Java

5 0.076987863 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

Author: Giorgio Corani, Marco Zaffalon

Abstract: In this paper, the naive credal classiﬁer, which is a set-valued counterpart of naive Bayes, is extended to a general and ﬂexible treatment of incomplete data, yielding a new classiﬁer called naive credal classiﬁer 2 (NCC2). The new classiﬁer delivers classiﬁcations that are reliable even in the presence of small sample sizes and missing values. Extensive empirical evaluations show that, by issuing set-valued classiﬁcations, NCC2 is able to isolate and properly deal with instances that are hard to classify (on which naive Bayes accuracy drops considerably), and to perform as well as naive Bayes on the other instances. The experiments point to a general problem: they show that with missing values, empirical evaluations may not reliably estimate the accuracy of a traditional classiﬁer, such as naive Bayes. This phenomenon adds even more value to the robust approach to classiﬁcation implemented by NCC2. Keywords: naive Bayes, naive credal classiﬁer, imprecise probabilities, missing values, conservative inference rule, missing at random

6 0.07389532 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

7 0.067713514 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

8 0.067274235 87 jmlr-2008-Stationary Features and Cat Detection

9 0.064627565 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

10 0.064010836 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering

11 0.058678299 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

12 0.057238258 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

13 0.056517143 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

14 0.053408086 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

15 0.053267468 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

16 0.052437175 96 jmlr-2008-Visualizing Data using t-SNE

17 0.052129768 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

18 0.047369815 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

19 0.046495724 52 jmlr-2008-Learning from Multiple Sources

20 0.046454579 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.27), (1, -0.074), (2, 0.107), (3, 0.041), (4, -0.23), (5, 0.167), (6, 0.264), (7, 0.108), (8, 0.0), (9, 0.041), (10, 0.001), (11, -0.125), (12, 0.142), (13, 0.011), (14, 0.063), (15, -0.01), (16, -0.148), (17, 0.079), (18, -0.048), (19, 0.163), (20, 0.133), (21, 0.024), (22, 0.047), (23, 0.015), (24, 0.098), (25, -0.087), (26, -0.023), (27, -0.092), (28, 0.069), (29, 0.018), (30, -0.03), (31, -0.044), (32, -0.008), (33, 0.073), (34, -0.019), (35, -0.032), (36, -0.032), (37, -0.026), (38, -0.168), (39, 0.042), (40, -0.015), (41, 0.056), (42, -0.017), (43, -0.021), (44, -0.092), (45, -0.025), (46, 0.003), (47, -0.169), (48, 0.045), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95120662 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

2 0.76229042 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

3 0.68298352 54 jmlr-2008-Learning to Select Features using their Properties

Author: Eyal Krupka, Amir Navot, Naftali Tishby

4 0.58862418 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)

Author: Giorgio Corani, Marco Zaffalon

5 0.40761581 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering

Author: Eyal Krupka, Naftali Tishby

Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove ﬁnite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative ﬁltering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck

6 0.38194221 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

7 0.37949568 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

8 0.35295931 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

9 0.34222648 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.314881 87 jmlr-2008-Stationary Features and Cat Detection

11 0.2905612 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

12 0.28603399 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

13 0.28222212 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

14 0.27237988 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

15 0.26415247 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

16 0.25908405 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

17 0.25699142 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

18 0.25129408 86 jmlr-2008-SimpleMKL

19 0.24568723 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

20 0.24374723 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (1, 0.013), (5, 0.023), (21, 0.316), (31, 0.015), (40, 0.078), (54, 0.045), (58, 0.04), (66, 0.072), (76, 0.038), (78, 0.015), (86, 0.01), (88, 0.08), (92, 0.034), (94, 0.056), (99, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76864535 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy

Abstract: In structured classiﬁcation problems, there is a direct conﬂict between expressive models and efﬁcient inference: while graphical models such as Markov random ﬁelds or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efﬁcient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efﬁcient inference in practical structured classiﬁcation problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classiﬁcation accuracy. Keywords: factor graphs, approximate inference algorithms, structured classiﬁcation, protein subcellular location patterns, location proteomics

same-paper 2 0.76768941 58 jmlr-2008-Max-margin Classification of Data with Absent Features

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

3 0.44193935 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

Author: Hsuan-Tien Lin, Ling Li

Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a ﬁnite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classiﬁer with a larger or even an inﬁnite number of hypotheses. In addition, constructing an inﬁnite ensemble itself is a challenging task. In this paper, we formulate an inﬁnite ensemble learning framework based on the support vector machine (SVM). The framework can output an inﬁnite and nonsparse ensemble through embedding inﬁnitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies inﬁnitely many decision stumps, and the perceptron kernel embodies inﬁnitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies inﬁnitely many decision trees, and can thus be explained through inﬁnite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the beneﬁt of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel

4 0.43897742 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

5 0.43366107 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that ﬁnd their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the ﬁnite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

6 0.42653549 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

7 0.42588094 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

8 0.42567375 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

9 0.4249008 86 jmlr-2008-SimpleMKL

10 0.42471454 56 jmlr-2008-Magic Moments for Structured Output Prediction

11 0.42347869 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

12 0.42295113 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

13 0.42157382 83 jmlr-2008-Robust Submodular Observation Selection

14 0.41536725 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

15 0.41515911 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

16 0.41504619 9 jmlr-2008-Active Learning by Spherical Subdivision

17 0.41161382 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

18 0.40797976 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

19 0.40752026 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

20 0.40612066 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes