nips nips2006 nips2006-130 knowledge-graph by maker-knowledge-mining

130 nips-2006-Max-margin classification of incomplete data


Source: pdf

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. [sent-3, score-0.479]

2 The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. [sent-4, score-1.247]

3 In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. [sent-5, score-0.712]

4 More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. [sent-8, score-1.464]

5 We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. [sent-9, score-0.265]

6 1 Introduction In the traditional formulation of supervised learning, data instances are viewed as vectors of features in some high-dimensional space. [sent-10, score-0.277]

7 However, in many real-world tasks, data instances have a complex pattern of missing features. [sent-11, score-0.564]

8 While features may sometimes be missing due to measurement noise or corruption, different samples often have different sets of observable features due to inherent properties of the instances. [sent-12, score-0.86]

9 In analyzing genomic data, we may wish to predict the edges in networks of interacting proteins or chemical reactions [9, 15]. [sent-17, score-0.231]

10 When a page has no such parents, however, this feature is meaningless and should be considered structurally absent. [sent-20, score-0.28]

11 The common approach for classification with missing features is imputation, a two phase procedure where the values of the missing attributes are first filled in during a preprocessing phase, after which a standard classifier is applied to the completed data [10]. [sent-21, score-1.29]

12 In common practice of applying imputation, missing attributes in continuous data are often filled with zeros, or with the average of all of the data instances, or using the k nearest neighbors (kNN) of each instance to find a plausible value of its missing features. [sent-23, score-1.089]

13 A second family of imputation methods builds probabilistic generative models of the features using raw maximum likelihood or algorithms such as expectation maximization (EM) [4]. [sent-24, score-0.301]

14 These methods work very well for MAR data settings, because they assume that the missing features are generated by the same model that generates the observed features. [sent-26, score-0.669]

15 More importantly, they will produce meaningless completions for features that are structurally absent. [sent-28, score-0.411]

16 , body parts, and architectural aspects), in which filling missing values (e. [sent-33, score-0.478]

17 As a result, for structurally absent features, it would be useful if we could avoid unnecessary prediction of hypothetical undefined values, and classify instances directly. [sent-36, score-0.299]

18 We approach this problem directly from the geometric interpretation of the classification task as finding a separating hyperplane in the feature space. [sent-37, score-0.223]

19 We view instances with different feature sets as lying in subspaces of the full feature space, and suggest a modified optimization objective within the framework of support vector machines (SVMs), that explicitly considers the subspace of each instance. [sent-38, score-0.317]

20 These approaches may be viewed as model-free methods for handling missing data in the cases where the MAR assumption fails to hold. [sent-41, score-0.508]

21 We evaluate the performance of our approach in two real world applications: prediction of missing enzymes in a metabolic network, and automobile detection in natural images. [sent-42, score-1.006]

22 In both tasks, features may be inherently absent due to the mechanisms described above, and our methods are found superior to other simple imputation methods. [sent-43, score-0.392]

23 Each sample xi is characterized by a subset of features F(xi ), from a full set F of size d. [sent-48, score-0.279]

24 A sample xi with partially valid features can be viewed as embedded in the relevant subspace R|F (xi )| ⊆ Rd . [sent-50, score-0.376]

25 Importantly, since instances share features, the learned classifier must be consistent across instances, assigning the same weight to a given feature in different samples, even if those instance do not lie in the same subspace. [sent-52, score-0.24]

26 Figure 1: The margin is incorrectly scaled when a sample that has missing features is treated as if the missing features have a value of zero. [sent-61, score-1.488]

27 In this example, the margin of a sample that only has one feature (the x dimension) is measured both in the higher dimensional space (ρ2 ) and the lower one (ρ1 ). [sent-62, score-0.21]

28 If all features are assumed to exist, and we give the missing feature (along the y axis) a value of zero, the margin ρ2 measured in the higher dimensional space is shorter that the margin measured in the relevant subspace ρ1 . [sent-63, score-1.071]

29 ρ 2 ρ 1 Consider now learning such a classifier in the presence of missing data. [sent-64, score-0.478]

30 At first glance, it may appear that since the x’s only affect the optimization through inner products with w, missing features can merely be skipped (or equivalently, replaced with zeros), thereby preserving the values of the inner product. [sent-65, score-0.793]

31 1 where a single sample in R2 has one valid and one missing feature. [sent-68, score-0.533]

32 Due to the missing feature, measuring the margin in the full space ρ2 , underestimates the correct geometric margin of the sample in the valid space ρ1 . [sent-69, score-0.93]

33 This is different from the case where the feature exists but is unknown, in which the sample’s margin could be either over- or under-estimated. [sent-70, score-0.21]

34 3 Geometric interpretation The derivation of the SVM classifier [14] is motivated by the goal of finding a hyperplane that maximally separates the positive examples from the negative, as measured by the geometric margin ρ(w) = mini yi wxi . [sent-74, score-0.582]

35 The task of maximizing the margin ρ(w), w max ρ(w) = max min w i w yi wxi w (2) is transformed into the quadratic programming problem of Eq. [sent-75, score-0.445]

36 First, w , 1 is taken out of the minimization, yielding maxw w (mini yi wxi ). [sent-77, score-0.325]

37 In the case of missing features, this derivation no longer optimizes the correct geometrical margin (Fig. [sent-84, score-0.628]

38 To address this problem, we treat the margin of each instance in its own (i) subspace, by defining the instance margin for the ith instance as ρi (w) = yi w (i)xi where w 2 w(i) = k:fk ∈F (xi ) wk . [sent-86, score-0.669]

39 The geometric margin is, as before, the minimum over all instance margins, yielding a new optimization problem max min w i yi w(i) xi w(i) . [sent-87, score-0.586]

40 Unfortunately, extending this formulation to the non-separable while preserving the geometric margin interpretation case makes the problem non-convex (this is discussed elsewhere). [sent-99, score-0.278]

41 (3) as max min i w yi wxi si w = max w 1 w min i yi wxi si , si = w(i) . [sent-103, score-0.814]

42 We can use a faster iterative algorithm based on the fact that the problem is a QP for any given set of si ’s, and iterate between (1) solve a QP for w given si , and (2) use the resulting w to calculate new si ’s. [sent-120, score-0.327]

43 The dual of the above QP can be kernelized as in a standard SVM, yielding n n 1 yi yj max αi − αi K (xi , xj ) αj n 2 i,j=1 si sj α∈R i=1 n s. [sent-130, score-0.297]

44 Kernels in this formulation operate over vectors with missing features, hence we have to develop kernels that handle them correctly. [sent-135, score-0.515]

45 In this case there is an easy procedure to construct a modified kernel that takes such missing values into account. [sent-137, score-0.478]

46 For example, for a polynomial d d kernel K(xi , xj ) = ( xi , xj + 1) , define K (xi , xj ) = K(xi , xj ) = ( xi , xj F + 1) , with the inner product calculated over valid features xi , xj F = k:fk ∈χ(xj )∩F (xi ) xik , xjk . [sent-138, score-1.033]

47 (a) An easy instance where all local features are approximately in agreement. [sent-146, score-0.285]

48 (b) A hard instance where local features are divided into two distinct groups. [sent-147, score-0.285]

49 First, as a sanity check, we explored performance when features are missing at random, in a series of five standard UCI benchmarks, and also in a large digit recognition task using MNIST data. [sent-152, score-0.704]

50 Second, we study a visual object recognition application where some features are missing because they cannot be located in the image. [sent-155, score-0.743]

51 Finally, we apply our methods to a problem of biological network completion, where missingness patterns of features is determined by the known structure of the network. [sent-156, score-0.298]

52 For all applications, we compare our iterative algorithm with five common approaches for completing missing features. [sent-157, score-0.57]

53 To reduce the number of added features, we added a single flag for each group of features that were valid or invalid together across all instances. [sent-163, score-0.284]

54 For example, In the vision application, all features of a landmark candidate are grouped together since they are all invalid if the match is wrong (see below). [sent-164, score-0.449]

55 kNN: Missing features were set with the mean value obtained from the K nearest neighbors instances; neighborhood was measured using a Euclidean distance in the subspace relevant to each two samples, number of neighbors was varied as K = 3, 5, 10, 20, and the best result is the one reported. [sent-166, score-0.279]

56 A Gaussian mixture model is learned by iterating between (1) learning a GMM model of the filled data (2) re-filling missing values using clusters means, weighted by the posterior probability that a cluster generated the sample. [sent-169, score-0.478]

57 1 Visual object recognition We now consider a visual object recognition task where instances have structurally missing features. [sent-178, score-0.901]

58 For example, a trunk of a car may not be found in a picture of a hatch-back car, hence all its corresponding features are considered to be structurally missing from that image. [sent-184, score-0.915]

59 If the number of patches for a given landmark is less than ten, we consider the rest to be structurally absent. [sent-195, score-0.307]

60 2a shows the top 10 matches for the front windshield landmark for a representative “easy” test instance where all local features are approximately in agreement. [sent-203, score-0.491]

61 2b shows a representative “hard” test instance where local features cluster into two different groups. [sent-206, score-0.285]

62 In this case, the cluster of bad matches was automatically excluded yielding missing features, and our geometric approach was the only method able to classify the instance correctly. [sent-207, score-0.767]

63 2 Metabolic pathway reconstruction As a final application, we consider the problem of predicting missing enzymes in metabolic pathways, a long-standing and important challenge in computational biology [15, 9]. [sent-209, score-1.004]

64 Instances in this task have missing features due to the structure of the biochemical network. [sent-210, score-0.704]

65 Cells use a complex network of chemical reactions to produce their chemical building blocks (Fig. [sent-211, score-0.326]

66 Each reaction transforms a set of molecular compounds (called substrates) into another set of molecules (products), and requires the presence of an enzyme to catalyze the reaction. [sent-213, score-0.286]

67 It is often unknown which enzyme catalyzes a given reaction, and it is desirable to predict the identity of such missing enzymes computationally. [sent-214, score-0.917]

68 Our approach for predicting missing enzymes is based on the observation that enzymes in local network neighborhoods usually participate in related functions. [sent-215, score-1.045]

69 As a result, neighboring enzyme pairs have non trivial correlations over their features that reflect their functional relations. [sent-216, score-0.367]

70 Importantly, different types of neighborhood relations between enzyme pairs lead to different relations of their properties. [sent-217, score-0.292]

71 For example, an enzyme in a linear chain depends on the preceding enzyme product as its substrate. [sent-218, score-0.352]

72 On the other hand, enzymes in forking motifs (same substrate, different products) often have anti-correlated expression profiles [7]. [sent-220, score-0.298]

73 Each enzyme is represented as a vector of features that measure its relatedness to each of its neighbors. [sent-222, score-0.367]

74 A feature vector has structurally missing entries if the enzyme does not have all types of neighbors. [sent-223, score-0.868]

75 3 does not have a neighbor of type fork, and therefore all features assigned to such a neighbor are absent in the representation of the reaction “Prephanate → Phenylpyruvate”. [sent-225, score-0.292]

76 75 1 True positives Figure 3: Left: A fragment of the full metabolic pathway network in S. [sent-235, score-0.304]

77 Chemical reactions (arrows) transform a set of molecular compounds into other compounds. [sent-237, score-0.212]

78 , ARO7), but in some cases these enzymes are unknown. [sent-241, score-0.263]

79 The network imposes various neighborhood relations between enzymes assigned to reactions, like linear chains (ARO7,PHA2), forks (TRP2,ARO7) and funnels (ARO9,PHA2) Top Right: Classification accuracy for compared methods. [sent-242, score-0.467]

80 The classification task is to identify if a candidate enzyme is in the right “neighborhood”. [sent-243, score-0.262]

81 cerevisiae, as reconstructed by Palsson and colleagues [3], after removing 14 metabolic currencies and reactions with unknown enzymes, leaving 1265 directed reactions. [sent-247, score-0.398]

82 Each domain k contributed one feature, the point-wise symmetric DKL measure xi (k) (log(xi (k)/(xj (k) + xi (k))/2)) + xj (k) (log(xj (k)/(xj (k) + xi (k))/2)). [sent-251, score-0.386]

83 Pathway reconstruction requires that we rank candidate enzymes by their potential to match a reaction. [sent-254, score-0.352]

84 We created a set of positive examples from the reactions with known enzymes (∼ 520 reactions), and also created negative examples by plugging impostor genes into ‘wrong’ neighborhoods. [sent-256, score-0.44]

85 The geometric margin approach achieves significantly better performance in this task. [sent-259, score-0.247]

86 Finally, the resulting classifier is used for predicting missing enzymes, by ranking all candidate enzymes according to their match to a given neighborhood. [sent-263, score-0.83]

87 Evaluating the quality of ranking on known enzymes (cross validation), shows that it significantly outperforms previous approaches [9] (not shown here due to space limitations). [sent-264, score-0.263]

88 We attribute this to the ability of the current approach to preserve different types of network-neighbors as separate features in spite of creating missing values. [sent-265, score-0.669]

89 5 Discussion We presented a novel method for max-margin training of classifiers in the presence of missing features, where the pattern of missing features is an inherent part of the domain. [sent-266, score-1.147]

90 Instead of completing missing features as a preprocessing phase, we developed a max-margin learning objective based on a geometric interpretation of the margin when different instances essentially lie in different spaces. [sent-267, score-1.155]

91 Using two challenging real life problems we showed that our method is significantly superior when the pattern of missing features has structure. [sent-268, score-0.669]

92 The standard treatment of missing features is based on the concept that missing features exist but are unobserved. [sent-269, score-1.338]

93 This assumption underlies the approach of completing features before the data is used in classification. [sent-270, score-0.25]

94 This paper focuses on a different scenario, in which features are inherently absent. [sent-271, score-0.223]

95 In fact, by completing features that are not supposed to be part of an instance, we may be confusing the learning algorithm by presenting it with problem which may be harder than the one we actually need to solve. [sent-273, score-0.25]

96 Interestingly, the problem of classifying with missing features is related to another problem, where individual reliability measures are available for features at each instance separately. [sent-274, score-0.954]

97 This problem can be viewed in the same framework described in this paper: the geometric margin must be defined separately for each instance since the different noise levels distort the relative scale of each coordinate of each instance separately. [sent-277, score-0.466]

98 Relative to this setting, the completely missing and fully valid features discussed in this paper are extreme points on the spectrum of reliability. [sent-278, score-0.724]

99 Principles of transcriptional control in the metabolic network of saccharomyces cerevisiae. [sent-337, score-0.306]

100 Filling gaps in a metabolic network using expression information. [sent-347, score-0.262]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('missing', 0.478), ('erent', 0.279), ('enzymes', 0.263), ('metabolic', 0.221), ('features', 0.191), ('di', 0.186), ('reactions', 0.177), ('enzyme', 0.176), ('wxi', 0.173), ('structurally', 0.154), ('margin', 0.15), ('imputation', 0.11), ('si', 0.098), ('geometric', 0.097), ('landmark', 0.096), ('instance', 0.094), ('classi', 0.088), ('xi', 0.088), ('yi', 0.087), ('instances', 0.086), ('knn', 0.084), ('xj', 0.08), ('object', 0.074), ('separable', 0.067), ('agg', 0.066), ('flag', 0.066), ('geom', 0.066), ('meaningless', 0.066), ('missingness', 0.066), ('matches', 0.066), ('feature', 0.06), ('validation', 0.059), ('completing', 0.059), ('absent', 0.059), ('patches', 0.057), ('valid', 0.055), ('chemical', 0.054), ('landmarks', 0.053), ('svm', 0.052), ('cross', 0.052), ('candidate', 0.051), ('car', 0.048), ('er', 0.048), ('lled', 0.047), ('gal', 0.046), ('neighborhood', 0.046), ('automobile', 0.044), ('buildings', 0.044), ('cerevisiae', 0.044), ('forks', 0.044), ('palsson', 0.044), ('saccharomyces', 0.044), ('socp', 0.044), ('trunk', 0.044), ('windshield', 0.044), ('cone', 0.044), ('lling', 0.044), ('mar', 0.044), ('mini', 0.044), ('incomplete', 0.043), ('inner', 0.043), ('reaction', 0.042), ('contributed', 0.042), ('pathway', 0.042), ('subspace', 0.042), ('network', 0.041), ('attributes', 0.039), ('stanford', 0.039), ('funnels', 0.038), ('xnew', 0.038), ('invalid', 0.038), ('constrained', 0.038), ('optimization', 0.038), ('phase', 0.038), ('match', 0.038), ('qp', 0.038), ('kernels', 0.037), ('wrong', 0.035), ('task', 0.035), ('compounds', 0.035), ('motifs', 0.035), ('relations', 0.035), ('completed', 0.034), ('importantly', 0.034), ('dence', 0.033), ('molecules', 0.033), ('maxw', 0.033), ('classification', 0.033), ('iterative', 0.033), ('preprocessing', 0.032), ('rd', 0.032), ('yielding', 0.032), ('fk', 0.032), ('inherently', 0.032), ('levels', 0.031), ('interpretation', 0.031), ('cellular', 0.031), ('objective', 0.031), ('localization', 0.03), ('handling', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 130 nips-2006-Max-margin classification of incomplete data

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1

2 0.13892318 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

3 0.11944308 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

4 0.11182925 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

5 0.10927062 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

Author: Edward Meeds, Zoubin Ghahramani, Radford M. Neal, Sam T. Roweis

Abstract: We introduce binary matrix factorization, a novel model for unsupervised matrix decomposition. The decomposition is learned by fitting a non-parametric Bayesian probabilistic model with binary latent variables to a matrix of dyadic data. Unlike bi-clustering models, which assign each row or column to a single cluster based on a categorical hidden feature, our binary feature model reflects the prior belief that items and attributes can be associated with more than one latent cluster at a time. We provide simple learning and inference rules for this new model and show how to extend it to an infinite model in which the number of features is not a priori fixed but is allowed to grow with the size of the data. 1 Distributed representations for dyadic data One of the major goals of probabilistic unsupervised learning is to discover underlying or hidden structure in a dataset by using latent variables to describe a complex data generation process. In this paper we focus on dyadic data: our domains have two finite sets of objects/entities and observations are made on dyads (pairs with one element from each set). Examples include sparse matrices of movie-viewer ratings, word-document counts or product-customer purchases. A simple way to capture structure in this kind of data is to do “bi-clustering” (possibly using mixture models) by grouping the rows and (independently or simultaneously) the columns[6, 13, 9]. The modelling assumption in such a case is that movies come in à types and viewers in Ä types and that knowing the type of movie and type of viewer is sufficient to predict the response. Clustering or mixture models are quite restrictive – their major disadvantage is that they do not admit a componential or distributed representation because items cannot simultaneously belong to several classes. (A movie, for example, might be explained as coming from a cluster of “dramas” or “comedies”; a viewer as a “single male” or as a “young mother”.) We might instead prefer a model (e.g. [10, 5]) in which objects can be assigned to multiple latent clusters: a movie might be a drama and have won an Oscar and have subtitles; a viewer might be single and female and a university graduate. Inference in such models falls under the broad area of factorial learning (e.g. [7, 1, 3, 12]), in which multiple interacting latent causes explain each observed datum. In this paper, we assume that both data items (rows) and attributes (columns) have this kind of componential structure: each item (row) has associated with it an unobserved vector of à binary features; similarly each attribute (column) has a hidden vector of Ä binary features. Knowing the features of the item and the features of the attribute are sufficient to generate (before noise) the response at that location in the matrix. In effect, we are factorizing a real-valued data (response) , where and are binary feature matrix into (a distribution defined by) the product is a real-valued weight matrix. Below, we develop this binary matrix factorization matrices, and Ï ÍÏÎ Í Î , ÛÓ « K L ÛÐ Ü Ù I Ð ÚÐ =f Ï Í Î J (A) (B) Figure 1: (A) The graphical model representation of the linear-Gaussian BMF model. The concentration parameter and Beta weights for the columns of are represented by the symbols and Ð . (B) BMF shown pictorally. (BMF) model using Bayesian non-parametric priors over the number and values of the unobserved binary features and the unknown weights. 2 BMF model description Binary matrix factorization is a model of an Á ¢  dyadic data matrix with exchangeable rows and columns. The entries of can be real-valued, binary, or categorial; BMF models suitable for each type are described below. Associated with each row is a latent binary feature vector ; similarly each column has an unobserved binary vector . The primary parameters are represented of interaction weights. is generated by a fixed observation process ´¡µ applied by a matrix (elementwise) to the linear inner product of the features and weights, which is the “factorization” or approximation of the data: Ù Ú Ï ÍÎÏ ´ÍÏÎ ¢µ (1) where ¢ are extra parameters specific to the model variant. Three possible parametric forms for and covariance ´½ µ ; the noise (observation) distribution are: Gaussian, with mean logistic, with mean ½ ´½ · ÜÔ´  µµ; and Poisson, with mean (and variance) . Other parametric forms are also possible. For illustrative purposes, we will use the linear-Gaussian model throughout this paper; this can be thought of as a two-sided version of the linear-Gaussian model found in [5]. ÍÏÎ ÍÏÎ ÍÏÎ Á To complete the description of the model, we need to specify prior distributions over the feature matrices and the weights . We adopt the same priors over binary matrices as previously described in [5]. For finite sized matrices with Á rows and à columns, we generate a bias independently for each column using a Beta prior (denoted ) and then conditioned on this bias generate the entries in column independently from a Bernoulli with mean . ÍÎ Ï «Ã Í È Á Í ´« à ¬ µ à ½ ½ Ù « ´½   µ½ Ù « « à ½ ´ Ò « «µ ´½   µÁ  Ò where Ò Ù . The hyperprior on the concentration « is a Gamma distribution (denoted ), whose shape and scale hyperparameters control the expected fraction of zeros/ones in the matrix. The biases are easily integrated out, which creates dependencies between the rows, although they remain exchangeable. The resulting prior depends only on the number Ò of active features in each column. An identical prior is used on , with  rows and Ä columns, but with different concentration prior . The variable ¬ was set to ½ for all experiments. Î The appropriate prior distribution over weights depends on the observation distribution is a matrix normal with prior mean the linear-Gaussian variant, a convenient prior on Ï ´¡µ. For ÏÓ and µ Á. covariance ´½ hyperpriors: The scale of the weights and output precision Ï ÏÓ (if needed) have Gamma Æ ´ÏÓ ´½ µ Áµ ´ µ ´ µ In certain cases, when the prior on the weights is conjugate to the output distribution model , the weights may be analytically integrated out, expressing the marginal distribution of the data only in terms of the binary features. This is true, for example, when we place a Gaussian prior on the weights and use a linear-Gaussian output process. ÍÎ Í Î Remarkably, the Beta-Bernoulli prior distribution over (and similarly ) can easily be extended to the case where à ½, creating a distribution over binary matrices with a fixed number Á of exchangeable rows and a potentially infinite number of columns (although the expected number of columns which are not entirely zero remains finite). Such a distribution, the Indian Buffet Process (IBP) was described by [5] and is analogous to the Dirichlet process and the associated Chinese restaurant process (CRP) [11]. Fortunately, as we will see, inference with this infinite prior is not only tractable, but is also nearly as efficient as the finite version. 3 Inference of features and parameters Í As with many other complex hierarchical Bayesian models, exact inference of the latent variables and in the BMF model is intractable (ie there is no efficient way to sample exactly from the posterior nor to compute its exact marginals). However, as with many other non-parametric Bayesian models, we can employ Markov Chain Monte Carlo (MCMC) methods to create an iterative procedure which, if run for sufficiently long, will produce correct posterior samples. Î 3.1 Finite binary latent feature matrices Í Î The posterior distribution of a single entry in (or ) given all other model parameters is proportional to the product of the conditional prior and the data likelihood. The conditional prior comes from integrating out the biases in the Beta-Bernoulli model and is proportional the number of active entries in other rows of the same column plus a term for new activations. Gibbs sampling for single entries of (or ) can be done using the following updates: È ´Ù È ´Ù where Ò  on « à and Í Î ½ Í  Î Ï µ ´ « à · Ò   µ È ´ Í  Ù ½ Î Ï µ (2) ¼ Í  Î Ï µ ´¬ · ´Á   ½µ   Ò  µ È ´ Í  Ù ¼ Πϵ (3) È Ù , Í  excludes entry , and is a normalizing constant. (Conditioning is implicit.) When conditioning on Ï, we only need to calculate the ratio of likeli- hoods corresponding to row . (Note that this is not the case when the weights are integrated out.) È ½) and This ratio is a simple function of the model’s predictions Ü· Ð Ù Ú Ð Û Ð (when Ù È Ü  Ù Ú Ð Û Ð (when Ù ¼). In the linear-Gaussian case: Ð ÐÓ È ´Ù È ´Ù ½ ¼ Í  Î Ï Í  Î Ï µ µ ÐÓ ´« à · Ò  µ ¬ · ´Á   ½µ   Ò  ´ µ  ½ ¾ ¢ Ü ´   Ü· µ¾   ´Ü   Ü  µ¾ £ In the linear-Gaussian case, we can easily derive analogous Gibbs sampling updates for the weights and hyperparameters. To simplify the presentation, we consider a “vectorized” representation of our variables. Let be an Á column vector taken column-wise from , be a ÃÄ column vector taken column-wise from and be a Á ¢ ÃÄ binary matrix which is the kronecker product ª . (In “Matlab notation”, ´µ ´ µ and ÖÓÒ´ µ.) In this notation, the data distribution is written as: Æ´ ´½ µ µ. Given values for and , samples can be drawn for , , and using the following posterior distributions (where conditioning on Ó is implicit): Æ ´ · µ ½ ´ · Óµ ´ · µ ½ Ï Î Í Û Ü Û Û Ü Ï Ü Ü Û Û Ï Û Á Û ÎÍ Á Ü Û Í Á Î Û Û Ü · ÃÄ ¾ · Á ¾ · ½ ´Û   ÛÓ µ ´Û   ÛÓ µ ¾ ½ ´Ü   Ûµ ´Ü   Ûµ ·¾ Note that we do not have to explicitly compute the matrix . For computing the posterior of linearGaussian weights, the matrix can be computed as ÖÓÒ´ µ. Similarly, the expression is constructed by computing and taking the elements column-wise. Ü Î ÎÍ Í Í Î 3.2 Infinite binary latent feature matrices One of the most elegant aspects of non-parametric Bayesian modeling is the ability to use a prior which allows a countably infinite number of latent features. The number of instantiated features is automatically adjusted during inference and depends on the amount of data and how many features it supports. Remarkably, we can do MCMC sampling using such infinite priors with essentially no computational penalty over the finite case. To derive these updates (e.g. for row of the matrix ), it is useful to consider partitioning the columns of into two sets as shown below. Let set A have at least one non-zero entry in rows other than . Let set B be all other set A set B columns, including the set of columns where 0 1 0 0 1 0 0 0 0 0 ¡¡¡ the only non-zero entries are found in row 0 0 1 0 0 0 0 0 0 0 ¡¡¡ and the countably infinite number of all-zero 1 1 0 0 1 0 0 0 0 0 ¡¡¡ 1 0 0 1 1 0 0 0 0 0 ¡¡¡ columns. Sampling values for elements in row 1 1 0 0 1 0 1 0 1 0 row of set A given everything else is straightfor0 1 0 0 0 0 0 0 0 0 ¡¡¡ ward, and involves Gibbs updates almost iden0 0 0 1 0 0 0 0 0 0 ¡¡¡ tical to those in the finite case handled by equa1 0 0 0 1 0 0 0 0 0 ¡¡¡ tions (2) and (3); as à ½ and in set A we get: Í Í ½ Í  Î Ï µ ¼ Í  Î Ï µ È ´Ù È ´Ù ¡ Ò   È ´ Í  Ù ½ Î Ï µ ¡ ´¬ · Á   ½   Ò  µ È ´ Í  Ù ¼ Πϵ (4) (5) When sampling new values for set B, the columns are exchangeable, and so we are really only interested in the number of entries Ò in set B which will be turned on in row . Sampling the number of entries set to ½ can be done with Metropolis-Hastings updates. Let  ´Ò Ò µ Poisson ´Ò « ´¬ · Á   ½µµ be the proposal distribution for a move which replaces the current Ò active entries with Ò active entries in set B. The reverse proposal is  ´Ò Ò µ. The acceptance ¡ probability is Ñ Ò ½ ÖÒ Ò , where ÖÒ Ò is È ´Ò È ´Ò µ  ´Ò Ò µ µ  ´Ò Ò µ È´ Ò È´ Ò µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ µ Poisson´Ò « ´¬ · Á   ½µµÂ ´Ò Ò µ Ï È´ Ò È´ Ò µ µ (6) This assumes a conjugate situation in which the weights are explicitly integrated out of the model to compute the marginal likelihood È ´ Ò µ. In the non-conjugate case, a more complicated proposal is required. Instead of proposing Ò , we jointly propose Ò and associated feature parameters from their prior distributions. In the linear-Gaussian model, where is a set of weights for features in set B, the proposal distribution is: Û Û « ´¬ · Á   ½µµ Normal ´Û Ò µ (7) We need actually sample only the finite portion of Û where Ù ½. As in the conjugate case, the  ´Ò Û Ò Ûµ Poisson ´Ò acceptance ratio reduces to the ratio of data likelihoods: ÖÒ Û Ò Û È´ Ò È´ Ò Ûµ Ûµ ÍÎ Ï (8) 3.3 Faster mixing transition proposals are the simplest moves we could The Gibbs updates described above for the entries of , and make in a Markov Chain Monte Carlo inference procedure for the BMF model. However, these limited local updates may result in extremely slow mixing. In practice, we often implement larger moves in indicator space using, for example, Metropolis-Hastings proposals on multiple features for row simultaneously. For example, we can propose new values for several columns in row of matrix by sampling feature values independently from their conditional priors. To compute the reverse proposal, we imagine forgetting the current configuration of those features for row and compute the probability under the conditional prior of proposing the current configuration. The acceptance probability of such a proposal is (the maximum of unity and) the ratio of likelihoods between the new proposed configuration and the current configuration. Í Split-merge moves may also be useful for efficiently sampling from the posterior distribution of the binary feature matrices. Jain and Neal [8] describe split-merge algorithms for Dirichlet process mixture models with non-conjugate component distributions. We have developed and implemented similar split-merge proposals for binary matrices with IBP priors. Due to space limitations, we present here only a sketch of the procedure. Two nonzero entries in are selected uniformly at random. If they are in the same column, we propose splitting that column; if they are in different columns, we propose merging their columns. The key difference between this algorithm and the Jain and Neal algorithm is that the binary features are not constrained to sum to unity in each row. Our split-merge algorithm also performs restricted Gibbs scans on columns of to increase acceptance probability. Í Í 3.4 Predictions A major reason for building generative models of data is to be able to impute missing data values given some observations. In the linear-Gaussian model, the predictive distribution at each iteration of the Markov chain is a Gaussian distribution. The interaction weights can be analytically integrated out at each iteration, also resulting in a Gaussian posterior, removing sampling noise contributed by having the weights explicitly represented. Computing the exact predictive distribution, however, conditional only on the model hyperparameters, is analytically intractable: it requires integrating over all binary matrices and , and all other nuisance parameters (e.g., the weights and precisions). Instead we integrate over these parameters implicitly by averaging predictive distributions from many MCMC iterations. This posterior, which is conditional only on the observed data and hyperparameters, is highly complex, potentially multimodal, and non-linear function of the observed variables. Í Î Í Î and . In our By averaging predictive distributions, our algorithm implicitly integrates over experiments, we show samples from the posteriors of and to help explain what the model is doing, but we stress that the posterior may have significant mass on many possible binary matrices. The number of features and their degrees of overlap will vary over MCMC iterations. Such variation will depend, for example, on the current value of « and (higher values will result in more features) and precision values (higher weight precision results in less variation in weights). Í Î 4 Experiments 4.1 Modified “bars” problem A toy problem commonly used to illustrate additive feature or multiple cause models is the bars problem ([2, 12, 1]). Vertical and horizontal bars are combined in some way to generate data samples. The goal of the illustration is to show recovery of the latent structure in the form of bars. We have modified the typical usage of bars to accommodate the linear-Gaussian BMF with infinite features. Data consists of Á vectors of size ¾ where each vector can be reshaped into a square image. The generation process is as follows: since has the same number of rows as the dimension of the images, is fixed to be a set of vertical and horizontal bars (when reshaped into an image). is sampled from the IBP, and global precisions and are set to ½ ¾. The weights are sampled from zero mean Gaussians. Model estimates of and were initialized from an IBP prior. Î Î Í Ï Î Í In Figure 2 we demonstrate the performance of the linear-Gaussian BMF on the bars data. We train the BMF with 200 training examples of the type shown in the top row in Figure 2. Some examples have their bottom halves labeled missing and are shown in the Figure with constant grey values. To handle this, we resample their values at each iteration of the Markov chain. The bottom row shows . Despite the relatively high the expected reconstruction using MCMC samples of , , and ÍÎ Ï noise levels in the data, the model is able to capture the complex relationships between bars and weights. The reconstruction of vertical bars is very good. The reconstruction of horizontal bars is good as well, considering that the model has no information regarding the existence of horizontal bars on the bottom half. (A) Data samples (B) Noise-free data (C) Initial reconstruction (D) Mean reconstruction (E) Nearest neighbour Figure 2: Bars reconstruction. (A) Bars randomly sampled from the complete dataset. The bottom half of these bars were removed and labeled missing during learning. (B) Noise-free versions of the same data. (C) The initial reconstruction. The missing values have been set to their expected value, ¼, to highlight the missing region. (D) The average MCMC reconstruction of the entire image. (E) Based solely on the information in the top-half of the original data, these are the noise-free nearest neighbours in pixel space. Î ÎÏ Î ÏÎ Figure 3: Bars features. The top row shows values of and used to generate the data. The second row shows a sample of and from the Markov chain. can be thought of as a set of basis images which can be added together with binary coefficients ( ) to create images. Î ÏÎ ÏÎ Í By examining the features captured by the model, we can understand the performance just described. In Figure 3 we show the generating, or true, values of and along with one sample of those basis features from the Markov chain. Because the model is generated by adding multiple images shown on the right of Figure 3, multiple bars are used in each image. This is reflected in the captured features. The learned are fairly similar to the generating , but the former are composed of overlapping bar structure (learned ). Î ÏÎ ÏÎ ÏÎ ÏÎ Î 4.2 Digits In Section 2 we briefly stated that BMF can be applied to data models other than the linear-Gaussian model. We demonstrate this with a logistic BMF applied to binarized images of handwritten digits. We train logistic BMF with 100 examples each of digits ½, ¾, and ¿ from the USPS dataset. In the first five rows of Figure 4 we again illustrate the ability of BMF to impute missing data values. The top row shows all 16 samples from the dataset which had their bottom halves labeled missing. Missing values are filled-in at each iteration of the Markov chain. In the third and fourth rows we show the mean and mode (È ´Ü ½µ ¼ ) of the BMF reconstruction. In the bottom row we have shown the nearest neighbors, in pixel space, to the training examples based only on the top halves of the original digits. In the last three rows of Figure 4 we show the features captured by the model. In row F, we show the average image of the data which have each feature in on. It is clear that some row features are shown. have distinct digit forms and others are overlapping. In row G, the basis images By adjusting the features that are non-zero in each row of , images are composed by adding basis images together. Finally, in row H we show . These pixel features mask out different regions in Î Í Í ÏÎ pixel space, which are weighted together to create the basis images. Note that there are à features in rows F and G, and Ä features in row H. (A) (B) (C) (D) (E) (F) (G) (H) Figure 4: Digits reconstruction. (A) Digits randomly sampled from the complete dataset. The bottom half of these digits were removed and labeled missing during learning. (B) The data shown to the algorithm. The top half is the original data value. (C) The mean of the reconstruction for the bottom halves. (D) The mode reconstruction of the bottom halves. (E) The nearest neighbours of the original data are shown in the bottom half, and were found based solely on the information from the top halves of the images. (F) The average of all digits for each feature. (G) The feature reshaped in the form of digits. By adding these features together, which the features do, reconstructions of the digits is possible. (H) reshaped into the form of digits. The first image represents a bias feature. ÏÎ Í Î Í 4.3 Gene expression data Gene expression data is able to exhibit multiple and overlapping clusters simultaneously; finding models for such complex data is an interesting and active research area ([10], [13]). The plaid model[10], originally introduced for analysis of gene expression data, can be thought of as a nonBayesian special case of our model in which the matrix is diagonal and the number of binary features is fixed. Our goal in this experiment is merely to illustrate qualitatively the ability of BMF to find multiple clusters in gene expression data, some of which are overlapping, others non-overlapping. The data in this experiment consists of rows corresponding to genes and columns corresponding to patients; the patients suffer from one of two types of acute Leukemia [4]. In Figure 5 we show the factorization produced by the final state in the Markov chain. The rows and columns of the data and its expected reconstruction are ordered such that contiguous regions in were observable. Some of the many feature pairings are highlighted. The BMF clusters consist of broad, overlapping clusters, and small, non-overlapping clusters. One of the interesting possibilities of using BMF to model gene expression data would be to fix certain columns of or with knowledge gained from experiments or literature, and to allow the model to add new features that help explain the data in more detail. Ï Í Î 5 Conclusion We have introduced a new model, binary matrix factorization, for unsupervised decomposition of dyadic data matrices. BMF makes use of non-parametric Bayesian methods to simultaneously discover binary distributed representations of both rows and columns of dyadic data. The model explains each row and column entity using a componential code composed of multiple binary latent features along with a set of parameters describing how the features interact to create the observed responses at each position in the matrix. BMF is based on a hierarchical Bayesian model and can be naturally extended to make use of a prior distribution which permits an infinite number of features, at very little extra computational cost. We have given MCMC algorithms for posterior inference of both the binary factors and the interaction parameters conditioned on some observed data, and (A) (B) Figure 5: Gene expression results. (A) The top-left is sorted according to contiguous features in the final and in the Markov chain. The bottom-left is and the top-right is . The bottomright is . (B) The same as (A), but the expected value of , . We have highlighted regions that have both Ù and Ú Ð on. For clarity, we have only shown the (at most) two largest contiguous regions for each feature pair. Í Ï Î Î ÍÏÎ Í demonstrated the model’s ability to capture overlapping structure and model complex joint distributions on a variety of data. BMF is fundamentally different from bi-clustering algorithms because of its distributed latent representation and from factorial models with continuous latent variables which interact linearly to produce the observations. This allows a much richer latent structure, which we believe makes BMF useful for many applications beyond the ones we outlined in this paper. References [1] P. Dayan and R. S. Zemel. Competition and multiple cause models. Neural Computation, 7(3), 1995. [2] P. Foldiak. Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 1990. [3] Z. Ghahramani. Factorial learning and the EM algorithm. In NIPS, volume 7. MIT Press, 1995. [4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 1999. [5] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18. MIT Press, 2005. [6] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67, 1972. [7] G. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS, volume 6. Morgan Kaufmann, 1994. [8] S. Jain and R. M. Neal. Splitting and merging for a nonconjugate Dirichlet process mixture model. To appear in Bayesian Analysis. [9] C. Kemp, J. B. Tenebaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006. [10] L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statistica Sinica, 12, 2002. [11] J. Pitman. Combinatorial stochastic processes. Lecture Notes for St. Flour Course, 2002. [12] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1), 1994. [13] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methods for the analysis of DNA microarray data. Technical report, Stanford University, 1999. Department of Statistics.

6 0.1066303 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

7 0.10662203 24 nips-2006-Aggregating Classification Accuracy across Time: Application to Single Trial EEG

8 0.10341387 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions

9 0.10065663 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

10 0.099854082 186 nips-2006-Support Vector Machines on a Budget

11 0.094906606 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

12 0.092736997 105 nips-2006-Large Margin Component Analysis

13 0.08692386 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

14 0.078128189 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

15 0.077793345 185 nips-2006-Subordinate class recognition using relational object models

16 0.076933771 128 nips-2006-Manifold Denoising

17 0.075970732 50 nips-2006-Chained Boosting

18 0.07581649 179 nips-2006-Sparse Representation for Signal Classification

19 0.074283518 53 nips-2006-Combining causal and similarity-based reasoning

20 0.074276082 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.265), (1, 0.076), (2, 0.095), (3, -0.002), (4, -0.007), (5, 0.028), (6, -0.122), (7, -0.009), (8, 0.045), (9, -0.036), (10, -0.136), (11, 0.036), (12, -0.068), (13, 0.017), (14, -0.002), (15, 0.036), (16, 0.052), (17, 0.01), (18, 0.005), (19, 0.042), (20, 0.127), (21, 0.0), (22, 0.136), (23, -0.061), (24, 0.073), (25, 0.017), (26, -0.053), (27, 0.038), (28, 0.011), (29, -0.11), (30, 0.15), (31, 0.054), (32, -0.006), (33, 0.01), (34, 0.084), (35, -0.035), (36, -0.005), (37, 0.001), (38, 0.019), (39, -0.032), (40, -0.059), (41, 0.056), (42, 0.093), (43, -0.132), (44, -0.043), (45, 0.217), (46, -0.207), (47, 0.195), (48, 0.005), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95765698 130 nips-2006-Max-margin classification of incomplete data

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1

2 0.64658952 105 nips-2006-Large Margin Component Analysis

Author: Lorenzo Torresani, Kuang-chih Lee

Abstract: Metric learning has been shown to significantly improve the accuracy of k-nearest neighbor (kNN) classification. In problems involving thousands of features, distance learning algorithms cannot be used due to overfitting and high computational complexity. In such cases, previous work has relied on a two-step solution: first apply dimensionality reduction methods to the data, and then learn a metric in the resulting low-dimensional subspace. In this paper we show that better classification performance can be achieved by unifying the objectives of dimensionality reduction and metric learning. We propose a method that solves for the low-dimensional projection of the inputs, which minimizes a metric objective aimed at separating points in different classes by a large margin. This projection is defined by a significantly smaller number of parameters than metrics learned in input space, and thus our optimization reduces the risks of overfitting. Theory and results are presented for both a linear as well as a kernelized version of the algorithm. Overall, we achieve classification rates similar, and in several cases superior, to those of support vector machines. 1

3 0.61807424 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

Author: Murat Dundar, Balaji Krishnapuram, R. B. Rao, Glenn M. Fung

Abstract: Many computer aided diagnosis (CAD) problems can be best modelled as a multiple-instance learning (MIL) problem with unbalanced data: i.e. , the training data typically consists of a few positive bags, and a very large number of negative instances. Existing MIL algorithms are much too computationally expensive for these datasets. We describe CH, a framework for learning a Convex Hull representation of multiple instances that is significantly faster than existing MIL algorithms. Our CH framework applies to any standard hyperplane-based learning algorithm, and for some algorithms, is guaranteed to find the global optimal solution. Experimental studies on two different CAD applications further demonstrate that the proposed algorithm significantly improves diagnostic accuracy when compared to both MIL and traditional classifiers. Although not designed for standard MIL problems (which have both positive and negative bags and relatively balanced datasets), comparisons against other MIL methods on benchmark problems also indicate that the proposed method is competitive with the state-of-the-art.

4 0.58497494 179 nips-2006-Sparse Representation for Signal Classification

Author: Ke Huang, Selin Aviyente

Abstract: In this paper, application of sparse representation (factorization) of signals over an overcomplete basis (dictionary) for signal classification is discussed. Searching for the sparse representation of a signal over an overcomplete dictionary is achieved by optimizing an objective function that includes two terms: one that measures the signal reconstruction error and another that measures the sparsity. This objective function works well in applications where signals need to be reconstructed, like coding and denoising. On the other hand, discriminative methods, such as linear discriminative analysis (LDA), are better suited for classification tasks. However, discriminative methods are usually sensitive to corruption in signals due to lacking crucial properties for signal reconstruction. In this paper, we present a theoretical framework for signal classification with sparse representation. The approach combines the discrimination power of the discriminative methods with the reconstruction property and the sparsity of the sparse representation that enables one to deal with signal corruptions: noise, missing data and outliers. The proposed approach is therefore capable of robust classification with a sparse representation of signals. The theoretical results are demonstrated with signal classification tasks, showing that the proposed approach outperforms the standard discriminative methods and the standard sparse representation in the case of corrupted signals. 1

5 0.57575077 24 nips-2006-Aggregating Classification Accuracy across Time: Application to Single Trial EEG

Author: Steven Lemm, Christin Schäfer, Gabriel Curio

Abstract: We present a method for binary on-line classification of triggered but temporally blurred events that are embedded in noisy time series in the context of on-line discrimination between left and right imaginary hand-movement. In particular the goal of the binary classification problem is to obtain the decision, as fast and as reliably as possible from the recorded EEG single trials. To provide a probabilistic decision at every time-point t the presented method gathers information from two distinct sequences of features across time. In order to incorporate decisions from prior time-points we suggest an appropriate weighting scheme, that emphasizes time instances, providing a higher discriminatory power between the instantaneous class distributions of each feature, where the discriminatory power is quantified in terms of the Bayes error of misclassification. The effectiveness of this procedure is verified by its successful application in the 3rd BCI competition. Disclosure of the data after the competition revealed this approach to be superior with single trial error rates as low as 10.7, 11.5 and 16.7% for the three different subjects under study. 1

6 0.54566783 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

7 0.53449392 186 nips-2006-Support Vector Machines on a Budget

8 0.5308935 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.5051567 185 nips-2006-Subordinate class recognition using relational object models

10 0.4880904 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

11 0.44863155 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

12 0.44840097 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions

13 0.44305149 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

14 0.43622869 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

15 0.43411183 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

16 0.43270782 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

17 0.42878538 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

18 0.42860255 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

19 0.41403425 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

20 0.4135657 53 nips-2006-Combining causal and similarity-based reasoning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.125), (3, 0.021), (7, 0.075), (9, 0.045), (20, 0.023), (22, 0.078), (44, 0.056), (52, 0.011), (57, 0.111), (65, 0.064), (69, 0.03), (71, 0.014), (86, 0.247), (90, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94181597 166 nips-2006-Recursive Attribute Factoring

Author: Deepak Verma, Karl Pfleger, David Tax

Abstract: Clustering, or factoring of a document collection attempts to “explain” each observed document in terms of one or a small number of inferred prototypes. Prior work demonstrated that when links exist between documents in the corpus (as is the case with a collection of web pages or scientific papers), building a joint model of document contents and connections produces a better model than that built from contents or connections alone. Many problems arise when trying to apply these joint models to corpus at the scale of the World Wide Web, however; one of these is that the sheer overhead of representing a feature space on the order of billions of dimensions becomes impractical. We address this problem with a simple representational shift inspired by probabilistic relational models: instead of representing document linkage in terms of the identities of linking documents, we represent it by the explicit and inferred attributes of the linking documents. Several surprising results come with this shift: in addition to being computationally more tractable, the new model produces factors that more cleanly decompose the document collection. We discuss several variations on this model and show how some can be seen as exact generalizations of the PageRank algorithm. 1

same-paper 2 0.82428998 130 nips-2006-Max-margin classification of incomplete data

Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller

Abstract: We consider the problem of learning classifiers for structurally incomplete data, where some objects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired objective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex objective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. 1

3 0.65559006 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

4 0.65201658 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

5 0.65200818 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf

Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1

6 0.65183085 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

7 0.6516971 65 nips-2006-Denoising and Dimension Reduction in Feature Space

8 0.64933234 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

9 0.64658266 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

10 0.64484423 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

11 0.64425212 175 nips-2006-Simplifying Mixture Models through Function Approximation

12 0.64368975 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

13 0.64349371 20 nips-2006-Active learning for misspecified generalized linear models

14 0.64304954 79 nips-2006-Fast Iterative Kernel PCA

15 0.64297712 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

16 0.64215356 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

17 0.64197963 34 nips-2006-Approximate Correspondences in High Dimensions

18 0.6413222 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

19 0.64127678 5 nips-2006-A Kernel Method for the Two-Sample-Problem

20 0.6412605 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation