jmlr jmlr2008 jmlr2008-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
Reference: text
sentIndex sentText sentNum sentScore
1 Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. [sent-9, score-0.78]
2 We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. [sent-11, score-0.949]
3 The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. [sent-12, score-0.664]
4 Often, it is desirable to have the clusters match some labels that are unknown to the clustering algorithm. [sent-16, score-0.553]
5 Even worse, the clustering quality depends on the specific choice of the unobserved labels. [sent-20, score-0.686]
6 For example, good document clustering with respect to topics is very different from clustering with respect to authors. [sent-21, score-0.808]
7 In our setting, instead of attempting to cluster by some arbitrary labels, we try to predict unobserved features from observed ones. [sent-22, score-0.572]
8 For example, when clustering fruits based on their observed features such as shape, color and size, the target of clustering is to match unobserved features such as nutritional value or toxicity. [sent-24, score-1.427]
9 When clustering users based on their movie ratings, the target of clustering is to match ratings of movies that were not rated, or not even created as yet. [sent-25, score-1.072]
10 The goal of the clustering algorithm is to assign a class label ti to each instance, such that the expected mutual information between the class labels and a randomly selected unobserved feature is maximized. [sent-33, score-0.825]
11 1 The clustering algorithm has access only to the observed features of m instances. [sent-37, score-0.567]
12 In other words, can we find the clustering that is most likely to match a randomly selected unobserved feature? [sent-44, score-0.722]
13 We show that for any clustering algorithm, the average performance of the clustering with respect to the observed and unobserved features is similar. [sent-46, score-1.24]
14 Hence we can indirectly optimize clustering performance with respect to unobserved features by analogy with generalization in supervised learning. [sent-47, score-0.86]
15 We are interested in cases where this new selected feature is most likely to be one of the unobserved features, and therefore we use the term unobserved information. [sent-60, score-0.602]
16 We use our framework to analyze clustering by the maximum likelihood of multinomial mixture model (also called Naive Bayes Mixture Model, see Figure 2 and Section 2. [sent-71, score-0.522]
17 This clustering assumes a generative model of the data, where the instances are assumed to be sampled independently from a mixture of distributions, and for each such distribution all features are independent. [sent-73, score-0.651]
18 2 we show that this clustering achieves nearly the best possible clustering in terms of information on unobserved features. [sent-76, score-1.08]
19 We show that the k-means clustering algorithm (Lloyd, 1957; MacQueen, 1967) not only minimizes the observed intra-cluster variance, but also minimizes the unobserved intra-cluster variance, that is, the variance of unobserved features within each cluster. [sent-79, score-1.157]
20 3 The clustering algorithm clusters the instances into k clusters. [sent-116, score-0.606]
21 We define the quality of clustering with respect to a single feature, q, as I (t(Z); xq [Z]), that is, the empirical mutual information between the cluster labels and the feature. [sent-124, score-0.819]
22 Obviously all clusters will be homogeneous with respect to all features but this clustering is pointless. [sent-128, score-0.633]
23 Definition 1 The average observed information of a clustering t and the observed features is de˜ noted by Iob (t, q) and defined by ˜ Iob (t, q) = 1 n ∑ I (t(Z); xqi [Z]) . [sent-130, score-0.698]
24 n i=1 The expected unobserved information of a clustering is denoted by Iun (t) and defined by Iun (t) = Eq∼D I (t(Z); xq [Z]) . [sent-131, score-0.936]
25 In the special case where the distribution D is uniform and L n2 , Iun can also be written as the average mutual information between 1 the cluster label and the unobserved features set; that is, Iun ≈ L−n ∑q∈{q1 ,. [sent-134, score-0.571]
26 The goal of the clustering algorithm is to cluster the instances into k clusters that maximize the unobserved information, Iun . [sent-139, score-1.017]
27 Definition 5 Observed information maximization algorithm: Let IobMax be any clustering algorithm that, based on the values of the observed features, selects a clustering t opt,ob : [m] → [k] having the maximum possible value of Iob , that is, ˜ topt,ob = arg max Iob (t, q). [sent-183, score-0.846]
28 t:[m]→[k] ˜ ˜ Let Iob,k be the average observed information of this clustering and Iun,k be the expected unobserved information of this clustering, that is, ˜ ˜ ˜ Iob,k (q) = Iob topt,ob , q , ˜ ˜ Iun,k (q) = Iun topt,ob . [sent-184, score-0.752]
29 (1) Proof We now define a bad clustering as a clustering whose expected unobserved information satis∗ fies Iun ≤ Iun,k − ε. [sent-190, score-1.099]
30 Hence the probability that a bad clustering has a higher average observed information than the best clustering is upper bounded as in Theorem 6. [sent-193, score-0.882]
31 As a result of this theorem, when n is large enough, even an algorithm that knows the value of all features (observed and unobserved) cannot find a clustering which is significantly better than the clustering found by the IobMax algorithm. [sent-195, score-0.891]
32 Informally, this theorem means that for a large number of features we can find a clustering that is informative on unobserved features. [sent-197, score-0.822]
33 For example, clustering users based on similar ratings of current movies are likely to match future movies as well (see Section 4). [sent-198, score-0.7]
34 If the number of randomly observed features is large enough we can find a clustering rule with two clusters such that Iob ∼ 1 . [sent-213, score-0.72]
35 For this clustering rule Iun ∼ 2 , since half of =2 =1 the unobserved features match this clustering (all features with an even index). [sent-217, score-1.299]
36 More specifically, if we use two clusters, where the clustering is determined by one of 1 the observed features (i. [sent-226, score-0.567]
37 Based on the generalization theorem, we now suggest a qualitative explanation of why clustering into bananas and oranges provides relatively high information on unobserved features, while clustering based on position (e. [sent-233, score-1.156]
38 By contrast, a clustering rule which puts all items that appeared in our right visual field in one cluster, and the others in a second cluster, has much smaller Iob (since it does not match many observed features), and indeed it is not predictive about unobserved features. [sent-238, score-0.753]
39 Example 4 As a negative example, if the type of observed features and the target unobserved features are very different, our assumptions do not hold. [sent-239, score-0.584]
40 Although the basic assumptions of the multinomial mixture model are very different from ours, Theorem 7 tells us that this method of clustering generalizes well to unobserved features. [sent-252, score-0.77]
41 In the following theorem we analyze the feature generalization properties of soft clustering by the multinomial mixture model. [sent-267, score-0.63]
42 Theorem 7 Let Iob,ML,k be the observed information of clustering achieved by the maximum likelihood solution of a multinomial mixture model for k clusters. [sent-269, score-0.574]
43 Our setup assumes that the instances are fixed and the observed features are randomly selected and we try to maximize information on unobserved features. [sent-281, score-0.595]
44 From our analysis, this is nearly the best clustering method for preserving information on the label, assuming that the label is yet another feature that happened to be unobserved in some instances. [sent-292, score-0.726]
45 , x[m]}, and the clustering algorithm clusters these instances into k clusters. [sent-310, score-0.606]
46 In our setup, however, we assume that the clustering algorithm has access only to the observed features over the m instances. [sent-317, score-0.567]
47 The goal of clustering is to achieve minimum intra-cluster variance of the unobserved features. [sent-318, score-0.697]
48 In our setup, the goal of the clustering algorithm is to create clusters with minimal unobserved intraclass variance (Dun ). [sent-342, score-0.829]
49 We now use it to find a clustering that minimizes the expected unobserved intra-cluster variance, using only the observed features. [sent-371, score-0.752]
50 ,Ck be the clustering that achieves the minimum unobserved intracluster variance under the constraint α ({C1 , . [sent-375, score-0.697]
51 un un ˆ opt ˆ opt be the clustering with the minimum observed intra-cluster variance, under Let C1 , . [sent-391, score-0.652]
52 e αc (5) Proof We now define a bad clustering as a clustering whose expected unobserved intra-cluster variance satisfies Dun > Dopt + ε. [sent-414, score-1.123]
53 Hence the probability that any of the bad clusterings has un a lower observed intra-cluster variance than the best clustering is upper bounded by δ. [sent-420, score-0.617]
54 However, Theorem 12 means that an algorithm that selects the clustering with the minimum observed intra-cluster variance indirectly finds the clustering with nearly minimum unobserved intra-cluster variance. [sent-423, score-1.187]
55 If the intra-cluster variance of a new, previously unobserved movie is small, then we can estimate the rating of one user from the average ratings of other users in the same cluster. [sent-430, score-0.526]
56 Similarly, for distance-based clustering we use k-means to examine the behavior of the observed and unobserved intra-cluster variances (see Definitions 8, 9). [sent-437, score-0.739]
57 Clustering methods are used for collaborative filtering by clustering users based on the similarity of their ratings (see, for example, Marlin, 2004; Ungar and Foster, 1998). [sent-443, score-0.556]
58 In our context, the goal of the clustering is to maximize the information between the clusters and unobserved features, that is, movies that have not yet been rated by any of the users. [sent-447, score-0.936]
59 We use a subset of the movies from group “A” as observed features and all movies from group “B” as the unobserved features. [sent-472, score-0.636]
60 Since the observed features are selected at random, the statistics of missing values of the observed and unobserved features are the same. [sent-486, score-0.651]
61 2 Greedy IobMax Algorithm For information-based clustering, we cluster the users using a simple greedy clustering algorithm (see Algorithm 1). [sent-494, score-0.539]
62 65 Number of clusters (k) 2 3 4 5 6 Number of clusters (k) (e) Fixed number of features (n=1200) (f) Fixed number of features (n=100) Figure 3: Feature generalization as a function of the number of training features (movies) and the number of clusters. [sent-530, score-0.63]
63 (a) (b) and (e) show the observed and unobserved information for various numbers of features and clusters (high is good). [sent-531, score-0.592]
64 In this case the algorithm tries directly to find the clustering that maximizes the mean mutual information on features from group ”B”. [sent-550, score-0.565]
65 On the other hand, as the number of observed features increases, the cluster variable, T = t(Z), captures the structure of the distribution (users’ tastes), and hence contains more information on unobserved features. [sent-558, score-0.572]
66 Informally, the achievability theorem (Theorem 6) tells as that for a large enough number of observed features, even though our clustering algorithm is based only on observed features, it can achieve nearly the best possible ∗ clustering, in terms of Iun . [sent-561, score-0.593]
67 This can be seen in Figures 3 (a,b), where Iun approaches Iun , which is the unobserved information of the best clustering (Definition 4). [sent-562, score-0.673]
68 6 Again, for a small numbers of features (n) the clustering overfits the observed features, that is, the D ob is relatively low but Dun is large. [sent-568, score-0.59]
69 However, for large n, Dun and Dob approach each other and both of them approach the unobserved intra-cluster variance of the best possible clustering (D opt ) as expected un 6. [sent-569, score-0.808]
70 4 Words and Documents In this section we repeat the information-based clustering experiment, but this time for document clustering with words as features. [sent-590, score-0.831]
71 We show how clustering which is based on a subset of words (observed words) is also informative about the unobserved words. [sent-591, score-0.696]
72 This helps understand the way clustering learned from observed words matches unobserved words. [sent-652, score-0.762]
73 We can see, for example, that although the word “player” is not part of the inputs to the clustering algorithm, it appears much more in the first cluster than in other clusters. [sent-653, score-0.519]
74 This clustering reveals the hidden topics of the documents (sports, computers and religious), and these topics contain information on the unobserved words. [sent-656, score-0.69]
75 We see that generalization to unobserved features can be explained from a standpoint of a generative model (a hidden variable which represents the topics of the documents) or from a statistical point of view (relationship between observed and unobserved information). [sent-657, score-0.79]
76 They do this by masking some features in the input data, that is, making them unobserved features, and training classifiers to predict these unobserved features from the observed features. [sent-664, score-0.854]
77 An example of such a property is the stability of the clustering with respect to the sampling process, for example, the clusters do not change significantly if we add some data points to our sample. [sent-678, score-0.522]
78 The idea of generalization to unobserved features by clustering was first presented in a short version of this paper (Krupka and Tishby, 2005). [sent-689, score-0.817]
79 Discussion We introduce a new learning paradigm: clustering based on observed features that generalizes to unobserved features. [sent-691, score-0.85]
80 Our main results include two theorems that tell us how, without knowing the value of the unobserved features, one can estimate and maximize information between the clusters and the unobserved features. [sent-692, score-0.732]
81 Despite the very different assumptions of these models, we show that clustering by multinomial mixture models is nearly optimal in terms of maximizing information on unobserved features. [sent-697, score-0.802]
82 Moreover, we expect this unobserved information to be similar to the information we have on the clustering on the observed features. [sent-709, score-0.739]
83 Analogous to what we show for information based clustering and multinomial mixture models, we show that this optimization goal of k-means is also optimal in terms of generalization to unobserved features. [sent-712, score-0.803]
84 Note that a contrary assumption to random selection would be that given two instances {x[1], x[2]}, there is a correlation between the distance of a feature xq [1] − xq [2] and the probability of observing this feature; for example, the probability of observing features that are similar is higher. [sent-714, score-0.703]
85 The difference between the observed and unobserved information is large only for a small portion of all possible partitions into observed and unobserved features. [sent-720, score-0.698]
86 The value of clustering which preserves information on unobserved features is that it enables us to learn new—previously unobserved—attributes from a small number of examples. [sent-722, score-0.784]
87 Suppose that after clustering fruits based on their observed features (Example 3), we eat a chinaberry 8 and thus, 8. [sent-723, score-0.616]
88 This is different from clustering that merely describes the observed measurements, and supports the rationale for defining the quality of clustering by its predictivity on unobserved features. [sent-730, score-1.142]
89 In this paper we proposed a clustering based on a subset of features, and analyzed the information that the clustering yielded on features outside this subset. [sent-750, score-0.891]
90 However, a general framework for generalization to both unobserved features and unobserved instances is still lacking. [sent-778, score-0.794]
91 Theorem 3—Proof outline: For the given m instances and any clustering t, draw uniformly and independently m instances (repeats allowed). [sent-796, score-0.558]
92 Since I (t (Z) ; xq [Z]) = −H (t(Z), xq [Z]) + H (t (Z)) + H (xq (Z)), we can upper bound the bias between the actual and the empirical estimation of the mutual information as follows: E˜ 1 ,. [sent-828, score-0.553]
93 In terms of the distributions P (t(Z), xq (Z)), assigning a soft clustering to an instance can be approximated by a second ˆ empirical distribution, P, achieved by duplicating each of the instances, and then using hard clustering. [sent-854, score-0.678]
94 Using hard clustering on the ×100 larger set of instances, can approximate any soft clustering of the original set with quantization of P (T |X) in ˆ ˆ steps of 1/100. [sent-856, score-0.816]
95 Now we show that for any soft clustering of an instance, we can find a hard clustering of the same instance that has the same or a higher value of Iob (without changing the cluster identity of other instances). [sent-858, score-0.944]
96 This is enough to show that soft clustering is not required to achieve the maximum value of Iob , since any soft clustering can be replaced by hard clustering instance by instance. [sent-859, score-1.258]
97 i=1 j ˜ where Pi is created by keeping the same soft clustering of instances {1, . [sent-871, score-0.51]
98 , m}, and replacing the soft clustering of the jth instance by a hard clustering t( j) = i. [sent-877, score-0.845]
99 In other words, we can replace the soft clustering of any instance j by a hard clustering without decreasing Iob . [sent-886, score-0.832]
100 The quality of the clustering with respect to a single variable, Xq , is defined by a (weighted) average distance of all pairs of instances within the same cluster (large distance means lower quality). [sent-908, score-0.627]
wordName wordTfidf (topN-words)
[('iob', 0.458), ('iun', 0.458), ('clustering', 0.39), ('unobserved', 0.283), ('xq', 0.236), ('dob', 0.207), ('dun', 0.185), ('cr', 0.146), ('clusters', 0.132), ('cluster', 0.112), ('features', 0.111), ('movies', 0.088), ('instances', 0.084), ('ratings', 0.083), ('iobmax', 0.082), ('nobserved', 0.082), ('ishby', 0.074), ('rupka', 0.074), ('eneralization', 0.069), ('observed', 0.066), ('xqi', 0.065), ('pr', 0.062), ('nbad', 0.06), ('movie', 0.057), ('clusterings', 0.056), ('dopt', 0.055), ('opt', 0.053), ('mixture', 0.052), ('mutual', 0.051), ('eatures', 0.05), ('fruits', 0.049), ('minr', 0.049), ('collaborative', 0.046), ('jl', 0.046), ('multinomial', 0.045), ('un', 0.045), ('lz', 0.044), ('xqn', 0.044), ('krupka', 0.042), ('rating', 0.042), ('theorem', 0.038), ('users', 0.037), ('soft', 0.036), ('eq', 0.035), ('generalization', 0.033), ('oranges', 0.033), ('tishby', 0.029), ('nb', 0.028), ('iq', 0.028), ('document', 0.028), ('bananas', 0.027), ('rated', 0.027), ('supervised', 0.026), ('variance', 0.024), ('words', 0.023), ('rth', 0.023), ('ob', 0.023), ('bad', 0.023), ('log', 0.022), ('feature', 0.022), ('likelihood', 0.021), ('unlabeled', 0.021), ('randomly', 0.021), ('dq', 0.019), ('lem', 0.019), ('theorems', 0.018), ('ltering', 0.018), ('bottleneck', 0.018), ('nearly', 0.017), ('sup', 0.017), ('word', 0.017), ('indirectly', 0.017), ('documents', 0.017), ('bound', 0.017), ('labels', 0.017), ('achievability', 0.016), ('hmle', 0.016), ('vitamin', 0.016), ('instance', 0.016), ('alphabet', 0.016), ('maximize', 0.016), ('globerson', 0.015), ('elidan', 0.015), ('maximizing', 0.015), ('xl', 0.014), ('distance', 0.014), ('match', 0.014), ('erm', 0.014), ('generative', 0.014), ('label', 0.014), ('analyze', 0.014), ('lemma', 0.014), ('denoted', 0.014), ('selected', 0.014), ('upper', 0.013), ('expected', 0.013), ('jth', 0.013), ('quality', 0.013), ('maximizes', 0.013), ('target', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
2 0.10685157 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
3 0.064010836 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
4 0.055232011 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
Author: Jorge Jambeiro Filho, Jacques Wainer
Abstract: We replaced the conditional probability tables of Bayesian network nodes whose parents have high cardinality with a multilevel empirical hierarchical Bayesian model called hierarchical pattern Bayes (HPB).1 The resulting Bayesian networks achieved signiÄ?Ĺš cant performance improvements over Bayesian networks with the same structure and traditional conditional probability tables, over Bayesian networks with simpler structures like na¨ve Bayes and tree augmented na¨ve Bayes, over Ă„Ä… Ă„Ä… Bayesian networks where traditional conditional probability tables were substituted by noisy-OR gates, default tables, decision trees and decision graphs and over Bayesian networks constructed after a cardinality reduction preprocessing phase using the agglomerative information bottleneck method. Our main tests took place in important fraud detection domains, which are characterized by the presence of high cardinality attributes and by the existence of relevant interactions among them. Other tests, over UCI data sets, show that HPB may have a quite wide applicability. Keywords: probabilistic reasoning, Bayesian networks, smoothing, hierarchical Bayes, empirical Bayes
5 0.042679992 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks. Keywords: hierarchical Bayes, latent variables, mean-field approximation, statistical network analysis, social networks, protein interaction networks
6 0.039128821 52 jmlr-2008-Learning from Multiple Sources
7 0.034743212 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
8 0.032925967 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
9 0.026802994 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
10 0.024117842 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
11 0.023262778 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
12 0.022692688 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2 (Machine Learning Open Source Software Paper)
13 0.022639276 87 jmlr-2008-Stationary Features and Cat Detection
14 0.022334343 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
15 0.021267535 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
16 0.020988885 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
17 0.020564033 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
18 0.020459788 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
19 0.020359024 96 jmlr-2008-Visualizing Data using t-SNE
20 0.020258659 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
topicId topicWeight
[(0, 0.113), (1, -0.033), (2, 0.0), (3, -0.016), (4, -0.117), (5, 0.102), (6, 0.12), (7, 0.006), (8, 0.071), (9, -0.026), (10, -0.076), (11, 0.057), (12, 0.023), (13, -0.001), (14, 0.04), (15, 0.015), (16, -0.122), (17, 0.04), (18, -0.056), (19, 0.067), (20, 0.329), (21, -0.051), (22, 0.028), (23, 0.142), (24, -0.256), (25, -0.221), (26, -0.058), (27, -0.033), (28, -0.217), (29, -0.005), (30, 0.077), (31, -0.078), (32, -0.259), (33, 0.217), (34, -0.107), (35, 0.096), (36, 0.124), (37, -0.126), (38, 0.143), (39, 0.005), (40, -0.175), (41, -0.094), (42, -0.015), (43, 0.045), (44, -0.087), (45, 0.101), (46, -0.054), (47, 0.127), (48, -0.111), (49, -0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.96248364 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
2 0.51671576 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
3 0.30573693 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
4 0.3045001 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
Author: Jorge Jambeiro Filho, Jacques Wainer
Abstract: We replaced the conditional probability tables of Bayesian network nodes whose parents have high cardinality with a multilevel empirical hierarchical Bayesian model called hierarchical pattern Bayes (HPB).1 The resulting Bayesian networks achieved signiÄ?Ĺš cant performance improvements over Bayesian networks with the same structure and traditional conditional probability tables, over Bayesian networks with simpler structures like na¨ve Bayes and tree augmented na¨ve Bayes, over Ă„Ä… Ă„Ä… Bayesian networks where traditional conditional probability tables were substituted by noisy-OR gates, default tables, decision trees and decision graphs and over Bayesian networks constructed after a cardinality reduction preprocessing phase using the agglomerative information bottleneck method. Our main tests took place in important fraud detection domains, which are characterized by the presence of high cardinality attributes and by the existence of relevant interactions among them. Other tests, over UCI data sets, show that HPB may have a quite wide applicability. Keywords: probabilistic reasoning, Bayesian networks, smoothing, hierarchical Bayes, empirical Bayes
5 0.29141843 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
Author: Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, Eric P. Xing
Abstract: Consider data consisting of pairwise measurements, such as presence or absence of links between pairs of objects. These data arise, for instance, in the analysis of protein interactions and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing pairwise measurements with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. Here we introduce a class of variance allocation models for pairwise measurements: mixed membership stochastic blockmodels. These models combine global parameters that instantiate dense patches of connectivity (blockmodel) with local parameters that instantiate node-specific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks. Keywords: hierarchical Bayes, latent variables, mean-field approximation, statistical network analysis, social networks, protein interaction networks
6 0.16317324 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
7 0.16171138 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
8 0.15786518 52 jmlr-2008-Learning from Multiple Sources
9 0.15077481 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
10 0.14025357 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
11 0.13149239 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
12 0.12963101 22 jmlr-2008-Closed Sets for Labeled Data
13 0.12400664 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
14 0.12206975 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
15 0.11438855 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
16 0.10986046 9 jmlr-2008-Active Learning by Spherical Subdivision
17 0.10671503 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
19 0.1027382 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
20 0.10248981 83 jmlr-2008-Robust Submodular Observation Selection
topicId topicWeight
[(0, 0.02), (5, 0.011), (40, 0.095), (54, 0.021), (58, 0.023), (60, 0.01), (66, 0.036), (76, 0.524), (88, 0.038), (92, 0.046), (94, 0.032), (99, 0.018)]
simIndex simValue paperId paperTitle
1 0.90704608 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
Author: Eric Perrier, Seiya Imoto, Satoru Miyano
Abstract: Classical approaches used to learn Bayesian network structure from data have disadvantages in terms of complexity and lower accuracy of their results. However, a recent empirical study has shown that a hybrid algorithm improves sensitively accuracy and speed: it learns a skeleton with an independency test (IT) approach and constrains on the directed acyclic graphs (DAG) considered during the search-and-score phase. Subsequently, we theorize the structural constraint by introducing the concept of super-structure S, which is an undirected graph that restricts the search to networks whose skeleton is a subgraph of S. We develop a super-structure constrained optimal search (COS): its time complexity is upper bounded by O(γm n ), where γm < 2 depends on the maximal degree m of S. Empirically, complexity depends on the average degree m and sparse structures ˜ allow larger graphs to be calculated. Our algorithm is faster than an optimal search by several orders and even finds more accurate results when given a sound super-structure. Practically, S can be approximated by IT approaches; significance level of the tests controls its sparseness, enabling to control the trade-off between speed and accuracy. For incomplete super-structures, a greedily post-processed version (COS+) still enables to significantly outperform other heuristic searches. Keywords: subset Bayesian networks, structure learning, optimal search, super-structure, connected
same-paper 2 0.88323396 38 jmlr-2008-Generalization from Observed to Unobserved Features by Clustering
Author: Eyal Krupka, Naftali Tishby
Abstract: We argue that when objects are characterized by many attributes, clustering them on the basis of a random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. We use our framework to analyze generalization to unobserved features of two well-known clustering algorithms: k-means and the maximum likelihood multinomial mixture model. The scheme is demonstrated for collaborative filtering of users with movie ratings as attributes and document clustering with words as attributes. Keywords: clustering, unobserved features, learning theory, generalization in clustering, information bottleneck
3 0.86435425 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
Author: Xing Sun, Andrew B. Nobel
Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance
4 0.52966166 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
5 0.42795548 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
6 0.40021759 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
7 0.39965346 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
8 0.39053717 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents
9 0.38189772 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
10 0.38081807 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
11 0.3699359 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
13 0.36128119 22 jmlr-2008-Closed Sets for Labeled Data
14 0.35929424 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters
15 0.34892583 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.34751853 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.34373459 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
18 0.34117192 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.34107745 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
20 0.33942065 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model