nips nips2009 nips2009-260 knowledge-graph by maker-knowledge-mining

260 nips-2009-Zero-shot Learning with Semantic Output Codes

Source: pdf

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classiﬁer f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we deﬁne the notion of a semantic output code classiﬁer (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classiﬁer and study its theoretical properties in a PAC framework, showing conditions under which the classiﬁer can accurately predict novel classes. As a case study, we build a SOC classiﬁer for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of zero-shot learning, where the goal is to learn a classiﬁer f : X → Y that must predict novel values of Y that were omitted from the training set. [sent-12, score-0.317]

2 To achieve this, we deﬁne the notion of a semantic output code classiﬁer (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. [sent-13, score-1.828]

3 As a case study, we build a SOC classiﬁer for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. [sent-15, score-0.59]

4 Another example is in neural activity decoding, where the goal is to determine the word or object a person is thinking about by observing an image of that person’s neural activity. [sent-21, score-0.571]

5 It is intractable to collect neural training images for every possible word in English, so to build a practical neural decoder we must have a way to extrapolate to recognizing words beyond those in the training set. [sent-22, score-0.731]

6 Speech recognition systems succeed by leveraging a relatively small set of phoneme 1 recognizers in conjunction with a large knowledge base representing words as combinations of phonemes. [sent-26, score-0.323]

7 The general question this paper asks is: Given a semantic encoding of a large set of concept classes, can we build a classiﬁer to recognize classes that were omitted from the training set? [sent-29, score-1.068]

8 We show it is possible to build a classiﬁer that can recognize words a person is thinking about, even without training examples for those particular words. [sent-31, score-0.486]

9 (2008) on zero-data learning has shown the ability to predict novel classes of digits that were omitted from a training set. [sent-35, score-0.373]

10 In computer vision, techniques for sharing features across object classes have been investigated (Torralba & Murphy, 2007; Bart & Ullman, 2005) but relatively little work has focused on recognizing entirely novel classes, with the exception of Lampert et al. [sent-36, score-0.29]

11 (2008) has shown the ability to decode (from visual cortex activity) which novel visual scenes a person is viewing from a large set of possible images, but without recognizing the image content per se. [sent-40, score-0.324]

12 They use semantic features derived from corpus statistics to generate a neural activity pattern for any noun in English. [sent-42, score-0.886]

13 In our work, by contrast, we focus on word decoding, where given a novel neural image, we wish to predict the word from a large set of possible words. [sent-43, score-0.63]

14 We also consider semantic features that are derived from human labeling in addition to corpus statistics. [sent-44, score-0.742]

15 Further, we introduce a formalism for a zero-shot learner and provide theoretical guarantees on its ability to recognize novel classes omitted from a training set. [sent-45, score-0.466]

16 2 Classiﬁcation with Semantic Knowledge In this section we formalize the notion of a zero-shot learner that uses semantic knowledge to extrapolate to novel classes. [sent-46, score-0.923]

17 While a zero-shot learner could take many forms, we present one such model that utilizes an intermediate set of features derived from a semantic knowledge base. [sent-47, score-0.836]

18 Intuitively, our goal is to treat each class not as simply an atomic label, but instead represent it using a vector of semantic features characterizing a large number of possible classes. [sent-48, score-0.794]

19 Our models will learn the relationship between input data and the semantic features. [sent-49, score-0.657]

20 Given new input data, the models will predict a set of semantic features corresponding to that input, and then ﬁnd the class in the knowledge base that best matches that set of predicted features. [sent-51, score-1.122]

21 Signiﬁcantly, this procedure will even work for input data from a novel class if that class is included in the semantic knowledge base (i. [sent-52, score-1.063]

22 even if no input space representation is available for the class, but a feature encoding of it exists in the semantic knowledge base). [sent-54, score-0.877]

23 Semantic Feature Space A semantic feature space of p dimensions is a metric space in which each of the p dimensions encodes the value of a semantic property. [sent-57, score-1.465]

24 2 As an example, consider a semantic space for describing high-level properties of animals. [sent-59, score-0.657]

25 In this semantic feature space, the prototypical concept of dog might be represented as the point {1, 1, 0, 1, 0}. [sent-66, score-0.848]

26 Semantic Knowledge Base A semantic knowledge base K of M examples is a collection of pairs {f, y}1:M such that f ∈ F p is a point in a p dimensional semantic space F p and y ∈ Y is a class label from a set Y . [sent-68, score-1.63]

27 We assume a one-to-one encoding between class labels and points in the semantic feature space. [sent-69, score-0.864]

28 A knowledge base of animals would contain the semantic encoding and label for many animals. [sent-70, score-1.019]

29 For example, we may imagine some raw-input features from a digital image of a dog ﬁrst mapped into the semantic encoding of a dog described earlier, which is then mapped to the class label dog. [sent-73, score-1.273]

30 As a result, our class labels can be thought of as a semantic output code, similar in spirit to the errorcorrecting output codes of Dietterich and Bakiri (1995). [sent-74, score-0.853]

31 Typically, M >> N , meaning that data in semantic space is available for many more class labels than in the raw-input space. [sent-77, score-0.709]

32 Thus, A semantic output code classiﬁer can be useful when the knowledge base K covers more of the possible values for Y than are covered by the input data D. [sent-78, score-0.999]

33 To learn the mapping S, the classiﬁer ﬁrst builds a new set of N examples {x, f }1:N by replacing each y with the respective semantic encoding f according to its knowledge base K. [sent-79, score-0.984]

34 The intuition behind using this two-stage process is that the classiﬁer may be able to learn the relationship between the raw-input space and the individual dimensions of the semantic feature space from a relatively small number of training examples in the input space. [sent-80, score-0.834]

35 When a new example is presented, the classiﬁer will make a prediction about its semantic encoding using the learned S map. [sent-81, score-0.823]

36 Even when a new example belongs to a class that did not appear in the training set D, if the prediction produced by the S map is close to the true encoding of that class, then the L map will have a reasonable chance of recovering the correct label. [sent-82, score-0.447]

37 As a concrete example, if the model can predict the object has fur and a tail, it would have a good chance of recovering the class label dog, even without having seen images of dogs during training. [sent-83, score-0.42]

38 In short: By using a rich semantic encoding of the classes, the classiﬁer may be able to extrapolate and recognize novel classes. [sent-84, score-1.019]

39 3 Theoretical Analysis In this section we consider theoretical properties of a semantic output code classiﬁer that determine its ability to recognize instances of novel classes. [sent-85, score-1.018]

40 In other words, we will address the question: Under what conditions will the semantic output code classiﬁer recognize examples from classes omitted from its training set? [sent-86, score-1.128]

41 3 In answering this question, our goal is to obtain a PAC-style bound: we want to know how much error can be tolerated in the prediction of the semantic properties while still recovering the novel class with high probability. [sent-87, score-0.972]

42 The idea is that if the ﬁrst stage S(·) of the classiﬁer can predict the semantic properties well, then the second stage L(·) will have a good chance of recovering the correct label for instances from novel classes. [sent-89, score-1.112]

43 As a ﬁrst step towards a general theory of zero-shot learning, we will consider one instantiation of a semantic output code classiﬁer. [sent-90, score-0.815]

44 We will assume that semantic features are binary labels, the ﬁrst stage S(·) is a collection of PAC-learnable linear classiﬁers (one classiﬁer per feature), and the second stage L(·) is a 1-nearest neighbor classiﬁer using the Hamming distance metric. [sent-91, score-0.93]

45 We ﬁrst want to bound the amount of error we can tolerate given a prediction of semantic features. [sent-94, score-0.748]

46 To ﬁnd this bound, we deﬁne F to be the distribution in semantic feature space of points from the knowledge base K. [sent-95, score-0.891]

47 Clearly points (classes) in semantic space may not be equidistant from each other. [sent-96, score-0.657]

48 A point might be far from others, which would allow more room for error in the prediction of semantic features for this point, while maintaining the ability to recover its unique identity (label). [sent-97, score-0.833]

49 Conversely, a point close to others in semantic space will have lower tolerance of error. [sent-98, score-0.657]

50 In short, the tolerance for error is relative to a particular point in relation to other points in semantic space. [sent-99, score-0.687]

51 Let d(q, q ) be the distance between the prediction q and some other point q representing a class in the semantic space. [sent-101, score-0.804]

52 Now suppose that we deﬁne τq to be the distance a prediction q for raw-input example x is from the true semantic encoding of the class to which x belongs. [sent-104, score-0.939]

53 We assume in this analysis q max that we have p binary semantic features and a Hamming distance metric, so τq deﬁnes the total 4 number of mistakes we can make predicting the binary features. [sent-111, score-0.932]

54 Note with our assumptions, each semantic feature is PAC-learnable using a linear classiﬁer from a d dimensional raw input space. [sent-112, score-0.707]

55 To simplify the analysis, we will treat each of the p semantic features as independently learned. [sent-113, score-0.742]

56 probability of the classiﬁer making a mistake) of each of the p learned hypotheses is , then the expected number of mistakes over the p semantic features will be max max max τq if we set = τq /p. [sent-116, score-0.925]

57 The binomial CDF yields max the probability of making at most τq mistakes total, and the (1 − γ) term above speciﬁes the probability of recovering the true class if a maximum of this many mistakes were made. [sent-120, score-0.327]

58 4 Case Study: Neural Decoding of Novel Thoughts In this section we empirically evaluate a semantic output code classiﬁer on a neural decoding task. [sent-125, score-0.916]

59 The objective is to decode novel words a person is thinking about from fMRI images of the person’s neural activity, without including example fMRI images of those words during training. [sent-126, score-0.745]

60 This dataset contains the neural activity observed from nine human participants while viewing 60 different concrete words (5 examples from 12 different categories). [sent-129, score-0.487]

61 Each participant was shown a word and a small line drawing of the concrete object the word represents. [sent-131, score-0.518]

62 The participants were asked to think about the properties of these objects for several seconds while images of their brain activity were recorded. [sent-132, score-0.288]

63 1 The VC dimension of linear classiﬁers in d dimensions is d + 1 5 In the language of the semantic output code classiﬁer, this dataset represents the collection D of raw-input space examples. [sent-138, score-0.849]

64 We also collected two semantic knowledge bases for these 60 words. [sent-139, score-0.722]

65 In the ﬁrst semantic knowledge base, corpus5000, each word is represented as a co-occurrence vector with the 5000 most frequent words from the Google Trillion-Word-Corpus2 . [sent-140, score-1.069]

66 The second semantic knowledge base, human218, was created using the Mechanical Turk human computation service from Amazon. [sent-141, score-0.722]

67 There were 218 semantic features collected for the 60 words, and the questions were selected to reﬂect psychological conjectures about neural activity encoding. [sent-143, score-0.93]

68 2 Model In our experiments, we use multiple output linear regression to learn the S(·) map of the semantic output code classiﬁer. [sent-150, score-0.887]

69 Let X ∈ N ∗d be a training set of fMRI examples where each row is the image for a particular word and d is the number of dimensions of the fMRI image. [sent-151, score-0.385]

70 Let Y ∈ N ∗p be a matrix of semantic features for those words (obtained from the knowledge base K) where p is the number of semantic features for that word (e. [sent-153, score-2.015]

71 Given a novel fMRI image x, we can obtain a prediction f of the semantic features for this image by multiplying the image by the weights: f=x·W For the second stage of the semantic output code classiﬁer, L(·), we simply use a 1-nearest neighbor classiﬁer. [sent-159, score-1.997]

72 In other words, L(f) will take the prediction of features and return the closest point in a given knowledge base according the Euclidean distance (L2 ) metric. [sent-160, score-0.414]

73 Speciﬁcally, we trained the model in Equation 3 to learn the mapping between 58 fMRI images and the semantic features for their respective words. [sent-166, score-0.805]

74 For the ﬁrst held out image, we applied the learned weight matrix to obtain a prediction of the semantic features, and then we used a 1-nearest neighbor classiﬁer to compare the vector of predictions to the true semantic encodings of the two held-out words. [sent-167, score-1.473]

75 The label was chosen by selecting the word with the encoding closest to the prediction for the fMRI image. [sent-168, score-0.466]

76 Table 1 shows the results for two different semantic feature encodings. [sent-172, score-0.707]

77 We see that the human218 semantic features signiﬁcantly outperformed the corpus5000 features, with mean accuracies over the nine participants of 80. [sent-173, score-0.896]

78 But for both feature sets, we see that it is possible to discriminate between two novel classes for each of the nine participants. [sent-176, score-0.317]

79 Figure 1: Ten semantic features from the human218 knowledge base for the words bear and dog. [sent-221, score-1.174]

80 The true encoding is shown along with the predicted encoding when fMRI images for bear and dog were left out of the training set. [sent-222, score-0.687]

81 Figure 1 shows ten semantic questions (features) from the human218 dataset. [sent-225, score-0.701]

82 The graph shows the true values along with the predicted feature values for both bear and dog when trained on the other 58 words. [sent-226, score-0.409]

83 For both of these novel words, the features predicted from the neural data were closest to the true word. [sent-230, score-0.393]

84 Given the success of the semantic output code classiﬁer at discriminating between the brain images for two novel words, we now consider the much harder problem of discriminating a novel word from a large set of candidate words. [sent-233, score-1.457]

85 To test this ability, we performed a leave-one-out-cross-validation, where we trained using Equation 3 on images and semantic features for 59 words. [sent-234, score-0.805]

86 The ﬁrst was mri60 which is the collection of all 60 concrete nouns for which we collected fMRI data, including the 59 training words and the single held out word. [sent-237, score-0.298]

87 On 12 of 540 total presentations of the mri60 words (60 presentations for each of nine participants), the human218 features predicted the single held-out word above all 59 other words in its training set. [sent-243, score-0.764]

88 While just a bit above chance level (9/540), the fact that the model ever chooses the held-out word over all the training words is noteworthy since the model is undoubtedly biased towards predicting feature values similar to the words on which it was trained. [sent-244, score-0.695]

89 On the noun940 words, the model predicted the correct word from the set of 941 alternatives a total of 26 times for the human218 features and 22 times for the corpus5000 features. [sent-245, score-0.401]

90 Table 2: The top ﬁve predicted words for a novel fMRI image taken for the word in bold (all fMRI images taken from participant P1). [sent-248, score-0.707]

91 The number in the parentheses contains the rank of the correct word selected from 941 concrete nouns in English. [sent-249, score-0.403]

92 The chance accuracy of predicting a word correctly is only 0. [sent-251, score-0.312]

93 The prediction of words in the categories animals, body parts, foods, tools, and vehicles typically perform well, while the words in the categories furniture, man-made items, and insects often perform poorly. [sent-256, score-0.375]

94 Even when the correct word is not the closest match, the words that best match the predicted features are often very similar to the held-out word. [sent-257, score-0.59]

95 Table 2 shows the top ﬁve predicted words for eight different held-out fMRI images for participant P1 (i. [sent-258, score-0.331]

96 the 5 closest words in the set of 941 to the predicted vector of semantic features). [sent-260, score-0.925]

97 5 Conclusion We presented a formalism for a zero-shot learning algorithm known as the semantic output code classiﬁer. [sent-261, score-0.859]

98 This classiﬁer can predict novel classes that were omitted from a training set by leveraging a semantic knowledge base that encodes features common to both the novel classes and the training set. [sent-262, score-1.561]

99 We demonstrated this semantic output code classiﬁer on the task of neural decoding using semantic knowledge bases derived from both human labeling and corpus statistics. [sent-264, score-1.638]

100 We have shown that training images of brain activity are not required for every word we would like a classiﬁer to recognize. [sent-266, score-0.496]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('semantic', 0.657), ('word', 0.208), ('fmri', 0.207), ('er', 0.185), ('dog', 0.141), ('words', 0.139), ('gq', 0.133), ('mitchell', 0.121), ('classi', 0.12), ('base', 0.119), ('novel', 0.118), ('activity', 0.113), ('bear', 0.109), ('encoding', 0.105), ('code', 0.086), ('features', 0.085), ('recognize', 0.085), ('omitted', 0.079), ('predicted', 0.079), ('mistakes', 0.078), ('output', 0.072), ('person', 0.071), ('decoding', 0.07), ('neighbor', 0.068), ('thinking', 0.067), ('pac', 0.067), ('knowledge', 0.065), ('predict', 0.065), ('images', 0.063), ('truck', 0.062), ('rank', 0.062), ('chance', 0.061), ('prediction', 0.061), ('nine', 0.059), ('brain', 0.057), ('classes', 0.056), ('participants', 0.055), ('training', 0.055), ('mq', 0.054), ('recovering', 0.054), ('decode', 0.054), ('extrapolate', 0.054), ('concrete', 0.052), ('median', 0.052), ('predictors', 0.052), ('nouns', 0.052), ('class', 0.052), ('participant', 0.05), ('ciaccia', 0.05), ('closest', 0.05), ('image', 0.05), ('feature', 0.05), ('questions', 0.044), ('formalism', 0.044), ('stage', 0.043), ('predicting', 0.043), ('label', 0.042), ('bakiri', 0.041), ('binocdf', 0.041), ('celery', 0.041), ('farhadi', 0.041), ('plaut', 0.041), ('screwdriver', 0.041), ('snodgrass', 0.041), ('waibel', 0.041), ('rq', 0.04), ('accuracies', 0.04), ('discriminating', 0.039), ('examples', 0.038), ('vehicles', 0.036), ('patella', 0.036), ('familiarity', 0.036), ('bus', 0.036), ('ers', 0.035), ('max', 0.035), ('distance', 0.034), ('dimensions', 0.034), ('discriminate', 0.034), ('pittsburgh', 0.034), ('lampert', 0.033), ('factory', 0.033), ('decoder', 0.033), ('turk', 0.033), ('encodes', 0.033), ('nearest', 0.032), ('animals', 0.031), ('recognizing', 0.031), ('build', 0.031), ('kay', 0.031), ('ullman', 0.031), ('bart', 0.031), ('dogs', 0.031), ('mechanical', 0.031), ('foot', 0.031), ('neural', 0.031), ('speech', 0.03), ('error', 0.03), ('true', 0.03), ('learner', 0.029), ('correct', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

2 0.32777968 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei

Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject ﬁrst studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle ﬁlter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1

3 0.18494762 96 nips-2009-Filtering Abstract Senses From Image Search Results

Author: Kate Saenko, Trevor Darrell

Abstract: We propose an unsupervised method that, given a word, automatically selects non-abstract senses of that word from an online ontology and generates images depicting the corresponding entities. When faced with the task of learning a visual model based only on the name of an object, a common approach is to ﬁnd images on the web that are associated with the object name and train a visual classiﬁer from the search result. As words are generally polysemous, this approach can lead to relatively noisy models if many examples due to outlier senses are added to the model. We argue that images associated with an abstract word sense should be excluded when training a visual classiﬁer to learn a model of a physical object. While image clustering can group together visually coherent sets of returned images, it can be difﬁcult to distinguish whether an image cluster relates to a desired object or to an abstract sense of the word. We propose a method that uses both image features and the text associated with the images to relate latent topics to particular senses. Our model does not require any human supervision, and takes as input only the name of an object category. We show results of retrieving concrete-sense images in two available multimodal, multi-sense databases, as well as experiment with object classiﬁers trained on concrete-sense images returned by our method for a set of ten common ofﬁce objects. 1

4 0.14953814 190 nips-2009-Polynomial Semantic Indexing

Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

Abstract: We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a query-document or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low-rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain state-of-the-art performance while providing realistically scalable methods. 1

5 0.12416276 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efﬁcient boosting algorithm that uses additional weights/priors on the base classiﬁers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classiﬁers. The algorithm’s primary advantage is in applications where the trained classiﬁer is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

6 0.10923301 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

7 0.10631792 211 nips-2009-Segmenting Scenes by Matching Image Composites

8 0.10350317 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

9 0.098671973 201 nips-2009-Region-based Segmentation and Object Detection

10 0.098238774 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

11 0.097294003 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

12 0.090245888 157 nips-2009-Multi-Label Prediction via Compressed Sensing

13 0.087529011 112 nips-2009-Human Rademacher Complexity

14 0.078780137 71 nips-2009-Distribution-Calibrated Hierarchical Classification

15 0.076057121 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

16 0.075327173 70 nips-2009-Discriminative Network Models of Schizophrenia

17 0.075109638 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

18 0.074109927 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

19 0.072223239 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

20 0.071974523 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.23), (1, -0.139), (2, -0.153), (3, -0.061), (4, 0.015), (5, 0.084), (6, -0.158), (7, -0.094), (8, -0.094), (9, 0.239), (10, 0.017), (11, -0.03), (12, 0.022), (13, 0.031), (14, 0.09), (15, -0.04), (16, 0.03), (17, 0.064), (18, -0.037), (19, -0.01), (20, 0.119), (21, 0.014), (22, 0.017), (23, 0.039), (24, 0.006), (25, 0.058), (26, -0.002), (27, 0.033), (28, -0.051), (29, -0.068), (30, -0.101), (31, 0.011), (32, 0.06), (33, 0.079), (34, 0.02), (35, -0.146), (36, 0.153), (37, 0.023), (38, -0.083), (39, -0.058), (40, -0.042), (41, 0.137), (42, 0.055), (43, -0.011), (44, 0.188), (45, 0.145), (46, -0.042), (47, -0.096), (48, 0.061), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96737826 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

2 0.79090995 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei

3 0.63484842 96 nips-2009-Filtering Abstract Senses From Image Search Results

Author: Kate Saenko, Trevor Darrell

4 0.57328647 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory

Author: Harold Pashler, Nicholas Cepeda, Robert Lindsey, Ed Vul, Michael C. Mozer

Abstract: When individuals learn facts (e.g., foreign language vocabulary) over multiple study sessions, the temporal spacing of study has a signiﬁcant impact on memory retention. Behavioral experiments have shown a nonmonotonic relationship between spacing and retention: short or long intervals between study sessions yield lower cued-recall accuracy than intermediate intervals. Appropriate spacing of study can double retention on educationally relevant time scales. We introduce a Multiscale Context Model (MCM) that is able to predict the inﬂuence of a particular study schedule on retention for speciﬁc material. MCM’s prediction is based on empirical data characterizing forgetting of the material following a single study session. MCM is a synthesis of two existing memory models (Staddon, Chelaru, & Higa, 2002; Raaijmakers, 2003). On the surface, these models are unrelated and incompatible, but we show they share a core feature that allows them to be integrated. MCM can determine study schedules that maximize the durability of learning, and has implications for education and training. MCM can be cast either as a neural network with inputs that ﬂuctuate over time, or as a cascade of leaky integrators. MCM is intriguingly similar to a Bayesian multiscale model of memory (Kording, Tenenbaum, & Shadmehr, 2007), yet MCM is better able to account for human declarative memory. 1

5 0.55889249 190 nips-2009-Polynomial Semantic Indexing

Author: Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

6 0.55545431 112 nips-2009-Human Rademacher Complexity

7 0.52229905 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

8 0.48460874 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

9 0.45547131 71 nips-2009-Distribution-Calibrated Hierarchical Classification

10 0.43965694 47 nips-2009-Boosting with Spatial Regularization

11 0.42082453 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

12 0.41888231 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

13 0.39932221 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

14 0.39647797 157 nips-2009-Multi-Label Prediction via Compressed Sensing

15 0.39452806 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

16 0.39243406 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information

17 0.38611886 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

18 0.37727904 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

19 0.37640676 237 nips-2009-Subject independent EEG-based BCI decoding

20 0.37633154 25 nips-2009-Adaptive Design Optimization in Experiments with People

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.174), (24, 0.047), (25, 0.077), (35, 0.069), (36, 0.131), (39, 0.064), (55, 0.012), (58, 0.084), (61, 0.012), (71, 0.136), (86, 0.085), (91, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91831899 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

Author: Dimitris Margaritis

Abstract: In this paper we address the problem of provably correct feature selection in arbitrary domains. An optimal solution to the problem is a Markov boundary, which is a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain. While numerous algorithms for this problem have been proposed, their theoretical correctness and practical behavior under arbitrary probability distributions is unclear. We address this by introducing the Markov Boundary Theorem that precisely characterizes the properties of an ideal Markov boundary, and use it to develop algorithms that learn a more general boundary that can capture complex interactions that only appear when the values of multiple features are considered together. We introduce two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and show that they perform well on artiﬁcial as well as benchmark and real-world data sets. Throughout the paper we make minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, which gives these algorithms universal applicability. 1 Introduction and Motivation The problem of feature selection has a long history due to its signiﬁcance in a wide range of important problems, from early ones like pattern recognition to recent ones such as text categorization, gene expression analysis and others. In such domains, using all available features may be prohibitively expensive, unnecessarily wasteful, and may lead to poor generalization performance, especially in the presence of irrelevant or redundant features. Thus, selecting a subset of features of the domain for use in subsequent application of machine learning algorithms has become a standard preprocessing step. A typical task of these algorithms is learning a classiﬁer: Given a number of input features and a quantity of interest, called the target variable, choose a member of a family of classiﬁers that can predict the target variable’s value as well as possible. Another task is understanding the domain and the quantities that interact with the target quantity. Many algorithms have been proposed for feature selection. Unfortunately, little attention has been paid to the issue of their behavior under a variety of application domains that can be encountered in practice. In particular, it is known that many can fail under certain probability distributions such as ones that contain a (near) parity function [1], which contain interactions that only appear when the values of multiple features are considered together. There is therefore an acute need for algorithms that are widely applicable and can be theoretically proven to work under any probability distribution. In this paper we present two such algorithms, an exact and a more practical randomized approximate one. We use the observation (ﬁrst made in Koller and Sahami [2]) that an optimal solution to the problem is a Markov boundary, deﬁned to be a minimal set of features that make the probability distribution of a target variable conditionally invariant to the state of all other features in the domain (a more precise deﬁnition is given later in Section 3) and present a family of algorithms for learning 1 the Markov boundary of a target variable in arbitrary domains. We ﬁrst introduce a theorem that exactly characterizes the minimal set of features necessary for probabilistically isolating a variable, and then relax this deﬁnition to derive a family of algorithms that learn a parameterized approximation of the ideal boundary that are provably correct under a minimal set of assumptions, including a set of axioms that hold for any probability distribution. In the following section we present work on feature selection, followed by notation and deﬁnitions in Section 3. We subsequently introduce an important theorem and the aforementioned parameterized family of algorithms in Sections 4 and 5 respectively, including a practical anytime version. We evaluate these algorithms in Section 6 and conclude in Section 7. 2 Related Work Numerous algorithms have been proposed for feature selection. At the highest level algorithms can be classiﬁed as ﬁlter, wrapper, or embedded methods. Filter methods work without consulting the classiﬁer (if any) that will make use of their output i.e., the resulting set of selected features. They therefore have typically wider applicability since they are not tied to any particular classiﬁer family. In contrast, wrappers make the classiﬁer an integral part of their operation, repeatedly invoking it to evaluate each of a sequence of feature subsets, and selecting the subset that results in minimum estimated classiﬁcation error (for that particular classiﬁer). Finally, embedded algorithms are classiﬁer-learning algorithms that perform feature selection implicitly during their operation e.g., decision tree learners. Early work was motivated by the problem of pattern recognition which inherently contains a large number of features (pixels, regions, signal responses at multiple frequencies etc.). Narendra and Fukunaga [3] ﬁrst cast feature selection as a problem of maximization of an objective function over the set of features to use, and proposed a number of search approaches including forward selection and backward elimination. Later work by machine learning researchers includes the FOCUS algorithm of Almuallim and Dietterich [4], which is a ﬁlter method for deterministic, noise-free domains. The RELIEF algorithm [5] instead uses a randomized selection of data points to update a weight assigned to each feature, selecting the features whose weight exceeds a given threshold. A large number of additional algorithms have appeared in the literature, too many to list here—good surveys are included in Dash and Liu [6]; Guyon and Elisseeff [1]; Liu and Motoda [7]. An important concept for feature subset selection is relevance. Several notions of relevance are discussed in a number of important papers such as Blum and Langley [8]; Kohavi and John [9]. The argument that the problem of feature selection can be cast as the problem of Markov blanket discovery was ﬁrst made convincingly in Koller and Sahami [2], who also presented an algorithm for learning an approximate Markov blanket using mutual information. Other algorithms include the GS algorithm [10], originally developed for learning of the structure of a Bayesian network of a domain, and extensions to it [11] including the recent MMMB algorithm [12]. Meinshausen and B¨ hlmann [13] u recently proposed an optimal theoretical solution to the problem of learning the neighborhood of a Markov network when the distribution of the domain can be assumed to be a multidimensional Gaussian i.e., linear relations among features with Gaussian noise. This assumption implies that the Composition axiom holds in the domain (see Pearl [14] for a deﬁnition of Composition); the difference with our work is that we address here the problem in general domains where it may not necessarily hold. 3 Notation and Preliminaries In this section we present notation, fundamental deﬁnitions and axioms that will be subsequently used in the rest of the paper. We use the term “feature” and “variable” interchangeably, and denote variables by capital letters (X, Y etc.) and sets of variables by bold letters (S, T etc.). We denote the set of all variables/features in the domain (the “universe”) by U. All algorithms presented are independence-based, learning the Markov boundary of a given target variable using the truth value of a number of conditional independence statements. The use of conditional independence for feature selection subsumes many other criteria proposed in the literature. In particular, the use of classiﬁcation accuracy of the target variable can be seen as a special case of testing for its conditional independence with some of its predictor variables (conditional on the subset selected at any given moment). A beneﬁt of using conditional independence is that, while classiﬁcation error estimates depend on the classiﬁer family used, conditional independence does not. In addition, algorithms utilizing conditional independence for feature selection are applicable to all domain types, 2 e.g., discrete, ordinal, continuous with non-linear or arbitrary non-degenerate associations or mixed domains, as long as a reliable estimate of probabilistic independence is available. We denote probabilistic independence by the symbol “ ⊥ ” i.e., (X⊥ Y | Z) denotes the fact ⊥ ⊥ that the variables in set X are (jointly) conditionally independent from those in set Y given the values of the variables in set Z; (X⊥ Y | Z) denotes their conditional dependence. We assume ⊥ the existence of a probabilistic independence query oracle that is available to answer any query of the form (X, Y | Z), corresponding to the question “Is the set of variables in X independent of the variables in Y given the value of the variables in Z?” (This is similar to the approach of learning from statistical queries of Kearns and Vazirani [15].) In practice however, such an oracle does not exist, but can be approximated by a statistical independence test on a data set. Many tests of independence have appeared and studied extensively in the statistical literature over the last century; in this work we use the χ2 (chi-square) test of independence [16]. A Markov blanket of variable X is a set of variables such that, after ﬁxing (by “knowing”) the value of all of its members, the set of remaining variables in the domain, taken together as a single setvalued variable, are statistically independent of X. More precisely, we have the following deﬁnition. Deﬁnition 1. A set of variables S ⊆ U is called a Markov blanket of variable X if and only if (X⊥ U − S − {X} | S). ⊥ Intuitively, a Markov blanket S of X captures all the information in the remaining domain variables U − S − {X} that can affect the probability distribution of X, making their value redundant as far as X is concerned (given S). The blanket therefore captures the essence of the feature selection problem for target variable X: By completely “shielding” X, a Markov blanket precludes the existence of any possible information about X that can come from variables not in the blanket, making it an ideal solution to the feature selection problem. A minimal Markov blanket is called a Markov boundary. Deﬁnition 2. A set of variables S ⊆ U − {X} is called a Markov boundary of variable X if it is a minimal Markov blanket of X i.e., none of its proper subsets is a Markov blanket. Pearl [14] proved that that the axioms of Symmetry, Decomposition, Weak Union, and Intersection are sufﬁcient to guarantee a unique Markov boundary. These are shown below together with the axiom of Contraction. (Symmetry) (Decomposition) (Weak Union) (Contraction) (Intersection) (X⊥ Y | Z) =⇒ (Y⊥ X | Z) ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z) ∧ (X⊥ W | Z) ⊥ ⊥ ⊥ (X⊥ Y ∪ W | Z) =⇒ (X⊥ Y | Z ∪ W) ⊥ ⊥ (X⊥ Y | Z) ∧ (X⊥ W | Y ∪ Z) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (X⊥ Y | Z ∪ W) ∧ (X⊥ W | Z ∪ Y) =⇒ (X⊥ Y ∪ W | Z) ⊥ ⊥ ⊥ (1) The Symmetry, Decomposition, Contraction and Weak Union axioms are very general: they are necessary axioms for the probabilistic deﬁnition of independence i.e., they hold in every probability distribution, as their proofs are based on the axioms of probability theory. Intersection is not universal but it holds in distributions that are positive, i.e., any value combination of the domain variables has a non-zero probability of occurring. 4 The Markov Boundary Theorem According to Deﬁnition 2, a Markov boundary is a minimal Markov blanket. We ﬁrst introduce a theorem that provides an alternative, equivalent deﬁnition of the concept of Markov boundary that we will relax later in the paper to produce a more general boundary deﬁnition. Theorem 1 (Markov Boundary Theorem). Assuming that the Decomposition and Contraction axioms hold, S ⊆ U − {X} is a Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X}, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ (2) A detailed proof cannot be included here due to space constraints but a proof sketch appears in Appendix A. According to the above theorem, a Markov boundary S partitions the powerset of U − {X} into two parts: (a) set P1 that contains all subsets of U − S, and (b) set P2 containing the remaining subsets. All sets in P1 are conditionally independent of X, and all sets in P2 are conditionally dependent with X. Intuitively, the two directions of the logical equivalence relation of Eq. (2) correspond to the concept of Markov blanket and its minimality i.e., the equation ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X⊥ T | S − T) ⊥ 3 Algorithm 1 The abstract GS(m) (X) algorithm. Returns an m-Markov boundary of X. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: S←∅ /* Growing phase. */ for all Y ⊆ U − S − {X} such that 1 ≤ |Y| ≤ m do if (X ⊥ Y | S) then ⊥ S←S∪Y goto line 3 /* Restart loop. */ /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 8 /* Restart loop. */ return S or, equivalently, (∀ T ⊆ U − S − {X}, (X⊥ T | S)) (as T and S are disjoint) corresponds to ⊥ the deﬁnition of Markov blanket, as it includes T = U − S − {X}. In the opposite direction, the contrapositive form is ∀ T ⊆ U − {X}, T ⊆ U − S =⇒ (X ⊥ T | S − T) . ⊥ This corresponds to the concept of minimality of the Markov boundary: It states that all sets that contain a part of S cannot be independent of X given the remainder of S. Informally, this is because if there existed some set T that contained a non-empty subset T′ of S such that (X⊥ T | S − T), ⊥ then one would be able to shrink S by T′ (by the property of Contraction) and therefore S would not be minimal (more details in Appendix A). 5 A Family of Algorithms for Arbitrary Domains Theorem 1 deﬁnes conditions that precisely characterize a Markov boundary and thus can be thought of as an alternative deﬁnition of a boundary. By relaxing these conditions we can produce a more general deﬁnition. In particular, an m-Markov boundary is deﬁned as follows. Deﬁnition 3. A set of variables S ⊆ U − {X} of a domain U is called an m-Markov boundary of variable X ∈ U if and only if ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − S ⇐⇒ (X⊥ T | S − T) . ⊥ We call the parameter m of an m-Markov boundary the Markov boundary margin. Intuitively, an m-boundary S guarantees that (a) all subsets of its complement (excluding X) of size m or smaller are independent of X given S, and (b) all sets T of size m or smaller that are not subsets of its complement are dependent of X given the part of S that is not contained in T. This deﬁnition is a special case of the properties of a boundary stated in Theorem 1, with each set T mentioned in the theorem now restricted to having size m or smaller. For m = n − 1, where n = |U |, the condition |T| ≤ m is always satisﬁed and can be omitted; in this case the deﬁnition of an (n − 1)-Markov boundary results in exactly Eq. (2) of Theorem 1. We now present an algorithm called GS(m) , shown in Algorithm 1, that provably correctly learns an m-boundary of a target variable X. GS(m) operates in two phases, a growing and a shrinking phase (hence the acronym). During the growing phase it examines sets of variables of size up to m, where m is a user-speciﬁed parameter. During the shrinking phase, single variables are examined for conditional independence and possible removal from S (examining sets in the shrinking phase is not necessary for provably correct operation—see Appendix B). The orders of examination of the sets for possible addition and deletion from the candidate boundary are left intentionally unspeciﬁed in Algorithm 1—one can therefore view it as an abstract representative of a family of algorithms, with each member specifying one such ordering. All members of this family are m-correct, as the proof of correctness does not depend on the ordering. In practice numerous choices for the ordering exist; one possibility is to examine the sets in the growing phase in order of increasing set size and, for each such size, in order of decreasing conditional mutual information I(X, Y, S) between X and Y given S. The rationale for this heuristic choice is that (usually) tests with smaller conditional sets tend to be more reliable, and sorting by mutual information tends to lessen the chance of adding false members of the Markov boundary. We used this implementation in all our experiments, presented later in Section 6. Intuitively, the margin represents a trade-off between sample and computational complexity and completeness: For m = n − 1 = |U| − 1, the algorithm returns a Markov boundary in unrestricted 4 Algorithm 2 The RGS(m,k) (X) algorithm, a randomized anytime version of the GS(m) algorithm, utilizing k random subsets for the growing phase. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: S←∅ /* Growing phase. */ repeat Schanged ← false Y ← subset of U − S − {X} of size 1 ≤ |Y| ≤ m of maximum dependence out of k random subsets if (X ⊥ Y | S) then ⊥ S←S∪Y Schanged ← true until Schanged = false /* Shrinking phase. */ for all Y ∈ S do if (X⊥ Y | S − {Y }) then ⊥ S ← S − {Y } goto line 11 /* Restart loop. */ return S (arbitrary) domains. For 1 ≤ m < n − 1, GS(m) may recover the correct boundary depending on characteristics of the domain. For example, it will recover the correct boundary in domains containing embedded parity functions such that the number of variables involved in every k-bit parity function is m + 1 or less i.e., if k ≤ m + 1 (parity functions are corner cases in the space of probability distributions that are known to be hard to learn [17]). The proof of m-correctness of GS(m) is included in Appendix B. Note that it is based on Theorem 1 and the universal axioms of Eqs. (1) only i.e., Intersection is not needed, and thus it is widely applicable (to any domain). A Practical Randomized Anytime Version While GS(m) is provably correct even in difﬁcult domains such as those that contain parity functions, it may be impractical with a large number of features as its asymptotic complexity is O(nm ). We therefore also we here provide a more practical randomized version called RGS(m,k) (Randomized GS(m) ), shown in Algorithm 2. The RGS(m,k) algorithm has an additional parameter k that limits its computational requirements: instead of exhaustively examining all possible subsets of (U −S−{X}) (as GS(m) does), it instead samples k subsets from the set of all possible subsets of (U − S − {X}), where k is user-speciﬁed. It is therefore a randomized algorithm that becomes equivalent to GS(m) given a large enough k. Many possibilities for the method of random selection of the subsets exist; in our experiments we select a subset Y = {Yi } (1 ≤ |Y| ≤ m) with probability proportional |Y| to i=1 (1/p(X, Yi | S)), where p(X, Yi | S) is the p-value of the corresponding (univariate) test between X and Yi given S, which has a low computational cost. The RGS(m,k) algorithm is useful in situations where the amount of time to produce an answer may be limited and/or the limit unknown beforehand: it is easy to show that the growing phase of GS(m) produces an an upper-bound of the m-boundary of X. Therefore, the RGS(m,k) algorithm, if interrupted, will return an approximation of this upper bound. Moreover, if there exists time for the shrinking phase to be executed (which conducts a number of tests linear in n and is thus fast), extraneous variables will be removed and a minimal blanket (boundary) approximation will be returned. These features make it an anytime algorithm, which is a more appropriate choice for situations where critical events may occur that require the interruption of computation, e.g., during the planning phase of a robot, which may be interrupted at any time due to an urgent external event that requires a decision to be made based on the present state’s feature values. 6 Experiments We evaluated the GS(m) and the RGS(m,k) algorithms on synthetic as well as real-world and benchmark data sets. We ﬁrst systematically examined the performance on the task of recovering near-parity functions, which are known to be hard to learn [17]. We compared GS(m) and RGS(m,k) with respect to accuracy of recovery of the original boundary as well as computational cost. We generated domains of sizes ranging from 10 to 100 variables, of which 4 variables (X1 to X4 ) were related through a near-parity relation with bit probability 0.60 and various degrees of noise. The remaining independent variables (X5 to Xn ) act as “dis5 GS(1) GS(3) RGS(1, 1000) 0 0.05 RGS(3, 1000) Relieved, threshold = 0.001 Relieved, threshold = 0.03 0.1 0.15 0.2 0.25 Noise probability 0.3 0.35 0.4 Probabilistic isolation performance of GS(m) and RELIEVED Probabilistic isolation performance of GS(m) and RGS(m ,k) Real-world and benchmark data sets 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) GS 0.6 0.8 1 Real-world and benchmark data sets RGS(m = 3, k = 300) average isolation measure F1 measure 50 variables, true Markov boundary size = 3 Bernoulli probability = 0.6, 1000 data points RELIEVED(threshold = 0.03) average isolation measure F1 measure of GS(m ), RGS(m, k ) and RELIEVED vs. noise level 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Data set Balance scale Balloons Car evaluation Credit screening Monks Nursery Tic-tac-toe Breast cancer Chess Audiology 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (m = 3) average isolation measure GS 0.6 0.8 1 average isolation measure Figure 2: Left: F1 measure of GS(m) , RGS(m,k) and RELIEVED under increasing amounts of noise. Middle: Probabilistic isolation performance comparison between GS(3) and RELIEVED on real-world and benchmark data sets. Right: Same for GS(3) and RGS(3,1000) . tractors” and had randomly assigned probabilities i.e., the correct boundary of X1 is B1 = {X2 , X3 , X4 }. In such domains, learning the boundary of X1 is difﬁcult because of the large number of distractors and because each Xi ∈ B1 is independent of X1 given any proper subset of B1 − {Xi } (they only become dependent when including all of them in the conditioning set). To measure an algorithm’s feature selection performance, acF -measure of GS and RGS vs. domain size curacy (fraction of variables correctly included or excluded) is inappropriate as the accuracy of trivial algorithms such as returning the empty set will tend to 1 as n increases. Precision and recall are therefore more appropriate, with precision deﬁned as the fraction of features returned that are in the correct boundary (3 features for X1 ), and recall as the fraction of the features present in the correct boundary that are returned by the algorithm. A convenient and frequently used Number of domain variables measure that combines precision and recall is the F1 meaRunning time of GS and RGS vs. domain size sure, deﬁned as the harmonic mean of precision and recall [18]. In Fig. 1 (top) we report 95% conﬁdence intervals for the F1 measure and execution time of GS(m) (margins m = 1 to 3) and RGS(m,k) (margins 1 to 3 and k = 1000 random subsets), using 20 data sets containing 10 to 100 variables, with the target variable X1 was perturbed (inverted) by noise with 10% probability. As can be seen, the RGS(m,k) and GS(m) using the same value for margin perform comparably Number of domain variables with respect to F1 , up to their 95% conﬁdence intervals. With Figure 1: GS(m) and RGS(m,k) per(m,k) respect to execution time however RGS exhibits much formance with respect to domain greater scalability (Fig. 1 bottom, log scale); for example, it size (number of variables). Top: F1 executes in about 10 seconds on average in domains containmeasure, reﬂecting accuracy. Bot(m) ing 100 variables, while GS executes in 1,000 seconds on tom: Execution time in seconds (log average for this domain size. scale). We also compared GS(m) and RGS(m,k) to RELIEF [5], a well-known algorithm for feature selection that is known to be able to recover parity functions in certain cases [5]. RELIEF learns a weight for each variable and compares it to a threshold τ to decide on its inclusion in the set of relevant variables. As it has been reported [9] that RELIEF can exhibit large variance due to randomization that is necessary only for very large data sets, we instead used a deterministic variant called RELIEVED [9], whose behavior corresponds to RELIEF at the limit of inﬁnite execution time. We calculated the F1 measure for GS(m) , RGS(m,k) and RELIEVED in the presence of varying amounts of noise, with noise probability ranging from 0 (no noise) to 0.4. We used domains containing 50 variables, as GS(m) becomes computationally demanding in larger domains. In Figure 2 (left) we show the performance of GS(m) and RGS(m,k) for m equal to 1 and 3, k = 1000 and RELIEVED for thresholds τ = 0.01 and 0.03 for various amounts of noise on the target variable. Again, each experiment was repeated 20 times to generate 95% conﬁdence intervals. We can observe that even though m = 1 (equivalent to the GS algorithm) performs poorly, increasing the margin m makes it more likely to recover the correct Markov boundary, and GS(3) (m = 3) recovers the exact blanket even with few (1,000) data points. RELIEVED does comparably to GS(3) for little noise and for a large threshold, (m ) (m, k ) 1 True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 1 0.9 GS(1) GS(2) GS(3) RGS(1, 1000) (2, 1000) RGS (3, 1000) RGS 0.8 F1-measure 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 (m ) 70 80 90 100 90 100 (m, k ) True Markov boundary size = 3, 1000 data points Bernoulli probability = 0.6, noise probability = 0.1 10000 GS(1) GS(2) GS(3) Execution time (sec) 1000 RGS(1, 1000) RGS(2, 1000) RGS(3, 1000) 100 10 1 0.1 0.01 10 6 20 30 40 50 60 70 80 but appears to deteriorate for more noisy domains. As we can see it is difﬁcult to choose the “right” threshold for RELIEVED—better performing τ at low noise can become worse in noisy environments; in particular, small τ tend to include irrelevant variables while large τ tend to miss actual members. We also evaluated GS(m) , RGS(m,k) , and RELIEVED on benchmark and real-world data sets from the UCI Machine Learning repository. As the true Markov boundary for these is impossible to know, we used as performance measure a measure of probabilistic isolation by the Markov boundary returned of subsets outside the boundary. For each domain variable X, we measured the independence of subsets Y of size 1, 2 and 3 given the blanket S of X returned by GS(3) and RELIEVED for τ = 0.03 (as this value seemed to do better in the previous set of experiments), as measured by the average p-value of the χ2 test between X and Y given S (with p-values of 0 and 1 indicating ideal dependence and independence, respectively). Due to the large number of subsets outside the boundary when the boundary is small, we limited the estimation of isolation performance to 2,000 subsets per variable. We plot the results in Figure 2 (middle and right). Each point represents a variable in the corresponding data set. Points under the diagonal indicate better probabilistic isolation performance for that variable for GS(3) compared to RELIEVED (middle plot) or to RGS(3,1000) (right plot). To obtain a statistically signiﬁcant comparison, we used the non-parametric Wilcoxon paired signed-rank test, which indicated that GS(3) RGS(3,1000) are statistically equivalent to each other, while both outperformed RELIEVED at the 99.99% signiﬁcance level (α < 10−7 ). 7 Conclusion In this paper we presented algorithms for the problem of feature selection in unrestricted (arbitrary distribution) domains that may contain complex interactions that only appear when the values of multiple features are considered together. We introduced two algorithms: an exact, provably correct one as well a more practical randomized anytime version, and evaluated them on on artiﬁcial, benchmark and real-world data, demonstrating that they perform well, even in the presence of noise. We also introduced the Markov Boundary Theorem that precisely characterizes the properties of a boundary, and used it to prove m-correctness of the exact family of algorithms presented. We made minimal assumptions that consist of only a general set of axioms that hold for every probability distribution, giving our algorithms universal applicability. Appendix A: Proof sketch of the Markov Boundary Theorem Proof sketch. (=⇒ direction) We need to prove that if S is a Markov boundary of X then (a) for every set T ⊆ U − S − {X}, (X⊥ T | S − T), and (b) for every set T′ ⊆ U − S that does not ⊥ contain X, (X ⊥ T′ | S − T′ ). Case (a) is immediate from the deﬁnition of the boundary and the ⊥ Decomposition theorem. Case (b) can be proven by contradiction: Assuming the independence of T′ that contains a non-empty part T′ in S and a part T′ in U − S, we get (from Decomposition) 1 2 (X⊥ T′ | S − T′ ). We can then use Contraction to show that the set S − T′ satisﬁes the inde⊥ 1 1 1 pendence property of a Markov boundary, i.e., that (X⊥ U − (S − T′ ) − {X} | S − T′ ), which ⊥ 1 1 contradicts the assumption that S is a boundary (and thus minimal). (⇐= direction) We need to prove that if Eq. (2) holds, then S is a minimal Markov blanket. The proof that S is a blanket is immediate. We can prove minimality by contradiction: Assume S = S1 ∪ S2 with S1 a blanket and S2 = ∅ i.e., S1 is a blanket strictly smaller than S. Then (X⊥ S2 | ⊥ S1 ) = (X⊥ S2 | S − S2 ). However, since S2 ⊆ U − S, from Eq. (2) we get (X ⊥ S2 | S − S2 ), ⊥ ⊥ which is a contradiction. Appendix B: Proof of m-Correctness of GS(m) Let the value of the set S at the end of the growing phase be SG , its value at the end of the shrinking phase SS , and their difference S∆ = SG − SS . The following two observations are immediate. Observation 1. For every Y ⊆ U − SG − {X} such that 1 ≤ |Y| ≤ m, (X⊥ Y | SG ). ⊥ Observation 2. For every Y ∈ SS , (X ⊥ Y | SS − {Y }). ⊥ Lemma 2. Consider variables Y1 , Y2 , . . . , Yt for some t ≥ 1 and let Y = {Yj }t . Assuming that j=1 Contraction holds, if (X⊥ Yi | S − {Yj }i ) for all i = 1, . . . , t, then (X⊥ Y | S − Y). ⊥ ⊥ j=1 Proof. By induction on Yj , j = 1, 2, . . . , t, using Contraction to decrease the conditioning set S t down to S − {Yj }i j=1 for all i = 1, 2, . . . , t. Since Y = {Yj }j=1 , we immediately obtain the desired relation (X⊥ Y | S − Y). ⊥ 7 Lemma 2 can be used to show that the variables found individually independent of X during the shrinking phase are actually jointly independent of X, given the ﬁnal set SS . Let S∆ = {Y1 , Y2 , . . . , Yt } be the set of variables removed (in that order) from SG to form the ﬁnal set SS i.e., S∆ = SG − SS . Using the above lemma, the following is immediate. Corollary 3. Assuming that the Contraction axiom holds, (X⊥ S∆ | SS ). ⊥ Lemma 4. If the Contraction, Decomposition and Weak Union axioms hold, then for every set T ⊆ U − SG − {X} such that (X⊥ T | SG ), ⊥ (X⊥ T ∪ (SG − SS ) | SS ). ⊥ (3) Furthermore SS is minimal i.e., there does not exist a subset of SS for which Eq. (3) is true. Proof. From Corollary 3, (X⊥ S∆ | SS ). Also, by the hypothesis, (X⊥ T | SG ) = (X⊥ T | ⊥ ⊥ ⊥ SS ∪ S∆ ), where S∆ = SG − SS as usual. From these two relations and Contraction we obtain (X⊥ T ∪ S∆ | SS ). ⊥ To prove minimality, let us assume that SS = ∅ (if SS = ∅ then it is already minimal). We prove by contradiction: Assume that there exists a set S′ ⊂ SS such that (X⊥ T ∪ (SG − S′ ) | S′ ). Let ⊥ W = SS − S′ = ∅. Note that W and S′ are disjoint. We have that SS ⊆ SS ∪ S∆ =⇒ SS − S′ ⊆ SS ∪ S∆ − S′ ⊆ T ∪ (SS ∪ S∆ − S′ ) =⇒ W ⊆ T ∪ (SS ∪ S∆ − S′ ) = T ∪ (SG − S′ ) • Since (X⊥ T ∪ (SG − S′ ) | S′ ) and W ⊆ T ∪ (SS ∪ S∆ − S′ ), from Decomposition we ⊥ get (X⊥ W | S′ ). ⊥ • From (X⊥ W | S′ ) and Weak Union we have that for every Y ∈ W, (X⊥ Y | S′ ∪ ⊥ ⊥ (W − {Y })). • Since S′ and W are disjoint and since Y ∈ W, Y ∈ S′ . Applying the set equality (A − B) ∪ C = (A ∪ B) − (A − C) to S′ ∪ (W − {Y }) we obtain S′ ∪ W − ({Y } − S′ ) = SS − {Y }. • Therefore, ∀ Y ∈ W, (X⊥ Y | SS − {Y }). ⊥ However, at the end of the shrinking phase, all variables Y in SS (and therefore in W, as W ⊆ SS ) have been evaluated for independence and found dependent (Observation 2). Thus, since W = ∅, there exists at least one Y such that (X ⊥ Y | SS − {Y }), producing a contradiction. ⊥ Theorem 5. Assuming that the Contraction, Decomposition, and Weak Union axioms hold, Algorithm 1 is m-correct with respect to X. Proof. We use the Markov Boundary Theorem. We ﬁrst prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X⊥ T | SS − T) ⊥ or, equivalently, ∀ T ⊆ U − SS − {X} such that |T| ≤ m, (X⊥ T | SS ). ⊥ Since U − SS − {X} = S∆ ∪ (U − SG − {X}), S∆ and U − SG − {X} are disjoint, there are three kinds of sets of size m or less to consider: (i) all sets T ⊆ S∆ , (ii) all sets T ⊆ U − SG − {X}, and (iii) all sets (if any) T = T′ ∪ T′′ , T′ ∩ T′′ = ∅, that have a non-empty part T′ ⊆ S∆ and a non-empty part T′′ ⊆ U − SG − {X}. (i) From Corollary 3, (X⊥ S∆ | SS ). Therefore, from Decomposition, for any set T ⊆ S∆ , ⊥ (X⊥ T | SS ). ⊥ (ii) By Observation 1, for every set T ⊆ U − SG − {X} such that |T| ≤ m, (X⊥ T | SG ). ⊥ By Lemma 4 we get (X⊥ T ∪ S∆ | SS ), from which we obtain (X⊥ T | SS ) by ⊥ ⊥ Decomposition. (iii) Since |T| ≤ m, we have that |T′′ | ≤ m. Since T′′ ⊆ U − SG − {X}, by Observation 1, (X⊥ T′′ | SG ). Therefore, by Lemma 4, (X⊥ T′′ ∪ S∆ | SS ). Since T′ ⊆ S∆ ⇒ ⊥ ⊥ T′′ ∪ T′ ⊆ T′′ ∪ S∆ , by Decomposition to obtain (X⊥ T′′ ∪ T′ | SS ) = (X⊥ T | SS ). ⊥ ⊥ To complete the proof we need to prove that ∀ T ⊆ U − {X} such that |T| ≤ m, T ⊆ U − SS =⇒ (X ⊥ T | SS − T) . ⊥ Let T = T1 ∪ T2 , with T1 ⊆ SS and T2 ⊆ U − SS . Since T ⊆ U − SS , T1 contains at least one variable Y ∈ SS . From Observation 2, (X ⊥ Y | SS − {Y }). From this and (the contrapositive of) ⊥ Weak Union, we get (X ⊥ {Y } ∪ (T1 − {Y }) | SS − {Y } − (T1 − {Y })) = (X ⊥ T1 | SS − T1 ). ⊥ ⊥ From (the contrapositive of) Decomposition we get (X ⊥ T1 ∪ T2 | SS − T1 ) = (X ⊥ T | ⊥ ⊥ SS − T1 ), which is equal to (X ⊥ T | SS − T1 − T2 ) = (X ⊥ T | SS − T) as SS and T2 are ⊥ ⊥ disjoint. 8 References [1] Isabelle Guyon and Andr´ Elisseeff. An introduction to variable and feature selection. Journal e of Machine Learning Research, 3:1157–1182, 2003. [2] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Tenth International Conference on Machine Learning (ICML), pages 284–292, 1996. [3] P. M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26(9):917–922, 1977. [4] H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), 1991. [5] K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on the Americal Association for Artiﬁcal Intelligence (AAAI), pages 129–134, 1992. [6] M. Dash and H. Liu. Feature selection for classiﬁcation. Intelligent Data Analysis, 1(3): 131–156, 1997. [7] Huan Liu and Hiroshi Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective, volume 453 of The Springer International Series in Engineering and Computer Science. 1998. [8] Avrim Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 97(1-2):245–271, 1997. [9] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artiﬁcial Intelligence, 97 (1-2):273–324, 1997. [10] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12 (NIPS), 2000. [11] I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In Proceedings of the 16th International FLAIRS Conference, 2003. [12] I. Tsamardinos, C. Aliferis, and A. Statnikov. Time and sample efﬁcient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 673–678, 2003. [13] N. Meinshausen and P. B¨ hlmann. High-dimensional graphs and variable selection with the u Lasso. Annals of Statistics, 34:1436–1462, 2006. [14] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. [15] Michael Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. [16] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 1990. [17] M. Kearns. Efﬁcient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998. [18] C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 1979. 9

2 0.8895238 149 nips-2009-Maximin affinity learning of image segmentation

Author: Kevin Briggman, Winfried Denk, Sebastian Seung, Moritz N. Helmstaedter, Srinivas C. Turaga

Abstract: Images can be segmented by ﬁrst using a classiﬁer to predict an afﬁnity graph that reﬂects the degree to which image pixels must be grouped together and then partitioning the graph to yield a segmentation. Machine learning has been applied to the afﬁnity classiﬁer to produce afﬁnity graphs that are good in the sense of minimizing edge misclassiﬁcation rates. However, this error measure is only indirectly related to the quality of segmentations produced by ultimately partitioning the afﬁnity graph. We present the ﬁrst machine learning algorithm for training a classiﬁer to produce afﬁnity graphs that are good in the sense of producing segmentations that directly minimize the Rand index, a well known segmentation performance measure. The Rand index measures segmentation performance by quantifying the classiﬁcation of the connectivity of image pixel pairs after segmentation. By using the simple graph partitioning algorithm of ﬁnding the connected components of the thresholded afﬁnity graph, we are able to train an afﬁnity classiﬁer to directly minimize the Rand index of segmentations resulting from the graph partitioning. Our learning algorithm corresponds to the learning of maximin afﬁnities between image pixel pairs, which are predictive of the pixel-pair connectivity. 1

same-paper 3 0.87430423 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

4 0.80276114 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

Author: Francois Caron, Arnaud Doucet

Abstract: Over recent years Dirichlet processes and the associated Chinese restaurant process (CRP) have found many applications in clustering while the Indian buffet process (IBP) is increasingly used to describe latent feature models. These models are attractive because they ensure exchangeability (over samples). We propose here extensions of these models where the dependency between samples is given by a known decomposable graph. These models have appealing properties and can be easily learned using Monte Carlo techniques. 1 Motivation The CRP and IBP have found numerous applications in machine learning over recent years [5, 10]. We consider here the case where the data we are interested in are ‘locally’ dependent; these dependencies being represented by a known graph G where each data point/object is associated to a vertex. These local dependencies can correspond to any conceptual or real (e.g. space, time) metric. For example, in the context of clustering, we might want to propose a prior distribution on partitions enforcing that data which are ‘close’ in the graph are more likely to be in the same cluster. Similarly, in the context of latent feature models, we might be interested in a prior distribution on features enforcing that data which are ‘close’ in the graph are more likely to possess similar features. The ‘standard’ CRP and IBP correspond to the case where the graph G is complete; that is it is fully connected. In this paper, we generalize the CRP and IBP to decomposable graphs. The resulting generalized versions of the CRP and IBP enjoy attractive properties. Each clique of the graph follows marginally a CRP or an IBP process and explicit expressions for the joint prior distribution on the graph is available. It makes it easy to learn those models using straightforward generalizations of Markov chain Monte Carlo (MCMC) or Sequential Monte Carlo (SMC) algorithms proposed to perform inference for the CRP and IBP [5, 10, 14]. The rest of the paper is organized as follows. In Section 2, we review the popular Dirichlet multinomial allocation model and the Dirichlet Process (DP) partition distribution. We propose an extension of these two models to decomposable graphical models. In Section 3 we discuss nonparametric latent feature models, reviewing brieﬂy the construction in [5] and extending it to decomposable graphs. We demonstrate these models in Section 4 on two applications: an alternative to the hierarchical DP model [12] and a time-varying matrix factorization problem. 2 Prior distributions for partitions on decomposable graphs Assume we have n observations. When performing clustering, we associate to each of this observation an allocation variable zi ∈ [K] = {1, . . . , K}. Let Πn be the partition of [n] = {1, . . . , n} deﬁned by the equivalence relation i ↔ j ⇔ zi = zj . The resulting partition Πn = {A1 , . . . , An(Πn ) } 1 is an unordered collection of disjoint non-empty subsets Aj of [n], j = 1, . . . , n(Πn ), where ∪j Aj = [n] and n(Πn ) is the number of subsets for partition Πn . We also denote by Pn be the set of all partitions of [n] and let nj , j = 1, . . . , n(Πn ), be the size of the subset Aj . Each allocation variable zi is associated to a vertex/site of an undirected graph G, which is assumed to be known. In the standard case where the graph G is complete, we ﬁrst review brieﬂy here two popular prior distributions on z1:n , equivalently on Πn . We then extend these models to undirected decomposable graphs; see [2, 8] for an introduction to decomposable graphs. Finally we brieﬂy discuss the directed case. Note that the models proposed here are completely different from the hyper multinomial-Dirichlet in [2] and its recent DP extension [6]. 2.1 Dirichlet multinomial allocation model and DP partition distribution Assume for the time being that K is ﬁnite. When the graph is complete, a popular choice for the allocation variables is to consider a Dirichlet multinomial allocation model [11] θ θ , . . . , ), zi |π ∼ π (1) K K where D is the standard Dirichlet distribution and θ > 0. Integrating out π, we obtain the following Dirichlet multinomial prior distribution π ∼ D( Pr(z1:n ) = K j=1 Γ(θ) Γ(nj + θ K) (2) θ Γ(θ + n)Γ( K )K and then, using the straightforward equality Pr(Πn ) = PK where PK = {Πn ∈ Pn |n(Πn ) ≤ K}, we obtain K! (K−n(Πn ))! Pr(z1:n ) valid for for all Πn ∈ n(Π ) Pr(Πn ) = θ Γ(θ) j=1n Γ(nj + K ) K! . θ (K − n(Πn ))! Γ(θ + n)Γ( K )n(Πn ) (3) DP may be seen as a generalization of the Dirichlet multinomial model when the number of components K → ∞; see for example [10]. In this case the distribution over the partition Πn of [n] is given by [11] n(Π ) θn(Πn ) j=1n Γ(nj ) . (4) Pr(Πn ) = n i=1 (θ + i − 1) Let Π−k = {A1,−k , . . . , An(Π−k ),−k } be the partition induced by removing item k to Πn and nj,−k be the size of cluster j for j = 1, . . . , n(Π−k ). It follows from (4) that an item k is assigned to an existing cluster j, j = 1, . . . , n(Π−k ), with probability proportional to nj,−k / (n − 1 + θ) and forms a new cluster with probability θ/ (n − 1 + θ). This property is the basis of the CRP. We now extend the Dirichlet multinomial allocation and the DP partition distribution models to decomposable graphs. 2.2 Markov combination of Dirichlet multinomial and DP partition distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. It can be easily checked that if the marginal distribution of zC for each clique C ∈ C is deﬁned by (2) then these distributions are consistent as they yield the same distribution (2) over the separators. Therefore, the unique Markov distribution over G with Dirichlet multinomial distribution over the cliques is deﬁned by [8] Pr(zC ) S∈S Pr(zS ) C∈C Pr(z1:n ) = (5) where for each complete set B ⊆ G, we have Pr(zB ) given by (2). It follows that we have for any Πn ∈ PK Γ(θ) K! Pr(Πn ) = (K − n(Πn ))! C∈C Γ(θ) S∈S 2 K j=1 θ Γ(nj,C + K ) θ Γ(θ+nC )Γ( K )K K j=1 θ Γ(nj,S + K ) θ Γ(θ+nS )Γ( K )K (6) where for each complete set B ⊆ G, nj,B is the number of items associated to cluster j, j = 1, . . . , K in B and nB is the total number of items in B. Within each complete set B, the allocation variables deﬁne a partition distributed according to the Dirichlet-multinomial distribution. We now extend this approach to DP partition distributions; that is we derive a joint distribution over Πn such that the distribution of ΠB over each complete set B of the graph is given by (4) with θ > 0. Such a distribution satisﬁes the consistency condition over the separators as the restriction of any partition distributed according to (4) still follows (4) [7]. G Proposition. Let Pn be the set of partitions Πn ∈ Pn such that for each decomposition A, B, and any (i, j) ∈ A × B, i ↔ j ⇒ ∃k ∈ A ∩ B such that k ↔ i ↔ j. As K → ∞, the prior distribution G over partitions (6) is given for each Πn ∈ Pn by Pr(Πn ) = θn(Πn ) n(ΠC ) Γ(nj,C ) j=1 nC i=1 (θ+i−1) n(ΠS ) Γ(nj,S ) j=1 nS (θ+i−1) i=1 C∈C S∈S (7) where n(ΠB ) is the number of clusters in the complete set B. Proof. From (6), we have θ n(ΠC ) K(K − 1) . . . (K − n(Πn ) + 1) Pr(Πn ) = K C∈C n(ΠC )− S∈S n(ΠS ) C∈C θ n(ΠS ) S∈S n(ΠC ) θ Γ(nj,C + K ) j=1 nC (θ+i−1) i=1 n(ΠS ) θ Γ(nj,S + K ) j=1 nS (θ+i−1) i=1 Thus when K → ∞, we obtain (7) if n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) and 0 otherwise. We have n(Πn ) ≤ C∈C n(ΠC ) − S∈S n(ΠS ) for any Πn ∈ Pn and the subset of Pn verifying G n(Πn ) = C∈C n(ΠC ) − S∈S n(ΠS ) corresponds to the set Pn . Example. Let the notation i ∼ j (resp. i j) indicates an edge (resp. no edge) between two sites. Let n = 3 and G be the decomposable graph deﬁned by the relations 1 ∼ 2, 2 ∼ 3 and 1 3. G The set P3 is then equal to {{{1, 2, 3}}; {{1, 2}, {3}}; {{1}, {2, 3}}; {{1}, {2}, {3}}}. Note that G the partition {{1, 3}, {2}} does not belong to P3 . Indeed, as there is no edge between 1 and 3, they cannot be in the same cluster if 2 is in another cluster. The cliques are C1 = {1, 2} and C2 = {2, 3} Pr(ΠC1 ) Pr(ΠC2 ) hence we can and the separator is S2 = {2}. The distribution is given by Pr(Π3 ) = Pr(ΠS ) 2 check that we obtain Pr({1, 2, 3}) = (θ + 1)−2 , Pr({1, 2}, {3}) = Pr({1, 2}, {3}) = θ(θ + 1)−2 and Pr({1}, {2}, {3}) = θ2 (θ + 1)−2 . Let now deﬁne the full conditional distributions. Based on (7) the conditional assignment of an item k is proportional to the conditional over the cliques divided by the conditional over the separators. G Let denote G−k the undirected graph obtained by removing vertex k from G. Suppose that Πn ∈ Pn . G−k If Π−k ∈ Pn−1 , then do not change the value of item k. Otherwise, item k is assigned to cluster j / where j = 1, . . . , n(Π−k ) with probability proportional to {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (8) and to a new cluster with probability proportional to θ, where n−k,j,C is the number of items in the set C \ {k} belonging to cluster j. The updating process is illustrated by the Chinese wedding party process1 in Fig. 1. The results of this section can be extended to the Pitman-Yor process, and more generally to species sampling models. Example (continuing). Given Π−2 = {A1 = {1}, A2 = {3}}, we have −1 Pr( item 2 assigned to A1 = {1}| Π−2 ) = Pr( item 2 assigned to A2 = {3}| Π−2 ) = (θ + 2) −1 and Pr( item 2 assigned to new cluster A3 | Π−2 ) = θ (θ + 2) . Given Π−2 = {A1 = {1, 3}}, item 2 is assigned to A1 with probability 1. 1 Note that this representation describes the full conditionals while the CRP represents the sequential updat- ing. 3 (a) (b) (d) (c) (e) Figure 1: Chinese wedding party. Consider a group of n guests attending a wedding party. Each of the n guests may belong to one or several cliques, i.e. maximal groups of people such that everybody knows everybody. The belonging of each guest to the different cliques is represented by color patches on the ﬁgures, and the graphical representation of the relationship between the guests is represented by the graphical model (e). (a) Suppose that the guests are already seated such that two guests cannot be together at the same table is they are not part of the same clique, or if there does not exist a group of other guests such that they are related (“Any friend of yours is a friend of mine”). (b) The guest number k leaves his table and either (c) joins a table where there are guests from the same clique as him, with probability proportional to the product of the number of guests from each clique over the product of the number of guests belonging to several cliques on that table or (d) he joins a new table with probability proportional to θ. 2.3 Monte Carlo inference 2.3.1 MCMC algorithm Using the full conditionals, a single site Gibbs sampler can easily be designed to approximate the posterior distribution Pr(Πn |z1:n ). Given a partition Πn , an item k is taken out of the partition. If G−k Π−k ∈ Pn−1 , item k keeps the same value. Otherwise, the item will be assigned to a cluster j, / j = 1, . . . , n(Π−k ), with probability proportional to p(z{k}∪Aj,−k ) × p(zAj,−k ) {C∈C|n−k,j,C >0} n−k,j,C {S∈S|n−k,j,S >0} n−k,j,S (9) and the item will be assigned to a new cluster with probability proportional to p(z{k} ) × θ. Similarly to [3], we can also deﬁne a procedure to sample from p(θ|n(Πn ) = k)). We assume that θ ∼ G(a, b) and use p auxiliary variables x1 , . . . , xp . The procedure is as follows. • For j = 1, . . . , p, sample xj |k, θ ∼ Beta(θ + nSj , nCj − nSj ) • Sample θ|k, x1:p ∼ G(a + k, b − j log xj ) 2.3.2 Sequential Monte Carlo We have so far only treated the case of an undirected decomposable graph G. We can formulate a sequential updating rule for the corresponding perfect directed version D of G. Indeed, let (a1 , . . . a|V | ) be a perfect ordering and pa(ak ) be the set of parents of ak which is by deﬁnition complete. Let Πk−1 = {A1,k−1 , . . . , An(Πk−1 ),k−1 } denote the partition of the ﬁrst k−1 vertices a1:k−1 and let nj,pa(ak ) be the number of elements with value j in the set pa(ak ), j = 1, . . . , n(Πk−1 ). Then the vertex ak joins the set j with probability nj,pa(ak ) / θ + cluster with probability θ/ θ + q q nq,pa(ak ) and creates a new nq,pa(ak ) . One can then design a particle ﬁlter/SMC method in a similar fashion as [4]. Consider a set of (i) (i) (i) (i) N N particles Πk−1 with weights wk−1 ∝ Pr(Πk−1 , z1:k−1 ) ( i=1 wk−1 = 1) that approximate (i) the posterior distribution Pr(Πk−1 |z1:k−1 ). For each particle i, there are n(Πk−1 ) + 1 possible 4 (i,j) allocations for component ak . We denote Πk the partition obtained by associating component ak (i,j) to cluster j. The weight associated to Πk is given by  nj,pa(ak ) (i)  if j = 1, . . . , n(Πk−1 ) θ+ q nq,pa(ak ) (i,j) (i) p(z{ak }∪Aj,k−1 ) wk−1 = wk−1 × (10) (i) θ  θ+ n p(zAj,k−1 ) if j = n(Πk−1 ) + 1 q q,pa(ak ) (i,j) Then we can perform a deterministic resampling step by keeping the N particles Πk with highest (i,j) (i) (i) weights wk−1 . Let Πk be the resampled particles and wk the associated normalized weights. 3 Prior distributions for inﬁnite binary matrices on decomposable graphs Assume we have n objects; each of these objects being associated to the vertex of a graph G. To K each object is associated a K-dimensional binary vector zn = (zn,1 , . . . , zn,K ) ∈ {0, 1} where zn,i = 1 if object n possesses feature i and zn,i = 0 otherwise. These vectors zt form a binary n × K matrix denoted Z1:n . We denote by ξ1:n the associated equivalence class of left-ordered matrices and let EK be the set of left-ordered matrices with at most K features. In the standard case where the graph G is complete, we review brieﬂy here two popular prior distributions on Z1:n , equivalently on ξ1:n : the Beta-Bernoulli model and the IBP [5]. We then extend these models to undirected decomposable graphs. This can be used for example to deﬁne a time-varying IBP as illustrated in Section 4. 3.1 Beta-Bernoulli and IBP distributions The Beta-Bernoulli distribution over the allocation Z1:n is K Pr(Z1:n ) = α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) α K Γ(nj j=1 (11) where nj is the number of objects having feature j. It follows that Pr(ξ1:n ) = K K! 2n −1 h=0 α K Γ(nj α + K )Γ(n − nj + 1) α Γ(n + 1 + K ) Kh ! j=1 (12) where Kh is the number of features possessing the history h (see [5] for details). The nonparametric model is obtained by taking the limit when K → ∞ Pr(ξ1:n ) = αK K+ + 2n −1 h=1 Kh ! exp(−αHn ) where K + is the total number of features and Hn = 3.2 (n − nj )!(nj − 1)! n! j=1 n 1 k=1 k . (13) The IBP follows from (13). Markov combination of Beta-Bernoulli and IBP distributions Let G be a decomposable undirected graph, C = {C1 , . . . , Cp } a perfect ordering of the cliques and S = {S2 , . . . , Cp } the associated separators. As in the Dirichlet-multinomial case, it is easily seen that if for each clique C ∈ C, the marginal distribution is deﬁned by (11), then these distributions are consistent as they yield the same distribution (11) over the separators. Therefore, the unique Markov distribution over G with Beta-Bernoulli distribution over the cliques is deﬁned by [8] Pr(ZC ) S∈S Pr(ZS ) C∈C Pr(Z1:n ) = (14) where Pr(ZB ) given by (11) for each complete set B ⊆ G. The prior over ξ1:n is thus given, for ξ1:n ∈ EK , by Pr(ξ1:n ) = K! 2n −1 h=0 Kh ! α K α Γ(nj,C + K )Γ(nC −nj,C +1) α Γ(nC +1+ K ) α α Γ(nj,S + K )Γ(nS −nj,S +1) K K α j=1 Γ(nS +1+ K ) K j=1 C∈C S∈S 5 (15) where for each complete set B ⊆ G, nj,B is the number of items having feature j, j = 1, . . . , K in the set B and nB is the whole set of objects in set B. Taking the limit when K → ∞, we obtain after a few calculations Pr(ξ1:n ) = α + K[n] exp [−α ( C HnC − 2n −1 h=1 Kh ! HnS )] × C∈C + KC (nC −nj,C )!(nj,C −1)! j=1 nC ! S∈S S + KS (nS −nj,S )!(nj,S −1)! j=1 nS ! + + + + if K[n] = C KC − S KS and 0 otherwise, where KB is the number of different features possessed by objects in B. G Let En be the subset of En such that for each decomposition A, B and any (u, v) ∈ A × B: {u and v possess feature j} ⇒ ∃k ∈ A ∩ B such that {k possesses feature j}. Let ξ−k be the left-ordered + matrix obtained by removing object k from ξn and K−k be the total number of different features in G−k + ξ−k . For each feature j = 1, . . . , K−k , if ξ−k ∈ En−1 then we have   b C∈C nj,C if i = 1 S∈C nj,S Pr(ξk,j = i) = (16)  b C∈C (nC −nj,C ) if i = 0 (nS −nj,S ) S∈C nS where b is the appropriate normalizing constant then the customer k tries Poisson α {S∈S|k∈S} nC {C∈C|k∈C} new dishes. We can easily generalize this construction to a directed version D of G using arguments similar to those presented in Section 2; see Section 4 for an application to time-varying matrix factorization. 4 4.1 Applications Sharing clusters among relative groups: An alternative to HDP Consider that we are given d groups with nj data yi,j in each group, i = 1, . . . , nj , j = 1, . . . , d. We consider latent cluster variables zi,j that deﬁne the partition of the data. We will use alternatively the notation θi,j = Uzi,j in the following. Hierarchical Dirichlet Process [12] (HDP) is a very popular model for sharing clusters among related groups. It is based on a hierarchy of DPs G0 ∼ DP (γ, H), Gj |G0 ∼ DP (α, G0 ) j = 1, . . . d θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj . Under conjugacy assumptions, G0 , Gj and U can be integrated out and we can approximate the marginal posterior of (zi,j ) given y = (yi,j ) with Gibbs sampling using the Chinese restaurant franchise to sample from the full conditional p(zi,j |z−{i,j} , y). Using the graph formulation deﬁned in Section 2, we propose an alternative to HDP. Let θ0,1 , . . . , θ0,N be N auxiliary variables belonging to what we call group 0. We deﬁne each clique Cj (j = 1, . . . , d) to be composed of elements from group j and elements from group 0. This deﬁnes a decomposable graphical model whose separator is given by the elements of group 0. We can rewrite the model in a way quite similar to HDP G0 ∼ DP (α, H), θ0,i |G0 ∼ G0 i = 1, ..., N α α Gj |θ0,1 , . . . , θ0,N ∼ DP (α + N, α+N H + α+N θi,j |Gj ∼ Gj , yi,j |θi,j ∼ f (θi,j ) i = 1, . . . , nj N i=1 δθ0,i ) j = 1, . . . d, N For any subset A and j = k ∈ {1, . . . , p} we have corr(Gj (A), Gk (A)) = α+N . Again, under conjugacy conditions, we can integrate out G0 , Gj and U and approximate the marginal posterior distribution over the partition using the Chinese wedding party process deﬁned in Section 2. Note that for latent variables zi,j , j = 1, . . . , d, associated to data, this is the usual CRP update. As in HDP, multiple layers can be added to the model. Figures 2 (a) and (b) resp. give the graphical DP alternative to HDP and 2-layer HDP. 6 z0 root z0 root corpora docs z1 z2 z1 z2 z3 z1,1 z1,2 z2,1 z2,2 z2,3 docs (a) Graphical DP alternative to HDP (b) Graphical DP alternative to 2-layer HDP Figure 2: Hierarchical Graphs of dependency with (a) one layer and (b) two layers of hierarchy. If N = 0, then Gj ∼ DP (α, H) for all j and this is equivalent to setting γ → ∞ in HDP. If N → ∞ then Gj = G0 for all j, G0 ∼ DP (α, H). This is equivalent to setting α → ∞ in the HDP. One interesting feature of the model is that, contrary to HDP, the marginal distribution of Gj at any layer of the tree is DP (α, H). As a consequence, the total number of clusters scales logarithmically (as in the usual DP) with the size of each group, whereas it scales doubly logarithmically in HDP. Contrary to HDP, there are at most N clusters shared between different groups. Our model is in that sense reminiscent of [9] where only a limited number of clusters can be shared. Note however that contrary to [9] we have a simple CRP-like process. The proposed methodology can be straightforwardly extended to the inﬁnite HMM [12]. The main issue of the proposed model is the setting of the number N of auxiliary parameters. Another issue is that to achieve high correlation, we need a large number of auxiliary variables. Nonetheless, the computational time used to sample from auxiliary variables is negligible compared to the time used for latent variables associated to data. Moreover, it can be easily parallelized. The model proposed offers a far richer framework and ensures that at each level of the tree, the marginal distribution of the partition is given by a DP partition model. 4.2 Time-varying matrix factorization Let X1:n be an observed matrix of dimension n × D. We want to ﬁnd a representation of this matrix in terms of two latent matrices Z1:n of dimension n × K and Y of dimension K × D. Here Z1:n 2 is a binary matrix whereas Y is a matrix of latent features. By assuming that Y ∼ N 0, σY IK×D and 2 X1:n = Z1:n Y + σX εn where εn ∼ N 0, σX In×D , we obtain p(X1:n |Z1:n ) ∝ −D/2 2 2 + Z+T Z+ + σX /σY IKn 1:n 1:n + (n−Kn )D σX exp − + Kn D σY 2 2 + where Σ−1 = I − Z+ Z+T Z+ + σX /σY IKn n 1:n 1:n 1:n −1 1 T −1 2 tr X1:n Σn X1:n 2σX (17) + Z+T , Kn the number of non-zero columns of 1:n + Z1:n and Z+ is the ﬁrst Kn columns of Z1:n . To avoid having to set K, [5, 14] assume that Z1:n 1:n follows an IBP. The resulting posterior distribution p(Z1:n |X1:n ) can be estimated through MCMC [5] or SMC [14]. We consider here a different model where the object Xt is assumed to arrive at time index t and we want a prior distribution on Z1:n ensuring that objects close in time are more likely to possess similar features. To achieve this, we consider the simple directed graphical model D of Fig. 3 where the site numbering corresponds to a time index in that case and a perfect numbering of D is (1, 2, . . .). The set of parents pa(t) is composed of the r preceding sites {{t − r}, . . . , {t − 1}}. The time-varying IBP to sample from p(Z1:n ) associated to this directed graph follows from (16) and proceeds as follows. At time t = 1 + new new • Sample K1 ∼Poisson(α), set z1,i = 1 for i = 1, ..., K1 and set K1 = Knew . At times t = 2, . . . , r n + new ∼Poisson( α ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( 1:t−1,k ) and Kt t t 7 ? ? - t−r - t−r+1 - . . . - t−1 - t - t+1 6 6 Figure 3: Directed graph. At times t = r + 1, . . . , n n + α new ∼Poisson( r+1 ). • For k = 1, . . . Kt , sample zt,k ∼ Ber( t−r:t−1,k ) and Kt r+1 + Here Kt is the total number of features appearing from time max(1, t − r) to t − 1 and nt−r:t−1,k the restriction of n1:t−1 to the r last customers. Using (17) and the prior distribution of Z1:n which can be sampled using the time-varying IBP described above, we can easily design an SMC method to sample from p(Z1:n |X1:n ). We do not detail it here. Note that contrary to [14], our algorithm does not require inverting a matrix whose dimension grows linearly with the size of the data but only a matrix of dimension r × r. In order to illustrate the model and SMC algorithm, we create 200 6 × 6 images using a ground truth Y consisting of 4 different 6 × 6 latent images. The 200 × 4 binary matrix was generated from Pr(zt,k = 1) = πt,k , where πt = ( .6 .5 0 0 ) if t = 1, . . . , 30, πt = ( .4 .8 .4 0 ) if t = 31, . . . , 50 and πt = ( 0 .3 .6 .6 ) if t = 51, . . . , 200. The order of the model is set to r = 50. The feature occurences Z1:n and true features Y and their estimates are represented in Figure 4. Two spurious features are detected by the model (features 2 and 5 on Fig. 3(c)) but quickly discarded (Fig. 4(d)). The algorithm is able to correctly estimate the varying prior occurences of the features over time. Feature1 Feature2 Feature1 Feature2 Feature3 20 20 40 40 60 60 Feature4 80 100 Feature4 Feature5 Feature6 Time Feature3 Time 80 100 120 120 140 140 160 160 180 200 180 1 2 3 200 4 Feature (a) 1 2 3 4 5 6 Feature (b) (c) (d) Figure 4: (a) True features, (b) True features occurences, (c) MAP estimate ZM AP and (d) associated E[Y|ZM AP ] t=20 t=50 t=20 t=50 t=100 t=200 t=100 t=200 (a) (b) Figure 5: (a) E[Xt |πt , Y] and (b) E[Xt |X1:t−1 ] at t = 20, 50, 100, 200. 5 Related work and Discussion The ﬁxed-lag version of the time-varying DP of Caron et al. [1] is a special case of the proposed model when G is given by Fig. 3. The bivariate DP of Walker and Muliere [13] is also a special case when G has only two cliques. In this paper, we have assumed that the structure of the graph was known beforehand and we have shown that many ﬂexible models arise from this framework. It would be interesting in the future to investigate the case where the graphical structure is unknown and must be estimated from the data. Acknowledgment The authors thank the reviewers for their comments that helped to improve the writing of the paper. 8 References [1] F. Caron, M. Davy, and A. Doucet. Generalized Polya urn for time-varying Dirichlet process mixtures. In Uncertainty in Artiﬁcial Intelligence, 2007. [2] A.P. Dawid and S.L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21:1272–1317, 1993. [3] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. [4] P. Fearnhead. Particle ﬁlters for mixture models with an unknown number of components. Statistics and Computing, 14:11–21, 2004. [5] T.L. Grifﬁths and Z. Ghahramani. Inﬁnite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2006. [6] D. Heinz. Building hyper dirichlet processes for graphical models. Electonic Journal of Statistics, 3:290–315, 2009. [7] J.F.C. Kingman. Random partitions in population genetics. Proceedings of the Royal Society of London, 361:1–20, 1978. [8] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. [9] P. M¨ ller, F. Quintana, and G. Rosner. A method for combining inference across related nonu parametric Bayesian models. Journal of the Royal Statistical Society B, 66:735–749, 2004. [10] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265, 2000. [11] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability theory and related ﬁelds, 102:145–158, 1995. [12] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566–1581, 2006. [13] S. Walker and P. Muliere. A bivariate Dirichlet process. Statistics and Probability Letters, 64:1–7, 2003. [14] F. Wood and T.L. Grifﬁths. Particle ﬁltering for nonparametric Bayesian matrix factorization. In Advances in Neural Information Processing Systems, 2007. 9

5 0.80232364 56 nips-2009-Conditional Neural Fields

Author: Jian Peng, Liefeng Bo, Jinbo Xu

Abstract: Conditional random ﬁelds (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs signiﬁcantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.

6 0.79575968 204 nips-2009-Replicated Softmax: an Undirected Topic Model

7 0.79302114 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.79285973 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

9 0.79267985 226 nips-2009-Spatial Normalized Gamma Processes

10 0.79265112 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

11 0.79010546 97 nips-2009-Free energy score space

12 0.78497195 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

13 0.78379077 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

14 0.78342307 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

15 0.78341395 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

16 0.78328776 129 nips-2009-Learning a Small Mixture of Trees

17 0.78260809 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

18 0.78229231 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

19 0.78137153 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

20 0.78135669 96 nips-2009-Filtering Abstract Senses From Image Search Results