nips nips2006 nips2006-115 knowledge-graph by maker-knowledge-mining

115 nips-2006-Learning annotated hierarchies from relational data

Source: pdf

Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum

Abstract: The objects in many real-world domains can be organized into hierarchies, where each internal node picks out a category of objects. Given a collection of features and relations deﬁned over a set of objects, an annotated hierarchy includes a speciﬁcation of the categories that are most useful for describing each individual feature and relation. We deﬁne a generative model for annotated hierarchies and the features and relations that they describe, and develop a Markov chain Monte Carlo scheme for learning annotated hierarchies. We show that our model discovers interpretable structure in several real-world data sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Learning annotated hierarchies from relational data Daniel M. [sent-1, score-0.568]

2 edu Abstract The objects in many real-world domains can be organized into hierarchies, where each internal node picks out a category of objects. [sent-6, score-0.511]

3 Given a collection of features and relations deﬁned over a set of objects, an annotated hierarchy includes a speciﬁcation of the categories that are most useful for describing each individual feature and relation. [sent-7, score-1.067]

4 We deﬁne a generative model for annotated hierarchies and the features and relations that they describe, and develop a Markov chain Monte Carlo scheme for learning annotated hierarchies. [sent-8, score-0.98]

5 1 Introduction Researchers in AI and cognitive science [1, 7] have proposed that hierarchies are useful for representing and reasoning about the objects in many real-world domains. [sent-10, score-0.46]

6 One of the reasons that hierarchies are valuable is that they compactly specify categories at many levels of resolution, each node representing the category of objects at the leaves below the node. [sent-11, score-0.968]

7 Consider, for example, the simple hierarchy shown in Figure 1a, which picks out ﬁve categories relevant to a typical university department: employees, staff, faculty, professors, and assistant professors. [sent-12, score-0.507]

8 Each of the ﬁve categories will account for some aspects of the data, but different categories will be needed for understanding different features and relations. [sent-14, score-0.532]

9 “Faculty,” for example, is the single most useful category for describing the employees that publish papers (Figure 1b), but three categories may be needed to describe the social interactions among the employees (Figure 1c). [sent-15, score-0.688]

10 In order to understand the structure of the department, it is important not only to understand the hierarchical organization of the employees, but to understand which levels in the hierarchy are appropriate for describing each feature and each relation. [sent-16, score-0.44]

11 Suppose, then, that an annotated hierarchy is a hierarchy along with a speciﬁcation of the categories in the hierarchy that are relevant to each feature and relation. [sent-17, score-1.19]

12 The idea of an annotated hierarchy is one of the oldest proposals in cognitive science, and researchers including Collins and Quillian [1] and Keil [7] have argued that semantic knowledge is organized into representations with this form. [sent-18, score-0.484]

13 First, annotated hierarchies are usually hand-engineered, and there are few proposals describing how they might be learned from data. [sent-20, score-0.461]

14 Second, annotated hierarchies typically capture knowledge only about the features of objects: relations between objects are rarely considered. [sent-21, score-0.923]

15 We address both problems by deﬁning a generative model for objects, features, relations, and hierarchies, and showing how it can be used to recover an annotated hierarchy from raw data. [sent-22, score-0.494]

16 Our generative model for feature data assumes that the objects are located at the leaves of a rooted tree, and that each feature is generated from a partition of the objects “consistent” with the hierarchy. [sent-23, score-0.95]

17 A tree-consistent partition (henceforth, t-c partition) of the objects is a partition of the objects into disjoint categories, i. [sent-24, score-0.844]

18 each class in the partition is exactly the set of leaves descending from some node in the tree. [sent-26, score-0.378]

19 Therefore, a t-c partition can be uniquely encoded as the set of these nodes whose leaf descendants comprise the classes (Figure 1a,b). [sent-27, score-0.304]

20 Each class in each partition is labeled with the corresponding node in the tree. [sent-30, score-0.306]

21 (c) Three binary relations, each of which is associated with a different t-c partition of the set of object pairs. [sent-31, score-0.297]

22 We assume that the features of objects in different classes are independent, but that objects in the same class tend to have similar features. [sent-35, score-0.49]

23 Therefore, ﬁnding the categories in the tree most relevant to a feature can be formalized as ﬁnding the simplest t-c partition that best accounts for the distribution of the feature (Figure 1b). [sent-36, score-0.786]

24 We deﬁne an annotated hierarchy as a hierarchy together with a t-c partition for each feature. [sent-37, score-0.882]

25 Although most discussions of annotated hierarchies focus on features, much of the data available to human learners comes in the form of relations. [sent-38, score-0.46]

26 Understanding the structure of social groups, for instance, involves inferences about relations like admires(·, ·), friend-of (·, ·) and brother-of (·, ·). [sent-39, score-0.332]

27 Like the feature case, our generative model for relational data assumes that each (binary) relation is generated from a t-c partition of the set of all pairs of objects. [sent-40, score-0.645]

28 Each class in a t-c partition now corresponds to a pair of categories (i. [sent-41, score-0.498]

29 As in the feature case, ﬁnding the categories in the tree most relevant to a relation can be formalized as ﬁnding the t-c partition that best accounts for the distribution of the relation. [sent-44, score-0.857]

30 The t-c partition for each relation can be viewed as an additional annotation of the tree. [sent-45, score-0.347]

31 Roughly speaking, the best hierarchy will then be the one that provides the best categories with which to summarize all the features and relations. [sent-47, score-0.53]

32 Our ﬁrst analysis suggests that the model recovers coherent domains given objects and features from several domains (animals, foods, tools and vehicles). [sent-50, score-0.387]

33 Next we show that the model discovers interpretable structure in kinship data, and in data representing relationships between ontological kinds. [sent-51, score-0.581]

34 2 A generative model for features and relations Our approach is organized around a generative model for feature data and relational data. [sent-52, score-0.609]

35 For simplicity, we present our model for feature and relational data separately, focusing on the case where we have a single binary feature or a single binary relation. [sent-53, score-0.345]

36 After presenting our generative model, we describe how it can be used to recover annotated hierarchies from data. [sent-54, score-0.474]

37 We begin with the case of a single binary feature and deﬁne a joint distribution over three entities: a rooted, weighted, binary tree T with O objects at the leaves; a t-c partition of the objects; and feature observations, d. [sent-55, score-0.837]

38 We will identify each node with the category of objects descending from it. [sent-60, score-0.458]

39 We denote the data for all objects in the category n as dn . [sent-61, score-0.415]

40 In Figure 1b, three t-c partitions associated with the hierarchy are represented and each class in each partition is labeled with the corresponding category. [sent-63, score-0.612]

41 Sample a tree T from a uniform distribution over rooted binary trees with O leaves (each leaf will represent an object and there are O objects). [sent-65, score-0.574]

42 For each category n, sample its weight, wn , according to an exponential distribution with parameter λ, i. [sent-68, score-0.31]

43 For each category n ∈ πf , sample θn ∼ Beta(γf , γf ), where θn is the probability that objects in category n exhibit the feature f . [sent-81, score-0.584]

44 Consider now the case where we have a single binary relation deﬁned over all ordered pairs of objects {(oi , oj )}. [sent-85, score-0.564]

45 In the relational case, our joint distribution is deﬁned over a rooted, weighted, binary tree; a t-c partition of ordered pairs of objects; and observed, relational data represented as a matrix D where Di,j = 1 if the relation holds between oi and oj . [sent-86, score-0.981]

46 Given a pair of categories (ni , mj ), let ni × mj be the set of all pairs of objects (oi , oj ) such that oi is an object in the category ni and oj is an object in the category mj . [sent-87, score-1.593]

47 With respect to pairs of trees, a t-c partition, π, is a set of pairs of categories {(n1 , m1 ), (n2 , m2 ), . [sent-88, score-0.3]

48 , (nk , mk )} such that, for every pair of objects (oi , oj ), there exists exactly one pair (nk , mk ) ∈ π such that (oi , oj ) ∈ nk × mk . [sent-91, score-0.799]

49 Each t-c partition now splits the matrix into contiguous, rectangular blocks (see Figure 1c, where each rectangular block is labeled with its category pair). [sent-93, score-0.369]

50 Intuitively, if a pair of categories (n, m) both have large weight, the process is more likely to group all pairs of objects in n × m into a single class. [sent-106, score-0.529]

51 In Figure 1c, the t-c partition for the works with relation is {(S, S), (S, F ), (F, S), (F, F )}. [sent-107, score-0.347]

52 For each pair of categories (n, m) ∈ πr , sample θn,m ∼ Beta(γr , γr ), where θn,m is the probability that the relation holds between any pair of objects in n × m. [sent-109, score-0.683]

53 For each pair of objects (oi , oj ), sample the relation Di,j ∼ Bernoulli(θn,m ), where (n, m) ∈ πr and (oi , oj ) ∈ (n, m). [sent-112, score-0.698]

54 For data sets with multiple relations and features, we assume that all relations and features are conditionally independent given the weighted tree T . [sent-114, score-0.731]

55 1 Inference Given observations of features and relations, we can use the generative model to ask various questions about the latent hierarchy and its annotations. [sent-116, score-0.353]

56 We start by determining the posterior distribution on the weighted tree topologies, (T, w), given data D = ({d(f ) }F=1 , {D(r) }R ) over O objects, r=1 f F features and R relations and hyperparameters λ and γ = ({γf }F=1 , {γr }R ). [sent-117, score-0.51]

57 Because the classes are indepen(f ) (f ) dent, P (d(f ) |π, γf ) = n∈π P (dn |n ∈ π, γf ), where Mf (n) = P (dn |n ∈ π, γf ) is the (f ) marginal likelihood for dn , the features for objects in category n. [sent-120, score-0.487]

58 Now the possible partitions of the objects in category n are every t-c partition of the objects below n1 paired with every t-c partition below n2 . [sent-130, score-1.171]

59 (r) For the relational case, we describe a dynamic program Tr (n, m) that calculates P (Dn,m |T, w, γr ), the probability of all relations between objects in n×m, conditioned on the tree, having marginalized (r) out the t-c partitions and relation probabilities. [sent-133, score-0.882]

60 We ﬁnd the MAP tree ﬁrst, rather than jointly optimizing for both the topology and partitions, because marginalizing over the t-c partitions produces more robust trees; marginalization has a (Bayesian) ”Occam’s razor” effect and helps avoid overﬁtting. [sent-140, score-0.388]

61 MAP t-c partitions can be computed by a straightforward modiﬁcation of the above dynamic programs, replacing sums with max operations and maintaining a list of nodes representing the MAP t-c partition at each node in the tree. [sent-141, score-0.477]

62 the asymptotically least informative prior) and report the (empirical) MAP tree and MAP t-c partitions conditioned on the tree. [sent-146, score-0.388]

63 Hierarchical clustering [4] has been successfully used for analyzing both biological data [18] and psychological data, but cannot learn the annotated hierarchies that we consider. [sent-165, score-0.421]

64 Bayesian hierarchical clustering (BHC) [6] is a recent alternative which constructs a tree as a byproduct of approximate inference in a ﬂat clustering model, but lacks any notion of annotations. [sent-166, score-0.281]

65 Our model adds an intervening layer of abstraction by assuming that partitions are generated by a stochastic process over a tree, and that features are generated from these partitions. [sent-170, score-0.277]

66 By introducing a partition for each feature, we gain the ability to annotate a hierarchy with the levels most relevant to each feature. [sent-171, score-0.441]

67 There are several methods for discovering hierarchical structure in relational data [5, 13], but none of these methods provides a general purpose solution to the problem we consider. [sent-172, score-0.322]

68 Most of these methods take a single relation as input, and assume that the hierarchy captures an underlying community structure: in other words, objects that are often paired in the input are assumed to lie nearby in the tree. [sent-173, score-0.571]

69 Our approach handles multiple relations simultaneously, and allows a more ﬂexible mapping between each relation and the underlying hierarchy. [sent-174, score-0.355]

70 Different relations may depend on very different regions of the hierarchy, and some relations may establish connections between categories that are quite distant in the tree (see Figure 4). [sent-175, score-0.889]

71 The IRM handles multiple relations simultaneously, and does not assume that each relation has underlying community structure. [sent-178, score-0.355]

72 The IRM, however, does not discover hierarchical structure; instead it partitions the objects into a set of non-overlapping categories. [sent-179, score-0.492]

73 Our relational model is an extension of the blockmodel that discovers a nested set of categories as well as which categories are useful for understanding each relation in the data set. [sent-180, score-0.941]

74 ”1 We analyzed a subset of the full data set including 60 common objects and the 100 features most commonly listed for these objects. [sent-184, score-0.281]

75 The model discovers the four domains as well as superordinate categories (e. [sent-187, score-0.446]

76 Figure 2 also shows MAP partitions for 10 1 Note that some of the features are noisy — according to these data, onions are not edible, since none of the participants chose to list this feature for onion. [sent-192, score-0.343]

77 Chemicals affects process of causes causes (IRM) Figure 3: MAP tree recovered from 49 relations between entities in a biomedical data set. [sent-196, score-0.584]

78 ” The Inﬁnite Relational Model (IRM) does not capture the appropriate structure in the relation cause because it does not model the latent hierarchy, instead choosing a single partition to describe the structure across all relations. [sent-200, score-0.475]

79 By organizing the 60 objects into domains and identifying a subset of features that are associated with each domain, our model begins to suggest how infants may parse their environment into coherent domains of objects and features. [sent-206, score-0.596]

80 The full data set includes 135 entities and 49 binary relations, where the entities are ontological categories like ‘Sign or Symptom’, ‘Cell’, and ‘Disease or Syndrome,’ and the relations include verbs like causes, analyzes and affects. [sent-209, score-0.712]

81 Individuals have been labelled with their age, gender and kinship section (e. [sent-212, score-0.281]

82 MAP partitions are shown for four representative relations: the model discovers that different relations depend on the tree in very different ways; hierarchical structure allows for a compact representation (c. [sent-215, score-0.881]

83 The MAP tree is an ontology that captures several natural groupings, including a category for “living things” (plant, bird, animal and mammal), a category for “chemical substances” (amino acid, lipid, antibiotic, enzyme etc. [sent-218, score-0.611]

84 Note that the MAP partitions for causes and analyzes are rather different: one of the reasons why discovering separate t-c partitions for each relation is important is that different relations can depend on very different parts of an ontology. [sent-222, score-0.85]

85 Our third application is inspired by the problem children face when learning the kinship structure of their social group. [sent-223, score-0.439]

86 This problem is especially acute for children growing up in Australian tribes, which have kinship systems that are more complicated in many ways than Western kinship systems, but which nevertheless display some striking regularities. [sent-224, score-0.609]

87 Denham [3] collected a large data set by asking 104 tribe members to provide kinship terms for each other. [sent-226, score-0.393]

88 More than one kinship term may describe the relationship between a pair of individuals — since the data set includes only one term per pair, some of the zeros in each matrix represent missing data rather than relationships that do not hold. [sent-228, score-0.406]

89 The Alyawarra tribe is divided into four kinship sections, and these sections are fundamental to the social structure of the tribe. [sent-230, score-0.488]

90 Whether a kinship term applies between a pair of individuals depends on their sections, ages and genders [3, 8]. [sent-232, score-0.443]

91 The MAP tree divides the individuals perfectly according to kinship section, and discovers additional structure within each section. [sent-234, score-0.776]

92 The MAP partitions for each relation indicate that different relations depend very differently on the structure of the tree. [sent-236, score-0.571]

93 Adiadya refers to a younger member of one’s own kinship section. [sent-237, score-0.281]

94 The MAP partition for this relation contains ﬁne-level structure only along the diagonal, indicating that the model has discovered that the term only applies between individuals from the same kinship section. [sent-238, score-0.743]

95 In some places the MAP partitions appears to overﬁt the data: the partition for Umbaidya, for example, appears to capture some of the noise in this relation. [sent-241, score-0.384]

96 4 Conclusions We developed a probabilistic model that assumes that features and relations are generated over an annotated hierarchy, and showed how this model can be used to recover annotated hierarchies from raw data. [sent-243, score-0.927]

97 A hierarchy speciﬁes a set of categories, and annotations indicate which of these categories are important for understanding speciﬁc features and relations. [sent-246, score-0.53]

98 For example, the kinship data we analyzed may be well described by three sets of overlapping categories where each individual belongs to a kinship section, a gender, and an age group. [sent-248, score-0.831]

99 We have already extended the model to handle continuous data and can imagine other extensions, including higher-order relations, multiple trees, and relations between distinct sets of objects (e. [sent-249, score-0.43]

100 We have shown that formalizing the intuition behind annotated hierarchies in terms of a prior on trees and partitions and a noise-robust likelihood enabled us to discover interesting structure in realworld data. [sent-255, score-0.727]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kinship', 0.281), ('categories', 0.23), ('hierarchy', 0.228), ('relations', 0.221), ('tree', 0.217), ('annotated', 0.213), ('partition', 0.213), ('objects', 0.209), ('hierarchies', 0.208), ('partitions', 0.171), ('discovers', 0.163), ('category', 0.156), ('wn', 0.154), ('oj', 0.15), ('relational', 0.147), ('relation', 0.134), ('tf', 0.134), ('irm', 0.131), ('oi', 0.119), ('employees', 0.098), ('node', 0.093), ('leaf', 0.091), ('map', 0.09), ('mf', 0.087), ('tr', 0.078), ('wm', 0.077), ('alyawarra', 0.075), ('features', 0.072), ('leaves', 0.072), ('individuals', 0.07), ('rooted', 0.068), ('social', 0.066), ('discovering', 0.066), ('hierarchical', 0.064), ('feature', 0.063), ('entities', 0.062), ('nk', 0.06), ('ni', 0.057), ('members', 0.056), ('juicy', 0.056), ('ontological', 0.056), ('professors', 0.056), ('tribe', 0.056), ('umbaidya', 0.056), ('pair', 0.055), ('domains', 0.053), ('generative', 0.053), ('vehicles', 0.052), ('animals', 0.05), ('dn', 0.05), ('ontology', 0.049), ('assistant', 0.049), ('staff', 0.049), ('discover', 0.048), ('object', 0.048), ('living', 0.047), ('children', 0.047), ('parent', 0.046), ('things', 0.046), ('structure', 0.045), ('analyzes', 0.045), ('cognitive', 0.043), ('causes', 0.042), ('trees', 0.042), ('customers', 0.041), ('mj', 0.041), ('sections', 0.04), ('mk', 0.04), ('describing', 0.04), ('kemp', 0.039), ('age', 0.039), ('faculty', 0.039), ('learners', 0.039), ('cause', 0.038), ('abnormalities', 0.037), ('abnormality', 0.037), ('anowadya', 0.037), ('antibiotic', 0.037), ('blockmodel', 0.037), ('chemicals', 0.037), ('chisel', 0.037), ('cree', 0.037), ('descendant', 0.037), ('fruits', 0.037), ('genders', 0.037), ('keil', 0.037), ('lemon', 0.037), ('lipid', 0.037), ('onions', 0.037), ('symptom', 0.037), ('syndrome', 0.037), ('binary', 0.036), ('interpretable', 0.036), ('subtree', 0.036), ('pairs', 0.035), ('beta', 0.034), ('stochastic', 0.034), ('mr', 0.033), ('animal', 0.033), ('mammal', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 115 nips-2006-Learning annotated hierarchies from relational data

Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum

2 0.14323731 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

Author: Marco Cuturi, Kenji Fukumizu

Abstract: We propose a family of kernels for structured objects which is based on the bag-ofcomponents paradigm. However, rather than decomposing each complex object into the single histogram of its components, we use for each object a family of nested histograms, where each histogram in this hierarchy describes the object seen from an increasingly granular perspective. We use this hierarchy of histograms to deﬁne elementary kernels which can detect coarse and ﬁne similarities between the objects. We compute through an efﬁcient averaging trick a mixture of such speciﬁc kernels, to propose a ﬁnal kernel value which weights efﬁciently local and global matches. We propose experimental results on an image retrieval experiment which show that this mixture is an effective template procedure to be used with kernels on histograms.

3 0.1410215 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests

Author: Frank Moosmann, Bill Triggs, Frederic Jurie

Abstract: Some of the most effective recent methods for content-based image classiﬁcation work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating histograms of the resulting “visual word” codes over the image, and classifying these with a conventional classiﬁer such as an SVM. Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests – ensembles of randomly created clustering trees – and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classiﬁcation tasks. 1

4 0.13972116 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

Author: Joseph Turian, Benjamin Wellington, I. D. Melamed

Abstract: Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the ﬁrst purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method’s versatility, accuracy, and eﬃciency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar. 1

5 0.13591589 185 nips-2006-Subordinate class recognition using relational object models

Author: Aharon B. Hillel, Daphna Weinshall

Abstract: We address the problem of sub-ordinate class recognition, like the distinction between different types of motorcycles. Our approach is motivated by observations from cognitive psychology, which identify parts as the deﬁning component of basic level categories (like motorcycles), while sub-ordinate categories are more often deﬁned by part properties (like ’jagged wheels’). Accordingly, we suggest a two-stage algorithm: First, a relational part based object model is learnt using unsegmented object images from the inclusive class (e.g., motorcycles in general). The model is then used to build a class-speciﬁc vector representation for images, where each entry corresponds to a model’s part. In the second stage we train a standard discriminative classiﬁer to classify subclass instances (e.g., cross motorcycles) based on the class-speciﬁc vector representation. We describe extensive experimental results with several subclasses. The proposed algorithm typically gives better results than a competing one-step algorithm, or a two stage algorithm where classiﬁcation is based on a model of the sub-ordinate class. 1

6 0.11141638 69 nips-2006-Distributed Inference in Dynamical Systems

7 0.11042324 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

8 0.10028908 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

9 0.09759213 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

10 0.090627238 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

11 0.082033537 53 nips-2006-Combining causal and similarity-based reasoning

12 0.081854016 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

13 0.079678431 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

14 0.078619815 41 nips-2006-Bayesian Ensemble Learning

15 0.077683337 35 nips-2006-Approximate inference using planar graph decomposition

16 0.074357815 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

17 0.072782569 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

18 0.070107982 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data

19 0.069392197 55 nips-2006-Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees

20 0.068418369 169 nips-2006-Relational Learning with Gaussian Processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.198), (1, 0.05), (2, 0.088), (3, -0.097), (4, 0.107), (5, 0.015), (6, -0.018), (7, -0.047), (8, 0.05), (9, -0.253), (10, -0.023), (11, 0.104), (12, 0.039), (13, -0.003), (14, -0.042), (15, -0.117), (16, 0.06), (17, 0.121), (18, -0.074), (19, 0.043), (20, -0.276), (21, 0.009), (22, -0.046), (23, 0.004), (24, -0.023), (25, 0.114), (26, 0.04), (27, 0.014), (28, 0.026), (29, 0.025), (30, 0.108), (31, 0.072), (32, 0.123), (33, -0.055), (34, 0.114), (35, -0.013), (36, 0.071), (37, -0.051), (38, -0.021), (39, 0.026), (40, 0.258), (41, 0.021), (42, 0.027), (43, 0.047), (44, -0.003), (45, 0.036), (46, -0.067), (47, 0.195), (48, -0.061), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96616954 115 nips-2006-Learning annotated hierarchies from relational data

Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum

2 0.59536344 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

Author: Michael C. Mozer, Michael Shettel, Michael P. Holmes

Abstract: Categorization is a central activity of human cognition. When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item inﬂuences category decisions for subsequent items. Speciﬁcally, when experimental subjects are shown an exemplar of some target category, the category prototype appears to be pulled toward the exemplar, and the prototypes of all nontarget categories appear to be pushed away. These push and pull effects diminish with experience, and likely reﬂect long-term learning of category boundaries. We propose and evaluate four principled probabilistic (Bayesian) accounts of context effects in categorization. In all four accounts, the probability of an exemplar given a category is encoded as a Gaussian density in feature space, and categorization involves computing category posteriors given an exemplar. The models differ in how the uncertainty distribution of category prototypes is represented (localist or distributed), and how it is updated following each experience (using a maximum likelihood gradient ascent, or a Kalman ﬁlter update). We ﬁnd that the distributed maximum-likelihood model can explain the key experimental phenomena. Further, the model predicts other phenomena that were conﬁrmed via reanalysis of the experimental data. Categorization is a key cognitive activity. We continually make decisions about characteristics of objects and individuals: Is the fruit ripe? Does your friend seem unhappy? Is your car tire ﬂat? When an individual is asked to categorize a sequence of items, context effects arise: categorization of one item inﬂuences category decisions for subsequent items. Intuitive naturalistic scenarios in which context effects occur are easy to imagine. For example, if one lifts a medium-weight object after lifting a light-weight or heavy-weight object, the medium weight feels heavier following the light weight than following the heavy weight. Although the object-contrast effect might be due to fatigue of sensory-motor systems, many context effects in categorization are purely cognitive and cannot easily be attributed to neural habituation. For example, if you are reviewing a set of conference papers, and the ﬁrst three in the set are dreadful, then even a mediocre paper seems like it might be above threshold for acceptance. Another example of a category boundary shift due to context is the following. Suppose you move from San Diego to Pittsburgh and notice that your neighbors repeatedly describe muggy, somewhat overcast days as ”lovely.” Eventually, your notion of what constitutes a lovely day accommodates to your new surroundings. As we describe shortly, experimental studies have shown a fundamental link between context effects in categorization and long-term learning of category boundaries. We believe that context effects can be viewed as a reﬂection of a trial-to-trial learning, and the cumulative effect of these trial-to-trial modulations corresponds to what we classically consider to be category learning. Consequently, any compelling model of category learning should also be capable of explaining context effects. 1 Experimental Studies of Context Effects in Categorization Consider a set of stimuli that vary along a single continuous dimension. Throughout this paper, we use as an illustration circles of varying diameters, and assume four categories of circles deﬁned ranges of diameters; call them A, B, C, and D, in order from smallest to largest diameter. In a classiﬁcation paradigm, experimental subjects are given an exemplar drawn from one category and are asked to respond with the correct category label (Zotov, Jones, & Mewhort, 2003). After making their response, subjects receive feedback as to the correct label, which we’ll refer to as the target. In a production paradigm, subjects are given a target category label and asked to produce an exemplar of that category, e.g., using a computer mouse to indicate the circle diameter (Jones & Mewhort, 2003). Once a response is made, subjects receive feedback as to the correct or true category label for the exemplar they produced. Neither classiﬁcation nor production task has sequential structure, because the order of trial is random in both experiments. The production task provides direct information about the subjects’ internal representations, because subjects are producing exemplars that they consider to be prototypes of a category, whereas the categorization task requires indirect inferences to be made about internal representations from reaction time and accuracy data. Nonetheless, the ﬁndings in the production and classiﬁcation tasks mirror one another nicely, providing converging evidence as to the nature of learning. The production task reveals how mental representations shift as a function of trial-to-trial sequences, and these shifts cause the sequential pattern of errors and response times typically observed in the classiﬁcation task. We focus on the production task in this paper because it provides a richer source of data. However, we address the categorization task with our models as well. Figure 1 provides a schematic depiction of the key sequential effects in categorization. The horizontal line represents the stimulus dimension, e.g., circle diameter. The dimension is cut into four regions labeled with the corresponding category. The category center, which we’ll refer to as the prototype, is indicated by a vertical dashed line. The long solid vertical line marks the current exemplar—whether it is an exemplar presented to subjects in the classiﬁcation task or an exemplar generated by subjects in the production task. Following an experimental trial with this exemplar, category prototypes appear to shift: the target-category prototype moves toward the exemplar, which we refer to as a pull effect, and all nontarget-category prototypes move away from the exemplar, which we refer to as a push effect. Push and pull effects are assessed in the production task by examining the exemplar produced on the following trial, and in the categorization task by examining the likelihood of an error response near category boundaries. The set of phenomena to be explained are as follows, described in terms of the production task. All numerical results referred to are from Jones and Mewhort (2003). This experiment consisted of 12 blocks of 40 trials, with each category label given as target 10 times within a block. • Within-category pull: When a target category is repeated on successive trials, the exemplar generated on the second trial moves toward the exemplar generated on the ﬁrst trial, with respect to the true category prototype. Across the experiment, a correlation coefﬁcient of 0.524 is obtained, and remains fairly constant over trials. • Between-category push: When the target category changes from one trial to the next, the exemplar generated on the second trial moves away from the exemplar generated on the ﬁrst trial (or equivalently, from the prototype of the target category on the ﬁrst trial). Figure 2a summarizes the sequential push effects from Jones and Mewhort. The diameter of the circle produced on trial t is plotted as a function of the target category on trial t − 1, with one line for each of the four trial t targets. The mean diameter for each target category is subtracted out, so the absolute vertical offset of each line is unimportant. The main feature of the data to note is that all four curves have a negative slope, which has the following meaning: the smaller that target t − 1 is (i.e., the further to the left on the x axis in Figure 1), the larger the response to target t is (further to the right in Figure 1), and vice versa, reﬂecting a push away from target t − 1. Interestingly and importantly, the magnitude of the push increases with the ordinal distance between targets t − 1 and t. Figure 2a is based on data from only eight subjects and is therefore noisy, though the effect is statistically reliable. As further evidence, Figure 2b shows data from a categorization task (Zotov et al., 2003), where the y-axis is a different dependent measure, but the negative slope has the same interpretation as in Figure 2a. example Figure 1: Schematic depiction of sequential effects in categorization A B C D stimulus dimension C 0 B −0.02 −0.04 −0.06 −0.08 −0.1 0.06 0.04 0.04 response deviation D 0.02 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C −0.1 D previous category label (b) humans: classification (d) KFU−distrib 0.08 D previous category label C D 0.06 0.04 0.04 response deviation response deviation C B (f) MLGA−distrib 0.08 0.02 0 −0.02 −0.04 −0.06 −0.08 B A previous category label 0.06 A (e) MLGA−local 0.08 0.06 A 0.04 (c) KFU−local 0.08 response deviation response deviation 0.06 response bias from (a) production task of Jones and Mewhort (2003), (b) classiﬁcation task of Zotov et al. (2003), and (c)-(f) the models proposed in this paper. The y axis is the deviation of the response from the mean, as a proportion of the total category width. The response to category A is solid red, B is dashed magenta, C is dash-dotted blue, and D is dotted green. (a) humans: production 0.08 Figure 2: Push effect data −0.1 0.02 0 −0.02 −0.04 −0.06 −0.08 A B C D previous category label −0.1 A B C D previous category label • Push and pull effects are not solely a consequence of errors or experimenter feedback. In quantitative estimation of push and pull effects, trial t is included in the data only if the response on trial t − 1 is correct. Thus, the effects follow trials in which no error feedback is given to the subjects, and therefore the adjustments are not due to explicit error correction. • Push and pull effects diminish over the course of the experiment. The magnitude of push effects can be measured by the slope of the regression lines ﬁt to the data in Figure 2a. The slopes get shallower over successive trial blocks. The magnitude of pull effects can be measured by the standard deviation (SD) of the produced exemplars, which also decreases over successive trial blocks. • Accuracy increases steadily over the course of the experiment, from 78% correct responses in the ﬁrst block to 91% in the ﬁnal block. This improvement occurs despite the fact that error feedback is relatively infrequent and becomes even less frequent as performance improves. 2 Four Models In this paper, we explore four probabilistic (Bayesian) models to explain data described in the previous section. The key phenomenon to explain turns out to be the push effect, for which three of the four models fail to account. Modelers typically discard the models that they reject, and present only their pet model. In this work, we ﬁnd it useful to report on the rejected models for three reasons. First, they help to set up and motivate the one successful model. Second, they include several obvious candidates, and we therefore have the imperative to address them. Third, in order to evaluate a model that can explain certain data, one needs to know the degree to which the the data constrain the space of models. If many models exist that are consistent with the data, one has little reason to prefer our pet candidate. Underlying all of the models is a generative probabilistic framework in which a category i is represented by a prototype value, di , on the dimension that discriminates among the categories. In the example used throughout this paper, the dimension is the diameter of a circle (hence the notation d for the prototype). An exemplar, E, of category i is drawn from a Gaussian distribution with mean di and variance vi , denoted E ∼ N (di , vi ). Category learning involves determining d ≡ {di }. In this work, we assume that the {vi } are ﬁxed and given. Because d is unknown at the start of the experiment, it is treated as the value of a random vector, D ≡ {Di }. Figure 3a shows a simple graphical model representing the generative framework, in which E is the exemplar and C the category label. To formalize our discussion so far, we adopt the following notation: P (E|C = c, D = d) ∼ N (hc d, vc ), (1) where, for the time being, hc is a unary column vector all of whose elements are zero except for element c which has value 1. (Subscripts may indicate either an index over elements of a vector, or an index over vectors. Boldface is used for vectors and matrices.) Figure 3: (a) Graphical model depicting selection of an exemplar, E, of a category, C, based on the prototype vector, D; (b) Dynamic version of model indexed by trials, t (a) D (b) C Dt-1 Ct-1 E Dt Ct Et-1 Et We assume that the prototype representation, D, is multivariate Gaussian, D ∼ N (Ψ, Σ), where Ψ and Σ encode knowledge—and uncertainty in the knowledge—of the category prototype structure. Given this formulation, the uncertainty in D can be integrated out: P (E|C) ∼ N (hc Ψ, hc ΣhT + vc ). c (2) For the categorization task, a category label can be assigned by evaluating the category posterior, P (C|E), via Bayes rule, Equation 1, and the category priors, P (C). In this framework, learning takes place via trial-to-trial adaptation of the category prototype distribution, D. In Figure 3b, we add the subscript t to each random variable to denote the trial, yielding a dynamic graphical model for the sequential updating of the prototype vector, Dt . (The reader should be attentive to the fact that we use subscripted indices to denote both trials and category labels. We generally use the index t to denote trial, and c or i to denote a category label.) The goal of our modeling work is to show that the sequential updating process leads to context effects, such as the push and pull effects discussed earlier. We propose four alternative models to explore within this framework. The four models are obtained via the Cartesian product of two binary choices: the learning rule and the prototype representation. 2.1 Learning rule The ﬁrst learning rule, maximum likelihood gradient ascent (MLGA), attempts to adjust the prototype representation so as to maximize the log posterior of the category given the exemplar. (The category, C = c, is the true label associated with the exemplar, i.e., either the target label the subject was asked to produce, or—if an error was made—the actual category label the subject did produce.) Gradient ascent is performed in all parameters of Ψ and Σ: ∆ψi = ψ ∂ log(P (c|e)) and ∂ψi ∆σij = σ ∂ log(P (c|e)), ∂σij (3) where ψ and σ are step sizes. To ensure that Σ remains a covariance matrix, constrained gradient 2 steps are applied. The constraints are: (1) diagonal terms are nonnegative, i.e., σi ≥ 0; (2) offdiagonal terms are symmetric, i.e., σij = σji ; and (3) the matrix remains positive deﬁnite, ensured σ by −1 ≤ σiijj ≤ 1. σ The second learning rule, a Kalman ﬁlter update (KFU), reestimates the uncertainty distribution of the prototypes given evidence provided by the current exemplar and category label. To draw the correspondence between our framework and a Kalman ﬁlter: the exemplar is a scalar measurement that pops out of the ﬁlter, the category prototypes are the hidden state of the ﬁlter, the measurement noise is vc , and the linear mapping from state to measurement is achieved by hc . Technically, the model is a measurement-switched Kalman ﬁlter, where the switching is determined by the category label c, i.e., the measurement function, hc , and noise, vc , are conditioned on c. The Kalman ﬁlter also allows temporal dynamics via the update equation, dt = Adt−1 , as well as internal process noise, whose covariance matrix is often denoted Q in standard Kalman ﬁlter notation. We investigated the choice of A and R, but because they did not impact the qualitative outcome of the simulations, we used A = I and R = 0. Given the correspondence we’ve established, the KFU equations—which specify Ψt+1 and Σt+1 as a function of ct , et , Ψt , and Σt —can be found in an introductory text (e.g., Maybeck, 1979). Change to a category prototype for each category following a trial of a given category. Solid (open) bars indicate trials in which the exemplar is larger (smaller) than the prototype. 2.2 prototype mvt. Figure 4: 0.2 trial t −1: A 0 −0.2 0.2 trial t −1: B 0 A B C trial t D −0.2 0.2 trial t −1: C 0 A B C trial t D −0.2 0.2 trial t −1: D 0 A B C trial t D −0.2 A B C trial t D Representation of the prototype The prototype representation that we described is localist: there is a one-to-one correspondence between the prototype for each category i and the random variable Di . To select the appropriate prototype given a current category c, we deﬁned the unary vector hc and applied hc as a linear transform on D. The identical operations can be performed in conjunction with a distributed representation of the prototype. But we step back momentarily to motivate the distributed representation. The localist representation suffers from a key weakness: it does not exploit interrelatedness constraints on category structure. The task given to experimental subjects speciﬁes that there are four categories, and they have an ordering; the circle diameters associated with category A are smaller than the diameters associated with B, etc. Consequently, dA < dB < dC < dD . One might make a further assumption that the category prototypes are equally spaced. Exploiting these two sources of domain knowledge leads to the distributed representation of category structure. A simple sort of distributed representation involves deﬁning the prototype for category i not as di but as a linear function of an underlying two-dimensional state-space representation of structure. In this state space, d1 indicates the distance between categories and d2 an offset for all categories. This representation of state can be achieved by applying Equation 1 and deﬁning hc = (nc , 1), where nc is the ordinal position of the category (nA = 1, nB = 2, etc.). We augment this representation with a bit of redundancy by incorporating not only the ordinal positions but also the reverse ordinal positions; this addition yields a symmetry in the representation between the two ends of the ordinal category scale. As a result of this augmentation, d becomes a three-dimensional state space, and hc = (nc , N + 1 − nc , 1), where N is the number of categories. To summarize, both the localist and distributed representations posit the existence of a hidden-state space—unknown at the start of learning—that speciﬁes category prototypes. The localist model assumes one dimension in the state space per prototype, whereas the distributed model assumes fewer dimensions in the state space—three, in our proposal—than there are prototypes, and computes the prototype location as a function of the state. Both localist and distributed representations assume a ﬁxed, known {hc } that specify the interpretation of the state space, or, in the case of the distributed model, the subject’s domain knowledge about category structure. 3 Simulation Methodology We deﬁned a one-dimensional feature space in which categories A-D corresponded to the ranges [1, 2), [2, 3), [3, 4), and [4, 5), respectively. In the human experiment, responses were considered incorrect if they were smaller than A or larger than D; we call these two cases out-of-bounds-low (OOBL) and out-of-bounds-high (OOBH). OOBL and OOBH were treated as two additional categories, resulting in 6 categories altogether for the simulation. Subjects and the model were never asked to produce exemplars of OOBL or OOBH, but feedback was given if a response fell into these categories. As in the human experiment, our simulation involved 480 trials. We performed 100 replications of each simulation with identical initial conditions but different trial sequences, and averaged results over replications. All prototypes were initialized to have the same mean, 3.0, at the start of the simulation. Because subjects had some initial practice on the task before the start of the experimental trials, we provided the models with 12 initial trials of a categorization (not production) task, two for each of the 6 categories. (For the MLGA models, it was necessary to use a large step size on these trials to move the prototypes to roughly the correct neighborhood.) To perform the production task, the models must generate an exemplar given a category. It seems natural to draw an exemplar from the distribution in Equation 2 for P (E|C). However, this distribu- tion reﬂects the full range of exemplars that lie within the category boundaries, and presumably in the production task, subjects attempt to produce a prototypical exemplar. Consequently, we exclude the intrinsic category variance, vc , from Equation 2 in generating exemplars, leaving variance only via uncertainty about the prototype. Each model involved selection of various parameters and initial conditions. We searched the parameter space by hand, attempting to ﬁnd parameters that satisﬁed basic properties of the data: the accuracy and response variance in the ﬁrst and second halves of the experiment. We report only parameters for the one model that was successful, the MLGA-Distrib: ψ = 0.0075, σ = 1.5 × 10−6 for off-diagonal terms and 1.5 × 10−7 for diagonal terms (the gradient for the diagonal terms was relatively steep), Σ0 = 0.01I, and for all categories c, vc = 0.42 . 4 4.1 Results Push effect The phenomenon that most clearly distinguishes the models is the push effect. The push effect is manifested in sequential-dependency functions, which plot the (relative) response on trial t as a function of trial t − 1. As we explained using Figures 2a,b, the signature of the push effect is a negatively sloped line for each of the different trial t target categories. The sequential-dependency functions for the four models are presented in Figures 2c-f. KFU-Local (Figure 2c) produces a ﬂat line, indicating no push whatsoever. The explanation for this result is straightforward: the Kalman ﬁlter update alters only the variable that is responsible for the measurement (exemplar) obtained on that trial. That variable is the prototype of the target class c, Dc . We thought the lack of an interaction among the category prototypes might be overcome with KFU-Distrib, because with a distributed prototype representation, all of the state variables jointly determine the target category prototype. However, our intuition turned out to be incorrect. We experimented with many different representations and parameter settings, but KFU-Distrib consistently obtained ﬂat or shallow positive sloping lines (Figure 2d). MLGA-Local (Figure 2e) obtains a push effect for neighboring classes, but not distant classes. For example, examining the dashed magenta line, note that B is pushed away by A and C, but is not affected by D. MLGA-Local maximizes the likelihood of the target category both by pulling the classconditional density of the target category toward the exemplar and by pushing the class-conditional densities of the other categories away from the exemplar. However, if a category has little probability mass at the location of the exemplar, the increase in likelihood that results from pushing it further away is negligible, and consequently, so is the push effect. MLGA-Distrib obtains a lovely result (Figure 2f)—a negatively-sloped line, diagnostic of the push effect. The effect magnitude matches that in the human data (Figure 2a), and captures the key property that the push effect increases with the ordinal distance of the categories. We did not build a mechanism into MLGA-Distrib to produce the push effect; it is somewhat of an emergent property of the model. The state representation of MLGA-Distrib has three components: d1 , the weight of the ordinal position of a category prototype, d2 , the weight of the reverse ordinal position, and d3 , an offset. The last term, d3 , cannot be responsible for a push effect, because it shifts all prototypes equally, and therefore can only produce a ﬂat sequential dependency function. Figure 4 helps provide an intuition how d1 and d2 work together to produce the push effect. Each graph shows the average movement of the category prototype (units on the y-axis are arbitrary) observed on trial t, for each of the four categories, following presentation of a given category on trial t − 1. Positve values on the y axis indicate increases in the prototype (movement to the right in Figure 1), and negative values decreases. Each solid vertical bar represents the movement of a given category prototype following a trial in which the exemplar is larger than its current prototype; each open vertical bar represents movement when the exemplar is to the left of its prototype. Notice that all category prototypes get larger or smaller on a given trial. But over the course of the experiment, the exemplar should be larger than the prototype as often as it is smaller, and the two shifts should sum together and partially cancel out. The result is the value indicated by the small horizontal bar along each line. The balance between the shifts in the two directions exactly corresponds to the push effect. Thus, the model produce a push-effect graph, but it is not truly producing a push effect as was originally conceived by the experimentalists. We are currently considering empirical consequences of this simulation result. Figure 5 shows a trial-by-trial trace from MLGA-Distrib. (a) class prototype 2 50 100 150 200 250 300 350 400 6 4 2 0 50 100 150 200 250 300 350 400 450 −4 50 100 150 200 250 50 100 150 200 300 350 400 450 250 300 350 400 450 300 350 400 450 300 350 400 450 0.2 0 −0.2 50 100 150 200 250 (f) 1 −2 −6 0.6 (e) (c) 0 0.8 0.4 450 (b) posterior log(class variance) P(correct) 4 0 (d) 1 shift (+=toward −=away) example 6 0.8 0.6 0.4 50 100 150 200 250 Figure 5: Trial-by-trial trace of MLGA-Distrib. (a) exemplars generated on one run of the simulation; (b) the mean and (c) variance of the class prototype distribution for the 6 classes on one run; (d) mean proportion correct over 100 replications of the simulation; (e) push and pull effects, as measured by changes to the prototype means: the upper (green) curve is the pull of the target prototype mean toward the exemplar, and the lower (red) curve is the push of the nontarget prototype means away from the exemplar, over 100 replications; (f) category posterior of the generated exemplar over 100 replications, reﬂecting gradient ascent in the posterior. 4.2 Other phenomena accounted for MLGA-Distrib captures the other phenomena we listed at the outset of this paper. Like all of the other models, MLGA-Distrib readily produces a pull effect, which is shown in the movement of category prototypes in Figure 5e. More observably, a pull effect is manifested when two successive trials of the same category are positively correlated: when trial t − 1 is to the left of the true category prototype, trial t is likely to be to the left as well. In the human data, the correlation coefﬁcient over the experiment is 0.524; in the model, the coefﬁcient is 0.496. The explanation for the pull effect is apparent: moving the category prototype to the exemplar increases the category likelihood. Although many learning effects in humans are based on error feedback, the experimental studies showed that push and pull effects occur even in the absence of errors, as they do in MLGA-Distrib. The model simply assumes that the target category it used to generate an exemplar is the correct category when no feedback to the contrary is provided. As long as the likelihood gradient is nonzero, category prototypes will be shifted. Pull and push effects shrink over the course of the experiment in human studies, as they do in the simulation. Figure 5e shows a reduction in both pull and push, as measured by the shift of the prototype means toward or away from the exemplar. We measured the slope of MLGA-Distrib’s push function (Figure 2f) for trials in the ﬁrst and second half of the simulation. The slope dropped from −0.042 to −0.025, as one would expect from Figure 5e. (These slopes are obtained by combining responses from 100 replications of the simulation. Consequently, each point on the push function was an average over 6000 trials, and therefore the regression slopes are highly reliable.) A quantitative, observable measure of pull is the standard deviation (SD) of responses. As push and pull effects diminish, SDs should decrease. In human subjects, the response SDs in the ﬁrst and second half of the experiment are 0.43 and 0.33, respectively. In the simulation, the response SDs are 0.51 and 0.38. Shrink reﬂects the fact that the model is approaching a local optimum in log likelihood, causing gradients—and learning steps—to become smaller. Not all model parameter settings lead to shrink; as in any gradient-based algorithm, step sizes that are too large do not lead to converge. However, such parameter settings make little sense in the context of the learning objective. 4.3 Model predictions MLGA-Distrib produces greater pull of the target category toward the exemplar than push of the neighboring categories away from the exemplar. In the simulation, the magnitude of the target pull— measured by the movement of the prototype mean—is 0.105, contrasted with the neighbor push, which is 0.017. After observing this robust result in the simulation, we found pertinent experimental data. Using the categorization paradigm, Zotov et al. (2003) found that if the exemplar on trial t is near a category border, subjects are more likely to produce an error if the category on trial t − 1 is repeated (i.e., a pull effect just took place) than if the previous trial is of the neighboring category (i.e., a push effect), even when the distance between exemplars on t − 1 and t is matched. The greater probability of error translates to a greater magnitude of pull than push. The experimental studies noted a phenomenon termed snap back. If the same target category is presented on successive trials, and an error is made on the ﬁrst trial, subjects perform very accurately on the second trial, i.e., they generate an exemplar near the true category prototype. It appears as if subjects, realizing they have been slacking, reawaken and snap the category prototype back to where it belongs. We tested the model, but observed a sort of anti snap back. If the model made an error on the ﬁrst trial, the mean deviation was larger—not smaller—on the second trial: 0.40 versus 0.32. Thus, MLGA-Distrib fails to explain this phenomenon. However, the phenomenon is not inconsistent with the model. One might suppose that on an error trial, subjects become more attentive, and increased attention might correspond to a larger learning rate on an error trial, which should yield a more accurate response on the following trial. McLaren et al. (1995) studied a phenomenon in humans known as peak shift, in which subjects are trained to categorize unidimensional stimuli into one of two categories. Subjects are faster and more accurate when presented with exemplars far from the category boundary than those near the boundary. In fact, they respond more efﬁciently to far exemplars than they do to the category prototype. The results are characterized in terms of the prototype of one category being pushed away from the prototype of the other category. It seems straightforward to explain these data in MLGA-Distrib as a type of long-term push effect. 5 Related Work and Conclusions Stewart, Brown, and Chater (2002) proposed an account of categorization context effects in which responses are based solely on the relative difference between the previous and present exemplars. No representation of the category prototype is maintained. However, classiﬁcation based solely on relative difference cannot account for a diminished bias effects as a function of experience. A long-term stable prototype representation, of the sort incorporated into our models, seems necessary. We considered four models in our investigation, and the fact that only one accounts for the experimental data suggests that the data are nontrivial. All four models have principled theoretical underpinnings, and they space they deﬁne may suggest other elegant frameworks for understanding mechanisms of category learning. The successful model, MLDA-Distrib, offers a deep insight into understanding multiple-category domains: category structure must be considered. MLGA-Distrib exploits knowledge available to subjects performing the task concerning the ordinal relationships among categories. A model without this knowledge, MLGA-Local, fails to explain data. Thus, the interrelatedness of categories appears to provide a source of constraint that individuals use in learning about the structure of the world. Acknowledgments This research was supported by NSF BCS 0339103 and NSF CSE-SMA 0509521. Support for the second author comes from an NSERC fellowship. References Jones, M. N., & Mewhort, D. J. K. (2003). Sequential contrast and assimilation effects in categorization of perceptual stimuli. Poster presented at the 44th Meeting of the Psychonomic Society. Vancouver, B.C. Maybeck, P.S. (1979). Stochastic models, estimation, and control, Volume I. Academic Press. McLaren, I. P. L., et al. (1995). Prototype effects and peak shift in categorization. JEP:LMC, 21, 662–673. Stewart, N. Brown, G. D. A., & Chater, N. (2002). Sequence effects in categorization of simple perceptual stimuli. JEP:LMC, 28, 3–11. Zotov, V., Jones, M. N., & Mewhort, D. J. K. (2003). Trial-to-trial representation shifts in categorization. Poster presented at the 13th Meeting of the Canadian Society for Brain, Behaviour, and Cognitive Science: Hamilton, Ontario.

3 0.5297609 185 nips-2006-Subordinate class recognition using relational object models

Author: Aharon B. Hillel, Daphna Weinshall

4 0.49543709 41 nips-2006-Bayesian Ensemble Learning

Author: Hugh A. Chipman, Edward I. George, Robert E. Mcculloch

Abstract: We develop a Bayesian “sum-of-trees” model, named BART, where each tree is constrained by a prior to be a weak learner. Fitting and inference are accomplished via an iterative backﬁtting MCMC algorithm. This model is motivated by ensemble methods in general, and boosting algorithms in particular. Like boosting, each weak learner (i.e., each weak tree) contributes a small amount to the overall model. However, our procedure is deﬁned by a statistical model: a prior and a likelihood, while boosting is deﬁned by an algorithm. This model-based approach enables a full and accurate assessment of uncertainty in model predictions, while remaining highly competitive in terms of predictive accuracy. 1

5 0.48177972 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests

Author: Frank Moosmann, Bill Triggs, Frederic Jurie

6 0.45388487 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

7 0.43976 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

8 0.43708566 69 nips-2006-Distributed Inference in Dynamical Systems

9 0.43037605 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

10 0.42943332 53 nips-2006-Combining causal and similarity-based reasoning

11 0.4234674 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

12 0.39378539 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

13 0.37588912 170 nips-2006-Robotic Grasping of Novel Objects

14 0.36094201 169 nips-2006-Relational Learning with Gaussian Processes

15 0.3523446 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

16 0.33437854 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

17 0.33248419 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

18 0.33058158 139 nips-2006-Multi-dynamic Bayesian Networks

19 0.30302614 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

20 0.2990694 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.082), (3, 0.021), (7, 0.065), (9, 0.026), (12, 0.011), (20, 0.052), (22, 0.092), (23, 0.275), (44, 0.071), (57, 0.065), (65, 0.091), (69, 0.03), (90, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80675405 115 nips-2006-Learning annotated hierarchies from relational data

Author: Daniel M. Roy, Charles Kemp, Vikash K. Mansinghka, Joshua B. Tenenbaum

2 0.55263495 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

3 0.54901373 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our ﬁrst algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1

4 0.54893023 79 nips-2006-Fast Iterative Kernel PCA

Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan

Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems conﬁrm that our methods converge substantially faster than conventional KHA. 1

5 0.54682207 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou

Abstract: We consider the problem of training a conditional random ﬁeld (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives signiﬁcantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1

6 0.54298902 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

7 0.54200077 65 nips-2006-Denoising and Dimension Reduction in Feature Space

8 0.54191631 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

9 0.54014719 20 nips-2006-Active learning for misspecified generalized linear models

10 0.5400964 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

11 0.53934312 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

12 0.53917122 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

13 0.53886521 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

14 0.53471375 175 nips-2006-Simplifying Mixture Models through Function Approximation

15 0.53446639 41 nips-2006-Bayesian Ensemble Learning

16 0.53406602 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.53332108 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

18 0.53185654 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

19 0.53097475 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

20 0.53060246 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition