nips nips2002 nips2002-40 knowledge-graph by maker-knowledge-mining

40 nips-2002-Bayesian Models of Inductive Generalization


Source: pdf

Author: Neville E. Sanjana, Joshua B. Tenenbaum

Abstract: We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam’s razor that trades off priors and likelihoods to prevent under- or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu   ¡ Abstract We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. [sent-4, score-0.565]

2 We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam’s razor that trades off priors and likelihoods to prevent under- or over-generalization in these flexible spaces. [sent-5, score-0.383]

3 We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out. [sent-6, score-0.248]

4 1 Introduction The problem of inductive reasoning — in particular, how we can generalize after seeing only one or a few specific examples of a novel concept — has troubled philosophers, psychologists, and computer scientists since the early days of their disciplines. [sent-7, score-0.434]

5 Computational approaches to inductive generalization range from simple heuristics based on similarity matching to complex statistical models [5]. [sent-8, score-0.494]

6 Based on two classic data sets from the literature and one more comprehensive data set that we have collected, we will argue for models based on a rational Bayesian learning framework [10]. [sent-10, score-0.175]

7 We also confront an issue that has often been side-stepped in previous models of concept learning: the origin of the learner’s hypothesis space. [sent-11, score-0.305]

8 We present a simple, unsupervised clustering method for creating hypotheses spaces that, when applied to human similarity judgments and embedded in our Bayesian framework, consistently outperforms the best alternative models of inductive reasoning based on similarity-matching heuristics. [sent-12, score-0.931]

9 We focus on two related inductive generalization tasks introduced in [6], which involve reasoning about the properties of animals. [sent-13, score-0.39]

10 The first task is to judge the strength of a generalization from one or more specific kinds of mammals to a different kind of mammal: given that animals of kind and have property , how likely is it that an animal of kind also has property ? [sent-14, score-0.857]

11 is always a blank predicate, such as “is susceptible to the disease blicketitis”, about which nothing is known outside of the given examples. [sent-16, score-0.29]

12 Working with blank predicates ensures that people’s inductions are driven by their deep knowledge about the general features of animals rather than the details they might or might not know about any ¤ ¥ £ £ ¢ ¢ ¤ ¥ ¤ one particular property. [sent-17, score-0.256]

13 Stimuli are typically presented in the form of an argument from premises (examples) to conclusion (the generalization test item), as in Chimps are susceptible to the disease blicketitis. [sent-18, score-0.598]

14 and subjects are asked to judge the strength of the argument — the likelihood that the conclusion (below the line) is true given that the premises (above the line) are true. [sent-21, score-0.366]

15 One data set contains human judgments for the relative strengths of 36 specific inferences, each with a different pair of mammals given as examples (premises) but the same test species, horses. [sent-29, score-0.616]

16 The other set contains judgments of argument strength for 45 general inferences, each with a different triplet of mammals given as examples and the same test category, all mammals. [sent-30, score-0.63]

17 also published subjects’ judgments of similarity for all 45 pairs of the 10 mammals used in their generalization experiments, which they (and we) use to build models of generalization. [sent-32, score-0.728]

18 The two factors that determine the strength of an inductive generalization in Osherson et al. [sent-34, score-0.337]

19 ’s model [6] are (i) similarity of the animals in the premise(s) to those in the conclusion, and (ii) coverage, defined as the similarity of the animals in the premise(s) to the larger taxonomic category of mammals, including all specific animal types in this domain. [sent-35, score-1.09]

20 To see the importance of the coverage factor, compare the following two inductive generalizations. [sent-36, score-0.259]

21 The chance that horses can get a disease given that we know chimps and squirrels can get that disease seems higher than if we know only that chimps and gorillas can get the disease. [sent-37, score-1.258]

22 Yet simple similarity favors the latter generalization: horses are judged to be more similar to gorillas than to chimps, and much more similar to either primate species than to squirrels. [sent-38, score-0.622]

23 ¡   ¡ Similarity and coverage factors are mixed linearly to predict the strength of a generalization. [sent-40, score-0.148]

24 Mathematically, the prediction is given by all mammals , where is the set of examples (premises), is the test set (conclusion), is a free parameter, and is a setwise similarity metric defined to be the sum of each element’s maximal similarity to the elements: . [sent-41, score-0.774]

25 For the specific arguments, the test set has just one element, horse, so is just the maximum similarity of horses to the example animal types in . [sent-42, score-0.602]

26 Summed similarity has more traditionally been used to model human concept learning, and also has a rational interpretation in terms of nonparametric density estimation, but Osherson et al. [sent-46, score-0.436]

27 Sloman [8] developed a feature-based model that encodes the shared features between the premise set and the conclusion set as weights in a neural network. [sent-51, score-0.127]

28 Despite some psychological plausibility, this model consistently fit the two data sets significantly worse than the max-similarity model. [sent-52, score-0.126]

29 Heit [3] outlines a Bayesian framework that provides qualitative explanations of various inductive reasoning phenomena from [6]. [sent-53, score-0.248]

30 His model does not constrain the learner’s hypothesis space, nor does it embody a generative model of the data, so its predictions depend strictly on well-chosen prior probabilities. [sent-54, score-0.253]

31 Formally, for the specific inference task, we observe positive examples of the concept and want to compute the probability that a particular test stimulus belongs to the concept given the observed examples : . [sent-57, score-0.476]

32 These generalization probabilities are computed by averaging the predictions of a set of hypotheses weighted by their posterior probabilities:   ¤ ¥   ¤   ¤ 6( ¢ © 7 ¡ D¢ ¥ ¥ (1)    "  ¤ )( ¢   " 0  $ % 1 543 2$ ¦ D¢  ( ! [sent-58, score-0.389]

33   ¥ ¨ ¥ © 7 (2)  A crucial component in modeling both tasks is the structure of the learner’s hypothesis space . [sent-61, score-0.219]

34 1 Hypothesis space Elements of the hypothesis space represent natural subsets of the objects in the domain — subsets likely to be the extension of some novel property or concept. [sent-63, score-0.169]

35 Our goal in building up is to capture as many hypotheses as possible that people might employ in concept learning, using a procedure that is ideally automatic and unsupervised. [sent-64, score-0.392]

36 One natural way to begin is to identify hypotheses with the clusters returned by a clustering algorithm [11][7]. [sent-65, score-0.42]

37 D D Here, hierarchical clustering seems particularly appropriate, as people across cultures appear to organize their concepts of biological species in a hierarchical taxonomic structure [1]. [sent-66, score-0.553]

38 We applied four standard agglomerative clustering algorithms [2] (single-link, complete-link, average-link, and centroid) to subjects’ similarity judgments for all pairs of 10 animals given in [6]. [sent-67, score-0.518]

39 We define the base set of clusters to consist of all 19 clusters in this tree. [sent-69, score-0.451]

40 The most straightforward way to define a hypothesis space for Bayesian concept learning is to take ; each hypothesis consists of one base cluster. [sent-70, score-0.531]

41 E E "    ¥ " # ' % &$   pick out subsets of stimuli — candidate extensions of the concept — and is just 1 or 0 depending on whether the test stimulus falls under the subset . [sent-72, score-0.215]

42 In the general inference task, we are interested in computing the probability that a whole test category falls under the concept : Hypotheses ¥ D ¥ D alone is not sufficient. [sent-73, score-0.269]

43 Each node in the tree corresponds to one hypothesis in the taxonomic hypothesis space . [sent-75, score-0.592]

44 ¥ D that chimps and squirrels can get the disease, yet the taxonomic hypotheses consistent with the example sets cow, squirrel and chimp, squirrel are the same. [sent-76, score-1.062]

45 Bayesian generalization with a purely taxonomic hypothesis space essentially depends only on the least similar example (here, squirrel), ignoring more fine-grained similarity structure, such as that one example in the set cow, squirrel is very similar to the target horse even if the other is not. [sent-77, score-1.0]

46 This sense of fine-grained similarity has a clear objective basis in biology, because a single property can apply to more than one taxonomic cluster, either by chance or through convergent evolution. [sent-78, score-0.464]

47 Thus we consider richer hypothesis subspaces , consisting of all pairs of taxonomic clusters (i. [sent-80, score-0.639]

48 , all unions of ), and , consisting of two clusters from Figure 1, except those already included in all triples of taxonomic clusters (except those included in lower layers). [sent-82, score-0.714]

49 Our total hypothesis space is then the union of these three layers, . [sent-84, score-0.247]

50 ¡ ¡   ¡ ¢ ¢ £D D   ¡D ¥ D ¢   ¤ £D ¤ £D ¥¥ D " D The notion that the hypothesis space of candidate concepts might correspond to the power set of the base clusters, rather than just single clusters, is broadly applicable beyond the domain of biological properties. [sent-85, score-0.36]

51 If the base system of clusters is sufficiently fine-grained, this framework can parameterize any logically possible concept. [sent-86, score-0.269]

52 2 The Bayesian Occam’s razor: balancing priors and likelihoods  (¢  Given this hypothesis space, Bayesian generalization then requires assigning a prior and likelihood for each hypothesis . [sent-89, score-0.504]

53 Let be the number of base clusters, and be a hypothesis in the th layer of the hypothesis space , corresponding to a union of base clusters. [sent-90, score-0.56]

54 For , instantiates a preference for simpler hypotheses — that is, hypotheses consisting of fewer disjoint clusters (smaller ). [sent-96, score-0.605]

55 More complex hypotheses receive exponentially lower probability under , and the penalty for complexity increases as becomes smaller. [sent-97, score-0.179]

56 We are currently exploring a more sophisticated domain-specific prior for taxonomic clusters defined by a stochastic mutation process over the branches of the tree. [sent-99, score-0.436]

57 ¤¢   ¥£ ¡©  (¢ ¦  (¢    (¢ © ¤  ¥¤¢  Following [10], the likelihood is calculated by assuming that the examples are a random sample (with replacement) of instances from the concept to be learned. [sent-100, score-0.22]

58 Let , the number of examples, and let the size of each hypothesis be simply the number of animal types it contains. [sent-101, score-0.304]

59 ¤¢ (4) ¤ (   (   ¥$  ©§"  (  ¥¤¢ ¨ ¦ ( if if includes all examples in does not include all examples in ¤ "    assigning greater likelihood to smaller hypotheses, by a factor that increases exponentially as the number of consistent examples observed increases. [sent-103, score-0.342]

60 The prior favors hypotheses consisting of few clusters, while the likelihood favors hypotheses consisting of small clusters. [sent-105, score-0.546]

61 For any set of examples, we can always cover them under a single cluster if we make the cluster large enough, and we can always cover them with a hypothesis of minimal size (i. [sent-107, score-0.341]

62 , including no other animals beyond the examples) if we use only singleton clusters and let the number of clusters equal the number of examples. [sent-109, score-0.557]

63 Both tasks drew their stimuli from the same set of 10 mammals shown in Figure 1. [sent-113, score-0.357]

64 Each data set (including the set of similarity judgments used to construct the models) came from a different group of subjects. [sent-114, score-0.302]

65 Our models of the probability of generalization for specific and general arguments are given by Equations 1 and 2, respectively, letting be the example set that varies from trial to trial and or (respectively) be the fixed test category, horses or all mammals. [sent-115, score-0.52]

66 ’s subjects did not provide an explicit judgment of generalization for each example set, but only a relative ranking of the strengths of all arguments in the general or specific sets. [sent-117, score-0.318]

67 ¨ ¤ D Figure 3 shows the (rank) predictions of three models, Bayesian, max-similarity and sumsimilarity, versus human subjects’ (rank) confirmation judgments on the general (row 1) and specific (row 2) induction tasks from [6]. [sent-119, score-0.392]

68 Each model had one free parameter ( in the Bayesian model, in the similarity models), which was tuned to the single value that maximized rank-order correlation between model and data jointly over both data sets. [sent-120, score-0.177]

69 The sum-similarity model is far worse than the other two — it is actually negatively correlated with the data on the general task — while max-similarity consistently scores slightly worse than the Bayesian model. [sent-122, score-0.169]

70 In both data sets 1 and 2, the number of examples was constant across all trials; we expected that varying the number of examples would cause difficulty for the max-similarity model because it is not explicitly sensitive to this factor. [sent-125, score-0.29]

71 For this purpose, we included five three-premise arguments, each with three examples of the same animal species (e. [sent-126, score-0.363]

72 , chimp, chimp, chimp ), and five one-premise arguments with the same five animals (e. [sent-128, score-0.561]

73 We also included three-premise arguments where all examples were drawn from a low-level cluster of species in Figure 1 (e. [sent-131, score-0.352]

74 Because of the increasing preference for smaller hypotheses as more examples are observed, Bayes will in general make very different predictions in these three cases, but max-similarity will not. [sent-134, score-0.438]

75 ¡ ¡   ¡ We also changed the judgment task and cover story slightly, to match more closely the natural problem of inductive learning from randomly sampled examples. [sent-136, score-0.269]

76 Subjects were told that they were training to be veterinarians, by observing examples of particular animals that had been diagnosed with novel diseases. [sent-137, score-0.242]

77 They were required to judge the probability that horses could get the same disease given the examples observed. [sent-138, score-0.633]

78 This cover story made it clear to subjects that when multiple examples of the same animal type were presented, these instances referred to distinct individual animals. [sent-139, score-0.382]

79 Figure 3 (row 3) shows the model’s predicted generalization probabilities along with the data from our experiment: mean ratings of generalization from 24 subjects on 28 example sets, using either , or examples and the same test species (horses) across all arguments. [sent-140, score-0.604]

80 All three models fit best at different parameter values than in data sets 1 and 2, perhaps due to the task differences or the greater range of stimuli here. [sent-142, score-0.131]

81 35 Figure 2: Human generalization to the conclusion category horse when given one or three examples of a single premise type. [sent-149, score-0.605]

82 15 cow chimp mouse dolphin elephant Premise category Again, the max-similarity model comes close to the performance of the Bayesian model, but it is inconsistent with several qualitative trends in the data. [sent-154, score-0.678]

83 Most notably, we found a difference between generalization from one example and generalization from three examples of the same kind, in the direction predicted by our Bayesian model. [sent-155, score-0.396]

84 Generalization to the test category of horses was greater from singleton examples (e. [sent-156, score-0.531]

85 , chimp ) than from three examples of the same kind (e. [sent-158, score-0.549]

86 This effect was relatively small but it was observed for all five animal types tested and it was   ¡   ¡      ¡©  statistically significant ( ) in a 2 5 (number of examples animal type) ANOVA. [sent-161, score-0.351]

87 ¢ ¢ It is also of interest to ask whether these models are sufficiently robust as to make reasonable predictions across all three experiments using a single parameter setting, or to make good predictions on held-out data when their free parameter is tuned on the remaining data. [sent-163, score-0.256]

88 More importantly, our Bayesian approach has a principled rational foundation, and we have introduced a framework for unsupervised construction of hypothesis spaces that could be applied in many other domains. [sent-168, score-0.28]

89 In contrast, the similarity-based approach requires arbitrary assumptions about the form of the similarity measure: it must include both “similarity” and “coverage” terms, and it must be based on max-similarity rather than sum-similarity. [sent-169, score-0.177]

90 These choices have no a priori justification and run counter to how similarity models have been applied in other domains, leading us to conclude that rational statistical principles offer the best hope for explaining how people can generalize so well from so little data. [sent-170, score-0.368]

91 Still, the consistently good performance of the max-similarity model raises an important question for future study: whether a relatively small number of simple heuristics might provide the algorithmic machinery implementing approximate rational inference in the brain. [sent-171, score-0.157]

92 We would also like to understand how people’s subjective hypothesis spaces have their origin in the objective structure of their environment. [sent-172, score-0.198]

93 Two plausible sources for the taxonomic hypothesis space used here can both be ruled out. [sent-173, score-0.423]

94 The actual biological taxonomy for these 10 animals, based on their evolutionary history, looks quite different from the subjective taxonomy used here. [sent-174, score-0.118]

95 Substituting the true taxonomic clusters from biology for the base clusters of our model’s hypothesis space leads to dramatically worse predictions of people’s generalization behavior. [sent-175, score-1.129]

96 Taxonomies constructed from linguistic co-occurrences, by applying the same agglomerative clustering algorithms to similarity scores output from the LSA algorithm [4], also lead to much worse predictions. [sent-176, score-0.342]

97 ), weighted appropriately, we can reproduce the taxonomy constructed here from people’s similarity judgments. [sent-181, score-0.236]

98 We do not offer a solution here, but merely point to this question as perhaps the most salient open problem in trying to understand the computational basis of human inductive inference. [sent-183, score-0.232]

99 8 0 0 1 2 3 Figure 3: Model predictions ( -axis) plotted against human confirmation scores ( -axis). [sent-278, score-0.187]

100 Each row is a different inductive generalization experiment, where indicates the number of examples (premises) in the stimuli. [sent-280, score-0.401]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chimp', 0.359), ('mammals', 0.27), ('horses', 0.254), ('taxonomic', 0.254), ('disease', 0.188), ('clusters', 0.182), ('hypotheses', 0.179), ('similarity', 0.177), ('hypothesis', 0.169), ('inductive', 0.161), ('osherson', 0.157), ('squirrel', 0.157), ('chimps', 0.135), ('animals', 0.128), ('bayesian', 0.127), ('generalization', 0.126), ('judgments', 0.125), ('horse', 0.117), ('examples', 0.114), ('premises', 0.112), ('squirrels', 0.112), ('concept', 0.106), ('animal', 0.102), ('coverage', 0.098), ('category', 0.091), ('premise', 0.089), ('subjects', 0.088), ('base', 0.087), ('species', 0.086), ('predictions', 0.084), ('rational', 0.082), ('people', 0.079), ('arguments', 0.074), ('cow', 0.071), ('human', 0.071), ('gorilla', 0.067), ('mammal', 0.067), ('susceptible', 0.063), ('favors', 0.06), ('clustering', 0.059), ('taxonomy', 0.059), ('cows', 0.059), ('occam', 0.059), ('reasoning', 0.053), ('tasks', 0.05), ('strength', 0.05), ('union', 0.048), ('concepts', 0.047), ('cluster', 0.047), ('consistently', 0.047), ('kind', 0.046), ('worse', 0.045), ('dolphin', 0.045), ('elephant', 0.045), ('gorillas', 0.045), ('rmation', 0.045), ('sanjana', 0.045), ('judge', 0.043), ('cognitive', 0.041), ('likelihoods', 0.04), ('blank', 0.039), ('razor', 0.039), ('story', 0.039), ('cover', 0.039), ('conclusion', 0.038), ('stimuli', 0.037), ('falls', 0.036), ('learner', 0.036), ('test', 0.036), ('exempli', 0.036), ('singleton', 0.036), ('speci', 0.035), ('tenenbaum', 0.035), ('argument', 0.035), ('qualitative', 0.034), ('sets', 0.034), ('ve', 0.034), ('get', 0.034), ('consisting', 0.034), ('ict', 0.033), ('mouse', 0.033), ('chance', 0.033), ('know', 0.033), ('types', 0.033), ('scores', 0.032), ('af', 0.032), ('induction', 0.032), ('included', 0.031), ('preference', 0.031), ('three', 0.03), ('judgment', 0.03), ('inferences', 0.03), ('models', 0.03), ('comprehensive', 0.029), ('agglomerative', 0.029), ('beyond', 0.029), ('spaces', 0.029), ('might', 0.028), ('bayes', 0.028), ('across', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 40 nips-2002-Bayesian Models of Inductive Generalization

Author: Neville E. Sanjana, Joshua B. Tenenbaum

Abstract: We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam’s razor that trades off priors and likelihoods to prevent under- or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.

2 0.15697044 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

Author: David Fass, Jacob Feldman

Abstract: We present an account of human concept learning-that is, learning of categories from examples-based on the principle of minimum description length (MDL). In support of this theory, we tested a wide range of two-dimensional concept types, including both regular (simple) and highly irregular (complex) structures, and found the MDL theory to give a good account of subjects' performance. This suggests that the intrinsic complexity of a concept (that is, its description -length) systematically influences its leamability. 1- The Structure of Categories A number of different principles have been advanced to explain the manner in which humans learn to categorize objects. It has been variously suggested that the underlying principle might be the similarity structure of objects [1], the manipulability of decision bound~ aries [2], or Bayesian inference [3][4]. While many of these theories are mathematically well-grounded and have been successful in explaining a range of experimental findings, they have commonly only been tested on a narrow collection of concept types similar to the simple unimodal categories of Figure 1(a-e). (a) (b) (c) (d) (e) Figure 1: Categories similar to those previously studied. Lines represent contours of equal probability. All except (e) are unimodal. ~http://ruccs.rutgers.edu/~jacob/feldman.html Moreover, in the scarce research that has ventured to look beyond simple category types, the goal has largely been to investigate categorization performance for isolated irregular distributions, rather than to present a survey of performance across a range of interesting distributions. For example, Nosofsky has previously examined the

3 0.14411598 198 nips-2002-Theory-Based Causal Inference

Author: Joshua B. Tenenbaum, Thomas L. Griffiths

Abstract: People routinely make sophisticated causal inferences unconsciously, effortlessly, and from very little data – often from just one or a few observations. We argue that these inferences can be explained as Bayesian computations over a hypothesis space of causal graphical models, shaped by strong top-down prior knowledge in the form of intuitive theories. We present two case studies of our approach, including quantitative models of human causal judgments and brief comparisons with traditional bottom-up models of inference.

4 0.091173694 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

Author: Ron Meir, Tong Zhang

Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0   n   1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) .  n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that  λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound    (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.

5 0.087415881 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

6 0.080169424 188 nips-2002-Stability-Based Model Selection

7 0.076839149 75 nips-2002-Dynamical Causal Learning

8 0.067871854 83 nips-2002-Extracting Relevant Structures with Side Information

9 0.06669192 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

10 0.063723922 163 nips-2002-Prediction and Semantic Association

11 0.063185476 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

12 0.061908606 53 nips-2002-Clustering with the Fisher Score

13 0.06166916 90 nips-2002-Feature Selection in Mixture-Based Clustering

14 0.060331028 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

15 0.059605036 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

16 0.058885224 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

17 0.055255141 125 nips-2002-Learning Semantic Similarity

18 0.054869372 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

19 0.054569706 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

20 0.053493492 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.173), (1, -0.029), (2, 0.027), (3, 0.011), (4, -0.042), (5, 0.051), (6, -0.114), (7, -0.107), (8, -0.066), (9, 0.002), (10, -0.004), (11, -0.014), (12, 0.11), (13, 0.053), (14, -0.096), (15, -0.091), (16, -0.003), (17, -0.105), (18, 0.09), (19, -0.186), (20, -0.096), (21, -0.175), (22, 0.082), (23, 0.021), (24, 0.047), (25, 0.054), (26, -0.1), (27, 0.009), (28, -0.018), (29, 0.074), (30, 0.002), (31, 0.042), (32, -0.014), (33, 0.213), (34, 0.106), (35, 0.156), (36, 0.054), (37, -0.127), (38, 0.017), (39, -0.042), (40, -0.027), (41, 0.084), (42, -0.177), (43, 0.006), (44, 0.031), (45, 0.009), (46, -0.07), (47, 0.013), (48, -0.035), (49, 0.119)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95413035 40 nips-2002-Bayesian Models of Inductive Generalization

Author: Neville E. Sanjana, Joshua B. Tenenbaum

Abstract: We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam’s razor that trades off priors and likelihoods to prevent under- or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.

2 0.79981333 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

Author: David Fass, Jacob Feldman

Abstract: We present an account of human concept learning-that is, learning of categories from examples-based on the principle of minimum description length (MDL). In support of this theory, we tested a wide range of two-dimensional concept types, including both regular (simple) and highly irregular (complex) structures, and found the MDL theory to give a good account of subjects' performance. This suggests that the intrinsic complexity of a concept (that is, its description -length) systematically influences its leamability. 1- The Structure of Categories A number of different principles have been advanced to explain the manner in which humans learn to categorize objects. It has been variously suggested that the underlying principle might be the similarity structure of objects [1], the manipulability of decision bound~ aries [2], or Bayesian inference [3][4]. While many of these theories are mathematically well-grounded and have been successful in explaining a range of experimental findings, they have commonly only been tested on a narrow collection of concept types similar to the simple unimodal categories of Figure 1(a-e). (a) (b) (c) (d) (e) Figure 1: Categories similar to those previously studied. Lines represent contours of equal probability. All except (e) are unimodal. ~http://ruccs.rutgers.edu/~jacob/feldman.html Moreover, in the scarce research that has ventured to look beyond simple category types, the goal has largely been to investigate categorization performance for isolated irregular distributions, rather than to present a survey of performance across a range of interesting distributions. For example, Nosofsky has previously examined the

3 0.56984955 198 nips-2002-Theory-Based Causal Inference

Author: Joshua B. Tenenbaum, Thomas L. Griffiths

Abstract: People routinely make sophisticated causal inferences unconsciously, effortlessly, and from very little data – often from just one or a few observations. We argue that these inferences can be explained as Bayesian computations over a hypothesis space of causal graphical models, shaped by strong top-down prior knowledge in the form of intuitive theories. We present two case studies of our approach, including quantitative models of human causal judgments and brief comparisons with traditional bottom-up models of inference.

4 0.54804063 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology

Author: Matthew G. Snover, Michael R. Brent

Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.

5 0.44434401 75 nips-2002-Dynamical Causal Learning

Author: David Danks, Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: Current psychological theories of human causal learning and judgment focus primarily on long-run predictions: two by estimating parameters of a causal Bayes nets (though for different parameterizations), and a third through structural learning. This paper focuses on people's short-run behavior by examining dynamical versions of these three theories, and comparing their predictions to a real-world dataset. 1

6 0.43165419 188 nips-2002-Stability-Based Model Selection

7 0.37375611 107 nips-2002-Identity Uncertainty and Citation Matching

8 0.37036449 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

9 0.35780317 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text

10 0.35112324 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

11 0.34801546 27 nips-2002-An Impossibility Theorem for Clustering

12 0.34272292 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

13 0.33828703 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis

14 0.31396514 18 nips-2002-Adaptation and Unsupervised Learning

15 0.29522857 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains

16 0.29373693 83 nips-2002-Extracting Relevant Structures with Side Information

17 0.28953049 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

18 0.28634185 163 nips-2002-Prediction and Semantic Association

19 0.27355236 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

20 0.27193779 167 nips-2002-Rational Kernels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.012), (14, 0.021), (22, 0.277), (23, 0.014), (42, 0.069), (54, 0.12), (55, 0.033), (57, 0.02), (67, 0.025), (68, 0.019), (74, 0.104), (87, 0.065), (92, 0.028), (98, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79604727 40 nips-2002-Bayesian Models of Inductive Generalization

Author: Neville E. Sanjana, Joshua B. Tenenbaum

Abstract: We argue that human inductive generalization is best explained in a Bayesian framework, rather than by traditional models based on similarity computations. We go beyond previous work on Bayesian concept learning by introducing an unsupervised method for constructing flexible hypothesis spaces, and we propose a version of the Bayesian Occam’s razor that trades off priors and likelihoods to prevent under- or over-generalization in these flexible spaces. We analyze two published data sets on inductive reasoning as well as the results of a new behavioral study that we have carried out.

2 0.77277023 152 nips-2002-Nash Propagation for Loopy Graphical Games

Author: Luis E. Ortiz, Michael Kearns

Abstract: We introduce NashProp, an iterative and local message-passing algorithm for computing Nash equilibria in multi-player games represented by arbitrary undirected graphs. We provide a formal analysis and experimental evidence demonstrating that NashProp performs well on large graphical games with many loops, often converging in just a dozen iterations on graphs with hundreds of nodes. NashProp generalizes the tree algorithm of (Kearns et al. 2001), and can be viewed as similar in spirit to belief propagation in probabilistic inference, and thus complements the recent work of (Vickrey and Koller 2002), who explored a junction tree approach. Thus, as for probabilistic inference, we have at least two promising general-purpose approaches to equilibria computation in graphs.

3 0.60299134 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

Author: Harald Steck, Tommi S. Jaakkola

Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as

4 0.58679897 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

5 0.58555388 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

Author: Max Welling, Simon Osindero, Geoffrey E. Hinton

Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.

6 0.58524483 124 nips-2002-Learning Graphical Models with Mercer Kernels

7 0.58512694 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

8 0.58399725 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

9 0.58359295 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

10 0.57911736 135 nips-2002-Learning with Multiple Labels

11 0.57852066 98 nips-2002-Going Metric: Denoising Pairwise Data

12 0.5775882 188 nips-2002-Stability-Based Model Selection

13 0.57697296 2 nips-2002-A Bilinear Model for Sparse Coding

14 0.57555693 3 nips-2002-A Convergent Form of Approximate Policy Iteration

15 0.57453299 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

16 0.57409549 10 nips-2002-A Model for Learning Variance Components of Natural Images

17 0.5737763 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

18 0.57299572 163 nips-2002-Prediction and Semantic Association

19 0.57245839 27 nips-2002-An Impossibility Theorem for Clustering

20 0.57089448 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs