nips nips2001 nips2001-149 knowledge-graph by maker-knowledge-mining

149 nips-2001-Probabilistic Abstraction Hierarchies


Source: pdf

Author: Eran Segal, Daphne Koller, Dirk Ormoneit

Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. [sent-10, score-0.689]

2 In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. [sent-11, score-0.332]

3 The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. [sent-12, score-0.531]

4 A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. [sent-13, score-0.368]

5 We present experimental results on synthetic data, and on real data in the domains of gene expression and text. [sent-15, score-0.256]

6 1 Introduction Many domains are naturally associated with a hierarchical taxonomy, in the form of a tree, where instances that are close to each other in the tree are assumed to be more “similar” than instances that are further away. [sent-16, score-0.548]

7 In biological systems, for example, creating a taxonomy of the instances is often one of the first steps in understanding the system. [sent-17, score-0.268]

8 In particular, much of the work on analyzing gene expression data [3] has focused on creating gene hierarchies. [sent-18, score-0.35]

9 Similarly, in text domains, creating a hierarchy of documents is a common task [12, 7]. [sent-19, score-0.254]

10 In many of these applications, the hierarchy is unknown; indeed, discovering the hierarchy is often a key part of the analysis. [sent-20, score-0.402]

11 Thus, domain knowledge about the type of distribution from which data instances are sampled is rarely used in the formation of the hierarchy. [sent-24, score-0.157]

12 In this paper, we present probabilistic abstraction hierarchies (PAH), a probabilistically principled general framework for learning abstraction hierarchies from data which overcomes these difficulties. [sent-25, score-0.994]

13 We use a Bayesian approach, where the different models correspond to different abstraction hierarchies. [sent-26, score-0.225]

14 The prior is designed to enforce our intuitions about taxonomies: nearby classes have similar data distributions. [sent-27, score-0.215]

15 More specifically, a model in a PAH is a tree, where each node in the tree is associated with a class-specific probabilistic model (CPM). [sent-28, score-0.299]

16 Data is generated only at the leaves of the tree, so that a model basically defines a mixture distribution whose components are the CPMs at the leaves of the tree. [sent-29, score-0.24]

17 The CPMs at the internal nodes are used to define the prior over models: We prefer models where the CPM at a child node is close to the CPM at its parent, relative to some distance function between CPMs. [sent-30, score-0.203]

18 We present a novel algorithm for learning the model parameters and the tree structure in this framework. [sent-33, score-0.325]

19 Our algorithm is based on the structural EM (SEM) approach of [4], but utilizes “global” optimization steps for learning the best possible hierarchy and CPM parameters (see also [5, 13] for similar global optimization steps within SEM). [sent-34, score-0.497]

20 (3) It utilizes global optimization steps, which tend to avoid local maxima and help make the model less sensitive to noise. [sent-39, score-0.183]

21 (4) The abstraction hierarchy tends to pull the parameters of one model closer to those of nearby ones, which naturally leads to a form of parameter smoothing or shrinkage [12]. [sent-40, score-0.568]

22 We present experiments for PAH on synthetic data and on two real data sets: gene expression and text. [sent-41, score-0.28]

23 Our results show that the PAH approach produces hierarchies that are more robust to noise in the data, and that the learned hierarchies generalize better to test data than those produced by hierarchical agglomerative clustering. [sent-42, score-0.713]

24 Each data instance is sampled independently by first selecting one of the classes according to a multinomial distribution, and then randomly selecting the data instance itself from the CPM of the chosen class. [sent-49, score-0.223]

25 In this paper, we propose a model where the different classes are related to each other via an abstraction hierarchy, such that classes that are “nearby” in the hierarchy have similar probabilistic models. [sent-51, score-0.537]

26 1 A probabilistic abstraction hierarchy (PAH) is a tree with nodes and undirected edges , such that has exactly leaves . [sent-53, score-0.905]

27 We also have a multinomial distribution over the leaves ; we use to denote the parameters of this distribution. [sent-55, score-0.175]

28 As discussed above, we assume that data is generated only from the leaves of the tree. [sent-60, score-0.171]

29 Given a PAH , an element , and a value for , we define , where is the multinomial distribution over the leaves and is the conditional density of the data item given the CPM at leaf . [sent-62, score-0.304]

30 (b) Two different weight-preserving transformations for a tree with 4 leaves . [sent-65, score-0.364]

31 As we mentioned, the role of the internal nodes in the tree is to enforce an intuitive interpretation of the model as an abstraction hierarchy, by enforcing similarity between CPMs at nearby leaves. [sent-67, score-0.657]

32 We achieve this goal by defining a prior distribution over aband straction hierarchies that penalizes the distance between neighboring CPMs using a distance function . [sent-68, score-0.361]

33 2 Given a set of data instances with domain , our goal is to find a PAH that maximizes or equivalently, . [sent-73, score-0.157]

34 By maximizing this expression, we are trading off the fit of the mixture model over the leaves to the data , and the desire to generate a hierarchy in which nearby models are similar. [sent-74, score-0.495]

35 1(a) illustrates a typical PAH with Gaussian CPM distributions, where a CPM close to the leaves of the tree is more specialized and thus has fairly peaked distributions. [sent-76, score-0.384]

36 This learning task is fairly complex, as many aspects are unknown: the structure of the tree , the CPMs at the nodes of , the parameters , and the assignment of the instances in to leaves of . [sent-80, score-0.634]

37 We first discuss the case of complete data, where for each data instance , we are given the leaf from which it was generated. [sent-89, score-0.153]

38 For this case, we show how to learn the structure of the tree and the setting of the parameters and . [sent-90, score-0.297]

39 This problem, of constructing a tree over a set of points that is not fixed, is very closely related to the Steiner tree problem [10], virtually all of whose variants are NP-hard. [sent-91, score-0.512]

40 We propose a heuristic approach that decouples the joint optimization problem into two subproblems: optimizing the CPM parameters given the tree structure, and learning a tree structure given a set of CPMs. [sent-92, score-0.578]

41 Somewhat surprisingly, we show that our careful choice of additive prior allows each of these subproblems to be tackled very effectively using global optimization techniques. [sent-93, score-0.135]

42 Thus, assume that we are given both the to one of the structure of the tree and the assignment of each data instance leaves, denoted . [sent-95, score-0.38]

43 An examination of (1) shows that the optimization of each CPM volves only the data cases assigned to (if is a leaf) and the parameters of the CPMs that are neighbors of in the tree, thereby simplifying the computation substantially. [sent-104, score-0.132]

44 We now turn our attention to the second subproblem, of learning the structure of the tree given the learned CPMs. [sent-105, score-0.323]

45 We first consider an empty tree containing only the (unconnected) leaf nodes , and find the optimal parameter settings for each leaf CPM as described above. [sent-106, score-0.482]

46 Given this initial set of CPMs for the leaf nodes , the algorithm tries to learn a good tree structure relative to these CPMs. [sent-108, score-0.464]

47 The goal is to find the lowest weight tree, subject to the restriction that the tree structure must keep the same set of leaves . [sent-109, score-0.396]

48 This probpenalty of the tree can be measured via the sum of the edge weights lem is also a variant of the Steiner tree problem. [sent-111, score-0.488]

49 As a heuristic substitute, we follow the lines of [5] and use a minimum spanning tree (MST) algorithm for constructing low-weight trees. [sent-112, score-0.296]

50 At each iteration, the algorithm starts out with a tree over some set of nodes . [sent-113, score-0.354]

51 For all nodes for which end up as internal nodes, we perform the same transformation described above. [sent-119, score-0.133]

52 The trees are evaluated relative to our score ( ), and the highest scoring tree is kept. [sent-124, score-0.317]

53 The tree just constructed has a new set of CPMs, so we can repeat this process. [sent-125, score-0.269]

54 To detect termination, the algorithm also keeps the tree from the previous iteration, and terminates when the score of all trees in the newly constructed pool is lower than the score of the best tree from previous iteration. [sent-126, score-0.662]

55 Finally, we address the fact that the data we have is incomplete, in that the assignments of data instances to classes is not determined. [sent-127, score-0.247]

56 We address the problem of incomplete data using the standard Expectation Maximization (EM) algorithm [2] and the structural EM algorithm [4] which extends EM to the problem of model selection. [sent-128, score-0.164]

57 In our case, the M-step is precisely the algorithm for complete data described above, but using a soft assignment of data instances to nodes in the tree. [sent-132, score-0.347]

58 Let E (c) f E d x da w (a)     Figure 2: Abstraction Hierarchy Learning Algorithm 4 Experimental Results We focus our experimental results on genomic expression data, although we also provide some results on a text dataset. [sent-158, score-0.166]

59 In gene expression data, the level of mRNA transcript of every gene in the cell is measured simultaneously, using DNA microarray technology. [sent-159, score-0.294]

60 This genomic expression data provides researchers with much insight towards understanding the overall cellular behavior. [sent-160, score-0.212]

61 The most commonly used method for analyzing this data is clustering, a process which identifies clusters of genes that share similar expression patterns (e. [sent-161, score-0.337]

62 The most popular clustering method for genomic expression data to date is hierarchical agglomerative clustering (HAC) [3], which builds a hierarchy among the genes by iteratively merging the closest genes relative to some distance metric. [sent-166, score-1.059]

63 (Note that in HAC the metric is used as the distance between data cases whereas in our algorithm it is used as the distance between models. [sent-168, score-0.175]

64 To do so, we create CPMs from the genes that HAC assigned to each internal node. [sent-170, score-0.253]

65 In both PAH and HAC, we then assign each gene (in the training set or the test set) to the hierarchy by choosing the best (highest likelihood) CPM among all the nodes in the tree (including internal nodes) and recording the probability that this CPM assigns to the gene. [sent-171, score-0.714]

66 A good algorithm for learning abstraction hierarchies should recover the true hierarchy as well as possible. [sent-173, score-0.673]

67 To test this, we generated a synthetic data set, and measured the ability of each method to recover the distances between pairs of instances (genes) in the generating model, where distance here is the length of the path between two genes in the hierarchy. [sent-174, score-0.499]

68 We generated the data set by sampling from the leaves of a PAH; to make the data realistic, we sampled from a PAH that we learned from a real gene expression data set. [sent-175, score-0.498]

69 We generated data for 80 (imaginary) genes and 100 experiments, for a total of 8000 measurements. [sent-177, score-0.254]

70 For robustness, we generated 5 different such data sets and ran PAH and HAC for each data set. [sent-178, score-0.134]

71 We used the correlation and the error between the pairwise distances in the original and the learned tree as measures of similiarity. [sent-179, score-0.343]

72 These results show that PAH recovers an abstraction hierarchy much better than HAC. [sent-182, score-0.404]

73 We selected 953 genes with significant changes in expression, using their full set of 93 experiments. [sent-186, score-0.203]

74 For PAH we also used different settings for (the coefficient of the penalty term in ), ) and greatly which explores the performance in the range of only fitting the data ( favoring hierarchies in which nearby models are similar (large ). [sent-188, score-0.415]

75 In both cases, we learned a model using training data, and evaluated the log-likelihood of test instances as described above. [sent-189, score-0.218]

76 3(a), clearly show that PAH generalizes much better to previously unobserved data than HAC and that PAH works best at some tradeoff between fitting the data and generating a hierarchy in which nearby models are similar. [sent-191, score-0.458]

77 Our goal in constructing a hierarchy is to extract meaningful biological conclusions from the data. [sent-193, score-0.259]

78 Thus, we want genes that are assigned to nearby nodes in the tree, to be close together also in hierarchies learned from perturbed data sets. [sent-196, score-0.785]

79 We tested robustness to noise by learning a model from the original data set and from perturbed data sets in which we permuted a varying percentage of the expression measuments. [sent-197, score-0.263]

80 We then compared the distances (the path length in the tree) between the nodes assigned to every pair of genes in trees learned from the original data and trees learned from perturbed data sets. [sent-198, score-0.685]

81 3(b), demonstrating that PAH preof the data is perturbed (and serves the pairwise distances extremely well even when permutation), while HAC completely deteriorates performs reasonably well for when of the data is permuted. [sent-200, score-0.191]

82 A second important test is robustness to our particular choice of training data: a particular training set reflects only a subset of the experiments that we could have performed. [sent-217, score-0.132]

83 In this experiment, we used the Yeast Compendium data of [9], which measures the expression profiles triggered by specific gene mutations. [sent-218, score-0.229]

84 We selected 450 genes and all 298 arrays, focusing on genes that changed significantly. [sent-219, score-0.406]

85 We then placed both training and test genes within the hierarchy. [sent-221, score-0.268]

86 For each data set, every pair of genes either appear together in the training set, the test set, or do not appear together (i. [sent-222, score-0.319]

87 We compared, for each pair of genes, their distances in training sets in which they appear together and their distances in test sets in which they appear together. [sent-225, score-0.169]

88 By contrast, the hierarchies constructed by HAC are much less consistent. [sent-229, score-0.266]

89 PAH is very consistent about its classification into the hierachy even of test instances — ones not used to construct the hierarchy. [sent-231, score-0.145]

90 By contrast, HAC places test instances in very different configurations in different trees, reducing our confidence in the biological validity of the learned structure. [sent-233, score-0.226]

91 To get qualitative insight into the hierarchies produced, we ran PAH on 350 documents from the Probabilistic Methods category in the Cora dataset (cora. [sent-235, score-0.273]

92 5 Discussion We presented probabilistic abstraction hierarchies, a general framework for learning abstraction hierarchies from data, which relates different classes in the hierarchy by a tree whose nodes correspond to class-specific probability models (CPMs). [sent-245, score-1.29]

93 We utilize a Bayesian approach, where the prior favors hierarchies in which nearby classes have similar data distributions, by penalizing the distance between neighboring CPMs. [sent-246, score-0.504]

94 A unique feature of PAH is the use of global optimization steps for constructing the hierarchy and for finding the optimal setting of the entire set of parameters. [sent-247, score-0.34]

95 This feature differentiates us from many other approaches that build hierarchies by local improvements of the objective function or approaches that optimize a fixed hierarchy [7]. [sent-248, score-0.473]

96 The global optimization steps help in avoiding local maxima and in reducing sensitivity to noise. [sent-249, score-0.185]

97 Our approach leads naturally to a form of parameter smoothing, and provides much better generalization for test data and robustness to noise than other clustering approaches. [sent-250, score-0.213]

98 The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. [sent-300, score-0.268]

99 The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. [sent-305, score-0.268]

100 Improving text classification by shrinkage in a hierarchy of classes. [sent-336, score-0.248]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pah', 0.488), ('cpms', 0.337), ('cpm', 0.305), ('hac', 0.289), ('tree', 0.244), ('hierarchies', 0.241), ('genes', 0.203), ('abstraction', 0.203), ('hierarchy', 0.201), ('leaves', 0.12), ('instances', 0.106), ('nearby', 0.101), ('gene', 0.095), ('ec', 0.089), ('expression', 0.083), ('nodes', 0.082), ('leaf', 0.078), ('taxonomy', 0.07), ('mst', 0.064), ('clustering', 0.06), ('genomic', 0.056), ('probabilistic', 0.055), ('distances', 0.052), ('agglomerative', 0.051), ('data', 0.051), ('distance', 0.048), ('eran', 0.048), ('segal', 0.048), ('sem', 0.048), ('steiner', 0.048), ('learned', 0.047), ('global', 0.046), ('trees', 0.046), ('hierarchical', 0.043), ('yeast', 0.042), ('robustness', 0.041), ('em', 0.039), ('classes', 0.039), ('maxima', 0.039), ('test', 0.039), ('convexity', 0.038), ('perturbed', 0.037), ('kl', 0.037), ('optimization', 0.037), ('biological', 0.034), ('multinomial', 0.034), ('structural', 0.033), ('ran', 0.032), ('unobserved', 0.032), ('autoclass', 0.032), ('causat', 0.032), ('compendium', 0.032), ('cora', 0.032), ('influenc', 0.032), ('gd', 0.032), ('structure', 0.032), ('steps', 0.032), ('local', 0.031), ('utilizes', 0.03), ('wh', 0.03), ('de', 0.03), ('assignment', 0.029), ('stanford', 0.029), ('recomb', 0.028), ('subproblems', 0.028), ('algorithm', 0.028), ('nes', 0.027), ('domains', 0.027), ('text', 0.027), ('internal', 0.027), ('score', 0.027), ('creating', 0.026), ('training', 0.026), ('constructed', 0.025), ('instance', 0.024), ('constructing', 0.024), ('transformation', 0.024), ('incomplete', 0.024), ('ne', 0.024), ('prior', 0.024), ('ormoneit', 0.024), ('pro', 0.023), ('assigned', 0.023), ('cellular', 0.022), ('models', 0.022), ('db', 0.022), ('naturally', 0.022), ('likelihood', 0.021), ('cell', 0.021), ('organized', 0.021), ('causal', 0.021), ('biology', 0.021), ('item', 0.021), ('pool', 0.021), ('bayesian', 0.021), ('parameters', 0.021), ('shrinkage', 0.02), ('koller', 0.02), ('peaked', 0.02), ('nd', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 149 nips-2001-Probabilistic Abstraction Hierarchies

Author: Eran Segal, Daphne Koller, Dirk Ormoneit

Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.

2 0.10750828 56 nips-2001-Convolution Kernels for Natural Language

Author: Michael Collins, Nigel Duffy

Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.

3 0.10207732 190 nips-2001-Thin Junction Trees

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

4 0.087386675 115 nips-2001-Linear-time inference in Hierarchical HMMs

Author: Kevin P. Murphy, Mark A. Paskin

Abstract: The hierarchical hidden Markov model (HHMM) is a generalization of the hidden Markov model (HMM) that models sequences with structure at many length/time scales [FST98]. Unfortunately, the original infertime, where is ence algorithm is rather complicated, and takes the length of the sequence, making it impractical for many domains. In this paper, we show how HHMMs are a special kind of dynamic Bayesian network (DBN), and thereby derive a much simpler inference algorithm, which only takes time. Furthermore, by drawing the connection between HHMMs and DBNs, we enable the application of many standard approximation techniques to further speed up inference. ¥ ©§ £ ¨¦¥¤¢ © £ ¦¥¤¢

5 0.085807301 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding

Author: Alberto Paccanaro, Geoffrey E. Hinton

Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0    £ § ¥ £  § ¥ % 

6 0.082188018 22 nips-2001-A kernel method for multi-labelled classification

7 0.076205678 169 nips-2001-Small-World Phenomena and the Dynamics of Information

8 0.071454093 118 nips-2001-Matching Free Trees with Replicator Equations

9 0.071199857 105 nips-2001-Kernel Machines and Boolean Functions

10 0.067831457 84 nips-2001-Global Coordination of Local Linear Models

11 0.056954145 30 nips-2001-Agglomerative Multivariate Information Bottleneck

12 0.054433562 43 nips-2001-Bayesian time series classification

13 0.053638931 185 nips-2001-The Method of Quantum Clustering

14 0.05011585 139 nips-2001-Online Learning with Kernels

15 0.048430104 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

16 0.0474842 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

17 0.047087763 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

18 0.044498667 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model

19 0.044239834 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

20 0.043719139 53 nips-2001-Constructing Distributed Representations Using Additive Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.16), (1, 0.024), (2, 0.003), (3, -0.023), (4, -0.055), (5, -0.119), (6, -0.081), (7, -0.024), (8, -0.118), (9, -0.016), (10, -0.008), (11, -0.113), (12, 0.016), (13, -0.006), (14, 0.018), (15, -0.095), (16, -0.124), (17, 0.092), (18, -0.015), (19, -0.011), (20, 0.02), (21, 0.032), (22, 0.086), (23, 0.013), (24, 0.016), (25, 0.04), (26, 0.036), (27, 0.063), (28, 0.115), (29, -0.007), (30, 0.083), (31, -0.078), (32, 0.018), (33, -0.032), (34, -0.095), (35, -0.159), (36, -0.064), (37, -0.043), (38, 0.02), (39, 0.065), (40, -0.102), (41, 0.099), (42, 0.003), (43, 0.029), (44, -0.073), (45, 0.015), (46, 0.075), (47, 0.102), (48, 0.235), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92234868 149 nips-2001-Probabilistic Abstraction Hierarchies

Author: Eran Segal, Daphne Koller, Dirk Ormoneit

Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.

2 0.65470517 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding

Author: Alberto Paccanaro, Geoffrey E. Hinton

Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0    £ § ¥ £  § ¥ % 

3 0.5315541 190 nips-2001-Thin Junction Trees

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

4 0.52099556 169 nips-2001-Small-World Phenomena and the Dynamics of Information

Author: Jon M. Kleinberg

Abstract: unkown-abstract

5 0.50374907 56 nips-2001-Convolution Kernels for Natural Language

Author: Michael Collins, Nigel Duffy

Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.

6 0.47439009 118 nips-2001-Matching Free Trees with Replicator Equations

7 0.45894426 93 nips-2001-Incremental A*

8 0.45790651 53 nips-2001-Constructing Distributed Representations Using Additive Clustering

9 0.41850641 115 nips-2001-Linear-time inference in Hierarchical HMMs

10 0.40702423 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

11 0.40692618 84 nips-2001-Global Coordination of Local Linear Models

12 0.38269961 22 nips-2001-A kernel method for multi-labelled classification

13 0.3803477 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

14 0.37986392 185 nips-2001-The Method of Quantum Clustering

15 0.36872214 30 nips-2001-Agglomerative Multivariate Information Bottleneck

16 0.3660751 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

17 0.36438698 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes

18 0.36163944 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model

19 0.34235194 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

20 0.3272863 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.051), (10, 0.011), (14, 0.024), (15, 0.012), (17, 0.011), (19, 0.028), (27, 0.092), (30, 0.34), (38, 0.012), (58, 0.022), (59, 0.037), (72, 0.073), (79, 0.047), (83, 0.033), (91, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98183888 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets

Author: S. Parveen, P. Green

Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.

2 0.97662294 33 nips-2001-An Efficient Clustering Algorithm Using Stochastic Association Model and Its Implementation Using Nanostructures

Author: Takashi Morie, Tomohiro Matsuura, Makoto Nagata, Atsushi Iwata

Abstract: This paper describes a clustering algorithm for vector quantizers using a “stochastic association model”. It offers a new simple and powerful softmax adaptation rule. The adaptation process is the same as the on-line K-means clustering method except for adding random fluctuation in the distortion error evaluation process. Simulation results demonstrate that the new algorithm can achieve efficient adaptation as high as the “neural gas” algorithm, which is reported as one of the most efficient clustering methods. It is a key to add uncorrelated random fluctuation in the similarity evaluation process for each reference vector. For hardware implementation of this process, we propose a nanostructure, whose operation is described by a single-electron circuit. It positively uses fluctuation in quantum mechanical tunneling processes.

3 0.97274256 163 nips-2001-Risk Sensitive Particle Filters

Author: Sebastian Thrun, John Langford, Vandi Verma

Abstract: We propose a new particle filter that incorporates a model of costs when generating particles. The approach is motivated by the observation that the costs of accidentally not tracking hypotheses might be significant in some areas of state space, and next to irrelevant in others. By incorporating a cost model into particle filtering, states that are more critical to the system performance are more likely to be tracked. Automatic calculation of the cost model is implemented using an MDP value function calculation that estimates the value of tracking a particular state. Experiments in two mobile robot domains illustrate the appropriateness of the approach.

4 0.96813893 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

Author: Xiaohui Xie, Martin A. Giese

Abstract: Asymmetric lateral connections are one possible mechanism that can account for the direction selectivity of cortical neurons. We present a mathematical analysis for a class of these models. Contrasting with earlier theoretical work that has relied on methods from linear systems theory, we study the network’s nonlinear dynamic properties that arise when the threshold nonlinearity of the neurons is taken into account. We show that such networks have stimulus-locked traveling pulse solutions that are appropriate for modeling the responses of direction selective cortical neurons. In addition, our analysis shows that outside a certain regime of stimulus speeds the stability of this solutions breaks down giving rise to another class of solutions that are characterized by specific spatiotemporal periodicity. This predicts that if direction selectivity in the cortex is mainly achieved by asymmetric lateral connections lurching activity waves might be observable in ensembles of direction selective cortical neurons within appropriate regimes of the stimulus speed.

5 0.96809304 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

Author: B. Zadrozny

Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.

6 0.95678294 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model

same-paper 7 0.90016443 149 nips-2001-Probabilistic Abstraction Hierarchies

8 0.86724216 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

9 0.86110502 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

10 0.84780753 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity

11 0.81602752 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.79024553 46 nips-2001-Categorization by Learning and Combining Object Parts

13 0.77768916 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

14 0.7733804 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

15 0.77060258 60 nips-2001-Discriminative Direction for Kernel Classifiers

16 0.76770955 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

17 0.76357883 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

18 0.75930357 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

19 0.75145602 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

20 0.74812365 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field