nips nips2002 nips2002-142 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
Reference: text
sentIndex sentText sentNum sentScore
1 il ¡ Abstract The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. [sent-4, score-0.136]
2 Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . [sent-5, score-0.063]
3 Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. [sent-6, score-0.189]
4 We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. [sent-8, score-0.294]
5 We show that under this mapping the or problems are strongly related. [sent-9, score-0.162]
6 In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. [sent-10, score-0.103]
7 Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. [sent-11, score-0.236]
8 Moreover, the values of the functionals at the fixed points are equal under simple transformations. [sent-12, score-0.112]
9 As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. [sent-13, score-0.182]
10 ©§ ¥£ ¨¦¤¢ 1 Introduction Unsupervised clustering is a central paradigm in data analysis. [sent-14, score-0.087]
11 In this approach, given the joint distribution , one looks for a compact representation of , which preserves as much information as possible about (see [2] for a detailed discussion). [sent-18, score-0.063]
12 In [1] it is argued that both the compactness of the representation and the preserved relevant information are naturally measured by mutual information, hence the above principle can be formulated as a trade-off between these quantities. [sent-22, score-0.061]
13 The IB problem can be stated as finding a (stochastic) mapping such that the IB-functional is minimized, where is a positive Lagrange multiplier that determines the trade-off between compression and precision. [sent-26, score-0.162]
14 (formal) solution without any assumption about the origin of the joint distribution 2 The mutual information, d ib g d b a YW ! [sent-28, score-0.713]
15 4 "sc&¨3 The standard statistical approach to clustering is mixture modeling. [sent-40, score-0.144]
16 These posterior probabilities define a “soft” clustering of values. [sent-45, score-0.087]
17 In the information-theoretic approach no assumption is made regarding how the data was generated but we assume that the joint distribution is known exactly. [sent-47, score-0.063]
18 )' s$% @¨ In spite of these conceptual differences we show that under a proper choice of the generative model, these two problems are strongly related. [sent-51, score-0.054]
19 a the one-sided [4] or the asymmetric clustering model [5]), and provide a simple “mapping” between the concepts of one problem to those of the other. [sent-54, score-0.087]
20 Using this mapping we show that in general, searching for a solution of one problem induces a search in the solution space of the other. [sent-55, score-0.323]
21 Furthermore, for uniform input distribution or for large sample sizes, we show that the problems are mathematically equivalent. [sent-56, score-0.103]
22 Hence, in these cases, any algorithm which solves one problem, induces a solution for the other. [sent-57, score-0.146]
23 Given this distriIn the IB framework, one is given as input a joint distribution bution, a compressed representation of is introduced through the stochastic mapping . [sent-60, score-0.261]
24 U 7% pr 1q and is defined through the IB Markovian independence The joint distribution over relation, . [sent-67, score-0.274]
25 Specifically, every choice of defines a specific joint probability . [sent-68, score-0.097]
26 r' )' % 7% pr 1s10(&$# 6s070(& q 2 © ! [sent-75, score-0.175]
27 )' #¤% p&1q s$% T# are (2) ' d d ib d g e &g; ¡ fP ! [sent-84, score-0.608]
28 U @sr V) 1q In principle every choice of given, the choice that minimizes (1) u ! [sent-90, score-0.135]
29 (1) defines an iterative algorithm that is guaranteed to converge to a (local) fixed point of [1]. [sent-94, score-0.134]
30 %' ¤s@& 3 Short review of ML for mixture models In a multinomial mixture model, we assume that takes on discrete values and sample it from a multinomial distribution , where denotes ’s label. [sent-98, score-0.289]
31 In the onesided clustering model [4] [5] we further assume that there can be multiple observations corresponding to a single but they are all sampled from the same multinomial distribution. [sent-99, score-0.162]
32 sr @ For each choose a unique label For – choose by sampling from – choose by sampling from % ! [sent-131, score-0.396]
33 A standard algorithm for this purpose is the EM algorithm [6]. [sent-148, score-0.077]
34 sr C q where and simply given by are normalization factors and . [sent-171, score-0.396]
35 The (6) (7) (8) -step is (9) Iterating over these EM steps is guaranteed to converge to a local fixed point of the likelihood. [sent-172, score-0.061]
36 Moreover, every fixed point of the likelihood defines a fixed point of this algorithm. [sent-173, score-0.157]
37 Since this functional is bounded (under mild conditions), the EM algorithm will converge to a local fixed point of which corresponds to a fixed point of the likelihood. [sent-183, score-0.126]
38 ' s)$% @¨ 8 a Y W x A ¦' ¡ IB mapping 4 The ML As already mentioned, the IB problem and the ML problem stem from different motivations and involve different “settings”. [sent-187, score-0.162]
39 Here, we define this mapping to achieve two goals. [sent-189, score-0.162]
40 The first is theoretically motivated: using the mapping we show some mathematical equivalence between both problems. [sent-190, score-0.223]
41 A natural mapping would be to identify each distribution with its corresponding one. [sent-192, score-0.187]
42 If we directly map to , respectively, obviously there is no guarantee that the IB Markovian independence relation will hold once we complete the through Eq. [sent-195, score-0.078]
43 Specifically, using this relation to extract sult with a different prior over then by simply defining . [sent-197, score-0.07]
44 However, we notice that once we defined and , the other distributions could be extracted by performing the IB-step defined in Eq. [sent-198, score-0.082]
45 Although in principle there are no “consistency” problems by mapping directly, we know that once we defined and , we can extract and by a simple -step. [sent-202, score-0.22]
46 )' H C A 0(% @¨ P Fvw9 G mapping by (12) ' £ ¤¡ § @ 3 ' 10(&$# ! [sent-220, score-0.162]
47 )' s$% @¨ ¡ A Therefore, to summarize, we define the £ § ' 7% pr 1q ! [sent-222, score-0.175]
48 r s& C q £ where is a positive (scaling) constant and the mapping is completed by performing an IB-step or an -step according to the mapping direction. [sent-224, score-0.354]
49 Notice that under this mapping, every search in the solution space of the IB problem induces a search in the solution space of the ML problem, and vice versa (see Figure 2). [sent-225, score-0.276]
50 , or are constant), the mapping is equivalent for a direct mapping of each distribution to its corresponding one. [sent-229, score-0.403]
51 7% @¨ ¡ ¥8 G @ &3 This observation is a direct result from the fact that if is uniformly distributed, then the IB-step defined in Eq. [sent-232, score-0.116]
52 2 When is uniformly distributed, the EM algorithm is equivalent to the IB iterative optimization algorithm under the mapping with . [sent-236, score-0.335]
53 U G U 9 £ @ ¡ ¦8 G 3 Again, this observation is a direct result from the equivalence of the IB-step and the -step for uniform prior over . [sent-237, score-0.165]
54 Additionally, we notice that in this case , hence Eq. [sent-238, score-0.052]
55 It is important to emphasize, though, that this equivalence holds only for a specific choice of . [sent-241, score-0.084]
56 While clearly the IB iterative algorithm (and problem) are meaningful for any value of , there is no such freedom (for good or worse) in the ML setting, and the exponential factor in EM must be . [sent-242, score-0.073]
57 1 When is uniformly distributed and hood are mapped to all the fixed points of the IB-functional with . [sent-247, score-0.198]
58 2 When is uniformly distributed, every algorithm which finds a fixed point of , induces a fixed point of with , and vice versa. [sent-251, score-0.346]
59 When the algorithm finds several fixed points, the solution that maximizes is mapped to the one that minimizes . [sent-252, score-0.164]
60 We assume where is constant, and that define a fixed that we are given observations point of the likelihood . [sent-256, score-0.083]
61 As a result, this is also a fixed point of the EM algorithm (where is defined through an -step). [sent-257, score-0.065]
62 2 it follows that this fixed-point is mapped to a fixed-point of with , as required. [sent-259, score-0.072]
63 sr C q ¡ @ ¡ d e¡ 9 ¥ 9 ¦' XF 0(% @¨ ! [sent-270, score-0.396]
64 We emphasize again that this equivalence is for a specific value of U U 9 £ Corollary 5. [sent-292, score-0.085]
65 3 When is uniformly distributed and , iff it decreases with . [sent-293, score-0.146]
66 )' 7% pr 1q 0(% T# HC ¡ @ 9 d 8 G 3 d Multiplying both sides by relation, we find that (14) " @ u Using the d ! [sent-297, score-0.199]
67 % 7&@¨ 9 d u This corollary is a direct result from the above proof that showed the equivalence of the free energy of the model and the IB-functional (up to linear transformations). [sent-302, score-0.179]
68 The previous claims dealt with the special case of uniform prior over . [sent-303, score-0.102]
69 The following claims provide similar results for the general case, when the (or ) are large enough. [sent-304, score-0.063]
70 H itva u 8 F8a `YW are mapped to all the fixed . [sent-305, score-0.072]
71 4 For (or ), all the fixed points of points of , and vice versa. [sent-307, score-0.151]
72 5 When every algorithm which finds a fixed point of , induces a fixed point of with , and vice versa. [sent-309, score-0.295]
73 When the algorithm finds several different fixed points, the solution that maximizes is mapped to the solution that minimize . [sent-310, score-0.183]
74 It is also important to keep in mind that in many clustering applications, a uniform prior over is “forced” during the pre-process to avoid non-desirable bias. [sent-312, score-0.15]
75 (6) we extract , ending up with a fixed point of the EM algorithm. [sent-335, score-0.095]
76 Therefore, the mapping becomes deterministic: ¡ A' f% ¡ ! [sent-337, score-0.162]
77 U ) sr V& A 8 G (17) mapping (including the IB-step), it is easy to verify that we get (but if the prior over is not uniform). [sent-347, score-0.584]
78 Since now it follows that mapping we try to update will remain deterministic. [sent-350, score-0.192]
79 Therefore, we are at a fixed point of the IB iterative algorithm, and by that at a fixed point of the IB-functional , as required. [sent-363, score-0.122]
80 r V) ¨A a Y F`W mapping and similar algebra as above, we find that with u iff it decreases (20) ¢ d ( i u £ $%) ' W 9 @! [sent-368, score-0.218]
81 4 3 £ C602 7( £ every algorithm decreases 2 &5 ¢ x ) ' ( $%W § d $%$#W 8 G 3 9 )' 1$ Corollary 5. [sent-370, score-0.096]
82 @ a Y F`W x Using the we notice again that at the fixed point 8 To show that Eq. [sent-373, score-0.09]
83 Yet, roughly speaking, we notice that the value of for which the above claims (approximately) hold is related to the “amount of uniformity” in . [sent-377, score-0.115]
84 Specifically, a crucial step in the above proof assumed that each is large enough such that becomes deterministic. [sent-378, score-0.049]
85 , EM) in the “ML space”, can be mapped to a sequence of probabilities in the “IB space”, and vice versa. [sent-388, score-0.151]
86 In this data we randomly choose only of the word occurrences for every document , ending up with . [sent-393, score-0.065]
87 For iterative IB (iIB) algorithm (where we took each algorithm we used the mapping to calculate and during the process (e. [sent-395, score-0.262]
88 , for iIB, after each iteration we mapped from to , including the -step, and different initializations, for each dataset. [sent-397, score-0.072]
89 )' 3 @ £ 3 ¡ 8 G d In these runs we found that usually both algorithms improved both functionals monotonically. [sent-401, score-0.103]
90 Comparing the functionals during the process, we see that for the smaller sample size the differences are indeed more evident (Figure 1). [sent-402, score-0.076]
91 Comparing the final values of the functionals (after iterations, which typically yielded convergence), we see that in out of runs iIB converged to a smaller value of than EM. [sent-403, score-0.135]
92 ¦ d ¨ § ¦ u § 7 Discussion While we have shown that the ML and IB approaches are equivalent under certain conditions, it is important to keep in mind the different assumptions both approaches make regarding the joint distribution over . [sent-407, score-0.109]
93 The mixture model (1) assumes that is independent of given and (2) assumes that is one of a small number ( ) of possible conditional distributions. [sent-408, score-0.105]
94 Therefore, the solution space is the family of distributions for which this relation holds and the marginal distribution over is consistent with the input. [sent-423, score-0.134]
95 In this formulation the IB problem is related to minimizing , where denotes the family of distributions for which the mixture model assumption holds, . [sent-425, score-0.057]
96 On the other hand, while solving the ML problem, one assumes an “ideal” world, and tries to minimize the KL with . [sent-431, score-0.049]
97 Our theoretical analysis shows that under respect to the given marginal distribution the mapping, these two procedures are in some cases equivalent (see Figure 2). [sent-432, score-0.072]
98 In the IB framework, for large enough , the quality of a given solution is [1]. [sent-436, score-0.066]
99 ¤% £ 9 g £ $R ¤¢G¡ d g $R ¥ ¦ £ d § We have shown that for the multinomial mixture model, ML and IB are equivalent in some cases. [sent-444, score-0.154]
100 A natural question is to look for further generative models that can be mapped to this multivariate IB problems, and we are working in this direction. [sent-447, score-0.132]
wordName wordTfidf (topN-words)
[('ib', 0.608), ('sr', 0.396), ('ml', 0.3), ('pr', 0.175), ('iib', 0.168), ('mapping', 0.162), ('ur', 0.153), ('fp', 0.117), ('em', 0.117), ('xed', 0.101), ('hc', 0.099), ('clustering', 0.087), ('hp', 0.085), ('vice', 0.079), ('induces', 0.077), ('functionals', 0.076), ('multinomial', 0.075), ('efp', 0.072), ('mapped', 0.072), ('yw', 0.067), ('lib', 0.063), ('claims', 0.063), ('equivalence', 0.061), ('corollary', 0.061), ('nds', 0.059), ('cally', 0.059), ('eg', 0.058), ('mixture', 0.057), ('tishby', 0.053), ('notice', 0.052), ('uniformly', 0.051), ('kl', 0.051), ('bottleneck', 0.049), ('qeg', 0.048), ('slonim', 0.048), ('fc', 0.048), ('markovian', 0.047), ('nes', 0.047), ('iterative', 0.046), ('likelihood', 0.045), ('de', 0.043), ('solution', 0.042), ('relation', 0.042), ('dkl', 0.042), ('distributed', 0.039), ('uniform', 0.039), ('mathematically', 0.039), ('xf', 0.038), ('joint', 0.038), ('point', 0.038), ('independence', 0.036), ('points', 0.036), ('every', 0.036), ('compressed', 0.036), ('friedman', 0.035), ('puzicha', 0.034), ('observation', 0.033), ('decreases', 0.033), ('documents', 0.032), ('speci', 0.032), ('moreover', 0.032), ('direct', 0.032), ('converged', 0.032), ('generative', 0.031), ('nonetheless', 0.031), ('compactness', 0.031), ('try', 0.03), ('principle', 0.03), ('ned', 0.03), ('performing', 0.03), ('ending', 0.029), ('multivariate', 0.029), ('iterating', 0.028), ('ss', 0.028), ('ne', 0.028), ('extract', 0.028), ('hebrew', 0.027), ('algorithm', 0.027), ('runs', 0.027), ('verify', 0.026), ('proof', 0.025), ('tries', 0.025), ('distribution', 0.025), ('marginal', 0.025), ('ning', 0.024), ('emphasize', 0.024), ('assumes', 0.024), ('hofmann', 0.024), ('mind', 0.024), ('sides', 0.024), ('enough', 0.024), ('choice', 0.023), ('phenomenon', 0.023), ('converge', 0.023), ('purpose', 0.023), ('minimizes', 0.023), ('iff', 0.023), ('opposite', 0.023), ('additionally', 0.022), ('equivalent', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
2 0.18769373 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
3 0.095531434 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
4 0.083054587 27 nips-2002-An Impossibility Theorem for Clustering
Author: Jon M. Kleinberg
Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1
5 0.078542501 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
6 0.077744164 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
7 0.07057374 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
8 0.068838045 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
9 0.064420611 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
10 0.059870269 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
11 0.058766577 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
12 0.053707991 53 nips-2002-Clustering with the Fisher Score
13 0.053322636 161 nips-2002-PAC-Bayes & Margins
14 0.053223878 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
15 0.052516282 115 nips-2002-Informed Projections
16 0.050539762 36 nips-2002-Automatic Alignment of Local Representations
17 0.050108679 124 nips-2002-Learning Graphical Models with Mercer Kernels
18 0.050017864 106 nips-2002-Hyperkernels
19 0.049608115 188 nips-2002-Stability-Based Model Selection
20 0.04958237 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
topicId topicWeight
[(0, -0.166), (1, -0.059), (2, -0.022), (3, 0.011), (4, -0.081), (5, 0.064), (6, -0.111), (7, -0.107), (8, -0.073), (9, 0.06), (10, 0.052), (11, -0.057), (12, -0.023), (13, -0.028), (14, -0.022), (15, -0.016), (16, 0.003), (17, -0.112), (18, -0.039), (19, 0.03), (20, 0.041), (21, -0.003), (22, -0.023), (23, -0.017), (24, -0.033), (25, 0.047), (26, 0.059), (27, -0.094), (28, 0.104), (29, 0.0), (30, -0.021), (31, -0.093), (32, 0.195), (33, -0.089), (34, -0.047), (35, -0.026), (36, 0.002), (37, 0.105), (38, 0.085), (39, 0.014), (40, -0.101), (41, -0.006), (42, 0.05), (43, 0.04), (44, -0.143), (45, -0.1), (46, -0.135), (47, -0.111), (48, 0.122), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.95344466 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
2 0.61675084 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
3 0.52731341 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
4 0.52230805 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
5 0.49288929 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
Author: Kenji Fukumizu, Shotaro Akaho, Shun-ichi Amari
Abstract: We show the existence of critical points as lines for the likelihood function of mixture-type models. They are given by embedding of a critical point for models with less components. A sufficient condition that the critical line gives local maxima or saddle points is also derived. Based on this fact, a component-split method is proposed for a mixture of Gaussian components, and its effectiveness is verified through experiments. 1
6 0.44780475 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
7 0.41556263 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
8 0.38772133 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
9 0.38661122 27 nips-2002-An Impossibility Theorem for Clustering
10 0.37315348 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
11 0.36977771 188 nips-2002-Stability-Based Model Selection
12 0.36937436 89 nips-2002-Feature Selection by Maximum Marginal Diversity
13 0.36566123 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
14 0.35596502 87 nips-2002-Fast Transformation-Invariant Factor Analysis
15 0.34672651 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
16 0.32940477 124 nips-2002-Learning Graphical Models with Mercer Kernels
17 0.32723734 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
18 0.31822285 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
19 0.31467596 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
20 0.30418468 178 nips-2002-Robust Novelty Detection with Single-Class MPM
topicId topicWeight
[(14, 0.031), (23, 0.012), (42, 0.067), (54, 0.141), (55, 0.04), (57, 0.017), (67, 0.342), (68, 0.014), (74, 0.113), (92, 0.04), (98, 0.089)]
simIndex simValue paperId paperTitle
1 0.95353866 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons
Author: Aaron J. Gruber, Sara A. Solla, James C. Houk
Abstract: Single unit activity in the striatum of awake monkeys shows a marked dependence on the expected reward that a behavior will elicit. We present a computational model of spiny neurons, the principal neurons of the striatum, to assess the hypothesis that direct neuromodulatory effects of dopamine through the activation of D 1 receptors mediate the reward dependency of spiny neuron activity. Dopamine release results in the amplification of key ion currents, leading to the emergence of bistability, which not only modulates the peak firing rate but also introduces a temporal and state dependence of the model's response, thus improving the detectability of temporally correlated inputs. 1
2 0.9056561 107 nips-2002-Identity Uncertainty and Citation Matching
Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.
3 0.89444917 18 nips-2002-Adaptation and Unsupervised Learning
Author: Peter Dayan, Maneesh Sahani, Gregoire Deback
Abstract: Adaptation is a ubiquitous neural and psychological phenomenon, with a wealth of instantiations and implications. Although a basic form of plasticity, it has, bar some notable exceptions, attracted computational theory of only one main variety. In this paper, we study adaptation from the perspective of factor analysis, a paradigmatic technique of unsupervised learning. We use factor analysis to re-interpret a standard view of adaptation, and apply our new model to some recent data on adaptation in the domain of face discrimination.
same-paper 4 0.83568054 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
5 0.59477627 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
Author: Peter Dayan, Angela J. Yu
Abstract: Inference and adaptation in noisy and changing, rich sensory environments are rife with a variety of specific sorts of variability. Experimental and theoretical studies suggest that these different forms of variability play different behavioral, neural and computational roles, and may be reported by different (notably neuromodulatory) systems. Here, we refine our previous theory of acetylcholine’s role in cortical inference in the (oxymoronic) terms of expected uncertainty, and advocate a theory for norepinephrine in terms of unexpected uncertainty. We suggest that norepinephrine reports the radical divergence of bottom-up inputs from prevailing top-down interpretations, to influence inference and plasticity. We illustrate this proposal using an adaptive factor analysis model.
6 0.58083224 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
7 0.58067274 27 nips-2002-An Impossibility Theorem for Clustering
8 0.5788182 83 nips-2002-Extracting Relevant Structures with Side Information
9 0.57600772 190 nips-2002-Stochastic Neighbor Embedding
10 0.57305193 3 nips-2002-A Convergent Form of Approximate Policy Iteration
11 0.57027334 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
12 0.56978697 150 nips-2002-Multiple Cause Vector Quantization
13 0.56946218 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
14 0.56717443 10 nips-2002-A Model for Learning Variance Components of Natural Images
15 0.5635196 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
16 0.56317627 163 nips-2002-Prediction and Semantic Association
17 0.56121874 40 nips-2002-Bayesian Models of Inductive Generalization
18 0.56084043 205 nips-2002-Value-Directed Compression of POMDPs
19 0.55960453 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
20 0.55859995 137 nips-2002-Location Estimation with a Differential Update Network