nips nips2002 nips2002-107 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.
Reference: text
sentIndex sentText sentNum sentScore
1 It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. [sent-4, score-0.128]
2 In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. [sent-6, score-0.711]
3 Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. [sent-7, score-0.856]
4 Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. [sent-8, score-0.303]
5 Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. [sent-9, score-0.183]
6 Results on several citation data sets show that the method outperforms current algorithms for citation matching. [sent-10, score-0.824]
7 The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. [sent-11, score-0.642]
8 1 Such systems process a large number of scientific publications to extract their citation lists. [sent-13, score-0.412]
9 By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. [sent-14, score-0.341]
10 This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. [sent-15, score-0.443]
11 Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. [sent-17, score-0.164]
12 For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. [sent-18, score-0.326]
13 Figure 1: Two citations that probably refer to the same paper. [sent-32, score-0.326]
14 Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. [sent-35, score-0.166]
15 EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. [sent-37, score-0.106]
16 Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. [sent-38, score-0.104]
17 A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. [sent-40, score-0.165]
18 They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. [sent-41, score-0.116]
19 Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. [sent-42, score-0.16]
20 While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. [sent-43, score-0.104]
21 Citeseer is the best-known example of work on citation matching [1]. [sent-49, score-0.469]
22 The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). [sent-50, score-0.443]
23 McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). [sent-51, score-0.155]
24 In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. [sent-53, score-0.307]
25 The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. [sent-56, score-0.425]
26 We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. [sent-59, score-0.155]
27 The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. [sent-60, score-0.299]
28 For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. [sent-63, score-0.299]
29 The errors made by our system point to some interesting unmodeled aspects of the citation process. [sent-64, score-0.412]
30 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. [sent-65, score-0.123]
31 Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. [sent-66, score-0.243]
32 1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. [sent-68, score-0.109]
33 • A set I of named instances denoting objects, each an instance of one class. [sent-69, score-0.177]
34 • A set A of complex attributes denoting functional relations. [sent-70, score-0.243]
35 Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. [sent-71, score-0.148]
36 • A set B of simple attributes denoting functions. [sent-72, score-0.243]
37 Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. [sent-73, score-0.148]
38 P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . [sent-75, score-0.28]
39 • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. [sent-81, score-0.208]
40 We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. [sent-82, score-0.209]
41 We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i. [sent-83, score-0.24]
42 2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. [sent-90, score-0.598]
43 There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. [sent-91, score-0.265]
44 Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. [sent-92, score-0.534]
45 , for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. [sent-95, score-0.612]
46 ”, and set its paper attribute to a newly created Paper instance, which we will call P1 . [sent-100, score-0.12]
47 (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. [sent-107, score-0.449]
48 parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. [sent-109, score-0.242]
49 is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. [sent-114, score-0.246]
50 The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. [sent-120, score-0.511]
51 The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. [sent-121, score-0.217]
52 In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. [sent-123, score-0.283]
53 It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. [sent-124, score-0.408]
54 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. [sent-127, score-0.299]
55 Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. [sent-128, score-0.468]
56 Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. [sent-179, score-0.104]
57 We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). [sent-183, score-0.104]
58 We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. [sent-188, score-0.426]
59 0 to the one assignment where each citation object is unique. [sent-190, score-0.437]
60 • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. [sent-191, score-0.148]
61 Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. [sent-193, score-0.169]
62 With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! [sent-195, score-0.158]
63 nk When n > m, the world contains objects unreferenced by any of the terms. [sent-197, score-0.128]
64 However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term. [sent-198, score-0.423]
65 ) Therefore, we never have to create them, or worry about their attribute values. [sent-199, score-0.12]
66 Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. [sent-200, score-0.276]
67 For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. [sent-202, score-0.286]
68 Similarly, we could allow ι to depend on the attribute values—e. [sent-206, score-0.12]
69 , the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. [sent-208, score-0.299]
70 Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A. [sent-210, score-0.209]
71 We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. [sent-213, score-0.651]
72 9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. [sent-217, score-0.128]
73 parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. [sent-235, score-0.824]
74 It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. [sent-236, score-0.102]
75 First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. [sent-239, score-0.989]
76 text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between. [sent-242, score-0.311]
77 obsTitle, while its author segment must match the combined values of the simple attributes of Citation. [sent-248, score-0.453]
78 text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. [sent-251, score-0.193]
79 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. [sent-253, score-0.115]
80 π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. [sent-264, score-0.142]
81 2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. [sent-269, score-0.411]
82 Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. [sent-270, score-0.234]
83 Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. [sent-271, score-0.277]
84 The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. [sent-274, score-0.408]
85 The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u. [sent-275, score-0.186]
86 into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e. [sent-278, score-0.209]
87 , guessing the true surname of the author currently known both as “Simth” and “Smith”. [sent-280, score-0.432]
88 Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. [sent-281, score-0.154]
89 In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u. [sent-291, score-0.166]
90 Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. [sent-295, score-0.155]
91 We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. [sent-296, score-0.179]
92 (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations. [sent-306, score-0.37]
93 The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. [sent-309, score-0.516]
94 implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. [sent-310, score-0.54]
95 (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words. [sent-311, score-0.465]
96 We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. [sent-314, score-0.299]
97 Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. [sent-316, score-0.299]
98 This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. [sent-317, score-0.366]
99 Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. [sent-318, score-0.321]
100 Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. [sent-319, score-0.242]
wordName wordTfidf (topN-words)
[('fnames', 0.509), ('citation', 0.412), ('citations', 0.299), ('surname', 0.263), ('attributes', 0.209), ('author', 0.169), ('rpm', 0.14), ('objects', 0.128), ('citeseer', 0.122), ('attribute', 0.12), ('title', 0.117), ('rpms', 0.105), ('names', 0.096), ('papers', 0.095), ('identity', 0.09), ('proposals', 0.084), ('relational', 0.078), ('authors', 0.077), ('uncertainty', 0.074), ('chain', 0.071), ('clusters', 0.071), ('authorascited', 0.07), ('lashkari', 0.07), ('worlds', 0.07), ('bigram', 0.061), ('pasula', 0.061), ('mcmc', 0.058), ('matching', 0.057), ('publication', 0.056), ('records', 0.056), ('instances', 0.056), ('linkage', 0.053), ('metral', 0.053), ('obstitle', 0.053), ('statements', 0.053), ('russell', 0.049), ('parse', 0.047), ('corruption', 0.046), ('al', 0.044), ('instance', 0.044), ('named', 0.043), ('cited', 0.042), ('clustering', 0.04), ('segment', 0.04), ('monte', 0.039), ('dom', 0.039), ('phrases', 0.039), ('language', 0.039), ('markov', 0.037), ('record', 0.037), ('basic', 0.037), ('phrase', 0.037), ('handled', 0.036), ('text', 0.036), ('ll', 0.035), ('carlo', 0.035), ('greedy', 0.035), ('canopies', 0.035), ('cardinalities', 0.035), ('marthi', 0.035), ('milch', 0.035), ('obsauthors', 0.035), ('pubtype', 0.035), ('surnames', 0.035), ('mccallum', 0.035), ('match', 0.035), ('probability', 0.034), ('denoting', 0.034), ('metric', 0.033), ('reasoning', 0.033), ('association', 0.032), ('classes', 0.031), ('semantics', 0.031), ('agents', 0.031), ('bibliography', 0.031), ('erroneous', 0.031), ('declarative', 0.031), ('inference', 0.03), ('tracking', 0.029), ('sub', 0.028), ('parses', 0.028), ('census', 0.028), ('segmentations', 0.028), ('punctuation', 0.028), ('preprocess', 0.028), ('domain', 0.028), ('bayesian', 0.027), ('specify', 0.027), ('probably', 0.027), ('uniformity', 0.026), ('keeps', 0.026), ('acceptance', 0.026), ('abbreviated', 0.026), ('mappings', 0.026), ('assignment', 0.025), ('segments', 0.025), ('network', 0.025), ('collaborative', 0.025), ('runtime', 0.025), ('rectangles', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 107 nips-2002-Identity Uncertainty and Citation Matching
Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.
2 0.091855668 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
Author: Dan Pelleg, Andrew W. Moore
Abstract: We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance. 1
3 0.070127502 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
4 0.069401488 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
Author: Christopher Williams, Michalis K. Titsias
Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.
5 0.068159774 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
Author: Dmitry Y. Pavlov, David M. Pennock
Abstract: We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic— conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users.
6 0.054019488 27 nips-2002-An Impossibility Theorem for Clustering
7 0.051642582 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
8 0.051463723 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
9 0.050143674 163 nips-2002-Prediction and Semantic Association
10 0.049656659 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
11 0.048458476 190 nips-2002-Stochastic Neighbor Embedding
12 0.048103545 41 nips-2002-Bayesian Monte Carlo
13 0.046058562 40 nips-2002-Bayesian Models of Inductive Generalization
14 0.044124693 83 nips-2002-Extracting Relevant Structures with Side Information
15 0.043133792 98 nips-2002-Going Metric: Denoising Pairwise Data
16 0.042048611 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
17 0.041399594 53 nips-2002-Clustering with the Fisher Score
18 0.041254018 90 nips-2002-Feature Selection in Mixture-Based Clustering
19 0.04108008 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
20 0.040921409 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations
topicId topicWeight
[(0, -0.133), (1, -0.03), (2, -0.028), (3, 0.034), (4, -0.053), (5, 0.048), (6, -0.055), (7, -0.088), (8, -0.028), (9, 0.023), (10, 0.021), (11, 0.027), (12, -0.013), (13, 0.037), (14, -0.075), (15, -0.045), (16, 0.015), (17, -0.029), (18, 0.043), (19, -0.041), (20, -0.046), (21, 0.029), (22, 0.006), (23, 0.095), (24, -0.041), (25, -0.099), (26, 0.016), (27, -0.008), (28, -0.057), (29, 0.081), (30, 0.044), (31, 0.108), (32, -0.007), (33, 0.132), (34, 0.0), (35, -0.109), (36, -0.031), (37, 0.02), (38, 0.025), (39, 0.032), (40, -0.026), (41, 0.053), (42, 0.074), (43, 0.047), (44, 0.03), (45, 0.055), (46, -0.096), (47, 0.028), (48, 0.01), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.91848731 107 nips-2002-Identity Uncertainty and Citation Matching
Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.
2 0.58447903 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
Author: Sergey Kirshner, Igor V. Cadez, Padhraic Smyth, Chandrika Kamath
Abstract: We describe the application of probabilistic model-based learning to the problem of automatically identifying classes of galaxies, based on both morphological and pixel intensity characteristics. The EM algorithm can be used to learn how to spatially orient a set of galaxies so that they are geometrically aligned. We augment this “ordering-model” with a mixture model on objects, and demonstrate how classes of galaxies can be learned in an unsupervised manner using a two-level EM algorithm. The resulting models provide highly accurate classi£cation of galaxies in cross-validation experiments. 1 Introduction and Background The £eld of astronomy is increasingly data-driven as new observing instruments permit the rapid collection of massive archives of sky image data. In this paper we investigate the problem of identifying bent-double radio galaxies in the FIRST (Faint Images of the Radio Sky at Twenty-cm) Survey data set [1]. FIRST produces large numbers of radio images of the deep sky using the Very Large Array at the National Radio Astronomy Observatory. It is scheduled to cover more that 10,000 square degrees of the northern and southern caps (skies). Of particular scienti£c interest to astronomers is the identi£cation and cataloging of sky objects with a “bent-double” morphology, indicating clusters of galaxies ([8], see Figure 1). Due to the very large number of observed deep-sky radio sources, (on the order of 106 so far) it is infeasible for the astronomers to label all of them manually. The data from the FIRST Survey (http://sundog.stsci.edu/) is available in both raw image format and in the form of a catalog of features that have been automatically derived from the raw images by an image analysis program [8]. Each entry corresponds to a single detectable “blob” of bright intensity relative to the sky background: these entries are called Figure 1: 4 examples of radio-source galaxy images. The two on the left are labelled as “bent-doubles” and the two on the right are not. The con£gurations on the left have more “bend” and symmetry than the two non-bent-doubles on the right. components. The “blob” of intensities for each component is £tted with an ellipse. The ellipses and intensities for each component are described by a set of estimated features such as sky position of the centers (RA (right ascension) and Dec (declination)), peak density ¤ux and integrated ¤ux, root mean square noise in pixel intensities, lengths of the major and minor axes, and the position angle of the major axis of the ellipse counterclockwise from the north. The goal is to £nd sets of components that are spatially close and that resemble a bent-double. In the results in this paper we focus on candidate sets of components that have been detected by an existing spatial clustering algorithm [3] where each set consists of three components from the catalog (three ellipses). As of the year 2000, the catalog contained over 15,000 three-component con£gurations and over 600,000 con£gurations total. The set which we use to build and evaluate our models consists of a total of 128 examples of bent-double galaxies and 22 examples of non-bent-double con£gurations. A con£guration is labelled as a bent-double if two out of three astronomers agree to label it as such. Note that the visual identi£cation process is the bottleneck in the process since it requires signi£cant time and effort from the scientists, and is subjective and error-prone, motivating the creation of automated methods for identifying bent-doubles. Three-component bent-double con£gurations typically consist of a center or “core” component and two other side components called “lobes”. Previous work on automated classi£cation of three-component candidate sets has focused on the use of decision-tree classi£ers using a variety of geometric and image intensity features [3]. One of the limitations of the decision-tree approach is its relative in¤exibility in handling uncertainty about the object being classi£ed, e.g., the identi£cation of which of the three components should be treated as the core of a candidate object. A bigger limitation is the £xed size of the feature vector. A primary motivation for the development of a probabilistic approach is to provide a framework that can handle uncertainties in a ¤exible coherent manner. 2 Learning to Match Orderings using the EM Algorithm We denote a three-component con£guration by C = (c 1 , c2 , c3 ), where the ci ’s are the components (or “blobs”) described in the previous section. Each component c x is represented as a feature vector, where the speci£c features will be de£ned later. Our approach focuses on building a probabilistic model for bent-doubles: p (C) = p (c1 , c2 , c3 ), the likelihood of the observed ci under a bent-double model where we implicitly condition (for now) on the class “bent-double.” By looking at examples of bent-double galaxies and by talking to the scientists studying them, we have been able to establish a number of potentially useful characteristics of the components, the primary one being geometric symmetry. In bent-doubles, two of the components will look close to being mirror images of one another with respect to a line through the third component. We will call mirror-image components lobe compo- core core 1 lobe 2 2 3 lobe 2 lobe 1 lobe 1 component 2 lobe 1 component 3 lobe 2 lobe 1 4 core lobe 1 5 lobe 2 lobe 2 6 core core component 1 core lobe 2 lobe 1 Figure 2: Possible orderings for a hypothetical bent-double. A good choice of ordering would be either 1 or 2. nents, and the other one the core component. It also appears that non-bent-doubles either don’t exhibit such symmetry, or the angle formed at the core component is too straight— the con£guration is not “bent” enough. Once the core component is identi£ed, we can calculate symmetry-based features. However, identifying the most plausible core component requires either an additional algorithm or human expertise. In our approach we use a probabilistic framework that averages over different possible orderings weighted by their probability given the data. In order to de£ne the features, we £rst need to determine the mapping of the components to labels “core”, “lobe 1”, and “lobe 2” (c, l1 , and l2 for short). We will call such a mapping an ordering. Figure 2 shows an example of possible orderings for a con£guration. We can number the orderings 1, . . . , 6. We can then write 6 p (C) = p (cc , cl1 , cl2 |Ω = k) p (Ω = k) , (1) k=1 i.e., a mixture over all possible orientations. Each ordering is assumed a priori to be equally 1 likely, i.e., p(Ω = k) = 6 . Intuitively, for a con£guration that clearly looks like a bentdouble the terms in the mixture corresponding to the correct ordering would dominate, while the other orderings would have much lower probability. We represent each component cx by M features (we used M = 3). Note that the features can only be calculated conditioned on a particular mapping since they rely on properties of the (assumed) core and lobe components. We denote by fmk (C) the values corresponding to the mth feature for con£guration C under the ordering Ω = k, and by f mkj (C) we denote the feature value of component j: fmk (C) = (fmk1 (C) , . . . , fmkBm (C)) (in our case, Bm = 3 is the number of components). Conditioned on a particular mapping Ω = k, where x ∈ {c, l1 , l2 } and c,l1 ,l2 are de£ned in a cyclical order, our features are de£ned as: • f1k (C) : Log-transformed angle, the angle formed at the center of the component (a vertex of the con£guration) mapped to label x; |center of x to center of next(x)| • f2k (C) : Logarithms of side ratios, center of x to center of prev(x) ; | | peak ¤ux of next(x) • f3k (C) : Logarithms of intensity ratios, peak ¤ux of prev(x) , and so (C|Ω = k) = (f1k (C) , f2k (C) f3k (C)) for a 9-dimensional feature vector in total. Other features are of course also possible. For our purposes in this paper this particular set appears to capture the more obvious visual properties of bent-double galaxies. For a set D = {d1 , . . . , dN } of con£gurations, under an i.i.d. assumption for con£gurations, we can write the likelihood as N K P (D) = P (Ωi = k) P (f1k (di ) , . . . , fM k (di )) , i=1 k=1 where Ωi is the ordering for con£guration d i . While in the general case one can model P (f1k (di ) , . . . , fM k (di )) as a full joint distribution, for the results reported in this paper we make a number of simplifying assumptions, motivated by the fact that we have relatively little labelled training data available for model building. First, we assume that the fmk (di ) are conditionally independent. Second, we are also able to reduce the number of components for each fmk (di ) by noting functional dependencies. For example, given two angles of a triangle, we can uniquely determine the third one. We also assume that the remaining components for each feature are conditionally independent. Under these assumptions the multivariate joint distribution P (f1k (di ) , . . . , fM k (di )) is factored into a product of simple distributions, which (for the purposes of this paper) we model using Gaussians. If we know for every training example which component should be mapped to label c, we can then unambiguously estimate the parameters for each of these distributions. In practice, however, the identity of the core component is unknown for each object. Thus, we use the EM algorithm to automatically estimate the parameters of the above model. We begin by randomly assigning an ordering to each object. For each subsequent iteration the E-step consists of estimating a probability distribution over possible orderings for each object, and the M-step estimates the parameters of the feature-distributions using the probabilistic ordering information from the E-step. In practice we have found that the algorithm converges relatively quickly (in 20 to 30 iterations) on both simulated and real data. It is somewhat surprising that this algorithm can reliably “learn” how to align a set of objects, without using any explicit objective function for alignment, but instead based on the fact that feature values for certain orderings exhibit a certain self-consistency relative to the model. Intuitively it is this self-consistency that leads to higher-likelihood solutions and that allows EM to effectively align the objects by maximizing the likelihood. After the model has been estimated, the likelihood of new objects can also be calculated under the model, where the likelihood now averages over all possible orderings weighted by their probability given the observed features. The problem described above is a speci£c instance of a more general feature unscrambling problem. In our case, we assume that con£gurations of three 3-dimensional components (i.e. 3 features) each are generated by some distribution. Once the objects are generated, the orders of their components are permuted or scrambled. The task is then to simultaneously learn the parameters of the original distributions and the scrambling for each object. In the more general form, each con£guration consists of L M -dimensional con£gurations. Since there are L! possible orderings of L components, the problem becomes computationally intractable if L is large. One solution is to restrict the types of possible scrambles (to cyclic shifts for example). 3 Automatic Galaxy Classi£cation We used the algorithm described in the previous section to estimate the parameters of features and orderings of the bent-double class from labelled training data and then to rank candidate objects according to their likelihood under the model. We used leave-one-out cross-validation to test the classi£cation ability of this supervised model, where for each of the 150 examples we build a model using the positive examples from the set of 149 “other” examples, and then score the “left-out” example with this model. The examples are then sorted in decreasing order by their likelihood score (averaging over different possi- 1 0.9 True positive rate 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate Figure 3: ROC plot for a model using angle, ratio of sides, and ratio of intensities, as features, and learned using ordering-EM with labelled data. ble orderings) and the results are analyzed using a receiver operating characteristic (ROC) methodology. We use AROC , the area under the curve, as a measure of goodness of the model, where a perfect model would have AROC = 1 and random performance corresponds to AROC = 0.5. The supervised model, using EM for learning ordering models, has a cross-validated AROC score of 0.9336 (Figure 3) and appears to be quite useful at detecting bent-double galaxies. 4 Model-Based Galaxy Clustering A useful technique in understanding astronomical image data is to cluster image objects based on their morphological and intensity properties. For example, consider how one might cluster the image objects in Figure 1 into clusters, where we have features on angles, intensities, and so forth. Just as with classi£cation, clustering of the objects is impeded by not knowing which of the “blobs” corresponds to the true “core” component. From a probabilistic viewpoint, clustering can be treated as introducing another level of hidden variables, namely the unknown class (or cluster) identity of each object. We can generalize the EM algorithm for orderings (Section 2) to handle this additional hidden level. The model is now a mixture of clusters where each cluster is modelled as a mixture of orderings. This leads to a more complex two-level EM algorithm than that presented in Section 2, where at the inner-level the algorithm is learning how to orient the objects, and at the outer level the algorithm is learning how to group the objects into C classes. Space does not permit a detailed presentation of this algorithm—however, the derivation is straightforward and produces intuitive update rules such as: µcmj = ˆ 1 ˆ N P (cl = c|Θ) N K P (cli = c|Ωi = k, D, Θ) P (Ωi = k|D, Θ) fmkj (di ) i=1 k=1 where µcmj is the mean for the cth cluster (1 ≤ c ≤ C), the mth feature (1 ≤ m ≤ M ), and the jth component of fmk (di ), and Ωi = k corresponds to ordering k for the ith object. We applied this algorithm to the data set of 150 sky objects, where unlike the results in Section 3, the algorithm now had no access to the class labels. We used the Gaussian conditional-independence model as before, and grouped the data into K = 2 clusters. Figures 4 and 5 show the highest likelihood objects, out of 150 total objects, under the Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Bent−double Figure 4: The 8 objects with the highest likelihood conditioned on the model for the larger of the two clusters learned by the unsupervised algorithm. Bent−double Non−bent−double Non−bent−double Non−bent−double Non−bent−double Non−bent−double Bent−double Non−bent−double Figure 5: The 8 objects with the highest likelihood conditioned on the model for the smaller of the two clusters learned by the unsupervised algorithm. 150 Unsupervised Rank bent−doubles non−bent−doubles 100 50 0 0 50 100 150 Supervised Rank Figure 6: A scatter plot of the ranking from the unsupervised model versus that of the supervised model. models for the larger cluster and smaller cluster respectively. The larger cluster is clearly a bent-double cluster: 89 of the 150 objects are more likely to belong to this cluster under the model and 88 out of the 89 objects in this cluster have the bent-double label. In other words, the unsupervised algorithm has discovered a cluster that corresponds to “strong examples” of bent-doubles relative to the particular feature-space and model. In fact the non-bentdouble that is assigned to this group may well have been mislabelled (image not shown here). The objects in Figure 5 are clearly inconsistent with the general visual pattern of bent-doubles and this cluster consists of a mixture of non-bent-double and “weaker” bentdouble galaxies. The objects in Figures 5 that are labelled as bent-doubles seem quite atypical compared to the bent-doubles in Figure 4. A natural hypothesis is that cluster 1 (88 bent-doubles) in the unsupervised model is in fact very similar to the supervised model learned using the labelled set of 128 bent-doubles in Section 3. Indeed the parameters of the two Gaussian models agree quite closely and the similarity of the two models is illustrated clearly in Figure 6 where we plot the likelihoodbased ranks of the unsupervised model versus those of the supervised model. Both models are in close agreement and both are clearly performing well in terms of separating the objects in terms of their class labels. 5 Related Work and Future Directions A related earlier paper is Kirshner et al [6] where we presented a heuristic algorithm for solving the orientation problem for galaxies. The generalization to an EM framework in this paper is new, as is the two-level EM algorithm for clustering objects in an unsupervised manner. There is a substantial body of work in computer vision on solving a variety of different object matching problems using probabilistic techniques—see Mjolsness [7] for early ideas and Chui et al. [2] for a recent application in medical imaging. Our work here differs in a number of respects. One important difference is that we use EM to learn a model for the simultaneous correspondence of N objects, using both geometric and intensity-based features, whereas prior work in vision has primarily focused on matching one object to another (essentially the N = 2 case). An exception is the recent work of Frey and Jojic [4, 5] who used a similar EM-based approach to simultaneously cluster images and estimate a variety of local spatial deformations. The work described in this paper can be viewed as an extension and application of this general methodology to a real-world problem in galaxy classi£cation. Earlier work on bent-double galaxy classi£cation used decision tree classi£ers based on a variety of geometric and intensity-based features [3]. In future work we plan to compare the performance of this decision tree approach with the probabilistic model-based approach proposed in this paper. The model-based approach has some inherent advantages over a decision-tree model for these types of problems. For example, it can directly handle objects in the catalog with only 2 blobs or with 4 or more blobs by integrating over missing intensities and over missing correspondence information using mixture models that allow for missing or extra “blobs”. Being able to classify such con£gurations automatically is of signi£cant interest to the astronomers. Acknowledgments This work was performed under a sub-contract from the ASCI Scienti£c Data Management Project of the Lawrence Livermore National Laboratory. The work of S. Kirshner and P. Smyth was also supported by research grants from NSF (award IRI-9703120), the Jet Propulsion Laboratory, IBM Research, and Microsoft Research. I. Cadez was supported by a Microsoft Graduate Fellowship. The work of C. Kamath was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48. We gratefully acknowledge our FIRST collaborators, in particular, Robert H. Becker for sharing his expertise on the subject. References [1] R. H. Becker, R. L. White, and D. J. Helfand. The FIRST Survey: Faint Images of the Radio Sky at Twenty-cm. Astrophysical Journal, 450:559, 1995. [2] H. Chui, L. Win, R. Schultz, J. S. Duncan, and A. Rangarajan. A uni£ed feature registration method for brain mapping. In Proceedings of Information Processing in Medical Imaging, pages 300–314. Springer-Verlag, 2001. [3] I. K. Fodor, E. Cant´ -Paz, C. Kamath, and N. A. Tang. Finding bent-double radio u galaxies: A case study in data mining. In Proceedings of the Interface: Computer Science and Statistics Symposium, volume 33, 2000. [4] B. J. Frey and N. Jojic. Estimating mixture models of images and inferring spatial transformations using the EM algorithm. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1999. [5] N. Jojic and B. J. Frey. Topographic transformation as a discrete latent variable. In Advances in Neural Information Processing Systems 12. MIT Press, 2000. ´ [6] S. Kirshner, I. V. Cadez, P. Smyth, C. Kamath, and E. Cantu-Paz. Probabilistic modelbased detection of bent-double radio galaxies. In Proceedings 16th International Conference on Pattern Recognition, volume 2, pages 499–502, 2002. [7] E. Mjolsness. Bayesian inference on visual grammars by neural networks that optimize. Technical Report YALEU/DCS/TR-854, Department of Computer Science, Yale University, May 1991. [8] R. L. White, R. H. Becker, D. J. Helfand, and M. D. Gregg. A catalog of 1.4 GHz radio sources from the FIRST Survey. Astrophysical Journal, 475:479, 1997.
3 0.56847632 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations
Author: Elzbieta Pekalska, David Tax, Robert Duin
Abstract: Problems in which abnormal or novel situations should be detected can be approached by describing the domain of the class of typical examples. These applications come from the areas of machine diagnostics, fault detection, illness identification or, in principle, refer to any problem where little knowledge is available outside the typical class. In this paper we explain why proximities are natural representations for domain descriptors and we propose a simple one-class classifier for dissimilarity representations. By the use of linear programming an efficient one-class description can be found, based on a small number of prototype objects. This classifier can be made (1) more robust by transforming the dissimilarities and (2) cheaper to compute by using a reduced representation set. Finally, a comparison to a comparable one-class classifier by Campbell and Bennett is given.
4 0.56413615 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
Author: Christopher Williams, Michalis K. Titsias
Abstract: We consider data which are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.
5 0.50154936 190 nips-2002-Stochastic Neighbor Embedding
Author: Geoffrey E. Hinton, Sam T. Roweis
Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.
6 0.45924914 174 nips-2002-Regularized Greedy Importance Sampling
7 0.44456029 27 nips-2002-An Impossibility Theorem for Clustering
8 0.43989274 98 nips-2002-Going Metric: Denoising Pairwise Data
9 0.41920987 41 nips-2002-Bayesian Monte Carlo
10 0.41425055 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
11 0.40712479 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
12 0.40368327 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
13 0.40027952 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
14 0.38058844 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
15 0.37767524 40 nips-2002-Bayesian Models of Inductive Generalization
16 0.35788104 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
17 0.34751141 163 nips-2002-Prediction and Semantic Association
18 0.34613553 150 nips-2002-Multiple Cause Vector Quantization
19 0.34331658 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
20 0.34108049 188 nips-2002-Stability-Based Model Selection
topicId topicWeight
[(11, 0.045), (14, 0.017), (23, 0.01), (41, 0.019), (42, 0.061), (54, 0.098), (55, 0.034), (57, 0.012), (67, 0.427), (68, 0.019), (74, 0.067), (92, 0.033), (98, 0.072)]
simIndex simValue paperId paperTitle
1 0.90162772 71 nips-2002-Dopamine Induced Bistability Enhances Signal Processing in Spiny Neurons
Author: Aaron J. Gruber, Sara A. Solla, James C. Houk
Abstract: Single unit activity in the striatum of awake monkeys shows a marked dependence on the expected reward that a behavior will elicit. We present a computational model of spiny neurons, the principal neurons of the striatum, to assess the hypothesis that direct neuromodulatory effects of dopamine through the activation of D 1 receptors mediate the reward dependency of spiny neuron activity. Dopamine release results in the amplification of key ion currents, leading to the emergence of bistability, which not only modulates the peak firing rate but also introduces a temporal and state dependence of the model's response, thus improving the detectability of temporally correlated inputs. 1
same-paper 2 0.81625861 107 nips-2002-Identity Uncertainty and Citation Matching
Author: Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser
Abstract: Identity uncertainty is a pervasive problem in real-world data analysis. It arises whenever objects are not labeled with unique identifiers or when those identifiers may not be perceived perfectly. In such cases, two observations may or may not correspond to the same object. In this paper, we consider the problem in the context of citation matching—the problem of deciding which citations correspond to the same publication. Our approach is based on the use of a relational probability model to define a generative model for the domain, including models of author and title corruption and a probabilistic citation grammar. Identity uncertainty is handled by extending standard models to incorporate probabilities over the possible mappings between terms in the language and objects in the domain. Inference is based on Markov chain Monte Carlo, augmented with specific methods for generating efficient proposals when the domain contains many objects. Results on several citation data sets show that the method outperforms current algorithms for citation matching. The declarative, relational nature of the model also means that our algorithm can determine object characteristics such as author names by combining multiple citations of multiple papers. 1 INTRODUCTION Citation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such systems process a large number of scientific publications to extract their citation lists. By grouping together all co-referring citations (and, if possible, linking to the actual cited paper), the system constructs a database of “paper” entities linked by the “cites(p 1 , p2 )” relation. This is an example of the general problem of determining the existence of a set of objects, and their properties and relations, given a collection of “raw” perceptual data; this problem is faced by intelligence analysts and intelligent agents as well as by citation systems. A key aspect of this problem is determining when two observations describe the same object; only then can evidence be combined to develop a more complete description of the object. Objects seldom carry unique identifiers around with them, so identity uncertainty is ubiquitous. For example, Figure 1 shows two citations that probably refer to the same paper, despite many superficial differences. Citations appear in many formats and are rife with errors of all kinds. As a result, Citeseer—which is specifically designed to overcome such problems—currently lists more than 100 distinct AI textbooks published by Russell 1 See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex. [Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994. Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994. Figure 1: Two citations that probably refer to the same paper. and Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been studied independently in several fields. Record linkage [2] is a method for matching up the records in two files, as might be required when merging two databases. For each pair of records, a comparison vector is computed that encodes the ways in which the records do and do not match up. EM is used to learn a naive-Bayes distribution over this vector for both matched and unmatched record pairs, so that the pairwise match probability can then be calculated using Bayes’ rule. Linkage decisions are typically made in a greedy fashion based on closest match and/or a probability threshold, so the overall process is order-dependent and may be inconsistent. The model does not provide for a principled way to combine matched records. A richer probability model is developed by Cohen et al [3], who model the database as a combination of some “original” records that are correct and some number of erroneous versions. They give an efficient greedy algorithm for finding a single locally optimal assignment of records into groups. Data association [4] is the problem of assigning new observations to existing trajectories when multiple objects are being tracked; it also arises in robot mapping when deciding if an observed landmark is the same as one previously mapped. While early data association systems used greedy methods similar to record linkage, recent systems have tried to find high-probability global solutions [5] or to approximate the true posterior over assignments [6]. The latter method has also been applied to the problem of stereo correspondence, in which a computer vision system must determine how to match up features observed in two or more cameras [7]. Data association systems usually have simple observation models (e.g., Gaussian noise) and assume that observations at each time step are all distinct. More general patterns of identity occur in natural language text, where the problem of anaphora resolution involves determining whether phrases (especially pronouns) co-refer; some recent work [8] has used an early form of relational probability model, although with a somewhat counterintuitive semantics. Citeseer is the best-known example of work on citation matching [1]. The system groups citations using a form of greedy agglomerative clustering based on a text similarity metric (see Section 6). McCallum et al [9] use a similar technique, but also develop clustering algorithms designed to work well with large numbers of small clusters (see Section 5). With the exception of [8], all of the preceding systems have used domain-specific algorithms and data structures; the probabilistic approaches are based on a fixed probability model. In previous work [10], we have suggested a declarative approach to identity uncertainty using a formal language—an extension of relational probability models [11]. Here, we describe the first substantial application of the approach. Section 2 explains how to specify a generative probability model of the domain. The key technical point (Section 3) is that the possible worlds include not only objects and relations but also mappings from terms in the language to objects in the domain, and the probability model must include a prior over such mappings. Once the extended model has been defined, Section 4 details the probability distributions used. A general-purpose inference method is applied to the model. We have found Markov chain Monte Carlo (MCMC) to be effective for this and other applications (see Section 5); here, we include a method for generating effective proposals based on ideas from [9]. The system also incorporates an EM algorithm for learning the local probability models, such as the model of how author names are abbreviated, reordered, and misspelt in citations. Section 6 evaluates the performance of four datasets originally used to test the Citeseer algorithms [1]. As well as providing significantly better performance, our system is able to reason simultaneously about papers, authors, titles, and publication types, and does a good job of extracting this information from the grouped citations. For example, an author’s name can be identified more accurately by combining information from multiple citations of several different papers. The errors made by our system point to some interesting unmodeled aspects of the citation process. 2 RPMs Reasoning about identity requires reasoning about objects, which requires at least some of the expressive power of a first-order logical language. Our approach builds on relational probability models (RPMs) [11], which let us specify probability models over possible worlds defined by objects, properties, classes, and relations. 2.1 Basic RPMs At its most basic, an RPM, as defined by Koller et al [12], consists of • A set C of classes denoting sets of objects, related by subclass/superclass relations. • A set I of named instances denoting objects, each an instance of one class. • A set A of complex attributes denoting functional relations. Each complex attribute A has a domain type Dom[A] ∈ C and a range type Range[A] ∈ C. • A set B of simple attributes denoting functions. Each simple attribute B has a domain type Dom[B] ∈ C and a range V al[B]. • A set of conditional probability models P (B|P a[B]) for the simple attributes. P a[B] is the set of B’s parents, each of which is a nonempty chain of (appropriately typed) attributes σ = A1 . · · · .An .B , where B is a simple attribute. Probability models may be attached to instances or inherited from classes. The parent links should be such that no cyclic dependencies are formed. • A set of instance statements, which set the value of a complex attribute to an instance of the appropriate class. We also use a slight variant of an additional concept from [11]: number uncertainty, which allows for multi-valued complex attributes of uncertain cardinality. We define each such attribute A as a relation rather than a function, and we associate with it a simple attribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range {0, 1, . . . , max #[A]}. 2.2 RPMs for citations Figure 2 outlines an RPM for the example citations of Figure 1. There are four classes, the self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which represents not actual authors, but author names as they appear when cited. Each citation we wish to match leads to the creation of a Citation instance; instances of the remaining three classes are then added as needed to fill all the complex attributes. E.g., for the first citation of Figure 1, we would create a Citation instance C1 , set its text attribute to the string “Metral M. ...August 1994.”, and set its paper attribute to a newly created Paper instance, which we will call P1 . We would then introduce max(#[author]) (here only 3, for simplicity) AuthorAsCited instances (D11 , D12 , and D13 ) to fill the P1 .obsAuthors (i.e., observed authors) attribute, and an equal number of Author instances (A 11 , A12 , and A13 ) to fill both the P1 .authors[i] and the D1i .author attributes. (The complex attributes would be set using instance statements, which would then also constrain the cited authors to be equal to the authors of the actual paper. 2 ) Assuming (for now) that the value of C1 .parse 2 Thus, uncertainty over whether the authors are ordered correctly can be modeled using probabilistic instance statements. A11 Author A12 surname #(fnames) fnames A13 A21 D11 AuthorAsCited surname #(fnames) fnames author A22 A23 D12 D13 D21 D22 Paper D23 Citation #(authors) authors title publication type P1 P2 #(obsAuthors) obsAuthors obsTitle parse C1 C2 text paper Figure 2: An RPM for our Citeseer example. The large rectangles represent classes: the dark arrows indicate the ranges of their complex attributes, and the light arrows lay out all the probabilistic dependencies of their basic attributes. The small rectangles represent instances, linked to their classes with thick grey arrows. We omit the instance statements which set many of the complex attributes. is observed, we can set the values of all the basic attributes of the Citation and AuthorAsCited instances. (E.g., given the correct parse, D11 .surname would be set to Lashkari, and D12 .fnames would be set to (Max)). The remaining basic attributes — those of the Paper and Author instances — represent the “true” attributes of those objects, and their values are unobserved. The standard semantics of RPMs includes the unique names assumption, which precludes identity uncertainty. Under this assumption, any two papers are assumed to be different unless we know for a fact that they are the same. In other words, although there are many ways in which the terms of the language can map to the objects in a possible world, only one of these identity mappings is legal: the one with the fewest co-referring terms. It is then possible to express the RPM as an equivalent Bayesian network: each of the basic attributes of each of the objects becomes a node, with the appropriate parents and probability model. RPM inference usually involves the construction of such a network. The Bayesian network equivalent to our RPM is shown in Figure 3. 3 IDENTITY UNCERTAINTY In our application, any two citations may or may not refer to the same paper. Thus, for citations C1 and C2 , there is uncertainty as to whether the corresponding papers P 1 and P2 are in fact the same object. If they are the same, they will share one set of basic attributes; A11. surname D12. #(fnames) D12. surname A11. fnames D11. #(fnames) D12. fnames A21. #(fnames) A13. surname A12. fnames A21. fnames A13. fnames A13. #(fnames) D13. surname D11. fnames D11. surname D13. #(fnames) C1. #(authors) P1. title C1. text P1. pubtype C1. obsTitle A21. surname A23. surname A22. fnames D22. #(fnames) D12. surname D21. #(fnames) D22. fnames A23. fnames A23. #(fnames) D23. surname D21. fnames D13. fnames C1. parse A22. #(fnames) A22. surname A12. #(fnames) A12. surname A11. #(fnames) D23. fnames D21. surname D23. #(fnames) C2. #(authors) P2. title C2. parse C2. text C2. obsTitle P2. pubtype Figure 3: The Bayesian network equivalent to our RPM, assuming C 1 = C2 . if they are distinct, there will be two sets. Thus, the possible worlds of our probability model may differ in the number of random variables, and there will be no single equivalent Bayesian network. The approach we have taken to this problem [10] is to extend the representation of a possible world so that it includes not only the basic attributes of a set of objects, but also the number of objects n and an identity clustering ι, that is, a mapping from terms in the language (such as P1 ) to objects in the world. We are interested only in whether terms co-refer or not, so ι can be represented by a set of equivalence classes of terms. For example, if P1 and P2 are the only terms, and they co-refer, then ι is {{P1 , P2 }}; if they do not co-refer, then ι is {{P1 }, {P2 }}. We define a probability model for the space of extended possible worlds by specifying the prior P (n) and the conditional distribution P (ι|n). As in standard RPMs, we assume that the class of every instance is known. Hence, we can simplify these distributions further by factoring them by class, so that, e.g., P (ι) = C∈C P (ιC ). We then distinguish two cases: • For some classes (such as the citations themselves), the unique names assumptions remains appropriate. Thus, we define P (ιCitation ) to assign a probability of 1.0 to the one assignment where each citation object is unique. • For classes such as Paper and Author, whose elements are subject to identity uncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then we make appropriate uniformity assumptions to construct P (ιC ). Specifically, we assume that each paper is a priori equally likely to be cited, and that each author is a priori equally likely to write a paper. Here, “a priori” means prior to obtaining any information about the object in question, so the uniformity assumption is entirely reasonable. With these assumptions, the probability of an assignment ι C,k,m that maps k named instances to m distinct objects, when C contains n objects, is given by 1 n! P (ιC,k,m ) = (n − m)! nk When n > m, the world contains objects unreferenced by any of the terms. However, these filler objects are obviously irrelevant (if they affected the attributes of some named term, they would have been named as functions of that term.) Therefore, we never have to create them, or worry about their attribute values. Our model assumes that the cardinalities and identity clusterings of the classes are independent of each other, as well as of the attribute values. We could remove these assumptions. For one, it would be straightforward to specify a class-wise dependency model for n or ι using standard Bayesian network semantics, where the network nodes correspond to the cardinality attributes of the classes. E.g., it would be reasonable to let the total number of papers depend on the total number of authors. Similarly, we could allow ι to depend on the attribute values—e.g., the frequency of citations to a given paper might depend on the fame of the authors—provided we did not introduce cyclic dependencies. 4 The Probability Model We will now fill in the details of the conditional probability models. Our priors over the “true” attributes are constructed off-line, using the following resources: the 1990 Census data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of 500 citations. We learn several bigram models (actually, linear combinations of a bigram model and a unigram model): letter-based models of first names, surnames, and title words, as well as higher-level models of various parts of the citation string. More specifically, the values of Author.fnames and Author.surname are modeled as having a a 0.9 chance of being 3 Other models are possible; for example, in situations where objects appear and disappear, P (ι) can be modeled implicitly by specifying the arrival, transition, and departure rates [6]. drawn from the relevant US census file, and a 0.1 chance of being generated using a bigram model learned from that file. The prior over Paper.titles is defined using a two-tier bigram model constructed using the bibliography, while the distributions over Author.#(fnames), Paper.#(authors), and Paper.pubType 4 are derived from our hand-parsed file. The conditional distributions of the “observed” variables given their true values (i.e., the corruption models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are modeled as noisy channels where each letter, or word, has a small probability of being deleted, or, alternatively, changed, and there is also a small probability of insertion. AuthorAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption models are learnt online, using stochastic EM. Let us now return to Citation.parse, which cannot be an observed variable, since citation parsing, or even citation subfield extraction, is an unsolved problem. It is therefore fortunate that our approach lets us handle uncertainty over parses so naturally. The state space of Citation.parse has two different components. First of all, it keeps track of the citation style, defined as the ordering of the author and title subfields, as well as the format in which the author names are written. The prior over styles is learned using our hand-segmented file. Secondly, it keeps track of the segmentation of Citation.text, which is divided into an author segment, a title segment, and three filler segments (one before, one after, and one in between.) We assume a uniform distribution over segmentations. Citation.parse greatly constrains Citation.text: the title segment of Citation.text must match the value of Citation.obsTitle, while its author segment must match the combined values of the simple attributes of Citation.obsAuthors. The distributions over the remaining three segments of Citation.text are defined using bigram models, with the model used for the final segment chosen depending on the publication type. These models were, once more, learned using our pre-segmented file. 5 INFERENCE With the introduction of identity uncertainty, our model grows from a single Bayesian network to a collection of networks, one for each possible value of ι. This collection can be rather large, since the number of ways in which a set can be partitioned grows very quickly with the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate method based on Markov chain Monte Carlo. 5.1 MARKOV CHAIN MONTE CARLO MCMC [13] is a well-known method for approximating an expectation over some distribution π(x), commonly used when the state space of x is too large to sum over. The weighted sum over the values of x is replaced by a sum over samples from π(x), which are generated using a Markov chain constructed to have π(x) as a stationary distribution. There are several ways of building up an appropriate Markov chain. In the Metropolis– Hastings method (M-H), transitions in the chain are constructed in two steps. First, a candidate next state x is generated from the current state x, using the (more or less arbitrary) proposal distribution q(x |x). The probability that the move to x is actually made is the acceptance probability, defined as α(x |x) = min 1, π(x )q(x|x ) . π(x)q(x |x) Such a Markov chain will have the right stationary distribution π(x) as long as q is defined in such a way that the chain is ergodic. It is even possible to factor q into separate proposals for various subsets of variables. In those situations, the variables that are not changed by the transition cancel in the ratio π(x )/π(x), so the required calculation can be quite simple. 4 Publication types range over {article, conference paper, book, thesis, and tech report} This sequence is described by the Bell numbers, whose asymptotic behaviour is more than exponential. 5 5.2 THE CITATION-MATCHING ALGORITHM The state space of our MCMC algorithm is the space of all the possible worlds, where each possible world contains an identity clustering ι, a set of class cardinalities n, and the values of all the basic attributes of all the objects. Since the ι is given in each world, the distribution over the attributes can be represented using a Bayesian network as described in Section 3. Therefore, the probability of a state is simply the product pf P (n), P (ι), and the probability of the hidden attributes of the network. Our algorithm uses a factored q function. One of our proposals attempts to change n using a simple random walk. The other suggests, first, a change to ι, and then, values for all the hidden attributes of all the objects (or clusters in ι) affected by that change. The algorithm for proposing a change in ιC works as follows: Select two clusters a1 , a2 ∈ ιC 6 Create two empty clusters b1 and b2 place each instance i ∈ a1 ∪ a2 u.a.r. into b1 or b2 Propose ιC = ιC − {a1, a2} ∪ {b1, b2} Given a proposed ιC , suggesting values for the hidden attributes boils down to recovering their true values from (possibly) corrupt observations, e.g., guessing the true surname of the author currently known both as “Simth” and “Smith”. Since our title and name noise models are symmetric, our basic strategy is to apply these noise models to one of the observed values. In the case of surnames, we have the additional resource of a dictionary of common names, so, some of the time, we instead pick one of the set of dictionary entries that are within a few corruptions of our observed names. (One must, of course, careful to account for this hybrid approach in our acceptance probability calculations.) Parses are handled differently: we preprocess each citation, organizing its plausible segmentations into a list ordered in terms of descending probability. At runtime, we simply sample from these discrete distributions. Since we assume that boundaries occur only at punctuation marks, and discard segmentations of probability < 10−6 , the lists are usually quite short. 7 The publication type variables, meanwhile, are not sampled at all. Since their range is so small, we sum them out. 5.3 SCALING UP One of the acknowledged flaws of the MCMC algorithm is that it often fails to scale. In this application, as the number of papers increases, the simplest approach — one where the two clusters a1 and a2 are picked u.a.r — is likely to lead to many rejected proposals, as most pairs of clusters will have little in common. The resulting Markov chain will mix slowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are actually likely to exchange their instances. We have implemented an approach based on the efficient clustering algorithm of McCallum et al [9], where a cheap distance metric is used to preprocess a large dataset and fragment it into many canopies, or smaller, overlapping sets of elements that have a non-zero probability of matching. We do the same, using word-matching as our metric, and setting the thresholds to 0.5 and 0.2. Then, at runtime, our q(x |x) function proposes first a canopy c, and then a pair of clusters u.a.r. from c. (q(x|x ) is calculated by summing over all the canopies which contain any of the elements of the two clusters.) 6 EXPERIMENTAL RESULTS We have applied the MCMC-based algorithm to the hand-matched datasets used in [1]. (Each of these datasets contains several hundred citations of machine learning papers, about half of them in clusters ranging in size from two to twenty-one citations.) We have also 6 7 Note that if the same cluster is picked twice, it will probably be split. It would also be possible to sample directly from a model such as a hierarchical HMM Face Reinforcement Reasoning Constraint 349 citations, 242 papers 406 citations, 148 papers 514 citations, 296 papers 295 citations, 199 papers Phrase matching 94% 79% 86% 89% RPM + MCMC 97% 94% 96% 93% Table 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms. The metric used is the percentage of actual citation clusters recovered perfectly; for the MCMC-based algorithm, this is an average over all the MCMC-generated samples. implemented their phrase matching algorithm, a greedy agglomerative clustering method based on a metric that measures the degrees to which the words and phrases of any two citations overlap. (They obtain their “phrases” by segmenting each citation at all punctuation marks, and then taking all the bigrams of all the segments longer than two words.) The results of our comparison are displayed in Figure 1, in terms of the Citeseer error metric. Clearly, the algorithm we have developed easily beats our implementation of phrase matching. We have also applied our algorithm to a large set of citations referring to the textbook Artificial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a couple of notable exceptions. Whenever several citations share the same set of unlikely errors, they are placed together in a separate cluster. This occurs because we do not currently model the fact that erroneous citations are often copied from reference list to reference list, which could be handled by extending the model to include a copiedFrom attribute. Another possible extension would be the addition of a topic attribute to both papers and authors: tracking the authors’ research topics might enable the system to distinguish between similarly-named authors working in different fields. Generally speaking, we expect that relational probabilistic languages with identity uncertainty will be a useful tool for creating knowledge from raw data. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999. I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000. Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988. I. J. Cox and S. Hingorani. An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In IAPR-94, 1994. H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In IJCAI-99, 1999. F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte carlo approach. In NIPS-00, 2000. E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD-00, 2000. H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In IJCAI-01, 2001. A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000. A. Pfeffer and D. Koller. Semantics and inference for recursive probability models. In AAAI/IAAI, 2000. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, London, 1996.
3 0.80917209 18 nips-2002-Adaptation and Unsupervised Learning
Author: Peter Dayan, Maneesh Sahani, Gregoire Deback
Abstract: Adaptation is a ubiquitous neural and psychological phenomenon, with a wealth of instantiations and implications. Although a basic form of plasticity, it has, bar some notable exceptions, attracted computational theory of only one main variety. In this paper, we study adaptation from the perspective of factor analysis, a paradigmatic technique of unsupervised learning. We use factor analysis to re-interpret a standard view of adaptation, and apply our new model to some recent data on adaptation in the domain of face discrimination.
4 0.70013702 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
5 0.45266244 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
Author: Peter Dayan, Angela J. Yu
Abstract: Inference and adaptation in noisy and changing, rich sensory environments are rife with a variety of specific sorts of variability. Experimental and theoretical studies suggest that these different forms of variability play different behavioral, neural and computational roles, and may be reported by different (notably neuromodulatory) systems. Here, we refine our previous theory of acetylcholine’s role in cortical inference in the (oxymoronic) terms of expected uncertainty, and advocate a theory for norepinephrine in terms of unexpected uncertainty. We suggest that norepinephrine reports the radical divergence of bottom-up inputs from prevailing top-down interpretations, to influence inference and plasticity. We illustrate this proposal using an adaptive factor analysis model.
6 0.41869596 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
7 0.41427162 163 nips-2002-Prediction and Semantic Association
8 0.41224289 176 nips-2002-Replay, Repair and Consolidation
9 0.40827098 190 nips-2002-Stochastic Neighbor Embedding
10 0.40700984 199 nips-2002-Timing and Partial Observability in the Dopamine System
11 0.40375721 3 nips-2002-A Convergent Form of Approximate Policy Iteration
12 0.40371478 27 nips-2002-An Impossibility Theorem for Clustering
13 0.40352011 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain
14 0.40215728 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model
15 0.39837787 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
16 0.3983579 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
17 0.39715064 76 nips-2002-Dynamical Constraints on Computing with Spike Timing in the Cortex
18 0.39692432 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
19 0.39588404 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
20 0.39550191 205 nips-2002-Value-Directed Compression of POMDPs