nips nips2002 nips2002-192 knowledge-graph by maker-knowledge-mining

192 nips-2002-Support Vector Machines for Multiple-Instance Learning

Source: pdf

Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann

Abstract: This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharmaceutical data set and on applications in automated image indexing and document categorization. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper presents two new formulations of multiple-instance learning as a maximum margin problem. [sent-3, score-0.117]

2 The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. [sent-4, score-0.112]

3 Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. [sent-5, score-0.371]

4 We present experimental results on a pharmaceutical data set and on applications in automated image indexing and document categorization. [sent-6, score-0.239]

5 1 Introduction Multiple-instance learning (MIL) [4] is a generalization of supervised classification in which training class labels are associated with sets of patterns, or bags, instead of individual patterns. [sent-7, score-0.323]

6 While every pattern may possess an associated true label, it is assumed that pattern labels are only indirectly accessible through labels attached to bags. [sent-8, score-0.638]

7 The law of inheritance is such that a set receives a particular label, if at least one of the patterns in the set possesses the label. [sent-9, score-0.191]

8 In the important case of binary classification, this implies that a bag is "positive" if at least one of its member patterns is a positive example. [sent-10, score-0.665]

9 MIL differs from the general set-learning problem in that the set-level classifier is by design induced by a pattern-level classifier. [sent-11, score-0.123]

10 Hence the key challenge in MIL is to cope with the ambiguity of not knowing which of the patterns in a positive bag are the actual positive examples and which ones are not. [sent-12, score-0.841]

11 One prominent application is the classification of molecules in the context of drug design [4]. [sent-14, score-0.304]

12 Here, each molecule is represented by a bag of possible conformations. [sent-15, score-0.559]

13 The efficacy of a molecule can be tested experimentally, but there is no way to control for individual conformations. [sent-16, score-0.167]

14 A second application is in image indexing for content-based image retrieval. [sent-17, score-0.252]

15 Here, an image can be viewed as a bag of local image patches [9] or image regions. [sent-18, score-0.725]

16 Since annotating whole images is far less time consuming then marking relevant image regions, the ability to deal with this type of weakly annotated data is very desirable. [sent-19, score-0.386]

17 Usually, documents which contain a relevant passage are considered to be relevant with respect to a particular cate- gory or topic, yet class labels are rarely available on the passage level and are most commonly associated with the document as a whole. [sent-21, score-0.556]

18 Formally, all of the above applications share the same type of label ambiguity which in our opinion makes a strong argument in favor of the relevance of the MIL setting. [sent-22, score-0.355]

19 We present two approaches to modify and extend Support Vector Machines (SVMs) to deal with MIL problems. [sent-23, score-0.036]

20 The first approach explicitly treats the pattern labels as unobserved integer variables, subjected to constraints defined by the (positive) bag labels. [sent-24, score-0.912]

21 The goal then is to maximize the usual pattern margin, or soft-margin, jointly over hidden label variables and a linear (or kernelized) discriminant function. [sent-25, score-0.327]

22 The second approach generalizes the notion of a margin to bags and aims at maximizing the bag margin directly. [sent-26, score-1.062]

23 The latter seems most appropriate in cases where we mainly care about classifying new test bags, while the first approach seems preferable whenever the goal is to derive an accurate pattern-level classifier. [sent-27, score-0.169]

24 In the case of singleton bags, both methods are identical and reduce to the standard soft-margin SVM formulation. [sent-28, score-0.046]

25 , [8, 12]) have focused on specially tailored machine learning algorithms that do not compare favorably in the limiting case of the standard classification setting. [sent-33, score-0.304]

26 More recently, a kernel-based approach has been suggested which derives MI-kernels on bags from a given kernel defined on the pattern-level [5]. [sent-35, score-0.444]

27 While the MI-kernel approach treats the MIL problem merely as a representational problem, we strongly believe that a deeper conceptual modification of SVMs as outlined in this paper is necessary. [sent-36, score-0.229]

28 However, we share the ultimate goal with [5], which is to make state-ofthe-art kernel-based classification methods available for multiple-instance learning. [sent-37, score-0.308]

29 2 Multiple-Instance Learning In statistical pattern recognition, it is usually assumed that a training set of labeled patterns is available where each pair (Xi, Yi) E ~d X Y has been generated independently from an unknown distribution. [sent-38, score-0.232]

30 Multiple-instance learning (MIL) generalizes this problem by making significantly weaker assumptions about the labeling information. [sent-44, score-0.147]

31 Patterns are grouped into bags and a label is attached to each bag and not to every pattern. [sent-45, score-1.115]

32 More formally, given is a set of input patterns Xl, . [sent-46, score-0.119]

33 These labels are interpreted in the following way: if YI = -1, then Yi = -1 for all i E I, i. [sent-57, score-0.133]

34 If on the other hand YI = 1, then at least one pattern Xi E BI is a positive example of the underlying concept. [sent-60, score-0.151]

35 Notice that the information provided by the label is asymmetric in the sense that a negative bag label induces a unique label for every pattern in a bag, while a positive label does not. [sent-61, score-1.303]

36 In general, the relation between pattern labels Yi and bag labels YI can be expressed compactly as YI = maxiEI Yi or alternatively as a set of linear constraints '"' - 2 - ;::: 1, VI s. [sent-62, score-0.819]

37 ~ Yi + 1 iEI (1) Finally, let us call a discriminant function! [sent-67, score-0.048]

38 : X --+ ~ MI-separating with respect to a multiple-instance data set if sgn maxiEI ! [sent-68, score-0.046]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mil', 0.506), ('bag', 0.443), ('bags', 0.369), ('yi', 0.264), ('label', 0.161), ('classification', 0.152), ('labels', 0.133), ('patterns', 0.119), ('maxiei', 0.116), ('molecule', 0.116), ('image', 0.094), ('passage', 0.092), ('treats', 0.086), ('margin', 0.083), ('pattern', 0.078), ('attached', 0.074), ('positive', 0.073), ('bi', 0.07), ('ambiguity', 0.068), ('grouped', 0.068), ('indexing', 0.064), ('classifier', 0.062), ('svms', 0.061), ('xi', 0.052), ('generalizes', 0.051), ('tailored', 0.051), ('efficacy', 0.051), ('annotating', 0.051), ('ioannis', 0.051), ('opinion', 0.051), ('integer', 0.05), ('document', 0.048), ('discriminant', 0.048), ('singleton', 0.046), ('subjected', 0.046), ('sgn', 0.046), ('share', 0.046), ('defined', 0.044), ('relevant', 0.043), ('marking', 0.043), ('stuart', 0.043), ('molecules', 0.043), ('modification', 0.043), ('annotated', 0.043), ('drug', 0.043), ('vi', 0.043), ('formally', 0.041), ('multi', 0.041), ('possesses', 0.041), ('providence', 0.041), ('deeper', 0.041), ('goal', 0.04), ('consuming', 0.039), ('specially', 0.039), ('svm', 0.038), ('associated', 0.038), ('accessible', 0.037), ('kernelized', 0.037), ('weakly', 0.037), ('deal', 0.036), ('possess', 0.035), ('notable', 0.035), ('ultimate', 0.035), ('available', 0.035), ('formulations', 0.034), ('prominent', 0.034), ('labeling', 0.034), ('asymmetric', 0.034), ('automated', 0.033), ('significantly', 0.033), ('aims', 0.033), ('knowing', 0.033), ('preferable', 0.033), ('design', 0.032), ('machines', 0.032), ('compactly', 0.032), ('favorably', 0.032), ('indirectly', 0.032), ('rarely', 0.032), ('brown', 0.032), ('cope', 0.032), ('care', 0.032), ('unobserved', 0.032), ('dominated', 0.032), ('seems', 0.032), ('induces', 0.031), ('law', 0.031), ('mixed', 0.031), ('derives', 0.031), ('programs', 0.031), ('induce', 0.031), ('limiting', 0.03), ('outlined', 0.03), ('exception', 0.03), ('member', 0.03), ('merely', 0.029), ('weaker', 0.029), ('differs', 0.029), ('topic', 0.029), ('favor', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann

2 0.22266807 135 nips-2002-Learning with Multiple Labels

Author: Rong Jin, Zoubin Ghahramani

Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1

3 0.12706454 125 nips-2002-Learning Semantic Similarity

Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor

Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1

4 0.099851221 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

Author: Yasemin Altun, Thomas Hofmann, Mark Johnson

Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1

5 0.089077316 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefﬁcients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classiﬁers.

6 0.086252399 120 nips-2002-Kernel Design Using Boosting

7 0.078545935 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

8 0.07177119 119 nips-2002-Kernel Dependency Estimation

9 0.070021264 161 nips-2002-PAC-Bayes & Margins

10 0.065990187 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

11 0.06546393 45 nips-2002-Boosted Dyadic Kernel Discriminants

12 0.062720299 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

13 0.062650926 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

14 0.062403917 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

15 0.061330285 140 nips-2002-Margin Analysis of the LVQ Algorithm

16 0.060541183 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

17 0.059213512 10 nips-2002-A Model for Learning Variance Components of Natural Images

18 0.05778411 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

19 0.055204537 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

20 0.054966342 39 nips-2002-Bayesian Image Super-Resolution

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.152), (1, -0.083), (2, 0.053), (3, -0.004), (4, -0.022), (5, 0.015), (6, 0.061), (7, -0.021), (8, 0.043), (9, -0.115), (10, -0.03), (11, -0.101), (12, -0.044), (13, -0.109), (14, -0.11), (15, 0.002), (16, 0.069), (17, 0.105), (18, 0.138), (19, 0.157), (20, 0.054), (21, -0.033), (22, 0.099), (23, -0.07), (24, 0.21), (25, -0.167), (26, 0.046), (27, -0.115), (28, -0.101), (29, -0.148), (30, 0.015), (31, -0.069), (32, -0.016), (33, -0.07), (34, 0.036), (35, -0.067), (36, 0.003), (37, -0.171), (38, -0.17), (39, -0.087), (40, -0.131), (41, -0.048), (42, -0.081), (43, 0.099), (44, -0.006), (45, -0.03), (46, -0.078), (47, -0.073), (48, -0.045), (49, 0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97713476 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann

2 0.78743857 135 nips-2002-Learning with Multiple Labels

Author: Rong Jin, Zoubin Ghahramani

3 0.42649361 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

Author: Yasemin Altun, Thomas Hofmann, Mark Johnson

4 0.4096514 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

5 0.39136279 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.

6 0.38700736 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

7 0.34198397 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

8 0.30802915 125 nips-2002-Learning Semantic Similarity

9 0.28049433 6 nips-2002-A Formulation for Minimax Probability Machine Regression

10 0.2771897 120 nips-2002-Kernel Design Using Boosting

11 0.26483101 176 nips-2002-Replay, Repair and Consolidation

12 0.26310813 119 nips-2002-Kernel Dependency Estimation

13 0.25819641 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

14 0.25781205 188 nips-2002-Stability-Based Model Selection

15 0.25527677 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

16 0.24522126 163 nips-2002-Prediction and Semantic Association

17 0.24290137 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

18 0.23564874 194 nips-2002-The Decision List Machine

19 0.22978486 85 nips-2002-Fast Kernels for String and Tree Matching

20 0.225108 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.016), (42, 0.076), (54, 0.133), (55, 0.03), (57, 0.023), (67, 0.01), (68, 0.016), (74, 0.126), (92, 0.079), (94, 0.294), (98, 0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81973535 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann

2 0.72721803 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

Author: Elad Schneidman, William Bialek, Michael Ii

Abstract: A population of neurons typically exhibits a broad diversity of responses to sensory inputs. The intuitive notion of functional classiﬁcation is that cells can be clustered so that most of the diversity is captured by the identity of the clusters rather than by individuals within clusters. We show how this intuition can be made precise using information theory, without any need to introduce a metric on the space of stimuli or responses. Applied to the retinal ganglion cells of the salamander, this approach recovers classical results, but also provides clear evidence for subclasses beyond those identiﬁed previously. Further, we ﬁnd that each of the ganglion cells is functionally unique, and that even within the same subclass only a few spikes are needed to reliably distinguish between cells. 1

3 0.59347534 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

4 0.59206378 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classiﬁcation problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artiﬁcial data and real data (amino acid sequences).

5 0.59162974 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

6 0.5916214 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

7 0.59148329 27 nips-2002-An Impossibility Theorem for Clustering

8 0.59058326 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

9 0.58970553 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

10 0.58874184 152 nips-2002-Nash Propagation for Loopy Graphical Games

11 0.58792132 124 nips-2002-Learning Graphical Models with Mercer Kernels

12 0.58755076 3 nips-2002-A Convergent Form of Approximate Policy Iteration

13 0.58694685 74 nips-2002-Dynamic Structure Super-Resolution

14 0.58590025 89 nips-2002-Feature Selection by Maximum Marginal Diversity

15 0.5857237 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

16 0.5856725 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

17 0.58458686 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

18 0.58320653 135 nips-2002-Learning with Multiple Labels

19 0.5811578 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

20 0.58100688 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting