nips nips2002 nips2002-37 knowledge-graph by maker-knowledge-mining

37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Source: pdf

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This enables the systematic derivation of algorithms customized for different models. [sent-11, score-0.082]

2 We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. [sent-14, score-0.258]

3 We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. [sent-16, score-0.493]

4 ” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. [sent-18, score-0.239]

5 The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. [sent-29, score-0.06]

6 , standardized probability distributions, standard optimization procedures, and generic data structures). [sent-34, score-0.079]

7 The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. [sent-42, score-0.074]

8 Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. [sent-46, score-0.079]

9 Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. [sent-48, score-0.079]

10 n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1. [sent-63, score-0.06]

11 Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. [sent-66, score-0.327]

12 The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . [sent-67, score-0.155]

13 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. [sent-69, score-0.105]

14 Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. [sent-70, score-0.105]

15 In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. [sent-72, score-0.076]

16 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). [sent-73, score-0.083]

17 The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. [sent-81, score-0.081]

18 General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. [sent-82, score-0.21]

19 AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task. [sent-83, score-0.066]

20 The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. [sent-85, score-0.081]

21 To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. [sent-86, score-0.104]

22 A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. [sent-88, score-0.733]

23 The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. [sent-89, score-0.164]

24 A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. [sent-91, score-0.069]

25 discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i. [sent-98, score-0.072]

26 Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. [sent-107, score-0.177]

27 Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. [sent-108, score-0.114]

28 Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! [sent-110, score-0.222]

29 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! [sent-111, score-0.066]

30 How can probabilities not satisfying these conditions be converted to symbolic expressions? [sent-114, score-0.222]

31 While many general schemes for inference on networks exist, our principal hurdle is the need to perform this over symbolic expressions incorporating real and integer variables from disparate real or inﬁnite-discrete distributions. [sent-115, score-0.231]

32 descendents holds and ancestors is independent of given iff there exists a set of variables such that Lemma 1 holds if we replace by . [sent-124, score-0.179]

33 Moreover, the unique minimal set satisfying these conditions is given by ancestors ancestors ¢ ¡ ¡ ¡ ¦ § ¢ ¢ ¡ ¨ ©¡ £ ¡ ¢ ¢ be a subset of descendents such that ancestors is independent ancestors given . [sent-125, score-0.446]

34 # 8 5 @ 4 while Lemma 3 lets us evaluate a probability by a summation and a ratio: ¡8¨¡ & % $! [sent-132, score-0.091]

35 # 8 Since the lemmas also show minimality of the sets and , they also give the minimal conditions under which a probability can be evaluated by discrete summation without integration. [sent-135, score-0.085]

36 These inference lemmas are operationalized as network decomposition schemas. [sent-136, score-0.162]

37 Internally, our system uses three conceptually different levels of representation. [sent-139, score-0.06]

38 They are processed via methods for Bayesian network decomposition or match with core algorithms such as EM. [sent-141, score-0.196]

39 Formulae are introduced when probabilities of the form parents are detected, either in the initial network, or after the application of network decompositions. [sent-142, score-0.173]

40 General probabilities are decomposed into sums and products of the respective atomic probabilities. [sent-146, score-0.145]

41 Formulae are ready for immediate optimization using symbolic or numeric methods but sometimes they can be decomposed further into independent subproblems. [sent-147, score-0.255]

42 Finally, we use imperative intermediate code as the lowest level to represent both program fragments within the schemas as well as the completely constructed programs. [sent-148, score-0.484]

43 Decomposition of a problem into independent subproblems is always done. [sent-152, score-0.098]

44 Decomposition of probabilities is driven by the Bayesian network; we have a separate system for handling decomposition of formulae. [sent-153, score-0.188]

45 , the problem “optimize for ” is transformed into a for-loop over subproblems “optimize for . [sent-156, score-0.098]

46 The statistical algorithm schemas currently implemented include EM, k-means, and discrete model selection. [sent-164, score-0.201]

47 Adding a Gibbs sampling schema would yield functionality comparable to that of BUGS [14]. [sent-165, score-0.2]

48 Usually, the schemas require a particular form of the probabilities involved; they are thus tightly coupled to the decomposition and simpliﬁcation transformations. [sent-166, score-0.285]

49 For is example, EM is a way of dealing with situation where Lemma 2 applies but where indexed identically to the data. [sent-167, score-0.08]

50 From the intermediate code, code in a particular target language may be generated. [sent-169, score-0.246]

51 During this code-generation phase, most of the vector and matrix expressions are converted into forloops, and various code optimizations are performed which are impossible for a standard compiler. [sent-171, score-0.317]

52 Our tool does not only generate efﬁcient code, but also highly readable, documented programs: model- and algorithm-speciﬁc comments are generated automatically during the synthesis phase. [sent-172, score-0.141]

53 A generated HTML software design document with navigation capabilities facilitates code understanding and reading. [sent-175, score-0.324]

54 AUTO BAYES also automatically generates a program for sampling from the speciﬁed model, so that closed-loop testing with synthetic data of the assumed distributions can be done. [sent-176, score-0.119]

55 From the model, the underlying Bayesian network is derived and represented internally as a directed graph. [sent-183, score-0.109]

56 The system attempts to decompose the optimization goal into independent parts, but ﬁnds that it cannot. [sent-187, score-0.103]

57 However, it then ﬁnds that the probability in the initial optimization statement matches the conditions of Lemma 2 and that the network describes a latent variable model. [sent-188, score-0.093]

58 System invokes abstract EMschema max Pr wrt family schema. [sent-190, score-0.327]

59 The syntactic structure of while the current subproblem must match /* M-step */ max Pr wrt ; the ﬁrst argument of the schema; /* E-step */ calculate Pr ; if additional applicability constraints ” (not shown here) hold, this schema is executed. [sent-195, score-0.622]

60 It constructs a piece of code which is returned in the variable . [sent-196, score-0.201]

61 This code fragment can contain recursive calls to other schemas (denoted by ) which return code for subproblems which then is inserted into the schema, such as converging, a generic convergence criterion here imposed over the variables . [sent-197, score-0.693]

62 Note that the schema actually implements an ME-algorithm (i. [sent-198, score-0.2]

63 This gives us the partial program shown in the internal pseudocode. [sent-215, score-0.081]

64 8 5 ¡ ¡ ¤ ¡ ¦ ¥5 ¤ ¡£ § wrt 5 ¡ ¤ ¢ ¡ ¨ ¨ ©¡ § ¤ ¦ ¥£ while converging for for Pr max Pr 1 ¡ 6. [sent-223, score-0.368]

65 AUTO BAYES is recursively called with the new goal Pr wrt . [sent-225, score-0.269]

66 max Now, the Bayesian network decomposition Pr wrt schema applies with , max wrt , revealing that is independent of , thus the optimization problem can be decomposed into two optimization subproblems: max Pr wrt and max Pr wrt . [sent-226, score-1.762]

67 f WX ¡ ¤ ¢ ¡ ¨ ¨ ©¡ § ¤ ¦ ¥£ while converging for for Pr for max B @ 8! [sent-229, score-0.135]

68 The ﬁrst subgoal from the decomposition schema, max Pr wrt , can be unrolled over the independent and identically distributed vector using an index decomposition schema which moves expressions out of loops (sums or products) when they are not dependent on the loop index. [sent-243, score-0.914]

69 Since and are co-indexed, unrolling proceeds over both (also independent and identically distributed) vectors in parallel: max Pr wrt . [sent-244, score-0.365]

70 Because the strictly monotone function can ﬁrst be applied to the objective function of the maximization, it becomes max wrt . [sent-251, score-0.327]

71 Another application of index decomposition allows solution for the two scalars and . [sent-252, score-0.071]

72 Gaussian elimination is then used to solve this subproblem analytically, yielding the sequence of expressions and . [sent-253, score-0.111]

73 The second subgoal wrt can be unrolled over the i. [sent-256, score-0.314]

74 This in turn results and for the multiplier which are both solved in two subproblems for a single instance symbolically. [sent-261, score-0.135]

75 During the synthesis process, AUTO BAYES accumulates a number of constraints which have to hold to ensure proper operation of the code (e. [sent-266, score-0.269]

76 , ), AUTO BAYES automatically inserts run-time checks into the code. [sent-271, score-0.077]

77 Thus, optimizations well beyond the capability of a regular compiler can be done. [sent-275, score-0.145]

78 System translates pseudocode to real code in desired language. [sent-277, score-0.201]

79 Finally, AUTO BAYES converts the intermediate code into code of the desired target system. [sent-278, score-0.447]

80 The source code contains thorough comments detailing the mathematics implemented. [sent-279, score-0.236]

81 A regular compiler containing generic performance optimizations not repeated by AUTO BAYES turns the code into an executable program. [sent-280, score-0.385]

82 A program for sampling from a mixture of Gaussians is also produced for testing purposes. [sent-281, score-0.135]

83 5 Range of Capabilities Here, we discuss 18 examples which have been successfully handled by AUTO BAYES, ranging from simple textbook examples to sophisticated EM models and recent multinomial versions of PCA. [sent-282, score-0.182]

84 For each entry, the table below gives a brief description, the number of lines of the speciﬁcation and synthesized C++ code (loc), and the runtime to generate the code (in secs. [sent-283, score-0.402]

85 Simple textbook examples, like Gaussian with simple prior , Gaussian with inverse gamma prior , or Gaussian with conjugate prior have closed-form solutions. [sent-288, score-0.13]

86 The symbolic system of AUTO BAYES can actually ﬁnd these solutions and thus generate short and efﬁcient code. [sent-289, score-0.225]

87 A wide range of -Gaussian mixture models can be handled by AUTO BAYES, ranging from the simple 1D ( ) and 2D with diagonal covariance ( ) and with (conjugate) priors on mean to 1D models for multi-dimensional classes or variance . [sent-299, score-0.089]

88 Finally, is a -Gaussians and -Gaussians two-level hierarchical mixture model which is solved by a nested instantiation of EM [15]: i. [sent-307, score-0.108]

89 We represented regression with Gaussian error and Legendre polynomials with full conjugate priors allowing smoothing [10]. [sent-311, score-0.087]

90 Two versions of this were then done: robust linear regression replaces the Gaussian error with a mixture of two Gaussians (one broad, one peaked) both centered at zero. [sent-312, score-0.145]

91 Trajectory clustering replaces the single regression curve by a mixture of several curves [5]. [sent-313, score-0.145]

92 AUTO BAYES currently lacks variational support, yet it manages to combine a -means style outer loop on the component proportions with an EM-style inner loop on the hidden counts, producing the original algorithm of Hofmann, Lee and Seung, and others [4]. [sent-317, score-0.206]

93 2 N Gauss Bayes Classify -Gauss mix 2D, diag –”– 1D, prior -Exp mix -Cauchy/Poisson mix PCA mult/w -means % q 2. [sent-326, score-0.225]

94 Code libraries are common in statistics and learning, but they lack the high level of automation achievable only by deep symbolic reasoning. [sent-340, score-0.165]

95 The Bayes Net Toolbox [12] is a Matlab library which allows users to program in models but does not derive algorithms or generate code. [sent-341, score-0.156]

96 The B UGS system [14] also allows users to program in models but is specialized for Gibbs sampling. [sent-342, score-0.221]

97 Even with only the few elements implemented so far, we showed that algorithms approaching research-level results [4, 5, 10, 15] can be automatically derived. [sent-353, score-0.076]

98 As more distributions, optimization methods and generalized learning algorithms are added to the system, an exponentially growing number of complex new algorithms become possible, including non-trivial variants which may challenge any single researcher’s particular algorithm design expertise. [sent-354, score-0.157]

99 We have already begun work on generalizing the EM schema to continuous hidden variables, as well as adding schemas for variational methods, fast kd-tree and -body algorithms, MCMC, and temporal models. [sent-357, score-0.399]

100 BUGS: A program to perform Bayesian inference using Gibbs sampling. [sent-455, score-0.081]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('auto', 0.563), ('wrt', 0.233), ('bayes', 0.204), ('code', 0.201), ('schema', 0.2), ('symbolic', 0.165), ('schemas', 0.157), ('pr', 0.155), ('em', 0.123), ('wc', 0.117), ('subproblems', 0.098), ('bayesian', 0.097), ('max', 0.094), ('descendents', 0.09), ('ancestors', 0.089), ('textbook', 0.089), ('wx', 0.086), ('program', 0.081), ('lemma', 0.079), ('mix', 0.075), ('gauss', 0.072), ('decomposition', 0.071), ('logic', 0.069), ('synthesis', 0.068), ('loc', 0.068), ('phi', 0.068), ('expressions', 0.066), ('parents', 0.066), ('system', 0.06), ('double', 0.06), ('loop', 0.06), ('bugs', 0.059), ('compiler', 0.059), ('internally', 0.059), ('multinomial', 0.058), ('probabilities', 0.057), ('nested', 0.054), ('formulae', 0.054), ('mixture', 0.054), ('optimizations', 0.05), ('toolbox', 0.05), ('applicability', 0.05), ('network', 0.05), ('gaussian', 0.048), ('lets', 0.047), ('decomposed', 0.047), ('speci', 0.046), ('regression', 0.046), ('matlab', 0.045), ('cation', 0.045), ('buntine', 0.045), ('functor', 0.045), ('preconditions', 0.045), ('researcher', 0.045), ('sigma', 0.045), ('subgoal', 0.045), ('subproblem', 0.045), ('vec', 0.045), ('optimize', 0.045), ('replaces', 0.045), ('capabilities', 0.045), ('intermediate', 0.045), ('summation', 0.044), ('derivation', 0.044), ('currently', 0.044), ('specialized', 0.043), ('int', 0.043), ('optimization', 0.043), ('indexed', 0.042), ('variational', 0.042), ('conjugate', 0.041), ('gibbs', 0.041), ('converging', 0.041), ('lemmas', 0.041), ('atomic', 0.041), ('software', 0.04), ('prover', 0.039), ('mathematica', 0.039), ('assertions', 0.039), ('checks', 0.039), ('disc', 0.039), ('executable', 0.039), ('identically', 0.038), ('automatically', 0.038), ('algorithms', 0.038), ('design', 0.038), ('users', 0.037), ('slight', 0.037), ('multiplier', 0.037), ('core', 0.037), ('theorem', 0.036), ('generic', 0.036), ('recursively', 0.036), ('unrolled', 0.036), ('axioms', 0.036), ('const', 0.036), ('beyond', 0.036), ('gaussians', 0.036), ('ranging', 0.035), ('comments', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

2 0.15244241 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classiﬁers, similar in spirit to Boosting.

3 0.09788426 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

4 0.097139537 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

Author: Harald Steck, Tommi S. Jaakkola

Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as

5 0.093110263 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

Author: Ron Meir, Tong Zhang

Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where suﬃciently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The ﬁnite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a ﬁnite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-deﬁned optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish ﬁnite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a ﬁnite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we deﬁne the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classiﬁcation problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classiﬁcation decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution deﬁned by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a ﬂurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classiﬁers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classiﬁer can be extended to a Bayesian mixture classiﬁer. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather uniﬁed setting for Bayesian mixture methods, where diﬀerent regularization criteria may be incorporated, and their eﬀect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and deﬁne for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider diﬀerent choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we deﬁne the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating signiﬁcantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We deﬁne the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We ﬁrst recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we ﬁnd that s(v) = log j νj ezj . From the deﬁnition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0   n   1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) .  n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that  λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoﬀ bound    (Jensen) i hj (xi )2 2 (Chernoﬀ) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are deﬁned in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class deﬁned in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a ﬁxed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the ﬁrst term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classiﬁers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately speciﬁed. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order diﬀerentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is ﬁnite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the beneﬁts of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorﬀ foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classiﬁers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classiﬁers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classiﬁcation - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.

6 0.08176998 42 nips-2002-Bias-Optimal Incremental Problem Solving

7 0.079900056 110 nips-2002-Incremental Gaussian Processes

8 0.078542501 142 nips-2002-Maximum Likelihood and the Information Bottleneck

9 0.077779233 90 nips-2002-Feature Selection in Mixture-Based Clustering

10 0.077334106 124 nips-2002-Learning Graphical Models with Mercer Kernels

11 0.07728482 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

12 0.076683633 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

13 0.072938398 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

14 0.064865202 10 nips-2002-A Model for Learning Variance Components of Natural Images

15 0.062811702 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

16 0.062543795 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

17 0.061091036 135 nips-2002-Learning with Multiple Labels

18 0.060803469 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

19 0.059340861 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

20 0.05726802 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, -0.056), (2, -0.045), (3, 0.032), (4, 0.008), (5, 0.092), (6, -0.117), (7, -0.016), (8, -0.02), (9, 0.001), (10, 0.073), (11, -0.072), (12, 0.071), (13, -0.06), (14, -0.057), (15, -0.031), (16, 0.045), (17, 0.029), (18, 0.015), (19, -0.053), (20, 0.027), (21, -0.066), (22, -0.105), (23, 0.024), (24, -0.007), (25, -0.059), (26, 0.004), (27, 0.02), (28, 0.064), (29, 0.166), (30, -0.061), (31, -0.0), (32, 0.107), (33, -0.095), (34, -0.081), (35, 0.075), (36, 0.047), (37, 0.141), (38, 0.115), (39, -0.031), (40, -0.161), (41, -0.026), (42, 0.006), (43, 0.031), (44, -0.169), (45, 0.032), (46, 0.047), (47, 0.095), (48, 0.099), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94730854 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

2 0.68677896 42 nips-2002-Bias-Optimal Incremental Problem Solving

Author: Jürgen Schmidhuber

Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program preﬁxes that modify the distribution on their sufﬁxes by reusing successful code for previous tasks (stored in non-modiﬁable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the ﬁrst general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by proﬁting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly veriﬁable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢

3 0.60281032 142 nips-2002-Maximum Likelihood and the Information Bottleneck

Author: Noam Slonim, Yair Weiss

Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that deﬁnes partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We deﬁne a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Speciﬁcally, in these cases, every ﬁxed point of the IB-functional deﬁnes a ﬁxed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the ﬁxed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢

4 0.60160112 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

Author: Kenji Fukumizu, Shotaro Akaho, Shun-ichi Amari

Abstract: We show the existence of critical points as lines for the likelihood function of mixture-type models. They are given by embedding of a critical point for models with less components. A sufﬁcient condition that the critical line gives local maxima or saddle points is also derived. Based on this fact, a component-split method is proposed for a mixture of Gaussian components, and its effectiveness is veriﬁed through experiments. 1

5 0.5284121 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

6 0.4794606 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

7 0.47769883 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

8 0.45605692 150 nips-2002-Multiple Cause Vector Quantization

9 0.42931315 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

10 0.42772573 110 nips-2002-Incremental Gaussian Processes

11 0.4245989 178 nips-2002-Robust Novelty Detection with Single-Class MPM

12 0.4197619 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

13 0.40514913 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

14 0.39655313 87 nips-2002-Fast Transformation-Invariant Factor Analysis

15 0.3914237 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

16 0.37814999 6 nips-2002-A Formulation for Minimax Probability Machine Regression

17 0.37002891 107 nips-2002-Identity Uncertainty and Citation Matching

18 0.36777315 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

19 0.3659417 124 nips-2002-Learning Graphical Models with Mercer Kernels

20 0.36440444 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.03), (14, 0.016), (23, 0.014), (41, 0.012), (42, 0.055), (54, 0.116), (55, 0.046), (57, 0.019), (67, 0.036), (68, 0.033), (73, 0.01), (74, 0.088), (83, 0.022), (87, 0.02), (92, 0.148), (95, 0.112), (98, 0.102)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90359044 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

2 0.86965722 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

Author: Dörthe Malzahn, Manfred Opper

Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.

3 0.85967398 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefﬁcients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classiﬁers.

4 0.85759193 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

Author: Clayton Scott, Robert Nowak

Abstract: Classiﬁcation trees are one of the most popular types of classiﬁers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classiﬁcation trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classiﬁcation trees, called dyadic classiﬁcation trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classiﬁcation problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform signiﬁcantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.

5 0.83050632 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities

Author: Sumio Watanabe, Shun-ichi Amari

Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coeﬃcient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1

6 0.82633913 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

7 0.80092025 27 nips-2002-An Impossibility Theorem for Clustering

8 0.79717827 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule

9 0.7907213 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

10 0.78494185 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

11 0.78177637 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

12 0.77870274 161 nips-2002-PAC-Bayes & Margins

13 0.77840018 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model

14 0.77564514 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

15 0.7751416 10 nips-2002-A Model for Learning Variance Components of Natural Images

16 0.77460754 53 nips-2002-Clustering with the Fisher Score

17 0.77196693 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

18 0.77158773 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

19 0.77010703 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

20 0.76766247 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers