nips nips2002 nips2002-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
Reference: text
sentIndex sentText sentNum sentScore
1 Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). [sent-4, score-0.435]
2 No tested program gets more runtime than its probability times the total search time. [sent-5, score-0.344]
3 In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). [sent-6, score-0.367]
4 It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. [sent-7, score-0.184]
5 1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. [sent-9, score-0.227]
6 Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. [sent-10, score-0.695]
7 &¨ £%'0653£'%$ 12¡ # & ¨ 0) ( # & 1 7¡ Recently Hutter developed a more complex asymptotically optimal search algorithm for all well-defined problems [3]. [sent-19, score-0.146]
8 H SEARCH (for Hutter Search) cleverly allocates part of the total search time for searching the space of proofs to find provably correct candidate programs with provable upper runtime bounds, and at any given time focuses resources on those programs with the currently best proven time bounds. [sent-20, score-0.454]
9 In this paper we will use basic concepts of optimal search to construct an optimal incremental problem solver that at any given time may exploit experience collected in previous searches for solutions to earlier tasks, to minimize the constants ignored by nonincremental H SEARCH and L SEARCH. [sent-24, score-0.4]
10 Strings represent possible internal states of a computer; strings represent code or programs for manipulating states. [sent-75, score-0.359]
11 We focus on being the set of integers and representing a set of instructions of some programming language (that is, substrings within states may also encode programs). [sent-76, score-0.345]
12 Let the variable denote the current state of task , with -th component on a computation tape (think of a separate tape for each task). [sent-78, score-0.194]
13 For convenience we combine current state and current code in a single address space, introducing negative and positive addresses ranging from to , defining the content of address as if and if . [sent-79, score-0.368]
14 In particular, the current instruction pointer ip(r) can be found at (possibly variable) address . [sent-81, score-0.64]
15 This variable distribution will be used to select a new instruction in case points to the current topmost address right after the end of the current code . [sent-83, score-0.774]
16 Once chosen, the code bias will remain unchangeable forever — it is a (possibly empty) sequence of programs , some of them prewired by the user, others frozen after previous successful , by a searches for solutions to previous tasks. [sent-85, score-0.479]
17 Given , the goal is to solve all tasks program that appropriately uses or extends the current code . [sent-86, score-0.546]
18 We will do this in a bias-optimal fashion, that is, no solution candidate will get much more search time than it deserves, given some initial probabilistic bias on program space : Definition 2. [sent-87, score-0.343]
19 A searcher is -bias-optimal ( ) if for any maximal total search time it is guaranteed to solve any problem if it has a solution satisfying . [sent-89, score-0.221]
20 Optimal backtracking requires that any prolongation of some prefix by some token gets immediately executed. [sent-99, score-0.154]
21 Here the expression “ring” indicates that the tasks are ordered in cyclic fashion; denotes the number of tasks in ring . [sent-102, score-0.24]
22 Given a global search time limit , we Try to solve all tasks in , by using existing code in and / or by discovering an appropriate prolongation of : ¦ 8 £ ! [sent-103, score-0.612]
23 &j; W j W HILE and and instruction pointer valid ( ) and instruction valid ( ) and no halt condition (e. [sent-139, score-0.964]
24 E LSE set Done T RUE; (all tasks solved; new code frozen, if any). [sent-145, score-0.298]
25 I F (this means an online request for prolongation of the current FALSE and there is some yet untested prefix through a new token): W HILE Done token (untried since as value for ), set and Done Try ( ), where is ’s probability according to current . [sent-157, score-0.18]
26 Use to efficiently restore only those struction pointer and original search distribution . [sent-166, score-0.272]
27 It is important that instructions whose runtimes are not known in advance can be interrupted by Try at any time. [sent-168, score-0.309]
28 Essentially, Try conducts a depth-first search in program space, where the branches of the search tree are program prefixes, and backtracking is triggered once the sum of the runtimes of the current prefix on all current tasks exceeds the prefix probability multiplied by the total time limit. [sent-169, score-0.842]
29 Task Now suppose there is an ordered sequence of tasks depend on solutions for For instance, task may or may not may be to find a ! [sent-180, score-0.194]
30 ¦ 8 faster way through a maze than the one found during the search for a solution to task ; We are searching for a single program solving all tasks encountered so far (see [9] for variants of this setup). [sent-182, score-0.484]
31 Inductively suppose we have solved the first tasks through programs stored below address , and that the most recently found program starting at address actually solves all of them, possibly using information conveyed by earlier programs. [sent-183, score-0.664]
32 To find a program solving the first tasks, OOPS invokes Try as follows (using set notation for ring ): ¢ ¦ 8 W A ¦ 7 ¢ A ¦9ai 7 7 ¤ ¢ ¥£ ¡ o 42 o 7 7 2! [sent-184, score-0.174]
33 In particular, start address does not increase as long as new tasks can be solved by prolonging . [sent-201, score-0.254]
34 A bit of thought shows that it is impossible for the most recent code starting at to request any additional tokens that could harm its performance on previous tasks. [sent-203, score-0.21]
35 We already inductively know that all of its prolongations will solve all tasks up to . [sent-204, score-0.224]
36 increases for the -th time, is defined as the program starting at ’s As address old value and ending right before its new value. [sent-207, score-0.196]
37 To see this, consider a program solving problem within steps, given current code bias and . [sent-215, score-0.448]
38 A bias-optimal solver would solve within at most steps. [sent-217, score-0.155]
39 2# ¨ o Our only bias shifts are due to freezing programs once they have solved a problem. [sent-223, score-0.231]
40 That is, unlike the learning rate-based bias shifts of A DAPTIVE L SEARCH [10], those of O OPS do not reduce probabilities of programs that were meaningful and executable before the addition of any new . [sent-224, score-0.199]
41 Only formerly meaningless, interrupted programs trying to access code for earlier solutions when there weren’t any suddenly may become prolongable and successful, once some solutions to earlier tasks have been stored. [sent-225, score-0.603]
42 For instance, may be rather short and likely because it uses information conveyed by earlier found programs stored below . [sent-228, score-0.217]
43 Or maybe is a short and fast program that copies into state , then modifies the copy just a little bit to obtain , then successfully applies to . [sent-232, score-0.223]
44 If is not many times faster than , then OOPS will in general suffer from a much smaller constant slowdown factor than L SEARCH, reflecting the extent to which solutions to successive tasks do share useful mutual information. [sent-233, score-0.193]
45 c As we are solving more and more tasks, thus collecting and freezing more and more , it will generally become harder and harder to identify and address and copy-edit particular useful code segments within the earlier solutions. [sent-240, score-0.311]
46 As a consequence we expect that much of the knowledge embodied by certain actually will be about how to access and edit and use programs ( ) previously stored below . [sent-241, score-0.209]
47 f b c 3 A Particular Initial Programming Language The efficient search and backtracking mechanism described in section 2. [sent-242, score-0.192]
48 1 is not aware of the nature of the particular programming language given by , the set of initial instructions for modifying states. [sent-243, score-0.345]
49 For the experiments we wrote an interpreter for an exemplary, stack-based, universal programming language inspired by F ORTH [8], whose disciples praise its beauty and the compactness of its programs. [sent-245, score-0.196]
50 Illegal use of any instruction will cause the currently tested program prefix to halt. [sent-247, score-0.587]
51 In particular, it is illegal to set variables (such as stack pointers or instruction pointers) to values outside their prewired ranges, or to pop empty stacks, or to divide by 0, or to call nonexistent functions, or to change probabilities . [sent-248, score-0.693]
52 We defined 68 instructions, such as oldq(n) for calling the -th previously on stack ds (e. [sent-252, score-0.401]
53 , to edit it with found program , or getq(n) for making a copy of additional instructions). [sent-254, score-0.212]
54 Lack of space prohibits to explain all instructions (see [9]) — we have to limit ourselves to the few appearing in solutions found in the experiments, using readable names instead of their numbers: Instruction c1() returns constant 1. [sent-255, score-0.334]
55 1), one instruction at a time; the instruction ret() causes a return to the address of the next instruction right after the calling instruction). [sent-261, score-1.346]
56 Given input arguments on ds, instruction defnp() pushes onto ds the begin of a definition of a procedure with inputs; this procedure returns if its topmost input is 0, otherwise decrements it. [sent-262, score-0.913]
57 callp() pushes onto ds code for a call of the most recently defined function / procedure. [sent-263, score-0.548]
58 Both defnp and callp also push code for making a fresh copy of the inputs of the most recently defined code, expected on top of ds. [sent-264, score-0.482]
59 endnp() pushes code for returning from the current call, then calls the code generated so far on stack ds above the inputs, applying the code to a copy of the inputs on top of . [sent-265, score-1.224]
60 We also boost the probabilities of the simple arithmetic instructions by2 and dec, such that the system can easily create other integers from the probable ones (e. [sent-272, score-0.277]
61 , code sequence (c3 by2 by2 dec) will return integer 11). [sent-274, score-0.178]
62 Moving some peg’s top disk to the top of another (possibly empty) peg, one disk at a time, but never a larger disk onto a smaller, transfer all disks to the third peg. [sent-277, score-0.39]
63 © ¦ ¤ ¢ ¡ ¨ ¤& Untrained humans find it hard to solve instances . [sent-280, score-0.149]
64 Anderson [1] applied traditional reinforcement learning methods and was able to solve instances up to , solvable within at most 7 moves. [sent-281, score-0.149]
65 Langley [5] used learning production systems and was able to solve Hanoi instances up to , solvable within at most 31 moves. [sent-282, score-0.149]
66 They also fail to solve Hanoi problem instances with , due to the exploding search space (Jana Koehler, IBM Research, personal communication, 2002). [sent-284, score-0.295]
67 Therefore, in principle it should be able to solve arbitrary instances by discovering the problem’s elegant recursive solution: given and three pegs (source peg, auxiliary peg, destination peg), define procedure ¥ ¦4 § § ¦ 4 & exit. [sent-286, score-0.267]
68 The -th task is to solve all Hanoi instances up to instance . [sent-289, score-0.193]
69 We represent the dynamic addresses for each peg, to store environment for task on the -th task tape, allocating its current disk positions and a pointer to its top disk (0 if there isn’t any). [sent-290, score-0.478]
70 That is, given an instance of size , we push onto ds the values . [sent-292, score-0.269]
71 ¦ 8 £ ¡ ¦ ¨¥ © ¨§%G We add three instructions to the 68 instructions of our F ORTH-like programming language: mvdsk() assumes that are represented by the first three elements on ds above the current base pointer , and moves a disk from peg to peg . [sent-294, score-1.253]
72 Illegal moves cause the current program prefix to halt. [sent-296, score-0.173]
73 5 GHz) the system was not able to solve instances involving more than 3 disks. [sent-299, score-0.149]
74 For this purpose we used a seemingly unrelated symmetry problem based on the context free language : given input on the data stack ds, the goal is to place symbols on the auxiliary stack Ds such that the topmost elements ¡ © "£¦ ¡ ¢ ¡ ¢ are 1’s followed by 2’s. [sent-301, score-0.52]
75 We add two more instructions to the initial programming language: instruction 1toD() pushes 1 onto Ds, instruction 2toD() pushes 2. [sent-302, score-1.366]
76 Now we have a total of five task-specific instructions (including those for Hanoi), with instruction numbers 1, 2, 3, 4, 5, for 1toD, 2toD, mvdsk, xSA, xAD, respectively. [sent-303, score-0.658]
77 So we first boost (Section 3) instructions c1, c2 for the first training phase where the -th task is to solve all symmetry problem instances up to . [sent-304, score-0.541]
78 Then we undo the symmetry-specific boosts of c1, c2 and boost instead the Hanoi-specific “instruction number pushers” for the subsequent training phase where the -th task (again ) is to solve all Hanoi instances up to . [sent-305, score-0.231]
79 3 days, OOPS found and froze code solving the symmetry problem. [sent-309, score-0.286]
80 Within 2 more days it also found a universal Hanoi solver, exploiting the benefits of incremental learning ignored by nonincremental H SEARCH and L SEARCH. [sent-310, score-0.243]
81 In what follows we will transform integer sequences discovered by OOPS back into readable programs (to fully understand them, however, one needs to know all side effects, and which instruction has got which number). [sent-312, score-0.558]
82 For the symmetry problem, within less than a second, OOPS found silly but working code . [sent-313, score-0.249]
83 Within less than 1 hour it had solved the 2nd, 3rd, 4th, and 5th instances, for always simply prolonging the previous code without changing the start address . [sent-314, score-0.312]
84 The code found so far was unelegant: (defnp 2toD grt c2 c2 endnp boostq delD delD bsf 2toD fromD delD delD delD fromD bsf by2 bsf by2 fromD delD delD fromD cpnb bsf). [sent-315, score-0.625]
85 3 days, OOPS had created and tested a new, elegant, recursive program (no prolongation of the previous one) with a new increased start address , solving all instances up to 6: (defnp c1 calltp c2 endnp). [sent-318, score-0.552]
86 That is, it was cheaper to solve all instances up to 6 by discovering and applying this new program to all instances so far, than just prolonging old code on instance 6 only. [sent-319, score-0.616]
87 In fact, the program turns out to be a universal symmetry problem solver. [sent-320, score-0.298]
88 On the stack, it constructs a 1-argument procedure that returns nothing if its input argument is 0, otherwise calls the instruction 1toD whose code is 1, then calls itself with a decremented input argument, then calls 2toD whose code is 2, then returns. [sent-321, score-1.113]
89 Using this program, within an additional 20 milliseconds, OOPS had also solved the remaining 24 symmetry tasks up to . [sent-322, score-0.223]
90 1 ms later it had found trivial code for : ) for (mvdsk). [sent-324, score-0.178]
91 After a day or so it had found fresh yet bizarre code (new start address : (c4 c3 cpn c4 by2 c3 by2 exec). [sent-325, score-0.338]
92 Finally, after 3 days it had found fresh code (new ) for : (c3 dec boostq defnp c4 calltp c3 c5 calltp endnp). [sent-326, score-0.74]
93 Therefore, within 1 additional day OOPS was able to solve the remaining 27 tasks for up to 30, reusing the same program again and takes mvdsk operations, and that again. [sent-328, score-0.453]
94 Recall that the optimal solution for for each mvdsk several other instructions need to be executed as well! [sent-329, score-0.326]
95 7 ¤ ¢ ¥¨ ¡ 4 ¡ ¦ "xx d# £ © ¢ £© ¦ & © 4 ¥ ¨¥ 4 ¡ ¦ 7 ¤ ¢ ¥£ ¡ The final Hanoi solution profits from the earlier recursive solution to the symmetry problem. [sent-330, score-0.148]
96 ) by exploiting previous code: Instruction c3 pushes 3; dec decrements this; boostq takes the result 2 as an argument and thus boosts the probabilities of all components of the 2nd frozen program, which happens to be the previously found universal symmetry solver. [sent-334, score-0.507]
97 On the other hand, the probability of the essential Hanoi code (defnp c4 calltp c3 c5 calltp endnp), , which explains why it was not quickly found without the given nothing, is only help of an easier task. [sent-338, score-0.41]
98 So in this particular setup the incremental training due to the simple recursion for the symmetry problem indeed provided useful training for the more complex Hanoi recursion, speeding up the search by a factor of roughly 1000. [sent-339, score-0.281]
99 Still, some programs do run for a long time; the longest measured runtime exceeded 30 billion steps. [sent-343, score-0.169]
100 , we could set to zero the initial probabilities of most of the 73 initial instructions (most are unnecessary for our two problem classes), and then solve all tasks more quickly (at the expense of obtaining a nonuniversal initial programming language). [sent-349, score-0.481]
wordName wordTfidf (topN-words)
[('instruction', 0.419), ('oops', 0.304), ('hanoi', 0.26), ('instructions', 0.239), ('ds', 0.199), ('pre', 0.189), ('code', 0.178), ('stack', 0.172), ('search', 0.146), ('peg', 0.145), ('programs', 0.139), ('program', 0.137), ('pointer', 0.126), ('tasks', 0.12), ('calltp', 0.116), ('defnp', 0.116), ('deld', 0.116), ('xes', 0.115), ('try', 0.109), ('endnp', 0.101), ('pushes', 0.101), ('calls', 0.091), ('universal', 0.09), ('mvdsk', 0.087), ('solver', 0.08), ('dx', 0.078), ('disk', 0.077), ('solve', 0.075), ('instances', 0.074), ('boostq', 0.072), ('bsf', 0.072), ('fromd', 0.072), ('frozen', 0.072), ('symmetry', 0.071), ('returns', 0.065), ('incremental', 0.064), ('bias', 0.06), ('address', 0.059), ('language', 0.059), ('prolongation', 0.058), ('dec', 0.058), ('tape', 0.057), ('token', 0.05), ('programming', 0.047), ('days', 0.046), ('backtracking', 0.046), ('copy', 0.046), ('topmost', 0.046), ('top', 0.045), ('task', 0.044), ('illegal', 0.043), ('nonincremental', 0.043), ('pegs', 0.043), ('prolonging', 0.043), ('slowdown', 0.043), ('xad', 0.043), ('decrements', 0.043), ('strings', 0.042), ('stored', 0.041), ('copies', 0.04), ('onto', 0.04), ('possibly', 0.04), ('recursive', 0.04), ('boost', 0.038), ('fresh', 0.038), ('rue', 0.038), ('runtimes', 0.038), ('xsa', 0.038), ('earlier', 0.037), ('solving', 0.037), ('current', 0.036), ('discovering', 0.035), ('day', 0.034), ('levin', 0.034), ('false', 0.032), ('tokens', 0.032), ('interrupted', 0.032), ('solved', 0.032), ('tested', 0.031), ('solutions', 0.03), ('runtime', 0.03), ('push', 0.03), ('calling', 0.03), ('call', 0.03), ('empty', 0.029), ('allocating', 0.029), ('callp', 0.029), ('callstack', 0.029), ('cpn', 0.029), ('cpnb', 0.029), ('disks', 0.029), ('edit', 0.029), ('exec', 0.029), ('grt', 0.029), ('hile', 0.029), ('hutter', 0.029), ('idsia', 0.029), ('inductively', 0.029), ('metasearching', 0.029), ('multitasking', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 42 nips-2002-Bias-Optimal Incremental Problem Solving
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
2 0.12932242 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
3 0.085818179 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
4 0.08176998 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
5 0.067849338 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
6 0.056679245 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
7 0.049072307 185 nips-2002-Speeding up the Parti-Game Algorithm
8 0.046952147 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
9 0.042335071 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
10 0.038911391 46 nips-2002-Boosting Density Estimation
11 0.038227595 119 nips-2002-Kernel Dependency Estimation
12 0.037448674 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
13 0.036761124 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
14 0.03604586 85 nips-2002-Fast Kernels for String and Tree Matching
15 0.03564113 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
16 0.035431471 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
17 0.035401627 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
18 0.034506097 120 nips-2002-Kernel Design Using Boosting
19 0.033748347 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
20 0.033433016 10 nips-2002-A Model for Learning Variance Components of Natural Images
topicId topicWeight
[(0, -0.123), (1, -0.02), (2, -0.033), (3, -0.03), (4, 0.013), (5, 0.054), (6, -0.005), (7, -0.045), (8, -0.01), (9, -0.025), (10, -0.032), (11, -0.025), (12, -0.0), (13, -0.052), (14, -0.06), (15, 0.051), (16, 0.08), (17, 0.045), (18, -0.045), (19, -0.13), (20, -0.008), (21, -0.046), (22, -0.139), (23, 0.03), (24, 0.028), (25, -0.133), (26, 0.078), (27, -0.088), (28, 0.1), (29, 0.135), (30, 0.059), (31, 0.095), (32, 0.107), (33, -0.062), (34, -0.033), (35, 0.117), (36, 0.077), (37, 0.049), (38, 0.093), (39, -0.063), (40, -0.17), (41, 0.029), (42, 0.026), (43, -0.083), (44, 0.021), (45, 0.054), (46, 0.099), (47, 0.026), (48, -0.049), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.95939118 42 nips-2002-Bias-Optimal Incremental Problem Solving
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
2 0.57164735 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
3 0.55300856 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
4 0.53930259 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
5 0.51101428 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
Author: Willem H. Zuidema
Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -
6 0.45046407 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
7 0.44058731 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
8 0.40787682 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
9 0.3800838 185 nips-2002-Speeding up the Parti-Game Algorithm
10 0.36355534 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
11 0.30894086 176 nips-2002-Replay, Repair and Consolidation
12 0.28537682 174 nips-2002-Regularized Greedy Importance Sampling
13 0.2842609 178 nips-2002-Robust Novelty Detection with Single-Class MPM
14 0.28412575 85 nips-2002-Fast Kernels for String and Tree Matching
15 0.26670501 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
16 0.26256248 128 nips-2002-Learning a Forward Model of a Reflex
17 0.26256174 142 nips-2002-Maximum Likelihood and the Information Bottleneck
19 0.25388393 107 nips-2002-Identity Uncertainty and Citation Matching
20 0.24961388 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
topicId topicWeight
[(3, 0.014), (11, 0.01), (23, 0.01), (42, 0.064), (43, 0.385), (54, 0.123), (55, 0.046), (57, 0.014), (67, 0.023), (68, 0.018), (73, 0.01), (74, 0.066), (92, 0.034), (98, 0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.77093172 42 nips-2002-Bias-Optimal Incremental Problem Solving
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
2 0.60674542 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
Author: Nicholas P. Hughes, David Lowe
Abstract: We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SS TRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropic distribution, and we provide a theoretical justification for this observation.
3 0.42638013 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
4 0.42524299 10 nips-2002-A Model for Learning Variance Components of Natural Images
Author: Yan Karklin, Michael S. Lewicki
Abstract: We present a hierarchical Bayesian model for learning efficient codes of higher-order structure in natural images. The model, a non-linear generalization of independent component analysis, replaces the standard assumption of independence for the joint distribution of coefficients with a distribution that is adapted to the variance structure of the coefficients of an efficient image basis. This offers a novel description of higherorder image structure and provides a way to learn coarse-coded, sparsedistributed representations of abstract image properties such as object location, scale, and texture.
5 0.42507005 53 nips-2002-Clustering with the Fisher Score
Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller
Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).
6 0.42353597 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
7 0.42347914 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
8 0.42330366 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
9 0.42265835 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
10 0.42175549 3 nips-2002-A Convergent Form of Approximate Policy Iteration
11 0.42172453 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
12 0.42144334 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation
13 0.42087281 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
14 0.42084169 190 nips-2002-Stochastic Neighbor Embedding
15 0.42063597 27 nips-2002-An Impossibility Theorem for Clustering
16 0.42035025 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
17 0.41999876 119 nips-2002-Kernel Dependency Estimation
18 0.41986394 46 nips-2002-Boosting Density Estimation
19 0.4195908 2 nips-2002-A Bilinear Model for Sparse Coding
20 0.41948509 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs