nips nips2001 nips2001-99 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
Reference: text
sentIndex sentText sentNum sentScore
1 edu ¡ Abstract In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. [sent-5, score-0.281]
2 Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. [sent-6, score-0.223]
3 We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. [sent-7, score-0.317]
4 Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. [sent-8, score-0.552]
5 1 Introduction An important aspect of decision theory is multi-way pattern classification whereby one must determine the class for a given data vector that minimizes the overall risk: ¥ £ ¤¢ % $¥ #¢ " ¢ ¢ ¨ ! [sent-11, score-0.095]
6 argmin ¦ §£ ¢ © ¨ where is the loss in choosing when the true class is . [sent-12, score-0.136]
7 For the 0/1-loss functions, it is optimal to simply use the posterior probability to determine the optimal class ¢ ¢ $¥ #¢ ! [sent-14, score-0.117]
8 argmax ¨ ¢ ¢ ¦ &£ ¢ This procedure may equivalently be specified using a tournament style game-playing strategy. [sent-15, score-0.372]
9 In this case, there is an implicit class ordering , and a class-pair ( and ) scoring function for an unknown sample : 9 ¡ 7 ¢ %333 % 1 ¢ % ' 80565420)(¢ D FB ba`9 ! [sent-16, score-0.17]
10 Of course, the order of the classes does not matter, as the same winner is found for all permutations. [sent-23, score-0.268]
11 In d¨ c ¨ ``VA e ¨ c ¨ A @ any event, this style of classification can be seen as a transitive game [5] between players who correspond to the individual classes. [sent-24, score-0.33]
12 We also show how, under certain assumptions, the term can be seen as an optimal correction between the estimated model likelihood ratio and the true likelihood ratio, and gain further intuition by examining the case when the likelihoods are Gaussians. [sent-27, score-0.479]
13 Furthermore, we observe that the new strategy leads to an intransitive game [5], and we investigate several strategies for playing such games. [sent-28, score-0.573]
14 Finally, we consider the instance of intransitivity as a confidence measure, and investigate an iterative approach to further improve the correction term. [sent-30, score-0.224]
15 1 then reports experimental results which show significant improvements where the likelihoods are hidden Markov models trained on speech data. [sent-34, score-0.154]
16 Section 3 then recasts the procedure as intransitive games, and evaluates a variety of game playing strategies yielding further (small) error reductions. [sent-35, score-0.561]
17 Section 4 explores an iterative strategy for improving our technique, and finally Section 5 concludes and discusses future work. [sent-38, score-0.107]
18 For our purposes, we are interested in KL-divergence between class-conditional likelihoods where is the class number: £ W9 V¥ ! [sent-43, score-0.155]
19 class are more likely to be erroneously classified as class than when Comparing and should tell us which of and is more likely to have its samples mis-classified by the other model. [sent-51, score-0.222]
20 Therefore, the difference , when positive, indicates that samples of class are more likely to be mis-classified as class than samples of class are to be mis-classified as class (and vice-versa when the difference is negative). [sent-52, score-0.496]
21 In other words, “steals” from more than steals from when the difference is positive, thereby suggesting that class should receive aid in this case. [sent-53, score-0.162]
22 , based on the data) “bias” indicating which class should receive favor over the other. [sent-56, score-0.145]
23 is an estimate of @ @ D EB `b¡ 9 @ ¡ ¢ 9 @ where £ ¦ ¢ ¡EDB The likelihood ratio is adjusted in favor of when is positive, and in favor of when is negative. [sent-58, score-0.225]
24 We then use , and when it is positive, choose class . [sent-59, score-0.095]
25 9 D EB A 9 D FB The above intuition does not explain why such a correction factor should be used, since along with is already optimal. [sent-60, score-0.101]
26 In practice, however, we do not have access using to the true likelihood ratios but instead to an approximation that has been estimated from training data. [sent-61, score-0.158]
27 U5Q X T S @ ¤ ¤ D FB ¦ $¥ ¤ ¡ 9 ¤ be the modified KL-divergence between the class conditional models, measured modulo the true distribution , and let . [sent-70, score-0.136]
28 While KL-divergence is not symmetric in general, we can see that if this holds (or is approximately true for a given problem) then the remaining correction is exactly yielding in Equation 1. [sent-99, score-0.142]
29 18517 D FB VOCAB SIZE 75 150 300 600 Table 1: Word error rates (WER) for likelihood ratio and augmented likelihood ratio based classification for various numbers of classes (VOCAB SIZE). [sent-113, score-0.355]
30 This implies that for Gaussians with equal covariance matrices, and our correction term is optimal. [sent-116, score-0.134]
31 Moreover, in the case with , we have that for and for which again implies that penalizes the class that has larger covariance. [sent-118, score-0.124]
32 1 Results D FB ¤ D EB We tried this method (assuming that ) on a medium vocabulary speech recognition task. [sent-120, score-0.205]
33 The task we chose is NYNEX PHONEBOOK[4], an isolated word speech corpus. [sent-122, score-0.098]
34 Also, the data set we used (PHONEBOOK) uses different classes for the training and test sets. [sent-128, score-0.099]
35 Of course, using the true test labels in method 3 would be the ideal measure of the degree of confusion between models, but these are of course not available (see Figure 2, however, showing the results of a cheating experiment). [sent-133, score-0.229]
36 93883 Table 2: The WER under different tournament strategies D EB ¤ `9 6 @ or is below a threshold (i. [sent-160, score-0.427]
37 The first column shows the vocabulary size of the system (identical to the number of classes)5 . [sent-164, score-0.116]
38 D FB A D EB 3 Playing Games @ D FB A D EB We may view either or as providing a score of class over — when positive, class wins, and when negative, class wins. [sent-168, score-0.285]
39 Players pair together and play each other, and the winner goes on to play another match with a different player. [sent-170, score-0.197]
40 The strategy leading to table 1 required a particular class presentation order — in that case the order was just the numeric ordering of the arbitrarily assigned integer classes (corresponding to words in this case). [sent-171, score-0.321]
41 9 @ 9 ¥ D EB alone is used, the order of the comparisons do not matter, leading to Of course when a transitive game [5] (the order of player pairings do not change the final winner). [sent-172, score-0.268]
42 The quantity , however, is not guaranteed to be transitive, and when used in a tournament it results in what is called an intransitive game[5]. [sent-173, score-0.547]
43 This means, for example, that might win over who might win over who then might win over . [sent-174, score-0.147]
44 In an intransitive game, the graph contains directed cycles. [sent-176, score-0.247]
45 There has been very little research on intransitive game strategies — there are in fact a number of philosophical issues relating to if such games are valid or truly exist. [sent-177, score-0.484]
46 Nevertheless, we derived a number of tournament strategies for playing such intransitive games and evaluated their performance in the following. [sent-178, score-0.763]
47 D EB A ¢ ¡ Broadly, there are two tournament types that we considered. [sent-179, score-0.35]
48 Given a particular ordering of , we define a sequential tournament when plays , the winner the classes plays , the winner plays and so on. [sent-180, score-1.064]
49 We also define a tree-based tournament when plays , plays , and so on. [sent-181, score-0.46]
50 The tree-based tournament is then applied recursively on the resulting winners until a final winner is found. [sent-182, score-0.629]
51 ' ¢ 1 ¢ ' ¢ ¥ ¤¢ ¡ 7 a6654% 1 0% ' ¢ ¢ %333 ¢ ¥ ¢ £ X ¦ £ ¢ £ ¤¢ 1 ¢ Based on the above, we investigated several intransitive game playing strategies. [sent-183, score-0.426]
52 For RAND1, we just choose a single random tournament order in a sequential tournament. [sent-184, score-0.41]
53 The ultimate winner is taken to be the player who wins the most tournaments. [sent-186, score-0.256]
54 The third strategy plays 1000 rather than 500 tournaments. [sent-187, score-0.125]
55 The final strategy is inspired by worldcup soccer tournaments: given a randomly generated permutation, the class sequence is 5 The 75-word case is an average result of 8 experiments, the 150-word case is an average of 4 cases, and the 300-word case is an average of 2 cases. [sent-188, score-0.165]
56 There are 7291 separate test samples in the 600-word case, and on average about 911 samples per 75-word test case. [sent-189, score-0.12]
57 We pick the winner of each group using a sequential tournament (the “regionals”). [sent-217, score-0.585]
58 Then a tree-based tournament is used on the group winners. [sent-218, score-0.35]
59 As can be seen, the results get slightly better (particularly with a larger number of classes) as the number of tournaments increases. [sent-220, score-0.144]
60 Finally, the single word cup strategy does surprisingly well for the larger class sizes. [sent-221, score-0.263]
61 Note that the improvements are statistically significant over the baseline (0. [sent-222, score-0.108]
62 002 using a difference of proportions significance test) and the improvements are more dramatic for increasing vocabulary size. [sent-223, score-0.195]
63 Furthermore, the it appears that the larger vocabulary sizes benefit more from the larger number (1000 rather than 500) of random tournaments. [sent-224, score-0.138]
64 60 probability of error (%) 70 50 probability of error (%) 60 40 30 20 10 0 50 40 30 20 10 1 2 3 4 length of cycle 5 6 0 0 1 2 3 4 number of cycles detected 5 Figure 1: 75-word vocabulary case. [sent-225, score-0.617]
65 Left: probability of error given that there exists a cycle of at least the given length (a cycle length of one means no cycle found). [sent-226, score-0.574]
66 Right:probability of error given that at least the given number of cycles exist. [sent-227, score-0.214]
67 1 Empirical Analysis In order to better understand our results, this section analyzes the 500 and 1000 random tournament strategies described above. [sent-229, score-0.449]
68 Each set of random tournaments produces a set of winners which may be described by a histogram. [sent-230, score-0.248]
69 ££ ¢ ¦ §££ ¤ ¥¥ ££ ¥ The table indicates that there is typically only one winner since is approximately 1 and the variances are small. [sent-235, score-0.263]
70 This shows further that the winner is typically not in a cycle, as the existence of a directed cycle in the tournament graph would probably lead to different winners for each random tournament. [sent-236, score-0.908]
71 The relationship between properties of cycles and WER is explored below. [sent-237, score-0.18]
72 ¥ When the tournament is intransitive (and therefore the graph possess a cycle), our second analyses shows that the probability of error tends to increase. [sent-238, score-0.605]
73 This is shown in Figure 1 showing that the error probability increases both as the detected cycle length and the num- D FB vocabulary 75 150 300 600 2. [sent-239, score-0.403]
74 53 Table 4: WER results using two strategies (skip and break) that utilize information about cycles in the tournament graphs, compared to baseline . [sent-259, score-0.662]
75 The and columns show the number of cycles detected relative to the number of samples in each case. [sent-260, score-0.285]
76 ¢ £ ¤ ¢ £¡) D FB ber of detected cycles increases. [sent-261, score-0.253]
77 As an attempt at the latter, we evaluated two very simple heuristics that try to eliminate cycles as detected during classification. [sent-263, score-0.253]
78 In the first method (skip), we run a sequential tournament (using a random class ordering) until either a clear winner is found (a transitive game), or a cycle is detected. [sent-264, score-0.969]
79 If a cycle is detected, we select two players not in the cycle, effectively jumping out of the cycle, and continue playing until the end of the class ordering. [sent-265, score-0.444]
80 If winner cannot be determined (because there are too few players remaining), we backoff and use to select the winner. [sent-266, score-0.265]
81 In a second method (break), if a cycle is detected, we eliminate the class having the smallest likelihood from that cycle, and then continue playing as before. [sent-267, score-0.453]
82 Neither method detects all the cycles in the graph (their number can be exponentially large). [sent-268, score-0.226]
83 Because the tournament strategy is coupled with cycle detection, the cycles detected are different in each case (the second method detecting fewer cycles presumably because the eliminated class is in multiple cycles). [sent-270, score-1.15]
84 In any case, it is apparent that further work is needed to investigate the relationship between the existence and properties of cycles and methods to utilize this information. [sent-271, score-0.207]
85 We would expect that using the true answers to determine the KLdivergence would improve our results further. [sent-273, score-0.174]
86 The top horizontal lines in Figure 2 shows the original baseline results, and the bottom lines show the results using the true answers (a cheating experiment) to determine the KL-divergence. [sent-274, score-0.347]
87 Note also that the relative improvement stays about constant with increasing vocabulary size. [sent-276, score-0.116]
88 D EB This further indicates that an iterative strategy for determining KL-divergence might further improve our results. [sent-277, score-0.131]
89 In this case, is used to determine the answers to compute the first set of KL-divergences used in . [sent-278, score-0.131]
90 The remaining plots in Figure 2 show the results of this strategy for the 500 and 1000 random trials case (i. [sent-280, score-0.117]
91 , the answers used to compute the KL-divergences in each case are obtained from the previous set of random tournaments using the histogram peak procedure described earlier). [sent-282, score-0.297]
92 Rather surprisingly, the results show that iterating in this fashion does not influence the results in 6 Note that this shows a lower bound on the number of cycles detected. [sent-283, score-0.18]
93 This is saying that if we find, for example, four or more cycles then the chance of error is high. [sent-284, score-0.214]
94 5 2 10 0 2 4 6 8 number of iterations 300 classes 600 classes 5. [sent-289, score-0.164]
95 4 baseline cheating 500 trials 1000 trials 7 6. [sent-295, score-0.187]
96 5 0 2 4 6 8 number of iterations 10 Figure 2: Baseline using likelihood ratio (top lines), cheating results using correct answers for KL-divergence (bottom lines), and the iterative determination of KL-distance using hypothesized answers from previous iteration (middle lines). [sent-298, score-0.573]
97 It is the case, however, that as the number of random tournaments increases, the results become closer to the ideal as the vocabulary size increases. [sent-300, score-0.282]
98 5 Discussion and Conclusion We have introduced a correction term to the likelihood ratio classification method that is justified by the difference between the estimated and true class conditional probabilities . [sent-302, score-0.482]
99 The correction term is an estimate of the classification bias that would makes the class comoptimally compensate for these differences. [sent-303, score-0.252]
100 The presence of parisons intransitive and we introduce several tournament-like strategies to compensate. [sent-304, score-0.274]
wordName wordTfidf (topN-words)
[('eb', 0.433), ('fb', 0.411), ('tournament', 0.35), ('wer', 0.247), ('winner', 0.197), ('intransitive', 0.197), ('cycle', 0.18), ('cycles', 0.18), ('game', 0.15), ('tournaments', 0.144), ('vocabulary', 0.116), ('classi', 0.109), ('answers', 0.109), ('correction', 0.101), ('class', 0.095), ('cheating', 0.082), ('phonebook', 0.082), ('winners', 0.082), ('playing', 0.079), ('strategies', 0.077), ('detected', 0.073), ('classes', 0.071), ('ratio', 0.07), ('strategy', 0.07), ('cation', 0.068), ('players', 0.068), ('transitive', 0.065), ('hypothesized', 0.062), ('intransitivity', 0.062), ('vocab', 0.062), ('games', 0.06), ('likelihoods', 0.06), ('word', 0.057), ('likelihood', 0.055), ('plays', 0.055), ('baseline', 0.055), ('improvements', 0.053), ('favor', 0.05), ('win', 0.049), ('ordering', 0.046), ('skip', 0.046), ('eca', 0.046), ('sq', 0.045), ('true', 0.041), ('cup', 0.041), ('steals', 0.041), ('speech', 0.041), ('table', 0.039), ('estimated', 0.039), ('sequential', 0.038), ('db', 0.037), ('iterative', 0.037), ('error', 0.034), ('signi', 0.034), ('break', 0.033), ('term', 0.033), ('wins', 0.033), ('samples', 0.032), ('ed', 0.032), ('phone', 0.03), ('lines', 0.03), ('densities', 0.029), ('confusion', 0.029), ('scoring', 0.029), ('correcting', 0.029), ('penalizes', 0.029), ('test', 0.028), ('determination', 0.027), ('var', 0.027), ('course', 0.027), ('con', 0.027), ('existence', 0.027), ('variances', 0.027), ('directed', 0.026), ('player', 0.026), ('medium', 0.026), ('justi', 0.026), ('difference', 0.026), ('trials', 0.025), ('record', 0.025), ('seattle', 0.025), ('seen', 0.025), ('positive', 0.024), ('improve', 0.024), ('graph', 0.024), ('evaluates', 0.024), ('ratios', 0.023), ('matter', 0.023), ('bias', 0.023), ('cant', 0.023), ('style', 0.022), ('continue', 0.022), ('washington', 0.022), ('iterations', 0.022), ('method', 0.022), ('hmms', 0.022), ('posterior', 0.022), ('compute', 0.022), ('random', 0.022), ('nd', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
2 0.10342961 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
Author: Yu-Han Chang, Leslie Pack Kaelbling
Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.
3 0.089174375 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
4 0.080513507 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
5 0.076658241 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).
6 0.076474741 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
7 0.07031095 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
8 0.069210641 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
9 0.066249631 43 nips-2001-Bayesian time series classification
10 0.064970277 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
11 0.064372666 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
12 0.063430078 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
13 0.062895536 46 nips-2001-Categorization by Learning and Combining Object Parts
14 0.06285499 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
15 0.061636504 8 nips-2001-A General Greedy Approximation Algorithm with Applications
16 0.058912687 144 nips-2001-Partially labeled classification with Markov random walks
17 0.058016967 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
18 0.056810126 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
19 0.056116275 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
20 0.055146504 172 nips-2001-Speech Recognition using SVMs
topicId topicWeight
[(0, -0.168), (1, 0.032), (2, 0.006), (3, 0.065), (4, -0.083), (5, -0.02), (6, 0.01), (7, -0.04), (8, 0.022), (9, -0.032), (10, 0.036), (11, -0.1), (12, -0.06), (13, 0.002), (14, -0.098), (15, -0.013), (16, -0.026), (17, 0.03), (18, -0.071), (19, -0.039), (20, -0.186), (21, 0.041), (22, 0.004), (23, 0.133), (24, 0.037), (25, 0.003), (26, -0.207), (27, 0.07), (28, 0.019), (29, 0.027), (30, -0.034), (31, -0.005), (32, -0.013), (33, -0.001), (34, 0.056), (35, 0.002), (36, 0.099), (37, 0.014), (38, 0.112), (39, 0.101), (40, 0.024), (41, -0.102), (42, -0.05), (43, 0.091), (44, 0.012), (45, 0.018), (46, 0.016), (47, -0.049), (48, -0.049), (49, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.91702521 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
2 0.59399837 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
Author: Yu-Han Chang, Leslie Pack Kaelbling
Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.
3 0.53508687 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
Author: Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1
4 0.49285412 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh
Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1
5 0.4361549 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
Author: Shie Mannor, Nahum Shimkin
Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1
6 0.39498729 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
7 0.39129362 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
8 0.38180754 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
9 0.3671571 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
10 0.34630367 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
11 0.33948496 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
12 0.3380394 118 nips-2001-Matching Free Trees with Replicator Equations
13 0.3357957 43 nips-2001-Bayesian time series classification
14 0.33463699 60 nips-2001-Discriminative Direction for Kernel Classifiers
15 0.33314148 144 nips-2001-Partially labeled classification with Markov random walks
16 0.32938719 190 nips-2001-Thin Junction Trees
17 0.32127318 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
18 0.32000139 61 nips-2001-Distribution of Mutual Information
19 0.31465834 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation
20 0.31333762 168 nips-2001-Sequential Noise Compensation by Sequential Monte Carlo Method
topicId topicWeight
[(14, 0.016), (17, 0.014), (19, 0.018), (27, 0.088), (30, 0.065), (59, 0.023), (72, 0.522), (79, 0.038), (83, 0.028), (91, 0.095)]
simIndex simValue paperId paperTitle
1 0.98475683 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
2 0.95576501 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
same-paper 3 0.94744694 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
4 0.92857897 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
5 0.82362866 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.76668036 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
7 0.71755958 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.70012051 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
9 0.69403499 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
10 0.69129926 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
11 0.68509477 1 nips-2001-(Not) Bounding the True Error
12 0.66691834 79 nips-2001-Gaussian Process Regression with Mismatched Models
13 0.65748262 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
14 0.64832526 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
15 0.64639628 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
16 0.64363056 155 nips-2001-Quantizing Density Estimators
17 0.64274454 61 nips-2001-Distribution of Mutual Information
18 0.63328385 25 nips-2001-Active Learning in the Drug Discovery Process
19 0.63324988 128 nips-2001-Multiagent Planning with Factored MDPs
20 0.62763315 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines