nips nips2003 nips2003-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Weng Department of Statistics National Chenechi University Taipei 116, Taiwan Abstract Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. [sent-2, score-0.156]
2 This paper presents two approaches for obtaining class probabilities. [sent-3, score-0.044]
3 We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. [sent-5, score-0.115]
4 In this paper we focus on techniques that provide a multi-class classification solution by combining all pairwise comparisons. [sent-8, score-0.12]
5 A common way to combine pairwise comparisons is by voting [6, 2]. [sent-9, score-0.181]
6 It constructs a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. [sent-10, score-0.122]
7 Though the voting procedure requires just pairwise decisions, it only predicts a class label. [sent-11, score-0.2]
8 As numerous (pairwise) classifiers do provide class probabilities, several authors [12, 11, 3] have proposed probability estimates by combining the pairwise class probabilities. [sent-13, score-0.209]
9 Given the observation x and the class label y, we assume that the estimated pairwise class probabilities rij of µij = p(y = i | y = i or j, x) are available. [sent-14, score-0.815]
10 Then, the goal is to estimate {pi }k , where pi = p(y = i=1 i | x), i = 1, . [sent-16, score-0.508]
11 We propose to obtain an approximate solution to an identity, and then select the label with the highest estimated class probability. [sent-20, score-0.101]
12 The existence of the solution is guaranteed by theory in finite Markov Chains. [sent-21, score-0.067]
13 Furthermore, from conceptual and experimental points of view, we show that the two proposed methods are more stable than voting and the method in [3]. [sent-25, score-0.093]
14 Section 5 presents the relationship among the four methods through their corresponding optimization formulas. [sent-29, score-0.051]
15 2 Review of Two Methods Let rij be the estimates of µij = pi /(pi + pj ). [sent-40, score-1.357]
16 The voting rule [6, 2] is δV = argmaxi [ I{rij >rji } ]. [sent-41, score-0.222]
17 (1) j:j=i A simple estimate of probabilities can be derived as pv = 2 j:j=i I{rij >rji } /(k(k − 1)). [sent-42, score-0.035]
18 i The authors of [3] suggest another method to estimate class probabilities, and they claim that the resulting classification rule can outperform δV in some situations. [sent-43, score-0.1]
19 Their approach is based on the minimization of the Kullback-Leibler (KL) distance between r ij and µij : l(p) = nij rij log(rij /µij ), (2) i=j k where i=1 pi = 1, pi > 0, i = 1, . [sent-44, score-1.751]
20 , k, and nij is the number of instances in class i or j. [sent-47, score-0.114]
21 [3] proposes an iterative procedure to find the minimum of (2). [sent-49, score-0.04]
22 If rij > 0, ∀i = j, the existence of a unique global minimal solution to (2) has been proved in [5] and references therein. [sent-50, score-0.735]
23 Then the resulting classification rule is δHT (x) = argmaxi [p∗ ]. [sent-52, score-0.148]
24 i It is shown in Theorem 1 of [3] that p∗ > p∗ if and only if pi > pj , where pj = ˜ ˜ ˜ i j 2 s:s=j rjs k(k − 1) ; (3) ˜ that is, the pi are in the same order as the p∗ . [sent-53, score-1.498]
25 In fact, as pointed out by [3], p can be derived as an approximation to the identity by replacing pi + pj with 2/k, and µij with rij . [sent-55, score-1.369]
26 pi = ( j:j=i 3 pi + p j pi )( )= k − 1 pi + p j ( j:j=i pi + p j )µij k−1 (4) Our First Approach ˜ Note that δHT is essentially argmaxi [˜i ], and p is an approximate solution to (4). [sent-56, score-2.692]
27 Instead p of replacing pi + pj by 2/k, in this section we propose to solve the system: pi = ( j:j=i pi + p j )rij , ∀i, k−1 k pi = 1, pi ≥ 0, ∀i. [sent-57, score-2.78]
28 subject to (5) i=1 ¯ Let p denote the solution to (5). [sent-58, score-0.038]
29 Then the resulting decision rule is δ1 = argmaxi [¯i ]. [sent-59, score-0.167]
30 To solve (5), we rewrite it as k Qp = p, pi = 1, pi ≥ 0, ∀i, where Qij = i=1 rij /(k − 1) s:s=i ris /(k − 1) if i = j, (6) if i = j. [sent-62, score-1.7]
31 Moreover, if rij > 0 for all i = j, then Qij > 0, which implies this Markov Chain is irreducible and aperiodic. [sent-70, score-0.61]
32 These conditions guarantee the existence of a unique stationary probability and all states being positive recurrent. [sent-71, score-0.069]
33 Hence, we have the following theorem: Theorem 1 If rij > 0, i = j, then (6) has a unique solution p with 0 < pi < 1, ∀i. [sent-72, score-1.196]
34 With Theorem 1 and some further analyses, if we remove the constraint pi ≥ 0, ∀i, the linear system with k + 1 equations still has the same unique solution. [sent-73, score-0.548]
35 On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of the n-step transition probability matrix Qn , we can solve p by repeatedly multiplying QT with any initial vector. [sent-76, score-0.056]
36 The following arguments show that the solution to (5) is a global minimum of a meaningful optimization problem. [sent-78, score-0.114]
37 To begin, we express (5) as j:j=i rji pi − j:j=i rij pj = 0, i = 1, . [sent-79, score-1.735]
38 , k, using the property that rij + rji = 1, ∀i = j. [sent-82, score-1.005]
39 Then the solution to (5) is in fact the global minimum of the following problem: k k ( min p rij pj )2 rji pi − i=1 j:j=i pi = 1, pi ≥ 0, ∀i. [sent-83, score-2.855]
40 4 Our Second Approach Note that both approaches in Sections 2 and 3 involve solving optimization problems using the relations like pi /(pi + pj ) ≈ rij or j:j=i rji pi ≈ j:j=i rij pj . [sent-85, score-3.114]
41 Motivated by (7), we suggest another optimization formulation as follows: min p 1 2 k k (rji pi − rij pj )2 pi = 1, pi ≥ 0, ∀i. [sent-86, score-2.402]
42 subject to i=1 j:j=i (8) i=1 k In related work, [12] proposes to solve a linear system consisting of i=1 pi = 1 and any k − 1 equations of the form rji pi = rij pj . [sent-87, score-2.28]
43 In fact, as (8) considers all rij pj − rji pi , not just k − 1 of them, it can be viewed as an improved version of [12]. [sent-89, score-1.735]
44 We then define the classification rule as δ2 = argmaxi [p† ]. [sent-91, score-0.148]
45 i Since (7) has a unique solution, which can be obtained by solving a simple linear system, it is desired to see whether the minimization problem (8) has these nice properties. [sent-92, score-0.078]
46 The following theorem shows that the nonnegative constraints in (8) are redundant. [sent-94, score-0.061]
47 Note that we can rewrite the objective function of (8) as min = p 1 T p Qp, 2 where Qij = 2 s:s=i rsi rji rij if i = j, if i = j. [sent-96, score-1.088]
48 Therefore, without constraints pi ≥ 0, ∀i, (9) is a linear-constrained convex quadratic programming problem. [sent-98, score-0.508]
49 Consequently, a point p is a global minimum if and only if it satisfies the KKT optimality condition: There is a scalar b such that Q eT e 0 p 0 = . [sent-99, score-0.039]
50 b 1 (10) Here e is the vector of all ones and b is the Lagrangian multiplier of the equality constraint k i=1 pi = 1. [sent-100, score-0.508]
51 Thus, the solution of (8) can be obtained by solving the simple linear system (10). [sent-101, score-0.058]
52 The existence of a unique solution is guaranteed by the invertibility of the matrix of (10). [sent-102, score-0.107]
53 The following theorem shows that Q is PD under quite general conditions. [sent-104, score-0.06]
54 rsi rsj ris = rji rjs rij , In addition to direct methods, next we propose a simple iterative method for solving (10): Algorithm 1 k i=1 1. [sent-109, score-1.177]
55 ) pt ← 1 [− Qtt Qtj pj + pT Qp] (11) j:j=t normalize p (12) until (10) is satisfied. [sent-118, score-0.24]
56 Theorem 4 If rsj > 0, ∀s = j, and {pi }∞ is the sequence generated by Algorithm 1, i=1 any convergent sub-sequence goes to a global minimum of (8). [sent-119, score-0.077]
57 As Theorem 3 indicates that in general Q is positive definite, the sequence {p i }∞ from i=1 Algorithm 1 usually globally converges to the unique minimum of (8). [sent-120, score-0.061]
58 (16) [ i=1 j:j=i k k min p i=1 j:j=i k k δV : min p 1 1 − pi )]2 , k 2 (rij pj − rji pi )2 , min p δ2 : (rij i=1 j:j=i Note that (13) can be easily verified, and that (14) and (15) have been explained in Sections 3 and 4. [sent-122, score-1.714]
59 For (16), its solution is c , pi = j:j=i I{rji >rij } where c is the normalizing constant;∗ and therefore, argmaxi [pi ] is the same as (1). [sent-123, score-0.66]
60 Clearly, (13) can be obtained from (14) by letting pj ≈ 1/k, ∀j and rji ≈ 1/2, ∀i, j. [sent-124, score-0.635]
61 Similarly, (16) is from (15) by taking the extreme values of rij : 0 or 1. [sent-126, score-0.628]
62 As a result, (16) may enlarge the differences between pi . [sent-127, score-0.528]
63 Next, compared with (15), (14) may tend to underestimate the differences between the p i ’s. [sent-128, score-0.045]
64 The reason is that (14) allows the difference between rij pj and rji pi to get canceled first. [sent-129, score-1.751]
65 Thus, conceptually, (13) and (16) are more extreme – the former tends to underestimate the differences between pi ’s, while the latter overestimate them. [sent-130, score-0.571]
66 These arguments will be supported by simulated and real data in the next section. [sent-131, score-0.044]
67 1 Experiments Simple Simulated Examples [3] designs a simple experiment in which all pi ’s are fairly close and their method δHT outperforms the voting strategy δV . [sent-133, score-0.598]
68 1zij if i > j, pi + p j rji = 1 − rij if j > i, rij = (17) (18) where zij are standard normal variates. [sent-140, score-2.139]
69 Since rij are required to be within (0,1), we truncate rij at below and 1 − above, with = 0. [sent-141, score-1.22]
70 In this example, class 1 has the highest probability and hence is the correct class. [sent-143, score-0.063]
71 Figure 1 shows accuracy rates for each of the four methods when k = 3, 5, 8, 10, 12, 15, 20. [sent-144, score-0.089]
72 Note that in this experiment all classes are quite competitive, so, when using δV , sometimes the highest vote occurs at two ∗ For I to beP defined, we consider rij = rji , which is generally true. [sent-146, score-1.068]
73 In addition, if there is well an i for which j:j=i I{rji >rij } = 0, an optimal solution of (16) is pi = 1, and pj = 0, ∀j = i. [sent-147, score-0.768]
74 6 2 3 4 log k 5 6 2 (b) unbalanced pi 7 0. [sent-175, score-0.536]
75 We handle this problem by randomly selecting one class from the ties. [sent-178, score-0.044]
76 Another explanation is that the rij here are all close to 1/2, but (16) uses 1 or 0 instead; therefore, the solution may be severely biased. [sent-180, score-0.664]
77 Since δHT relies on the approximation pi + pj ≈ k/2, this rule may suffer some losses if the class probabilities are not highly balanced. [sent-182, score-0.843]
78 To examine this point, we consider the following two sets of class probabilities: (1) We let k1 = k/2 if k is even, and (k + 1)/2 if k is odd; then we define p1 = 0. [sent-183, score-0.044]
79 After setting pi , we define the pairwise comparisons rij as in (17)-(18). [sent-206, score-1.225]
80 The accuracy rates are shown in Figures 1(b) and 1(c). [sent-208, score-0.057]
81 As expected, δHT is quite sensitive to the imbalance of pi . [sent-210, score-0.526]
82 The situation is much worse in Figure 1(c) because the approximation pi + pj ≈ k/2 is more seriously violated, especially when k is large. [sent-211, score-0.746]
83 In summary, δ1 and δ2 are less sensitive to pi , and their overall performance are fairly stable. [sent-212, score-0.524]
84 From thousands of instances in each data, we select 300 and 500 as our training and testing sets. [sent-222, score-0.092]
85 At each (C, γ), sequentially four folds are Table 1: Testing errors (in percentage) by four methods: Each row reports the testing errors based on a pair of the training and testing sets. [sent-233, score-0.252]
86 The mean and std (standard deviation) are from five 5-fold cross-validation procedures to select the best (C, γ). [sent-234, score-0.062]
87 Dataset k satimage 6 segment 7 USPS 10 MNIST 10 letter 26 δHT mean std 14. [sent-235, score-0.127]
88 001 used as the training set while one fold as the validation set. [sent-435, score-0.058]
89 The training of the four folds consists of k(k − 1)/2 binary SVMs. [sent-436, score-0.095]
90 † Then, for each validation instance , we apply the four methods to obtain classification decisions. [sent-438, score-0.063]
91 After the cross-validation is done, each rule obtains its best (C, γ). [sent-440, score-0.034]
92 Next, the same as (19), the decision values from the training data are employed to find rij . [sent-442, score-0.656]
93 Then, testing data are tested using each of the four rules. [sent-443, score-0.076]
94 Due to the randomness of separating training data into five folds for finding the best (C, γ), we repeat the five-fold cross-validation five times and obtain the mean and standard deviation of the testing error. [sent-444, score-0.131]
95 Moreover, as the selection of 300 and 500 training and testing instances from a larger dataset is also random, we generate five of such pairs. [sent-445, score-0.092]
96 In Table 1, each row reports the testing error based on a pair of the training and testing sets. [sent-446, score-0.152]
97 The results show that when the number of classes k is small, the four methods perform similarly; however, for problems with larger k, δHT is less competitive. [sent-447, score-0.058]
98 7 Discussions and Conclusions As the minimization of the KL distance is a well known criterion, some may wonder why the performance of δHT is not quite satisfactory in some of the examples. [sent-459, score-0.036]
99 One possible explanation is that here KL distance is derived under the assumptions that n ij rij ∼ Bin(nij , µij ) and rij are independent; however, as pointed out in [3], neither of the assumptions holds in the classification problem. [sent-460, score-1.323]
100 Single-layer learning revisited: a stepwise procedure for building and training a neural network. [sent-505, score-0.043]
wordName wordTfidf (topN-words)
[('rij', 0.61), ('pi', 0.508), ('rji', 0.395), ('pj', 0.222), ('ht', 0.148), ('argmaxi', 0.114), ('pairwise', 0.082), ('voting', 0.074), ('qij', 0.069), ('std', 0.062), ('classi', 0.061), ('ij', 0.058), ('taiwan', 0.055), ('qp', 0.055), ('nij', 0.049), ('mnist', 0.045), ('class', 0.044), ('testing', 0.044), ('theorem', 0.042), ('unique', 0.04), ('solution', 0.038), ('knerr', 0.038), ('personnaz', 0.038), ('ris', 0.038), ('rjs', 0.038), ('rsi', 0.038), ('rsj', 0.038), ('taipei', 0.038), ('lin', 0.037), ('marked', 0.037), ('folds', 0.036), ('probabilities', 0.035), ('rule', 0.034), ('satimage', 0.033), ('four', 0.032), ('letter', 0.032), ('coupling', 0.031), ('validation', 0.031), ('existence', 0.029), ('pointed', 0.029), ('rates', 0.029), ('ve', 0.029), ('accuracy', 0.028), ('sections', 0.028), ('equalities', 0.028), ('libsvm', 0.028), ('platt', 0.028), ('unbalanced', 0.028), ('min', 0.027), ('kl', 0.027), ('training', 0.027), ('simulated', 0.026), ('pd', 0.026), ('http', 0.026), ('classes', 0.026), ('ers', 0.025), ('comparisons', 0.025), ('underestimate', 0.025), ('cation', 0.025), ('library', 0.024), ('randomness', 0.024), ('authors', 0.022), ('scenarios', 0.022), ('conceptually', 0.022), ('chain', 0.022), ('minimum', 0.021), ('instances', 0.021), ('usps', 0.021), ('national', 0.02), ('solving', 0.02), ('differences', 0.02), ('stable', 0.019), ('proposes', 0.019), ('optimization', 0.019), ('available', 0.019), ('decision', 0.019), ('agree', 0.019), ('reports', 0.019), ('nonnegative', 0.019), ('highest', 0.019), ('arguments', 0.018), ('pt', 0.018), ('extreme', 0.018), ('minimization', 0.018), ('rewrite', 0.018), ('letting', 0.018), ('global', 0.018), ('pair', 0.018), ('quite', 0.018), ('solve', 0.018), ('estimates', 0.017), ('revisited', 0.016), ('reexamine', 0.016), ('seriously', 0.016), ('stepwise', 0.016), ('zij', 0.016), ('canceled', 0.016), ('annals', 0.016), ('explanation', 0.016), ('fairly', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
2 0.14342649 151 nips-2003-PAC-Bayesian Generic Chaining
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
3 0.13652255 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
Author: Arnab Nilim, Laurent El Ghaoui
Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.
4 0.12159494 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
5 0.10846373 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
6 0.10328972 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
7 0.092194483 193 nips-2003-Variational Linear Response
8 0.091118783 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
9 0.080613762 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
10 0.073440537 41 nips-2003-Boosting versus Covering
11 0.066731542 100 nips-2003-Laplace Propagation
12 0.052426364 124 nips-2003-Max-Margin Markov Networks
13 0.049657129 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
14 0.046406467 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
15 0.042213507 170 nips-2003-Self-calibrating Probability Forecasting
16 0.04204154 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
17 0.041100949 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
18 0.038495481 73 nips-2003-Feature Selection in Clustering Problems
19 0.037212521 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
20 0.036266301 113 nips-2003-Learning with Local and Global Consistency
topicId topicWeight
[(0, -0.143), (1, -0.027), (2, -0.056), (3, -0.006), (4, 0.109), (5, -0.04), (6, -0.001), (7, -0.043), (8, -0.112), (9, -0.006), (10, -0.065), (11, -0.142), (12, 0.007), (13, 0.007), (14, 0.113), (15, -0.047), (16, 0.031), (17, 0.057), (18, -0.157), (19, -0.061), (20, 0.14), (21, -0.106), (22, -0.092), (23, 0.103), (24, -0.132), (25, 0.228), (26, -0.065), (27, -0.087), (28, 0.044), (29, -0.086), (30, -0.033), (31, -0.002), (32, -0.031), (33, -0.076), (34, -0.017), (35, -0.053), (36, 0.061), (37, -0.154), (38, 0.081), (39, -0.125), (40, -0.116), (41, 0.037), (42, 0.133), (43, 0.026), (44, 0.201), (45, 0.049), (46, -0.017), (47, 0.034), (48, -0.074), (49, 0.197)]
simIndex simValue paperId paperTitle
same-paper 1 0.96523339 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
2 0.56627232 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
Author: Haifeng Li, Tao Jiang, Keshu Zhang
Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efficient and stable. 1
3 0.48888522 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
Author: Arnab Nilim, Laurent El Ghaoui
Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.
4 0.46798369 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
Author: Gang Ji, Jeff A. Bilmes
Abstract: In pattern classification tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. 1
5 0.41442475 151 nips-2003-PAC-Bayesian Generic Chaining
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
6 0.40681162 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
7 0.38527629 100 nips-2003-Laplace Propagation
8 0.32792249 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
9 0.29863882 170 nips-2003-Self-calibrating Probability Forecasting
10 0.2905139 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
11 0.29036522 117 nips-2003-Linear Response for Approximate Inference
12 0.2717945 193 nips-2003-Variational Linear Response
13 0.26756418 3 nips-2003-AUC Optimization vs. Error Rate Minimization
14 0.26537302 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
15 0.25932086 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
16 0.25229335 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
17 0.24703182 187 nips-2003-Training a Quantum Neural Network
18 0.24138093 123 nips-2003-Markov Models for Automated ECG Interval Analysis
19 0.22714193 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
20 0.22568649 41 nips-2003-Boosting versus Covering
topicId topicWeight
[(0, 0.048), (5, 0.278), (11, 0.038), (35, 0.069), (53, 0.08), (71, 0.113), (76, 0.05), (82, 0.011), (85, 0.08), (91, 0.074), (99, 0.034)]
simIndex simValue paperId paperTitle
1 0.8571921 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering
Author: Benjamin M. Marlin
Abstract: In this paper we present a generative latent variable model for rating-based collaborative filtering called the User Rating Profile model (URP). The generative process which underlies URP is designed to produce complete user rating profiles, an assignment of one rating to each item for each user. Our model represents each user as a mixture of user attitudes, and the mixing proportions are distributed according to a Dirichlet random variable. The rating for each item is generated by selecting a user attitude for the item, and then selecting a rating according to the preference pattern associated with that attitude. URP is related to several models including a multinomial mixture model, the aspect model [7], and LDA [1], but has clear advantages over each. 1
same-paper 2 0.78049856 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
3 0.66365457 73 nips-2003-Feature Selection in Clustering Problems
Author: Volker Roth, Tilman Lange
Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1
4 0.57188964 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
5 0.56698287 189 nips-2003-Tree-structured Approximations by Expectation Propagation
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
6 0.56620169 122 nips-2003-Margin Maximizing Loss Functions
7 0.56227684 100 nips-2003-Laplace Propagation
8 0.56134254 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
9 0.56033707 41 nips-2003-Boosting versus Covering
10 0.55937111 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
11 0.55784631 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
12 0.55773276 145 nips-2003-Online Classification on a Budget
13 0.5557518 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.55231225 158 nips-2003-Policy Search by Dynamic Programming
15 0.55209786 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
16 0.55100256 3 nips-2003-AUC Optimization vs. Error Rate Minimization
17 0.55086285 126 nips-2003-Measure Based Regularization
18 0.55041081 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
19 0.55007434 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
20 0.54933685 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images