nips nips2003 nips2003-163 knowledge-graph by maker-knowledge-mining

163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Source: pdf

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

Abstract: Pairwise coupling is a popular multi-class classiﬁcation method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Weng Department of Statistics National Chenechi University Taipei 116, Taiwan Abstract Pairwise coupling is a popular multi-class classiﬁcation method that combines together all pairwise comparisons for each pair of classes. [sent-2, score-0.156]

2 This paper presents two approaches for obtaining class probabilities. [sent-3, score-0.044]

3 We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. [sent-5, score-0.115]

4 In this paper we focus on techniques that provide a multi-class classiﬁcation solution by combining all pairwise comparisons. [sent-8, score-0.12]

5 A common way to combine pairwise comparisons is by voting [6, 2]. [sent-9, score-0.181]

6 It constructs a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. [sent-10, score-0.122]

7 Though the voting procedure requires just pairwise decisions, it only predicts a class label. [sent-11, score-0.2]

8 As numerous (pairwise) classiﬁers do provide class probabilities, several authors [12, 11, 3] have proposed probability estimates by combining the pairwise class probabilities. [sent-13, score-0.209]

9 Given the observation x and the class label y, we assume that the estimated pairwise class probabilities rij of µij = p(y = i | y = i or j, x) are available. [sent-14, score-0.815]

10 Then, the goal is to estimate {pi }k , where pi = p(y = i=1 i | x), i = 1, . [sent-16, score-0.508]

11 We propose to obtain an approximate solution to an identity, and then select the label with the highest estimated class probability. [sent-20, score-0.101]

12 The existence of the solution is guaranteed by theory in ﬁnite Markov Chains. [sent-21, score-0.067]

13 Furthermore, from conceptual and experimental points of view, we show that the two proposed methods are more stable than voting and the method in [3]. [sent-25, score-0.093]

14 Section 5 presents the relationship among the four methods through their corresponding optimization formulas. [sent-29, score-0.051]

15 2 Review of Two Methods Let rij be the estimates of µij = pi /(pi + pj ). [sent-40, score-1.357]

16 The voting rule [6, 2] is δV = argmaxi [ I{rij >rji } ]. [sent-41, score-0.222]

17 (1) j:j=i A simple estimate of probabilities can be derived as pv = 2 j:j=i I{rij >rji } /(k(k − 1)). [sent-42, score-0.035]

18 i The authors of [3] suggest another method to estimate class probabilities, and they claim that the resulting classiﬁcation rule can outperform δV in some situations. [sent-43, score-0.1]

19 Their approach is based on the minimization of the Kullback-Leibler (KL) distance between r ij and µij : l(p) = nij rij log(rij /µij ), (2) i=j k where i=1 pi = 1, pi > 0, i = 1, . [sent-44, score-1.751]

20 , k, and nij is the number of instances in class i or j. [sent-47, score-0.114]

21 [3] proposes an iterative procedure to ﬁnd the minimum of (2). [sent-49, score-0.04]

22 If rij > 0, ∀i = j, the existence of a unique global minimal solution to (2) has been proved in [5] and references therein. [sent-50, score-0.735]

23 Then the resulting classiﬁcation rule is δHT (x) = argmaxi [p∗ ]. [sent-52, score-0.148]

24 i It is shown in Theorem 1 of [3] that p∗ > p∗ if and only if pi > pj , where pj = ˜ ˜ ˜ i j 2 s:s=j rjs k(k − 1) ; (3) ˜ that is, the pi are in the same order as the p∗ . [sent-53, score-1.498]

25 In fact, as pointed out by [3], p can be derived as an approximation to the identity by replacing pi + pj with 2/k, and µij with rij . [sent-55, score-1.369]

26 pi = ( j:j=i 3 pi + p j pi )( )= k − 1 pi + p j ( j:j=i pi + p j )µij k−1 (4) Our First Approach ˜ Note that δHT is essentially argmaxi [˜i ], and p is an approximate solution to (4). [sent-56, score-2.692]

27 Instead p of replacing pi + pj by 2/k, in this section we propose to solve the system: pi = ( j:j=i pi + p j )rij , ∀i, k−1 k pi = 1, pi ≥ 0, ∀i. [sent-57, score-2.78]

28 subject to (5) i=1 ¯ Let p denote the solution to (5). [sent-58, score-0.038]

29 Then the resulting decision rule is δ1 = argmaxi [¯i ]. [sent-59, score-0.167]

30 To solve (5), we rewrite it as k Qp = p, pi = 1, pi ≥ 0, ∀i, where Qij = i=1 rij /(k − 1) s:s=i ris /(k − 1) if i = j, (6) if i = j. [sent-62, score-1.7]

31 Moreover, if rij > 0 for all i = j, then Qij > 0, which implies this Markov Chain is irreducible and aperiodic. [sent-70, score-0.61]

32 These conditions guarantee the existence of a unique stationary probability and all states being positive recurrent. [sent-71, score-0.069]

33 Hence, we have the following theorem: Theorem 1 If rij > 0, i = j, then (6) has a unique solution p with 0 < pi < 1, ∀i. [sent-72, score-1.196]

34 With Theorem 1 and some further analyses, if we remove the constraint pi ≥ 0, ∀i, the linear system with k + 1 equations still has the same unique solution. [sent-73, score-0.548]

35 On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of the n-step transition probability matrix Qn , we can solve p by repeatedly multiplying QT with any initial vector. [sent-76, score-0.056]

36 The following arguments show that the solution to (5) is a global minimum of a meaningful optimization problem. [sent-78, score-0.114]

37 To begin, we express (5) as j:j=i rji pi − j:j=i rij pj = 0, i = 1, . [sent-79, score-1.735]

38 , k, using the property that rij + rji = 1, ∀i = j. [sent-82, score-1.005]

39 Then the solution to (5) is in fact the global minimum of the following problem: k k ( min p rij pj )2 rji pi − i=1 j:j=i pi = 1, pi ≥ 0, ∀i. [sent-83, score-2.855]

40 4 Our Second Approach Note that both approaches in Sections 2 and 3 involve solving optimization problems using the relations like pi /(pi + pj ) ≈ rij or j:j=i rji pi ≈ j:j=i rij pj . [sent-85, score-3.114]

41 Motivated by (7), we suggest another optimization formulation as follows: min p 1 2 k k (rji pi − rij pj )2 pi = 1, pi ≥ 0, ∀i. [sent-86, score-2.402]

42 subject to i=1 j:j=i (8) i=1 k In related work, [12] proposes to solve a linear system consisting of i=1 pi = 1 and any k − 1 equations of the form rji pi = rij pj . [sent-87, score-2.28]

43 In fact, as (8) considers all rij pj − rji pi , not just k − 1 of them, it can be viewed as an improved version of [12]. [sent-89, score-1.735]

44 We then deﬁne the classiﬁcation rule as δ2 = argmaxi [p† ]. [sent-91, score-0.148]

45 i Since (7) has a unique solution, which can be obtained by solving a simple linear system, it is desired to see whether the minimization problem (8) has these nice properties. [sent-92, score-0.078]

46 The following theorem shows that the nonnegative constraints in (8) are redundant. [sent-94, score-0.061]

47 Note that we can rewrite the objective function of (8) as min = p 1 T p Qp, 2 where Qij = 2 s:s=i rsi rji rij if i = j, if i = j. [sent-96, score-1.088]

48 Therefore, without constraints pi ≥ 0, ∀i, (9) is a linear-constrained convex quadratic programming problem. [sent-98, score-0.508]

49 Consequently, a point p is a global minimum if and only if it satisﬁes the KKT optimality condition: There is a scalar b such that Q eT e 0 p 0 = . [sent-99, score-0.039]

50 b 1 (10) Here e is the vector of all ones and b is the Lagrangian multiplier of the equality constraint k i=1 pi = 1. [sent-100, score-0.508]

51 Thus, the solution of (8) can be obtained by solving the simple linear system (10). [sent-101, score-0.058]

52 The existence of a unique solution is guaranteed by the invertibility of the matrix of (10). [sent-102, score-0.107]

53 The following theorem shows that Q is PD under quite general conditions. [sent-104, score-0.06]

54 rsi rsj ris = rji rjs rij , In addition to direct methods, next we propose a simple iterative method for solving (10): Algorithm 1 k i=1 1. [sent-109, score-1.177]

55 ) pt ← 1 [− Qtt Qtj pj + pT Qp] (11) j:j=t normalize p (12) until (10) is satisﬁed. [sent-118, score-0.24]

56 Theorem 4 If rsj > 0, ∀s = j, and {pi }∞ is the sequence generated by Algorithm 1, i=1 any convergent sub-sequence goes to a global minimum of (8). [sent-119, score-0.077]

57 As Theorem 3 indicates that in general Q is positive deﬁnite, the sequence {p i }∞ from i=1 Algorithm 1 usually globally converges to the unique minimum of (8). [sent-120, score-0.061]

58 (16) [ i=1 j:j=i k k min p i=1 j:j=i k k δV : min p 1 1 − pi )]2 , k 2 (rij pj − rji pi )2 , min p δ2 : (rij i=1 j:j=i Note that (13) can be easily veriﬁed, and that (14) and (15) have been explained in Sections 3 and 4. [sent-122, score-1.714]

59 For (16), its solution is c , pi = j:j=i I{rji >rij } where c is the normalizing constant;∗ and therefore, argmaxi [pi ] is the same as (1). [sent-123, score-0.66]

60 Clearly, (13) can be obtained from (14) by letting pj ≈ 1/k, ∀j and rji ≈ 1/2, ∀i, j. [sent-124, score-0.635]

61 Similarly, (16) is from (15) by taking the extreme values of rij : 0 or 1. [sent-126, score-0.628]

62 As a result, (16) may enlarge the differences between pi . [sent-127, score-0.528]

63 Next, compared with (15), (14) may tend to underestimate the differences between the p i ’s. [sent-128, score-0.045]

64 The reason is that (14) allows the difference between rij pj and rji pi to get canceled ﬁrst. [sent-129, score-1.751]

65 Thus, conceptually, (13) and (16) are more extreme – the former tends to underestimate the differences between pi ’s, while the latter overestimate them. [sent-130, score-0.571]

66 These arguments will be supported by simulated and real data in the next section. [sent-131, score-0.044]

67 1 Experiments Simple Simulated Examples [3] designs a simple experiment in which all pi ’s are fairly close and their method δHT outperforms the voting strategy δV . [sent-133, score-0.598]

68 1zij if i > j, pi + p j rji = 1 − rij if j > i, rij = (17) (18) where zij are standard normal variates. [sent-140, score-2.139]

69 Since rij are required to be within (0,1), we truncate rij at below and 1 − above, with = 0. [sent-141, score-1.22]

70 In this example, class 1 has the highest probability and hence is the correct class. [sent-143, score-0.063]

71 Figure 1 shows accuracy rates for each of the four methods when k = 3, 5, 8, 10, 12, 15, 20. [sent-144, score-0.089]

72 Note that in this experiment all classes are quite competitive, so, when using δV , sometimes the highest vote occurs at two ∗ For I to beP deﬁned, we consider rij = rji , which is generally true. [sent-146, score-1.068]

73 In addition, if there is well an i for which j:j=i I{rji >rij } = 0, an optimal solution of (16) is pi = 1, and pj = 0, ∀j = i. [sent-147, score-0.768]

74 6 2 3 4 log k 5 6 2 (b) unbalanced pi 7 0. [sent-175, score-0.536]

75 We handle this problem by randomly selecting one class from the ties. [sent-178, score-0.044]

76 Another explanation is that the rij here are all close to 1/2, but (16) uses 1 or 0 instead; therefore, the solution may be severely biased. [sent-180, score-0.664]

77 Since δHT relies on the approximation pi + pj ≈ k/2, this rule may suffer some losses if the class probabilities are not highly balanced. [sent-182, score-0.843]

78 To examine this point, we consider the following two sets of class probabilities: (1) We let k1 = k/2 if k is even, and (k + 1)/2 if k is odd; then we deﬁne p1 = 0. [sent-183, score-0.044]

79 After setting pi , we deﬁne the pairwise comparisons rij as in (17)-(18). [sent-206, score-1.225]

80 The accuracy rates are shown in Figures 1(b) and 1(c). [sent-208, score-0.057]

81 As expected, δHT is quite sensitive to the imbalance of pi . [sent-210, score-0.526]

82 The situation is much worse in Figure 1(c) because the approximation pi + pj ≈ k/2 is more seriously violated, especially when k is large. [sent-211, score-0.746]

83 In summary, δ1 and δ2 are less sensitive to pi , and their overall performance are fairly stable. [sent-212, score-0.524]

84 From thousands of instances in each data, we select 300 and 500 as our training and testing sets. [sent-222, score-0.092]

85 At each (C, γ), sequentially four folds are Table 1: Testing errors (in percentage) by four methods: Each row reports the testing errors based on a pair of the training and testing sets. [sent-233, score-0.252]

86 The mean and std (standard deviation) are from ﬁve 5-fold cross-validation procedures to select the best (C, γ). [sent-234, score-0.062]

87 Dataset k satimage 6 segment 7 USPS 10 MNIST 10 letter 26 δHT mean std 14. [sent-235, score-0.127]

88 001 used as the training set while one fold as the validation set. [sent-435, score-0.058]

89 The training of the four folds consists of k(k − 1)/2 binary SVMs. [sent-436, score-0.095]

90 † Then, for each validation instance , we apply the four methods to obtain classiﬁcation decisions. [sent-438, score-0.063]

91 After the cross-validation is done, each rule obtains its best (C, γ). [sent-440, score-0.034]

92 Next, the same as (19), the decision values from the training data are employed to ﬁnd rij . [sent-442, score-0.656]

93 Then, testing data are tested using each of the four rules. [sent-443, score-0.076]

94 Due to the randomness of separating training data into ﬁve folds for ﬁnding the best (C, γ), we repeat the ﬁve-fold cross-validation ﬁve times and obtain the mean and standard deviation of the testing error. [sent-444, score-0.131]

95 Moreover, as the selection of 300 and 500 training and testing instances from a larger dataset is also random, we generate ﬁve of such pairs. [sent-445, score-0.092]

96 In Table 1, each row reports the testing error based on a pair of the training and testing sets. [sent-446, score-0.152]

97 The results show that when the number of classes k is small, the four methods perform similarly; however, for problems with larger k, δHT is less competitive. [sent-447, score-0.058]

98 7 Discussions and Conclusions As the minimization of the KL distance is a well known criterion, some may wonder why the performance of δHT is not quite satisfactory in some of the examples. [sent-459, score-0.036]

99 One possible explanation is that here KL distance is derived under the assumptions that n ij rij ∼ Bin(nij , µij ) and rij are independent; however, as pointed out in [3], neither of the assumptions holds in the classiﬁcation problem. [sent-460, score-1.323]

100 Single-layer learning revisited: a stepwise procedure for building and training a neural network. [sent-505, score-0.043]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rij', 0.61), ('pi', 0.508), ('rji', 0.395), ('pj', 0.222), ('ht', 0.148), ('argmaxi', 0.114), ('pairwise', 0.082), ('voting', 0.074), ('qij', 0.069), ('std', 0.062), ('classi', 0.061), ('ij', 0.058), ('taiwan', 0.055), ('qp', 0.055), ('nij', 0.049), ('mnist', 0.045), ('class', 0.044), ('testing', 0.044), ('theorem', 0.042), ('unique', 0.04), ('solution', 0.038), ('knerr', 0.038), ('personnaz', 0.038), ('ris', 0.038), ('rjs', 0.038), ('rsi', 0.038), ('rsj', 0.038), ('taipei', 0.038), ('lin', 0.037), ('marked', 0.037), ('folds', 0.036), ('probabilities', 0.035), ('rule', 0.034), ('satimage', 0.033), ('four', 0.032), ('letter', 0.032), ('coupling', 0.031), ('validation', 0.031), ('existence', 0.029), ('pointed', 0.029), ('rates', 0.029), ('ve', 0.029), ('accuracy', 0.028), ('sections', 0.028), ('equalities', 0.028), ('libsvm', 0.028), ('platt', 0.028), ('unbalanced', 0.028), ('min', 0.027), ('kl', 0.027), ('training', 0.027), ('simulated', 0.026), ('pd', 0.026), ('http', 0.026), ('classes', 0.026), ('ers', 0.025), ('comparisons', 0.025), ('underestimate', 0.025), ('cation', 0.025), ('library', 0.024), ('randomness', 0.024), ('authors', 0.022), ('scenarios', 0.022), ('conceptually', 0.022), ('chain', 0.022), ('minimum', 0.021), ('instances', 0.021), ('usps', 0.021), ('national', 0.02), ('solving', 0.02), ('differences', 0.02), ('stable', 0.019), ('proposes', 0.019), ('optimization', 0.019), ('available', 0.019), ('decision', 0.019), ('agree', 0.019), ('reports', 0.019), ('nonnegative', 0.019), ('highest', 0.019), ('arguments', 0.018), ('pt', 0.018), ('extreme', 0.018), ('minimization', 0.018), ('rewrite', 0.018), ('letting', 0.018), ('global', 0.018), ('pair', 0.018), ('quite', 0.018), ('solve', 0.018), ('estimates', 0.017), ('revisited', 0.016), ('reexamine', 0.016), ('seriously', 0.016), ('stepwise', 0.016), ('zij', 0.016), ('canceled', 0.016), ('annals', 0.016), ('explanation', 0.016), ('fairly', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

2 0.14342649 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds for classiﬁcation. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classiﬁers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classiﬁers, and such priors also arise in the PAC-bayesian setting. 1

3 0.13652255 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving ﬁnite-state and ﬁnite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efﬁcient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.

4 0.12159494 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

Abstract: Belief propagation on cyclic graphs is an efﬁcient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulﬁll. The ﬁrst algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable ﬁxed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.

5 0.10846373 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

Author: Xuanlong Nguyen, Michael I. Jordan

Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1

6 0.10328972 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

7 0.092194483 193 nips-2003-Variational Linear Response

8 0.091118783 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

9 0.080613762 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

10 0.073440537 41 nips-2003-Boosting versus Covering

11 0.066731542 100 nips-2003-Laplace Propagation

12 0.052426364 124 nips-2003-Max-Margin Markov Networks

13 0.049657129 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

14 0.046406467 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

15 0.042213507 170 nips-2003-Self-calibrating Probability Forecasting

16 0.04204154 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

17 0.041100949 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

18 0.038495481 73 nips-2003-Feature Selection in Clustering Problems

19 0.037212521 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

20 0.036266301 113 nips-2003-Learning with Local and Global Consistency

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, -0.027), (2, -0.056), (3, -0.006), (4, 0.109), (5, -0.04), (6, -0.001), (7, -0.043), (8, -0.112), (9, -0.006), (10, -0.065), (11, -0.142), (12, 0.007), (13, 0.007), (14, 0.113), (15, -0.047), (16, 0.031), (17, 0.057), (18, -0.157), (19, -0.061), (20, 0.14), (21, -0.106), (22, -0.092), (23, 0.103), (24, -0.132), (25, 0.228), (26, -0.065), (27, -0.087), (28, 0.044), (29, -0.086), (30, -0.033), (31, -0.002), (32, -0.031), (33, -0.076), (34, -0.017), (35, -0.053), (36, 0.061), (37, -0.154), (38, 0.081), (39, -0.125), (40, -0.116), (41, 0.037), (42, 0.133), (43, 0.026), (44, 0.201), (45, 0.049), (46, -0.017), (47, 0.034), (48, -0.074), (49, 0.197)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96523339 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

2 0.56627232 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

Author: Haifeng Li, Tao Jiang, Keshu Zhang

Abstract: A new feature extraction criterion, maximum margin criterion (MMC), is proposed in this paper. This new criterion is general in the sense that, when combined with a suitable constraint, it can actually give rise to the most popular feature extractor in the literature, linear discriminate analysis (LDA). We derive a new feature extractor based on MMC using a different constraint that does not depend on the nonsingularity of the within-class scatter matrix Sw . Such a dependence is a major drawback of LDA especially when the sample size is small. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in this paper. Our preliminary experimental results on face images demonstrate that the new feature extractors are efﬁcient and stable. 1

3 0.48888522 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

Author: Arnab Nilim, Laurent El Ghaoui

4 0.46798369 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

Author: Gang Ji, Jeff A. Bilmes

Abstract: In pattern classiﬁcation tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classiﬁcation, it is possible to correct for this discrepancy by ﬁnding class-pair speciﬁc terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, speciﬁcally those that are necessary to achieve perfect classiﬁcation (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classiﬁcation accuracy can substantially improve over both the baseline and over our previous results. 1

5 0.41442475 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

6 0.40681162 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

7 0.38527629 100 nips-2003-Laplace Propagation

8 0.32792249 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

9 0.29863882 170 nips-2003-Self-calibrating Probability Forecasting

10 0.2905139 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

11 0.29036522 117 nips-2003-Linear Response for Approximate Inference

12 0.2717945 193 nips-2003-Variational Linear Response

13 0.26756418 3 nips-2003-AUC Optimization vs. Error Rate Minimization

14 0.26537302 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

15 0.25932086 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

16 0.25229335 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

17 0.24703182 187 nips-2003-Training a Quantum Neural Network

18 0.24138093 123 nips-2003-Markov Models for Automated ECG Interval Analysis

19 0.22714193 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

20 0.22568649 41 nips-2003-Boosting versus Covering

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.048), (5, 0.278), (11, 0.038), (35, 0.069), (53, 0.08), (71, 0.113), (76, 0.05), (82, 0.011), (85, 0.08), (91, 0.074), (99, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8571921 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering

Author: Benjamin M. Marlin

Abstract: In this paper we present a generative latent variable model for rating-based collaborative ﬁltering called the User Rating Proﬁle model (URP). The generative process which underlies URP is designed to produce complete user rating proﬁles, an assignment of one rating to each item for each user. Our model represents each user as a mixture of user attitudes, and the mixing proportions are distributed according to a Dirichlet random variable. The rating for each item is generated by selecting a user attitude for the item, and then selecting a rating according to the preference pattern associated with that attitude. URP is related to several models including a multinomial mixture model, the aspect model [7], and LDA [1], but has clear advantages over each. 1

same-paper 2 0.78049856 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng

3 0.66365457 73 nips-2003-Feature Selection in Clustering Problems

Author: Volker Roth, Tilman Lange

Abstract: A novel approach to combining clustering and feature selection is presented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. On the technical side, we present an efﬁcient optimization algorithm with guaranteed local convergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features. 1

4 0.57188964 117 nips-2003-Linear Response for Approximate Inference

Author: Max Welling, Yee W. Teh

5 0.56698287 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1

6 0.56620169 122 nips-2003-Margin Maximizing Loss Functions

7 0.56227684 100 nips-2003-Laplace Propagation

8 0.56134254 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

9 0.56033707 41 nips-2003-Boosting versus Covering

10 0.55937111 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

11 0.55784631 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

12 0.55773276 145 nips-2003-Online Classification on a Budget

13 0.5557518 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

14 0.55231225 158 nips-2003-Policy Search by Dynamic Programming

15 0.55209786 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

16 0.55100256 3 nips-2003-AUC Optimization vs. Error Rate Minimization

17 0.55086285 126 nips-2003-Measure Based Regularization

18 0.55041081 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

19 0.55007434 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

20 0.54933685 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images