nips nips2004 nips2004-4 knowledge-graph by maker-knowledge-mining

4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Source: pdf

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Weng Department of Statistics National Chenechi University Taipei 116, Taiwan Abstract The Bradley-Terry model for paired comparison has been popular in many areas. [sent-2, score-0.084]

2 We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. [sent-3, score-0.319]

3 We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. [sent-4, score-0.101]

4 A useful application to multi-class probability estimates using error-correcting codes is demonstrated. [sent-5, score-0.082]

5 1 Introduction The Bradley-Terry model [2] for paired comparisons has been broadly applied in many areas such as statistics, sports, and machine learning. [sent-6, score-0.084]

6 It considers the model πi P (individual i beats individual j) = , (1) πi + π j where πi is the overall skill of the ith individual. [sent-7, score-0.215]

7 Given k individuals and rij as the number of times that i beats j, an approximate skill pi can be found by minimizing the negative log likelihood of the model (1): pi pj min l(p) = − rij log + rji log p pi + p j pi + p j i 0, j = 1, . [sent-8, score-1.161]

8 , k, deﬁne  P si  P i:i=s r+r if j = s, rsi is t,n i:i=s pt +pt pj ≡ s i  t pj if j = s. [sent-21, score-0.968]

9 If rij > 0, ∀i, j, Algorithm 1 globally converges to the unique minimum of (2). [sent-29, score-0.115]

10 Several machine learning work have used the Bradley-Terry model and one is to obtain multi-class probability estimates from pairwise coupling [8]. [sent-31, score-0.12]

11 For any data instance x, if nij is the number of training data in the ith or jth class, and rij ≈ nij P (x in class i | x in class i or j) is available, solving (2) obtains the estimate of P (x in class i), i = 1, . [sent-32, score-0.319]

12 [13] tried to extend this algorithm to other multi-class settings such as “one-against-the rest” or “errorcorrecting coding,” but did not provide a convergence proof. [sent-36, score-0.043]

13 2 we show that the algorithm proposed in [13] indeed has some convergence problems. [sent-38, score-0.043]

14 In this paper, we propose a generalized Bradley-Terry model where each comparison is between two disjoint subsets of subjects. [sent-39, score-0.1]

15 Then from the results of team competitions, we can approximate the skill of each individual. [sent-40, score-0.149]

16 For example, from records of tennis or badminton doubles (or singles and doubles combined), we may obtain the rank of all individuals. [sent-42, score-0.086]

17 A useful application in machine learning is multi-class probability estimates using error-correcting codes. [sent-43, score-0.063]

18 We then introduce a simple iterative method to solve the generalized model with a convergence proof. [sent-44, score-0.12]

19 Experiments on multi-class probability estimates demonstrate the viability of the proposed model and algorithm. [sent-45, score-0.063]

20 Due to space limitation, we omit all proofs in this paper. [sent-46, score-0.042]

21 2 Generalized Bradley-Terry Model We propose a generalized Bradley-Terry model where, using team competition results, we can approximate individual skill levels. [sent-47, score-0.279]

22 + − ′ Two disjoint subsets Ii and Ii form teams for games and ri ≥ 0 (ri ≥ 0) is the number + − − + of times that Ii beats Ii (Ii beats Ii ). [sent-52, score-0.594]

23 , k(k − 1)/2 are as the following: + − ′ Ii Ii ri ri {1} {2} r12 r21 . [sent-64, score-0.754]

24 The difﬁculty of solving (4) over solving (2) is that now l(p) + − is expressed in terms of qi , qi , qi but the real variable is p. [sent-77, score-1.026]

25 The original Bradley-Terry model is a special case of other statistical models such as log-linear or generalized linear model, so methods other than Algorithm 1 (e. [sent-78, score-0.055]

26 , iterative scaling and iterative weighted least squares) can also be used. [sent-80, score-0.044]

27 , k, deﬁne P ′ P ri ri − +  i:s∈Ii qt,+ + i:s∈Ii qt,− t  i i pj if j = s, P ri +r ′ i (5) pt,n ≡ j i:s∈Ii qt  i  t pj if j = s. [sent-100, score-1.686]

28 For the multiplicative factor in (5) to be well deﬁned (i. [sent-112, score-0.075]

29 , pt ) is modiﬁed while the others s remain the same. [sent-118, score-0.356]

30 It is motivated from using a descent direction to strictly decrease l(p): If ∂l(pt )/∂ps = 0, then ∂l(pt ) t,n · (ps − pt ) = s ∂ps − ∂l(pt ) ∂ps 2 pt s i:s∈Ii ′ ri + ri t qi < 0, (6) where ∂l(p) ∂ps pt,n s = − + i:s∈Ii ri + − qi − i:s∈Ii ′ ′ ri ri + ri . [sent-119, score-3.667]

31 − + qi qi i:s∈I i pt s − is a descent direction in optimization since a sufﬁciently small step along Thus, this direction guarantees the strict decrease of the function value. [sent-120, score-1.09]

32 Since now we take the whole direction without searching for the step size, more efforts are needed to prove the strict decrease in Lemma 1. [sent-121, score-0.088]

33 Lemma 1 If pt > 0 is the index to be updated and ∂l(pt )/∂ps = 0, then l(pt+1 ) < l(pt ). [sent-123, score-0.356]

34 s If we apply the update rule (5) on the pairwise model, rsi i:i=s pt s rsi i:i=s pt +pt s i 3 + ris i:i=s pt +pt s i pt = s i:i=s rsi rsi +ris i:i=s pt +pt s i and (5) goes back to (3). [sent-124, score-2.318]

35 , k and constraints of (4), it is a stationary point of (4)1 . [sent-128, score-0.035]

36 We will prove that Algorithm 2 converges to such a point. [sent-129, score-0.047]

37 If 1 A stationary point means a Karash-Kunh-Tucker (KKT) point for constrained optimization problems like (2) and (4). [sent-130, score-0.035]

38 , k, which means a stationary point of (4) is already obtained. [sent-135, score-0.035]

39 , closed and t=0 k bounded) set {p | 0 ≤ pj ≤ 1, j=1 pj = 1}, it has at least one convergent subsequence. [sent-139, score-0.547]

40 To prove the convergence of a ﬁxed-point type algorithm, we need that if p∗ > 0 and s ∂l(p∗ )/∂ps = 0, then from p∗ we can use (5) to update it to p∗,n = p∗ . [sent-145, score-0.098]

41 , k}, where A = {i | (Ii = {j}, ri > 0) or (Ii = {j}, ri > 0)}. [sent-153, score-0.754]

42 That is, each individual forms a winning (losing) team in some competitions which together involve all subjects. [sent-154, score-0.134]

43 An issue left in Section 2 is whether the multiplicative factor in (5) is well deﬁned. [sent-155, score-0.075]

44 , k, one can show by induction that pt > 0, ∀t j j and hence the denominator of (5) is never zero: If pt > 0, Assumption 1 implies that j t,+ t,− ′ or i:j∈I − ri /qi is positive. [sent-159, score-1.132]

45 Thus, both numerator and denominator in i:j∈I + ri /qi i i the multiplicative factor are positive, and so is pt+1 . [sent-160, score-0.495]

46 j If rij > 0, the original Bradley-Terry model satisﬁes Assumption 1. [sent-161, score-0.094]

47 No matter the model satisﬁes the assumption or not, an easy way to fulﬁll it is to add an additional term k −µ ps k j=1 log s=1 (7) pj to l(p), where µ is a small positive number. [sent-162, score-0.673]

48 As j=1 pj = 1 is one of the constraints, (7) reduces k to −µ s=1 log ps , which is a barrier term in optimization to ensure that ps does not go to zero. [sent-167, score-1.019]

49 The property p∗ > 0 and the convergence of Algorithm 2 are in Theorem 1: s Theorem 1 Under Assumption 1, any convergent point p∗ of Algorithm 2 satisﬁes p∗ > s 0, s = 1, . [sent-168, score-0.086]

50 4 Asymptotic Distribution of the Maximum Likelihood Estimator For the standard Bradley-Terry model, asymptotic distribution of the MLE (i. [sent-172, score-0.029]

51 In this section, we discuss the asymptotic distribution for the proposed estimator. [sent-175, score-0.029]

52 To work on the real probability π, we deﬁne qi ≡ j∈Ii πj , ¯ qi ≡ j∈I + πj , ¯+ qi ≡ j∈I − πj , ¯− i i ′ and consider ni ≡ ri + ri as a constant. [sent-176, score-1.914]

53 Note that ri ∼ BIN(ni , qi /¯i ) is a random ¯+ q + − variable representing the number of times that Ii beats Ii in ni competitions. [sent-177, score-0.952]

54 If ri is independent of rj , ∀i = j, √ √ then n(p1 − π1 ), . [sent-182, score-0.42]

55 , n(pk−1 − πk−1 ) have for large samples the multivariate normal distribution with zero means and dispersion matrix [λ′ ]−1 , where st λ′ = λst − λsk − λtk + λkk , s, t = 1, . [sent-185, score-0.034]

56 st 5 Application to Multi-class Probability Estimates Many classiﬁcation methods are two-class based approaches and there are different ways to extend them for multi-class cases. [sent-189, score-0.055]

57 Most existing studies focus on predicting class labels but not probability estimates. [sent-190, score-0.044]

58 In this section, we discuss how the generalized Bradley-Terry model can be applied to multi-class probability estimates. [sent-191, score-0.074]

59 Error-correction coding [7, 1] is a general method to construct binary classiﬁers and com+ − bine them for multi-class prediction. [sent-192, score-0.03]

60 It suggests some ways to construct Ii and Ii ; both + are subsets of {1, . [sent-193, score-0.045]

61 Then one trains a binary model using data from classes in Ii − (Ii ) as positive (negative). [sent-197, score-0.069]

62 Given ni the number of training data with + − classes in Ii = Ii ∪ Ii , we assume here that for any data x, + + − ri ≈ ni P (x in classes of Ii | x in classes of Ii or Ii ) (8) is available, and the task is to approximate P (x in class s), s = 1, . [sent-199, score-0.845]

63 In the rest of this section we discuss the special case “one-against-the rest” and the earlier results in [13]. [sent-203, score-0.041]

64 Now n1 = · · · = nm = the total number of training data, so the solution to (4) is not ′ affected by ni . [sent-226, score-0.163]

65 As ∂l(p)/∂ps = 0 becomes rs + ps j:j=s k ′ ′ rj rj r1 1 − r1 rk 1 − rk = k, we have − = ··· = − = k− = δ, 1 − pj p1 1 − p 1 pk 1 − p k 1 − pj j=1 where δ is a constant. [sent-228, score-1.14]

66 These equalities provide another way to solve p, and ps = ((1 + δ) − (1 + δ)2 − 4rs δ)/2δ. [sent-229, score-0.407]

67 By solving m s=1 ps = 1, we obtain δ and the optimal p. [sent-231, score-0.393]

68 From the formula of ps , if δ > 0, larger ps implies smaller (1 + δ)2 − 4rs δ and hence larger rs . [sent-232, score-0.789]

69 2 The Approach in [13] for Error-Correcting Codes [13] was the ﬁrst attempt to address the probability estimates using general error-correcting codes. [sent-242, score-0.063]

70 By considering the same optimization problem (4), it proposes a heuristic update rule ′ + − i:s∈Ii ri + i:s∈Ii ri t pt,n ≡ (9) t,+ t,− ps , s ni qi ni qi + + i:s∈I − qt t i:s∈I q i i i i but does not provide a convergence proof. [sent-243, score-2.224]

71 For a ﬁxed-point update, we expect that at the optimum, the multiplicative factor in (9) is one. [sent-244, score-0.075]

72 However, unlike (5), when the factor is one, (9) does not relate to ∂l(p)/∂ps = 0. [sent-245, score-0.036]

73 Taking the “one-against-the rest” approach, if we keep i=1 pt = 1 i ′ and assume ni = 1, then ri + riP 1 and the factor in the update rule (9) is = ′ rs + i:i=s ri P pt + i:i=s (1−pt ) s i = P k−1+2rs − k ri i=1 . [sent-247, score-2.122]

74 k−2+2pt s k If the algorithm converges and the factor approaches one, then ps = (1 + 2rs − i=1 ri )/2 k k but they may not satisfy s=1 ps = 1. [sent-248, score-1.172]

75 Therefore, if in the algorithm we keep i=1 pt = 1 i as [13] did, the factor may not approach one and the algorithm does not converge. [sent-249, score-0.392]

76 As qi = 1, the condition k that the factor equals one can be written as a linear equation of p. [sent-254, score-0.362]

77 Together with i=1 pi = 1, there is an over-determined linear system (i. [sent-255, score-0.107]

78 We then generate ri + by adding some noise to qi /qi : q+ ri = min(max(ǫ, qii (1 + 0. [sent-284, score-1.08]

79 Here ǫ = 10−7 is used so that all ri , ri are positive. [sent-287, score-0.754]

80 We consider the four encodings used in [1] to generate Ii : 1. [sent-288, score-0.103]

81 Following [1], we repeat this procedure 100 times and select the one whose [10 log2 k] splits have the smallest distance. [sent-301, score-0.086]

82 Similar to “dense,” we repeat the procedure 100 times to ﬁnd a good coding. [sent-308, score-0.037]

83 Figure 1 shows averaged accuracy rates over 500 replicates for each of the four methods when k = 22 , 23 , . [sent-309, score-0.057]

84 “1vs1” is good for (a) and (b), but suffers some losses in (c), where the class probabilities are highly unbalanced. [sent-313, score-0.076]

85 We also analyze the (relative) mean square error (MSE) in Figure 2: 1 MSE = 500 500 k k (ˆj pi j=1 i=1 2 − pi ) / p2 i , (10) i=1 ˆ where pj is the probability estimate obtained in the jth of the 500 replicates. [sent-318, score-0.511]

86 Note that in Figure 2(a), as k ˆ p (and pj ) are balanced, i=1 (ˆj − pi )2 is small. [sent-320, score-0.359]

87 Hence, all approaches have small MSE pi though some have poor accuracy. [sent-321, score-0.107]

88 2 3 2 4 log k 5 0 2 6 3 2 (a) 4 log k 5 6 2 (b) (c) Figure 1: Accuracy by the four encodings: “1vs1” (dashed line, square), “1vsrest” (solid line, cross), “dense” (dotted line, circle), “sparse” (dashdot line, asterisk) 0. [sent-335, score-0.086]

89 20 training/testing splits are generated and the testing error rates are averaged. [sent-364, score-0.072]

90 With the property that these multi-class problems are reasonably balanced, we set ni = 1 in (8). [sent-373, score-0.163]

91 Since there are no probability values available for these problems, we compare the accuracy by predicting the label with the largest probability estimate. [sent-374, score-0.067]

92 The purpose here is to compare the four probability estimates but not to check the difference from existing multiclass classiﬁcation techniques. [sent-375, score-0.136]

93 We consider support vector machines (SVM) [4] with the RBF kernel as the binary classiﬁer. [sent-376, score-0.048]

94 Moreover, “1vs1” suffers some losses when k is larger (e. [sent-385, score-0.051]

95 In terms of the computational time, because the number of binary problems for “dense” and “sparse” ([10 log2 k] and [15 log2 k], respectively) is larger than k, and each binary problem involves many classes of data (all and one half), their training time is longer than “1vs1” and “1vsrest. [sent-388, score-0.099]

96 Note that though “1vs1” solves k(k − 1)/2 binaries, it is efﬁcient as each binary problem involves only two classes of data. [sent-390, score-0.069]

97 Table 1: Average of 20 test errors (in percentage) by four encodings (lowest boldfaced) 300 training and 500 testing 800 training and 1,000 testing Problem k 1vs1 1vsrest dense sparse 1vs1 1vsrest dense sparse dna 3 10. [sent-391, score-0.369]

98 73 In summary, we propose a generalized Bradley-Terry model which gives individual skill from group competition results. [sent-447, score-0.24]

99 Reducing multiclass to binary: a unifying approach for margin classiﬁers. [sent-455, score-0.045]

100 Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung. [sent-557, score-0.067]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ii', 0.44), ('ri', 0.377), ('ps', 0.369), ('pt', 0.356), ('qi', 0.326), ('pj', 0.252), ('ni', 0.163), ('rsi', 0.108), ('pi', 0.107), ('mse', 0.106), ('rij', 0.094), ('skill', 0.091), ('taiwan', 0.087), ('beats', 0.086), ('paired', 0.084), ('dense', 0.078), ('encodings', 0.075), ('team', 0.058), ('generalized', 0.055), ('qt', 0.051), ('rs', 0.051), ('splits', 0.049), ('rk', 0.046), ('multiclass', 0.045), ('estimates', 0.044), ('doubles', 0.043), ('ris', 0.043), ('taipei', 0.043), ('convergent', 0.043), ('denominator', 0.043), ('rj', 0.043), ('convergence', 0.043), ('rest', 0.041), ('lin', 0.04), ('multiplicative', 0.039), ('classes', 0.039), ('pk', 0.038), ('competitions', 0.038), ('equalities', 0.038), ('nij', 0.038), ('individual', 0.038), ('repeat', 0.037), ('competition', 0.037), ('factor', 0.036), ('stationary', 0.035), ('pairwise', 0.034), ('st', 0.034), ('satis', 0.033), ('mod', 0.032), ('sparse', 0.032), ('binary', 0.03), ('asymptotic', 0.029), ('accuracy', 0.029), ('update', 0.029), ('individuals', 0.029), ('losses', 0.029), ('log', 0.029), ('four', 0.028), ('jth', 0.026), ('libsvm', 0.026), ('editors', 0.026), ('prove', 0.026), ('competitive', 0.026), ('kkt', 0.026), ('theorem', 0.025), ('class', 0.025), ('subsets', 0.024), ('balanced', 0.024), ('der', 0.024), ('letter', 0.024), ('obtains', 0.024), ('solving', 0.024), ('assumption', 0.023), ('testing', 0.023), ('coupling', 0.023), ('dietterich', 0.023), ('normalize', 0.022), ('omit', 0.022), ('suffers', 0.022), ('iterative', 0.022), ('national', 0.022), ('disjoint', 0.021), ('strict', 0.021), ('ways', 0.021), ('converges', 0.021), ('decrease', 0.021), ('classi', 0.02), ('direction', 0.02), ('proofs', 0.02), ('ma', 0.02), ('codes', 0.019), ('library', 0.019), ('limitation', 0.019), ('group', 0.019), ('probability', 0.019), ('allwein', 0.019), ('biometrika', 0.019), ('dashdot', 0.019), ('ein', 0.019), ('machines', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

2 0.31659204 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

Abstract: We establish a mistake bound for an ensemble method for classiﬁcation based on maximizing the entropy of voting weights subject to margin constraints. The bound is the same as a general bound proved for the Weighted Majority Algorithm, and similar to bounds for other variants of Winnow. We prove a more reﬁned bound that leads to a nearly optimal algorithm for learning disjunctions, again, based on the maximum entropy principle. We describe a simpliﬁcation of the on-line maximum entropy method in which, after each iteration, the margin constraints are replaced with a single linear inequality. The simpliﬁed algorithm, which takes a similar form to Winnow, achieves the same mistake bounds. 1

3 0.093170002 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

Author: Yuanqing Lin, Daniel D. Lee

Abstract: Bayesian Regularization and Nonnegative Deconvolution (BRAND) is proposed for estimating time delays of acoustic signals in reverberant environments. Sparsity of the nonnegative ﬁlter coefﬁcients is enforced using an L1 -norm regularization. A probabilistic generative model is used to simultaneously estimate the regularization parameters and ﬁlter coefﬁcients from the signal data. Iterative update rules are derived under a Bayesian framework using the Expectation-Maximization procedure. The resulting time delay estimation algorithm is demonstrated on noisy acoustic data.

4 0.092194758 138 nips-2004-Online Bounds for Bayesian Algorithms

Author: Sham M. Kakade, Andrew Y. Ng

Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1

5 0.091271363 160 nips-2004-Seeing through water

Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai

Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of ﬁnding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex ﬂow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1

6 0.080381334 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

7 0.07696256 17 nips-2004-Adaptive Manifold Learning

8 0.073631294 130 nips-2004-Newscast EM

9 0.06563329 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

10 0.060369231 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

11 0.059393164 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

12 0.058640979 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

13 0.05391651 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

14 0.051219933 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

15 0.047661174 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

16 0.047237221 64 nips-2004-Experts in a Markov Decision Process

17 0.045572579 102 nips-2004-Learning first-order Markov models for control

18 0.043799356 82 nips-2004-Incremental Algorithms for Hierarchical Classification

19 0.042449765 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.042201322 100 nips-2004-Learning Preferences for Multiclass Problems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.15), (1, 0.033), (2, 0.088), (3, 0.037), (4, 0.032), (5, -0.065), (6, 0.017), (7, 0.038), (8, 0.079), (9, 0.078), (10, 0.029), (11, 0.019), (12, 0.126), (13, -0.108), (14, -0.033), (15, -0.111), (16, -0.087), (17, -0.199), (18, 0.056), (19, 0.215), (20, 0.202), (21, 0.072), (22, -0.207), (23, -0.218), (24, -0.046), (25, -0.105), (26, -0.161), (27, 0.19), (28, -0.12), (29, 0.031), (30, 0.045), (31, 0.102), (32, -0.098), (33, -0.149), (34, -0.014), (35, 0.115), (36, 0.001), (37, 0.029), (38, -0.071), (39, -0.125), (40, 0.03), (41, -0.063), (42, -0.083), (43, 0.048), (44, -0.043), (45, -0.075), (46, 0.038), (47, 0.056), (48, -0.02), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98317635 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

2 0.82300806 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

Author: Philip M. Long, Xinyu Wu

3 0.46058649 138 nips-2004-Online Bounds for Bayesian Algorithms

Author: Sham M. Kakade, Andrew Y. Ng

4 0.35112163 130 nips-2004-Newscast EM

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

5 0.26134148 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

Author: Yuanqing Lin, Daniel D. Lee

6 0.25980252 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.25249779 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

8 0.24622841 160 nips-2004-Seeing through water

9 0.2402959 17 nips-2004-Adaptive Manifold Learning

10 0.23846227 102 nips-2004-Learning first-order Markov models for control

11 0.22968727 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

12 0.21115793 57 nips-2004-Economic Properties of Social Networks

13 0.20724137 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

14 0.20239688 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

15 0.19892749 34 nips-2004-Breaking SVM Complexity with Cross-Training

16 0.19017951 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

17 0.18909095 29 nips-2004-Beat Tracking the Graphical Model Way

18 0.18087006 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

19 0.18014953 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

20 0.17659657 82 nips-2004-Incremental Algorithms for Hierarchical Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.142), (15, 0.173), (26, 0.081), (31, 0.021), (33, 0.141), (35, 0.035), (39, 0.022), (50, 0.025), (97, 0.239)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8393181 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng

2 0.7607137 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

Author: Dori Peleg, Ron Meir

Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efﬁcient solvers are available. Highly competitive numerical results on both artiﬁcial and real-world data sets are reported. 1

3 0.73529416 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

4 0.73444039 28 nips-2004-Bayesian inference in spiking neurons

Author: Sophie Deneve

Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, ﬁring statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their ﬁring rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efﬁciently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roﬀ . For example, these transition rates could represent how often motion in a preferred direction appears the receptive ﬁeld and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoﬀ dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (ﬁgure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roﬀ 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is ﬁred only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output ﬁring rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s conﬁdence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s conﬁdence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoﬀ . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And ﬁnally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron ﬁres only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to ﬁre when it expects that prediction to be too inaccurate (ﬁgure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is ﬁred) are given by: ˙ G = Ot = ron 1 + e−L − roﬀ 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L ﬂuctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the ﬁrst. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output ﬁring rate ¯ The mean ﬁring rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in ﬁgure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reﬂects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roﬀ . If the state is very stable (ron = roﬀ ∼ 0), synaptic evidence is integrated over almost inﬁnite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in ﬁgure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output ﬁring rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reﬂect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output ﬁring rate is a rectiﬁed linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(oﬀ) Analogy with a leaky integrate and ﬁre neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input ﬂuctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roﬀ eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and ﬁre (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron ﬁres when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input ﬂuctuations, or a random walk in membrane potential. As a consequence, the neuron’s ﬁring will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (ﬁgure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, ﬁg 1-B). Of course, this requires an estimate of the transition probabilities ron , roﬀ , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the inﬂuence of the drift J is usually negligible compared to the large ﬂuctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in ﬁgure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2

5 0.72827417 178 nips-2004-Support Vector Classification with Input Data Uncertainty

Author: Jinbo Bi, Tong Zhang

Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classiﬁcation, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efﬁciently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1

6 0.72747719 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.72427082 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

8 0.72186393 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

9 0.71968967 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

10 0.7192446 148 nips-2004-Probabilistic Computation in Spiking Populations

11 0.71830559 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

12 0.71796936 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

13 0.7168687 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

14 0.71563148 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

15 0.71494555 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

16 0.71464872 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

17 0.7146256 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.71375364 70 nips-2004-Following Curved Regularized Optimization Solution Paths

19 0.71364939 116 nips-2004-Message Errors in Belief Propagation

20 0.71235752 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications