nips nips2004 nips2004-206 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
Reference: text
sentIndex sentText sentNum sentScore
1 it Abstract We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. [sent-7, score-0.498]
2 Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. [sent-11, score-0.168]
3 We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. [sent-12, score-0.58]
4 We complement our theoretical findings with an empirical comparison on two text categorization tasks. [sent-13, score-0.084]
5 The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. [sent-14, score-0.752]
6 1 Introduction In this paper, we consider learning binary classification tasks with partially labelled data via selective sampling. [sent-15, score-0.321]
7 , [3, 12, 7] and references therein) is an on-line learning algorithm that receives a sequence of unlabelled instances, and decides whether or not to query the label of the current instance based on instances and labels observed so far. [sent-18, score-0.396]
8 The idea is to let the algorithm determine which labels are most useful to its inference mechanism, so that redundant examples can be discarded on the fly and labels can be saved. [sent-19, score-0.23]
9 The overall goal of selective sampling is to fit real-world scenarios where labels are scarce or expensive. [sent-20, score-0.547]
10 As a by now classical example, in a web-searching task, collecting web pages is a fairly automated process, but assigning them a label (a set of topics) often requires timeconsuming and costly human expertise. [sent-21, score-0.115]
11 In these cases, it is clearly important to devise learning algorithms having the ability to exploit the label information as much as possible. [sent-22, score-0.183]
12 Furthermore, when we consider kernel-based algorithms [23, 9, 21], saving labels directly implies saving support vectors in the currently built hypothesis, which, in turn, implies saving running time in both training and test phases. [sent-23, score-0.244]
13 In this paper we present a worst-case analysis of two Perceptron-like selective sampling algorithms. [sent-29, score-0.456]
14 Our analysis relies on and contributes to a well-established way of studying linear-threshold algorithms within the mistake bound model of on-line learning (e. [sent-30, score-0.234]
15 In a sense, this line of research complements an earlier work on selective sampling [7], where a second-order kind of algorithm was analyzed under precise stochastic assumptions about the way data are generated. [sent-34, score-0.507]
16 This is exactly what we face in this paper: we avoid any assumption whatsoever on the data-generating process, but we are still able to prove meaningful statements about the label efficiency features of our algorithms. [sent-35, score-0.115]
17 These experiments confirm our theoretical results, and show the effectiveness of our margin-based label selection rule. [sent-37, score-0.137]
18 We say that S is linearly separable if there exists a vector u ∈ Rn such that yt u xt > 0 for t = 1, . [sent-43, score-0.759]
19 We consider the following selective sampling variant of a standard on-line learning model (e. [sent-47, score-0.456]
20 In the generic trial t the algorithm receives instance xt from the environment, outputs a prediction yt ∈ {−1, +1} about the label yt associated with xt , ˆ and decides whether or not to query the label yt . [sent-52, score-2.354]
21 No matter what the algorithm decides, we say that the algorithm has made a prediction mistake if yt = yt . [sent-53, score-0.968]
22 We measure the ˆ performance of the algorithm by the total number of mistakes it makes on S (including the trials where the true label remains hidden). [sent-54, score-0.281]
23 Given a comparison class of predictors, the goal of the algorithm is to bound the amount by which this total number of mistakes differs, on an arbitrary sequence S, from some measure of the performance of the best predictor in hindsight within the comparison class. [sent-55, score-0.293]
24 Since we are dealing with (zero-threshold) linearthreshold algorithms, it is natural to assume the comparison class be the set of all (zerothreshold) linear-threshold predictors, i. [sent-56, score-0.097]
25 Given a margin value γ > 0, we measure the performance of u on S by its cumulative hinge loss 1 T [11, 13] t=1 Dγ (u; (xt , yt )), where Dγ (u; (xt , yt )) = max{0, γ − yt u xt }. [sent-59, score-1.64]
26 Broadly speaking, the goal of the selective sampling algorithm is to achieve the best bound on the number of mistakes with as few queried labels as possible. [sent-60, score-0.838]
27 As in [6], our algorithms exploit a margin-based randomized rule to decide which labels to query. [sent-61, score-0.209]
28 Thus, our mistake bounds are actually worst-case over the training sequence and average-case over the internal randomization of the algorithms. [sent-62, score-0.178]
29 3 The algorithms and their analysis As a simple example, we start by turning the classical Perceptron algorithm [20] into a worst-case selective sampling algorithm. [sent-67, score-0.552]
30 The algorithm, described in Figure 1, has a real 1 The cumulative hinge loss measures to what extent hyperplane u separates S at margin γ. [sent-68, score-0.138]
31 ALGORITHM Selective sampling Perceptron algorithm Parameter b > 0. [sent-70, score-0.164]
32 Get instance vector xt ∈ Rn and set rt = v k−1 xt , with xt = xt /||xt ||; 2. [sent-76, score-1.861]
33 if Zt = 1 then: (a) ask for label yt ∈ {−1, +1}, (b) if yt = yt then update as follows: v k = v k−1 + yt xt , k ← k + 1. [sent-79, score-2.093]
34 ˆ ˆ Figure 1: The selective sampling (first-order) Perceptron algorithm. [sent-80, score-0.456]
35 In each trial t the algorithm observes an instance vector ˆ xt ∈ Rn and predicts the binary label yt through the sign of the margin value rt = v k−1 xt . [sent-83, score-1.74]
36 Then the algorithm decides whether to ask for the label yt through a simple randomized rule: a coin with bias b/(b + |rt |) is flipped; if the coin turns up heads (Zt = 1 in Figure 1) then the label yt is revealed. [sent-84, score-1.348]
37 Moreover, on a prediction mistake (ˆt = yt ) the algorithm y updates vector v k according to the usual Perceptron additive rule. [sent-85, score-0.534]
38 On the other hand, if either the coin turns up tails or yt = yt no update takes place. [sent-86, score-0.863]
39 Thus, at the end of trial t, subscript k counts the number of updates made so far (plus one). [sent-88, score-0.079]
40 In the following theorem we prove that our selective sampling version of the Perceptron algorithm can achieve, in expectation, the same mistake bound as the standard Perceptron’s using fewer labels. [sent-89, score-0.746]
41 , (xT , yT )) ∈ (Rn × {−1, +1})T be any sequence of examples and UT be the (random) set of update trials for the algorithm in Figure 1 (i. [sent-94, score-0.148]
42 e, the set of trials t ≤ T such that yt = yt and Zt = 1). [sent-95, score-0.81]
43 Then the expected number of ˆ mistakes made by the algorithm in Figure 1 is upper bounded by (2b+1)2 ||u||2 1 inf γ>0 inf u∈Rn 2b+1 E x . [sent-96, score-0.425]
44 t∈UT γ Dγ (u; (ˆt , yt )) + 2b 8b γ2 T b The expected number of labels queried by the algorithm is equal to t=1 E b+|rt | . [sent-97, score-0.608]
45 Let Mt be the Bernoulli variable which is one iff yt = yt and denote by k(t) the ˆ value of the update counter k in trial t just before the update k ← k + 1. [sent-99, score-0.884]
46 Then one can verify by direct inspection that choosing rt = v k(t−1) xt (as in Figure 1) ˆ 1 1 yields yt u xt − yt rt = 2 ||u − v k(t−1) ||2 − 2 ||u − v k(t) ||2 + 1 ||v k(t−1) − v k(t) ||2 , ˆ 2 holding for any u ∈ Rn . [sent-102, score-2.227]
47 Hence we conclude that the equality 1 1 1 Mt Zt yt u xt − yt rt = 2 ||u − v k(t−1) ||2 − 2 ||u − v k(t) ||2 + 2 ||v k(t−1) − v k(t) ||2 ˆ actually holds for all trials t. [sent-104, score-1.52]
48 , T while observing that Mt Zt = 1 implies both ||v k(t−1) −v k(t) || = 1 and yt rt ≤ 0. [sent-108, score-0.738]
49 (1) t=1 Mt Zt yt u xt + |rt | − 2 ≤ 2 ||u|| , Now, since the previous inequality holds for any comparison vector u ∈ Rn , we stretch u to b+1/2 u, being γ > 0 a free parameter. [sent-110, score-0.829]
50 Then, by the very definition of Dγ (u; (ˆt , yt )), x γ b+1/2 γ yt u x t ≥ ˆ T t=1 b+1/2 γ (γ − Dγ (u; (ˆt , yt ))) ∀γ > 0. [sent-111, score-1.146]
51 Plugging into (1) and rearranging, x 1 Mt Zt (b + |rt |) ≤ (b + 2 ) 1 t∈UT γ Dγ (u; (ˆt , yt )) + x (2b+1)2 8γ 2 ||u||2 . [sent-112, score-0.382]
52 (2) ALGORITHM Selective sampling second-order Perceptron algorithm Parameter b > 0. [sent-113, score-0.164]
53 Get xt ∈ Rn and set rt = v k−1 (Ak−1 + xt xt )−1 xt , xt = xt /||xt ||; 2. [sent-119, score-2.595]
54 draw a Bernoulli random variable Zt ∈ {0, 1} of parameter b ; (3) 1 2 ˆ k−1 ˆ b + |rt | + 2 rt 1 + xt A−1 xt 4. [sent-121, score-1.114]
55 if Zt = 1 then: (a) ask for label yt ∈ {−1, +1}, (b) if yt = yt then update as follows: ˆ v k = v k−1 + yt xt , Ak = Ak−1 + xt xt , k ← k + 1. [sent-122, score-2.847]
56 ˆ ˆ ˆ Figure 2: The selective sampling second-order Perceptron algorithm. [sent-123, score-0.456]
57 T The value of E Zt (the expected number of queried labels) trivially follows from t=1 E T t=1 Zt = E T t=1 E[Zt | Z1 , . [sent-136, score-0.106]
58 2 We now consider the selective sampling version of the second-order Perceptron algorithm, as defined in [5]. [sent-140, score-0.456]
59 The algorithm predicts through the sign of the margin quantity rt = v k−1 (Ak−1 + xt xt )−1 xt , and decides whether to ask for the label yt through a ˆ ˆ ˆ randomized rule similar to the one in Figure 1. [sent-143, score-2.201]
60 Again, the comparison between the second-order Perceptron’s bound and the one contained in Theorem 2 reveals that the selective sampling algorithm can achieve, in expectation, the same mistake bound (see Remark 1) using fewer labels. [sent-147, score-0.812]
61 Theorem 2 Using the notation of Theorem 1, the expected number of mistakes made by the algorithm in Figure 2 is upper bounded by inf γ>0 inf u∈Rn E t∈UT 1 b 1 Dγ (u; (ˆt , yt )) + 2 u E Ak(T ) u + x γ 2γ 2b n E ln (1 + λi ) , i=1 where λ1 , . [sent-148, score-0.858]
62 , λn are the eigenvalues of the (random) correlation matrix t∈UT xt xt and ˆ ˆ Ak(T ) = I + t∈UT xt xt (thus 1 + λi is the i-th eigenvalue of Ak(T ) ). [sent-151, score-1.508]
63 The expected numˆ ˆ ber of labels queried by the algorithm is equal to T t=1 E b 1 2 b+|rt |+ 2 rt 1+xt A−1 xt ˆ k−1 ˆ . [sent-152, score-0.936]
64 In addition to the notation given there, we define Ut as the set of update trials up to time t, i. [sent-156, score-0.078]
65 (4) ˆ k(t) ˆ ˆ k(t)−1 ˆ u∈R u∈R On the other hand, if trial t is such that Mt Zt = 0 we have Ut = Ut−1 , thus inf u∈Rn Rt−1 (u) = inf u∈Rn Rt (u). [sent-160, score-0.314]
66 Hence the equality 1 2 (yt 1 2 Mt 2 Zt (yt − rt )2 + rt xt A−1 xt ˆ k(t)−1 ˆ 1 ˆ k(t) ˆ (5) = inf n Rt (u) − inf n Rt−1 (u) + 2 Mt Zt xt A−1 xt u∈R u∈R holds for all trials t. [sent-161, score-2.478]
67 , T , and observe that by definition RT (u) = T Mt Zt 1 1 2 2 n ˆ 2 t=1 2 ||u|| + 2 (yi − u xi ) and R0 (u) = 2 ||u|| (thus inf u∈R R0 (u) = 0). [sent-165, score-0.129]
68 After some manipulation one can see that (5) implies T t=1 1 2 Mt Zt yt u xt + |rt | + 2 rt (1 + xt A−1 xt ) ˆ ˆ k(t)−1 ˆ T 1 t=1 2 Mt 1 ≤ 2 u Ak(T ) u + Zt xt A−1 xt , ˆ k(t) ˆ (6) n holding for any u ∈ R . [sent-166, score-2.623]
69 First, as in [4, 10, 5], we det(Ak(t) ) upper bound the quadratic terms xt A−1 xt by2 ln det(Ak(t)−1 ) . [sent-168, score-0.874]
70 This gives ˆ k(t) ˆ T 1 t=1 2 Mt Zt xt A−1 xt ≤ ˆ k(t) ˆ 1 2 ln det(Ak(T ) ) det(A0 ) = n i=1 1 2 ln (1 + λi ) . [sent-169, score-0.856]
71 b Second, as in the proof of Theorem 1, we stretch the comparison vector u ∈ R n to γ u and introduce hinge loss terms. [sent-170, score-0.167]
72 We obtain: T t=1 Mt 2 Zt b + |rt | + 1 rt (1 + xt A−1 xt ) ˆ k(t)−1 ˆ 2 ≤b 1 t∈UT γ The bounds on E proof of Theorem 1. [sent-171, score-1.151]
73 T t=1 Dγ (u; (ˆt , yt )) + x Mt and E T t=1 b2 2γ 2 u Ak(T ) u + 1 2 n i=1 ln (1 + λi ). [sent-172, score-0.433]
74 Let us set for brevity 2 1 1 3 ˆ ˆ Dγ (u; S) = E x 1 + ||4γ||2 Dγ (u; S) in t∈UT γ Dγ (u; (ˆt , yt )) . [sent-175, score-0.382]
75 Choosing b = 2 u Theorem 1 gives the following bound on the expected number of mistakes: inf u∈Rn u 2 u ˆ Dγ (u; S) + ||2γ|| + ||2γ|| 2 u 2 ˆ Dγ (u; S) + ||4γ|| 2 . [sent-176, score-0.222]
76 (8) This is an expectation version of the mistake bound for the standard (first-order) Perceptron algorithm [14]. [sent-177, score-0.248]
77 Notice, that in the special case when the data are linearly separable with margin γ ∗ the optimal tuning simplifies to b = 1/2 and yields the familiar Perceptron bound ||u||2 /(γ ∗ )2 . [sent-178, score-0.157]
78 On the other hand, if we set b = γ n i=1 u E ln(1+λi ) E[Ak(T ) ]u in Theorem 2 we are led to the bound ˆ inf u∈Rn Dγ (u; S) + 2 1 γ (u E Ak(T ) u) n i=1 E ln (1 + λi ) , (9) Here det denotes the determinant. [sent-179, score-0.314]
79 Clearly, this tuning relies on information not available ahead of time, since it depends on the whole sequence of examples. [sent-180, score-0.085]
80 3 which is an expectation version of the mistake bound for the (deterministic) second-order Perceptron algorithm, as proven in [5]. [sent-182, score-0.219]
81 In fact, the set of update trials UT is on average significantly smaller than the one for the deterministic algorithms. [sent-184, score-0.102]
82 This tends to shrink the n ˆ three terms Dγ (u; S), u E Ak(T ) u, and i=1 E ln (1 + λi ), the main ingredients of the selective sampling bounds. [sent-185, score-0.539]
83 The second dataset is a specific subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05. [sent-193, score-0.098]
84 In the first experiment we compared the selective sampling algorithms in Figures 1 and 2 (for different values of b), with the standard second-order Perceptron algorithm (requesting all labels). [sent-199, score-0.527]
85 Such a comparison was devoted to studying the extent to which a reduced number of label requests might lead to performance degradation. [sent-200, score-0.177]
86 That is, we fixed a few values for parameter b, run the selective sampling algorithm in Figure 2, and computed the fraction of labels requested over the training set. [sent-203, score-0.603]
87 We then run a second-order selective sampling ˆ ˆ algorithm with (constant) label request probability equal to p (independent of t). [sent-205, score-0.644]
88 The aim ˆ of this experiment was to investigate the effectiveness of a margin-based selective sampling criterion, as opposed to a random one. [sent-206, score-0.456]
89 The standard second-order algorithm is denoted by 2 ND ORDER - ALL - LABELS , the selective sampling algorithms in Figures 1 and 2 are denoted by 1 ST- ORDER and 2 ND - ORDER, respectively, whereas the second-order algorithm with constant label request is denoted by 2 ND - ORDER - FIXED - BIAS. [sent-209, score-0.715]
90 4 As evinced by Figure 3(a), there is a range of values for parameter b that makes 2 ND - ORDER achieve almost the same performance as 2 ND - ORDER - ALL - LABELS, but with a substantial reduction in the total number of queried labels. [sent-210, score-0.129]
91 We then compared the resulting label request rates and found 2 ND ORDER largely best among the three algorithms (its instantaneous label rate after 40, 000 examples is less than 19%). [sent-212, score-0.412]
92 Finally, in Figure 3(d) we report a direct macroaveraged F-measure comparison 4 We omitted to report on the first-order algorithms 1 ST- ORDER - ALL - LABELS and 1 ST- ORDER since they are always outperformed by their corresponding second-order algorithms. [sent-215, score-0.12]
93 5 Notice that the figures are plotting instantaneous label rates, hence the overall fraction of queried labels is obtained by integration. [sent-216, score-0.338]
94 ˆ 5 Conclusions and open problems We have introduced new Perceptron-like selective sampling algorithms for learning linearthreshold functions. [sent-284, score-0.554]
95 We analyzed these algorithms in a worst-case on-line learning setting, providing bounds on both the expected number of mistakes and the expected number of labels requested. [sent-285, score-0.327]
96 Our theoretical investigation naturally arises from the traditional way margin-based algorithms are analyzed in the mistake bound model of on-line learning [18, 15, 11, 13, 14, 5]. [sent-286, score-0.298]
97 This investigation suggests that our worst-case selective sampling algorithms can achieve on average the same accuracy as that of their more standard relatives, but allowing a substantial label saving. [sent-287, score-0.653]
98 (1) Our selective sampling algorithms depend on a scale parameter b having a significant influence on their practical performance. [sent-290, score-0.525]
99 (2) Theorems 1 and 2 do not make any explicit statement about the number of weight updates/support vectors computed by our selective sampling algorithms. [sent-292, score-0.456]
100 We would like to see a theoretical argument that enables us to combine the bound on the number of mistakes with that on the number of labels, giving rise to a meaningful upper bound on the number of updates. [sent-293, score-0.251]
wordName wordTfidf (topN-words)
[('yt', 0.382), ('xt', 0.377), ('rt', 0.333), ('selective', 0.321), ('zt', 0.318), ('mt', 0.24), ('perceptron', 0.174), ('ut', 0.143), ('sampling', 0.135), ('ak', 0.134), ('rn', 0.132), ('inf', 0.129), ('mistake', 0.123), ('label', 0.115), ('gentile', 0.097), ('labels', 0.091), ('fixed', 0.091), ('mistakes', 0.091), ('queried', 0.082), ('decides', 0.069), ('littlestone', 0.069), ('bound', 0.069), ('det', 0.065), ('trial', 0.056), ('linearthreshold', 0.056), ('ln', 0.051), ('margin', 0.051), ('bias', 0.05), ('query', 0.05), ('instantaneous', 0.05), ('randomized', 0.05), ('coin', 0.048), ('ohsumed', 0.048), ('colt', 0.047), ('trials', 0.046), ('request', 0.044), ('theorem', 0.044), ('hinge', 0.042), ('algorithms', 0.042), ('comparison', 0.041), ('ask', 0.041), ('macroaveraged', 0.037), ('milan', 0.037), ('subtree', 0.037), ('saving', 0.037), ('tuning', 0.037), ('remark', 0.036), ('therein', 0.033), ('bernoulli', 0.033), ('bounds', 0.033), ('shrink', 0.032), ('apple', 0.032), ('zaniboni', 0.032), ('lnai', 0.032), ('update', 0.032), ('proof', 0.031), ('nd', 0.03), ('warmuth', 0.03), ('stretch', 0.029), ('rearranging', 0.029), ('conconi', 0.029), ('algorithm', 0.029), ('figures', 0.029), ('order', 0.029), ('url', 0.027), ('reuters', 0.027), ('helmbold', 0.027), ('largely', 0.027), ('parameter', 0.027), ('expectation', 0.027), ('exploit', 0.026), ('ahead', 0.026), ('fewer', 0.025), ('turning', 0.025), ('corpus', 0.024), ('categories', 0.024), ('loss', 0.024), ('expected', 0.024), ('deterministic', 0.024), ('observing', 0.023), ('holding', 0.023), ('frequent', 0.023), ('made', 0.023), ('theoretical', 0.022), ('analyzed', 0.022), ('sequence', 0.022), ('extent', 0.021), ('sgn', 0.021), ('categorization', 0.021), ('morgan', 0.021), ('theorems', 0.02), ('predictors', 0.02), ('achieve', 0.02), ('investigation', 0.02), ('reproducing', 0.02), ('choosing', 0.02), ('instance', 0.02), ('cristianini', 0.019), ('turns', 0.019), ('examples', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
2 0.3469075 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
3 0.28089184 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.21638048 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
5 0.18725747 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters
Author: Tim K. Marks, J. C. Roddey, Javier R. Movellan, John R. Hershey
Abstract: We present a generative model and stochastic filtering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture using a single camera. We show that the solution to this problem is formally equivalent to stochastic filtering of conditionally Gaussian processes, a problem for which well known approaches exist [3, 8]. We propose an approach based on Monte Carlo sampling of the nonlinear component of the process (object motion) and exact filtering of the object and background textures given the sampled motion. The smoothness of image sequences in time and space is exploited by using Laplace’s method to generate proposal distributions for importance sampling [7]. The resulting inference algorithm encompasses both optic flow and template-based tracking as special cases, and elucidates the conditions under which these methods are optimal. We demonstrate an application of the system to 3D non-rigid face tracking. 1 Background Recent algorithms track morphable objects by solving optic flow equations, subject to the constraint that the tracked points belong to an object whose non-rigid deformations are linear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise initialization of the object pose and tend to drift out of alignment on long video sequences. We present G-flow, a generative model and stochastic filtering formulation of tracking that address the problems of initialization and error recovery in a principled manner. We define a non-rigid object by the 3D locations of n vertices. The object is a linear combination of k fixed morph bases, with coefficients c = [c1 , c2 , · · · , ck ]T . The fixed 3 × k matrix hi contains the position of the ith vertex in all k morph bases. The transformation from object-centered to image coordinates consists of a rotation, weak perspective projection, and translation. Thus xi , the 2D location of the ith vertex on the image plane, is xi = grhi c + l, (1) where r is the 3 × 3 rotation matrix, l is the 2 × 1 translation vector, and g = 1 0 0 is the 010 projection matrix. The object pose, ut , comprises both the rigid motion parameters and the morph parameters at time t: ut = {r(t), l(t), c(t)}. (2) 1.1 Optic flow Let yt represent the current image, and let xi (ut ) index the image pixel that is rendered by the ith object vertex when the object assumes pose ut . Suppose that we know ut−1 , the pose at time t − 1, and we want to find ut , the pose at time t. This problem can be solved by minimizing the following form with respect to ut : ut = argmin ˆ ut 1 2 n 2 [yt (xi (ut )) − yt−1 (xi (ut−1 ))] . (3) i=1 In the special case in which the xi (ut ) are neighboring points that move with the same 2D displacement, this reduces to the standard Lucas-Kanade optic flow algorithm [9, 1]. Recent work [10, 2, 11] has shown that in the general case, this optimization problem can be solved efficiently using the Gauss-Newton method. We will take advantage of this fact to develop an efficient stochastic inference algorithm within the framework of G-flow. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for specific values taken by random variables, and Greek letters for fixed model parameters. Subscripted colons indicate sequences: e.g., X1:t = X1 · · · Xt . The term In stands for the n × n identity matrix, E for expected value, V ar for the covariance matrix, and V ar−1 for the inverse of the covariance matrix (precision matrix). 2 The Generative Model for G-Flow Figure 1: Left: a(Ut ) determines which texel (color at a vertex of the object model or a pixel of the background model) is responsible for rendering each image pixel. Right: G-flow video generation model: At time t, the object’s 3D pose, Ut , is used to project the object texture, Vt , into 2D. This projection is combined with the background texture, Bt , to generate the observed image, Yt . We model the image sequence Y as a stochastic process generated by three hidden causes, U , V , and B, as shown in the graphical model (Figure 1, right). The m × 1 random vector Yt represents the m-pixel image at time t. The n × 1 random vector Vt and the m × 1 random vector Bt represent the n-texel object texture and the m-texel background texture, respectively. As illustrated in Figure 1, left, the object pose, Ut , determines onto which image pixels the object and background texels project at time t. This is formulated using the projection function a(Ut ). For a given pose, ut , the projection a(ut ) is a block matrix, def a(ut ) = av (ut ) ab (ut ) . Here av (ut ), the object projection function, is an m × n matrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1 at row j, column i it means that the ith object point projects onto image pixel j. Matrix ab plays the same role for background pixels. Assuming the foreground mapping is one-toone, we let ab = Im −av (ut )av (ut )T , expressing the simple occlusion constraint that every image pixel is rendered by object or background, but not both. In the G-flow generative model: Vt Yt = a(Ut ) + Wt Wt ∼ N (0, σw Im ), σw > 0 Bt (4) Ut ∼ p(ut | ut−1 ) v v Vt = Vt−1 + Zt−1 Zt−1 ∼ N (0, Ψv ), Ψv is diagonal b b Bt = Bt−1 + Zt−1 Zt−1 ∼ N (0, Ψb ), Ψb is diagonal where p(ut | ut−1 ) is the pose transition distribution, and Z v , Z b , W are independent of each other, of the initial conditions, and over time. The form of the pose distribution is left unspecified since the algorithm proposed here does not require the pose distribution or the pose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1 and the variance of B1 are both diagonal. Non-rigid 3D tracking is a difficult nonlinear filtering problem because changing the pose has a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure that we can exploit: under the G-flow model, video generation is a conditionally Gaussian process [3, 6, 4, 5]. If the specific values taken by the pose sequence, u1:t , were known, then the texture processes, V and B, and the image process, Y , would be jointly Gaussian. This suggests the following scheme: we could use particle filtering to obtain a distribution of pose experts (each expert corresponds to a highly probable sample of pose, u1:t ). For each expert we could then use Kalman filtering equations to infer the posterior distribution of texture given the observed images. This method is known in the statistics community as a Monte Carlo filtering solution for conditionally Gaussian processes [3, 4], and in the machine learning community as Rao-Blackwellized particle filtering [6, 5]. We found that in addition to Rao-Blackwellization, it was also critical to use Laplace’s method to generate the proposal distributions for importance sampling [7]. In the context of G-flow, we accomplished this by performing an optic flow-like optimization, using an efficient algorithm similar to those in [10, 2]. 3 Inference Our goal is to find an expression for the filtering distribution, p(ut , vt , bt | y1:t ). Using the law of total probability, we have the following equation for the filtering distribution: p(ut , vt , bt | y1:t ) = p(ut , vt , bt | u1:t−1 , y1:t ) p(u1:t−1 | y1:t ) du1:t−1 Opinion of expert (5) Credibility of expert We can think of the integral in (5) as a sum over a distribution of experts, where each expert corresponds to a single pose history, u1:t−1 . Based on its hypothesis about pose history, each expert has an opinion about the current pose of the object, Ut , and the texture maps of the object and background, Vt and Bt . Each expert also has a credibility, a scalar that measures how well the expert’s opinion matches the observed image yt . Thus, (5) can be interpreted as follows: The filtering distribution at time t is obtained by integrating over the entire ensemble of experts the opinion of each expert weighted by that expert’s credibility. The opinion distribution of expert u1:t−1 can be factorized into the expert’s opinion about the pose Ut times the conditional distribution of texture Vt , Bt given pose: p(ut , vt , bt | u1:t−1 , y1:t ) = p(ut | u1:t−1 , y1:t ) p(vt , bt | u1:t , y1:t ) (6) Opinion of expert Pose Opinion Texture Opinion given pose The rest of this section explains how we evaluate each term in (5) and (6). We cover the distribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3. 3.1 Texture opinion given pose The distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covariance that can be obtained using the Kalman filter estimation equations: −1 V ar−1 (Vt , Bt | u1:t , y1:t ) = V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw a(ut ) E(Vt , Bt | u1:t , y1:t ) = V ar(Vt , Bt | u1:t , y1:t ) −1 × V ar−1 (Vt , Bt | u1:t−1 , y1:t−1 )E(Vt , Bt | u1:t−1 , y1:t−1 ) + a(ut )T σw yt (7) (8) This requires p(Vt , Bt |u1:t−1 , y1:t−1 ), which we get from the Kalman prediction equations: E(Vt , Bt | u1:t−1 , y1:t−1 ) = E(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) V ar(Vt , Bt | u1:t−1 , y1:t−1 ) = V ar(Vt−1 , Bt−1 | u1:t−1 , y1:t−1 ) + (9) Ψv 0 0 Ψb (10) In (9), the expected value E(Vt , Bt | u1:t−1 , y1:t−1 ) consists of texture maps (templates) for the object and background. In (10), V ar(Vt , Bt | u1:t−1 , y1:t−1 ) represents the degree of uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we can refer to the mean and variance of each texel individually. For the ith texel in the object texture map, we use the following notation: µv (i) t v σt (i) def = ith element of E(Vt | u1:t−1 , y1:t−1 ) def = (i, i)th element of V ar(Vt | u1:t−1 , y1:t−1 ) b Similarly, define µb (j) and σt (j) as the mean and variance of the jth texel in the backt ground texture map. (This notation leaves the dependency on u1:t−1 and y1:t−1 implicit.) 3.2 Pose opinion Based on its current texture template (derived from the history of poses and images up to time t−1) and the new image yt , each expert u1:t−1 has a pose opinion, p(ut |u1:t−1 , y1:t ), a probability distribution representing that expert’s beliefs about the pose at time t. Since the effect of ut on the likelihood function is nonlinear, we will not attempt to find an analytical solution for the pose opinion distribution. However, due to the spatio-temporal smoothness of video signals, it is possible to estimate the peak and variance of an expert’s pose opinion. 3.2.1 Estimating the peak of an expert’s pose opinion We want to estimate ut (u1:t−1 ), the value of ut that maximizes the pose opinion. Since ˆ p(ut | u1:t−1 , y1:t ) = p(y1:t−1 | u1:t−1 ) p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ), p(y1:t | u1:t−1 ) (11) def ut (u1:t−1 ) = argmax p(ut | u1:t−1 , y1:t ) = argmax p(ut | ut−1 ) p(yt | u1:t , y1:t−1 ). ˆ ut ut (12) We now need an expression for the final term in (12), the predictive distribution p(yt | u1:t , y1:t−1 ). By integrating out the hidden texture variables from p(yt , vt , bt | u1:t , y1:t−1 ), and using the conditional independence relationships defined by the graphical model (Figure 1, right), we can derive: 1 m log p(yt | u1:t , y1:t−1 ) = − log 2π − log |V ar(Yt | u1:t , y1:t−1 )| 2 2 n v 2 1 (yt (xi (ut )) − µt (i)) 1 (yt (j) − µb (j))2 t − − , (13) v (i) + σ b 2 i=1 σt 2 σt (j) + σw w j∈X (ut ) where xi (ut ) is the image pixel rendered by the ith object vertex when the object assumes pose ut , and X (ut ) is the set of all image pixels rendered by the object under pose ut . Combining (12) and (13), we can derive ut (u1:t−1 ) = argmin − log p(ut | ut−1 ) ˆ (14) ut + 1 2 n i=1 [yt (xi (ut )) − µv (i)]2 [yt (xi (ut )) − µb (xi (ut ))]2 t t b − − log[σt (xi (ut )) + σw ] v b σt (i) + σw σt (xi (ut )) + σw Foreground term Background terms Note the similarity between (14) and constrained optic flow (3). For example, focus on the foreground term in (14) and ignore the weights in the denominator. The previous image yt−1 from (3) has been replaced by µv (·), the estimated object texture based on the images t and poses up to time t − 1. As in optic flow, we can find the pose estimate ut (u1:t−1 ) ˆ efficiently using the Gauss-Newton method. 3.2.2 Estimating the distribution of an expert’s pose opinion We estimate the distribution of an expert’s pose opinion using a combination of Laplace’s method and importance sampling. Suppose at time t − 1 we are given a sample of experts (d) (d) indexed by d, each endowed with a pose sequence u1:t−1 , a weight wt−1 , and the means and variances of Gaussian distributions for object and background texture. For each expert (d) (d) u1:t−1 , we use (14) to compute ut , the peak of the pose distribution at time t according ˆ (d) to that expert. Define σt as the inverse Hessian matrix of (14) at this peak, the Laplace ˆ estimate of the covariance matrix of the expert’s opinion. We then generate a set of s (d,e) (d) independent samples {ut : e = 1, · · · , s} from a Gaussian distribution with mean ut ˆ (d) (d) (d) and variance proportional to σt , g(·|ˆt , αˆt ), where the parameter α > 0 determines ˆ u σ the sharpness of the sampling distribution. (Note that letting α → 0 would be equivalent to (d,e) (d) simply setting the new pose equal to the peak of the pose opinion, ut = ut .) To find ˆ the parameters of this Gaussian proposal distribution, we use the Gauss-Newton method, ignoring the second of the two background terms in (14). (This term is not ignored in the importance sampling step.) To refine our estimate of the pose opinion we use importance sampling. We assign each sample from the proposal distribution an importance weight wt (d, e) that is proportional to the ratio between the posterior distribution and the proposal distribution: s (d) p(ut | u1:t−1 , y1:t ) = ˆ (d,e) δ(ut − ut ) wt (d, e) s f =1 wt (d, f ) (15) e=1 (d,e) (d) (d) (d,e) p(ut | ut−1 )p(yt | u1:t−1 , ut , y1:t−1 ) wt (d, e) = (16) (d,e) (d) (d) g(ut | ut , αˆt ) ˆ σ (d,e) (d) The numerator of (16) is proportional to p(ut |u1:t−1 , y1:t ) by (12), and the denominator of (16) is the sampling distribution. 3.3 Estimating an expert’s credibility (d) The credibility of the dth expert, p(u1:t−1 | y1:t ), is proportional to the product of a prior term and a likelihood term: (d) (d) p(u1:t−1 | y1:t−1 )p(yt | u1:t−1 , y1:t−1 ) (d) p(u1:t−1 | y1:t ) = . (17) p(yt | y1:t−1 ) Regarding the likelihood, p(yt |u1:t−1 , y1:t−1 ) = p(yt , ut |u1:t−1 , y1:t−1 )dut = p(yt |u1:t , y1:t−1 )p(ut |ut−1 )dut (18) (d,e) We already generated a set of samples {ut : e = 1, · · · , s} that estimate the pose opin(d) ion of the dth expert, p(ut | u1:t−1 , y1:t ). We can now use these samples to estimate the likelihood for the dth expert: (d) p(yt | u1:t−1 , y1:t−1 ) = (d) (d) p(yt | u1:t−1 , ut , y1:t−1 )p(ut | ut−1 )dut (19) (d) (d) (d) (d) = p(yt | u1:t−1 , ut , y1:t−1 )g(ut | ut , αˆt ) ˆ σ 3.4 p(ut | ut−1 ) s e=1 dut ≈ wt (d, e) s Updating the filtering distribution g(ut | (d) (d) ut , αˆt ) ˆ σ Once we have calculated the opinion and credibility of each expert u1:t−1 , we evaluate the integral in (5) as a weighted sum over experts. The credibilities of all of the experts are normalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t−1 (parents) by appending a pose ut to the parent’s history of poses u1:t−1 . Every expert in the new generation is created as follows: One parent is chosen to sire the child. The probability of being chosen is proportional to the parent’s credibility. The child’s value of ut is chosen at random from its parent’s pose opinion (the weighted samples described in Section 3.2.2). 4 Relation to Optic Flow and Template Matching In basic template-matching, the same time-invariant texture map is used to track every frame in the video sequence. Optic flow can be thought of as template-matching with a template that is completely reset at each frame for use in the subsequent frame. In most cases, optimal inference under G-flow involves a combination of optic flow-based and template-based tracking, in which the texture template gradually evolves as new images are presented. Pure optic flow and template-matching emerge as special cases. Optic Flow as a Special Case Suppose that the pose transition probability p(ut | ut−1 ) is uninformative, that the background is uninformative, that every texel in the initial object texture map has equal variance, V ar(V1 ) = κIn , and that the texture transition uncertainty is very high, Ψv → diag(∞). Using (7), (8), and (10), it follows that: µv (i) = [av (ut−1 )]T yt−1 = yt−1 (xi (ut−1 )) , t (20) i.e., the object texture map at time t is determined by the pixels from image yt−1 that according to pose ut−1 were rendered by the object. As a result, (14) reduces to: ut (u1:t−1 ) = argmin ˆ ut 1 2 n yt (xi (ut )) − yt−1 (xi (ut−1 )) 2 (21) i=1 which is identical to (3). Thus constrained optic flow [10, 2, 11] is simply a special case of optimal inference under G-flow, with a single expert and with sampling parameter α → 0. The key assumption that Ψv → diag(∞) means that the object’s texture is very different in adjacent frames. However, optic flow is typically applied in situations in which the object’s texture in adjacent frames is similar. The optimal solution in such situations calls not for optic flow, but for a texture map that integrates information across multiple frames. Template Matching as a Special Case Suppose the initial texture map is known precisely, V ar(V1 ) = 0, and the texture transition uncertainty is very low, Ψv → 0. By (7), (8), and (10), it follows that µv (i) = µv (i) = µv (i), i.e., the texture map does not change t t−1 1 over time, but remains fixed at its initial value (it is a texture template). Then (14) becomes: n yt (xi (ut )) − µv (i) 1 ut (u1:t−1 ) = argmin ˆ ut 2 (22) i=1 where µv (i) is the ith texel of the fixed texture template. This is the error function mini1 mized by standard template-matching algorithms. The key assumption that Ψv → 0 means the object’s texture is constant from each frame to the next, which is rarely true in real data. G-flow provides a principled way to relax this unrealistic assumption of template methods. General Case In general, if the background is uninformative, then minimizing (14) results in a weighted combination of optic flow and template matching, with the weight of each approach depending on the current level of certainty about the object template. In addition, when there is useful information in the background, G-flow infers a model of the background which is used to improve tracking. Figure 2: G-flow tracking an outdoor video. Results are shown for frames 1, 81, and 620. 5 Simulations We collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety of facial expressions while moving her head. A later motion-capture session was used to create a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5). Twenty experts were initialized randomly near the correct pose on frame 1 of the video and propagated using G-flow inference (assuming an uninformative background). See http://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames. In each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and morph coefficients). The 38 points in the model are projected into the image according to each expert’s pose, yielding 760 red dots in each frame. In each frame, the mean of the experts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right) as well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-flow’s ability to recover from error: bad initial hypotheses are weeded out, leaving only good hypotheses. To compare G-flow’s performance versus deterministic constrained optic flow algorithms such as [10, 2, 11] , we used both G-flow and the method from [2] to track the same video sequence. We ran each tracker several times, introducing small errors in the starting pose. Figure 3: Average error over time for G-flow (green) and for deterministic optic flow [2] (blue). Results were averaged over 16 runs (deterministic algorithm) or 4 runs (G-flow) and smoothed. As ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The error at every 20th frame was calculated as the distance from these labeled locations to the inferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking error as a function of time for the deterministic constrained optic flow algorithm and for a 20-expert version of the G-flow tracking algorithm. Notice that the deterministic system has a tendency to drift (increase in error) over time, whereas G-flow can recover from drift. Acknowledgments Tim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC. John Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by the Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141, and IIS-0223052, and by the UCDIMI grant D00-10084. References [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221–255, 2002. [2] M. Brand. Flexible flow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1, pages 315–322, 2001. [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman filtering for conditionally gaussian systems with random matrices. Syst. Contr. Lett., 13:397–404, 1989. [4] R. Chen and J. Liu. Mixture Kalman filters. J. R. Statist. Soc. B, 62:493–508, 2000. [5] A. Doucet and C. Andrieu. Particle filtering for partially observed gaussian state space models. J. R. Statist. Soc. B, 64:827–838, 2002. [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176–183, 2000. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10:197–208, 2000. [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000. [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981. [10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank constraints. In CVPR, pages 493–500, 2001. [11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from 2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.
6 0.18032306 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
7 0.16498226 148 nips-2004-Probabilistic Computation in Spiking Populations
8 0.14856914 183 nips-2004-Temporal-Difference Networks
9 0.13517433 175 nips-2004-Stable adaptive control with online learning
10 0.13397653 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
11 0.12909274 82 nips-2004-Incremental Algorithms for Hierarchical Classification
12 0.12664242 64 nips-2004-Experts in a Markov Decision Process
13 0.11280005 116 nips-2004-Message Errors in Belief Propagation
14 0.10861409 185 nips-2004-The Convergence of Contrastive Divergences
15 0.10505424 83 nips-2004-Incremental Learning for Visual Tracking
16 0.10037282 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks
17 0.10034269 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
18 0.1001756 136 nips-2004-On Semi-Supervised Classification
19 0.0996328 23 nips-2004-Analysis of a greedy active learning strategy
20 0.091990642 28 nips-2004-Bayesian inference in spiking neurons
topicId topicWeight
[(0, -0.23), (1, -0.062), (2, 0.382), (3, 0.099), (4, 0.247), (5, -0.297), (6, -0.01), (7, 0.025), (8, 0.085), (9, -0.03), (10, -0.102), (11, 0.096), (12, -0.12), (13, -0.164), (14, -0.023), (15, 0.071), (16, 0.061), (17, 0.053), (18, 0.026), (19, -0.065), (20, -0.074), (21, 0.028), (22, 0.092), (23, -0.029), (24, 0.066), (25, 0.031), (26, -0.114), (27, -0.002), (28, -0.028), (29, 0.023), (30, 0.024), (31, -0.024), (32, -0.009), (33, -0.026), (34, -0.059), (35, -0.055), (36, -0.031), (37, -0.011), (38, -0.004), (39, 0.067), (40, -0.008), (41, 0.064), (42, 0.084), (43, -0.052), (44, -0.046), (45, 0.01), (46, -0.079), (47, -0.016), (48, -0.032), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.98253036 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
2 0.7998116 48 nips-2004-Convergence and No-Regret in Multiagent Learning
Author: Michael Bowling
Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1
3 0.68052435 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.57730335 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
5 0.5136764 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
Author: Shyam Visweswaran, Gregory F. Cooper
Abstract: Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. On experimental evaluation, this algorithm shows superior performance over model selection. We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. A population-wide model is optimized to predict well on average when applied to expected future instances. In contrast, an instance-specific model is one that is constructed specifically for a particular instance. The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. In contrast, lazy learning defers most or all processing until a response to a test instance is required. Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. The rest of this paper is structured as follows. Section 2 introduces a Bayesian framework for instance-specific learning. Section 3 describes the implementation of ISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. Finally, in Section 5 we discuss the results of the comparison. 2 Deci si on Th eo ret i c F rame wo rk We use the following notation. Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. Thus, X = x denotes that variable X is assigned the value x. Bold upper case letters, such as X, Z, represent sets of variables or random vectors and their realization is denoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that the variables in X have the states given by x. In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. We now characterize population-wide and instance-specific model selection in decision theoretic terms. Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . (1) M The optimal population-wide model for predicting Zt is as follows: max∑ U P( Z t | X t , D), P (Z t | X t , M ) P ( X | D) , M Xt [ ] (2) where the function U gives the utility of approximating the Bayes optimal estimate P(Zt | Xt , D), with the estimate P(Zt | Xt , M) obtained from model M. The term P(X | D) is given by: P ( X | D) = ∫ P ( X | M ) P ( M | D)dM . (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . The difference between the population-wide and the instance-specific models can be noted by comparing Equations 2 and 4. Equation 2 for the population-wide model selects the model that on average will have the greatest utility. Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. This observation provides support for developing instance-specific models. Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. The current paper focuses on model averaging, rather than model selection. Ideal Bayesian model averaging is given by Equation 1. Model averaging has previously been applied using population-wide models. Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. We introduce Bayesian networks briefly and then describe ISA in detail. 3.1 B ay e s i a n N e t w or k s A Bayesian network is a probabilistic model that combines a graphical representation (the Bayesian network structure) with quantitative information (the parameters of the Bayesian network) to represent the joint probability distribution over a set of random variables [3]. Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. ΘG represents the parameterization of the model. In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . The immediate successors of X i are called the children of X i . For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , ..., X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . If Xi has no parents, then the set parents(Xi ) is empty and P(Xi | parents(X i)) is just P(Xi ). 3.2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). We use the same class of Bayesian networks for the ISA models, to facilitate comparison between the two algorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. That is, each node is either a parent or a child of Z. Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. ... X1= x1 Xi = xi Z Xi+1 ... Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. The indices need not be ordered as shown, but are presented in this example for convenience of exposition. 3.3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . ∑RP(M | D) M∈ (7) Assuming uniform prior belief over all possible models, the model posterior P(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to obtain the following equation: P ( Z | x , D) ≅ t t ∑ P ( Z t | x t , M ) P( D | M ) . ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,..., x p ,z 1 ,...,z p −1 ,M ) . (9) p =1 This score is computed in a predictive and sequential fashion: for the pth training instance the probability of predicting the observed value zp for the target variable is computed based on the values of all the variables in the preceding p-1 training instances and the values xp of the attributes in the pth instance. One limitation of this score is that its value depends on the ordering of the data. Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. The parameters of the Bayesian network M, used in the above computations, are defined as follows: P ( X i = k | parents ( X i ) = j ) ≡ θ ijk = N ijk + α ijk N ij + α ij , (10) where (i) Nijk is the number of instances in the training dataset D where variable Xi has value k and the parents of X i are in state j, (ii) N ij = ∑k N ijk , (iii) αijk is a parameter prior that can be interpreted as the belief equivalent of having previously observed αijk instances in which variable Xi has value k and the parents of X i are in state j, and (iv) α ij = ∑k α ijk . 3.4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. The initial model is the simple Bayes network that includes all the attributes in X as children of Z. A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. This process is performed for each child in the current model. An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. The newly derived models are added to a priority queue, Q. During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. The first phase terminates after a user-specified number of models have accumulated in R. The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. The idea here is to find viable competing models for making this posterior probability prediction. When no competitive models can be found, the prediction becomes stable. During each iteration of the search, the highest ranked model M* is removed from Q and added to R. The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). More change is better. In particular, M* is the model in Q that maximizes the following function: f ( R, M *) = g ( R) − g ( R U {M *}) , (11) where for a set of models S, the function g(S) computes the approximate model averaged prediction for Zt, as follows: g (S ) = ∑ P(Z M ∈S t | x t , M ) score( D, M ) ∑ score( D, M ) ∈ . (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. The final distribution of Zt is then computed from the models in R using Equation 8. 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. For SB and LBR, we used the Weka implementations (Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We implemented the ISA algorithm as a standalone application in Java. The following settings were used for ISA: a maximum of 100 phase-1 models, a threshold T of 0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in Equation 10, all αijk were set to 1. All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. The error rates of two algorithms on a dataset were compared with a paired t-test carried out at the 5% significance level on the error rate statistics obtained from the 20 trials. The results are shown in Table 1. Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. On two datasets, chess and tic-tac-toe, ISA shows considerable improvement in performance over both SB and LBR. With respect to computation Table 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) and Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is statistically significantly lower than the marked SB or LBR error rate. A + indicates that the ISA error rate is statistically significantly higher. Dataset Size Annealing Audiology Breast (W) Chess (KR-KP) Credit (A) Echocardiogram Glass Heart (C) Hepatitis Horse colic House votes 84 Hypothyroid Iris Labor LED 24 Liver disorders Lung cancer Lymphography Pima Postoperative Primary tumor Promoters Solar flare Sonar Soybean Splice junction Tic-Tac-Toe Wine Zoo 898 226 699 3169 690 131 214 303 155 368 435 3163 150 57 200 345 32 148 768 90 339 106 1389 208 683 3177 958 178 101 No. of classes 6 24 2 2 2 2 6 2 2 2 2 2 3 2 10 2 3 4 2 3 22 2 2 2 19 3 2 3 7 Num. Attrib. 6 0 9 0 6 6 9 13 6 7 0 7 4 8 0 6 0 0 8 1 0 0 0 60 0 0 0 13 0 Nom. Attrib. 32 69 0 36 9 1 0 0 13 15 16 18 0 8 24 0 56 18 0 7 17 57 10 0 35 60 9 0 16 Percent error rate SB LBR ISA 1.9 3.5 2.7 29.6 29.4 30.9 3.7 2.9 + 2.8 + 1.1 12.1 3.0 13.8 14.0 13.9 33.2 34.0 35.9 26.9 27.8 29.0 16.2 16.2 17.5 14.2 - 14.2 - 11.3 20.2 16.0 17.8 5.1 10.1 7.0 0.9 0.9 1.4 6.0 6.0 5.3 8.8 6.1 7.0 40.5 40.5 40.3 36.8 36.8 36.8 56.3 56.3 56.3 15.5 - 15.5 - 13.2 21.8 22.0 22.3 33.3 33.3 33.3 54.4 53.5 54.2 7.5 7.5 7.5 20.2 18.3 + 19.4 15.4 15.6 15.9 7.1 7.2 7.9 4.7 4.3 4.4 30.3 - 13.7 - 10.3 1.1 1.1 1.1 6.4 8.4 8.4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. We will also investigate methods to improve the computational efficiency of ISA. In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. In the future, we plan to explore other models using this framework. Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. A c k n ow l e d g me n t s This work was supported by the grant T15-LM/DE07059 from the National Library of Medicine (NLM) to the University of Pittsburgh’s Biomedical Informatics Training Program. We would like to thank the three anonymous reviewers for their helpful comments. References [1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41(1):53-84. [2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14:382-417. [3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. [4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection of Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. [5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA. [6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of ContinuousValued Attributes for Classification Learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan Kaufmann.
6 0.49845454 175 nips-2004-Stable adaptive control with online learning
7 0.47382292 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem
8 0.45887628 185 nips-2004-The Convergence of Contrastive Divergences
9 0.44940829 183 nips-2004-Temporal-Difference Networks
10 0.41366377 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
11 0.40662432 136 nips-2004-On Semi-Supervised Classification
12 0.37207124 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters
13 0.37143415 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
14 0.36495969 33 nips-2004-Brain Inspired Reinforcement Learning
15 0.36396125 64 nips-2004-Experts in a Markov Decision Process
16 0.36331052 82 nips-2004-Incremental Algorithms for Hierarchical Classification
17 0.34292707 28 nips-2004-Bayesian inference in spiking neurons
18 0.3428371 116 nips-2004-Message Errors in Belief Propagation
19 0.32225081 138 nips-2004-Online Bounds for Bayesian Algorithms
20 0.31589386 148 nips-2004-Probabilistic Computation in Spiking Populations
topicId topicWeight
[(13, 0.11), (15, 0.123), (25, 0.14), (26, 0.093), (31, 0.016), (32, 0.032), (33, 0.184), (35, 0.016), (39, 0.032), (50, 0.072), (51, 0.038), (63, 0.02), (94, 0.019)]
simIndex simValue paperId paperTitle
1 0.94479114 52 nips-2004-Discrete profile alignment via constrained information bottleneck
Author: Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin
Abstract: Amino acid profiles, which capture position-specific mutation probabilities, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol comparisons, making profiles impractical for many large datasets. Furthermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments between these sequences, while nearly as accurate as the full profileprofile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise alignment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete encoding can expand the range of sequence problems to which profile information can be applied. 1
2 0.9447071 109 nips-2004-Mass Meta-analysis in Talairach Space
Author: Finn \. Nielsen
Abstract: We provide a method for mass meta-analysis in a neuroinformatics database containing stereotaxic Talairach coordinates from neuroimaging experiments. Database labels are used to group the individual experiments, e.g., according to cognitive function, and the consistent pattern of the experiments within the groups are determined. The method voxelizes each group of experiments via a kernel density estimation, forming probability density volumes. The values in the probability density volumes are compared to null-hypothesis distributions generated by resamplings from the entire unlabeled set of experiments, and the distances to the nullhypotheses are used to sort the voxels across groups of experiments. This allows for mass meta-analysis, with the construction of a list with the most prominent associations between brain areas and group labels. Furthermore, the method can be used for functional labeling of voxels. 1
3 0.94295597 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
Author: K. Wong, S. W. Lim, Z. Gao
Abstract: We consider multi-agent systems whose agents compete for resources by striving to be in the minority group. The agents adapt to the environment by reinforcement learning of the preferences of the policies they hold. Diversity of preferences of policies is introduced by adding random biases to the initial cumulative payoffs of their policies. We explain and provide evidence that agent cooperation becomes increasingly important when diversity increases. Analyses of these mechanisms yield excellent agreement with simulations over nine decades of data. 1
same-paper 4 0.93757272 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni
Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. 1
5 0.90797275 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang
Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1
6 0.87137002 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
7 0.86103833 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.86057144 69 nips-2004-Fast Rates to Bayes for Kernel Machines
9 0.85107046 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
10 0.84899378 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
11 0.84418947 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
12 0.84409738 23 nips-2004-Analysis of a greedy active learning strategy
13 0.84386134 124 nips-2004-Multiple Alignment of Continuous Time Series
14 0.84381801 131 nips-2004-Non-Local Manifold Tangent Learning
15 0.84353596 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
16 0.84303606 103 nips-2004-Limits of Spectral Clustering
17 0.84253401 82 nips-2004-Incremental Algorithms for Hierarchical Classification
18 0.83917487 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
19 0.83837992 116 nips-2004-Message Errors in Belief Propagation
20 0.83723456 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination