nips nips2005 nips2005-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. [sent-3, score-0.969]
2 We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. [sent-5, score-0.679]
3 This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model. [sent-6, score-0.916]
4 1 Introduction There are many striking similarities between results in the standard batch learning setting, where labeled examples are assumed to be drawn independently from some distribution, and the more difficult online setting, where labeled examples arrive in an arbitrary sequence. [sent-7, score-0.902]
5 Moreover, there are simple procedures that convert any online learning algorithm to an equally good batch learning algorithm [8]. [sent-8, score-0.701]
6 It is well-known that the online setting is strictly harder than the batch setting, even for the simple one-dimensioanl class of threshold functions on the interval [0, 1]. [sent-10, score-0.616]
7 Hence, we consider the online transductive model of Ben-David, Kushilevitz, and Mansour [2]. [sent-11, score-0.5]
8 In this model, an arbitrary but unknown sequence of n examples (x1 , y1 ), . [sent-12, score-0.148]
9 The set of unlabeled examples is then presented to the learner, Σ = {xi |1 ≤ i ≤ n}. [sent-16, score-0.187]
10 The examples are then revealed, in an online manner, to the learner, for i = 1, 2, . [sent-17, score-0.319]
11 The learner observes example xi (along with all previous labeled examples (x1 , y1 ), . [sent-21, score-0.32]
12 , (xi−1 , yi−1 ) and the unlabeled example set Σ) and must predict yi . [sent-24, score-0.134]
13 After this occurs, the learner compares its number of mistakes to the minimum number of mistakes of any of a target class F of functions f : X → {−1, 1} (such as linear threshold functions). [sent-26, score-1.456]
14 Note that our results are in this type of agnostic model [7], where we allow for arbitrary labels, unlike the realizable setting, i. [sent-27, score-0.212]
15 With this simple transductive knowledge of what unlabeled examples are to come, one can use existing expert algorithms to inefficiently learn any class of finite VC dimension, similar to the batch setting. [sent-30, score-0.782]
16 How does one use unlabeled examples efficiently to guarantee good online performance? [sent-31, score-0.412]
17 Our efficient algorithm A2 converts a proper1 batch algorithm to a proper online algorithm (both in the agnostic setting). [sent-32, score-0.901]
18 It then “hallucinates” random examples by taking some number of unlabeled examples and labeling them randomly. [sent-34, score-0.281]
19 It appends these examples to those observed so far and predicts according to the batch algorithm that finds the hypothesis of minimum empirical error on the combined data. [sent-35, score-0.642]
20 The idea of “hallucinating” and optimizing has been used for designing efficient online algorithms [6, 5, 1, 10, 4] in situations where exponential weighting schemes were inefficient. [sent-36, score-0.225]
21 In the context of transductive learning, it seems to be a natural way to try to use the unlabeled examples in conjunction with a batch learner. [sent-38, score-0.782]
22 Let #mistakes(f, σn ) denote the number of mistakes of a function f ∈ F on a particular sequence σn ∈ (X × {−1, 1})n , and #mistakes(A, σn ) denote the same quantity for a transductive online learning algorithm A. [sent-39, score-1.265]
23 There is an efficient randomized transductive online algorithm that, for any n > 1 and σn ∈ (X × {−1, 1})n , E[#mistakes(A2 , σn )] ≤ minf ∈F #mistakes(f, σn ) + 2. [sent-43, score-0.821]
24 The algorithm is computationally efficient in the sense that it runs in time poly(n), given an efficient proper batch learning algorithm. [sent-45, score-0.567]
25 Consequently, the classes of functions learnable in the batch and transductive online settings are the same. [sent-48, score-0.921]
26 The classes of functions properly learnable by computationally efficient algorithms in the proper batch and transductive online settings are identical, as well. [sent-49, score-1.12]
27 In addition to the new algorithm, this is interesting because it helps justify a long line of work suggesting that whatever can be done in a batch setting can also be done online. [sent-50, score-0.366]
28 Our result is surprising in light of earlier work by Blum showing that a slightly different online model is harder than its batch analog for computational reasons and not informationtheoretic reasons [3]. [sent-51, score-0.545]
29 In Section 2, we define the transductive online model. [sent-52, score-0.5]
30 In Section 5, we discuss open problems such as extending the results to improper learning and the efficient realizable case. [sent-57, score-0.134]
31 2 Models and definitions The transductive online model considered by Ben-David, Kushlevitz, and Mansour [2], consists of an instance space X and label set Y which we will always take to be binary Y = {−1, 1}. [sent-58, score-0.522]
32 An arbitrary n > 0 and arbitrary sequence of labeled examples (x1 , y1 ), . [sent-59, score-0.232]
33 For notational convenience, we define σi to be the subsequence of first i 1 A proper learning algorithm is one that always outputs a hypothesis h ∈ F . [sent-64, score-0.316]
34 , (xi , yi ), and Σ to be the set of all unlabeled examples in σn , Σ = {xi | i ∈ {1, 2, . [sent-68, score-0.228]
35 The number of mistakes of A on the sequence σn = (x1 , y1 ), . [sent-73, score-0.697]
36 , (xn , yn ) is, #mistakes(A, σn ) = |{i | A(n, Σ, xi , σi−1 ) = yi }|. [sent-76, score-0.155]
37 Formally, paralleling agnostic learning [7],2 we define an efficient transductive online learner A for class F to be one for which the learning algorithm runs in time poly(n) and achieves, for any > 0, E[#mistakes(A, σn )] ≤ minf ∈F #mistakes(f, σn ) + n, for n =poly(1/ ). [sent-79, score-0.969]
38 1 Proper learning Proper batch learning requires one to output a hypothesis h ∈ F. [sent-81, score-0.448]
39 An efficient proper batch learning algorithm for F is a batch learning algorithm B that, given any > 0, with n = poly(1/ ) many examples from any distribution D, outputs an h ∈ F of expected error E[PrD [h(x) = y]] ≤ minf ∈F PrD [f (x) = y] + and runs in time poly(n). [sent-82, score-1.309]
40 Any efficient proper batch learning algorithm B can be converted into an efficient empirical error minimizer M that, for any n, given any data set σn ∈ (X × Y)n , outputs an f ∈ F of minimal empirical error on σn . [sent-84, score-0.881]
41 Running B only on σn , B is not guaranteed to output a hypothesis of minimum empirical error. [sent-86, score-0.149]
42 If B indeed returns a hypothesis h of error less than 1/n more than the best f ∈ F, it must be a hypothesis of minimum empirical error on σn . [sent-88, score-0.237]
43 To define proper learning in an online setting, it is helpful to think of the following alternative definition of transductive online learning. [sent-92, score-0.906]
44 After the ith hypothesis hi is output, the example (xi , yi ) is revealed, and it is clear whether the learner made an error. [sent-97, score-0.29]
45 Formally, the (possibly randomized) algorithm A still takes as input n, Σ, and σi−1 (but 2 It is more common in online learning to bound the total number of mistakes of an online algorithm on an arbitrary sequence. [sent-98, score-1.274]
46 We bound its error rate, as is usual for batch learning. [sent-99, score-0.346]
47 no longer xi ), and outputs hi : X → {−1, 1} and errs if hi (xi ) = yi . [sent-101, score-0.378]
48 To see that this model is equivalent to the previous definition, note that any algorithm A that outputs hypotheses hi can be used to make predictions hi (xi ) on example i (it errs if hi (xi ) = yi ). [sent-102, score-0.413]
49 This is because A can be viewed as outputting hi : X → {−1, 1}, where the function hi is defined by setting hi (x) equal to be the prediction of algorithm A on the sequence σi−1 followed by the example x, for each x ∈ X , i. [sent-104, score-0.395]
50 ) A (possibly randomized) transductive online algorithm in this model is defined to be proper for family of functions F if it always outputs hi ∈ F . [sent-108, score-0.846]
51 3 Warmup: the realizable case In this section, we consider the realizable special case in which there is some f ∈ F which correctly labels all examples. [sent-109, score-0.214]
52 Say there are at most L different ways to label the examples in Σ according to functions f ∈ F, so 1 ≤ L ≤ 2|Σ| . [sent-116, score-0.141]
53 In the transductive online model, L is determined by Σ and F only. [sent-117, score-0.5]
54 Hence, as long as prediction occurs only on examples x ∈ Σ, there are effectively only L different functions in F that matter, and we can thus pick L such functions that give rise to the L different labelings. [sent-118, score-0.184]
55 On the ith example, one could simply take majority vote of fj (xi ) over consistent labelings fj (the so-called halving algorithm), and this would easily ensure at most log2 (L) mistakes, because each mistake eliminates at least half of the consistent labelings. [sent-119, score-0.402]
56 One can also use the following proper learning algorithm. [sent-120, score-0.181]
57 Proper transductive online learning algorithm in the realizable case: • Preprocessing: Given the set of unlabeled examples Σ, take L functions f1 , f2 , . [sent-121, score-0.887]
58 The above randomized proper learning algorithm makes an expected d log(n) mistakes on any sequence of examples of length n ≥ 2, provided that there is some mistake-free f ∈ F. [sent-129, score-1.073]
59 Let Vi be the number of labelings fj consistent with the first i examples, so that L = V0 ≥ V1 ≥ · · · ≥ Vn ≥ 1 and L ≤ nd , by Sauer’s lemma [11] for n ≥ 2, where d is the VC dimension of F. [sent-131, score-0.286]
60 Observe that the number of consistent labelings that make a mistake on the ith example are exactly Vi−1 − Vi . [sent-132, score-0.165]
61 Hence, the total expected number of mistakes is, n i=1 4 Vi−1 − Vi ≤ Vi−1 n i=1 1 Vi−1 + 1 Vi−1 − 1 + . [sent-133, score-0.689]
62 i More formally, take L functions with the following properties: for each pair 1 ≤ j, k ≤ L with j = k, there exists x ∈ Σ such that fj (x) = fk (x), and for every f ∈ F , there exists a 1 ≤ j ≤ L with f (x) = fj (x) for all x ∈ Σ. [sent-137, score-0.179]
63 Note that, this closely matches what one achieves in the batch setting. [sent-139, score-0.35]
64 Like the batch setting, no better bounds can be given up to a constant factor. [sent-140, score-0.339]
65 4 General setting We now consider the more difficult unrealizable setting where we have an unconstrained sequence of examples (though we still work in a transductive setting). [sent-141, score-0.525]
66 We begin by presenting an known (inefficnet) extension to the halving algorithm of the previous section, that works in the agnostic (unrealizable) setting that is similar to the previous algorithm. [sent-142, score-0.197]
67 Inefficient proper transductive online learning algorithm A1 : • Preprocessing: Given the set of unlabeled examples Σ, take L functions f1 , f2 , . [sent-143, score-0.934]
68 • Update: for each j for which fj (xi ) = yi , reduce wj , wj := wj 1 − log L n . [sent-155, score-0.266]
69 Using an analysis very similar to that of Weighted Majority [9], one can show that, for any n > 1 and sequence of examples σn ∈ (X × {−1, 1})n , E[#mistakes(A1 , σn )] = minf ∈F #mistakes(f, σn ) + 2 dn log n, where d is the VC dimension of F. [sent-156, score-0.425]
70 1 Efficient algorithm We can only hope to get an efficient proper online algorithm when there is an efficient proper batch algorithm. [sent-159, score-0.935]
71 1, this means that there is a batch algorithm M that, given any data set, efficiently finds a hypothesis h ∈ F of minimum empirical error. [sent-161, score-0.491]
72 (In fact, most proper learning algorithms work this way to begin with. [sent-162, score-0.181]
73 Efficient transductive online learning algorithm A2 : • Preprocessing: Given the set of unlabeled examples Σ, create a hallucinated data set τ as follows. [sent-164, score-0.809]
74 For each example x ∈ Σ,√ choose integer rx uniformly at random √ such that − 4 n ≤ rx ≤ 4 n. [sent-166, score-0.152]
75 • To predict on xi : output hypothesis M (τ σi−1 ) ∈ F, where τ σi−1 is the concatenation of the hallucinated examples and the observed labeled examples so far. [sent-169, score-0.467]
76 Let A2 be the algorithm that, for each i, predicts f (xi ) based on f ∈ F which is any empirical minimizer on the concatenated data τ σi , i. [sent-177, score-0.242]
77 Then the total number of mistakes of A2 is, #mistakes(A2 , σn ) ≤ minf ∈F #mistakes(f, τ σn ) − minf ∈F #mistakes(f, τ ). [sent-180, score-1.129]
78 The above lemma means that if one could use the hypothetical “be the leader” algorithm then one would make no more mistakes than the best f ∈ F . [sent-185, score-0.842]
79 The first inequality above holds by induction hypothesis, and the second follows from the fact that gt is an empirical minimizer of τ σt . [sent-207, score-0.259]
80 The total mistakes of the hypothetical algorithm proposed in the n lemma is i=1 mi , which gives the lemma by rearranging (1) for t = n. [sent-209, score-0.981]
81 For any σn , Eτ [minf ∈F #mistakes(f, τ σn )] ≤ Eτ [|τ |/2] + minf ∈F #mistakes(f, σn ). [sent-211, score-0.22]
82 For the first part of the lemma, let g = M (σn ) be an empirical minimizer on σn . [sent-215, score-0.17]
83 The last inequality holds because, since each example in τ is equally likely to have a ± label, the expected number of mistakes of any fixed g ∈ F on τ is E[|τ |/2]. [sent-217, score-0.708]
84 For the second part of the lemma, observe that we can write the number of mistakes of f on τ as, #mistakes(f, τ ) = |τ | − n i=1 2 f (xi )ri . [sent-219, score-0.669]
85 For any σn , for any i, with probability at least 1 − n−1/4 over τ , M (τ σi−1 ) is an empirical minimizer of τ σi . [sent-234, score-0.17]
86 , if all f ∈ F predict the same sign f (xi ), then the sets of empirical minimizers of τ σi−1 and τ σi are equal and the lemma holds trivially. [sent-240, score-0.166]
87 For any sequence π ∈ (X × Y)∗ , define, s+ (π) = minf ∈F+ #mistakes(f, π) and s− (π) = minf ∈F− #mistakes(f, π). [sent-241, score-0.468]
88 If they are equal then f (xi ) can be an empirical minimizer in either. [sent-244, score-0.17]
89 It is now clear that if M (τ σi−1 ) is not also an empirical minimizer of τ σi then s+ (τ σi−1 ) = s− (τ σi−1 ). [sent-249, score-0.17]
90 If we fix σn and the random choices rx for each x ∈ Σ \ {xi }, as we increase or decrease ri by 1, ∆ correspondingly increases or decreases by 1. [sent-251, score-0.155]
91 Combining Lemmas 1 and 2, if on each period i, we used any minimizer of empirical error on the data τ σi , we would have a total number of mistakes of at √ most minf ∈F #mistakes(f, σn ) + 1. [sent-255, score-1.105]
92 Then, its total number of mistakes can only be p larger than this bound. [sent-258, score-0.689]
93 By Lemma 3, the expected number p of periods i in which an empirical minimizer of τ σi is not used is ≤ n3/4 . [sent-259, score-0.17]
94 Hence, the expected total number of mistakes of A2 is at most, Eτ [#mistakes(A2 , σn )] ≤ minf ∈F #mistakes(f, σn ) + 1. [sent-260, score-0.909]
95 The above algorithm is still costly in the sense that we must re-run the batch error minimizer for each prediction we would like to make. [sent-264, score-0.519]
96 5 Conclusions and open problems We have given an algorithm for learning in the transductive online setting and established several results between efficient proper batch and transductive online learnability. [sent-269, score-1.588]
97 Hence, it is an open question as to whether efficient learnability in the batch and transductive online settings are the same in the realizable case. [sent-271, score-0.956]
98 In addition, our computationally efficient algorithm requires polynomially more examples than its inefficient counterpart. [sent-272, score-0.16]
99 It would be nice to have the best of both worlds, namely √ computationally efficient algorithm that a achieves a number of mistakes that is at most O( dn log n). [sent-273, score-0.844]
100 Additionally, it would be nice to remove the restriction to proper algorithms. [sent-274, score-0.173]
wordName wordTfidf (topN-words)
[('mistakes', 0.669), ('batch', 0.32), ('transductive', 0.275), ('online', 0.225), ('minf', 0.22), ('proper', 0.154), ('minimizer', 0.113), ('realizable', 0.107), ('vc', 0.096), ('examples', 0.094), ('unlabeled', 0.093), ('xi', 0.093), ('hi', 0.087), ('lemma', 0.082), ('agnostic', 0.079), ('labelings', 0.079), ('ri', 0.079), ('fj', 0.077), ('learnable', 0.076), ('rx', 0.076), ('learner', 0.075), ('hallucination', 0.072), ('vi', 0.07), ('poly', 0.067), ('inef', 0.066), ('randomized', 0.06), ('labeled', 0.058), ('empirical', 0.057), ('hypothesis', 0.055), ('hallucinated', 0.054), ('gt', 0.051), ('hypothetical', 0.05), ('fix', 0.05), ('ef', 0.049), ('setting', 0.046), ('leader', 0.043), ('yi', 0.041), ('algorithm', 0.041), ('log', 0.04), ('kalai', 0.04), ('outputs', 0.039), ('mi', 0.037), ('wj', 0.036), ('kushilevitz', 0.036), ('maxf', 0.036), ('prd', 0.036), ('sauer', 0.036), ('technological', 0.036), ('toyota', 0.036), ('unrealizable', 0.036), ('blum', 0.036), ('cient', 0.032), ('ith', 0.032), ('errs', 0.031), ('halving', 0.031), ('wlog', 0.031), ('predicts', 0.031), ('achieves', 0.03), ('learnability', 0.029), ('mistake', 0.029), ('sham', 0.029), ('fl', 0.029), ('mansour', 0.029), ('revealed', 0.028), ('sequence', 0.028), ('majority', 0.027), ('learning', 0.027), ('minimizers', 0.027), ('xn', 0.027), ('error', 0.026), ('gi', 0.026), ('arbitrary', 0.026), ('computationally', 0.025), ('il', 0.025), ('coin', 0.025), ('functions', 0.025), ('consistent', 0.025), ('wl', 0.024), ('vn', 0.024), ('preprocessing', 0.023), ('remark', 0.023), ('chicago', 0.023), ('dimension', 0.023), ('mt', 0.022), ('label', 0.022), ('rise', 0.021), ('yn', 0.021), ('converted', 0.021), ('equally', 0.02), ('total', 0.02), ('properly', 0.02), ('dn', 0.02), ('ine', 0.02), ('nice', 0.019), ('induction', 0.019), ('prediction', 0.019), ('inequality', 0.019), ('output', 0.019), ('bounds', 0.019), ('minimum', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
2 0.1811436 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
3 0.16877653 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
4 0.11947069 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
5 0.093673781 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
6 0.085372284 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.080795974 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
8 0.076529749 148 nips-2005-Online Discovery and Learning of Predictive State Representations
9 0.076144606 105 nips-2005-Large-Scale Multiclass Transduction
10 0.065956526 117 nips-2005-Learning from Data of Variable Quality
11 0.063796893 95 nips-2005-Improved risk tail bounds for on-line algorithms
12 0.059441339 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
13 0.058233153 50 nips-2005-Convex Neural Networks
14 0.056307249 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
15 0.052326765 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
16 0.050479393 47 nips-2005-Consistency of one-class SVM and related algorithms
17 0.048263464 85 nips-2005-Generalization to Unseen Cases
18 0.047905415 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.047286961 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
20 0.044189285 160 nips-2005-Query by Committee Made Real
topicId topicWeight
[(0, 0.15), (1, 0.075), (2, -0.021), (3, -0.102), (4, 0.094), (5, 0.132), (6, 0.036), (7, 0.063), (8, -0.075), (9, 0.079), (10, -0.108), (11, 0.191), (12, 0.033), (13, -0.093), (14, -0.006), (15, -0.151), (16, -0.061), (17, 0.02), (18, -0.014), (19, 0.059), (20, -0.076), (21, 0.094), (22, 0.033), (23, 0.036), (24, 0.203), (25, -0.062), (26, -0.064), (27, 0.032), (28, 0.038), (29, 0.055), (30, -0.027), (31, -0.151), (32, -0.034), (33, -0.044), (34, 0.131), (35, -0.025), (36, 0.055), (37, -0.005), (38, -0.011), (39, 0.05), (40, -0.034), (41, -0.05), (42, 0.033), (43, 0.003), (44, 0.047), (45, 0.045), (46, -0.022), (47, 0.109), (48, -0.074), (49, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.94426286 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
2 0.87224203 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
3 0.75715911 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
4 0.56544685 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
5 0.50087243 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
6 0.41196492 117 nips-2005-Learning from Data of Variable Quality
7 0.39347097 148 nips-2005-Online Discovery and Learning of Predictive State Representations
8 0.38126829 85 nips-2005-Generalization to Unseen Cases
9 0.3587271 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
10 0.34409374 105 nips-2005-Large-Scale Multiclass Transduction
11 0.3417336 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
12 0.32976061 160 nips-2005-Query by Committee Made Real
13 0.32246396 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
14 0.31925857 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
15 0.28762186 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
16 0.27025557 62 nips-2005-Efficient Estimation of OOMs
17 0.26275018 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
18 0.26183677 192 nips-2005-The Information-Form Data Association Filter
19 0.26091027 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
20 0.26015693 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
topicId topicWeight
[(3, 0.093), (10, 0.023), (11, 0.01), (27, 0.021), (31, 0.019), (34, 0.084), (55, 0.024), (69, 0.024), (73, 0.03), (88, 0.058), (91, 0.503)]
simIndex simValue paperId paperTitle
1 0.97782904 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks
Author: Ricky Der, Daniel D. Lee
Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.
2 0.96960717 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf
Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.
3 0.9328444 118 nips-2005-Learning in Silicon: Timing is Everything
Author: John V. Arthur, Kwabena Boahen
Abstract: We describe a neuromorphic chip that uses binary synapses with spike timing-dependent plasticity (STDP) to learn stimulated patterns of activity and to compensate for variability in excitability. Specifically, STDP preferentially potentiates (turns on) synapses that project from excitable neurons, which spike early, to lethargic neurons, which spike late. The additional excitatory synaptic current makes lethargic neurons spike earlier, thereby causing neurons that belong to the same pattern to spike in synchrony. Once learned, an entire pattern can be recalled by stimulating a subset. 1 Variability in Neural Systems Evidence suggests precise spike timing is important in neural coding, specifically, in the hippocampus. The hippocampus uses timing in the spike activity of place cells (in addition to rate) to encode location in space [1]. Place cells employ a phase code: the timing at which a neuron spikes relative to the phase of the inhibitory theta rhythm (5-12Hz) conveys information. As an animal approaches a place cell’s preferred location, the place cell not only increases its spike rate, but also spikes at earlier phases in the theta cycle. To implement a phase code, the theta rhythm is thought to prevent spiking until the input synaptic current exceeds the sum of the neuron threshold and the decreasing inhibition on the downward phase of the cycle [2]. However, even with identical inputs and common theta inhibition, neurons do not spike in synchrony. Variability in excitability spreads the activity in phase. Lethargic neurons (such as those with high thresholds) spike late in the theta cycle, since their input exceeds the sum of the neuron threshold and theta inhibition only after the theta inhibition has had time to decrease. Conversely, excitable neurons (such as those with low thresholds) spike early in the theta cycle. Consequently, variability in excitability translates into variability in timing. We hypothesize that the hippocampus achieves its precise spike timing (about 10ms) through plasticity enhanced phase-coding (PEP). The source of hippocampal timing precision in the presence of variability (and noise) remains unexplained. Synaptic plasticity can compensate for variability in excitability if it increases excitatory synaptic input to neurons in inverse proportion to their excitabilities. Recasting this in a phase-coding framework, we desire a learning rule that increases excitatory synaptic input to neurons directly related to their phases. Neurons that lag require additional synaptic input, whereas neurons that lead 120µm 190µm A B Figure 1: STDP Chip. A The chip has a 16-by-16 array of microcircuits; one microcircuit includes four principal neurons, each with 21 STDP circuits. B The STDP Chip is embedded in a circuit board including DACs, a CPLD, a RAM chip, and a USB chip, which communicates with a PC. require none. The spike timing-dependent plasticity (STDP) observed in the hippocampus satisfies this requirement [3]. It requires repeated pre-before-post spike pairings (within a time window) to potentiate and repeated post-before-pre pairings to depress a synapse. Here we validate our hypothesis with a model implemented in silicon, where variability is as ubiquitous as it is in biology [4]. Section 2 presents our silicon system, including the STDP Chip. Section 3 describes and characterizes the STDP circuit. Section 4 demonstrates that PEP compensates for variability and provides evidence that STDP is the compensation mechanism. Section 5 explores a desirable consequence of PEP: unconventional associative pattern recall. Section 6 discusses the implications of the PEP model, including its benefits and applications in the engineering of neuromorphic systems and in the study of neurobiology. 2 Silicon System We have designed, submitted, and tested a silicon implementation of PEP. The STDP Chip was fabricated through MOSIS in a 1P5M 0.25µm CMOS process, with just under 750,000 transistors in just over 10mm2 of area. It has a 32 by 32 array of excitatory principal neurons commingled with a 16 by 16 array of inhibitory interneurons that are not used here (Figure 1A). Each principal neuron has 21 STDP synapses. The address-event representation (AER) [5] is used to transmit spikes off chip and to receive afferent and recurrent spike input. To configure the STDP Chip as a recurrent network, we embedded it in a circuit board (Figure 1B). The board has five primary components: a CPLD (complex programmable logic device), the STDP Chip, a RAM chip, a USB interface chip, and DACs (digital-to-analog converters). The central component in the system is the CPLD. The CPLD handles AER traffic, mediates communication between devices, and implements recurrent connections by accessing a lookup table, stored in the RAM chip. The USB interface chip provides a bidirectional link with a PC. The DACs control the analog biases in the system, including the leak current, which the PC varies in real-time to create the global inhibitory theta rhythm. The principal neuron consists of a refractory period and calcium-dependent potassium circuit (RCK), a synapse circuit, and a soma circuit (Figure 2A). RCK and the synapse are ISOMA Soma Synapse STDP Presyn. Spike PE LPF A Presyn. Spike Raster AH 0 0.1 Spike probability RCK Postsyn. Spike B 0.05 0.1 0.05 0.1 0.08 0.06 0.04 0.02 0 0 Time(s) Figure 2: Principal neuron. A A simplified schematic is shown, including: the synapse, refractory and calcium-dependent potassium channel (RCK), soma, and axon-hillock (AH) circuits, plus their constituent elements, the pulse extender (PE) and the low-pass filter (LPF). B Spikes (dots) from 81 principal neurons are temporally dispersed, when excited by poisson-like inputs (58Hz) and inhibited by the common 8.3Hz theta rhythm (solid line). The histogram includes spikes from five theta cycles. composed of two reusable blocks: the low-pass filter (LPF) and the pulse extender (PE). The soma is a modified version of the LPF, which receives additional input from an axonhillock circuit (AH). RCK is inhibitory to the neuron. It consists of a PE, which models calcium influx during a spike, and a LPF, which models calcium buffering. When AH fires a spike, a packet of charge is dumped onto a capacitor in the PE. The PE’s output activates until the charge decays away, which takes a few milliseconds. Also, while the PE is active, charge accumulates on the LPF’s capacitor, lowering the LPF’s output voltage. Once the PE deactivates, this charge leaks away as well, but this takes tens of milliseconds because the leak is smaller. The PE’s and the LPF’s inhibitory effects on the soma are both described below in terms of the sum (ISHUNT ) of the currents their output voltages produce in pMOS transistors whose sources are at Vdd (see Figure 2A). Note that, in the absence of spikes, these currents decay exponentially, with a time-constant determined by their respective leaks. The synapse circuit is excitatory to the neuron. It is composed of a PE, which represents the neurotransmitter released into the synaptic cleft, and a LPF, which represents the bound neurotransmitter. The synapse circuit is similar to RCK in structure but differs in function: It is activated not by the principal neuron itself but by the STDP circuits (or directly by afferent spikes that bypass these circuits, i.e., fixed synapses). The synapse’s effect on the soma is also described below in terms of the current (ISYN ) its output voltage produces in a pMOS transistor whose source is at Vdd. The soma circuit is a leaky integrator. It receives excitation from the synapse circuit and shunting inhibition from RCK and has a leak current as well. Its temporal behavior is described by: τ dISOMA ISYN I0 + ISOMA = dt ISHUNT where ISOMA is the current the capacitor’s voltage produces in a pMOS transistor whose source is at Vdd (see Figure 2A). ISHUNT is the sum of the leak, refractory, and calciumdependent potassium currents. These currents also determine the time constant: τ = C Ut κISHUNT , where I0 and κ are transistor parameters and Ut is the thermal voltage. STDP circuit ~LTP SRAM Presynaptic spike A ~LTD Inverse number of pairings Integrator Decay Postsynaptic spike Potentiation 0.1 0.05 0 0.05 0.1 Depression -80 -40 0 Presynaptic spike Postsynaptic spike 40 Spike timing: t pre - t post (ms) 80 B Figure 3: STDP circuit design and characterization. A The circuit is composed of three subcircuits: decay, integrator, and SRAM. B The circuit potentiates when the presynaptic spike precedes the postsynaptic spike and depresses when the postsynaptic spike precedes the presynaptic spike. The soma circuit is connected to an AH, the locus of spike generation. The AH consists of model voltage-dependent sodium and potassium channel populations (modified from [6] by Kai Hynna). It initiates the AER signaling process required to send a spike off chip. To characterize principal neuron variability, we excited 81 neurons with poisson-like 58Hz spike trains (Figure 2B). We made these spike trains poisson-like by starting with a regular 200Hz spike train and dropping spikes randomly, with probability of 0.71. Thus spikes were delivered to neurons that won the coin toss in synchrony every 5ms. However, neurons did not lock onto the input synchrony due to filtering by the synaptic time constant (see Figure 2B). They also received a common inhibitory input at the theta frequency (8.3Hz), via their leak current. Each neuron was prevented from firing more than one spike in a theta cycle by its model calcium-dependent potassium channel population. The principal neurons’ spike times were variable. To quantify the spike variability, we used timing precision, which we define as twice the standard deviation of spike times accumulated from five theta cycles. With an input rate of 58Hz the timing precision was 34ms. 3 STDP Circuit The STDP circuit (related to [7]-[8]), for which the STDP Chip is named, is the most abundant, with 21,504 copies on the chip. This circuit is built from three subcircuits: decay, integrator, and SRAM (Figure 3A). The decay and integrator are used to implement potentiation, and depression, in a symmetric fashion. The SRAM holds the current binary state of the synapse, either potentiated or depressed. For potentiation, the decay remembers the last presynaptic spike. Its capacitor is charged when that spike occurs and discharges linearly thereafter. A postsynaptic spike samples the charge remaining on the capacitor, passes it through an exponential function, and dumps the resultant charge into the integrator. This charge decays linearly thereafter. At the time of the postsynaptic spike, the SRAM, a cross-coupled inverter pair, reads the voltage on the integrator’s capacitor. If it exceeds a threshold, the SRAM switches state from depressed to potentiated (∼LTD goes high and ∼LTP goes low). The depression side of the STDP circuit is exactly symmetric, except that it responds to postsynaptic activation followed by presynaptic activation and switches the SRAM’s state from potentiated to depressed (∼LTP goes high and ∼LTD goes low). When the SRAM is in the potentiated state, the presynaptic 50 After STDP 83 92 100 Timing precision(ms) Before STDP 75 B Before STDP After STDP 40 30 20 10 0 50 60 70 80 90 Input rate(Hz) 100 50 58 67 text A 0.2 0.4 Time(s) 0.6 0.2 0.4 Time(s) 0.6 C Figure 4: Plasticity enhanced phase-coding. A Spike rasters of 81 neurons (9 by 9 cluster) display synchrony over a two-fold range of input rates after STDP. B The degree of enhancement is quantified by timing precision. C Each neuron (center box) sends synapses to (dark gray) and receives synapses from (light gray) twenty-one randomly chosen neighbors up to five nodes away (black indicates both connections). spike activates the principal neuron’s synapse; otherwise the spike has no effect. We characterized the STDP circuit by activating a plastic synapse and a fixed synapse– which elicits a spike at different relative times. We repeated this pairing at 16Hz. We counted the number of pairings required to potentiate (or depress) the synapse. Based on this count, we calculated the efficacy of each pairing as the inverse number of pairings required (Figure 3B). For example, if twenty pairings were required to potentiate the synapse, the efficacy of that pre-before-post time-interval was one twentieth. The efficacy of both potentiation and depression are fit by exponentials with time constants of 11.4ms and 94.9ms, respectively. This behavior is similar to that observed in the hippocampus: potentiation has a shorter time constant and higher maximum efficacy than depression [3]. 4 Recurrent Network We carried out an experiment designed to test the STDP circuit’s ability to compensate for variability in spike timing through PEP. Each neuron received recurrent connections from 21 randomly selected neurons within an 11 by 11 neighborhood centered on itself (see Figure 4C). Conversely, it made recurrent connections to randomly chosen neurons within the same neighborhood. These connections were mediated by STDP circuits, initialized to the depressed state. We chose a 9 by 9 cluster of neurons and delivered spikes at a mean rate of 50 to 100Hz to each one (dropping spikes with a probability of 0.75 to 0.5 from a regular 200Hz train) and provided common theta inhibition as before. We compared the variability in spike timing after five seconds of learning with the initial distribution. Phase coding was enhanced after STDP (Figure 4A). Before STDP, spike timing among neurons was highly variable (except for the very highest input rate). After STDP, variability was virtually eliminated (except for the very lowest input rate). Initially, the variability, characterized by timing precision, was inversely related to the input rate, decreasing from 34 to 13ms. After five seconds of STDP, variability decreased and was largely independent of input rate, remaining below 11ms. Potentiated synapses 25 A Synaptic state after STDP 20 15 10 5 0 B 50 100 150 200 Spiking order 250 Figure 5: Compensating for variability. A Some synapses (dots) become potentiated (light) while others remain depressed (dark) after STDP. B The number of potentiated synapses neurons make (pluses) and receive (circles) is negatively (r = -0.71) and positively (r = 0.76) correlated to their rank in the spiking order, respectively. Comparing the number of potentiated synapses each neuron made or received with its excitability confirmed the PEP hypothesis (i.e., leading neurons provide additional synaptic current to lagging neurons via potentiated recurrent synapses). In this experiment, to eliminate variability due to noise (as opposed to excitability), we provided a 17 by 17 cluster of neurons with a regular 200Hz excitatory input. Theta inhibition was present as before and all synapses were initialized to the depressed state. After 10 seconds of STDP, a large fraction of the synapses were potentiated (Figure 5A). When the number of potentiated synapses each neuron made or received was plotted versus its rank in spiking order (Figure 5B), a clear correlation emerged (r = -0.71 or 0.76, respectively). As expected, neurons that spiked early made more and received fewer potentiated synapses. In contrast, neurons that spiked late made fewer and received more potentiated synapses. 5 Pattern Completion After STDP, we found that the network could recall an entire pattern given a subset, thus the same mechanisms that compensated for variability and noise could also compensate for lack of information. We chose a 9 by 9 cluster of neurons as our pattern and delivered a poisson-like spike train with mean rate of 67Hz to each one as in the first experiment. Theta inhibition was present as before and all synapses were initialized to the depressed state. Before STDP, we stimulated a subset of the pattern and only neurons in that subset spiked (Figure 6A). After five seconds of STDP, we stimulated the same subset again. This time they recruited spikes from other neurons in the pattern, completing it (Figure 6B). Upon varying the fraction of the pattern presented, we found that the fraction recalled increased faster than the fraction presented. We selected subsets of the original pattern randomly, varying the fraction of neurons chosen from 0.1 to 1.0 (ten trials for each). We classified neurons as active if they spiked in the two second period over which we recorded. Thus, we characterized PEP’s pattern-recall performance as a function of the probability that the pattern in question’s neurons are activated (Figure 6C). At a fraction of 0.50 presented, nearly all of the neurons in the pattern are consistently activated (0.91±0.06), showing robust pattern completion. We fitted the recall performance with a sigmoid that reached 0.50 recall fraction with an input fraction of 0.30. No spurious neurons were activated during any trials. Rate(Hz) Rate(Hz) 8 7 7 6 6 5 5 0.6 0.4 2 0.2 0 0 3 3 2 1 1 A 0.8 4 4 Network activity before STDP 1 Fraction of pattern actived 8 0 B Network activity after STDP C 0 0.2 0.4 0.6 0.8 Fraction of pattern stimulated 1 Figure 6: Associative recall. A Before STDP, half of the neurons in a pattern are stimulated; only they are activated. B After STDP, half of the neurons in a pattern are stimulated, and all are activated. C The fraction of the pattern activated grows faster than the fraction stimulated. 6 Discussion Our results demonstrate that PEP successfully compensates for graded variations in our silicon recurrent network using binary (on–off) synapses (in contrast with [8], where weights are graded). While our chip results are encouraging, variability was not eliminated in every case. In the case of the lowest input (50Hz), we see virtually no change (Figure 4A). We suspect the timing remains imprecise because, with such low input, neurons do not spike every theta cycle and, consequently, provide fewer opportunities for the STDP synapses to potentiate. This shortfall illustrates the system’s limits; it can only compensate for variability within certain bounds, and only for activity appropriate to the PEP model. As expected, STDP is the mechanism responsible for PEP. STDP potentiated recurrent synapses from leading neurons to lagging neurons, reducing the disparity among the diverse population of neurons. Even though the STDP circuits are themselves variable, with different efficacies and time constants, when using timing the sign of the weight-change is always correct (data not shown). For this reason, we chose STDP over other more physiological implementations of plasticity, such as membrane-voltage-dependent plasticity (MVDP), which has the capability to learn with graded voltage signals [9], such as those found in active dendrites, providing more computational power [10]. Previously, we investigated a MVDP circuit, which modeled a voltage-dependent NMDAreceptor-gated synapse [11]. It potentiated when the calcium current analog exceeded a threshold, which was designed to occur only during a dendritic action potential. This circuit produced behavior similar to STDP, implying it could be used in PEP. However, it was sensitive to variability in the NMDA and potentiation thresholds, causing a fraction of the population to potentiate anytime the synapse received an input and another fraction to never potentiate, rendering both subpopulations useless. Therefore, the simpler, less biophysical STDP circuit won out over the MVDP circuit: In our system timing is everything. Associative storage and recall naturally emerge in the PEP network when synapses between neurons coactivated by a pattern are potentiated. These synapses allow neurons to recruit their peers when a subset of the pattern is presented, thereby completing the pattern. However, this form of pattern storage and completion differs from Hopfield’s attractor model [12] . Rather than forming symmetric, recurrent neuronal circuits, our recurrent network forms asymmetric circuits in which neurons make connections exclusively to less excitable neurons in the pattern. In both the poisson-like and regular cases (Figures 4 & 5), only about six percent of potentiated connections were reciprocated, as expected by chance. We plan to investigate the storage capacity of this asymmetric form of associative memory. Our system lends itself to modeling brain regions that use precise spike timing, such as the hippocampus. We plan to extend the work presented to store and recall sequences of patterns, as the hippocampus is hypothesized to do. Place cells that represent different locations spike at different phases of the theta cycle, in relation to the distance to their preferred locations. This sequential spiking will allow us to link patterns representing different locations in the order those locations are visited, thereby realizing episodic memory. We propose PEP as a candidate neural mechanism for information coding and storage in the hippocampal system. Observations from the CA1 region of the hippocampus suggest that basal dendrites (which primarily receive excitation from recurrent connections) support submillisecond timing precision, consistent with PEP [13]. We have shown, in a silicon model, PEP’s ability to exploit such fast recurrent connections to sharpen timing precision as well as to associatively store and recall patterns. Acknowledgments We thank Joe Lin for assistance with chip generation. The Office of Naval Research funded this work (Award No. N000140210468). References [1] O’Keefe J. & Recce M.L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus 3(3):317-330. [2] Mehta M.R., Lee A.K. & Wilson M.A. (2002) Role of experience and oscillations in transforming a rate code into a temporal code. Nature 417(6890):741-746. [3] Bi G.Q. & Wang H.X. (2002) Temporal asymmetry in spike timing-dependent synaptic plasticity. Physiology & Behavior 77:551-555. [4] Rodriguez-Vazquez, A., Linan, G., Espejo S. & Dominguez-Castro R. (2003) Mismatch-induced trade-offs and scalability of analog preprocessing visual microprocessor chips. Analog Integrated Circuits and Signal Processing 37:73-83. [5] Boahen K.A. (2000) Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems II 47:416-434. [6] Culurciello E.R., Etienne-Cummings R. & Boahen K.A. (2003) A biomorphic digital image sensor. IEEE Journal of Solid State Circuits 38:281-294. [7] Bofill A., Murray A.F & Thompson D.P. (2005) Citcuits for VLSI Implementation of Temporally Asymmetric Hebbian Learning. In: Advances in Neural Information Processing Systems 14, MIT Press, 2002. [8] Cameron K., Boonsobhak V., Murray A. & Renshaw D. (2005) Spike timing dependent plasticity (STDP) can ameliorate process variations in neuromorphic VLSI. IEEE Transactions on Neural Networks 16(6):1626-1627. [9] Chicca E., Badoni D., Dante V., D’Andreagiovanni M., Salina G., Carota L., Fusi S. & Del Giudice P. (2003) A VLSI recurrent network of integrate-and-fire neurons connected by plastic synapses with long-term memory. IEEE Transaction on Neural Networks 14(5):1297-1307. [10] Poirazi P., & Mel B.W. (2001) Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron 29(3)779-796. [11] Arthur J.V. & Boahen K. (2004) Recurrently connected silicon neurons with active dendrites for one-shot learning. In: IEEE International Joint Conference on Neural Networks 3, pp.1699-1704. [12] Hopfield J.J. (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science 81(10):3088-3092. [13] Ariav G., Polsky A. & Schiller J. (2003) Submillisecond precision of the input-output transformation function mediated by fast sodium dendritic spikes in basal dendrites of CA1 pyramidal neurons. Journal of Neuroscience 23(21):7750-7758.
same-paper 4 0.92466748 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
5 0.69627577 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
6 0.6585474 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
7 0.61385715 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
8 0.61140788 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.59901065 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
10 0.5959447 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
11 0.59407163 54 nips-2005-Data-Driven Online to Batch Conversions
12 0.57239699 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models
13 0.56218272 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.56110203 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
15 0.55473387 181 nips-2005-Spiking Inputs to a Winner-take-all Network
16 0.5520308 112 nips-2005-Learning Minimum Volume Sets
17 0.54853928 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
18 0.54773724 52 nips-2005-Correlated Topic Models
19 0.54384434 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
20 0.54005218 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches