nips nips2005 nips2005-54 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
Reference: text
sentIndex sentText sentNum sentScore
1 The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. [sent-6, score-1.347]
2 We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. [sent-7, score-1.333]
3 We then build upon these data-independent conversions to derive and analyze data-driven conversions. [sent-8, score-0.348]
4 Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. [sent-9, score-0.61]
5 We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions. [sent-10, score-0.324]
6 In the batch setting, instances are drawn from a domain X and are associated with target values from a target set Y. [sent-12, score-0.308]
7 Concretely, the learning algorithm is confined to a predetermined set of candidate hypotheses H, where each hypothesis h ∈ H is a mapping from X to Y, and the algorithm must select a “good” hypothesis from H. [sent-16, score-0.561]
8 The quality of different hypotheses in H is evaluated with respect to a loss function ℓ, where ℓ(y, y ′ ) is interpreted as the penalty for predicting the target value y ′ when the correct target is y. [sent-17, score-0.361]
9 (1) The goal of a batch learning algorithm is to use the training set to find a hypothesis that does well on average, or more formally, to find h ∈ H with a small risk. [sent-26, score-0.426]
10 In contrast to the batch learning setting, online learning takes place in a sequence of rounds. [sent-27, score-0.52]
11 Finally, the online algorithm may use the newly obtained example (xt , yt ) to improve its prediction strategy, namely to replace ht−1 with a new hypothesis ht . [sent-31, score-0.509]
12 An online algorithm is therefore defined by its default hypothesis h0 and the update rule it uses to define new hypotheses. [sent-33, score-0.499]
13 The cumulative loss suffered on a sequence of rounds is the sum of instantaneous losses suffered on each one of the rounds in the sequence. [sent-34, score-0.318]
14 The goal of the online algorithm is simply to suffer a small cumulative loss on the sequence of examples it is given, and examples that are not in this sequence are entirely irrelevant. [sent-36, score-0.424]
15 Throughout this paper, we assume that we have access to a good online learning algorithm A for the task on hand. [sent-37, score-0.28]
16 A simple and powerful way to achieve this is to use an online-to-batch conversion technique. [sent-41, score-0.637]
17 Several different online-to-batch conversion techniques have been developed over the years. [sent-43, score-0.662]
18 Littlestone and Warmuth [11] introduced an explicit relation between compression and learnability, which immediately lent itself to a conversion technique for classification algorithms. [sent-44, score-0.653]
19 Gallant [7] presented the Pocket algorithm, a conversion of Rosenblatt’s online Perceptron to the batch setting. [sent-45, score-1.115]
20 Littlestone [10] presented the Cross-Validation conversion which was further developed by Cesa-Bianchi, Conconi and Gentile [2]. [sent-46, score-0.637]
21 As A performs the m online rounds, it generates a sequence of online hypotheses which it uses to make predictions on each round. [sent-51, score-0.775]
22 This sequence includes the default hypothesis h0 and the m hypotheses h1 , . [sent-52, score-0.47]
23 The aforementioned techniques all share a common property: they all choose h, the output of the batch algorithm B, to be one of the online hypotheses h0 , . [sent-56, score-0.748]
24 The conversion strategies in this family also begin by using A to generate the sequence of online hypotheses. [sent-61, score-1.005]
25 Another characteristic shared by these three conversions is that the training data does not play a part in determining how the online hypotheses are combined. [sent-63, score-0.829]
26 In this paper, we build on the foundations of these data-independent conversions, and define conversion techniques that explicitly use the training data to derive the batch algorithm from the online algorithm. [sent-69, score-1.216]
27 2 we review the data-independent conversion techniques from [8, 6, 2] and give a simple unified analysis for all three conversions. [sent-73, score-0.662]
28 4, we compare the different conversion techniques on several benchmark datasets and show that our data-driven approach outperforms the existing data-independent approach. [sent-78, score-0.692]
29 2 Voting, Averaging, and Sampling The first conversion we discuss is the voting conversion [6], which applies to problems where the target space Y is discrete (and relatively small), such as classification problems. [sent-79, score-1.463]
30 The conversion presents the training set (x1 , y1 ), . [sent-80, score-0.671]
31 , (xm , ym ) to the online algorithm A, which generates the sequence of online hypotheses, h0 , . [sent-83, score-0.62]
32 The conversion then outputs the hypothesis hV , which is defined as follows: given an input x ∈ X , each online hypothesis casts a vote of hi (x) and then hV outputs the target value that receives the highest number of votes. [sent-87, score-1.381]
33 The second conversion is the averaging conversion [2] which applies to problems where Y is a convex set. [sent-89, score-1.371]
34 For example, this conversion is applicable to margin-based online classifiers or to regression problems where, in both cases, Y = R. [sent-90, score-0.899]
35 This conversion also begins by using A to m 1 generate h0 , . [sent-91, score-0.66]
36 Then the batch hypothesis hA is defined to be m+1 i=0 hi (x). [sent-95, score-0.502]
37 The third and last conversion discussed here is the sampling conversion [8]. [sent-96, score-1.29]
38 This conversion is the most general and applicable to any learning problem, however this generality comes at a price. [sent-97, score-0.637]
39 Again, this conversion begins by applying A to the training set and obtaining the sequence of online hypotheses. [sent-100, score-0.975]
40 We now describe a simple generalization of these three conversion techniques. [sent-108, score-0.655]
41 It is reasonable to assume that some of the online hypotheses generated by A are better than others. [sent-109, score-0.488]
42 For instance, the default hypothesis h0 is determined without observing even a single training example. [sent-110, score-0.27]
43 This surfaces the question whether it is possible to isolate the “best” online hypotheses and only use them to define the batch hypothesis. [sent-111, score-0.687]
44 Now define hV (x) to be the hypothesis which I performs voting as described above, with the single difference that only the members of {hi : i ∈ I} participate in the vote. [sent-116, score-0.309]
45 The data-independent conversions presented in the beginning of this section are obtained by setting I = [m]. [sent-118, score-0.34]
46 Our idea is to use the training data to find a set I which induces the batch hypotheses hV , hA , and hS with I I I the smallest risk. [sent-119, score-0.459]
47 L(J) is the empirical evaluation of the loss of the hypotheses indexed by J. [sent-126, score-0.291]
48 We would like to find a set J for which L(J) is small since we expect that good empirical loss of the online hypotheses indicates a low risk of the batch hypothesis. [sent-127, score-0.806]
49 Second, we would like |J| to be large so that the presence of a few bad online hypotheses in J will not have a devastating effect on the performance of the batch hypothesis. [sent-128, score-0.704]
50 The first lemma relates the risk of these functions J J J with the average risk of the hypotheses indexed by J. [sent-136, score-0.415]
51 , (xm , ym ) be a sequence of examples which is presented to the online algorithm A and let h0 , . [sent-141, score-0.358]
52 , hm be the resulting sequence of online hypotheses. [sent-144, score-0.487]
53 Beginning with the voting conversion, recall that the loss function being used is the 0-1 loss, namely there is a single correct prediction which incurs a loss of 0 and every other prediction incurs a loss of 1. [sent-150, score-0.369]
54 Therefore, if ℓ(y, hV (x)) = 1 then at least half of the hypotheses in J J {hi }i∈J make incorrect predictions and (|J|/2) ≤ i∈J ℓ(y, hi (x)). [sent-152, score-0.337]
55 Finally, hS chooses its outcome by randomly choosing an hypothesis in {hi : i ∈ J}, J where the probability of choosing each hypothesis in this set equals (1/|J|). [sent-159, score-0.37]
56 The next lemma relates the average risk of the hypotheses indexed by J with the empirical performance of these hypotheses, L(J). [sent-163, score-0.356]
57 , Hm be the sequence of online hypotheses generated by A while observing this sequence of examples. [sent-173, score-0.589]
58 For concreteness, we now focus on the averaging conversion and note that the analyses of the other two conversion strategies are virtually identical. [sent-178, score-1.373]
59 3 Concrete Data-Driven Conversions In this section we build on the ideas of the previous section and derive three concrete datadriven conversion techniques. [sent-185, score-0.73]
60 Suffix Conversion: An intuitive argument against selecting I = [m], as done by the dataindependent conversions, is that many online algorithms tend to generate bad hypotheses during the first few rounds of learning. [sent-186, score-0.627]
61 As previously noted, the default hypothesis h0 is determined without observing any training data, and we should expect the first few online hypotheses to be inferior to those that are generated further along. [sent-187, score-0.777]
62 Li [9] proposed this idea in the context of the voting conversion and gave a heuristic criterion for choosing a. [sent-192, score-0.805]
63 In this conversion we define I to be the set of all suffixes of [m − 1]. [sent-194, score-0.637]
64 A similar problem arises in the online learning setting where the goal is to construct online algorithms where each online hypothesis hi is a kernel-based function defined by at most B vectors. [sent-210, score-1.072]
65 Although each individual online hypothesis is defined by at most B vectors, hA is defined by the union of these sets, which can be much larger than B. [sent-213, score-0.42]
66 To convert a budget-constrained online algorithm A into a budget-constrained batch algorithm, we make an additional assumption on the update strategy employed by A. [sent-214, score-0.496]
67 We assume that whenever A updates its online hypothesis, it adds a single new support pattern into the set used to represent the kernel hypothesis, and possibly removes some other pattern from this set. [sent-215, score-0.32]
68 , b} for some integers 0 ≤ a < b < m, and A updates its hypothesis k times during rounds a + 1 through b, then hA is defined by at most B + k I support patterns. [sent-220, score-0.288]
69 I Tree-Based Conversion: A drawback of the suffix conversions is that it must be performed in two consecutive stages. [sent-233, score-0.324]
70 Therefore, the memory requirements of this conversions grow linearly with m. [sent-239, score-0.35]
71 We now present a conversion that can sidestep this problem by interleaving the conversion with the online hypothesis generation. [sent-240, score-1.694]
72 This conversion slightly deviates from the general framework described in the previous section: instead of predefining a set of candidates I, we construct the optimal subset I in a recursive manner. [sent-241, score-0.653]
73 otherwise (3) Finally, define I = J0,m−1 and output the batch hypothesis hA . [sent-247, score-0.374]
74 When m is not a power of 2, we can pad the sequence of online hypotheses with virtual hypotheses, each of which attains an infinite loss. [sent-252, score-0.513]
75 This conversion can be performed in parallel with the online rounds since on round t we already have all of the information required to calculate Ja,b for all b < t. [sent-253, score-1.012]
76 , hm are linear hypotheses and we use the averaging technique, the implementation of the tree-based conversion becomes memory efficient. [sent-257, score-1.133]
77 Specifically, assume that each hi takes the form hi (x) = wi · x where wi is a vector of weights in Rn . [sent-258, score-0.288]
78 In this case, storing an online hypothesis hi is equivalent to storing its weight vector wi . [sent-259, score-0.666]
79 Hence, once we calculate Ja,b we can discard the original online hypotheses ha , . [sent-261, score-0.675]
80 It can be verified using an inductive argument that the overall memory utilization of this conversion is O(log(m)), which is significantly less than the O(m) space required by the suffix conversion. [sent-267, score-0.684]
81 Each bar shows the difference between the error percentages of a datadriven conversion (suffix (S), interval (I) or tree-based (T)) and of the data-independent conversion. [sent-270, score-0.707]
82 the Passive-Aggressive (PA) algorithm [3] as the online algorithm. [sent-272, score-0.28]
83 The PA algorithm is a kernel-based large-margin online classifier. [sent-273, score-0.28]
84 To apply the averaging conversion Y must be a convex set. [sent-276, score-0.734]
85 We applied the data-independent averaging and voting conversions, as well as the three data-driven variants of these conversions (6 data-driven conversions in all). [sent-289, score-0.877]
86 The interval conversion was set to choose an interval containing 500 updates. [sent-290, score-0.703]
87 Additionally, we evaluated the test error of the last hypothesis generated by the online algorithm, hm . [sent-292, score-0.652]
88 It is common malpractice amongst practitioners to use hm as if it were a batch hypothesis, instead of using an online-to-batch conversion. [sent-293, score-0.399]
89 As a byproduct of our experiments, we show that hm performs significantly worse than any of the conversion techniques discussed in this paper. [sent-294, score-0.845]
90 We would like to emphasize that our goal was not to achieve state-of-the-art results on these datasets but rather to compare the different conversion strategies on the same sequence of hypotheses. [sent-296, score-0.749]
91 The results for the different variants of the averaging conversion are depicted in Fig. [sent-298, score-0.715]
92 Results are given for the last online hypothesis (hm ), the data-independent averaging and voting conversions, and their suffix variants. [sent-381, score-0.665]
93 For each dataset and each training-set size, we present a bar-plot which represents by how much each of the data-driven averaging conversions improves over the data-independent averaging conversion. [sent-383, score-0.48]
94 For instance, the left bar in each plot shows the difference between the test errors of the suffix conversion and the data-independent conversion. [sent-384, score-0.637]
95 The results clearly indicate that the suffix and tree-based conversions consistently improve over the data-independent conversion. [sent-386, score-0.324]
96 The interval conversion does not improve as much and occasionally even looses to the data-independent conversion. [sent-387, score-0.661]
97 Due to the lack of space, we omit a similar figure for the voting conversion and merely note that the plots are very similar to the ones in Fig. [sent-389, score-0.788]
98 As a reference, we also give the results obtained by the last hypothesis generated by the online algorithm. [sent-392, score-0.453]
99 In all of the experiments, the data-driven conversion outperforms the data-independent conversion. [sent-393, score-0.637]
100 In general, averaging exhibits better results than voting, while the last online hypothesis is almost always inferior to all of the online-to-batch conversions. [sent-394, score-0.533]
wordName wordTfidf (topN-words)
[('conversion', 0.637), ('conversions', 0.324), ('online', 0.262), ('batch', 0.216), ('hypotheses', 0.209), ('riskd', 0.201), ('ha', 0.184), ('hm', 0.183), ('hypothesis', 0.158), ('voting', 0.151), ('hs', 0.151), ('hv', 0.148), ('hi', 0.128), ('averaging', 0.078), ('rounds', 0.068), ('default', 0.061), ('loss', 0.06), ('risk', 0.059), ('isolet', 0.054), ('ht', 0.053), ('storing', 0.051), ('mnist', 0.051), ('usps', 0.049), ('lemma', 0.048), ('datadriven', 0.046), ('letter', 0.044), ('sequence', 0.042), ('dekel', 0.04), ('suffered', 0.04), ('target', 0.038), ('conconi', 0.037), ('littlestone', 0.037), ('ym', 0.036), ('training', 0.034), ('datasets', 0.03), ('multiclass', 0.029), ('xm', 0.029), ('claim', 0.028), ('dataindependent', 0.027), ('helmbold', 0.027), ('memory', 0.026), ('round', 0.025), ('techniques', 0.025), ('budget', 0.024), ('concretely', 0.024), ('gentile', 0.024), ('minj', 0.024), ('formally', 0.024), ('perceptron', 0.024), ('build', 0.024), ('interval', 0.024), ('concrete', 0.023), ('support', 0.023), ('suf', 0.023), ('warmuth', 0.023), ('generate', 0.023), ('indexed', 0.022), ('integers', 0.022), ('argument', 0.021), ('strategies', 0.021), ('calculate', 0.02), ('linearity', 0.02), ('outcome', 0.02), ('family', 0.02), ('predicts', 0.02), ('emphasize', 0.019), ('crammer', 0.019), ('incurs', 0.019), ('inferior', 0.019), ('convex', 0.019), ('freund', 0.018), ('relates', 0.018), ('yt', 0.018), ('algorithm', 0.018), ('generalization', 0.018), ('kernel', 0.018), ('classi', 0.018), ('choose', 0.018), ('price', 0.018), ('prede', 0.018), ('generated', 0.017), ('pa', 0.017), ('choosing', 0.017), ('observing', 0.017), ('bad', 0.017), ('updates', 0.017), ('uni', 0.016), ('candidates', 0.016), ('compression', 0.016), ('hj', 0.016), ('subsets', 0.016), ('last', 0.016), ('wi', 0.016), ('concepts', 0.016), ('beginning', 0.016), ('bounds', 0.016), ('evaluated', 0.016), ('bound', 0.016), ('ne', 0.016), ('instances', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
2 0.1811436 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
3 0.16874868 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
4 0.16353661 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
5 0.14814641 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
6 0.090813294 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
7 0.089667976 148 nips-2005-Online Discovery and Learning of Predictive State Representations
8 0.084095761 50 nips-2005-Convex Neural Networks
9 0.057866815 160 nips-2005-Query by Committee Made Real
10 0.054440133 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
11 0.052679252 58 nips-2005-Divergences, surrogate loss functions and experimental design
12 0.04746661 85 nips-2005-Generalization to Unseen Cases
13 0.046011072 192 nips-2005-The Information-Form Data Association Filter
14 0.041590761 47 nips-2005-Consistency of one-class SVM and related algorithms
15 0.039865088 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
16 0.038932558 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes
17 0.03853704 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
18 0.038471073 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
19 0.037564173 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
20 0.037176959 117 nips-2005-Learning from Data of Variable Quality
topicId topicWeight
[(0, 0.14), (1, 0.054), (2, -0.005), (3, -0.079), (4, 0.123), (5, 0.155), (6, 0.003), (7, 0.07), (8, -0.132), (9, 0.103), (10, -0.146), (11, 0.263), (12, 0.012), (13, -0.141), (14, 0.046), (15, -0.176), (16, -0.132), (17, -0.021), (18, -0.067), (19, 0.014), (20, -0.096), (21, 0.062), (22, 0.013), (23, -0.007), (24, 0.102), (25, -0.036), (26, -0.08), (27, 0.024), (28, 0.086), (29, 0.061), (30, -0.022), (31, -0.139), (32, -0.045), (33, -0.054), (34, 0.044), (35, -0.081), (36, 0.023), (37, -0.001), (38, -0.046), (39, 0.063), (40, -0.064), (41, -0.048), (42, 0.04), (43, 0.096), (44, 0.048), (45, 0.043), (46, -0.057), (47, 0.041), (48, -0.006), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.96005321 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
2 0.8299678 76 nips-2005-From Batch to Transductive Online Learning
Author: Sham Kakade, Adam Tauman Kalai
Abstract: It is well-known that everything that is learnable in the difficult online setting, where an arbitrary sequences of examples must be labeled one at a time, is also learnable in the batch setting, where examples are drawn independently from a distribution. We show a result in the opposite direction. We give an efficient conversion algorithm from batch to online that is transductive: it uses future unlabeled data. This demonstrates the equivalence between what is properly and efficiently learnable in a batch model and a transductive online model.
3 0.77967668 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.
4 0.64254338 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
5 0.57133889 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
6 0.44876981 148 nips-2005-Online Discovery and Learning of Predictive State Representations
7 0.38663357 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
8 0.37355685 160 nips-2005-Query by Committee Made Real
9 0.35653368 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
10 0.31233847 117 nips-2005-Learning from Data of Variable Quality
11 0.30826256 85 nips-2005-Generalization to Unseen Cases
12 0.26550856 50 nips-2005-Convex Neural Networks
13 0.25454247 62 nips-2005-Efficient Estimation of OOMs
14 0.25228286 112 nips-2005-Learning Minimum Volume Sets
15 0.24993244 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
16 0.23472531 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
17 0.22408392 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
18 0.22345529 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
19 0.21764797 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
20 0.21185416 74 nips-2005-Faster Rates in Regression via Active Learning
topicId topicWeight
[(3, 0.067), (10, 0.029), (27, 0.017), (31, 0.027), (34, 0.09), (39, 0.023), (47, 0.302), (55, 0.05), (69, 0.046), (73, 0.031), (88, 0.086), (91, 0.112)]
simIndex simValue paperId paperTitle
1 0.89875418 25 nips-2005-An aVLSI Cricket Ear Model
Author: Andre V. Schaik, Richard Reeve, Craig Jin, Tara Hamilton
Abstract: Female crickets can locate males by phonotaxis to the mating song they produce. The behaviour and underlying physiology has been studied in some depth showing that the cricket auditory system solves this complex problem in a unique manner. We present an analogue very large scale integrated (aVLSI) circuit model of this process and show that results from testing the circuit agree with simulation and what is known from the behaviour and physiology of the cricket auditory system. The aVLSI circuitry is now being extended to use on a robot along with previously modelled neural circuitry to better understand the complete sensorimotor pathway. 1 In trod u ction Understanding how insects carry out complex sensorimotor tasks can help in the design of simple sensory and robotic systems. Often insect sensors have evolved into intricate filters matched to extract highly specific data from the environment which solves a particular problem directly with little or no need for further processing [1]. Examples include head stabilisation in the fly, which uses vision amongst other senses to estimate self-rotation and thus to stabilise its head in flight, and phonotaxis in the cricket. Because of the narrowness of the cricket body (only a few millimetres), the Interaural Time Difference (ITD) for sounds arriving at the two sides of the head is very small (10–20µs). Even with the tympanal membranes (eardrums) located, as they are, on the forelegs of the cricket, the ITD only reaches about 40µs, which is too low to detect directly from timings of neural spikes. Because the wavelength of the cricket calling song is significantly greater than the width of the cricket body the Interaural Intensity Difference (IID) is also very low. In the absence of ITD or IID information, the cricket uses phase to determine direction. This is possible because the male cricket produces an almost pure tone for its calling song. * School of Electrical and Information Engineering, Institute of Perception, Action and Behaviour. + Figure 1: The cricket auditory system. Four acoustic inputs channel sounds directly or through tracheal tubes onto two tympanal membranes. Sound from contralateral inputs has to pass a (double) central membrane (the medial septum), inducing a phase delay and reduction in gain. The sound transmission from the contralateral tympanum is very weak, making each eardrum effectively a 3 input system. The physics of the cricket auditory system is well understood [2]; the system (see Figure 1) uses a pair of sound receivers with four acoustic inputs, two on the forelegs, which are the external surfaces of the tympana, and two on the body, the prothoracic or acoustic spiracles [3]. The connecting tracheal tubes are such that interference occurs as sounds travel inside the cricket, producing a directional response at the tympana to frequencies near to that of the calling song. The amplitude of vibration of the tympana, and hence the firing rate of the auditory afferent neurons attached to them, vary as a sound source is moved around the cricket and the sounds from the different inputs move in and out of phase. The outputs of the two tympana match when the sound is straight ahead, and the inputs are bilaterally symmetric with respect to the sound source. However, when sound at the calling song frequency is off-centre the phase of signals on the closer side comes better into alignment, and the signal increases on that side, and conversely decreases on the other. It is that crossover of tympanal vibration amplitudes which allows the cricket to track a sound source (see Figure 6 for example). A simplified version of the auditory system using only two acoustic inputs was implemented in hardware [4], and a simple 8-neuron network was all that was required to then direct a robot to carry out phonotaxis towards a species-specific calling song [5]. A simple simulator was also created to model the behaviour of the auditory system of Figure 1 at different frequencies [6]. Data from Michelsen et al. [2] (Figures 5 and 6) were digitised, and used together with average and “typical” values from the paper to choose gains and delays for the simulation. Figure 2 shows the model of the internal auditory system of the cricket from sound arriving at the acoustic inputs through to transmission down auditory receptor fibres. The simulator implements this model up to the summing of the delayed inputs, as well as modelling the external sound transmission. Results from the simulator were used to check the directionality of the system at different frequencies, and to gain a better understanding of its response. It was impractical to check the effect of leg movements or of complex sounds in the simulator due to the necessity of simulating the sound production and transmission. An aVLSI chip was designed to implement the same model, both allowing more complex experiments, such as leg movements to be run, and experiments to be run in the real world. Figure 2: A model of the auditory system of the cricket, used to build the simulator and the aVLSI implementation (shown in boxes). These experiments with the simulator and the circuits are being published in [6] and the reader is referred to those papers for more details. In the present paper we present the details of the circuits used for the aVLSI implementation. 2 Circuits The chip, implementing the aVLSI box in Figure 2, comprises two all-pass delay filters, three gain circuits, a second-order narrow-band band-pass filter, a first-order wide-band band-pass filter, a first-order high-pass filter, as well as supporting circuitry (including reference voltages, currents, etc.). A single aVLSI chip (MOSIS tiny-chip) thus includes half the necessary circuitry to model the complete auditory system of a cricket. The complete model of the auditory system can be obtained by using two appropriately connected chips. Only two all-pass delay filters need to be implemented instead of three as suggested by Figure 2, because it is only the relative delay between the three pathways arriving at the one summing node that counts. The delay circuits were implemented with fully-differential gm-C filters. In order to extend the frequency range of the delay, a first-order all-pass delay circuit was cascaded with a second-order all-pass delay circuit. The resulting addition of the first-order delay and the second-order delay allowed for an approximately flat delay response for a wider bandwidth as the decreased delay around the corner frequency of the first-order filter cancelled with the increased delay of the second-order filter around its resonant frequency. Figure 3 shows the first- and second-order sections of the all-pass delay circuit. Two of these circuits were used and, based on data presented in [2], were designed with delays of 28µs and 62µs, by way of bias current manipulation. The operational transconductance amplifier (OTA) in figure 3 is a standard OTA which includes the common-mode feedback necessary for fully differential designs. The buffers (Figure 3) are simple, cascoded differential pairs. V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ Figure 3: The first-order all-pass delay circuit (left) and the second-order all-pass delay (right). The differential output of the delay circuits is converted into a current which is multiplied by a variable gain implemented as shown in Figure 4. The gain cell includes a differential pair with source degeneration via transistors N4 and N5. The source degeneration improves the linearity of the current. The three gain cells implemented on the aVLSI have default gains of 2, 3 and 0.91 which are set by holding the default input high and appropriately ratioing the bias currents through the value of vbiasp. To correct any on-chip mismatches and/or explore other gain configurations a current splitter cell [7] (p-splitter, figure 4) allows the gain to be programmed by digital means post fabrication. The current splitter takes an input current (Ibias, figure 4) and divides it into branches which recursively halve the current, i.e., the first branch gives ½ Ibias, the second branch ¼ Ibias, the third branch 1/8 Ibias and so on. These currents can be used together with digitally controlled switches as a Digital-to-Analogue converter. By holding default low and setting C5:C0 appropriately, any gain – from 4 to 0.125 – can be set. To save on output pins the program bits (C5:C0) for each of the three gain cells are set via a single 18-bit shift register in bit-serial fashion. Summing the output of the three gain circuits in the current domain simply involves connecting three wires together. Therefore, a natural option for the filters that follow is to use current domain filters. In our case we have chosen to implement log-domain filters using MOS transistors operating in weak inversion. Figure 5 shows the basic building blocks for the filters – the Tau Cell [8] and the multiplier cell – and block diagrams showing how these blocks were connected to create the necessary filtering blocks. The Tau Cell is a log-domain filter which has the firstorder response: I out 1 , = I in sτ + 1 where τ = nC aVT Ia and n = the slope factor, VT = thermal voltage, Ca = capacitance, and Ia = bias current. In figure 5, the input currents to the Tau Cell, Imult and A*Ia, are only used when building a second-order filter. The multiplier cell is simply a translinear loop where: I out1 ∗ I mult = I out 2 ∗ AI a or Imult = AIaIout2/Iout1. The configurations of the Tau Cell to get particular responses are covered in [8] along with the corresponding equations. The high frequency filter of Figure 2 is implemented by the high-pass filter in Figure 5 with a corner frequency of 17kHz. The low frequency filter, however, is divided into two parts since the biological filter’s response (see for example Figure 3A in [9]) separates well into a narrow second-order band-pass filter with a 10kHz resonant frequency and a wide band-pass filter made from a first-order high-pass filter with a 3kHz corner frequency followed by a first-order low-pass filter with a 12kHz corner frequency. These filters are then added together to reproduce the biological filter. The filters’ responses can be adjusted post fabrication via their bias currents. This allows for compensation due to processing and matching errors. Figure 4: The Gain Cell above is used to convert the differential voltage input from the delay cells into a single-ended current output. The gain of each cell is controllable via a programmable current cell (p_splitter). An on-chip bias generator [7] was used to create all the necessary current biases on the chip. All the main blocks (delays, gain cells and filters), however, can have their on-chip bias currents overridden through external pins on the chip. The chip was fabricated using the MOSIS AMI 1.6µm technology and designed using the Cadence Custom IC Design Tools (5.0.33). 3 Methods The chip was tested using sound generated on a computer and played through a soundcard to the chip. Responses from the chip were recorded by an oscilloscope, and uploaded back to the computer on completion. Given that the output from the chip and the gain circuits is a current, an external current-sense circuit built with discrete components was used to enable the output to be probed by the oscilloscope. Figure 5: The circuit diagrams for the log-domain filter building blocks – The Tau Cell and The Multiplier – along with the block diagrams for the three filters used in the aVLSI model. Initial experiments were performed to tune the delays and gains. After that, recordings were taken of the directional frequency responses. Sounds were generated by computer for each chip input to simulate moving the forelegs by delaying the sound by the appropriate amount of time; this was a much simpler solution than using microphones and moving them using motors. 4 Results The aVLSI chip was tested to measure its gains and delays, which were successfully tuned to the appropriate values. The chip was then compared with the simulation to check that it was faithfully modelling the system. A result of this test at 4kHz (approximately the cricket calling-song frequency) is shown in Figure 6. Apart from a drop in amplitude of the signal, the response of the circuit was very similar to that of the simulator. The differences were expected because the aVLSI circuit has to deal with real-world noise, whereas the simulated version has perfect signals. Examples of the gain versus frequency response of the two log-domain band-pass filters are shown in Figure 7. Note that the narrow-band filter peaks at 6kHz, which is significantly above the mating song frequency of the cricket which is around 4.5kHz. This is not a mistake, but is observed in real crickets as well. As stated in the introduction, a range of further testing results with both the circuit and the simulator are being published in [6]. 5 D i s c u s s i on The aVLSI auditory sensor in this research models the hearing of the field cricket Gryllus bimaculatus. It is a more faithful model of the cricket auditory system than was previously built in [4], reproducing all the acoustic inputs, as well as the responses to frequencies of both the co specific calling song and bat echolocation chirps. It also generates outputs corresponding to the two sets of behaviourally relevant auditory receptor fibres. Results showed that it matched the biological data well, though there were some inconsistencies due to an error in the specification that will be addressed in a future iteration of the design. A more complete implementation across all frequencies was impractical because of complexity and size issues as well as serving no clear behavioural purpose. Figure 6: Vibration amplitude of the left (dotted) and right (solid) virtual tympana measured in decibels in response to a 4kHz tone in simulation (left) and on the aVLSI chip (right). The plot shows the amplitude of the tympanal responses as the sound source is rotated around the cricket. Figure 7: Frequency-Gain curves for the narrow-band and wide-band bandpass filters. The long-term aim of this work is to better understand simple sensorimotor control loops in crickets and other insects. The next step is to mount this circuitry on a robot to carry out behavioural experiments, which we will compare with existing and new behavioural data (such as that in [10]). This will allow us to refine our models of the neural circuitry involved. Modelling the sensory afferent neurons in hardware is necessary in order to reduce processor load on our robot, so the next revision will include these either onboard, or on a companion chip as we have done before [11]. We will also move both sides of the auditory system onto a single chip to conserve space on the robot. It is our belief and experience that, as a result of this intelligent pre-processing carried out at the sensor level, the neural circuits necessary to accurately model the behaviour will remain simple. Acknowledgments The authors thank the Institute of Neuromorphic Engineering and the UK Biotechnology and Biological Sciences Research Council for funding the research in this paper. References [1] R. Wehner. Matched filters – neural models of the external world. J Comp Physiol A, 161: 511–531, 1987. [2] A. Michelsen, A. V. Popov, and B. Lewis. Physics of directional hearing in the cricket Gryllus bimaculatus. Journal of Comparative Physiology A, 175:153–164, 1994. [3] A. Michelsen. The tuned cricket. News Physiol. Sci., 13:32–38, 1998. [4] H. H. Lund, B. Webb, and J. Hallam. A robot attracted to the cricket species Gryllus bimaculatus. In P. Husbands and I. Harvey, editors, Proceedings of 4th European Conference on Artificial Life, pages 246–255. MIT Press/Bradford Books, MA., 1997. [5] R Reeve and B. Webb. New neural circuits for robot phonotaxis. Phil. Trans. R. Soc. Lond. A, 361:2245–2266, August 2003. [6] R. Reeve, A. van Schaik, C. Jin, T. Hamilton, B. Torben-Nielsen and B. Webb Directional hearing in a silicon cricket. Biosystems, (in revision), 2005b [7] T. Delbrück and A. van Schaik, Bias Current Generators with Wide Dynamic Range, Analog Integrated Circuits and Signal Processing 42(2), 2005 [8] A. van Schaik and C. Jin, The Tau Cell: A New Method for the Implementation of Arbitrary Differential Equations, IEEE International Symposium on Circuits and Systems (ISCAS) 2003 [9] Kazuo Imaizumi and Gerald S. Pollack. Neural coding of sound frequency by cricket auditory receptors. The Journal of Neuroscience, 19(4):1508– 1516, 1999. [10] Berthold Hedwig and James F.A. Poulet. Complex auditory behaviour emerges from simple reactive steering. Nature, 430:781–785, 2004. [11] R. Reeve, B. Webb, A. Horchler, G. Indiveri, and R. Quinn. New technologies for testing a model of cricket phonotaxis on an outdoor robot platform. Robotics and Autonomous Systems, 51(1):41-54, 2005.
same-paper 2 0.75825113 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
3 0.74461019 98 nips-2005-Infinite latent feature models and the Indian buffet process
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
4 0.52835798 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the effect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1
5 0.51488316 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
6 0.51482427 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
7 0.51433933 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
8 0.51408446 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.50906456 30 nips-2005-Assessing Approximations for Gaussian Process Classification
10 0.50775152 76 nips-2005-From Batch to Transductive Online Learning
11 0.50398064 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
12 0.50274336 74 nips-2005-Faster Rates in Regression via Active Learning
13 0.50207376 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
14 0.50133103 58 nips-2005-Divergences, surrogate loss functions and experimental design
15 0.50098228 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
16 0.50074518 112 nips-2005-Learning Minimum Volume Sets
17 0.50058484 50 nips-2005-Convex Neural Networks
18 0.50027043 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
19 0.49768993 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
20 0.49702433 205 nips-2005-Worst-Case Bounds for Gaussian Process Models