nips nips2010 nips2010-274 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
Reference: text
sentIndex sentText sentNum sentScore
1 On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. [sent-8, score-0.506]
2 We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. [sent-11, score-0.265]
3 We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin. [sent-12, score-0.518]
4 The algorithm is not allowed to make any mistakes; still, it learns from those examples on which it answers ⊥. [sent-16, score-0.337]
5 It is worth mentioning that the idea of forcing the algorithm to say “I don’t know” instead of making a mistake has also appeared in earlier work such as [RS88], and referred to as reliable learning. [sent-19, score-0.54]
6 Or, in the other direction, our aim is to produce algorithms that can make substantially fewer mistakes than in the standard Mistake-Bound model, by trading off some of those for (presumably less costly) don’t-know predictions. [sent-26, score-0.591]
7 † 1 We show that if only one mistake is allowed, that number can be reduced to 2|H|. [sent-31, score-0.394]
8 Furthermore, we show that the problem is equivalent to the famous egg-dropping puzzle, defined formally in 1 Section 2, hence getting bound (k + 1)H k+1 when k mistakes are allowed. [sent-32, score-0.55]
9 Our algorithm does not in general run in polynomial time in the description length of the target function since its running time depends on |H|; however, we propose polynomial versions of our algorithm for two important classes: monotone disjunctions and linear separators. [sent-33, score-0.629]
10 Allowing the algorithm to make mistakes in the KWIK model is equivalent to allowing the algorithm to say “I don’t know” in the Mistake-bound model introduced in [Lit88]. [sent-34, score-0.757]
11 In fact, one way of looking at the algorithms presented in section 3 is that we want to decrease the number of mistakes in Mistake-bound model by allowing the algorithm to say ⊥. [sent-35, score-0.723]
12 Then in section 2, we describe how would the bounds on the number of ⊥s change if we allow a few mistakes in KWIK model. [sent-38, score-0.532]
13 For a given integer k, we want to design an algorithm such that for any sequence of examples, the number of times M that it makes a mistake is not more than k, and the number of times I that it answers ⊥ is minimized. [sent-45, score-0.613]
14 Class H can be learned in Mistake-bound model with mistake bound of only 1. [sent-58, score-0.438]
15 Since the algorithm does not know the answer until it has seen its first positive example, it must keep answering ⊥ on all examples that it has not seen yet. [sent-62, score-0.46]
16 Therefore, in the worst case, it answers ⊥ and finds out that the answer is − on all the first 2n − 1 examples that it sees. [sent-63, score-0.366]
17 Algorithm 1 Enumeration The algorithm looks at all the functions in class H; if they all agree on the label of the current example x ∈ X, the algorithm outputs that label, otherwise the algorithm outputs ⊥. [sent-69, score-0.243]
18 Note that at least one function gets removed from H each time that algorithm answers ⊥; therefore, the algorithm finds the target function with at most |H|−1 number of ⊥’s. [sent-71, score-0.319]
19 Of course, this case is not so interesting since k = 1 is the mistake bound for the class. [sent-79, score-0.438]
20 Our main interest is the case that k > 0 and yet is much smaller than the number of mistakes needed to learn in the pure Mistake Bound model. [sent-80, score-0.506]
21 We saw, in Algorithm 1, how to learn a concept class H with no mistakes and with O(|H|) number of ⊥’s. [sent-81, score-0.602]
22 In general, if k mistakes are allowed, there is an algorithm that can learn any class H with at most (k + 1)|H|1/k+1 number of ⊥’s. [sent-88, score-0.585]
23 If the population of the minority is < |H|, we predict the label that the majority gives on the example; however, if the population of the minority is ≥ |H|, we say ⊥. [sent-93, score-0.338]
24 If we make a mistake in some step, the size of the version space will reduce to < |H|. [sent-95, score-0.444]
25 Furthermore, note that before making any mistake, we remove at least |H| of the functions from the pool each time we answer ⊥. [sent-97, score-0.377]
26 √ The answer to this puzzle is 2n up to an additive constant. [sent-105, score-0.296]
27 The ⊥ minimization problem when k mistakes are allowed is clearly related to the egg game puzzle when the building has |H| floors and there are k + 1 eggs available. [sent-107, score-0.923]
28 If 3 the minority population is > s, we answer ⊥, otherwise, we answer the majority prediction. [sent-112, score-0.503]
29 At the end of each step, we remove the functions that made a mistake in the last step from P . [sent-113, score-0.417]
30 Whenever we k−i make a mistake, we update s = |P | k+1−i , where i is the number of mistakes we have made so far. [sent-114, score-0.556]
31 Proposition 1 Algorithm 2 learns a concept class H with at most k mistakes and (k + 1)|H|1/k+1 “I don’t know”s. [sent-115, score-0.632]
32 Hence, using induction, we k can argue that after the first mistake, the learning can be done with k − 1 mistakes and k(|H| k+1 )1/k “I don’t know”s. [sent-117, score-0.506]
33 3 The Mistake Bound Model with “I don’t know” predictions We can look at the problem from another perspective: instead of adding mistakes to the KWIK framework, we can add “I don’t know” to the Mistake Bound model. [sent-123, score-0.555]
34 Therefore, in this section, we try to improve over optimal mistake bounds by allowing the algorithm to use a modest number of ⊥’s, and in general to consider the tradeoff between the number of mistakes and the number of ⊥’s. [sent-125, score-1.009]
35 Note that an algorithm can always replace its ⊥’s with random +’s and −’s, therefore, we must expect that decreasing the number of mistakes by one requires increasing the number of ⊥’s by at least one. [sent-126, score-0.595]
36 A monotone disjunction is a disjunction in which no literal appears negated, that is, a function of the form f (x1 , . [sent-129, score-0.246]
37 We know that this class can be learned with at most n mistakes in Mistake-bound Model [Lit88] where n is the total number of variables. [sent-137, score-0.709]
38 This class is particularly interesting because results derived about monotone disjunctions can be applied to other classes as well, such as general disjunctions, conjunctions, and k-DNF formulas. [sent-138, score-0.399]
39 We are interested in decreasing the number of mistakes at the cost of having (hopefully few) ⊥’s. [sent-139, score-0.506]
40 In fact, for the case of i = 2, we are trading off each mistake for four “I don’t know”s. [sent-144, score-0.429]
41 The pool P + is the set of variables that we know must exist in the target function; the pool P − is the set of the 4 variables that we know cannot exist in the target function. [sent-153, score-0.598]
42 Every time we answer ⊥, we move S ∩ P to P + or P − depending on the correct label of the example. [sent-169, score-0.306]
43 Proposition 2 Algorithm 3 learns the class of Monotone Disjunctions with at most M ≤ n/2 mistakes and n − 2M number of ⊥s. [sent-170, score-0.573]
44 Proof: If we make a mistake, it must be the case that the answer had been negative while we answered positive; for this to happen, we must have |S ∩ P | ≥ 2. [sent-171, score-0.294]
45 Note that the sum of its ⊥s and its mistakes is never more than n. [sent-179, score-0.506]
46 More precisely, if the cost of making a mistake is 1 and the cost of saying ⊥ is < 1, the worst-case cost of this algorithm is strictly smaller than n. [sent-180, score-0.548]
47 Next we present an algorithm for decreasing the number of mistakes to n/3. [sent-181, score-0.548]
48 We have another pool P ′ which consists of pairs of variables such that for each pair we know at least one of the variables belongs to the target function. [sent-183, score-0.346]
49 Proposition 3 Algorithm 4 learns the class of Monotone Disjunction with at most M ≤ n/3 mistake and 3n/2 − 3M number of ⊥’s. [sent-202, score-0.461]
50 Proof: If |S ∩ P | ≥ 3 and we make a mistake on S, then the size of P will be reduced by at least 3, and the size of P − will increase by at least 3. [sent-203, score-0.538]
51 If |S ∩ (P ∪ P ′ )| ≥ 2 and |S ∩ P ′ | ≥ 1 and we make a mistake on S, then at least two variables will be moved from (P ′ ∪ P ) to P − , and at least one variable will be moved from P ′ to P + (since whenever a variable moves from P ′ to P − , the other variable in its pair should move to P + ). [sent-204, score-0.751]
52 If |S ∩ P | = 0 and |S ∩ P ′ | = 1, we answer ⊥; however, after knowing the correct label, S ∩ P ′ will be moved to P + or P − . [sent-208, score-0.286]
53 2 Learning Linear Separator Functions In this section, we analyze how we can use ⊥ predictions to decrease the number of mistakes for efficiently learning linear separators with margin γ. [sent-216, score-0.674]
54 Below, we show how to formulate the problem with a Linear Program to bound the number of mistakes using some “I don’t know” answers. [sent-221, score-0.55]
55 These points arrive one at a time and we have to answer when a point arrives. [sent-223, score-0.254]
56 The objective is to make a small number of mistakes and some “I don’t know” answers to find a separation vector w such that w · xi is positive if and only if xi is a + point. [sent-224, score-0.753]
57 w · xi > 0 If xi is a + instance, and w · xi ≤ 0 If xi is a − instance Note that there are d variables which are the coordinates of vector w, and there are n linear constraints one per input point. [sent-226, score-0.242]
58 Clearly we do not know which points are the + points, so we can not write this linear program explicitly and solve it. [sent-227, score-0.391]
59 In order to make the analysis easier and bound the core (the set of feasible solutions of the linear program), we can assume that √ √ the coordinates of the vector w are always in range [−1 − γ/ d, 1 + γ/ d]. [sent-230, score-0.396]
60 The core of the linear program is the set of vectors w in [−1 − √ √ √ γ/ d, 1 + γ/ d]d at the beginning. [sent-235, score-0.366]
61 So we have a core (feasible set) of volume (2 + 2γ/ d)d at first. [sent-236, score-0.289]
62 If we add any of these two constraints to the linear program, we obtain a more restricted linear program with a core of lesser volume. [sent-242, score-0.434]
63 So we obtain one LP for each of these two possibilities, and the sum of the volumes of the cores of these two linear programs is equal to the volume of the core of our current linear program. [sent-243, score-0.477]
64 If the volume of the linear program for the + case is larger than the − case, we answer +. [sent-245, score-0.502]
65 Otherwise we have made a mistake, but the volume of the core of our linear program is halved. [sent-247, score-0.482]
66 we answer − when the larger volume is for − case. [sent-250, score-0.309]
67 First of all, we have to find a way to compute the volume of the core of a linear program. [sent-252, score-0.33]
68 6 In fact computing the volume of a linear program is #P -hard [DF88]. [sent-254, score-0.309]
69 There exists a randomized polynomial time algorithm that approximates the volume of the core of a linear program with (1 + ǫ) approximation [DFK91], i. [sent-255, score-0.586]
70 But note that we really do not need to know the volumes of these linear programs. [sent-260, score-0.279]
71 We just need to know whether the volume of the linear program of the + case is larger or the − case is larger or if they are approximately equal. [sent-261, score-0.475]
72 Lovasz and Vempala present a faster polynomial time algorithm for sampling a uniformly random point in the core of a linear program in [LV06]. [sent-262, score-0.47]
73 One way to estimate the relative volumes of both sides is to sample a uniformly random point from the core of our current linear program (without taking into account the new arrived point), and see if the sampled point is in the + side or the − side. [sent-263, score-0.438]
74 If we sample a sufficiently large number of points (here 2 log(n)/ǫ2 is large enough), and if the majority of them are in the + (−) side, we can say that the volume of the linear program for + (−) case is at least a 1 − ǫ fraction of our current linear program 2 with high probability. [sent-264, score-0.708]
75 So we can answer based on the majority of these sampled points, and if we make a mistake, we know that the volume of the core of the linear program is multiplied by at most 1 1 − ( 1 − ǫ) = 2 + ǫ. [sent-265, score-1.003]
76 We sample points from the core of this linear program, and based on the majority of them we answer + or − for this new example. [sent-268, score-0.49]
77 1 Lemma 4 With high probability (1− nΩ(1) ), for every mistake we make in our algorithm, the volume of the core of the linear program decreases by a factor of ( 1 + ǫ). [sent-270, score-0.948]
78 2 Proof: Without loss of generality, assume that we answered +, but the correct answer was −. [sent-271, score-0.269]
79 Since there are n examples arriving, we can use the union bound to bound the probability of failure on any of these rounds: the probability that the volume of the core of our linear program is not multiplied by 1 1 at most 2 + ǫ on any mistakes is at most n × n−2/(1−ǫ) = n1/(1−ǫ) . [sent-276, score-1.169]
80 Therefore with high probability 1 (at least 1 − n1/(1−ǫ) ), for every mistake we make, the volume of the core is multiplied by at most 1 2 2 + ǫ. [sent-277, score-0.791]
81 Now we show that the core of the linear program after adding all n constraints (the constraints of the variables) should have a decent volume in terms of γ. [sent-278, score-0.536]
82 ∗ ·x Lemma 5 If there is a unit-length separator vector w∗ with minx∈S w|x| = γ, the core of the √ complete linear program after adding all n constraints of the points has volume at least (γ/ d)d . [sent-279, score-0.637]
83 We also know ·x i |− ∗ ′ that the coordinates of w + w are in range (−1 − γ/ d, 1 + γ/ d) because w∗ has unit length (so √ √ all its coordinates are between −1 and 1), and the coordinates of w′ are in range (−γ/ d, γ/ d). [sent-289, score-0.352]
84 We conclude that the volume of the core is at all least (2γ/ d)d . [sent-291, score-0.336]
85 Theorem 6 The √ total number of mistakes in the above algorithm is not more than √ d (2+2γ/ d)d d) √ √ log2/(1+ǫ) (2γ/ d)d = log2/(1+ǫ) (1+γ/ d)d = O(d(log d + log 1/γ)). [sent-293, score-0.577]
86 Define Y1 to be (2+2γ/ d)d which is the volume of the √ core at the beginning before adding any of the constraints of the points. [sent-297, score-0.316]
87 Define Y2 to be (2γ/ d)d which is a lower bound for the volume of the core after adding all the constraints of the points. [sent-298, score-0.36]
88 Let V, V1 , and V2 be the volumes of the cores of the current linear program, the linear program with the additional constraint that the new point is a + point, and the linear program with the additional constraint that the new point is a − point respectively. [sent-303, score-0.533]
89 In this case, even if we make a mistake the volume of the core is divided by at least C, and by definition of C, this can not happen more than logC R = k times. [sent-305, score-0.81]
90 If V1 /V and V2 /V are both greater than 1/C, we answer “I don’t know”, and we know that the volume of the core is multiplied by at most 1 − 1/C. [sent-307, score-0.709]
91 Since we just need to estimate the ratios V1 /V and V2 /V , and in fact we want to see if any of them is smaller than 1/C or not, we can simply sample points from the core of our current linear program. [sent-308, score-0.282]
92 For example if we sample 16C log n points, and there are at most 8 log n + points among them, we can say that V1 /V 2 1 is at most 1/C with probability at least 1 − e−64 log n/32 log n = 1 − n2 . [sent-310, score-0.271]
93 But if there are at least 8 log n + points, and 8 log n − points among the samples, we can say that both V1 /V and V2 /V are 1 at least 8C with high probability using Chernoff bounds. [sent-311, score-0.26]
94 If we make a mistake in this algorithm, the volume of the core is divided by at least C, so we do not make more than k mistakes. [sent-312, score-0.83]
95 We also know that for each “I don’t know” answer the volume of the 1 core is multiplied by at most 1 − 8C , so after 8C “I don’t know” answers the volume of the core is multiplied by at most 1/e. [sent-313, score-1.2]
96 Theorem 7 For any k > 0, we can learn a linear separator of margin γ in ℜd using the above algorithm with k mistakes and O(R1/k × log R) “I don’t know” answers, where R is equal to √ (1+γ/ d)d √ . [sent-316, score-0.667]
97 From one perspective, we are allowing the algorithm to make mistakes in the KWIK model. [sent-318, score-0.639]
98 We showed, using a version-space algorithm and through a reduction to the egg-game puzzle, that allowing a few mistakes in the KWIK model can significantly decrease the number of don’tknow predictions. [sent-319, score-0.611]
99 This can be particularly useful if don’t-know predictions are cheaper than mistakes and we can trade off some number of mistakes for a not-too-much-larger number of “I don’t know”s. [sent-321, score-1.113]
100 We gave polynomial-time algorithms that effectively reduce the number of mistakes in the mistakebound model using don’t-know predictions for two concept classes: monotone disjunctions and linear separators with a margin. [sent-322, score-1.083]
wordName wordTfidf (topN-words)
[('mistakes', 0.506), ('mistake', 0.394), ('kwik', 0.394), ('disjunctions', 0.205), ('answer', 0.193), ('core', 0.173), ('know', 0.166), ('program', 0.152), ('answers', 0.141), ('monotone', 0.128), ('egg', 0.117), ('eggs', 0.117), ('volume', 0.116), ('puzzle', 0.103), ('pool', 0.086), ('saying', 0.084), ('say', 0.076), ('volumes', 0.072), ('moved', 0.068), ('coordinates', 0.062), ('polynomial', 0.062), ('multiplied', 0.061), ('concept', 0.059), ('disjunction', 0.059), ('label', 0.057), ('separators', 0.056), ('answered', 0.051), ('arrives', 0.051), ('majority', 0.051), ('make', 0.05), ('separator', 0.049), ('predictions', 0.049), ('sure', 0.049), ('gf', 0.047), ('least', 0.047), ('target', 0.047), ('bound', 0.044), ('enumeration', 0.044), ('minority', 0.044), ('algorithm', 0.042), ('allowed', 0.042), ('hi', 0.042), ('allowing', 0.041), ('linear', 0.041), ('diuk', 0.039), ('mistakebound', 0.039), ('walsh', 0.039), ('wsdl', 0.039), ('game', 0.038), ('class', 0.037), ('knows', 0.036), ('want', 0.036), ('dropped', 0.035), ('trading', 0.035), ('morteza', 0.034), ('cores', 0.034), ('intern', 0.034), ('examples', 0.032), ('points', 0.032), ('move', 0.031), ('reinforcement', 0.03), ('learns', 0.03), ('happen', 0.03), ('pools', 0.029), ('conjunctions', 0.029), ('dyer', 0.029), ('cheaper', 0.029), ('classes', 0.029), ('arrive', 0.029), ('log', 0.029), ('xi', 0.028), ('chernoff', 0.028), ('oor', 0.028), ('avrim', 0.028), ('exceed', 0.028), ('making', 0.028), ('constraints', 0.027), ('answering', 0.027), ('cmu', 0.027), ('bounds', 0.026), ('break', 0.026), ('feasible', 0.026), ('correct', 0.025), ('adversary', 0.025), ('minx', 0.024), ('proof', 0.024), ('concepts', 0.023), ('functions', 0.023), ('moves', 0.023), ('whenever', 0.023), ('trade', 0.023), ('decrease', 0.022), ('initially', 0.022), ('predict', 0.022), ('littman', 0.022), ('possibilities', 0.022), ('population', 0.022), ('decreases', 0.022), ('kearns', 0.021), ('therefore', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
2 0.21562022 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.074475721 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Author: Ling Huang, Jinzhu Jia, Bin Yu, Byung-gon Chun, Petros Maniatis, Mayur Naik
Abstract: Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.
4 0.071466558 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
5 0.060842164 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
6 0.058023032 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
7 0.05405046 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
8 0.051290739 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
9 0.049915921 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
10 0.045902524 192 nips-2010-Online Classification with Specificity Constraints
11 0.045702796 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
12 0.044676661 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
13 0.041825756 226 nips-2010-Repeated Games against Budgeted Adversaries
14 0.041221805 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
15 0.0382073 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
16 0.037993144 223 nips-2010-Rates of convergence for the cluster tree
17 0.037710495 222 nips-2010-Random Walk Approach to Regret Minimization
18 0.03770496 243 nips-2010-Smoothness, Low Noise and Fast Rates
19 0.034304265 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
20 0.033799425 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
topicId topicWeight
[(0, 0.12), (1, 0.003), (2, 0.068), (3, -0.001), (4, -0.006), (5, 0.04), (6, -0.059), (7, -0.024), (8, 0.01), (9, 0.068), (10, 0.045), (11, -0.012), (12, 0.001), (13, -0.002), (14, 0.014), (15, 0.032), (16, -0.013), (17, -0.07), (18, -0.026), (19, -0.043), (20, 0.017), (21, 0.064), (22, 0.093), (23, 0.033), (24, -0.089), (25, -0.029), (26, -0.004), (27, 0.05), (28, -0.091), (29, 0.03), (30, 0.089), (31, 0.039), (32, -0.016), (33, 0.061), (34, -0.066), (35, 0.076), (36, 0.032), (37, 0.102), (38, 0.154), (39, -0.01), (40, -0.085), (41, 0.016), (42, -0.003), (43, 0.054), (44, -0.016), (45, 0.063), (46, -0.064), (47, -0.095), (48, -0.086), (49, -0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.88995123 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
2 0.75564152 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.61390424 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
4 0.5340609 188 nips-2010-On Herding and the Perceptron Cycling Theorem
Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling
Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1
5 0.46821156 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
6 0.43153098 2 nips-2010-A Bayesian Approach to Concept Drift
7 0.43103254 47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation
8 0.4127754 15 nips-2010-A Theory of Multiclass Boosting
9 0.40073398 261 nips-2010-Supervised Clustering
10 0.36863264 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
11 0.34277365 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
12 0.3402006 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
13 0.33928466 226 nips-2010-Repeated Games against Budgeted Adversaries
14 0.33320293 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
15 0.32841673 223 nips-2010-Rates of convergence for the cluster tree
16 0.3241339 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
17 0.31632408 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
18 0.31366718 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
19 0.31058964 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
20 0.30426705 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
topicId topicWeight
[(13, 0.061), (27, 0.033), (30, 0.126), (39, 0.011), (45, 0.196), (50, 0.024), (52, 0.018), (60, 0.038), (77, 0.045), (78, 0.037), (87, 0.284), (90, 0.028)]
simIndex simValue paperId paperTitle
1 0.78856725 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
same-paper 2 0.78006768 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
3 0.77149606 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider multivariate regression problems involving high-dimensional predictor and response spaces. To efficiently address such problems, we propose a variable selection method, Multivariate Group Orthogonal Matching Pursuit, which extends the standard Orthogonal Matching Pursuit technique. This extension accounts for arbitrary sparsity patterns induced by domain-specific groupings over both input and output variables, while also taking advantage of the correlation that may exist between the multiple outputs. Within this framework, we then formulate the problem of inferring causal relationships over a collection of high-dimensional time series variables. When applied to time-evolving social media content, our models yield a new family of causality-based influence measures that may be seen as an alternative to the classic PageRank algorithm traditionally applied to hyperlink graphs. Theoretical guarantees, extensive simulations and empirical studies confirm the generality and value of our framework.
4 0.65850943 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
5 0.65141457 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
6 0.64789265 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
7 0.64789259 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
8 0.64365333 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
9 0.64132327 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
10 0.63954073 220 nips-2010-Random Projection Trees Revisited
11 0.63868165 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
12 0.63737285 243 nips-2010-Smoothness, Low Noise and Fast Rates
13 0.63588214 27 nips-2010-Agnostic Active Learning Without Constraints
14 0.63469464 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
15 0.63398421 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
16 0.62952226 134 nips-2010-LSTD with Random Projections
17 0.62851322 155 nips-2010-Learning the context of a category
18 0.62826443 63 nips-2010-Distributed Dual Averaging In Networks
19 0.62582308 280 nips-2010-Unsupervised Kernel Dimension Reduction
20 0.6248517 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks