nips nips2008 nips2008-123 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
Reference: text
sentIndex sentText sentNum sentScore
1 it Abstract We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. [sent-7, score-0.532]
2 Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. [sent-8, score-1.058]
3 Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. [sent-9, score-1.175]
4 73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. [sent-11, score-0.352]
5 Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. [sent-12, score-0.578]
6 1 Introduction In the standard online learning protocol for binary classification the learner receives a sequence of instances generated by an unknown source. [sent-13, score-0.272]
7 Each time a new instance is received the learner predicts its binary label, and is then given the true label of the current instance before the next instance is observed. [sent-14, score-0.348]
8 In order to address this problem, a variant of online learning that has been proposed is selective sampling. [sent-17, score-0.504]
9 In this modified protocol the true label of the current instance is never revealed unless the learner decides to issue an explicit query. [sent-18, score-0.301]
10 A natural sampling strategy is one that tries to identify labels which are likely to be useful to the algorithm, and then queries those ones only. [sent-20, score-0.414]
11 In [10] this approach was employed to define a selective sampling rule that queries a new label whenever the margin of the current instance, with respect to the current linear hypothesis, is smaller (in magnitude) than an adaptively adjusted threshold. [sent-23, score-0.91]
12 Although this selective sampling algorithm is efficient, and has simple variants working quite well in practice, the rate of convergence to the Bayes risk was never assessed in terms of natural distributional parameters, thus preventing a full understanding of the properties of this algorithm. [sent-25, score-0.742]
13 (ii) Under the same low noise condition, we prove that the RLS-based selective sampling rule of [10] converges to the Bayes risk at rate O n−(1+α)/(3+α) , with labels being queried at rate O n−α/(2+α) . [sent-27, score-1.204]
14 (iii) We perform experiments on a real-world medium-size dataset showing that variants of our mistake-driven sampler compare favorably with other selective samplers proposed in the literature, like the ones in [11, 16, 20]. [sent-31, score-0.61]
15 For example, in Angluin’s adversarial learning with queries (see [1] for a survey), the goal is to identify an unknown boolean function f from a given class, and the learner can query the labels (i. [sent-34, score-0.464]
16 They prove risk bounds in terms of nonparametric characterizations of both the regularity of the Bayes decision boundary and the behavior of the noise rate in its proximity. [sent-39, score-0.24]
17 In fact, a large statistical literature on adaptive sampling and sequential hypothesis testing exists (see for instance the detailed description in [9]) which is concerned with problems that share similarities with active learning. [sent-40, score-0.245]
18 The idea of querying small margin instances when learning linear classifiers has been explored several times in different active learning contexts. [sent-41, score-0.328]
19 A landmark result in the selective sampling protocol is the query-by-committee algorithm of Freund, Seung, Shamir and Tishby [17]. [sent-46, score-0.607]
20 In the realizable (noise-free) case, and under strong distributional assumptions, this algorithm is shown to require exponentially fewer labels than instances when learning linear classifiers (see also [18] for a more practical implementation). [sent-47, score-0.286]
21 In the general statistical learning case, under no assumptions on the joint distribution of label and instances, selective sampling bears no such exponential advantage. [sent-49, score-0.624]
22 For instance, K¨ ari¨ inen shows a¨ a that, in order to approach the risk of the best linear classifier f ∗ within error ε, at least Ω((η/ε)2 ) labels are needed, where η is the risk of f ∗ . [sent-50, score-0.29]
23 General selective sampling strategies for the nonrealizable case have been proposed in [3, 4, 15]. [sent-52, score-0.532]
24 2 Learning protocol and data model We consider the following online selective sampling protocol. [sent-54, score-0.648]
25 the sampling algorithm (or selective sampler ) receives an instance xt ∈ Rd and outputs a binary prediction for the associated label yt ∈ {−1, +1}. [sent-58, score-1.156]
26 After each prediction, the algorithm has the option of “sampling” (issuing a query) in order to receive the label yt . [sent-59, score-0.316]
27 After seeing the label yt , the algorithm can choose whether or not to update its internal state using the new information encoded by (xt , yt ). [sent-61, score-0.54]
28 Following [10], we assume that labels yt are generated according to the following simple linear noise model: there exists a fixed and unknown vector u ∈ Rd , with Euclidean norm u = 1, such that E Yt X t = xt = u⊤ xt for all t ≥ 1. [sent-66, score-0.668]
29 Hence X t = xt has label 1 with probability (1 + u⊤ xt )/2 ∈ [0, 1]. [sent-67, score-0.34]
30 The instantaneous regret R(f ) is the excess risk of SGN(f ) w. [sent-81, score-0.34]
31 Let Rt−1 (ft ) be the instantaneous conditional regret Rt−1 (ft ) = Pt−1 (Yt ft (X t ) < 0) − Pt−1 (Yt f ∗ (X t ) < 0). [sent-101, score-0.37]
32 Our goal is to bound the expected cumulative regret E R0 (f1 ) + R1 (f2 ) + · · · + Rn−1 (fn ) , as a function of n, and other relevant quantities. [sent-102, score-0.318]
33 Observe that, although the learner’s predictions can only depend on the queried examples, the regret is computed over all time steps, including the ones when the selective sampler did not issue a query. [sent-103, score-1.087]
34 In order to model the distribution of the instances around the hyperplane u⊤ x = 0, we use Mammen-Tsybakov low noise condition [24]: There exist c > 0 and α ≥ 0 such that P |f ∗ (X 1 )| < ε ≤ c εα for all ε > 0. [sent-104, score-0.238]
35 (1) When the noise exponent α is 0 the low noise condition becomes vacuous. [sent-105, score-0.278]
36 Our wt is an RLS estimator defined over the set of previously queried examples. [sent-113, score-0.366]
37 More precisely, let Nt be the number of queried examples during the first t time steps, St−1 = x′ , . [sent-114, score-0.296]
38 , x′ t−1 be the 1 N ′ ′ matrix of the queried instances up to time t − 1, and y t−1 = y1 , . [sent-117, score-0.335]
39 Then the RLS estimator is defined by ⊤ wt = I + St−1 St−1 + xt x⊤ t −1 ⊤ be the vector of the St−1 y t−1 , (2) where I is the d × d identity matrix. [sent-121, score-0.238]
40 Note that wt depends on the current instance xt . [sent-122, score-0.251]
41 The RLS estimator (2) can be stored in space Θ(d2 ), which we need for the inverse of I + ⊤ St−1 St−1 + xt x⊤ . [sent-135, score-0.29]
42 Our first result establishes a regret bound for the fully supervised algorithm, i. [sent-141, score-0.422]
43 , the algorithm that predicts using RLS as in (2), queries the label of every instance, and stores all examples. [sent-143, score-0.275]
44 This result is the baseline against which we measure the performance of our selective sampling algorithm. [sent-144, score-0.532]
45 Then the expected cumulative regret after n steps of the fully supervised algorithm based on (2) is bounded by E 4c 1 + d i=1 ⊤ 4c(1 + ln |I + Sn Sn |) ln(1 + nλi ) 1+α 2+α 1 1+α 2+α n 2+α = O 1 n 2+α . [sent-150, score-0.596]
46 This, in turn, is bounded from above by d ln n 1+α 2+α 1 n 2+α . [sent-151, score-0.234]
47 1 When α = 0 (corresponding to a vacuous noise condition) the bound of Theorem 1 reduces to √ O d n ln n . [sent-156, score-0.4]
48 When α → ∞ (corresponding to a hard margin condition) the bound gives the d logarithmic behavior O d ln n . [sent-157, score-0.399]
49 Notice that i=1 ln(1 + nλi ) is substantially smaller than d ln n ⊤ whenever the spectrum of E[X 1 X 1 ] is rapidly decreasing. [sent-158, score-0.267]
50 Predict the label yt ∈ {−1, 1} with SGN(w⊤ xt ), where wt is as in (2). [sent-170, score-0.506]
51 If N ≤ ρt then query label yt and store (xt , yt ); ln 4. [sent-172, score-0.893]
52 If (xt , yt ) is scheduled to be stored, then increment N and update wt using (xt+1 , yt+1 ). [sent-174, score-0.29]
53 A reference closer to our paper is Ying and Zhou [26], where the authors prove bounds for online linear classification using the low noise condition (1), though under different distributional assumptions. [sent-179, score-0.277]
54 Our second result establishes a new regret bound, under low noise conditions, for the selective sampler introduced in [10]. [sent-180, score-0.913]
55 This variant, described in Figure 1, queries all labels (and stores all examples) during an initial stage of length at least (16d)/λ2 , where λ denotes the smallest nonzero eigenvalue of the process covariance matrix E[X 1 X ⊤ ]. [sent-181, score-0.344]
56 When this transient regime is over, the 1 sampler issues a query at time t based on both the query counter Nt−1 and the margin ∆t . [sent-182, score-0.508]
57 Specifically, if evidence is collected that the number Nt−1 of stored examples is smaller than our current estimate of 1/∆2 , that is if ∆2 ≤ (128 ln t)/(λNt−1 ), then we query (and store) the label of the next t t instance xt+1 . [sent-183, score-0.668]
58 This additional information is needed because, unlike the fully supervised classifier of Theorem 1, the selective sampler queries labels at random steps. [sent-185, score-0.921]
59 This prevents us from bounding the sum of conditional variances of the involved RLS estimator through ⊤ ln I + Sn Sn , as we can do when proving Theorem 1 (see below). [sent-186, score-0.282]
60 In this respect, our algorithm is more properly an adaptive sampler, rather than a selective sampler. [sent-191, score-0.422]
61 Such a rule provides that, when a small margin is detected, a query be issued (and the next example be stored) only if SGN(∆t ) = yt (i. [sent-193, score-0.502]
62 It is easy to adapt our analysis to obtain even for this algorithm the same regret bound as the one established in Theorem 2. [sent-197, score-0.318]
63 However, in this case we can only give guarantees on the expected number of stored examples (which can indeed be much smaller than the actual number of queried labels). [sent-198, score-0.414]
64 Theorem 2 Assume the low noise condition (1) holds with unknown exponent α ≥ 0 and assume 16 the selective sampler of Figure 1 is run with ρt = λ2 max{d, ln t}. [sent-199, score-0.987]
65 Then, after n steps, the expected 1+α 2 d + ln n ln n 3+α 3+α cumulative regret is bounded by O + n whereas the expected number λ2 λ α 2 d + ln n ln n 2+α 2+α of queried labels (including the stored ones) is bounded by O n + . [sent-200, score-1.69]
66 In particular, we show that our selective sampler is able to adaptively estimate the number of queries needed to ensure a 1/t increase of the regret when a query is not issued at time t. [sent-205, score-1.157]
67 As expected, when we compare our semi-supervised selective sampler (Theorem 2) to the fully supervised “yardstick” (Theorem 1), we see that the per-step regret of the former vanishes at a sig1+α 1+α nificantly slower rate than the latter, i. [sent-206, score-0.958]
68 Note, however, that the per-step regret of the semi-supervised algorithm vanishes faster than its fully-supervised counterpart when both regrets are expressed in terms of the number N of issued queries. [sent-210, score-0.312]
69 Then both algorithms have a per-step regret of order (ln n)/n. [sent-212, score-0.258]
70 However, since the semi-supervised algorithm makes only N = O(ln n) queries, we have that, as a function of N , the per-step regret of the semi-supervised algorithm is of order N/eN where the fully supervised has only (ln N )/N . [sent-213, score-0.362]
71 When α = 0 (vacuous noise conditions), the per-step regret rates in terms of N become (excluding logarithmic factors) of order N −1/3 in the semi-supervised case and of order N −1/2 in the fully supervised case. [sent-215, score-0.432]
72 In order to find this critical value we write the (1+α)(2+α) rates of the per-step regret for 0 ≤ α < ∞ obtaining N − 2(3+α) (semi-supervised algorithm) and 1+α N − 2+α (fully supervised algorithm). [sent-217, score-0.315]
73 This indicates that selective sampling is advantageous when the noise level (as modeled by the MammenTsybakov condition) is not too high. [sent-219, score-0.602]
74 It turns out this is a fixable artifact of our analysis, rather than an intrinsic limitation of the selective sampling scheme in Figure 1. [sent-221, score-0.532]
75 The proof proceeds by relating the classification regret to the square loss regret via a comparison theorem. [sent-224, score-0.516]
76 The square loss regret is then controlled by applying a known point2 wise bound. [sent-225, score-0.258]
77 Next, we take the 1+α bound just obtained and apply Jensen’s inequality twice, first to the concave function (·) 2+α of a real argument, and then to the concave function ln |·| of a (positive definite) matrix argument. [sent-237, score-0.294]
78 We aim at bounding from above the cumulative regret n P(Yt ∆t < 0) − P(Yt ∆t < 0) which, according to our probabilistic model, can be shown t=1 to be at most c n ε1+α + n n t=1 P(∆t ∆t ≤ 0, |∆t | ≥ ε) . [sent-241, score-0.258]
79 The last sum is upper bounded by n t=1 P (Nt−1 ≤ ρt ) + t=1 P ∆2 ≤ t (I) (II) n + t=1 128 ln t , Nt−1 > ρt , |∆t | ≥ ε λNt−1 P ∆t ∆t ≤ 0, ∆2 > t 128 ln t , Nt−1 > ρt λNt−1 . [sent-242, score-0.468]
80 (III) where: (I) are the initial time steps; (II) are the time steps on which we trigger the query of the next label (because ∆2 is smaller than the threshold at time t); (III) are the steps that do not trigger any t queries at all. [sent-243, score-0.423]
81 Note that (III) bounds the regret over non-sampled examples. [sent-244, score-0.295]
82 To bound (II) and (III) we need to exploit the fact that λ2 the subsequence of stored instances and labels is a sequence of i. [sent-247, score-0.387]
83 By suitably combining 1 these concentration results we can bound term (II) by O( d+ln n + ln n ) and term (III) by O(ln n). [sent-256, score-0.327]
84 λ2 λε2 1+α n Putting together and choosing ε of the order of ln n 3+α gives the desired regret bound. [sent-257, score-0.492]
85 The bound λ on the number of queried labels is obtained in a similar way. [sent-258, score-0.438]
86 In this paper, however, we decided to stick to the simpler analysis leading to Theorem 2, since the resulting bounds would be harder to read, and would somehow obscure understanding of regret and sampling rate behavior as a function of n. [sent-263, score-0.456]
87 4 Experimental analysis In evaluating the empirical performance of our selective sampling algorithm, we consider two additional variants obtained by slightly modifying Step 4 in Figure 1. [sent-264, score-0.565]
88 The first variant (which we just call SS, Selective Sampler) queries the current label instead of the next one. [sent-265, score-0.279]
89 The second variant is a mistake-driven version (referred to as SSMD, Selective Sampling Mistake Driven) that queries the current label (and stores the corresponding example) only if the label gets mispredicted. [sent-267, score-0.408]
90 For clarity, the algorithm in Figure 1 will then be called SSNL (Selective Sampling Next Label) since it queries the next label whenever a small margin is observed. [sent-268, score-0.343]
91 The online categorization of excerpts from a newswire feed is a realistic learning problem for selective sampling algorithms since a newswire feed consists of a large amount of uncategorized data with a high labeling cost. [sent-272, score-0.697]
92 It is reasonable to assume that in a selective sampling setup we are interested in the performance achieved when the fraction of queried labels stays below some threshold, say 10%. [sent-284, score-0.98]
93 Unsurprisingly, as the number of queried labels gets larger, SSMD, SOLE and SOP exhibit similar behaviors. [sent-286, score-0.378]
94 Moreover, the less than ideal plot of SSNL seems to confirm the intuition that querying small margin instances 0. [sent-287, score-0.254]
95 1 Figure 2: Average F -measure obtained by different algorithms after 40,000 examples, as a function of the number of queried labels. [sent-337, score-0.252]
96 1 Number of stored examples (normalized) Norm of the SVM weight vector (normalized) F-measure Fraction of positive examples Fraction of queried labels 1 0. [sent-341, score-0.584]
97 Right: F -measure achieved on the different binary classification tasks compared to the number of positive examples in each topic, and to the fraction of queried labels (including the stored ones). [sent-350, score-0.61]
98 In our selective sampling framework it is important to investigate how harder problems influence the sampling rate of an algorithm and, for each binary problem, to assess the impact of the number of positive examples on F-measure performance. [sent-358, score-0.737]
99 Here we focus on SSMD since it is reasonably the best candidate, among our selective samplers, as applied to real-world problems. [sent-360, score-0.422]
100 , on each individual topic, including the infrequent ones), and the corresponding levels of F -measure and queried labels (right). [sent-363, score-0.425]
wordName wordTfidf (topN-words)
[('selective', 0.422), ('ssmd', 0.269), ('regret', 0.258), ('queried', 0.252), ('ln', 0.234), ('yt', 0.224), ('nt', 0.196), ('rls', 0.147), ('queries', 0.146), ('labels', 0.126), ('xt', 0.124), ('sampler', 0.123), ('query', 0.119), ('stored', 0.118), ('ft', 0.112), ('sampling', 0.11), ('margin', 0.105), ('dkmperc', 0.104), ('label', 0.092), ('sn', 0.087), ('ssnl', 0.083), ('instances', 0.083), ('risk', 0.082), ('classi', 0.076), ('protocol', 0.075), ('active', 0.074), ('learner', 0.073), ('sop', 0.072), ('noise', 0.07), ('fraction', 0.07), ('querying', 0.066), ('sole', 0.066), ('wt', 0.066), ('rd', 0.063), ('newswire', 0.062), ('instance', 0.061), ('measurable', 0.061), ('bound', 0.06), ('bayes', 0.057), ('supervised', 0.057), ('sgn', 0.055), ('topics', 0.055), ('issued', 0.054), ('exponent', 0.053), ('theorem', 0.051), ('rate', 0.051), ('atlas', 0.05), ('estimator', 0.048), ('infrequent', 0.047), ('fully', 0.047), ('condition', 0.045), ('examples', 0.044), ('dasgupta', 0.044), ('distributional', 0.044), ('italy', 0.044), ('cohn', 0.044), ('st', 0.044), ('agnostic', 0.042), ('transient', 0.042), ('castro', 0.042), ('azoury', 0.041), ('cavallanti', 0.041), ('dkm', 0.041), ('dsi', 0.041), ('milano', 0.041), ('perc', 0.041), ('iii', 0.041), ('variant', 0.041), ('online', 0.041), ('low', 0.04), ('stores', 0.037), ('cristianini', 0.037), ('ers', 0.037), ('bounds', 0.037), ('ari', 0.036), ('degli', 0.036), ('monteleoni', 0.036), ('parametrize', 0.036), ('studi', 0.036), ('vacuous', 0.036), ('svm', 0.036), ('eigenvalue', 0.035), ('adaptively', 0.035), ('universit', 0.034), ('perceptron', 0.034), ('pt', 0.033), ('spectrum', 0.033), ('storage', 0.033), ('campbell', 0.033), ('trigger', 0.033), ('realizable', 0.033), ('reuters', 0.033), ('ying', 0.033), ('committee', 0.033), ('shamir', 0.033), ('documents', 0.033), ('concentration', 0.033), ('variants', 0.033), ('frequent', 0.032), ('ones', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
2 0.25092933 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
3 0.17530698 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
4 0.1714666 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
5 0.16557804 73 nips-2008-Estimating Robust Query Models with Convex Optimization
Author: Kevyn Collins-thompson
Abstract: Query expansion is a long-studied approach for improving retrieval effectiveness by enhancing the user’s original query with additional related words. Current algorithms for automatic query expansion can often improve retrieval accuracy on average, but are not robust: that is, they are highly unstable and have poor worst-case performance for individual queries. To address this problem, we introduce a novel formulation of query expansion as a convex optimization problem over a word graph. The model combines initial weights from a baseline feedback algorithm with edge weights based on word similarity, and integrates simple constraints to enforce set-based criteria such as aspect balance, aspect coverage, and term centrality. Results across multiple standard test collections show consistent and significant reductions in the number and magnitude of expansion failures, while retaining the strong positive gains of the baseline algorithm. Our approach does not assume a particular retrieval model, making it applicable to a broad class of existing expansion algorithms. 1
6 0.16365252 170 nips-2008-Online Optimization in X-Armed Bandits
7 0.13609488 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
8 0.13561347 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
9 0.12846683 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
10 0.1266965 15 nips-2008-Adaptive Martingale Boosting
11 0.1208652 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
12 0.11524442 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
13 0.10788127 101 nips-2008-Human Active Learning
14 0.10603213 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
15 0.10502633 214 nips-2008-Sparse Online Learning via Truncated Gradient
16 0.10109226 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
17 0.099834003 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
18 0.099137828 168 nips-2008-Online Metric Learning and Fast Similarity Search
19 0.094251245 206 nips-2008-Sequential effects: Superstition or rational behavior?
20 0.093158647 171 nips-2008-Online Prediction on Large Diameter Graphs
topicId topicWeight
[(0, -0.265), (1, 0.029), (2, -0.196), (3, -0.07), (4, -0.274), (5, 0.16), (6, -0.037), (7, 0.119), (8, 0.106), (9, -0.06), (10, -0.068), (11, -0.003), (12, -0.004), (13, -0.057), (14, 0.095), (15, -0.077), (16, -0.014), (17, 0.062), (18, 0.079), (19, -0.068), (20, -0.015), (21, -0.011), (22, 0.102), (23, 0.043), (24, -0.043), (25, -0.076), (26, 0.039), (27, -0.077), (28, 0.16), (29, -0.011), (30, 0.058), (31, 0.07), (32, 0.11), (33, -0.078), (34, -0.039), (35, 0.102), (36, -0.005), (37, 0.029), (38, 0.04), (39, -0.057), (40, 0.084), (41, -0.028), (42, -0.039), (43, 0.076), (44, 0.079), (45, -0.058), (46, 0.048), (47, -0.009), (48, 0.047), (49, -0.141)]
simIndex simValue paperId paperTitle
same-paper 1 0.95554781 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
2 0.73810244 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
Author: Sham M. Kakade, Ambuj Tewari
Abstract: This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. As a corollary, we characterize the convergence rate of P EGASOS (with high probability), a recently proposed method for solving the SVM optimization problem. 1
3 0.64765161 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
Author: Zhihua Zhang, Michael I. Jordan, Dit-Yan Yeung
Abstract: Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as “Silverman’s g-prior.” We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion. 1
4 0.56922591 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
Author: Matthew Streeter, Daniel Golovin
Abstract: We present an algorithm for solving a broad class of online resource allocation problems. Our online algorithm can be applied in environments where abstract jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submodular function of a set of pairs (v, τ ), where τ is the time invested in activity v. Under this assumption, our online algorithm performs near-optimally according to two natural metrics: (i) the fraction of jobs completed within time T , for some fixed deadline T > 0, and (ii) the average time required to complete each job. We evaluate our algorithm experimentally by using it to learn, online, a schedule for allocating CPU time among solvers entered in the 2007 SAT solver competition. 1
5 0.54206991 15 nips-2008-Adaptive Martingale Boosting
Author: Phil Long, Rocco Servedio
Abstract: In recent work Long and Servedio [LS05] presented a “martingale boosting” algorithm that works by constructing a branching program over weak classifiers and has a simple analysis based on elementary properties of random walks. [LS05] showed that this martingale booster can tolerate random classification noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classifiers it receives. We present an adaptive variant of the martingale boosting algorithm. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original [LS05] algorithm, such as random classification noise tolerance, and has other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with confidencerated weak hypotheses that output real values rather than Boolean predictions.
6 0.53442472 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
7 0.51681274 168 nips-2008-Online Metric Learning and Fast Similarity Search
8 0.50514728 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
9 0.47785404 170 nips-2008-Online Optimization in X-Armed Bandits
10 0.46431753 73 nips-2008-Estimating Robust Query Models with Convex Optimization
11 0.46082684 112 nips-2008-Kernel Measures of Independence for non-iid Data
12 0.44128323 101 nips-2008-Human Active Learning
13 0.44056961 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
14 0.43199345 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
15 0.42307559 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
16 0.3551684 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
17 0.35333377 85 nips-2008-Fast Rates for Regularized Objectives
18 0.34759191 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
19 0.34158716 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
20 0.33648038 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
topicId topicWeight
[(6, 0.108), (7, 0.071), (12, 0.04), (28, 0.19), (57, 0.048), (59, 0.02), (63, 0.033), (64, 0.264), (71, 0.036), (77, 0.042), (83, 0.065)]
simIndex simValue paperId paperTitle
1 0.88139838 105 nips-2008-Improving on Expectation Propagation
Author: Manfred Opper, Ulrich Paquet, Ole Winther
Abstract: A series of corrections is developed for the fixed points of Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.
same-paper 2 0.84309769 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
3 0.83965492 108 nips-2008-Integrating Locally Learned Causal Structures with Overlapping Variables
Author: David Danks, Clark Glymour, Robert E. Tillman
Abstract: In many domains, data are distributed among datasets that share only some variables; other recorded variables may occur in only one dataset. While there are asymptotically correct, informative algorithms for discovering causal relationships from a single dataset, even with missing values and hidden variables, there have been no such reliable procedures for distributed data with overlapping variables. We present a novel, asymptotically correct procedure that discovers a minimal equivalence class of causal DAG structures using local independence information from distributed data of this form and evaluate its performance using synthetic and real-world data against causal discovery algorithms for single datasets and applying Structural EM, a heuristic DAG structure learning procedure for data with missing values, to the concatenated data.
4 0.80881953 151 nips-2008-Non-parametric Regression Between Manifolds
Author: Florian Steinke, Matthias Hein
Abstract: This paper discusses non-parametric regression between Riemannian manifolds. This learning problem arises frequently in many application areas ranging from signal processing, computer vision, over robotics to computer graphics. We present a new algorithmic scheme for the solution of this general learning problem based on regularized empirical risk minimization. The regularization functional takes into account the geometry of input and output manifold, and we show that it implements a prior which is particularly natural. Moreover, we demonstrate that our algorithm performs well in a difficult surface registration problem. 1
5 0.68625271 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.68019682 62 nips-2008-Differentiable Sparse Coding
7 0.67986643 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
8 0.67869377 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.67780733 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
10 0.67740136 196 nips-2008-Relative Margin Machines
11 0.67704678 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
12 0.67619681 85 nips-2008-Fast Rates for Regularized Objectives
13 0.67509794 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
14 0.67426991 143 nips-2008-Multi-label Multiple Kernel Learning
15 0.67388332 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
16 0.67387164 239 nips-2008-Tighter Bounds for Structured Estimation
17 0.67304409 75 nips-2008-Estimating vector fields using sparse basis field expansions
18 0.6723119 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
19 0.67194086 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
20 0.671718 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization