jmlr jmlr2008 jmlr2008-52 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Editor: Peter Bartlett Abstract We consider the problem of learning accurate models from multiple sources of “nearby” data. [sent-7, score-0.456]
2 Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. [sent-8, score-0.588]
3 Introduction We introduce and analyze a theoretical model for the problem of learning from multiple sources of “nearby” data. [sent-14, score-0.426]
4 In our model there are K unknown data sources, with source i generating a distinct sample Si of ni observations. [sent-20, score-0.427]
5 We assume we are given only the samples S i , and a disparity1 matrix D whose entry D(i, j) bounds the difference between source i and source j. [sent-21, score-0.414]
6 Our framework includes settings in which the sources produce data for classification, regression, and density estimation (and more generally any additive-loss learning problem obeying certain conditions). [sent-29, score-0.386]
7 Our main result is a general theorem establishing a bound on the expected loss incurred by using all data sources within a given disparity of the target source. [sent-30, score-0.974]
8 Our bound clearly expresses a trade-off between three quantities: the sample size used (which increases as we include data from more distant models), a weighted average of the disparities of the sources whose data is used, and a model complexity term. [sent-32, score-0.678]
9 It can be applied to any learning setting in which the underlying loss function obeys an approximate triangle inequality, and in which the class of hypothesis models under consideration obeys uniform convergence of empirical estimates of loss to expectations. [sent-33, score-0.564]
10 Uniform convergence bounds for the settings we consider may be obtained via standard data-independent model complexity measures such as VC dimension and pseudo-dimension, or via more recent measures such as Rademacher complexity. [sent-36, score-0.215]
11 (2006) examines the considerably more limited problem of learning a model when all data sources are corrupted versions of a single, fixed source, for instance when each data source provides noisy samples of a fixed binary function, but with varying levels of noise. [sent-38, score-0.603]
12 In the current work, the labels on each source may be entirely unrelated to those on other source except as constrained by the bounds on disparities, requiring us to develop new techniques. [sent-39, score-0.366]
13 (2007) study the related problem of training classifiers using multiple sources of data drawn from different underlying domains but labeled using identical or similar labeling functions. [sent-41, score-0.483]
14 In Section 3 we provide our main result, which is a general bound on the expected loss incurred by using all data within a given disparity of a target source. [sent-45, score-0.522]
15 In each case, we are interested in the expected loss of a model g2 on target g1 , e(g1 , g2 ) = Ez∼Pg1 [L (g2 , z)]. [sent-62, score-0.22]
16 In our multiple source model, we are presented with K distinct mutually independent samples or sources of data S1 , . [sent-64, score-0.603]
17 Each source Si contains ni observations that are generated from a fixed and unknown model f i , and D satisfies max(e( f i , f j ), e( f j , fi )) ≤ D(i, j). [sent-68, score-0.649]
18 Our goal is to decide which sources S j to use in order to learn the best approximation (in terms of expected loss) to each f i . [sent-70, score-0.468]
19 Thus without loss of generality let us suppose that we are given sources S1 , . [sent-72, score-0.535]
20 In all cases ˆ we will analyze, for any k ≤ K, the hypothesis hk minimizing the empirical loss ek (h) on the first k ˆ sources S1 , . [sent-88, score-0.779]
21 , Sk , that is 1 ˆ hk = argmin ek (h) = argmin ˆ n1:k h∈H h∈H k nj ∑ ∑ L (h, zij ) , j=1 i=1 where n1:k = n1 + · · · + nk . [sent-91, score-0.424]
22 We also denote the expected error of function h with respect to the first k sources of data as k ni e( fi , h). [sent-92, score-0.883]
23 General Theory for the Multiple Source Problem In this section we provide the first of our main results: a general bound on the expected loss of the model minimizing the empirical loss on the nearest k sources. [sent-94, score-0.378]
24 Optimization of this bound leads to a recommended set of sources to incorporate when learning f = f 1 . [sent-95, score-0.529]
25 The key ingredients needed to apply this bound are an approximate triangle inequality and a uniform convergence bound, which we define below. [sent-96, score-0.432]
26 Definition 1 For α ≥ 1, we say that the α-triangle inequality holds for a class of models F and expected loss function e if for all g1 , g2 , g3 ∈ F we have e(g1 , g2 ) ≤ α(e(g1 , g3 ) + e(g3 , g2 )). [sent-98, score-0.264]
27 Our results will require only that the unknown source models f 1 , . [sent-102, score-0.188]
28 Definition 2 A uniform convergence bound for a hypothesis space H and loss function L is a bound that states that for any 0 < δ < 1, with probability at least 1 − δ for any h ∈ H |e(h) − e(h)| ≤ β(n, δ) , ˆ 1 where e(h) = n ∑n L (h, zi ) for n observations z1 , . [sent-107, score-0.547]
29 This definition simply asserts that for every model in H , its empirical loss on a sample of size n and the expectation of this loss will be “close” when β(n, δ) is small. [sent-118, score-0.265]
30 Our bounds will be derived from the rich literature on uniform convergence. [sent-120, score-0.185]
31 Theorem 3 Let e be the expected loss function for loss L , and let F be a class of models for which the α-triangle inequality holds with respect to e. [sent-124, score-0.413]
32 Let H ⊆ F be a class of hypothesis models for which there is a uniform convergence bound β for L . [sent-125, score-0.316]
33 , K} k ˆ e( f , hk ) ≤ α2 min {e( f , h)} + (α + α2 ) ∑ h∈H i=1 ni n1:k εi + 2αβ(n1:k , δ/2K) . [sent-133, score-0.398]
34 The first term in the bound is simply the approximation error, the residual loss that we incur by limiting our hypothesis model to fall in the restricted class H . [sent-135, score-0.258]
35 We expect this term to decrease with added sources due to the increased sample size. [sent-139, score-0.386]
36 Theorem 3 and this optimization make the implicit assumption that the best subset of sources to use will be a prefix of the sources—that is, that we should not “skip” a nearby source in favor of more distant ones. [sent-142, score-0.645]
37 This assumption will be true for typical data-independent uniform convergence such as VC dimension bounds, and will be true on average for data-dependent bounds, where we expect uniform convergence bounds to improve with increased sample size. [sent-143, score-0.386]
38 , k}, ni n1:k ni n1:k e( f , h) ≤ (αe( f , fi ) + αe( fi , h)) . [sent-151, score-0.902]
39 , k}, we find k e( f , h) ≤ ni n1:k ∑ i=1 k = α∑ i=1 ni n1:k (αe( f , fi ) + αe( fi , h)) k ni n1:k e( f , fi ) + α ∑ i=1 k e( fi , h) ≤ α ∑ i=1 ni n1:k εi + αek (h) . [sent-155, score-1.804]
40 In the second line, we have broken up the summation using the fact that e( f , fi ) ≤ εi and the definition of ek (h). [sent-157, score-0.334]
41 Notice that the first summation is a weighted average of the expected loss of each f i , while the second summation is the expected loss of h on the data. [sent-158, score-0.386]
42 Using the uniform convergence bound, we may assert that with high probability e k (h) ≤ ek (h)+β(n1:k , δ/2K), ˆ and with high probability ek (hk ) = min{ek (h)} ≤ min ˆ ˆ ˆ h∈H h∈H k ∑ i=1 ni n1:k e( fi , h) + β(n1:k , δ/2K) . [sent-159, score-0.822]
43 The loss function L (h, x, y ) is defined as 0 if y = h(x) and 1 otherwise, and the corresponding expected loss is e(g1 , g2 ) = E x,y ∼Pg1 [L (g2 , x, y )] = Prx∼P [g1 (x) = g2 (x)]. [sent-167, score-0.268]
44 For 0/1 loss it is well-known and easy to see that the (standard) 1-triangle inequality holds. [sent-168, score-0.188]
45 The following function β is a uniform convergence bound for H and L when n ≥ d/2: 8(d ln(2en/d) + ln(4/δ)) β(n, δ) = . [sent-171, score-0.249]
46 n The proof is analogous to the standard proof of uniform convergence using the VC Dimension (see, for example, Chapters 2–4 of Anthony and Bartlett (1999)), requiring only minor modifications to the symmetrization argument to handle the fact that the samples need not be uniformly distributed. [sent-172, score-0.245]
47 Then Pr 1 m ∑ fi (xi ) − m i=1 1 m ∑ Ex∼Di [ fi (x)] m i=1 ≥ε ≤ 2 exp −2ε2 m2 ∑m (bi − ai )2 i=1 , where the probability is over the sequence of values xi distributed according to Di for all i = 1, · · · , m. [sent-175, score-0.431]
48 , K} k ˆ e( f , hk ) ≤ min {e( f , h)} + 2 ∑ h∈H i=1 ni n1:k εi + 1762 32 (d ln (2en1:k /d) + ln (8K/δ)) . [sent-186, score-0.792]
49 75) (marked “MAX DATA” in the left panel); the resulting source sizes for each model are shown in the bar plot in the right panel, where the origin (0, 0) is in the near corner, (1, 1) in the far corner, and the source sizes clearly peak near (0. [sent-218, score-0.258]
50 For every function fi we used Theorem 6 to find the best sources j to be used to estimate its parameters. [sent-221, score-0.568]
51 We start by deriving bounds for settings in which generic Lipschitz loss functions are used, and then discuss specific applications to classification and to regression with squared loss. [sent-235, score-0.322]
52 , xn is defined as 2 n ˆ Rn (H ) = E sup ∑ σi h(xi ) h∈H n i=1 , where the expectation is taken with respect to independent uniform {±1}-valued random variables σ1 , . [sent-240, score-0.248]
53 Lemma 8 below gives a uniform convergence bound for any loss function with a corresponding Lipschitz cost function. [sent-262, score-0.36]
54 Let H : X → Y be a class of functions and let { xi , yi }n be sampled independently according to some i=1 probability distributed P. [sent-270, score-0.21]
55 2 Application to Classification Using Rademacher Complexity Theorem 9 below follows from the application of Theorem 3 using the 1-triangle inequality and an application of Lemma 8 with 1 if ya ≤ 0, φ(y, a) = 1 − ya if 0 < ya ≤ 1, 0 if ya > 1. [sent-273, score-0.365]
56 Theorem 9 Let F be a set of functions from an input set X into {-1,1} and let R n1:k (H ) be the Rademacher complexity of H ⊆ F on the first k sources of data. [sent-275, score-0.505]
57 , K} k ˆ e( f , hk ) ≤ min {e( f , h)} + 2 ∑ h∈H i=1 ni n1:k εi + 2 2 ln(4K/δ) + 4Rn1:k (H ) . [sent-284, score-0.398]
58 Similarly to the VC-based bound given in Theorem 6, as k increases and more sources of data are combined, the second term will grow while the third will shrink. [sent-286, score-0.496]
59 The behavior of the final term R n1:k (H ), however, is less predictable and may grow or shrink as more sources of data are combined. [sent-287, score-0.386]
60 ) Our loss function is L (h, x, y ) = (y − h(x)) 2 , and the expected loss is thus e(g1 , g2 ) = E x,y ∼Pg1 [L (g2 , x, y )] = Ex∼P (g1 (x) − g2 (x))2 . [sent-295, score-0.268]
61 Lemma 10 Given any three functions g1 , g2 , g3 : X → R, a fixed and unknown distribution P on the inputs X , and the expected loss e(g1 , g2 ) = Ex∼P (g1 (x) − g2 (x))2 , e(g1 , g2 ) ≤ 2 (e(g1 , g3 ) + e(g3 , g1 )) . [sent-298, score-0.222]
62 We can derive a uniform convergence bound for squared loss using Rademacher complexity as long as the region Y is bounded. [sent-300, score-0.444]
63 The following function β is a uniform convergence bound for H and L : β(n, δ) = 8BRn (H ) + 4B2 2 ln(2/δ) . [sent-302, score-0.249]
64 4B2 2B B Applying Lemma 8 gives a uniform convergence bound of (2/B)R n (H ) + Scaling by 4B2 yields the bound for L . [sent-308, score-0.359]
65 , K} k ˆ e( f , hk ) ≤ 4 min {e( f , h)} + 6 ∑ h∈H i=1 ni εi + 32BRn1:k (H ) + 16B2 n1:k 2 ln(4K/δ) . [sent-320, score-0.398]
66 n 1766 L EARNING FROM M ULTIPLE S OURCES ˆ This lemma immediately allows us to replace Rn (H ) with that data-dependent quantity Rn in any of the bounds above for only a small penalty. [sent-325, score-0.202]
67 While the use of data-dependent complexity measures can be expected to yield more accurate bounds and thus better decisions about the number k ∗ of sources to use, it is not without its costs in comparison to the more standard data-independent approaches. [sent-326, score-0.585]
68 By Hoeffding’s inequality, with probability 1−δ , ln(2/δ ) |e( fi , f j ) − e( fi , f j )| ≤ ˆ . [sent-336, score-0.364]
69 ˆ Using the lemma we set the upper bound on the mutual error e( f i , f j ) between the pair of function fi and f j to be Di, j = e( fi , f j ) + ε. [sent-346, score-0.568]
70 Thus many fewer labeled examples are required to estimate the disparity matrix than to actually learn the best function in the class. [sent-349, score-0.258]
71 These commonly rated movies can be used to determine how similar each pair of users are, while ratings of additional movies can be reserved to learn the prediction functions. [sent-353, score-0.253]
72 Estimating the Parameters of a Distribution We now proceed with the study of the related problem of estimating the unknown parameters of a distribution from multiple sources of data. [sent-355, score-0.455]
73 As in the previous sections, we provide a bound on the diversity of an estimator based on the first k sources from the target. [sent-356, score-0.553]
74 Up until this point, we have measured the diversity between two functions by using the expected value of a loss function. [sent-357, score-0.25]
75 Example 1 We wish to estimate the bias θ of a coin given K sources of training observations N1 , . [sent-362, score-0.514]
76 Each source Nk contains nk outcomes of flips of a coin with bias θk . [sent-366, score-0.343]
77 Example 2 We wish to estimate the probability Θ(p) of a die to fall on its pth side (out of D possible outcomes) given K sources of training observations N1 , . [sent-369, score-0.491]
78 Each source Nk contains nk outcomes (p) using a die with parameters Θk . [sent-373, score-0.32]
79 n i=1 In our setting, we wish to estimate the parameters Θ of a parametric distribution Pr [X|Θ] given K sources of training observations N1 , . [sent-388, score-0.46]
80 Each source Nk contains nk outcomes from a distribution with parameters Θk , that is, Pr [X|Θk ]. [sent-392, score-0.289]
81 Pr E Θ(p) − Θ(p) ≥ 2n We can use the union bound to bound on this difference for all D parameters at once and get D B2 ln( 2D ) δ δ ˆ ˆ Pr ∃p : E Θ(p) − Θ(p) ≥ ≤ ∑ =δ. [sent-405, score-0.22]
82 We define the estimator using the first k sources to be, 1 k ˆ Θk = ∑ ∑ Ψ(X) , n1:k i=1 X∈Ni where as before n1:k = ∑k ni . [sent-425, score-0.655]
83 We denote the expectation of this estimate by i=1 1 k ¯ ˆ Θk = E Θk = ∑ ni Θi . [sent-426, score-0.312]
84 n1:k i=1 ˆ We now bound the deviation of the estimate Θk from the true set of parameters Θ using the expec¯ tation Θk , ˆ Θ − Θk ∞ = ≤ ¯ ¯ ˆ Θ − Θk + Θk − Θk ∞ ¯ ¯ ˆ Θ − Θk ∞ + Θk − Θk ≤ ni Θ − Θ i n1:k i=1 ≤ ∑ n1:k εk + k ∑ k ni i=1 1769 ∞ ∞ ¯ ˆ + Θk − Θk ¯ ˆ Θk − Θk ∞ . [sent-427, score-0.648]
85 Then for any δ > 0, with probability ≥ 1 − δ we have ˆ Θ − Θk k ∞ ni εi + i=1 n1:k ≤∑ 4B2 ln( 2DK ) δ 2n1:k simultaneously for all k = 1, . [sent-466, score-0.269]
86 Given the K sources of data we simply compute the bounds provided by these theorems for each prefix of the sources of length k and select the subset of sources that yields the smallest bound. [sent-471, score-1.266]
87 That bound has the same form as the bound given here in Theorem 16 but with better constants. [sent-474, score-0.22]
88 Synthetic Simulations In this section, we illustrate the bounds of our main theorem through a simple synthetic simulation. [sent-476, score-0.212]
89 To predict the optimal set of training data sources to use for each model, we calculated an approximation of the multiple-source VC bound for classification. [sent-487, score-0.496]
90 It is well known that the constants in the VC-based uniform convergence bounds are not tight. [sent-488, score-0.247]
91 Thus for the purpose of illustrating how these bounds might be used in practice, we have chosen to show approximations of our bounds with a variety of constants. [sent-489, score-0.216]
92 In particular, we have chosen to approximate the bound with k 2∑ i=1 nk n1:K εk +C (d ln (2en1:K /d) + ln (8K/δ)) n1:K with δ = 0. [sent-490, score-0.619]
93 On the x axis is the number of data sources used in training. [sent-496, score-0.386]
94 Dashed red curves show our multiple source error bound with C √ set to 1/4 in the lowest curve, 1/2 in the middle curve, and 1/ 2 in the highest curve. [sent-499, score-0.313]
95 When too few sources are used, there is not enough data available to learn a 15-dimensional function. [sent-502, score-0.422]
96 When too many sources are used, the labels on the training data often will not correspond to the labels that would have been assigned by the target function. [sent-503, score-0.449]
97 Although the VC bounds remain loose even after constants have been dropped, the bounds tend to maintain the appropriate shape and thus predict the optimal set of sources quite well. [sent-505, score-0.665]
98 In both cases, although the bounds are loose, they can still prove useful in determining the optimal set of sources to consider. [sent-508, score-0.494]
99 , n}, let x i , yi be the ith training instance, distributed according to Pi , and let xi , yi be independent random variables drawn according to Pi . [sent-540, score-0.281]
100 When only one instance xi , yi changes, the sup term can change by at most 2/n. [sent-542, score-0.21]
wordName wordTfidf (topN-words)
[('sources', 0.386), ('ni', 0.269), ('rademacher', 0.263), ('earns', 0.213), ('ortman', 0.213), ('rammer', 0.213), ('ln', 0.197), ('ources', 0.189), ('fi', 0.182), ('disparity', 0.165), ('hk', 0.129), ('source', 0.129), ('vc', 0.118), ('crammer', 0.117), ('ek', 0.116), ('nk', 0.115), ('ultiple', 0.115), ('loss', 0.111), ('lipschitz', 0.11), ('bound', 0.11), ('bounds', 0.108), ('lemma', 0.094), ('fk', 0.082), ('pr', 0.081), ('ex', 0.077), ('inequality', 0.077), ('uniform', 0.077), ('sup', 0.074), ('hoeffding', 0.072), ('ya', 0.072), ('disparities', 0.071), ('triangle', 0.07), ('yi', 0.069), ('xi', 0.067), ('distant', 0.066), ('mcdiarmid', 0.066), ('theorem', 0.066), ('nearby', 0.064), ('target', 0.063), ('earning', 0.063), ('bartlett', 0.063), ('convergence', 0.062), ('upenn', 0.06), ('users', 0.06), ('diversity', 0.057), ('labeled', 0.057), ('mendelson', 0.054), ('pre', 0.054), ('coin', 0.054), ('koltchinskii', 0.054), ('xn', 0.054), ('rn', 0.048), ('samples', 0.048), ('jennifer', 0.047), ('kearns', 0.047), ('expected', 0.046), ('movies', 0.046), ('complexity', 0.045), ('outcomes', 0.045), ('panel', 0.045), ('classi', 0.044), ('ers', 0.044), ('expectation', 0.043), ('observations', 0.04), ('koby', 0.04), ('cis', 0.04), ('blitzer', 0.04), ('corrupted', 0.04), ('multiple', 0.04), ('squared', 0.039), ('synthetic', 0.038), ('let', 0.038), ('hypothesis', 0.037), ('functions', 0.036), ('learn', 0.036), ('ratings', 0.036), ('ingredients', 0.036), ('summation', 0.036), ('zn', 0.035), ('curves', 0.034), ('wish', 0.034), ('recommended', 0.033), ('obeys', 0.033), ('anthony', 0.033), ('loose', 0.032), ('argmin', 0.032), ('shape', 0.031), ('thirty', 0.031), ('die', 0.031), ('models', 0.03), ('user', 0.03), ('proof', 0.029), ('commonly', 0.029), ('unknown', 0.029), ('sk', 0.029), ('disagreement', 0.029), ('aggregate', 0.029), ('regression', 0.028), ('incurred', 0.027), ('alternate', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
2 0.17014936 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
3 0.14393555 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
4 0.11148667 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
5 0.10889544 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
6 0.093511559 56 jmlr-2008-Magic Moments for Structured Output Prediction
7 0.092316508 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
8 0.07847622 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
9 0.076154806 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
10 0.071850114 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
11 0.07022161 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
12 0.068699017 83 jmlr-2008-Robust Submodular Observation Selection
13 0.065229058 54 jmlr-2008-Learning to Select Features using their Properties
14 0.064669549 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
15 0.062650807 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
16 0.059857339 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
18 0.055234812 9 jmlr-2008-Active Learning by Spherical Subdivision
19 0.05366857 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
20 0.052166678 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
topicId topicWeight
[(0, 0.3), (1, -0.116), (2, -0.146), (3, -0.108), (4, 0.057), (5, 0.067), (6, 0.055), (7, -0.298), (8, 0.094), (9, -0.105), (10, 0.153), (11, 0.062), (12, 0.073), (13, -0.007), (14, -0.053), (15, -0.034), (16, -0.142), (17, 0.079), (18, -0.034), (19, -0.022), (20, 0.043), (21, -0.047), (22, -0.028), (23, -0.084), (24, 0.016), (25, 0.148), (26, -0.134), (27, 0.163), (28, 0.017), (29, -0.011), (30, -0.055), (31, 0.037), (32, -0.006), (33, -0.094), (34, -0.042), (35, 0.01), (36, 0.02), (37, -0.111), (38, 0.11), (39, -0.111), (40, -0.064), (41, 0.008), (42, -0.109), (43, 0.029), (44, -0.054), (45, 0.105), (46, -0.15), (47, -0.001), (48, 0.025), (49, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.94138557 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
2 0.5711112 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
Author: Hannes Nickisch, Carl Edward Rasmussen
Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC
3 0.56912589 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
4 0.4970322 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
5 0.46991685 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
6 0.44404572 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix
7 0.42168257 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
8 0.36846304 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
9 0.36315057 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
10 0.35572454 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
11 0.35379124 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
12 0.33955294 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
14 0.29057652 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
15 0.2822113 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
16 0.28045949 83 jmlr-2008-Robust Submodular Observation Selection
17 0.27745548 9 jmlr-2008-Active Learning by Spherical Subdivision
18 0.2767629 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
19 0.27519855 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
20 0.26909769 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
topicId topicWeight
[(0, 0.024), (40, 0.03), (54, 0.033), (58, 0.032), (66, 0.056), (76, 0.034), (88, 0.067), (92, 0.579), (94, 0.039), (99, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.96723878 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
2 0.95261872 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
3 0.67324996 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
4 0.62014586 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
Author: Eric Bax
Abstract: This paper develops bounds on out-of-sample error rates for support vector machines (SVMs). The bounds are based on the numbers of support vectors in the SVMs rather than on VC dimension. The bounds developed here improve on support vector counting bounds derived using Littlestone and Warmuth’s compression-based bounding technique. Keywords: compression, error bound, support vector machine, nearly uniform
5 0.59816873 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
7 0.56499994 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
8 0.54700959 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
9 0.48964784 56 jmlr-2008-Magic Moments for Structured Output Prediction
10 0.47903806 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
11 0.4702796 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
12 0.45515698 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
13 0.44942918 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
14 0.44702157 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
15 0.44564253 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
16 0.44447467 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
17 0.44394073 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
18 0.44106662 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
19 0.44101158 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
20 0.43925166 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension