jmlr jmlr2013 jmlr2013-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou
Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error
Reference: text
sentIndex sentText sentNum sentScore
1 This learning algorithm involves a Parzen windowing scaling parameter. [sent-10, score-0.185]
2 Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. [sent-13, score-0.341]
3 An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. [sent-14, score-0.126]
4 Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error 1. [sent-15, score-0.101]
5 Minimum error entropy (MEE) is a principle of information theoretical learning and provides a family of supervised learning algorithms. [sent-21, score-0.155]
6 The idea of MEE is to extract from data as much information as possible about the data generating systems by minimizing error entropies in various ways. [sent-24, score-0.075]
7 In supervised learning our target is to predict the response variable Y from the explanatory variable X. [sent-27, score-0.093]
8 Then the random variable E becomes the error variable E = Y − f (X) when a predictor f (X) is used and the MEE principle aims at searching for a predictor f (X) that contains the most information of the response variable by minimizing information entropies of the error variable E = Y − f (X). [sent-28, score-0.251]
9 The least squares method minimizes the variance of the error variable E and is perfect to deal with problems involving Gaussian noise (such as some from linear signal processing). [sent-31, score-0.084]
10 For such problems, MEE might still perform very well in principle since moments of all orders of the error variable are taken into account by entropies. [sent-33, score-0.078]
11 Here we only consider R´ nyi’s entropy of order α = 2: HR (E) = HR,2 (E) = − log (pE (e))2 de. [sent-34, score-0.127]
12 Our analysis does not e apply to R´ nyi’s entropy of order α = 2. [sent-35, score-0.095]
13 e In most real applications, neither the explanatory variable X nor the response variable Y is explicitly known. [sent-36, score-0.093]
14 Instead, in supervised learning, a sample z = {(xi , yi )}m is available which i=1 reflects the distribution of the explanatory variable X and the functional relation between X and the response variable Y . [sent-37, score-0.12]
15 A typical choice for the windowing function G(t) = exp{−t} corresponds to Gaussian windowing. [sent-39, score-0.163]
16 Though the MEE principle has been proposed for a decade and MEE algorithms have been shown to be effective in various applications, its theoretical foundation for mathematical error analysis is not well understood yet. [sent-42, score-0.06]
17 However, it is well known that the convergence of Parzen windowing requires h to converge to 0. [sent-45, score-0.163]
18 Another technical barrier for mathematical analysis of MEE algorithms for regression is the possibility that the regression function may not be a minimizer of the associated generalization error, as described in detail in Section 3 below. [sent-47, score-0.094]
19 In the sequel of this paper, we consider an MEE learning algorithm that minimizes the empirical R´ nyi’s entropy HR and focus on the regression problem. [sent-50, score-0.115]
20 The minimization of empirical R´ nyi’s entropy cannot be done over all possible measurable e functions which would lead to overfitting. [sent-56, score-0.095]
21 Then the MEE learning algorithm associated with H is defined by 1 m m fz = arg min − log 2 ∑ ∑ G m h i=1 j=1 f ∈H [(yi − f (xi )) − (y j − f (x j ))]2 2h2 . [sent-62, score-0.657]
22 Its compactness ensures the existence of a minimizer fz . [sent-64, score-0.654]
23 The main purpose of this paper is to verify this observation in the ERM setting and show that fz with a suitable constant adjustment approximates the regression function well with confidence. [sent-69, score-0.667]
24 Note that the requirement of a constant adjustment is natural because any translate fz + c of a solution fz to (1) with a constant c ∈ R is another solution to (1). [sent-70, score-1.272]
25 So our consistency result for MEE algorithm (1) will be stated in terms of the variance var[ fz (X)− fρ (X)] of the error function fz − fρ . [sent-71, score-1.311]
26 Here we use var to denote the variance of a random variable. [sent-72, score-0.208]
27 (2) We also assume that the windowing function G satisfies G ∈ C2 [0, ∞), G′ (0) = −1, and CG := sup + t∈(0,∞) |(1 + t)G′ (t)| + |(1 + t)G′′ (t)| < ∞. [sent-76, score-0.212]
28 (3) The special example G(t) = exp{−t} for the Gaussian windowing satisfies (3). [sent-77, score-0.163]
29 Note that (2) obviously holds when |Y | ≤ M almost surely for some constant M > 0, in which case we shall denote q∗ = 2. [sent-81, score-0.087]
30 Our consistency result, to be proved in Section 5, asserts that when h and m are large enough, the error var[ fz (X) − fρ (X)] of MEE algorithm (1) can be arbitrarily close to the approximation error (Smale and Zhou, 2003) of the hypothesis space H with respect to the regression function fρ . [sent-82, score-0.795]
31 f ∈H Theorem 3 Under assumptions (2) and (3), for any 0 < ε ≤ 1 and 0 < δ < 1, there exist hε,δ ≥ 1 and mε,δ (h) ≥ 1 both depending on H , G, ρ, ε, δ such that for h ≥ hε,δ and m ≥ mε,δ (h), with confidence 1 − δ, we have var[ fz (X) − fρ (X)] ≤ DH ( fρ ) + ε. [sent-84, score-0.625]
32 (4) Our convergence rates will be stated in terms of the approximation error and the capacity of the hypothesis space H measured by covering numbers in this paper. [sent-85, score-0.145]
33 Definition 4 For ε > 0, the covering number N (H , ε) is defined to be the smallest integer l ∈ N such that there exist l disks in C(X ) with radius ε and centers in H covering the set H . [sent-86, score-0.114]
34 (5) The behavior (5) of the covering numbers is typical in learning theory. [sent-88, score-0.057]
35 We remark that empirical covering numbers might be used together with concentration inequalities to provide shaper error estimates. [sent-91, score-0.093]
36 This is however beyond our scope and for simplicity we adopt the the covering number in C(X ) throughout this paper. [sent-92, score-0.057]
37 380 M INIMUM E RROR E NTROPY Theorem 5 Assume (2), (3) and covering number condition (5) for some p > 0. [sent-94, score-0.057]
38 Then for any 0 < η ≤ 1 and 0 < δ < 1, with confidence 1 − δ we have 1 2 var[ fz (X) − fρ (X)] ≤ CH η(2−q)/2 h− min{q−2,2} + hm− 1+p log + (1 + η)DH ( fρ ). [sent-95, score-0.657]
39 δ If |Y | ≤ M almost surely for some M > 0, then with confidence 1 − δ we have C var[ fz (X) − fρ (X)] ≤ H η 1 2 h−2 + m− 1+p log + (1 + η)DH ( fρ ). [sent-96, score-0.744]
40 Remark 6 In Theorem 5, we use a parameter η > 0 in error bounds (6) and (7) to show that the bounds consist of two terms, one of which is essentially the approximation error DH ( fρ ) since η can be arbitrarily small. [sent-98, score-0.103]
41 1 If moment condition (2) with q ≥ 4 is satisfied and η = 1, then by taking h = m 3(1+p) , (6) becomes 1 m var[( fz (X) − fρ (X)] ≤ 2CH 2 3(1+p) 2 log + 2DH ( fρ ). [sent-100, score-0.657]
42 δ (8) 1 If |Y | ≤ M almost surely, then by taking h = m 2(1+p) and η = 1, error bound (7) becomes 1 2 var[ fz (X) − fρ (X)] ≤ 2CH m− 1+p log + 2DH ( fρ ). [sent-101, score-0.719]
43 This gap is caused by the Parzen windowing process for which our method does not lead to better estimates when q > 4. [sent-104, score-0.163]
44 Note the result in Theorem 5 does not guarantee that fz itself approximates fρ well when the bounds are small. [sent-106, score-0.625]
45 Theoretically the best constant is 1 E[ fz (X) − fρ (X)]. [sent-108, score-0.625]
46 In practice it is usually approximated by the sample mean m ∑m ( fz (xi ) − yi ) i=1 in the case of uniformly bounded noise and the approximation can be easily handled. [sent-109, score-0.668]
47 381 H U , FAN , W U AND Z HOU Theorem 8 Assume E[|Y |2 ] < ∞ and covering number condition (5) for some p > 0. [sent-112, score-0.057]
48 Replacing the mean E[ fz (X) − fρ (X)] by the quantity (10), we define an estimator of fρ as fz = fz − 1 m ∑ fz (xi ) − π√m (yi ) . [sent-114, score-2.5]
49 Under 1 conditions (2) and (3), by taking h = m (1+p) min{q−1,3} , we have with confidence 1 − δ, fz − fρ 2 LρX ′ ≤ CH + 2CH min{q−2,2} − 2(1+p) min{q−1,3} m 2 log . [sent-121, score-0.657]
50 δ 1 If |Y | ≤ M almost surely, then by taking h = m 2(1+p) , we have with confidence 1 − δ, fz − fρ 2 LρX ′ ≤ CH + 2CH 1 − 2(1+p) m 2 log . [sent-122, score-0.683]
51 δ This corollary states that fz can approximate the regression function very well. [sent-123, score-0.645]
52 Then the power exponent of the following convergence rate can be arbitrarily close to 1 when E[|Y |4 ] < ∞, and 1 when |Y | ≤ M almost surely. [sent-127, score-0.058]
53 If E[|Y |4 ] < ∞, 1 then by taking h = m 3(1+n/s) , we have with confidence 1 − δ, fz − fρ n 2 LρX 1 − 3(1+n/s) ≤ Cs,n,ρ R 2(s+n) m 2 log . [sent-130, score-0.657]
54 δ 1 If |Y | ≤ M almost surely, then by taking h = m 2(1+n/s) , with confidence 1 − δ, fz − fρ n 2 LρX 1 − 2+2n/s ≤ Cs,n,ρ R 2(s+n) m 2 log . [sent-131, score-0.683]
55 Compared to the analysis of least squares methods, our consistency results for the MEE algorithm require a weaker condition by allowing heavy tailed noise, while the convergence rates are − 1 comparable but slightly worse than the optimal one O(m 2+n/s ). [sent-133, score-0.097]
56 , 2010) and verified mathematically for Shannon’s entropy in Chen and Principe (2012) that the regression function fρ may not be a minimizer of E (h) . [sent-140, score-0.144]
57 It is totally different from the classical least squares generalization error E ls ( f ) = Z ( f (x) − y)2 dρ which satisfies a nice identity E ls ( f ) − E ls ( fρ ) = f − fρ 2 2 ≥ 0. [sent-141, score-0.15]
58 This barrier leads to three techLρX nical difficulties in our error analysis which will be overcome by our novel approaches making full use of the special feature that the MEE scaling parameter h is large in this paper. [sent-142, score-0.1]
59 1 Approximation of Information Error The first technical difficulty we meet in our mathematical analysis for MEE algorithm (1) is the varying form depending on the windowing function G. [sent-144, score-0.163]
60 This is achieved by showing that E (h) is closely related to the following symmetrized least squares error which has appeared in the literature of ranking algorithms (Clemencon et al. [sent-146, score-0.126]
61 383 H U , FAN , W U AND Z HOU 2 Definition 10 The symmetrized least squares error is defined on the space LρX by E sls ( f ) = Z Z (y − f (x)) − y′ − f (x′ ) 2 dρ(x, y)dρ(x′ , y′ ), 2 f ∈ LρX . [sent-148, score-0.13]
62 Since H is a compact subset of C(X ), we know that the number sup f ∈H f ∞ is finite. [sent-157, score-0.067]
63 These possible candidates for the target function are defined as fH := arg min E (h) ( f ), f ∈H fapprox := arg min var[ f (X) − fρ (X)]. [sent-168, score-0.218]
64 385 H U , FAN , W U AND Z HOU Theorem 14 Under assumptions (2) and (3), we have ′′ E (h) ( fapprox ) ≤ E (h) ( fH ) + 2CH h−q ∗ and ∗ ′′ var[ fH (X) − fρ (X)] ≤ var[ fapprox (X) − fρ (X)] + 2CH h−q . [sent-170, score-0.436]
65 Lemma 15 Under assumptions (2) and (3), we have ∗ ′′ var[ fz (X) − fρ (X)] ≤ E (h) ( fz ) − E (h) ( fH ) + var[ fapprox (X) − fρ (X)] + 2CH h−q . [sent-174, score-1.468]
66 (14) Proof By Theorem 13, ′′ var[ fz (X) − fρ (X)] ≤ E (h) ( fz ) − E (h) ( fρ ) +CH h−q ≤ ∗ ∗ ′′ E (h) ( fz ) − E (h) ( fH ) + E (h) ( fH ) − E (h) ( fρ ) +CH h−q . [sent-175, score-1.875]
67 Since fapprox ∈ H , the definition of fH tells us that E (h) ( fH ) − E (h) ( fρ ) ≤ E (h) ( fapprox ) − E (h) ( fρ ). [sent-176, score-0.436]
68 Applying Theorem 13 to the above bound implies ∗ ′′ var[ fz (X) − fρ (X)] ≤ E (h) ( fz ) − E (h) ( fH ) + var[ fapprox (X) − fρ (X)] + 2CH h−q . [sent-177, score-1.468]
69 In error decomposition (14) for MEE learning algorithm (1), the first term on the right side is the sample error, the second term var[ fapprox (X) − fρ (X)] is the approximation error, while the last ∗ ′′ extra term 2CH h−q is caused by the Parzen windowing and is small when h is large. [sent-180, score-0.433]
70 The quantity E (h) ( fz ) − E (h) ( fH ) of the sample error term will be bounded in the following discussion. [sent-181, score-0.661]
71 Then E (h) ( fz ) − E (h) ( fH ) = E [V fz ] − E V fH = E [V fz ] −V fz +V fz −V fH +V fH − E V fH . [sent-185, score-3.125]
72 Hence E (h) ( fz ) − E (h) ( fH ) ≤ E [V fz ] −V fz +V fH − E V fH . [sent-187, score-1.875]
73 Sample Error Estimates In this section, we follow (16) and estimate the sample error by a uniform ratio probability inequality based on the following Hoeffding’s probability inequality for U-statistics (Hoeffding, 1963). [sent-193, score-0.078]
74 Lemma 17 If U is a symmetric real-valued function on Z × Z satisfying a ≤ U(z, z′ ) ≤ b almost surely and var[U] = σ2 , then for any ε > 0, Prob m 1 ∑ ∑ U(zi , z j ) − E[U] ≥ ε m(m − 1) i=1 j=i ≤ 2 exp − (m − 1)ε2 . [sent-194, score-0.102]
75 (18) Denote a constant AH depending on ρ, G, q and H as 2 AH = 9 · 28CG sup f − fρ f ∈H 4 q ∞ 2 (E[|Y |q ]) q + fρ 387 2 ∞+ sup f − fρ f ∈H 2 ∞ . [sent-197, score-0.098]
76 ∞ (b) If |Y | ≤ M almost surely for some constant M > 0, then we see from (21) that almost surely U f (z, z′ ) ≤ 4 G′′ ∞ (M + fρ ∞ + f − fρ ∞ ) f (x) − fρ (x) − f (x′ ) − fρ (x′ ) . [sent-215, score-0.174]
77 Then we have Prob V f − E[V f ] > 4ε2/q (q−2)/q f ∈H (E[V f ] + 2ε) sup ≤ 2N H, (m − 1)ε ε exp − 4CG h A′′ h H where A′′ is the constant given by H ′′ A′′ = 4AH (CH )−2/q + 12CG sup f − fρ H ∞. [sent-222, score-0.113]
78 f ∈H If |Y | ≤ M almost surely for some constant M > 0, then we have Prob sup f ∈H V f − E[V f ] E[V f ] + 2ε √ >4 ε ≤ 2N H, ε 2A′ H exp − where A′′ is the constant given by H A′′ = 8A′ + 6A′ sup f − fρ H H H f ∈H 389 ∞. [sent-223, score-0.2]
79 These in connection with Lemma 16 tell us that V f − E[V f ] > 4ε2/q (E[V f ] + 2ε)(q−2)/q Thus by taking { f j }N to be an j=1 N H, ε 4CG h net of the set H with N being the covering number , we find Prob ≤ ε 4CG h V f j − E[V f j ] > ε2/q . [sent-225, score-0.057]
80 This together with the notation ≤ sup f ∈H f − fρ ∞ gives the first desired bound, ′′ ′′ where we have observed that ε ≥ CH h and h ≥ 1 imply ε−2/q ≤ (CH )−2/q h. [sent-239, score-0.064]
81 If |Y | ≤ M almost surely for some constant M > 0, then we follows the same line as in our above proof. [sent-240, score-0.087]
82 For m ∈ N, 0 < δ < 1, let εm,δ be the smallest positive solution H to the inequality log N H, ε 2A′ H − 390 (m − 1)ε δ ≤ log . [sent-246, score-0.085]
83 Under assumptions (2) and (3), we have with confidence of 1 − δ, var[ fz (X) − fρ (X)] ≤ (1 + η)var[ fapprox (X) − fρ (X)] + 12 2 + 24 q−2 2 η 2−q 2 ∗ ′′ (hεm,δ + 2CH h−q ). [sent-248, score-0.843]
84 If |Y | ≤ M almost surely for some M > 0, then with confidence of 1 − δ, we have var[ fz (X) − fρ (X)] ≤ (1 + η)var[ fapprox (X) − fρ (X)] + 278 ′′ (ε + 2CH h−2 ). [sent-249, score-0.93]
85 Then by Lemma 19, we know that with confidence 1 − δ, there holds V f − E[V f ] 1−τ ≤ 4εm,δ,h (E[V f ] + 2εm,δ,h )τ f ∈H sup which implies 1−τ 1−τ E [V fz ] −V fz +V fH − E V fH ≤ 4εm,δ,h (E[V fz ] + 2εm,δ,h )τ + 4εm,δ,h (E[V fH ] + 2εm,δ,h )τ . [sent-251, score-1.942]
86 This together with Lemma 15 and (16) yields ∗ ′′ var[ fz (X) − fρ (X)] ≤ 4S + 16εm,δ,h + var[ fapprox (X) − fρ (X)] + 2CH h−q , (23) where 1−τ 1−τ S := εm,δ,h (E[V fz ])τ + εm,δ,h (E[V fH ])τ = ( 24 τ 1−τ η ) εm,δ,h E[V fz ] η 24 τ +( 12 τ 1−τ η ) εm,δ,h E[V fH ] η 12 τ . [sent-252, score-2.093]
87 Now we apply Young’s inequality a · b ≤ (1 − τ)a1/(1−τ) + τb1/τ , and find S≤ 24 η τ/(1−τ) εm,δ,h + η 12 E[V fz ] + 24 η a, b ≥ 0 τ/(1−τ) εm,δ,h + η E[V fH ]. [sent-253, score-0.646]
88 12 Combining this with (23), Theorem 13 and the identity E[V f ] = E (h) ( f ) − E (h) ( fρ ) gives var[ fz (X) − fρ (X)] ≤ η η var[ fz (X) − fρ (X)] + (1 + )var[ fapprox (X) − fρ (X)] + S ′ , 6 3 ∗ ′′ where S ′ := (16 + 8(24/η)τ/(1−τ) )εm,δ,h + 3CH h−q . [sent-254, score-1.468]
89 Since 1/(1 − η ) ≤ 1 + η and (1 + η )2 ≤ 1 + η, 6 3 3 we see that 4 var[ fz (X) − fρ (X)] ≤ (1 + η)var[ fapprox (X) − fρ (X)] + S ′ . [sent-255, score-0.843]
90 1 Proof of Theorem 3 Recall DH ( fρ ) = var[ fapprox (X) − fρ (X)]. [sent-260, score-0.218]
91 We choose mε,δ (h) = hA′′ H ε log N H, ε 2hA′ H δ 2 − log + 1. [sent-265, score-0.064]
92 By covering number condition (5), we know that εm,δ is bounded by εm,δ , the smallest positive solution to the inequality Ap 2A′ H ε p − δ (m − 1)ε ≤ log . [sent-271, score-0.128]
93 If |Y | ≤ M almost surely for some M > 0, then the second part of Proposition 20 proves (7) with the constant CH given by CH = 278 2A′′ + 2A p A′′ (2A′ ) p H H H This completes the proof of Theorem 5. [sent-276, score-0.087]
94 m i=1 Prob 1 m ∑ f (xi ) − π√m (yi ) − E[ f (X) − π√m (Y )] > ε m i=1 f ∈H It follows that sup ≤ Prob ≤ ∑ Prob sup j=1,. [sent-285, score-0.098]
95 Conclusion and Discussion In this paper we have proved the consistency of an MEE algorithm associated with R´ nyi’s entropy e of order 2 by letting the scaling parameter h in the kernel density estimator tends to infinity at an appropriate rate. [sent-300, score-0.156]
96 However, the motivation of the MEE principle is to minimize error entropies approximately, and requires small h for the kernel density estimator to converge to the true probability density function. [sent-302, score-0.113]
97 Can one carry out error analysis for the MEE algorithm if Shannon’s entropy or R´ nyi’s entropy of order α = 2 is used? [sent-307, score-0.226]
98 Note that MEE algorithm (1) essentially minimizes the empirical version of the information error which, according to our study in Section 2, differs from the symmetrized least squares error used in some ranking algorithms by an extra term which vanishes when h → ∞. [sent-311, score-0.162]
99 Some further results on the minimum error entropy estimation. [sent-338, score-0.131]
100 Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. [sent-362, score-0.131]
wordName wordTfidf (topN-words)
[('fz', 0.625), ('mee', 0.49), ('fh', 0.224), ('fapprox', 0.218), ('var', 0.208), ('ch', 0.181), ('windowing', 0.163), ('prob', 0.109), ('nyi', 0.109), ('entropy', 0.095), ('inimum', 0.093), ('ntropy', 0.093), ('rror', 0.093), ('principe', 0.084), ('hou', 0.077), ('pe', 0.076), ('parzen', 0.075), ('hr', 0.062), ('surely', 0.061), ('covering', 0.057), ('erdogmus', 0.056), ('dh', 0.052), ('fan', 0.05), ('sup', 0.049), ('erm', 0.048), ('zhou', 0.046), ('dence', 0.042), ('cucker', 0.042), ('shannon', 0.039), ('explanatory', 0.039), ('entropies', 0.039), ('sls', 0.037), ('ug', 0.037), ('hong', 0.036), ('kong', 0.036), ('error', 0.036), ('ah', 0.034), ('ranking', 0.033), ('tg', 0.033), ('smale', 0.032), ('log', 0.032), ('anthony', 0.031), ('squares', 0.03), ('minimizer', 0.029), ('ls', 0.028), ('cityu', 0.028), ('clemencon', 0.028), ('symmetrized', 0.027), ('hs', 0.027), ('yi', 0.027), ('almost', 0.026), ('barrier', 0.025), ('consistency', 0.025), ('ei', 0.025), ('principle', 0.024), ('china', 0.024), ('tailed', 0.023), ('lemma', 0.022), ('hypothesis', 0.022), ('scaling', 0.022), ('silva', 0.022), ('chee', 0.022), ('kowloon', 0.022), ('tat', 0.022), ('wuhan', 0.022), ('adjustment', 0.022), ('sobolev', 0.022), ('yd', 0.022), ('con', 0.021), ('inequality', 0.021), ('regression', 0.02), ('qiang', 0.019), ('hu', 0.019), ('heavy', 0.019), ('response', 0.018), ('variable', 0.018), ('know', 0.018), ('overcome', 0.017), ('power', 0.017), ('agarwal', 0.017), ('xi', 0.017), ('ting', 0.017), ('theorem', 0.016), ('approximation', 0.016), ('cg', 0.015), ('novelty', 0.015), ('arbitrarily', 0.015), ('desired', 0.015), ('exp', 0.015), ('dt', 0.015), ('reproducing', 0.015), ('blind', 0.014), ('kernel', 0.014), ('capacity', 0.014), ('culty', 0.014), ('risk', 0.013), ('predictor', 0.013), ('mh', 0.013), ('haussler', 0.013), ('jun', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou
Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error
2 0.050692745 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels
Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
4 0.037098352 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 0.035096217 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown
Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound
6 0.028861649 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
7 0.027592329 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
8 0.027084591 104 jmlr-2013-Sparse Single-Index Model
9 0.025432711 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
10 0.024438147 76 jmlr-2013-Nonparametric Sparsity and Regularization
11 0.024225475 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
12 0.023636503 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
13 0.023265375 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
14 0.023135703 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
15 0.023000326 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
16 0.021335008 68 jmlr-2013-Machine Learning with Operational Costs
17 0.020775817 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression
18 0.020446103 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
19 0.019634215 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
20 0.01953331 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
topicId topicWeight
[(0, -0.102), (1, 0.03), (2, 0.037), (3, 0.029), (4, 0.01), (5, -0.039), (6, 0.002), (7, 0.014), (8, 0.046), (9, -0.044), (10, -0.041), (11, -0.034), (12, 0.064), (13, -0.043), (14, -0.039), (15, 0.073), (16, -0.066), (17, -0.11), (18, -0.053), (19, 0.103), (20, -0.034), (21, -0.058), (22, 0.066), (23, -0.048), (24, 0.09), (25, 0.041), (26, 0.089), (27, -0.08), (28, -0.006), (29, -0.016), (30, -0.195), (31, -0.127), (32, 0.02), (33, 0.184), (34, 0.091), (35, 0.057), (36, 0.057), (37, -0.129), (38, 0.205), (39, 0.115), (40, 0.25), (41, -0.129), (42, 0.069), (43, -0.143), (44, -0.197), (45, -0.006), (46, -0.289), (47, -0.132), (48, -0.076), (49, 0.205)]
simIndex simValue paperId paperTitle
same-paper 1 0.93401992 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou
Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error
2 0.45605239 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels
Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule
Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown
Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
5 0.28903788 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.22968616 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
7 0.21830557 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
8 0.20722462 104 jmlr-2013-Sparse Single-Index Model
9 0.20423166 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
10 0.20197235 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
11 0.1970727 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
12 0.17760995 90 jmlr-2013-Quasi-Newton Method: A New Direction
13 0.17611966 76 jmlr-2013-Nonparametric Sparsity and Regularization
14 0.17129874 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
15 0.16461311 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
16 0.16058455 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
17 0.15922816 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
18 0.15103833 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
19 0.14932033 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
20 0.14762855 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
topicId topicWeight
[(0, 0.011), (5, 0.101), (6, 0.015), (10, 0.051), (20, 0.014), (23, 0.034), (68, 0.022), (70, 0.048), (75, 0.03), (85, 0.038), (87, 0.026), (88, 0.472), (89, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.69261205 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou
Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error
2 0.57297707 68 jmlr-2013-Machine Learning with Operational Costs
Author: Theja Tulabandhula, Cynthia Rudin
Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory
3 0.27425519 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
4 0.27315027 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
5 0.268049 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.26729366 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
7 0.26715308 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
9 0.26136485 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
10 0.26082474 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
11 0.26074463 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
12 0.26030338 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
13 0.26013479 73 jmlr-2013-Multicategory Large-Margin Unified Machines
15 0.25971246 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
16 0.25970215 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
17 0.25895414 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
18 0.25758442 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
19 0.25715005 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
20 0.25703862 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components