jmlr jmlr2009 jmlr2009-18 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
Reference: text
sentIndex sentText sentNum sentScore
1 Ś c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. [sent-14, score-0.185]
2 Ĺš nition of support vector machines whether they behave locally or not: The separating hyperplane is determined based on the entire training set Sn , and furthermore does not depend on the speciÄ? [sent-44, score-0.184]
3 Thus, on an intuitive level we might think that some consistent methods behave locally and some do not. [sent-46, score-0.242]
4 This intuition also appears relevant when we consider regression: The k-NN regression estimator appears to behave locally, while on the other hand support vector regression (see, e. [sent-47, score-0.159]
5 Ĺš xed harmonic functions (Lugosi and Zeger, 1995); this method appears to not behave locally both because the harmonic functions are non-local and because the coefÄ? [sent-53, score-0.163]
6 Despite the intuition that some consistent methods might not behave locally, we will see that in fact all of them necessarily behave in that manner. [sent-55, score-0.269]
7 What we will see is that all other consistent methods must also behave locally. [sent-58, score-0.158]
8 , must behave locally if they are consistent, again, despite their being deÄ? [sent-63, score-0.163]
9 Ĺš ne Sn (x, r) = {(xi , yi ) ∈ Sn : ||xi − x|| â‰Â¤ r} then a localizable method has the property that f (Sn , x) ≈ f (Sn (x, r), x) for some small r > 0 (the formal deÄ? [sent-82, score-0.276]
10 In other words, a localizable method is one that behaves similarly to a localization of itself. [sent-86, score-0.258]
11 The main motivation for us is that, as we will see later, consistency implies a form of localizability. [sent-89, score-0.128]
12 Ĺš ned methods like k-NN behave locally, but also all other consistent methods as well. [sent-96, score-0.158]
13 Previous work related to localizability has been done in the context of learning methods that work by minimizing a loss function: We can ‘localize’ the loss function by re-weighting it so that close-by points are more inÄ? [sent-108, score-0.149]
14 Another difference is that we consider consistency in the sense of asymptotically arriving at the lowest possible loss achievable by any measurable function, and not in the sense of minimizing the loss within a set of Ä? [sent-115, score-0.222]
15 (2006), in which it was shown that kernel machines behave locally, in the sense of requiring a large number of examples in order to learn complex functions (because each local area must be learned separately). [sent-118, score-0.173]
16 We now sketch our main result, which is that consistency is logically equivalent to the combination of two properties (which will be given later, in DeÄ? [sent-122, score-0.158]
17 It will be easy to see that the UAL and WCM properties are ‘independent’ in the sense that neither implies the other, and therefore we can see consistency as comprised of two independent aspects, which might be presented as Consistency � [sent-124, score-0.19]
18 This can be seen as describing consistency in terms of local (UAL) and global (WCM) aspects (WCM is ‘global’ in the sense of only comparing scalar values averaged over x). [sent-126, score-0.23]
19 Ś rst of which has already been mentioned—that all consistent methods must behave locally. [sent-128, score-0.158]
20 It turns out that a method consistent on P1 must behave locally on that set (just as with the set of all distributions), but the same is not true for P2 , where a consistent method does not necessarily behave locally. [sent-141, score-0.384]
21 In Section 5 we present our main result, the equivalence of consistency to the combination of UAL and WCM. [sent-151, score-0.128]
22 Ĺš ned as a sequence of measurable functions { fk }k∈N , where each fk is a function on training sets of size k, that is, fk : (X Ä‚—Y )k Ä‚— X −→ Y. [sent-176, score-0.129]
23 Finally, recall that we are concerned with consistent methods, that is, that behave similarly to f ∗ in the limit, and f ∗ is bounded. [sent-187, score-0.158]
24 Ĺš ned consistency on a single distribution, and then consistency in general as the property of being consistent on all distributions. [sent-208, score-0.334]
25 This convention will be used for consistency as well as for UAL and other properties. [sent-212, score-0.128]
26 n→∞ That is, f and g are mutually consistent if they behave asymptotically similarly according to a distance metric Dn . [sent-224, score-0.192]
27 Thus, mutual consistency is a natural extension of consistency. [sent-227, score-0.156]
28 For any training set S and r â‰Ä˝ 0, we call S(x, r) = {(xi , yi ) ∈ S : ||xi − x|| â‰Â¤ r} a local training set for x within S, of radius r. [sent-231, score-0.146]
29 For example, if f is a linear regression estimator then we can see f |r as performing local linear regression (Cleveland and Loader, 1995). [sent-238, score-0.128]
30 Ĺš ne one form of local behavior: Call a method f localizable on a distribution P iff there exists a local version f |{Rk } of f with which f is mutually consistent, that is, Dn f |{Rk } , f −→ 0. [sent-244, score-0.41]
31 Call a method f locally consistent on a distribution P iff there exists a local version f |{Rk } of f which is consistent, Ln f |{Rk } = Dn f |{Rk } , f ∗ −→ 0. [sent-251, score-0.223]
32 Are all consistent learning methods localizable and locally consistent? [sent-253, score-0.369]
33 Thus, it seems reasonable to expect a localization of a consistent method to be consistent as well, and if this were true then it might have useful practical applications, as mentioned in the introduction. [sent-255, score-0.162]
34 Yet this intuition turns out to be false, and a similar failure occurs for localizability: Proposition 1 A learning method exists which is consistent but neither localizable nor locally consistent. [sent-256, score-0.383]
35 That is, the issue is that we have required merely that the functions fk be measurable, and it turns out that without additional assumptions they can behave erratically in a manner that renders f not locally consistent. [sent-258, score-0.216]
36 It is then possible to show that the ‘problematic area’ near Ex can be large enough so that all local versions of the method behave poorly, but small enough so that the original method is consistent. [sent-261, score-0.176]
37 R (T ) contains sequences of radii of local training sets; we require that Rn Ă– 0, as we are interested in behavior on local training sets with radius descending to 0, that is, that become truly ‘local’ asymptotically. [sent-296, score-0.225]
38 Since Q(Rn ) ¡ Rn = o(Rn ), the smoothing is done on radii much smaller (asymptotically negligibly small) than the radii of the local training sets Rn , and therefore this is a minor operation. [sent-298, score-0.166]
39 In conclusion, a smoothed local version is similar to a local version, but adds an averaging operation on small radii. [sent-299, score-0.188]
40 ) In essence, a UAL learning method is one for whom, for any choice of T , all large-enough choices of Q and {Rk } are suitable in order to get similar behavior between f and a smoothed local version of f . [sent-307, score-0.158]
41 k n→∞ k A UALC method is one whose smoothed local versions are consistent for any large-enough choice of Q, {Rk }. [sent-325, score-0.204]
42 |S| f0 (S, x) = 0 (3) fy (called thus because it considers only the y values) is UALC since a local version of it is simply a kernel estimator, using the ‘window kernel’ k(x) = 1{||x|| â‰Â¤ 1}, and thus consistent, for any {Rk } fulÄ? [sent-330, score-0.128]
43 Ĺš rst main result is that consistency is equivalent to the combination of UAL and UALC: Theorem 4 A learning method is consistent iff it is both UAL and UALC. [sent-338, score-0.219]
44 Ĺš rst version of the global property of interest to us: We say that f is consistent in mean iff ∀P lim |En ( f ) − E( f ∗ )| = lim |MADn ( f ) − MAD( f ∗ )| = 0. [sent-365, score-0.323]
45 (5) By consistency the RHS converges to 0, and therefore so does the LHS. [sent-369, score-0.145]
46 It turns out that a weaker property than consistency in mean is sufÄ? [sent-375, score-0.16]
47 Ĺš rst that consistency in mean immediately implies WCM, using H(t) = t. [sent-386, score-0.145]
48 Second, recall the example hinted at in the introduction: fy (S, x) = 1 n ∑i yi is clearly not consistent, since it ignores the xs, nor is it consistent in mean, since MADn ( fy ) = ESn ,x | fy (Sn , x) − En ( fy )| = Ey1 ,. [sent-387, score-0.342]
49 n∑ i However, fy is WCM, since En ( fy ) → E( f ∗ ) and, as just mentioned, fy ’s MAD converges to 0. [sent-391, score-0.209]
50 Main Result In this section we present our main result, the logical equivalence of consistency to the combination of the UAL and WCM properties. [sent-395, score-0.128]
51 Note that we already saw that consistency implies WCM (since consistency implies consistency in mean, which implies WCM), and that UAL combined with UALC implies consistency (shown immediately after the statement of Theorem 4). [sent-399, score-0.512]
52 Instead of proving directly that consistency implies UAL and UALC, we will prove that WCM implies UALC, and hence that consistency implies UALC. [sent-400, score-0.256]
53 That is, consistency is equivalent to the combination of UAL and UALC; consistency implies WCM; and WCM implies UALC. [sent-403, score-0.256]
54 The expected loss of a smoothed local version can then be seen to be approximately equal to a mean of expected losses over Px,r for various x. [sent-408, score-0.156]
55 Localizable Sets of Distributions Thus far we have been concerned with consistency in the sense of the set of all (appropriately bounded) distributions; this is a very general case. [sent-416, score-0.142]
56 Ĺš nition 8 We call a set of distributions P localizable iff P∈P =⇒ ∀r â‰Ä˝ 0, x ∈ suppX (P) 839 Px,r ∈ P. [sent-424, score-0.293]
57 Z AKAI AND R ITOV That is, a localizable set of distributions contains all conditionings of its distributions onto small balls. [sent-425, score-0.292]
58 Simple inspection of the proof of Lemma 6 reveals that it holds true on localizable sets of distributions, and not just on the set of all distributions, simply because we apply the WCM property of f only on distributions Px,r . [sent-426, score-0.28]
59 Ś cally, Theorem 4 and Corollary 7—apply to localizable sets in general, and not just to the set of all distributions bounded by some constant M > 0 (which is just one type of example of a localizable set). [sent-428, score-0.503]
60 Note that the requirement that a set of distributions be localizable is in a sense the minimal requirement we would expect, since if f is consistent on a distribution P but not on some Px,r then there is no reason to expect f to be UALC. [sent-429, score-0.342]
61 It is clearly a setup that implicitly works on a localizable set since Px,r retains the property that y = f ∗ (x) + ĂŽÄž. [sent-438, score-0.253]
62 Note, however, that if the density of the marginal distribution on X is assumed to be bounded then this is no longer a localizable set. [sent-441, score-0.238]
63 Note, however, that we assume f (S, x) ∈ {−1, +1}, and smoothed local versions do not have this property; for them the equality between the 0-1 and L1 losses is not valid. [sent-447, score-0.141]
64 Ĺš nal somewhat more complex example in order to show how localizable sets can be present even in settings where we might not expect them. [sent-457, score-0.254]
65 The assumption of x = ÄŽ†(z) (and that ÄŽ† is smooth) is a nontrivial requirement; however, it is easy to see that if we consider the set of all ÄŽ† and z then this set is a localizable set of distributions. [sent-469, score-0.238]
66 Note that such distributions form a localizable set, and thus we might expect our results to apply to them. [sent-479, score-0.281]
67 A change needs to be made to smoothed local versions to ensure that they remain classiÄ? [sent-509, score-0.141]
68 Ă‚Ĺťr r q When we talk of smoothed local versions in the context of classiÄ? [sent-513, score-0.141]
69 Let fker be a consistent NadarayaWatson kernel estimator using the window kernel. [sent-521, score-0.164]
70 We will now use our results on regression for estimators fc in order to arrive at conclusions for classiÄ? [sent-528, score-0.151]
71 Note that | fc (S, x)| â‰Â¤ 1 (since fker uses the window kernel, it is bounded by the largest |yi | in the training set, which is 1). [sent-530, score-0.22]
72 fc is UALC ⇒ {Q(R )} Dn ( fĂ‚Ĺťc )|{Rk } k , f {Q(Rk )} k} c|{R Ă‹œ →0 for large-enough Q, {Rk } in the sense appearing in the deÄ? [sent-542, score-0.128]
73 Ĺš rst part of the lemma indicates that the C-mutual consistency of any c, d are linked to the mutual consistency of fc , fd ; the second does the same for C-consistency and consistency. [sent-545, score-0.43]
74 The third part of the lemma shows that if fc is UALC then smoothed local versions of fc are asymptotically equivalent to fcâ€Ë› , where câ€Ë› is a smoothed local version in the sense of classiÄ? [sent-546, score-0.541]
75 Ĺš cation of c (in other words, we can either smooth fc or Ä? [sent-547, score-0.131]
76 Assume that c is C-consistent on some localizable set P. [sent-552, score-0.238]
77 Since the entire argument was for an arbitrary localizable set P, we conclude that Theorem 10 A C-consistent classiÄ? [sent-558, score-0.238]
78 Concluding Remarks Our analysis of consistency has led to the following result: Consistent learning methods must have two properties, Ä? [sent-566, score-0.128]
79 Ĺš rst, that they behave locally (UAL); second, that their mean must not be far from estimating the true mean (WCM). [sent-567, score-0.197]
80 Thus, we can see consistency as comprised of two aspects, one local and one global. [sent-570, score-0.21]
81 Note also that the global property, WCM, is a trivial consequence of consistency, and therefore what is worth noting about this result is that consistency implies local behavior. [sent-571, score-0.216]
82 What does seem to be the crux of the matter is the requirement to perform well on all distributions, or more generally on a localizable set; note that the term ‘localizable’ here is a giveaway. [sent-574, score-0.252]
83 Indeed, if we have a method that is consistent on a non-localizable set of distributions, it may not behave locally. [sent-575, score-0.158]
84 Then, while g is consistent on the set of distributions with Ey = 0 (using the consistency of f on all distributions), g does not behave locally. [sent-584, score-0.313]
85 Ĺš lls Ey = 0, and g is consistent on it, but neither UAL nor UALC, since smoothed local versions of g in fact return values tending towards 0. [sent-587, score-0.218]
86 Thus, in summary, the fact that the set of all distributions is localizable is what causes consistency to imply local behavior. [sent-588, score-0.457]
87 , 2006), then we must do away with consistency on the set of all distributions and instead talk about consistency on a more limited set, one which is not localizable. [sent-592, score-0.283]
88 One such direction is to apply our results towards proving the consistency of learning methods not yet known to be consistent. [sent-597, score-0.128]
89 While not necessarily a simple property to show, it may in some cases be easier than proving consistency directly. [sent-600, score-0.143]
90 Since a consistent method is necessarily UALC, we can consider using a smoothed local version of the original method, since if chosen appropriately it is also consistent. [sent-602, score-0.187]
91 That is, local versions might behave similarly enough to smoothed local versions for practical purposes. [sent-611, score-0.333]
92 Ĺš rst show that f is consistent; then we will show that f is neither locally consistent nor const localizable. [sent-623, score-0.145]
93 Hence x will tend to be in the area on which f is forced to return 0, making f neither locally consistent nor localizable. [sent-628, score-0.145]
94 Because of these two facts, we get ESn Ă‚Äž BE(Sn ),Q(Sn ) Ă‹† \ {xi : (xi , yi ) ∈ Sn }C → 0 hence once more f is consistent by the consistency of g, by a similar argument as before. [sent-641, score-0.231]
95 n→∞ 2 Thus, f |{Rk } is not consistent, that is, f is not locally consistent (note that this is even by a relatively large constant factor). [sent-700, score-0.131]
96 Having shown that f is consistent but not locally consistent, we now show that in addition f cannot be localizable. [sent-701, score-0.131]
97 By Lemma 14, we know that for almost every x, ÂĞ (Bx,r ) 1 cx rd = lim = d r→0 ÂĞ (Bx,qr ) r→0 cx qd r d q lim (11) and using Lemma 12 we can consider the MAD, as follows. [sent-737, score-0.206]
98 Consequently, due to (10) we have Ă‚Ĺť lim C∗ (r, q) â‰Â¤ lim C∗ (r, q) = 0. [sent-740, score-0.176]
99 Ĺš xed r, q > 0 by the consistency of fker on Px,r (using the fact that m → ∞ a. [sent-823, score-0.198]
100 Distribution-free consistency results in nonparametric discrimination and regression function estimation. [sent-882, score-0.168]
wordName wordTfidf (topN-words)
[('sn', 0.481), ('esn', 0.4), ('ual', 0.295), ('localizable', 0.238), ('wcm', 0.224), ('nrn', 0.21), ('rk', 0.205), ('ualc', 0.196), ('ex', 0.188), ('dn', 0.168), ('tk', 0.158), ('consistency', 0.128), ('localizability', 0.119), ('fc', 0.114), ('mad', 0.112), ('akai', 0.105), ('itov', 0.105), ('ocalizability', 0.098), ('behave', 0.095), ('rn', 0.091), ('lim', 0.088), ('onsistency', 0.083), ('fker', 0.07), ('locally', 0.068), ('local', 0.064), ('fy', 0.064), ('consistent', 0.063), ('smoothed', 0.06), ('madn', 0.049), ('en', 0.046), ('ful', 0.042), ('suppx', 0.042), ('em', 0.04), ('ey', 0.037), ('zakai', 0.035), ('ln', 0.032), ('fk', 0.03), ('devroye', 0.03), ('localize', 0.03), ('mutual', 0.028), ('gyor', 0.028), ('lling', 0.028), ('iff', 0.028), ('distributions', 0.027), ('smoothing', 0.025), ('jerusalem', 0.024), ('global', 0.024), ('regression', 0.024), ('nitions', 0.023), ('boundedness', 0.023), ('yi', 0.023), ('manner', 0.023), ('radii', 0.021), ('supn', 0.021), ('training', 0.021), ('sup', 0.021), ('ev', 0.02), ('localization', 0.02), ('sign', 0.019), ('xi', 0.019), ('measurable', 0.018), ('asymptotically', 0.018), ('alon', 0.018), ('comprised', 0.018), ('ritov', 0.018), ('behavior', 0.017), ('versions', 0.017), ('converges', 0.017), ('radius', 0.017), ('classi', 0.017), ('mean', 0.017), ('smooth', 0.017), ('get', 0.017), ('estimator', 0.016), ('nonparametric', 0.016), ('lemma', 0.016), ('fd', 0.016), ('cleveland', 0.016), ('logically', 0.016), ('might', 0.016), ('log', 0.016), ('mutually', 0.016), ('cx', 0.015), ('loss', 0.015), ('write', 0.015), ('window', 0.015), ('property', 0.015), ('sense', 0.014), ('minor', 0.014), ('behaving', 0.014), ('exc', 0.014), ('huji', 0.014), ('kouropteva', 0.014), ('llable', 0.014), ('loader', 0.014), ('mads', 0.014), ('matter', 0.014), ('sketch', 0.014), ('neither', 0.014), ('estimators', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
2 0.10522851 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks
3 0.092302226 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
4 0.080979809 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras
Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC
5 0.066515863 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
6 0.065112539 16 jmlr-2009-Classification with Gaussians and Convex Loss
7 0.063980907 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
8 0.057226814 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
9 0.053616632 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
10 0.041708514 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
11 0.040875968 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
12 0.034754686 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
13 0.030746995 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
14 0.030744521 90 jmlr-2009-Structure Spaces
15 0.028905965 81 jmlr-2009-Robust Process Discovery with Artificial Negative Events (Special Topic on Mining and Learning with Graphs and Relations)
16 0.028522484 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
17 0.028186021 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
18 0.027095271 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
20 0.024954269 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
topicId topicWeight
[(0, 0.141), (1, -0.033), (2, 0.049), (3, 0.011), (4, -0.191), (5, 0.087), (6, -0.104), (7, -0.008), (8, -0.002), (9, 0.14), (10, 0.057), (11, -0.098), (12, -0.276), (13, 0.144), (14, -0.078), (15, -0.134), (16, -0.067), (17, 0.156), (18, 0.26), (19, -0.048), (20, 0.157), (21, 0.075), (22, -0.114), (23, -0.1), (24, -0.201), (25, 0.105), (26, 0.16), (27, -0.088), (28, -0.039), (29, -0.023), (30, 0.077), (31, -0.05), (32, 0.059), (33, 0.015), (34, -0.05), (35, 0.067), (36, 0.042), (37, 0.048), (38, 0.088), (39, 0.046), (40, 0.19), (41, 0.183), (42, -0.095), (43, -0.017), (44, 0.0), (45, 0.084), (46, 0.081), (47, 0.06), (48, 0.098), (49, -0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.96408975 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
2 0.51837987 46 jmlr-2009-Learning Halfspaces with Malicious Noise
Author: Adam R. Klivans, Philip M. Long, Rocco A. Servedio
Abstract: We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions. We give poly(n, 1/ε)-time algorithms for solving the following problems to accuracy ε: • Learning origin-centered halfspaces in Rn with respect to the uniform distribution on the unit ball with malicious noise rate η = Ω(ε2 / log(n/ε)). (The best previous result was Ω(ε/(n log(n/ε))1/4 ).) • Learning origin-centered halfspaces with respect to any isotropic logconcave distribution on Rn with malicious noise rate η = Ω(ε3 / log2 (n/ε)). This is the first efficient algorithm for learning under isotropic log-concave distributions in the presence of malicious noise. We also give a poly(n, 1/ε)-time algorithm for learning origin-centered halfspaces under any isotropic log-concave distribution on Rn in the presence of adversarial label noise at rate η = Ω(ε3 / log(1/ε)). In the adversarial label noise setting (or agnostic model), labels can be noisy, but not example points themselves. Previous results could handle η = Ω(ε) but had running time exponential in an unspecified function of 1/ε. Our analysis crucially exploits both concentration and anti-concentration properties of isotropic log-concave distributions. Our algorithms combine an iterative outlier removal procedure using Principal Component Analysis together with “smooth” boosting. Keywords: PAC learning, noise tolerance, malicious noise, agnostic learning, label noise, halfspace learning, linear classifiers c 2009 Adam R. Klivans, Philip M. Long and Rocco A. Servedio. K LIVANS , L ONG AND S ERVEDIO
3 0.46678275 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas
Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program defined directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks
4 0.38727459 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
5 0.32791698 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras
Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC
6 0.27601916 16 jmlr-2009-Classification with Gaussians and Convex Loss
7 0.25850117 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
8 0.21414596 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
9 0.15414536 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
10 0.14318916 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
11 0.1412845 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
12 0.13934256 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
13 0.13542277 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
15 0.12675363 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
16 0.12218656 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
17 0.12089624 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
18 0.11284587 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
19 0.10894699 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
20 0.10565619 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
topicId topicWeight
[(3, 0.432), (8, 0.017), (11, 0.018), (19, 0.027), (38, 0.026), (47, 0.027), (52, 0.035), (55, 0.062), (58, 0.033), (66, 0.107), (68, 0.015), (90, 0.058), (96, 0.025)]
simIndex simValue paperId paperTitle
1 0.67970997 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
same-paper 2 0.64950556 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
3 0.31816888 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
4 0.3151722 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
5 0.30870208 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
6 0.30704388 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
8 0.30563632 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
9 0.30464199 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
10 0.30435735 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
11 0.30325976 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
12 0.30315465 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
13 0.30315152 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
14 0.30224046 78 jmlr-2009-Refinement of Reproducing Kernels
15 0.30203933 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
16 0.30079734 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence
17 0.30065101 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
18 0.30051675 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
19 0.29888737 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
20 0.29814839 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability