nips nips2004 nips2004-137 knowledge-graph by maker-knowledge-mining

137 nips-2004-On the Adaptive Properties of Decision Trees

Source: pdf

Author: Clayton Scott, Robert Nowak

Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classiﬁer is the ﬁrst known to achieve these optimal rates while being practical and implementable. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. [sent-4, score-0.695]

2 In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. [sent-5, score-1.229]

3 The proposed classiﬁer is the ﬁrst known to achieve these optimal rates while being practical and implementable. [sent-6, score-0.313]

4 1 Introduction This paper presents three adaptivity properties of decision trees that lead to faster rates of convergence for a broad range of pattern classiﬁcation problems. [sent-7, score-0.793]

5 These properties are: Noise Adaptivity: Decision trees can automatically adapt to the (unknown) regularity of the excess risk function in the neighborhood of the Bayes decision boundary. [sent-8, score-0.648]

6 The regularity is quantiﬁed by a condition similar to Tsybakov’s noise condition [1]. [sent-9, score-0.296]

7 Manifold Focus: When the distribution of features happens to have support on a lower dimensional manifold, decision trees can automatically detect and adapt their structure to the manifold. [sent-10, score-0.579]

8 , independent of the class labels), then decision trees can automatically ignore these features. [sent-14, score-0.474]

9 Each of the above properties can be formalized and translated into a class of distributions with known minimax rates of convergence. [sent-16, score-0.466]

10 We show that dyadic decision trees achieve the (minimax) optimal rate (to within a log factor) without needing to know the speciﬁc parameters deﬁning the class. [sent-18, score-1.128]

11 Such trees are constructed by minimizing a complexity penalized empirical risk over an appropriate family of dyadic partitions. [sent-19, score-0.891]

12 This bound in turn leads to an oracle inequality from which the optimal rates are derived. [sent-21, score-0.454]

13 The restriction to dyadic splits is necessary to achieve a computationally tractable classiﬁer. [sent-23, score-0.543]

14 The same rates may be achieved by more general tree classiﬁers, but these require searches over prohibitively large families of partitions. [sent-25, score-0.26]

15 Dyadic decision trees are thus preferred because they are simultaneously implementable, analyzable, and sufﬁciently ﬂexible to achieve optimal rates. [sent-26, score-0.509]

16 The Bayes decision boundary, denoted ∂G∗ , is the topological boundary of the Bayes decision set G∗ = {x | f ∗ (x) = 1}. [sent-40, score-0.41]

17 Yang [3] shows that for η(x) in certain smoothness classes minimax optimal rates are achieved by appropriate plug-in density estimates. [sent-45, score-0.488]

18 Faster rates are then possible, although existing optimal classiﬁers typically rely on -nets or otherwise non-implementable methods. [sent-47, score-0.206]

19 Other authors have derived rates of convergence for existing practical classiﬁers, but these rates are suboptimal in the minimax sense considered here [6–8]. [sent-49, score-0.667]

20 Our contribution is to demonstrate practical classiﬁers that adaptively attain minimax optimal rates for some of Tsybakov’s and other classes. [sent-50, score-0.54]

21 2 Dyadic Decision Trees A dyadic decision tree (DDT) is a decision tree that divides the input space by means of axis-orthogonal dyadic splits. [sent-51, score-1.454]

22 More precisely, a dyadic decision tree T is speciﬁed by assigning an integer s(v) ∈ {1, . [sent-52, score-0.793]

23 , d} to each internal node v of T (corresponding to the coordinate/attribute that is split at that node), and a binary label 0 or 1 to each leaf node. [sent-55, score-0.163]

24 Each node of a DDT is associated with a cell according to the following rules: (1) The root node is associated with [0, 1]d ; (2) If v is an internal node associated to the cell A, then the children of v are associated to As(v),1 and As(v),2 . [sent-59, score-0.269]

25 Deﬁne T to be the collection of all DDTs and A to be the collection of all cells corresponding to nodes Figure 1: A dyadic decision tree (right) with the associated recursive dyadic partition (left) when d = 2. [sent-65, score-1.372]

26 Each internal node of the tree is labeled with an integer from 1 to d indicating the coordinate being split at that node. [sent-66, score-0.29]

27 Let M be a dyadic integer, that is, M = 2L for some nonnegative integer L. [sent-69, score-0.541]

28 Deﬁne TM to be the collection of all DDTs such that no terminal cell has a sidelength smaller than 2−L . [sent-70, score-0.15]

29 For all of our theorems on rates of convergence below we have L = O(log n), in which case the computational cost is O(nd(log n)d+1 ). [sent-75, score-0.229]

30 3 Generalization Error Bounds for Trees In this section we state a uniform error bound and an oracle inequality for DDTs. [sent-76, score-0.282]

31 The bounding techniques are quite general and can be extended to larger (even uncountable) families of trees using VC theory, but for the sake of simplicity we conﬁne the discussion to DDTs. [sent-78, score-0.243]

32 Further deﬁne ˆ pA = 4 max(ˆA , ( A log 2 + log n)/n) and pA = 4 max(pA , ( A log 2 + log n)/(2n)). [sent-85, score-0.312]

33 Deﬁne the data-dependent penalty 2ˆA p Φn (T ) = A∈π(T ) A log 2 + log(2n) . [sent-88, score-0.157]

34 (3) Traditional error bounds for trees involve a penalty proportional to |T | log n/n, where |T | denotes the number of leaves in T (see [12] or the “naive” bound in [2]). [sent-92, score-0.595]

35 The penalty in (2) assigns a different weight to each leaf of the tree depending on both the depth of the leaf and the fraction of data reaching the leaf. [sent-93, score-0.381]

36 For example, we may bound pA by pA with high probability, and if X has a bounded density, ˆ then pA decays like max{2−j , log n/n}, where j is the depth of A. [sent-95, score-0.212]

37 The upshot is that the ˆ penalty Φn (T ) favors unbalanced trees. [sent-97, score-0.136]

38 Intuitively, if two trees have the same size and empirical error, minimizing the penalized empirical risk with this new penalty will select the tree that is more unbalanced, whereas a traditional penalty based only on tree size would not distinguish the two. [sent-98, score-0.752]

39 This has advantages for classiﬁcation because unbalanced trees are what we expect when approximating a lower dimensional decision boundary. [sent-99, score-0.524]

40 Spatial decomposition allows the introduction of local probabilities pA to offset the complexity of each leaf node A. [sent-101, score-0.164]

41 The uniform error bound of Theorem 1 can be converted (using standard techniques) into an oracle inequality that is the key to deriving rates of convergence for DDTs. [sent-103, score-0.511]

42 The use of Φn instead of Φn in the oracle bound facilitates rate of convergence analysis. [sent-109, score-0.319]

43 The oracle inequality essentially says that Tn performs nearly as well as the DDT chosen by an oracle to minimize R(T ) − R∗ . [sent-110, score-0.307]

44 The classes are indexed by the smoothness γ of the Bayes decision boundary ∂G∗ and a parameter κ that quantiﬁes how “noisy” the distribution is near ∂G∗ . [sent-113, score-0.334]

45 We write an bn when an = O(bn ) and an bn if both an bn and bn an . [sent-114, score-0.356]

46 For a constant c1 > 0, deﬁne Σ(γ, c1 ), the class of functions with H¨ lder regularity γ, to be the collection of all b such that o |b(s ) − pb,s (s )| ≤ c1 |s − s |γ for all s, s ∈ [0, 1]d−1 . [sent-117, score-0.136]

47 Using Tsybakov’s terminology, the Bayes decision set G∗ is a boundary fragment of smoothness γ if G∗ = epi(b) for some b ∈ Σ(γ, c1 ). [sent-118, score-0.39]

48 In other words, for a boundary fragment, the last coordinate of ∂G∗ is a H¨ lder-γ function of the ﬁrst d − 1 coordinates. [sent-120, score-0.144]

49 o Tsybakov also introduces a condition that characterizes the level of “noise” near ∂G∗ in terms of a noise exponent κ, 1 ≤ κ ≤ ∞. [sent-121, score-0.263]

50 A distribution satisﬁes Tsybakov’s noise condition with noise exponent κ and constant c2 if PX (∆(f, f ∗ )) ≤ c2 (R(f ) − R∗ )1/κ for all f . [sent-124, score-0.326]

51 (5) The case κ = 1 is the “low noise” case and corresponds to a jump of η(x) at the Bayes decision boundary. [sent-125, score-0.163]

52 Deﬁne the class F = F(γ, κ) = F(γ, κ, c0 , c1 , c2 ) to be the collection of distributions of Z = (X, Y ) such that 0A For all measurable A ⊆ [0, 1]d , PX (A) ≤ c0 λ(A) 1A G∗ is a boundary fragment deﬁned by b ∈ Σ(γ, c1 ). [sent-128, score-0.281]

53 2A The margin condition is satisﬁed with noise exponent κ and constant c2 . [sent-129, score-0.23]

54 Introducing the parameter ρ = (d − 1)/γ, Tsybakov [1] proved the lower bound inf sup EZ n {R(fn )} − R∗ fn n−κ/(2κ+ρ−1) . [sent-130, score-0.245]

55 Theoretical rules that achieve this lower bound are studied by [1, 4, 5, 13]. [sent-132, score-0.198]

56 Therefore, the minimax rate for F(γ, κ) can be no faster than n−1/(1+ρ) , which is the lower bound for F(γ, 1). [sent-138, score-0.479]

57 In light of the above, Tsybakov’s noise condition does not improve the learning situation when ρ > 1. [sent-139, score-0.164]

58 To achieve rates faster than n−1/(1+ρ) when ρ > 1, clearly an alternate assumption must be made. [sent-140, score-0.325]

59 If the right-hand side of (6) is any indication, then the distributions responsible for slower rates are those with small κ. [sent-141, score-0.207]

60 Thus, it would seem that we need a noise assumption that excludes those “low noise” distributions with small κ that cause slower rates when ρ > 1. [sent-142, score-0.351]

61 Since recursive dyadic partitions can well-approximate G∗ with smoothness γ ≤ 1, we are in the regime of ρ ≥ (d − 1)/γ ≥ 1. [sent-143, score-0.529]

62 As motivated above, faster rates in this situation require an assumption that excludes low noise levels. [sent-144, score-0.401]

63 Because of limited space we are unable to fully present the modiﬁed noise condition, and we simply write 2B Low noise levels are excluded as deﬁned in [11]. [sent-147, score-0.192]

64 According to the results in the next section, these lower bounds are tight to within a log factor for ρ > 1. [sent-150, score-0.157]

65 5 Adaptive Rates for Dyadic Decision Trees All of our rate of convergence proofs use the oracle inequality in the same basic way. [sent-151, score-0.337]

66 First form a “regular” dyadic partition (the exact construction will depend on the speciﬁc problem) into cells of sidelength 1/m = 2−K , for a certain K ≤ L. [sent-154, score-0.635]

67 This example reveals how the noise exponent enters the picture to affect the approximation error. [sent-158, score-0.162]

68 1 Noise Adaptive Classiﬁcation Dyadic decision trees, selected according to the penalized empirical risk criterion discussed earlier, adapt to the unknown noise level to achieve faster rates as stated in Theorem 3 below. [sent-161, score-0.801]

69 , Lipschitz decision boundaries (the case γ = 1 is discussed in Section 5. [sent-164, score-0.163]

70 In an effort to be more general and practical, we replace the boundary fragment condition 1A with a less restrictive assumption. [sent-168, score-0.241]

71 Tysbakov and van de Geer [5] assume the Bayes decision set G∗ is a boundary fragment, meaning it is known a priori that (a) one coordinate of ∂G∗ is a function of the others, (b) that coordinate is known, and (c) class 1 corresponds to the region above ∂G∗ . [sent-169, score-0.398]

72 The following condition includes all piecewise Lipschitz decision boundaries, and allows ∂G∗ to have arbitrary orientation and G∗ to have multiple connected components. [sent-170, score-0.231]

73 Let Pm denote the regular partition of [0, 1]d into hypercubes of sidelength 1/m where m is a dyadic integer (i. [sent-171, score-0.693]

74 A distribution of Z satisﬁes the box-counting assumption with constant c1 > 0 if 1B For all dyadic integers m, ∂G∗ intersects at most c1 md−1 of the md cells in Pm . [sent-174, score-0.617]

75 Condition 1A (γ = 1) implies 1B, (with a different c1 ) so the minimax rate under 0A, 1B, and 2B is no faster than n−κ/(2κ+d−2) . [sent-175, score-0.38]

76 Then sup EZ n {R(Tn )} − R ∗ log n n κ 2κ+d−2 . [sent-179, score-0.173]

77 (7) The complexity regularized DDT is adaptive in the sense that the noise exponent κ and constants c0 , c1 , c2 need not be known. [sent-181, score-0.242]

78 When this happens, dyadic decision trees automatically adapt to achieve faster rates of convergence. [sent-185, score-1.318]

79 The boundedness and regularity assumptions for a d dimensional manifold are given by 0B For all dyadic integers m and all A ∈ Pm , PX (A) ≤ c0 m−d . [sent-188, score-0.741]

80 1C For all dyadic integers m, ∂G∗ passes through at most c1 md −1 of the md hypercubes in Pm . [sent-189, score-0.655]

81 The minimax rate under these assumptions is n−1/d . [sent-190, score-0.294]

82 Then X lives on a d dimensional manifold, and clearly there can be no classiﬁer achieving a rate faster than n−1/d uniformly over all such X, as this would lead to a classiﬁer outperforming the minimax rate for X . [sent-201, score-0.476]

83 As the following theorem shows, DDTs can achieve this rate to within a log factor. [sent-202, score-0.249]

84 By an argument like that in the previous section, the minimax rate under this assumption (and 0A and 1B) can be seen to be n−1/d . [sent-211, score-0.294]

85 Once again, DDTs can achieve this rate to within a log factor. [sent-212, score-0.212]

86 (9) where the sup is over all distributions with relevant data dimension d and such that 0A and 1B hold. [sent-217, score-0.164]

87 1 For simplicity, we eliminate the margin assumption here and in subsequent sections, although it could be easily incorporated to yield faster adaptive rates. [sent-219, score-0.135]

88 In [10] we show that DDTs with polynomial classiﬁers decorating the leaves can achieve faster rates for γ > 1. [sent-222, score-0.37]

89 Combined with the analysis here, these rates can approach n−1 under appropriate noise assumptions. [sent-223, score-0.267]

90 Unfortunately, the rates we obtain are suboptimal and the classiﬁers are not practical. [sent-224, score-0.171]

91 For γ ≤ 1, free DDTs adaptively attain the minimax rate (within a log factor) of n−γ/(γ+d−1) . [sent-225, score-0.439]

92 Finding practical classiﬁers that adapt to the optimal rate for γ > 1 is a current line of research. [sent-227, score-0.215]

93 6 Conclusion Dyadic decision trees adapt to a variation of Tsybakov’s noise condition, data manifold dimension and the number of relevant features to achieve minimax optimal rates of convergence (to within a log factor). [sent-228, score-1.342]

94 DDTs are constructed by a computationally efﬁcient penalized empirical risk minimization procedure based on a novel, spatially adaptive, datadependent penalty. [sent-229, score-0.142]

95 Although we consider each condition separately so as to simplify the ∗ discussion, the conditions can be combined to yield a rate of (log n/n)κ/(2κ+d −2) where d∗ is the dimension of the manifold supporting the relevant features. [sent-230, score-0.261]

96 McAllester, “Generalization bounds for decision trees,” in Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, N. [sent-241, score-0.211]

97 Steinwart, “Fast rates for support vector machines,” Los Alamos National Laboratory, Tech. [sent-284, score-0.171]

98 Rozenholc, “Oracle bounds and exact algorithm for dyadic a classiﬁcation trees,” in Learning Theory: 17th Annual Conference on Learning Theory, COLT 2004, J. [sent-290, score-0.523]

99 Nowak, “Near-minimax optimal classiﬁcation with dyadic classiﬁcation trees,” in Advances in Neural Information Processing Systems 16, S. [sent-297, score-0.51]

100 [11] ——, “Minimax optimal classiﬁcation with dyadic decision trees,” Rice University, Tech. [sent-302, score-0.673]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dyadic', 0.475), ('tsybakov', 0.284), ('trees', 0.243), ('ddts', 0.24), ('minimax', 0.228), ('tn', 0.19), ('rates', 0.171), ('pa', 0.171), ('decision', 0.163), ('ddt', 0.146), ('oracle', 0.127), ('classi', 0.104), ('noise', 0.096), ('ez', 0.095), ('sup', 0.095), ('manifold', 0.094), ('fragment', 0.089), ('bn', 0.089), ('tree', 0.089), ('leaf', 0.088), ('faster', 0.086), ('boundary', 0.084), ('ers', 0.081), ('penalty', 0.079), ('log', 0.078), ('pz', 0.076), ('adapt', 0.075), ('adaptivity', 0.072), ('nowak', 0.072), ('sidelength', 0.072), ('bayes', 0.071), ('bound', 0.068), ('condition', 0.068), ('achieve', 0.068), ('rate', 0.066), ('risk', 0.066), ('exponent', 0.066), ('integer', 0.066), ('pm', 0.064), ('regularity', 0.064), ('bf', 0.063), ('px', 0.061), ('coordinate', 0.06), ('convergence', 0.058), ('unbalanced', 0.057), ('smoothness', 0.054), ('inequality', 0.053), ('fn', 0.051), ('cells', 0.05), ('blanchard', 0.05), ('adaptive', 0.049), ('bounds', 0.048), ('cscott', 0.048), ('epi', 0.048), ('excludes', 0.048), ('geer', 0.048), ('ndld', 0.048), ('md', 0.046), ('integers', 0.046), ('node', 0.045), ('penalized', 0.045), ('leaves', 0.045), ('analyzable', 0.042), ('mammen', 0.042), ('hypercubes', 0.042), ('hyperrectangles', 0.042), ('collection', 0.041), ('er', 0.041), ('practical', 0.039), ('rice', 0.038), ('tm', 0.038), ('partition', 0.038), ('theorem', 0.037), ('depth', 0.037), ('pn', 0.037), ('cell', 0.037), ('automatically', 0.037), ('distributions', 0.036), ('mcallester', 0.035), ('adaptively', 0.035), ('optimal', 0.035), ('error', 0.034), ('irrelevant', 0.034), ('near', 0.033), ('relevant', 0.033), ('proofs', 0.033), ('attain', 0.032), ('scott', 0.032), ('boundedness', 0.032), ('complexity', 0.031), ('empirical', 0.031), ('class', 0.031), ('lower', 0.031), ('rules', 0.031), ('cation', 0.031), ('let', 0.03), ('root', 0.03), ('internal', 0.03), ('dimensional', 0.03), ('decays', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

2 0.28744105 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

3 0.10048234 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

Abstract: We study a method of optimal data-driven aggregation of classiﬁers in a convex combination and establish tight upper bounds on its excess risk with respect to a convex loss function under the assumption that the solution of optimal aggregation problem is sparse. We use a boosting type algorithm of optimal aggregation to develop aggregate classiﬁers of activation patterns in fMRI based on locally trained SVM classiﬁers. The aggregation coefﬁcients are then used to design a ”boosting map” of the brain needed to identify the regions with most signiﬁcant impact on classiﬁcation. 1

4 0.093324602 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classiﬁcation problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

5 0.091278151 70 nips-2004-Following Curved Regularized Optimization Solution Paths

Author: Saharon Rosset

Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized ﬁtting is likely to lead to over-ﬁtted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efﬁcient and highly ﬂexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.

6 0.083808713 23 nips-2004-Analysis of a greedy active learning strategy

7 0.081541903 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

8 0.079548791 131 nips-2004-Non-Local Manifold Tangent Learning

9 0.074442267 32 nips-2004-Boosting on Manifolds: Adaptive Regularization of Base Classifiers

10 0.073880069 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

11 0.072383985 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

12 0.071600854 34 nips-2004-Breaking SVM Complexity with Cross-Training

13 0.069399945 19 nips-2004-An Application of Boosting to Graph Classification

14 0.067607693 82 nips-2004-Incremental Algorithms for Hierarchical Classification

15 0.066996306 166 nips-2004-Semi-supervised Learning via Gaussian Processes

16 0.066697665 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

17 0.066397704 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

18 0.064672567 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

19 0.063466325 164 nips-2004-Semi-supervised Learning by Entropy Minimization

20 0.059036259 136 nips-2004-On Semi-Supervised Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, 0.061), (2, 0.03), (3, 0.131), (4, -0.003), (5, 0.08), (6, 0.1), (7, 0.027), (8, 0.037), (9, -0.024), (10, -0.08), (11, -0.056), (12, 0.094), (13, -0.101), (14, -0.18), (15, -0.228), (16, 0.095), (17, 0.041), (18, -0.084), (19, 0.128), (20, -0.141), (21, 0.009), (22, 0.065), (23, 0.014), (24, -0.056), (25, 0.163), (26, 0.051), (27, -0.068), (28, 0.056), (29, -0.034), (30, -0.058), (31, 0.098), (32, -0.05), (33, 0.009), (34, 0.144), (35, 0.137), (36, -0.02), (37, -0.13), (38, 0.057), (39, -0.02), (40, -0.012), (41, 0.061), (42, 0.039), (43, 0.014), (44, 0.05), (45, 0.055), (46, -0.003), (47, 0.077), (48, 0.005), (49, 0.098)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95546865 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

2 0.8422538 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

3 0.72326279 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

4 0.48690748 139 nips-2004-Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Author: Vladimir Koltchinskii, Manel Martínez-ramón, Stefan Posse

5 0.4823485 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

Author: Lorenzo Rosasco, Andrea Caponnetto, Ernesto D. Vito, Francesca Odone, Umberto D. Giovannini

Abstract: Many works have shown that strong connections relate learning from examples to regularization techniques for ill-posed inverse problems. Nevertheless by now there was no formal evidence neither that learning from examples could be seen as an inverse problem nor that theoretical results in learning theory could be independently derived using tools from regularization theory. In this paper we provide a positive answer to both questions. Indeed, considering the square loss, we translate the learning problem in the language of regularization theory and show that consistency results and optimal regularization parameter choice can be derived by the discretization of the corresponding inverse problem. 1

6 0.40101519 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

7 0.38761523 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

8 0.33645892 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

9 0.319745 138 nips-2004-Online Bounds for Bayesian Algorithms

10 0.31962255 23 nips-2004-Analysis of a greedy active learning strategy

11 0.31951809 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

12 0.31443805 82 nips-2004-Incremental Algorithms for Hierarchical Classification

13 0.31409958 136 nips-2004-On Semi-Supervised Classification

14 0.31395781 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

15 0.31260452 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

16 0.31211221 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

17 0.30764508 131 nips-2004-Non-Local Manifold Tangent Learning

18 0.30526665 34 nips-2004-Breaking SVM Complexity with Cross-Training

19 0.3019985 54 nips-2004-Distributed Information Regularization on Graphs

20 0.28934205 19 nips-2004-An Application of Boosting to Graph Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.084), (15, 0.081), (26, 0.056), (31, 0.419), (32, 0.014), (33, 0.177), (39, 0.021), (50, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87564242 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution beneﬁts from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are deﬁnitely in favor of minimum entropy regularization when generative models are misspeciﬁed, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

same-paper 2 0.87545586 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

3 0.81414682 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research ﬁelds. Although this approach has met with considerable success, the causal semantics of these models can make it difﬁcult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

4 0.76038951 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a ﬁrst step, kernel matrices on a ﬁxed set of input points are learned from data using a simple and efﬁcient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are ﬁtted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

5 0.60407466 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

6 0.60187924 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

7 0.59601313 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

8 0.57684237 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

9 0.57641172 158 nips-2004-Sampling Methods for Unsupervised Learning

10 0.57591403 87 nips-2004-Integrating Topics and Syntax

11 0.5707292 23 nips-2004-Analysis of a greedy active learning strategy

12 0.56709254 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

13 0.56687337 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

14 0.56409478 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

15 0.56296343 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

16 0.56049061 33 nips-2004-Brain Inspired Reinforcement Learning

17 0.5602718 77 nips-2004-Hierarchical Clustering of a Mixture Model

18 0.55407393 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

19 0.55220866 177 nips-2004-Supervised Graph Inference

20 0.55165011 166 nips-2004-Semi-supervised Learning via Gaussian Processes