nips nips2010 nips2010-178 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Multivariate Dyadic Regression Trees for Sparse Learning Problems Han Liu and Xi Chen School of Computer Science, Carnegie Mellon University Pittsburgh, PA 15213 Abstract We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). [sent-1, score-0.487]
2 Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. [sent-2, score-0.403]
3 Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. [sent-3, score-0.322]
4 Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. [sent-4, score-0.081]
5 predicting multi-channel signals within a time frame, predicting concentrations of several chemical constitutes using the mass spectra of a sample, or predicting expression levels of many genes using a common set of phenotype variables. [sent-9, score-0.087]
6 These problems can be naturally formulated in terms of multivariate regression. [sent-10, score-0.087]
7 Moreover, we denote the jth dimension of y 1 n by yj = (yj , . [sent-18, score-0.103]
8 Without loss of generality, k k we assume X = [0, 1]d and the true model on yj is : i yj = fj (xi ) + ϵi , i = 1, . [sent-25, score-0.201]
9 , n, (1) j d where fj : R → R is a smooth function. [sent-28, score-0.129]
10 j This is a general setting of the nonparametric multivariate regression. [sent-38, score-0.142]
11 , fp lie in a d-dimensional Sobolev ball with order α and radius C, the best convergence rate for the minimax risk is p · n−2α/(2α+d) . [sent-43, score-0.183]
12 However, in many real world applications, the true regression function f may depend only on a small set of variables. [sent-45, score-0.088]
13 If S has been given, the minimax lower bound can be improved to be p · n−2α/(2α+r) , which is the best possible rate can be expected. [sent-53, score-0.096]
14 For sparse learning problems, our task is to develop an estimator, which adaptively achieves this faster rate of convergence without knowing S in advance. [sent-54, score-0.097]
15 1 Previous research on these problems can be roughly divided into three categories: (i) parametric linear models, (ii) nonparametric additive models, and (iii) nonparametric tree models. [sent-55, score-0.308]
16 Recently, significant progress has been made on inferring nonparametric additive models with joint sparsity constraints [7, 10]. [sent-59, score-0.144]
17 For additive models, each fj (x) is assumed to have an additive form: ∑d fj (x) = k=1 fjk (xk ). [sent-60, score-0.378]
18 A family of more flexible nonparametric methods are based on tree models. [sent-62, score-0.193]
19 One of the most popular tree methods is the classification and regression tree (CART) [2]. [sent-63, score-0.364]
20 It first grows a full tree by orthogonally splitting the axes at locally optimal splitting points, then prunes back the full tree to form a subtree. [sent-64, score-0.328]
21 In contrast to CART, dyadic decision trees (DDTs) are restricted to only axis-orthogonal dyadic splits, i. [sent-66, score-0.456]
22 each dimension can only be split at its midpoint. [sent-68, score-0.092]
23 For a broad range of classification problems, [15] showed that DDTs using a special penalty can attain nearly optimal rate of convergence in a minimax sense. [sent-69, score-0.201]
24 [1] proposed a dynamic programming algorithm for constructing DDTs when the penalty term has an additive form, i. [sent-70, score-0.104]
25 the penalty of the tree can be written as the sum of penalties on all terminal nodes. [sent-72, score-0.37]
26 Though intensively studied for classification problems, the dyadic decision tree idea has not drawn much attention in the regression settings. [sent-73, score-0.425]
27 One of the closest results we are aware of is [4], in which a single response dyadic regression procedure is considered for non-sparse learning problems. [sent-74, score-0.287]
28 Another interesting tree model, “Bayesian Additive Regression Trees (BART)”, is proposed under Bayesian framework [6], which is essentially a “sum-of-trees” model. [sent-75, score-0.138]
29 Most of the existing work adopt the number of terminal nodes as the penalty. [sent-76, score-0.215]
30 Such penalty cannot lead to sparse models since a tree with a small number of terminal nodes might still involve too many variables. [sent-77, score-0.461]
31 To obtain sparse models, we propose a new nonparametric method based on multivariate dyadic regression trees (MDRTs). [sent-78, score-0.526]
32 Our contributions are two-fold: (i) Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rate of convergence for the class of (α, C)smooth functions. [sent-81, score-0.294]
33 (ii) Empirically, to avoid computationally prohibitive exhaustive search in high dimensions, we propose a two-stage greedy algorithm and its randomized version that achieve good performance in both function estimation and variable selection. [sent-82, score-0.115]
34 Note that our theory and algorithm can be straightforwardly adapted to univariate sparse regression problem, which is a special case of the multivariate one. [sent-83, score-0.214]
35 To the best of our knowledge, this is the first time such a sparsity-inducing penalty is equipped to tree models for solving sparse regression problems. [sent-84, score-0.309]
36 A MDRT T is a multivariate regression tree that recursively divides the input space X by means of axis-orthogonal dyadic splits. [sent-92, score-0.512]
37 If a node is associated ∏d to the cell B = j=1 [aj , bj ], after being dyadically split on the dimension k, the two children are associated to the subcells B k,1 and B k,2 : } { ak + bk B k,1 = xi ∈ B | xi ≤ and B k,2 = B \ B k,1 . [sent-95, score-0.369]
38 k 2 The set of terminal nodes of a MDRT T is denoted as term(T ). [sent-96, score-0.215]
39 Let Bt be the cell in X induced by a terminal node t, the partition induced by term(T ) can be denoted as π(T ) = {Bt |t ∈ term(T )}. [sent-97, score-0.311]
40 2 For each terminal node t, we can fit a multivariate m-th order polynomial regression on data points falling in Bt . [sent-98, score-0.529]
41 Instead of using all covariates, such a polynomial regression is only fitted on a set of active variables, which is denoted as A(t). [sent-99, score-0.154]
42 For each node b ∈ T (not necessarily a terminal node), A(b) can be an arbitrary subset of {1, . [sent-100, score-0.282]
43 If a node is dyadically split perpendicular to the axis k, k must belong to the active sets of its two children. [sent-104, score-0.247]
44 For any node b, let par(b) be its parent node, then A(par(b)) ⊂ A(b). [sent-106, score-0.094]
45 m For a MDRT T , we define FT to be the class of p-valued measurable m-th order polynomials corresponding to π(T ). [sent-107, score-0.08]
46 Furthermore, for a dyadic integer N = 2L , let TN be the collection of all MDRTs such that no terminal cell has a side length smaller than 2−L . [sent-108, score-0.416]
47 The latter one penalizing the number of terminal nodes |π(T )| has been commonly adopted in the existing tree literature. [sent-116, score-0.353]
48 In the next section, we will show that this sparsity-inducing term is derived by bounding the VC-dimension of the underlying subgraph of regression functions. [sent-119, score-0.088]
49 To evaluate the algorithm performance, we use the L2 -risk with respect to the Lebesgue measure ∑p ∫ µ(·), which is defined as R(f , f ) = E j=1 X |fj (x) − fj (x)|2 dµ(x), where f is the function estimate constructed from n observed samples. [sent-123, score-0.129]
50 2 of [9] shows that the lower minimax rate of convergence for class D(α, C) is exactly the same as that for class of d-dimensional Sobolev ball with order α and radius C. [sent-154, score-0.166]
51 ,fp ∈D(α,C) Therefore, the lower minimax rate of convergence is p · n−2α/(2α+d) . [sent-160, score-0.12]
52 Similarly, if the problem is jointly sparse with the index set S and r = |S| ≪ d, the best rate of convergence can be improved to p · n−2α/(2α+r) when S is given. [sent-161, score-0.097]
53 Gaussian random j √ variables, this assumption easily holds with γ = O( log n), which only contributes a logarithmic term to the final rate of convergence. [sent-171, score-0.082]
54 The next assumption specifies the scaling of the relevant dimension r and ambient dimension d with respect to the sample size n. [sent-172, score-0.159]
55 On the other hand, the ambient dimension d can increase exponentially fast with the sample size, which is a realistic scaling for high dimensional settings. [sent-177, score-0.114]
56 Remark 1 As discussed in Proposition 1, the obtained rate of convergence in (5) is nearly optimal up to a logarithmic term. [sent-180, score-0.119]
57 Remark 2 Since the estimator defined in (2) does not need to know the smoothness α and the sparsity level r in advance, MDRTs are simultaneously adaptive to the unknown smoothness and sparsity level. [sent-181, score-0.171]
58 Our analysis closely follows the least squares regression analysis in [9] and some specific coding scheme of trees in [15]. [sent-183, score-0.177]
59 m Let ST be the class of scalar-valued measurable m-th order polynomials corresponding to π(T ), m m and let GT be the class of all subgraphs of functions of ST , i. [sent-186, score-0.103]
60 , fp ) be the true regression function, there exists a set of piecewise m polynomials h1 , . [sent-200, score-0.235]
61 Sketch of Proof: This is a standard approximation result using multivariate piecewise polynomials. [sent-207, score-0.114]
62 The main idea is based on a multivariate Taylor expansion of the function fj at a given point x0 . [sent-208, score-0.216]
63 ∑p ∫ First, we define R(g, f ) = j=1 X |gj (x) − fj (x)|2 dµ(x), Lemma 3 [9] Choose ( ) [[T ]] log 2 γ4 m log(120eγ 4 n)VGT + n 2 ∑ for some prefix code [[T ]] > 0 satisfying T ∈TN 2−[[T ]] ≤ 1. [sent-214, score-0.153]
64 To make MDRTs scalable for high dimensional massive datasets, using similar ideas as CARTs, we propose a two-stage procedure: (1) we grow a full tree in a greedy manner; (2) we prune back the full tree to from the final tree. [sent-224, score-0.338]
65 Given a MDRT T , denote the corresponding multivariate m-th order polynomial fit on π(T ) by m fT = {ftm }t∈π(T ) , where ftm is the m-th order polynomial regression fit on the partition Bt . [sent-226, score-0.313]
66 For 5 each xi falling in Bt , let ftm (xi , A(t)) be the predicted function value for xi . [sent-227, score-0.192]
67 We denote the the local squared error (LSE) on node t by Rm (t, A(t)): 1 ∑ Rm (t, A(t)) = ∥yi − ftm (xi , A(t))∥2 . [sent-228, score-0.164]
68 The total MSE of the tree R(T ) can then be computed by the following equation: ∑ R(T ) = Rm (t, A(t)). [sent-230, score-0.138]
69 (11) Our goal is to find the tree structure with the polynomial regression on each terminal node that can minimize the total cost. [sent-232, score-0.542]
70 The first stage is tree growing, in which a terminal node t is first selected in each step. [sent-233, score-0.448]
71 We then perform one of two actions a1 and a2: a1: adding another dimension k ̸∈ A(t) to A(t), and refit the regression model on all data points falling in Bt ; a2: dyadically splitting t perpendicular to the dimension k ∈ A(t). [sent-234, score-0.382]
72 In each tree growing step, we need to decide which action to perform. [sent-235, score-0.204]
73 For action a2, let sl(t(k) ) be the side length of Bt on dimension k ∈ A(t). [sent-237, score-0.108]
74 If sl(t(k) ) > 2−L , the (k) (k) dimension k of Bt can then be dyadically split. [sent-238, score-0.137]
75 In this case, let tL and tR be the left and right child of node t. [sent-239, score-0.094]
76 For each terminal node t, we greedily perform the action a∗ on the dimension k ∗ , which are determined by m (14) (a∗ , k ∗ ) = argmax ∆Ra (t, k). [sent-241, score-0.39]
77 d} In high dimensional setting, the above greedy procedure may not lead to the optimal tree since successively locally optimal splits cannot guarantee the global optimum. [sent-245, score-0.224]
78 Once an irrelevant dimension has been added in or split, the greedy procedure can never fix the mistake. [sent-246, score-0.145]
79 Instead of greedily performing the action on the dimension that leads the maximum drop in LSE, we randomly choose which action to perform according to a multinomial distribution. [sent-248, score-0.172]
80 The action a∗ is then performed on the dimension k ∗ . [sent-251, score-0.108]
81 In general, when the randomized scheme is adopted, we need to repeat our algorithm many times to pick the best tree. [sent-252, score-0.136]
82 For each step, we either merge a pair of terminal nodes or remove a variable from the active set of a terminal node such that the resulted tree has the smaller cost. [sent-254, score-0.667]
83 We repeat this process until the tree becomes a single root node with an empty active set. [sent-255, score-0.264]
84 The tree with the minimum cost in this process is returned as the final tree. [sent-256, score-0.138]
85 The pseudocode for the growing stage and cost complexity pruning stage are presented in the Appendix. [sent-257, score-0.081]
86 Moreover, to avoid a cell with too few data points, we pre-define a quantity nmax . [sent-258, score-0.108]
87 Let n(t) be the the number of data points fall into Bt , if n(t) ≤ nmax , Bt will no longer be split. [sent-259, score-0.079]
88 In addition, whenever we perform the mth order polynomial regression on the active set of a node, we need to make sure it is not rank deficient. [sent-261, score-0.154]
89 For randomized scheme, we run 50 random trials and pick the minimum cost tree. [sent-264, score-0.105]
90 As for CART, we adopt the MATLAB package from [12], which fits piecewise constant on each p terminal node with the cost complexity criterion: C(T ) = R(T ) + ρ n |π(T )|, where ρ is the tuning parameter playing the same role as λ in (3). [sent-265, score-0.309]
91 For more detailed experiment protocols, we set nmax = 5 and L = 6. [sent-283, score-0.079]
92 The tree with the minimum MSE on the validation set is then picked as the best tree. [sent-285, score-0.138]
93 For criterion (i), if the variables involved in the best tree are exactly the first four variables, the variable selection task for this design is deemed as successful. [sent-286, score-0.168]
94 Moreover, the performance of randomized scheme is slightly better than its deterministic version in variable selection. [sent-295, score-0.106]
95 nmax is set to be 5 for the first dataset and 20 for the latter two. [sent-415, score-0.079]
96 Moreover, for randomized scheme, we run 50 random trials and pick the minimum cost tree. [sent-417, score-0.105]
97 Moreover, randomized scheme does improve the performance compared to the deterministic counterpart. [sent-449, score-0.106]
98 6 Conclusions We propose a novel sparse learning method based on multivariate dyadic regression trees (MDRTs). [sent-459, score-0.471]
99 Our approach adopts a new sparsity-inducing penalty that simultaneously conduct function estimation and variable selection. [sent-460, score-0.095]
100 To the best of our knowledge, it is the first time that such a penalty is introduced in the tree literature for high dimensional sparse learning problems. [sent-462, score-0.243]
wordName wordTfidf (topN-words)
[('mdrt', 0.674), ('mdrts', 0.337), ('dyadic', 0.199), ('cart', 0.199), ('terminal', 0.188), ('tree', 0.138), ('pen', 0.135), ('fj', 0.129), ('bt', 0.122), ('ddts', 0.119), ('node', 0.094), ('regression', 0.088), ('multivariate', 0.087), ('nmax', 0.079), ('rt', 0.078), ('randomized', 0.075), ('ftm', 0.07), ('dyadically', 0.07), ('housing', 0.07), ('lse', 0.07), ('dimension', 0.067), ('nt', 0.067), ('fp', 0.063), ('xs', 0.062), ('minimax', 0.062), ('additive', 0.06), ('vgt', 0.06), ('mse', 0.059), ('rm', 0.059), ('trees', 0.058), ('polynomials', 0.057), ('nonparametric', 0.055), ('han', 0.052), ('penalty', 0.044), ('xi', 0.042), ('gt', 0.041), ('action', 0.041), ('greedy', 0.04), ('carts', 0.04), ('chem', 0.04), ('jerome', 0.039), ('sparse', 0.039), ('tn', 0.039), ('ft', 0.038), ('irrelevant', 0.038), ('falling', 0.038), ('ga', 0.038), ('nearly', 0.037), ('yj', 0.036), ('rate', 0.034), ('polynomial', 0.034), ('lemma', 0.032), ('active', 0.032), ('leo', 0.032), ('bart', 0.032), ('scheme', 0.031), ('smoothness', 0.03), ('selection', 0.03), ('xk', 0.03), ('sobolev', 0.03), ('inf', 0.03), ('pick', 0.03), ('predicting', 0.029), ('simultaneously', 0.029), ('cell', 0.029), ('sparsity', 0.029), ('breiman', 0.028), ('worthwhile', 0.028), ('larry', 0.028), ('stage', 0.028), ('nodes', 0.027), ('designs', 0.027), ('sl', 0.027), ('st', 0.027), ('synthetic', 0.027), ('piecewise', 0.027), ('remark', 0.026), ('perpendicular', 0.026), ('tl', 0.026), ('liu', 0.026), ('exible', 0.026), ('splitting', 0.026), ('involve', 0.025), ('ambient', 0.025), ('split', 0.025), ('constants', 0.025), ('growing', 0.025), ('splits', 0.024), ('convergence', 0.024), ('estimator', 0.024), ('logarithmic', 0.024), ('log', 0.024), ('ra', 0.024), ('par', 0.024), ('generic', 0.024), ('drop', 0.023), ('class', 0.023), ('tk', 0.023), ('dimensional', 0.022), ('conduct', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
2 0.18821783 108 nips-2010-Graph-Valued Regression
Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty
Abstract: Undirected graphical models encode in a graph G the dependency structure of a random vector Y . In many applications, it is of interest to model Y given another random vector X as input. We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classification and regression trees), but at each leaf of the tree estimates a graph. We call the method “Graph-optimized CART”, or GoCART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. 1
3 0.12305571 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
4 0.08470694 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
5 0.082701668 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
6 0.081245176 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
7 0.073982835 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
8 0.065769948 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
9 0.064617284 223 nips-2010-Rates of convergence for the cluster tree
10 0.063176662 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
11 0.061648302 117 nips-2010-Identifying graph-structured activation patterns in networks
12 0.060451739 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.060114421 144 nips-2010-Learning Efficient Markov Networks
14 0.056406241 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
15 0.055767011 233 nips-2010-Scrambled Objects for Least-Squares Regression
16 0.054899503 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
17 0.054073215 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
18 0.052311637 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
19 0.050782867 63 nips-2010-Distributed Dual Averaging In Networks
20 0.050160564 150 nips-2010-Learning concept graphs from text with stick-breaking priors
topicId topicWeight
[(0, 0.176), (1, 0.019), (2, 0.086), (3, 0.051), (4, -0.028), (5, -0.014), (6, -0.055), (7, 0.03), (8, -0.013), (9, 0.046), (10, -0.036), (11, 0.035), (12, -0.113), (13, -0.005), (14, 0.017), (15, -0.081), (16, 0.005), (17, -0.076), (18, -0.069), (19, -0.023), (20, -0.082), (21, -0.053), (22, -0.013), (23, -0.058), (24, -0.034), (25, 0.02), (26, -0.054), (27, 0.125), (28, 0.024), (29, -0.033), (30, 0.066), (31, 0.05), (32, -0.058), (33, 0.074), (34, 0.043), (35, 0.066), (36, 0.081), (37, -0.171), (38, 0.016), (39, -0.024), (40, 0.083), (41, 0.037), (42, -0.046), (43, 0.013), (44, -0.055), (45, -0.06), (46, 0.004), (47, -0.025), (48, -0.0), (49, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.90476608 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
2 0.76948351 108 nips-2010-Graph-Valued Regression
Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty
Abstract: Undirected graphical models encode in a graph G the dependency structure of a random vector Y . In many applications, it is of interest to model Y given another random vector X as input. We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classification and regression trees), but at each leaf of the tree estimates a graph. We call the method “Graph-optimized CART”, or GoCART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. 1
3 0.59612787 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
Author: Samy Bengio, Jason Weston, David Grangier
Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1
4 0.55801958 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
5 0.54554331 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
Author: Jun Liu, Jieping Ye
Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-defined tree structure is based on a group-Lasso penalty, where one group is defined for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efficient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efficient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efficiency and effectiveness of the proposed algorithm.
6 0.515728 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
7 0.51101154 220 nips-2010-Random Projection Trees Revisited
8 0.49278089 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
9 0.48715365 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.48397222 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
11 0.46852896 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
12 0.441017 66 nips-2010-Double Q-learning
13 0.4305256 203 nips-2010-Parametric Bandits: The Generalized Linear Case
14 0.42889312 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
15 0.42869988 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
16 0.40403447 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
17 0.40378097 19 nips-2010-A rational decision making framework for inhibitory control
18 0.40082359 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
19 0.39678785 233 nips-2010-Scrambled Objects for Least-Squares Regression
20 0.39606553 172 nips-2010-Multi-Stage Dantzig Selector
topicId topicWeight
[(13, 0.048), (27, 0.049), (30, 0.087), (35, 0.018), (45, 0.219), (50, 0.054), (52, 0.025), (60, 0.028), (77, 0.031), (78, 0.01), (90, 0.343)]
simIndex simValue paperId paperTitle
1 0.95025045 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
Author: Malik Magdon-Ismail
Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.
2 0.93144405 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
Author: Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen
Abstract: In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classifiers already exist for many subtasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a ‘black-box’ interface to the original classifier for each sub-task. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about what error modes to focus on. We show that our method significantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classification. 1
3 0.91685772 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
4 0.91144967 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
Author: Jun Zhu, Li-jia Li, Li Fei-fei, Eric P. Xing
Abstract: Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.
same-paper 5 0.85305232 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
6 0.79726571 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics
8 0.76990193 175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers
9 0.74099821 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
10 0.7328071 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
11 0.7295146 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
12 0.72355866 282 nips-2010-Variable margin losses for classifier design
13 0.71700579 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
14 0.71521789 117 nips-2010-Identifying graph-structured activation patterns in networks
15 0.70366079 243 nips-2010-Smoothness, Low Noise and Fast Rates
16 0.70292044 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach
17 0.70216256 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces
18 0.69937575 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
19 0.69618803 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
20 0.69071323 220 nips-2010-Random Projection Trees Revisited