nips nips2010 nips2010-178 knowledge-graph by maker-knowledge-mining

178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Source: pdf

Author: Han Liu, Xi Chen

Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classiﬁcation and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Multivariate Dyadic Regression Trees for Sparse Learning Problems Han Liu and Xi Chen School of Computer Science, Carnegie Mellon University Pittsburgh, PA 15213 Abstract We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). [sent-1, score-0.487]

2 Unlike traditional dyadic decision trees (DDTs) or classiﬁcation and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. [sent-2, score-0.403]

3 Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. [sent-3, score-0.322]

4 Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. [sent-4, score-0.081]

5 predicting multi-channel signals within a time frame, predicting concentrations of several chemical constitutes using the mass spectra of a sample, or predicting expression levels of many genes using a common set of phenotype variables. [sent-9, score-0.087]

6 These problems can be naturally formulated in terms of multivariate regression. [sent-10, score-0.087]

7 Moreover, we denote the jth dimension of y 1 n by yj = (yj , . [sent-18, score-0.103]

8 Without loss of generality, k k we assume X = [0, 1]d and the true model on yj is : i yj = fj (xi ) + ϵi , i = 1, . [sent-25, score-0.201]

9 , n, (1) j d where fj : R → R is a smooth function. [sent-28, score-0.129]

10 j This is a general setting of the nonparametric multivariate regression. [sent-38, score-0.142]

11 , fp lie in a d-dimensional Sobolev ball with order α and radius C, the best convergence rate for the minimax risk is p · n−2α/(2α+d) . [sent-43, score-0.183]

12 However, in many real world applications, the true regression function f may depend only on a small set of variables. [sent-45, score-0.088]

13 If S has been given, the minimax lower bound can be improved to be p · n−2α/(2α+r) , which is the best possible rate can be expected. [sent-53, score-0.096]

14 For sparse learning problems, our task is to develop an estimator, which adaptively achieves this faster rate of convergence without knowing S in advance. [sent-54, score-0.097]

15 1 Previous research on these problems can be roughly divided into three categories: (i) parametric linear models, (ii) nonparametric additive models, and (iii) nonparametric tree models. [sent-55, score-0.308]

16 Recently, signiﬁcant progress has been made on inferring nonparametric additive models with joint sparsity constraints [7, 10]. [sent-59, score-0.144]

17 For additive models, each fj (x) is assumed to have an additive form: ∑d fj (x) = k=1 fjk (xk ). [sent-60, score-0.378]

18 A family of more ﬂexible nonparametric methods are based on tree models. [sent-62, score-0.193]

19 One of the most popular tree methods is the classiﬁcation and regression tree (CART) [2]. [sent-63, score-0.364]

20 It ﬁrst grows a full tree by orthogonally splitting the axes at locally optimal splitting points, then prunes back the full tree to form a subtree. [sent-64, score-0.328]

21 In contrast to CART, dyadic decision trees (DDTs) are restricted to only axis-orthogonal dyadic splits, i. [sent-66, score-0.456]

22 each dimension can only be split at its midpoint. [sent-68, score-0.092]

23 For a broad range of classiﬁcation problems, [15] showed that DDTs using a special penalty can attain nearly optimal rate of convergence in a minimax sense. [sent-69, score-0.201]

24 [1] proposed a dynamic programming algorithm for constructing DDTs when the penalty term has an additive form, i. [sent-70, score-0.104]

25 the penalty of the tree can be written as the sum of penalties on all terminal nodes. [sent-72, score-0.37]

26 Though intensively studied for classiﬁcation problems, the dyadic decision tree idea has not drawn much attention in the regression settings. [sent-73, score-0.425]

27 One of the closest results we are aware of is [4], in which a single response dyadic regression procedure is considered for non-sparse learning problems. [sent-74, score-0.287]

28 Another interesting tree model, “Bayesian Additive Regression Trees (BART)”, is proposed under Bayesian framework [6], which is essentially a “sum-of-trees” model. [sent-75, score-0.138]

29 Most of the existing work adopt the number of terminal nodes as the penalty. [sent-76, score-0.215]

30 Such penalty cannot lead to sparse models since a tree with a small number of terminal nodes might still involve too many variables. [sent-77, score-0.461]

31 To obtain sparse models, we propose a new nonparametric method based on multivariate dyadic regression trees (MDRTs). [sent-78, score-0.526]

32 Our contributions are two-fold: (i) Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rate of convergence for the class of (α, C)smooth functions. [sent-81, score-0.294]

33 (ii) Empirically, to avoid computationally prohibitive exhaustive search in high dimensions, we propose a two-stage greedy algorithm and its randomized version that achieve good performance in both function estimation and variable selection. [sent-82, score-0.115]

34 Note that our theory and algorithm can be straightforwardly adapted to univariate sparse regression problem, which is a special case of the multivariate one. [sent-83, score-0.214]

35 To the best of our knowledge, this is the ﬁrst time such a sparsity-inducing penalty is equipped to tree models for solving sparse regression problems. [sent-84, score-0.309]

36 A MDRT T is a multivariate regression tree that recursively divides the input space X by means of axis-orthogonal dyadic splits. [sent-92, score-0.512]

37 If a node is associated ∏d to the cell B = j=1 [aj , bj ], after being dyadically split on the dimension k, the two children are associated to the subcells B k,1 and B k,2 : } { ak + bk B k,1 = xi ∈ B | xi ≤ and B k,2 = B \ B k,1 . [sent-95, score-0.369]

38 k 2 The set of terminal nodes of a MDRT T is denoted as term(T ). [sent-96, score-0.215]

39 Let Bt be the cell in X induced by a terminal node t, the partition induced by term(T ) can be denoted as π(T ) = {Bt |t ∈ term(T )}. [sent-97, score-0.311]

40 2 For each terminal node t, we can ﬁt a multivariate m-th order polynomial regression on data points falling in Bt . [sent-98, score-0.529]

41 Instead of using all covariates, such a polynomial regression is only ﬁtted on a set of active variables, which is denoted as A(t). [sent-99, score-0.154]

42 For each node b ∈ T (not necessarily a terminal node), A(b) can be an arbitrary subset of {1, . [sent-100, score-0.282]

43 If a node is dyadically split perpendicular to the axis k, k must belong to the active sets of its two children. [sent-104, score-0.247]

44 For any node b, let par(b) be its parent node, then A(par(b)) ⊂ A(b). [sent-106, score-0.094]

45 m For a MDRT T , we deﬁne FT to be the class of p-valued measurable m-th order polynomials corresponding to π(T ). [sent-107, score-0.08]

46 Furthermore, for a dyadic integer N = 2L , let TN be the collection of all MDRTs such that no terminal cell has a side length smaller than 2−L . [sent-108, score-0.416]

47 The latter one penalizing the number of terminal nodes |π(T )| has been commonly adopted in the existing tree literature. [sent-116, score-0.353]

48 In the next section, we will show that this sparsity-inducing term is derived by bounding the VC-dimension of the underlying subgraph of regression functions. [sent-119, score-0.088]

49 To evaluate the algorithm performance, we use the L2 -risk with respect to the Lebesgue measure ∑p ∫ µ(·), which is deﬁned as R(f , f ) = E j=1 X |fj (x) − fj (x)|2 dµ(x), where f is the function estimate constructed from n observed samples. [sent-123, score-0.129]

50 2 of [9] shows that the lower minimax rate of convergence for class D(α, C) is exactly the same as that for class of d-dimensional Sobolev ball with order α and radius C. [sent-154, score-0.166]

51 ,fp ∈D(α,C) Therefore, the lower minimax rate of convergence is p · n−2α/(2α+d) . [sent-160, score-0.12]

52 Similarly, if the problem is jointly sparse with the index set S and r = |S| ≪ d, the best rate of convergence can be improved to p · n−2α/(2α+r) when S is given. [sent-161, score-0.097]

53 Gaussian random j √ variables, this assumption easily holds with γ = O( log n), which only contributes a logarithmic term to the ﬁnal rate of convergence. [sent-171, score-0.082]

54 The next assumption speciﬁes the scaling of the relevant dimension r and ambient dimension d with respect to the sample size n. [sent-172, score-0.159]

55 On the other hand, the ambient dimension d can increase exponentially fast with the sample size, which is a realistic scaling for high dimensional settings. [sent-177, score-0.114]

56 Remark 1 As discussed in Proposition 1, the obtained rate of convergence in (5) is nearly optimal up to a logarithmic term. [sent-180, score-0.119]

57 Remark 2 Since the estimator deﬁned in (2) does not need to know the smoothness α and the sparsity level r in advance, MDRTs are simultaneously adaptive to the unknown smoothness and sparsity level. [sent-181, score-0.171]

58 Our analysis closely follows the least squares regression analysis in [9] and some speciﬁc coding scheme of trees in [15]. [sent-183, score-0.177]

59 m Let ST be the class of scalar-valued measurable m-th order polynomials corresponding to π(T ), m m and let GT be the class of all subgraphs of functions of ST , i. [sent-186, score-0.103]

60 , fp ) be the true regression function, there exists a set of piecewise m polynomials h1 , . [sent-200, score-0.235]

61 Sketch of Proof: This is a standard approximation result using multivariate piecewise polynomials. [sent-207, score-0.114]

62 The main idea is based on a multivariate Taylor expansion of the function fj at a given point x0 . [sent-208, score-0.216]

63 ∑p ∫ First, we deﬁne R(g, f ) = j=1 X |gj (x) − fj (x)|2 dµ(x), Lemma 3 [9] Choose ( ) [[T ]] log 2 γ4 m log(120eγ 4 n)VGT + n 2 ∑ for some preﬁx code [[T ]] > 0 satisfying T ∈TN 2−[[T ]] ≤ 1. [sent-214, score-0.153]

64 To make MDRTs scalable for high dimensional massive datasets, using similar ideas as CARTs, we propose a two-stage procedure: (1) we grow a full tree in a greedy manner; (2) we prune back the full tree to from the ﬁnal tree. [sent-224, score-0.338]

65 Given a MDRT T , denote the corresponding multivariate m-th order polynomial ﬁt on π(T ) by m fT = {ftm }t∈π(T ) , where ftm is the m-th order polynomial regression ﬁt on the partition Bt . [sent-226, score-0.313]

66 For 5 each xi falling in Bt , let ftm (xi , A(t)) be the predicted function value for xi . [sent-227, score-0.192]

67 We denote the the local squared error (LSE) on node t by Rm (t, A(t)): 1 ∑ Rm (t, A(t)) = ∥yi − ftm (xi , A(t))∥2 . [sent-228, score-0.164]

68 The total MSE of the tree R(T ) can then be computed by the following equation: ∑ R(T ) = Rm (t, A(t)). [sent-230, score-0.138]

69 (11) Our goal is to ﬁnd the tree structure with the polynomial regression on each terminal node that can minimize the total cost. [sent-232, score-0.542]

70 The ﬁrst stage is tree growing, in which a terminal node t is ﬁrst selected in each step. [sent-233, score-0.448]

71 We then perform one of two actions a1 and a2: a1: adding another dimension k ̸∈ A(t) to A(t), and reﬁt the regression model on all data points falling in Bt ; a2: dyadically splitting t perpendicular to the dimension k ∈ A(t). [sent-234, score-0.382]

72 In each tree growing step, we need to decide which action to perform. [sent-235, score-0.204]

73 For action a2, let sl(t(k) ) be the side length of Bt on dimension k ∈ A(t). [sent-237, score-0.108]

74 If sl(t(k) ) > 2−L , the (k) (k) dimension k of Bt can then be dyadically split. [sent-238, score-0.137]

75 In this case, let tL and tR be the left and right child of node t. [sent-239, score-0.094]

76 For each terminal node t, we greedily perform the action a∗ on the dimension k ∗ , which are determined by m (14) (a∗ , k ∗ ) = argmax ∆Ra (t, k). [sent-241, score-0.39]

77 d} In high dimensional setting, the above greedy procedure may not lead to the optimal tree since successively locally optimal splits cannot guarantee the global optimum. [sent-245, score-0.224]

78 Once an irrelevant dimension has been added in or split, the greedy procedure can never ﬁx the mistake. [sent-246, score-0.145]

79 Instead of greedily performing the action on the dimension that leads the maximum drop in LSE, we randomly choose which action to perform according to a multinomial distribution. [sent-248, score-0.172]

80 The action a∗ is then performed on the dimension k ∗ . [sent-251, score-0.108]

81 In general, when the randomized scheme is adopted, we need to repeat our algorithm many times to pick the best tree. [sent-252, score-0.136]

82 For each step, we either merge a pair of terminal nodes or remove a variable from the active set of a terminal node such that the resulted tree has the smaller cost. [sent-254, score-0.667]

83 We repeat this process until the tree becomes a single root node with an empty active set. [sent-255, score-0.264]

84 The tree with the minimum cost in this process is returned as the ﬁnal tree. [sent-256, score-0.138]

85 The pseudocode for the growing stage and cost complexity pruning stage are presented in the Appendix. [sent-257, score-0.081]

86 Moreover, to avoid a cell with too few data points, we pre-deﬁne a quantity nmax . [sent-258, score-0.108]

87 Let n(t) be the the number of data points fall into Bt , if n(t) ≤ nmax , Bt will no longer be split. [sent-259, score-0.079]

88 In addition, whenever we perform the mth order polynomial regression on the active set of a node, we need to make sure it is not rank deﬁcient. [sent-261, score-0.154]

89 For randomized scheme, we run 50 random trials and pick the minimum cost tree. [sent-264, score-0.105]

90 As for CART, we adopt the MATLAB package from [12], which ﬁts piecewise constant on each p terminal node with the cost complexity criterion: C(T ) = R(T ) + ρ n |π(T )|, where ρ is the tuning parameter playing the same role as λ in (3). [sent-265, score-0.309]

91 For more detailed experiment protocols, we set nmax = 5 and L = 6. [sent-283, score-0.079]

92 The tree with the minimum MSE on the validation set is then picked as the best tree. [sent-285, score-0.138]

93 For criterion (i), if the variables involved in the best tree are exactly the ﬁrst four variables, the variable selection task for this design is deemed as successful. [sent-286, score-0.168]

94 Moreover, the performance of randomized scheme is slightly better than its deterministic version in variable selection. [sent-295, score-0.106]

95 nmax is set to be 5 for the ﬁrst dataset and 20 for the latter two. [sent-415, score-0.079]

96 Moreover, for randomized scheme, we run 50 random trials and pick the minimum cost tree. [sent-417, score-0.105]

97 Moreover, randomized scheme does improve the performance compared to the deterministic counterpart. [sent-449, score-0.106]

98 6 Conclusions We propose a novel sparse learning method based on multivariate dyadic regression trees (MDRTs). [sent-459, score-0.471]

99 Our approach adopts a new sparsity-inducing penalty that simultaneously conduct function estimation and variable selection. [sent-460, score-0.095]

100 To the best of our knowledge, it is the ﬁrst time that such a penalty is introduced in the tree literature for high dimensional sparse learning problems. [sent-462, score-0.243]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mdrt', 0.674), ('mdrts', 0.337), ('dyadic', 0.199), ('cart', 0.199), ('terminal', 0.188), ('tree', 0.138), ('pen', 0.135), ('fj', 0.129), ('bt', 0.122), ('ddts', 0.119), ('node', 0.094), ('regression', 0.088), ('multivariate', 0.087), ('nmax', 0.079), ('rt', 0.078), ('randomized', 0.075), ('ftm', 0.07), ('dyadically', 0.07), ('housing', 0.07), ('lse', 0.07), ('dimension', 0.067), ('nt', 0.067), ('fp', 0.063), ('xs', 0.062), ('minimax', 0.062), ('additive', 0.06), ('vgt', 0.06), ('mse', 0.059), ('rm', 0.059), ('trees', 0.058), ('polynomials', 0.057), ('nonparametric', 0.055), ('han', 0.052), ('penalty', 0.044), ('xi', 0.042), ('gt', 0.041), ('action', 0.041), ('greedy', 0.04), ('carts', 0.04), ('chem', 0.04), ('jerome', 0.039), ('sparse', 0.039), ('tn', 0.039), ('ft', 0.038), ('irrelevant', 0.038), ('falling', 0.038), ('ga', 0.038), ('nearly', 0.037), ('yj', 0.036), ('rate', 0.034), ('polynomial', 0.034), ('lemma', 0.032), ('active', 0.032), ('leo', 0.032), ('bart', 0.032), ('scheme', 0.031), ('smoothness', 0.03), ('selection', 0.03), ('xk', 0.03), ('sobolev', 0.03), ('inf', 0.03), ('pick', 0.03), ('predicting', 0.029), ('simultaneously', 0.029), ('cell', 0.029), ('sparsity', 0.029), ('breiman', 0.028), ('worthwhile', 0.028), ('larry', 0.028), ('stage', 0.028), ('nodes', 0.027), ('designs', 0.027), ('sl', 0.027), ('st', 0.027), ('synthetic', 0.027), ('piecewise', 0.027), ('remark', 0.026), ('perpendicular', 0.026), ('tl', 0.026), ('liu', 0.026), ('exible', 0.026), ('splitting', 0.026), ('involve', 0.025), ('ambient', 0.025), ('split', 0.025), ('constants', 0.025), ('growing', 0.025), ('splits', 0.024), ('convergence', 0.024), ('estimator', 0.024), ('logarithmic', 0.024), ('log', 0.024), ('ra', 0.024), ('par', 0.024), ('generic', 0.024), ('drop', 0.023), ('class', 0.023), ('tk', 0.023), ('dimensional', 0.022), ('conduct', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

2 0.18821783 108 nips-2010-Graph-Valued Regression

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

Abstract: Undirected graphical models encode in a graph G the dependency structure of a random vector Y . In many applications, it is of interest to model Y given another random vector X as input. We refer to the problem of estimating the graph G(x) of Y conditioned on X = x as “graph-valued regression”. In this paper, we propose a semiparametric method for estimating G(x) that builds a tree on the X space just as in CART (classiﬁcation and regression trees), but at each leaf of the tree estimates a graph. We call the method “Graph-optimized CART”, or GoCART. We study the theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle inequalities on risk minimization and tree partition consistency. We also demonstrate the application of Go-CART to a meteorological dataset, showing how graph-valued regression can provide a useful tool for analyzing complex data. 1

3 0.12305571 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Author: Samy Bengio, Jason Weston, David Grangier

Abstract: Multi-class classiﬁcation becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classiﬁers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1

4 0.08470694 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams

Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a ﬂexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are inﬁnitely exchangeable. One can view our model as providing inﬁnite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1

5 0.082701668 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: We develop a theory of online learning by deﬁning several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1

6 0.081245176 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

7 0.073982835 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning

8 0.065769948 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

9 0.064617284 223 nips-2010-Rates of convergence for the cluster tree

10 0.063176662 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

11 0.061648302 117 nips-2010-Identifying graph-structured activation patterns in networks

12 0.060451739 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.060114421 144 nips-2010-Learning Efficient Markov Networks

14 0.056406241 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

15 0.055767011 233 nips-2010-Scrambled Objects for Least-Squares Regression

16 0.054899503 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

17 0.054073215 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference

18 0.052311637 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

19 0.050782867 63 nips-2010-Distributed Dual Averaging In Networks

20 0.050160564 150 nips-2010-Learning concept graphs from text with stick-breaking priors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.176), (1, 0.019), (2, 0.086), (3, 0.051), (4, -0.028), (5, -0.014), (6, -0.055), (7, 0.03), (8, -0.013), (9, 0.046), (10, -0.036), (11, 0.035), (12, -0.113), (13, -0.005), (14, 0.017), (15, -0.081), (16, 0.005), (17, -0.076), (18, -0.069), (19, -0.023), (20, -0.082), (21, -0.053), (22, -0.013), (23, -0.058), (24, -0.034), (25, 0.02), (26, -0.054), (27, 0.125), (28, 0.024), (29, -0.033), (30, 0.066), (31, 0.05), (32, -0.058), (33, 0.074), (34, 0.043), (35, 0.066), (36, 0.081), (37, -0.171), (38, 0.016), (39, -0.024), (40, 0.083), (41, 0.037), (42, -0.046), (43, 0.013), (44, -0.055), (45, -0.06), (46, 0.004), (47, -0.025), (48, -0.0), (49, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90476608 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

2 0.76948351 108 nips-2010-Graph-Valued Regression

Author: Han Liu, Xi Chen, Larry Wasserman, John D. Lafferty

3 0.59612787 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

Author: Samy Bengio, Jason Weston, David Grangier

4 0.55801958 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-speciﬁc independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efﬁcient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

5 0.54554331 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning

Author: Jun Liu, Jieping Ye

Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-deﬁned tree structure is based on a group-Lasso penalty, where one group is deﬁned for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efﬁcient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efﬁcient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efﬁciency and effectiveness of the proposed algorithm.

6 0.515728 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

7 0.51101154 220 nips-2010-Random Projection Trees Revisited

8 0.49278089 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

9 0.48715365 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

10 0.48397222 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

11 0.46852896 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

12 0.441017 66 nips-2010-Double Q-learning

13 0.4305256 203 nips-2010-Parametric Bandits: The Generalized Linear Case

14 0.42889312 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies

15 0.42869988 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems

16 0.40403447 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

17 0.40378097 19 nips-2010-A rational decision making framework for inhibitory control

18 0.40082359 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

19 0.39678785 233 nips-2010-Scrambled Objects for Least-Squares Regression

20 0.39606553 172 nips-2010-Multi-Stage Dantzig Selector

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.048), (27, 0.049), (30, 0.087), (35, 0.018), (45, 0.219), (50, 0.054), (52, 0.025), (60, 0.028), (77, 0.031), (78, 0.01), (90, 0.343)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95025045 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

Author: Malik Magdon-Ismail

Abstract: We deﬁne a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efﬁciently estimated.

2 0.93144405 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

Author: Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen

Abstract: In many machine learning domains (such as scene understanding), several related sub-tasks (such as scene categorization, depth estimation, object detection) operate on the same raw data and provide correlated outputs. Each of these tasks is often notoriously hard, and state-of-the-art classiﬁers already exist for many subtasks. It is desirable to have an algorithm that can capture such correlation without requiring to make any changes to the inner workings of any classiﬁer. We propose Feedback Enabled Cascaded Classiﬁcation Models (FE-CCM), that maximizes the joint likelihood of the sub-tasks, while requiring only a ‘black-box’ interface to the original classiﬁer for each sub-task. We use a two-layer cascade of classiﬁers, which are repeated instantiations of the original ones, with the output of the ﬁrst layer fed into the second layer as input. Our training method involves a feedback step that allows later classiﬁers to provide earlier classiﬁers information about what error modes to focus on. We show that our method signiﬁcantly improves performance in all the sub-tasks in two different domains: (i) scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling and saliency detection, and (ii) robotic grasping, where we consider grasp point detection and object classiﬁcation. 1

3 0.91685772 250 nips-2010-Spectral Regularization for Support Estimation

Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo

Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.

4 0.91144967 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models

Author: Jun Zhu, Li-jia Li, Li Fei-fei, Eric P. Xing

Abstract: Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classiﬁcation. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efﬁciently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.

same-paper 5 0.85305232 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

Author: Han Liu, Xi Chen

6 0.79726571 127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics

7 0.77204341 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

8 0.76990193 175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers

9 0.74099821 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers

10 0.7328071 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

11 0.7295146 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case

12 0.72355866 282 nips-2010-Variable margin losses for classifier design

13 0.71700579 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

14 0.71521789 117 nips-2010-Identifying graph-structured activation patterns in networks

15 0.70366079 243 nips-2010-Smoothness, Low Noise and Fast Rates

16 0.70292044 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

17 0.70216256 79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces

18 0.69937575 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

19 0.69618803 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

20 0.69071323 220 nips-2010-Random Projection Trees Revisited