jmlr jmlr2011 jmlr2011-52 knowledge-graph by maker-knowledge-mining

52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Source: pdf

Author: Huixin Wang, Xiaotong Shen, Wei Pan

Abstract: In hierarchical classiﬁcation, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classiﬁcation is speciﬁed by directed paths of the tree. In such a situation, the focus has been on how to leverage the interclass relationship to enhance the performance of ﬂat classiﬁcation, which ignores such dependency. This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. For implementation, we consider support vector machines and ψ-learning. Numerical and theoretical analyses suggest that the proposed method achieves the desired objective and compares favorably against strong competitors in the literature, including its ﬂat counterparts. Finally, an application to gene function prediction is discussed. Keywords: difference convex programming, gene function annotation, margins, multi-class classiﬁcation, structured learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper considers single-path or partial-path hierarchical classiﬁcation, where only one path is permitted from the root to a leaf node. [sent-10, score-0.24]

2 A large margin method is introduced based on a new concept of generalized margins with respect to hierarchy. [sent-11, score-0.152]

3 For instance, in one of the central problems in modern biomedical research—gene function prediction, biological functions of genes are often organized by a hierarchical annotation system such as MIPS (the Munich Information Center for Protein Sequences, Mewes et al. [sent-17, score-0.182]

4 MIPS is structured hierarchically, with upper-level functional categories describing more general information concerning biological functions of genes, while low-level ones refer to more speciﬁc and detailed functional categories. [sent-20, score-0.165]

5 To predict unknown gene functions, a gene is classiﬁed, through some predictors, into one or more gene functional categories in the hierarchy of MIPS, forming novel hypotheses for conﬁrmatory biological experiments (Hughes et al. [sent-22, score-0.446]

6 WANG , S HEN AND PAN multiclass classiﬁcation where class membership is mutually exclusive for all classes. [sent-26, score-0.223]

7 The primary objective of hierarchical classiﬁcation is leveraging inter-class relationships to enhance multiclass classiﬁcation ignoring such dependencies, known as ﬂat classiﬁcation. [sent-27, score-0.183]

8 To achieve the desired objective, this paper develops a large margin approach for single-path or partial-path hierarchical classiﬁcation with hierarchy deﬁned by a tree. [sent-29, score-0.311]

9 One major challenge is how to formulate a loosely deﬁned hierarchical structure into classiﬁcation to achieve higher generalization performance, which, otherwise, is impossible for ﬂat classiﬁcation, especially in a high-dimensional situation. [sent-31, score-0.146]

10 The second is the sequential approach, where a multiclass classiﬁer is trained locally at each parent node of the hierarchy. [sent-38, score-0.158]

11 The third is the promising structured approach, which recognizes the importance of a hierarchical structure in classiﬁcation. [sent-43, score-0.151]

12 (2006) developed sequential training based SVM with certain hierarchical loss functions. [sent-48, score-0.15]

13 Despite progress, issues remain with respect to how to fully take into account a hierarchical structure and to what role the hierarchy plays. [sent-55, score-0.2]

14 To meet the challenge, this article develops a large margin method for hierarchical classiﬁcation, based on a new concept of structured functional and geometric margins deﬁned for each node of the hierarchy, which differs from the concept of the loss-weighted margins in structured prediction. [sent-56, score-0.549]

15 In conclusion, both the numerical and theoretical results suggest that a tree hierarchical structure has been incorporated into classiﬁcation for generalization. [sent-68, score-0.166]

16 Before proceeding, we introduce some notations for a tree H with k leaves and (K −k) non-leafnodes, where a non-leaf node is an ancestor of a leaf one. [sent-79, score-0.219]

17 From the top to the bottom, we go through each node j and assign x to one of its children l = argmaxt∈chi( j) ft (x) having the highest value among ft ’s for t ∈ chi( j) when j ∈ L , and assign x to j otherwise. [sent-88, score-0.148]

18 / This top-down rule is sequential, and yields mutually exclusive membership for sibling classes. [sent-89, score-0.222]

19 Finally, a classiﬁer is constructed through d H (·) to have small generalization error El0−1 (Y, d H (f (X)), with l0−1 (Y, d H (f (X)) = I(Y = d H (f (X)) the 0-1 hierarchical loss. [sent-93, score-0.146]

20 Proposed Method In the existing literature on hierarchical classiﬁcation, the margins are deﬁned by the conventional unstructured margins for multiclass classiﬁcation, for instance, the loss-weighted hierarchical SVM of Cai and Hofmann (2004), denoted as HSVMc . [sent-95, score-0.43]

21 In what follows, we propose a new framework using a given hierarchy to deﬁne margins, leading to a reduced number of pairwise comparisons for hierarchical classiﬁcation. [sent-97, score-0.2]

22 1 Margins with Respect to H We ﬁrst explore a connection between classiﬁcation and function comparisons, based on the concept of generalized functional margins with respect to a hierarchy is introduced. [sent-99, score-0.186]

23 Consider leaf node 4 in the tree H described in Figure 2 (c). [sent-102, score-0.201]

24 There f4 − f3 and f6 − f5 need to be compared against 0 to classify at node 4 through the top-down rule, that is, min( f4 − f3 , f6 − f5 ) is less than 0 or not, which leads to our margin deﬁnition for (x, y = 4) U(f (x), y = 4) = min( f4 − f3 , f6 − f5 ). [sent-103, score-0.16]

25 This set compares any class t against sibling classes deﬁned by sib(t) for y and any of its ancestors t, permitting hierarchical classiﬁcation at any location of H and generating a single-path or partial-path from the root to the node corresponding to class y. [sent-105, score-0.258]

26 For classiﬁcation evaluation, we deﬁne the generalized functional margin with respect to H for (x, y) as umin (f (x), y) = min{uy, j : uy, j ∈ U(f (x), y)}. [sent-106, score-0.419]

27 In light of the result of Lemma 1, this quantity is directly related to the generalization error, which summarizes the overall error in hierarchical classiﬁcation as the 0-1 loss in binary classiﬁcation. [sent-107, score-0.169]

28 That is, a classiﬁcation error occurs if and only if umin (f (x), y) < 0. [sent-108, score-0.274]

29 Moreover, this deﬁnition reduces to that of multiclass margin classiﬁcation of Liu and Shen (2006) when no hierarchical structure is imposed. [sent-109, score-0.275]

30 Lemma 1 establishes a key connection between the generalization error and our deﬁnition of umin ( f (X),Y ). [sent-111, score-0.293]

31 Lemma 1 With I(·) denoting the indicator function, GE(d) = El0−1 (Y, d(X)) ≡ EI(Y = d(X)) = EI(umin ( f (X),Y ) < 0), where l0−1 is the 0-1 loss in hierarchical classiﬁcation, and I(·) is the indicator function. [sent-112, score-0.15]

32 This lemma says that a classiﬁcation error occurs for decision function f and an observation (x, y), if and only if the functional margin umin (f (x), y) is negative. [sent-113, score-0.419]

33 For this reason, we replace I(·) by a surrogate loss v(·) to use the existing two-class surrogate losses in hierarchical classiﬁcation. [sent-118, score-0.261]

34 Given functional margin u = umin (f (x), y), we say that a loss v(·) is a margin loss if it can be written as a function of u. [sent-120, score-0.557]

35 Most importantly, v(umin (f (x), y)) yields Fisher-consistency in hierarchical classiﬁcation, which constitutes a basis of studying the generalization error in Section 5. [sent-122, score-0.146]

36 Note that in the two-class case a number of margin losses have been proposed. [sent-123, score-0.149]

37 Convex margin losses are the hinge loss v(u) = (1 − u)+ for SVM and the logistic loss v(u) = log(1 + e−u ) for logistic regression (Zhu and Hastie, 2005). [sent-124, score-0.195]

38 Nonconvex large margin losses include, for example ψ-loss v(u) = ψ(u) for ψ-learning, with ψ(u) = 1 − sign(u) and sign(u) = I(u > 0), if u ≥ 1 or u < 0, and 1 − u otherwise (Shen et al. [sent-125, score-0.149]

39 Note that (1) reduces to that of multiclass margin classiﬁcation of Liu and ˆ classiﬁer d Shen (2006) when no hierarchical structure is speciﬁed. [sent-131, score-0.275]

40 K 2 j=1 For hierarchical classiﬁcation, (1) yields different classiﬁers with different choices of margin loss v(·). [sent-139, score-0.242]

41 Speciﬁcally, (1) covers multiclass SVM and ψ-learning of Liu and Shen (2006), with equal cost when all the leaf nodes share the same parent—the root, which are called SVM and ψ-learning in what follows. [sent-140, score-0.188]

42 5 X_1 (a) Hierarchical structure (b) Geometric margin Figure 1: Plot of generalized geometric margin with respect to H in (b), deﬁned by a tree in (a). [sent-158, score-0.245]

43 the top-down rule is speciﬁed by H , umin (f (x), y) captures the relations through (1). [sent-161, score-0.304]

44 This aspect will be conﬁrmed by the numerical results in Section 4, and by a comparison of the generalization errors between hierarchical SVM (HSVM) and hierarchical ψ-learning (HPSI) against their ﬂat counterparts—SVM and ψ-learning in Section 5. [sent-163, score-0.302]

45 5 Evaluation Losses and Test Errors with Respect to Hierarchy In hierarchical classiﬁcation, three types of losses have been proposed for measuring a classiﬁer’s performance with respect to H , as a generalization of the 0-1 loss in two-class classiﬁcation. [sent-191, score-0.226]

46 Two common choices of c j ’s have been suggested, leading to the subtree based H-loss lsub and the siblings based H-loss lsib : c j = |sub( j)|/K; j = 1, . [sent-203, score-0.327]

47 1, described by four leave-nodes asymmetric tree in (a), and complete binary trees with depth p = 3 and k = 2 p = 8 leaf nodes in (b) and with depth p = 2 and k = 4 leaf nodes in (c), respectively. [sent-229, score-0.303]

48 The testing errors are computed under the l0−1 , l∆ , lsib and lsub . [sent-430, score-0.319]

49 The testing errors are computed under the l0−1 , l∆ , lsib and lsub . [sent-633, score-0.319]

50 As suggested in Table 2, HSVM and HPSI outperform the three competitors under l0−1 , l∆ , lsib and lsub in the linear case, whereas HSVM performs slightly worse than SHSVM in the Gaussian 2731 WANG , S HEN AND PAN case. [sent-637, score-0.341]

51 First, the ﬁve classiﬁers perform similarly under l∆ , lsib and lsub . [sent-641, score-0.29]

52 This is because all the eight leaf node classes are at level 3 of the hierarchy, resulting a similar structure under these evaluation losses. [sent-642, score-0.181]

53 This generates a complete binary tree of depth 2, where nodes 1 and 2 are siblings of node 5, and nodes 3 and 4 are siblings of node 6, see Figure 2 (c). [sent-655, score-0.325]

54 As effective means, gene function prediction is performed through known gene functions and gene expression proﬁles of both annotated and unannotated genes. [sent-685, score-0.315]

55 By learning the patterns of expression proﬁles, a gene with unknown functions can be classiﬁed into existing functional categories, as well as newly created functional categories. [sent-688, score-0.201]

56 The test errors are computed under the l0−1 , l∆ , lsib and lsub . [sent-1038, score-0.319]

57 The test errors are computed under the l0−1 , l∆ , lsib and lsub . [sent-1394, score-0.319]

58 To battle the curse of dimensionality, a structured approach needs to be used with built-in biological knowledge presented in a form of annotation system such as MIPS, where a ﬂat approach does not perform better than a classiﬁer that uses a correct hierarchical structure. [sent-1403, score-0.151]

59 The problem of gene function prediction is an ideal test case for hierarchical classiﬁcation, where accuracy of prediction is key. [sent-1405, score-0.222]

60 In the literature, recalibration and combination of different large margin methods, including sequential HSVM and loss scaled SVM, were used in gene function prediction, see, for example, Obozinski et al. [sent-1406, score-0.21]

61 Of particular consideration is prediction of functional categories of unknown genes within two major branches of MIPS, composed of two functional categories at the highest level: “cell cycle and DNA processing” and “transcription” and their corresponding offsprings. [sent-1415, score-0.257]

62 Within these two major branches, we have n = 1103 annotated genes together with p = 300 expressions for each gene and a tree hierarchy of K = 23 functional categories, see Figure 3 for a display of the tree hierarchy. [sent-1416, score-0.384]

63 To place the problem of gene function prediction into our framework of hierarchical classiﬁcation, we may create a new separate category for all genes that are annotated with a common set of categories. [sent-1419, score-0.33]

64 We now perform some simulations to gain insight into HSVM and HPSI for gene function prediction before applying to predict new gene functions with respect to MIPS. [sent-1423, score-0.19]

65 As suggested by Table 5, besides similar results in F1-score, HSVM and HPSI outperform SVM and HSVMc under l0−1 , l∆ , lsib and l∆ , in both the linear and Gaussian kernel cases. [sent-1428, score-0.165]

66 Note that l∆ and lsub penalize misclassiﬁcation more at relevant nodes at lower levels in deep branches, whereas lsib only does so at upper levels. [sent-1436, score-0.328]

67 We are now ready to apply our method to real challenge of predicting unknown gene functional categories that had not been annotated in the 2005 version of MIPS. [sent-1439, score-0.213]

68 Notes that a blank node represents its parent itself, for instance, the left blank node at level 3 indicates functional category 03. [sent-1450, score-0.246]

69 For an example, gene “YGR054w” is not annotated in the 2005 version of MIPS, and is predicted to belong to functional categories along a path “Protein synthesis” → “Ribosome biogenesis” → “Ribosomal proteins” by HPSI. [sent-1459, score-0.213]

70 This section develops an asymptotic theory to quantify the generalization ˆ accuracy of the proposed hierarchical large margin classiﬁer d H (f ) deﬁned by (2) for a general loss H (f ) is derived. [sent-1463, score-0.28]

71 In hierarchical classiﬁcation with k leaf and (K − k) non-leaf node classes, the Bayes de¯ cision function vector f is a decision function vector yielding the Bayes classiﬁer under d H , that is, H (f (x)) = d(x). [sent-1615, score-0.289]

72 Theorem 2 Under Assumptions A-C in Appendix A, for any large margin hierarchical classiﬁer ˆ dH (f ) deﬁned by (1), there exists a constant c6 > 0 such that for any x ≥ 1, ˆ ¯ P e(f , f ) ≥ c1 xδ2α ≤ 3. [sent-1622, score-0.219]

73 By comparison, with V induced by a margin loss v, F V (t) in multiclass classiﬁcation is usually larger than its counterpart in hierarchical classiﬁcation. [sent-1626, score-0.298]

74 This is because V is structured in that functional margin umin (f (X),Y ) involves a smaller number of pairwise comparisons in hierarchical classiﬁcation. [sent-1627, score-0.57]

75 In contrast, any two classes need to be compared in multiclass classiﬁcation without such a hierarchy (Liu and Shen, 2006). [sent-1629, score-0.148]

76 Lemma 2 Let H be a tree hierarchy with K non-root nodes including k leaf nodes. [sent-1632, score-0.244]

77 2 Bayes Classiﬁer and Fisher-consistency To compare different losses for the purpose of hierarchical classiﬁcation, we introduce a new concept called “Fisher-consistency” with respect to H . [sent-1635, score-0.184]

78 Before proceeding, we deﬁne the Bayes rule in Lemma 3 for K-class classiﬁcation with non-exclusive membership, where only k < K classes have mutually exclusive membership, determining the class membership of the other K − k non-exclusive classes. [sent-1636, score-0.216]

79 Lemma 3 In K-class classiﬁcation with non-exclusive membership, assume that the k mutually exclusive membership classes uniquely determine the membership of the other K − k non-exclusive ˜/ ˜ ˜ classes. [sent-1637, score-0.246]

80 That is, for any t ∈ E and t ∈ E, either {Y = t } ⊇ {Y = t}, or {Y = t } ⊆ {Y = t}, ¯ where E is the set of k mutually exclusive membership classes. [sent-1638, score-0.167]

81 Deﬁnition 1 In hierarchical classiﬁcation, denote by L the set of classes corresponding to the leaf nodes in a tree. [sent-1641, score-0.278]

82 With L being a set of mutually exclusive membership classes, a loss l(·, ·) is said to be Fisher-consistent with respect to H if a global minimizer El(Y, f (X)) over all possible f (x) is ¯ f. [sent-1642, score-0.19]

83 Lemma 4 Loss l0−1 is Fisher-consistent with respect to H ; so is l∆ in the presence of a dominating leaf node class, that is, a class such that for any x ∈ S there exists a leaf node class j such that P(Y = j|X = x) > 1/2. [sent-1643, score-0.324]

84 3 Theoretical Examples Consider hierarchical classiﬁcation with H deﬁned by a complete binary tree with depth p. [sent-1647, score-0.166]

85 For any leaf node ji ; i = 1, · · · , k, when X(1) ∈ [(i − 1)/k, i/k), P(Y = ji |X) = (k − 1)/k, 2739 WANG , S HEN AND PAN and P(Y = j|X) = 1/[k(k − 1)] for j = ji . [sent-1652, score-0.303]

86 For any non-leaf node ji ; i = k + 1, · · · , K, P(Y = ji |X) = ∑t∈sub( ji )∩L P(Y = t|X). [sent-1653, score-0.209]

87 For non-leaf nodes, let it be the maximum over the leaf nodes in the subtree, that is, f¯ji (x) = max{t∈sub( ji )∩L } f¯t ; i = k + 1, · · · , K. [sent-1655, score-0.179]

88 In this case, the hierarchy enables to reduce the Note that H is a ﬂat tree with only one layer, that is, all the leaf nodes are the direct offsprings of the root node 0, which means that |chi(0)| = k. [sent-1667, score-0.331]

89 Discussion This paper proposed a novel large margin method for single-path or partial-path hierarchical classiﬁcation with mutually exclusive membership at the same level of a hierarchy. [sent-1695, score-0.386]

90 By integrating the hierarchical structure into classiﬁcation, the classiﬁcation accuracy, or the generalization error deﬁned by hierarchical losses, has been improved over its ﬂat counterpart, as suggested by our theoretical and numerical analyses. [sent-1698, score-0.273]

91 On one hand, given each non-leaf node j, there is only one path from the root to the node j but when there are additional |chi( j)| − 1 paths from the root to its children. [sent-1738, score-0.193]

92 Proof of Lemma 3: Without loss of generality, assume that the membership is mutually exclusive for the ﬁrst k classes. [sent-1746, score-0.19]

93 For all other leaf node j = j , ˆ ˆ ∀j ˆ ˆ ˆa ∈ anc( j) such that ja ∈ sib( ja ). [sent-1770, score-0.224]

94 Then umin (f (x), j) ≤ fˆja (x) − ˆ ˆ there exists ja ∈ anc( j) and j ˆ ˆ ˆ ˆ fˆjˆa (x) ≤ −umin (f (x), j) = −u, by the fact that umin (f (x), j) ≤ −( fˆja (x) − fˆjˆa (x)). [sent-1771, score-0.579]

95 Now we prove ˆ ˆ(x), j) ≤ −u holds through construction of f ′ : f ′′ (x) − f ′′′ (x) = u, and ˆ the equality of umin (f ˆ j j ′ (x), j) = −u, for 1 ≤ j ≤ k, j = j , and ′ (x) = 0, ∀ j ∈ sib ◦ anc( j ). [sent-1772, score-0.508]

96 By construction, u ˆ ˆ / ˆ fj min (f ˆ umin (f ′ (x), j) = u. [sent-1773, score-0.274]

97 Moreover, for the Bayes rule d(x), if j = d(x), we construct f ∗ such that ¯ ¯ ˆ umin (f ˆ ¯ ¯ umin (f ∗ (x), d(x)) = u, and umin (f ∗ (x), j) = −u, for any leaf node j = d(x), similar as above. [sent-1776, score-1.014]

98 For the case of u = 0, it can be shown that umin (f (x), j) = 0, ∀ j = 1, · · · , k, and ˆ ˆ fˆj (x) = 0, ∀ j = 1, · · · , K, which reduces to the trivial case f (x) = 0. [sent-1780, score-0.274]

99 Predicting gene function in a hierarchical context with an ensemble of classiﬁers. [sent-1888, score-0.222]

100 Weighted true path rule: a multilabel hierarchical algorithm for gene function prediction. [sent-2044, score-0.222]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hsvm', 0.451), ('hpsi', 0.443), ('umin', 0.274), ('sib', 0.234), ('chi', 0.171), ('shsvm', 0.169), ('hsvmc', 0.153), ('anc', 0.145), ('lsib', 0.145), ('lsub', 0.145), ('yes', 0.141), ('mips', 0.129), ('hierarchical', 0.127), ('hen', 0.113), ('embership', 0.105), ('utually', 0.105), ('xclusive', 0.105), ('pan', 0.103), ('shen', 0.098), ('gene', 0.095), ('leaf', 0.094), ('svm', 0.092), ('margin', 0.092), ('classi', 0.091), ('lass', 0.08), ('ierarchical', 0.08), ('arge', 0.074), ('argin', 0.074), ('hb', 0.074), ('hierarchy', 0.073), ('lassification', 0.069), ('node', 0.068), ('exclusive', 0.062), ('wang', 0.061), ('margins', 0.06), ('membership', 0.06), ('yi', 0.058), ('losses', 0.057), ('multiclass', 0.056), ('genes', 0.055), ('functional', 0.053), ('competitors', 0.051), ('bayes', 0.048), ('hughes', 0.048), ('ji', 0.047), ('mutually', 0.045), ('ev', 0.043), ('ft', 0.04), ('er', 0.04), ('tree', 0.039), ('sub', 0.039), ('nodes', 0.038), ('siblings', 0.037), ('liu', 0.037), ('categories', 0.035), ('dc', 0.035), ('parent', 0.034), ('gu', 0.033), ('rousu', 0.032), ('valentini', 0.032), ('gl', 0.032), ('ja', 0.031), ('annotated', 0.03), ('rule', 0.03), ('errors', 0.029), ('mrna', 0.027), ('uy', 0.027), ('surrogate', 0.027), ('parenthesis', 0.026), ('branches', 0.026), ('ge', 0.026), ('maxt', 0.025), ('sibling', 0.025), ('par', 0.025), ('kn', 0.024), ('ju', 0.024), ('umn', 0.024), ('structured', 0.024), ('wt', 0.024), ('synthesis', 0.023), ('loss', 0.023), ('category', 0.023), ('dekel', 0.023), ('pro', 0.022), ('geometric', 0.022), ('xi', 0.021), ('lh', 0.021), ('replications', 0.021), ('ers', 0.021), ('shahbaba', 0.021), ('tsochantaridis', 0.021), ('boser', 0.021), ('kernel', 0.02), ('classes', 0.019), ('generalization', 0.019), ('root', 0.019), ('paths', 0.019), ('develops', 0.019), ('ntest', 0.018), ('ancestor', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

2 0.065335698 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

3 0.058312118 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

4 0.056162819 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difﬁcult to optimize, and in this paper, we propose efﬁcient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efﬁcient and scales gracefully to millions of variables, which we illustrate in two types of applications: ﬁrst, we consider ﬁxed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespeciﬁed arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

5 0.054510839 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

6 0.054342356 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

7 0.045638502 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

8 0.03885033 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

9 0.037961043 58 jmlr-2011-Learning from Partial Labels

10 0.03574441 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

11 0.034813002 104 jmlr-2011-X-Armed Bandits

12 0.034305997 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.033296425 66 jmlr-2011-Multiple Kernel Learning Algorithms

14 0.032732777 105 jmlr-2011-lp-Norm Multiple Kernel Learning

15 0.032351322 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

16 0.029845716 35 jmlr-2011-Forest Density Estimation

17 0.029638965 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

18 0.028825428 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

19 0.02754459 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

20 0.027244858 14 jmlr-2011-Better Algorithms for Benign Bandits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.156), (1, -0.002), (2, 0.027), (3, -0.032), (4, 0.015), (5, -0.008), (6, -0.001), (7, 0.005), (8, 0.003), (9, 0.017), (10, -0.171), (11, 0.042), (12, -0.018), (13, -0.036), (14, -0.072), (15, 0.057), (16, -0.234), (17, -0.152), (18, -0.0), (19, -0.043), (20, -0.065), (21, 0.08), (22, 0.018), (23, 0.256), (24, -0.02), (25, 0.103), (26, -0.13), (27, 0.136), (28, 0.137), (29, -0.028), (30, 0.03), (31, 0.028), (32, -0.056), (33, -0.173), (34, -0.003), (35, -0.106), (36, -0.04), (37, -0.173), (38, 0.129), (39, -0.168), (40, 0.043), (41, -0.042), (42, -0.084), (43, 0.241), (44, -0.017), (45, 0.059), (46, -0.137), (47, -0.005), (48, 0.094), (49, -0.134)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93163931 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

2 0.58776963 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy

3 0.46329921 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

4 0.38637993 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most inﬂuential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classiﬁer in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a reﬁned analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classiﬁer

5 0.37125194 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

6 0.33838075 54 jmlr-2011-Learning Latent Tree Graphical Models

7 0.30274469 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.27310216 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

9 0.25555804 58 jmlr-2011-Learning from Partial Labels

10 0.24587508 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

11 0.23442341 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

12 0.22101679 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

13 0.20960239 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

14 0.19781932 28 jmlr-2011-Double Updating Online Learning

15 0.19583385 22 jmlr-2011-Differentially Private Empirical Risk Minimization

16 0.19540516 104 jmlr-2011-X-Armed Bandits

17 0.18351929 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

18 0.18056743 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

19 0.17931831 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

20 0.15838356 12 jmlr-2011-Bayesian Co-Training

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.054), (6, 0.441), (9, 0.047), (10, 0.024), (24, 0.037), (31, 0.091), (32, 0.035), (41, 0.025), (60, 0.012), (71, 0.013), (73, 0.022), (78, 0.056), (90, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74048173 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

Author: Huixin Wang, Xiaotong Shen, Wei Pan

2 0.70861953 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

Author: Aad van der Vaart, Harry van Zanten

Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel

3 0.35057747 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-deﬁnite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efﬁcient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efﬁcient approximations, multitask learning, structured outputs, multivariate processes

4 0.35004368 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

5 0.32392189 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.31957024 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

7 0.31759056 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.31543696 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

9 0.31528589 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

10 0.31452739 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

11 0.31250587 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

12 0.30629802 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

13 0.30508411 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

14 0.3037166 105 jmlr-2011-lp-Norm Multiple Kernel Learning

15 0.29691237 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

16 0.29660562 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

17 0.28763774 27 jmlr-2011-Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets

18 0.28607765 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

19 0.28337464 59 jmlr-2011-Learning with Structured Sparsity

20 0.28194854 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms