nips nips2003 nips2003-172 knowledge-graph by maker-knowledge-mining

172 nips-2003-Semi-Supervised Learning with Trees

Source: pdf

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efﬁcient computation of the optimal Bayesian classiﬁcation function from the labeled examples. We test our approach on eight real-world datasets. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. [sent-4, score-0.354]

2 The tree (or a distribution over trees) may be inferred using the unlabeled data. [sent-5, score-0.388]

3 A prior over concepts generated by a mutation process on the inferred tree(s) allows efﬁcient computation of the optimal Bayesian classiﬁcation function from the labeled examples. [sent-6, score-0.54]

4 1 Introduction People have remarkable abilities to learn concepts from very limited data, often just one or a few labeled examples per class. [sent-8, score-0.151]

5 Algorithms for semi-supervised learning try to match this ability by extracting strong inductive biases from a much larger sample of unlabeled data. [sent-9, score-0.15]

6 A general strategy is to assume some latent structure T that underlies both the label vector Y to be learned and the observed features X of the full data (unlabeled and labeled; see Figure 1). [sent-10, score-0.216]

7 The structure T is assumed to be a lowdimensional manifold, whose topology is approximated by a sparse neighborhood graph deﬁned over the data points (based on Euclidean distance between feature vectors in the X matrix). [sent-14, score-0.083]

8 The label vector Y is assumed to be smooth with respect to T ; [1] implements this smoothness assumption by deﬁning a Gaussian ﬁeld over all complete labelings Y of the neighborhood graph that expects neighbors to have the same label. [sent-15, score-0.193]

9 The graphical model in Figure 1 suggests a more general strategy for exploiting other kinds of latent structure T , not just low-dimensional manifolds. [sent-19, score-0.113]

10 In particular, trees arise prominently in both natural and human-generated domains (e. [sent-20, score-0.171]

11 Here we describe an approach to semi-supervised learning based on mapping the data onto the leaf nodes of a rooted (and typically ultrametric) tree T . [sent-23, score-0.268]

12 The label vector Y is generated from a stochastic mutation process operating over branches of T . [sent-24, score-0.485]

13 Tree T can be inferred from unlabeled data using either bottom-up methods (agglomerative clustering) or more complex probabilistic methods. [sent-25, score-0.164]

14 The mutation process deﬁnes T X Y Figure 1: A general approach to semi-supervised learning. [sent-26, score-0.352]

15 X is an observed object-feature matrix, Y the hidden vector of true labels for these objects and Yobs a sparse vector of observed labels. [sent-27, score-0.103]

16 The unlabeled data in X assist in inferring Y by allowing us to infer some latent structure T that is assumed to generate both X and Y . [sent-28, score-0.269]

17 Yobs a prior over all possible labelings of the unlabeled data, favoring those that maximize a tree-speciﬁc notion of “smoothness”. [sent-29, score-0.267]

18 Each of the 32 objects in this dataset has two continuous features (x and y coordinates); X is a 32-by-2 matrix. [sent-31, score-0.106]

19 The shading in part (b) represents a probabilistic inference about Y : the darker an object’s node in the tree, the more likely that its label is positive. [sent-33, score-0.097]

20 TBB classiﬁes unlabeled data by integrating over all possible labelings of the domain that are consistent with the observed labels Yobs , and is thus an instance of optimal Bayesian concept learning [6]. [sent-34, score-0.315]

21 Typically, optimal Bayes is of theoretical interest only [7], because the sum over labelings is in general intractable and it is difﬁcult to specify sufﬁciently powerful and noise-resistant priors for real-world domains. [sent-35, score-0.101]

22 Here, a prior deﬁned in terms of a tree-based mutation process makes the approach efﬁcient and empirically successful. [sent-36, score-0.386]

23 The next section describes TBB, as well as a simple heuristic method, Tree Nearest Neighbor (TNN), which we show approximates TBB in the limit of high mutation rate. [sent-37, score-0.332]

24 (a) We observe a set of unlabeled objects (small points) with some latent hierarchical structure (gray ellipses) along with two positive and two negative examples of a new concept (black and white circles). [sent-40, score-0.377]

25 (b) Inferring the latent tree, and treating the concept as generated from a mutation process on the tree, we can probabilistically classify the unlabeled objects. [sent-41, score-0.603]

26 We choose a label yi for unlabeled object xi by computing p(yi = 1|Yobs , X) and thresholding at 0. [sent-43, score-0.301]

27 Ideally we would sum over all possible latent trees T : p(yi = 1|Yobs , X) = p(yi = 1|Yobs , T )p(T |Yobs , X) T (1) First we consider p(yi = 1|Yobs , T ) and the classiﬁcation of object xi given a particular tree T . [sent-46, score-0.479]

28 2 discusses p(T |Yobs , X), the inference of tree T , and approaches to approximating the sum over trees in Equation 1. [sent-48, score-0.372]

29 Typical simplifying assumptions are that the labeled objects were chosen randomly from all objects in the domain, and that all observations are free of noise. [sent-50, score-0.205]

30 The likelihood can also be modiﬁed to handle additional complexities, such as learning from labeled examples of just a single class, or learning in the presence of label noise. [sent-60, score-0.188]

31 Our experiments explore both random and retrospective sampling, but the algorithm we implement is strictly correct only for noise-free learning under random sampling. [sent-62, score-0.116]

32 1 Bayesian classiﬁcation with a mutation model In many tree-structured domains it is natural to think of features arising from a history of stochastic events or mutations. [sent-64, score-0.386]

33 We develop a mutation model that induces a sensible “smoothness” prior p(Y |T ) and enables efﬁcient computation of Equation 5 via belief propagation on a Bayes net. [sent-65, score-0.383]

34 The model combines aspects of several previous proposals for probabilistic learning with trees [8, 9, 10]. [sent-66, score-0.169]

35 Suppose that L is deﬁned at every point along every branch, not just at the leaf nodes where the data points lie. [sent-68, score-0.061]

36 Imagine L spreading out over the tree from root to leaves — it starts out at the root with some value and could switch values at any point along any branch. [sent-69, score-0.333]

37 Whenever a branch splits, both lower branches inherit the value of L at the point immediately before the split. [sent-70, score-0.138]

38 Transitions between states of L are modeled using a continuous-time Markov chain with inﬁnitesimal matrix: −λ λ Q= λ −λ The free parameter, λ, will be called the mutation rate. [sent-71, score-0.351]

39 Note that the mutation process is symmetric: mutations from -1 to 1 are just as likely as mutations in the other direction. [sent-72, score-0.509]

40 Other models of mutation could be substituted if desired. [sent-73, score-0.332]

41 Transition probabilities along a branch of length t are given by: eQt = 1+e−2λt 2 1−e−2λt 2 1−e−2λt 2 1+e−2λt 2 (6) That is, the probability that a parent and child separated by a branch of length t have −2λt different values of L is 1−e2 . [sent-75, score-0.21]

42 This mutation process induces a prior p(Y |T ) equal to the probability of generating the label vector Y over leaves of T under the mutation process. [sent-76, score-0.84]

43 The resulting distribution favors labelings that are “smooth” with respect to T . [sent-77, score-0.124]

44 Regardless of λ, it is always more likely for L to stay the same than to switch its value along a branch. [sent-78, score-0.063]

45 Thus labelings that do not require very many mutations are preferred, and the two hypotheses that assign the same label to all leaf nodes receive the most weight. [sent-79, score-0.283]

46 Because mutations are more likely to occur along longer branches, the prior also favors hypotheses in which label changes occur between clusters (where branches tend to be longer) rather than within clusters (where branches tend to be shorter). [sent-80, score-0.395]

47 The independence assumptions implicit in the mutation model allow the right side of Equation 5 to be computed efﬁciently. [sent-81, score-0.348]

48 Evaluating Equation 5 now reduces to a standard problem of inference in a Bayes net – we clamp the nodes in Yobs to their observed values, and compute the posterior marginal probability at node yi . [sent-84, score-0.118]

49 The tree structure makes this computation efﬁcient and allows specially tuned inference algorithms, as in [9]. [sent-85, score-0.253]

50 2 A distribution over trees We now consider p(T |Yobs , X), the second component of Equation 1. [sent-87, score-0.148]

51 Using Bayes’ theorem: p(T |Yobs , X) ∝ p(Yobs , X|T )p(T ) (7) We assume that each discrete feature in X is generated independently over T according to the mutation model just outlined. [sent-88, score-0.332]

52 Continuous features can be handled by an analogous stochastic diffusion process in a continuous space (see for example [11]). [sent-89, score-0.086]

53 To ﬁnish the theoretical development of the model it remains only to specify p(T ), a prior over tree structures. [sent-91, score-0.258]

54 3 Approximating the sum over trees The sum over trees in Equation 1 is intractable for datasets of even moderate size. [sent-95, score-0.357]

55 Markov Chain Monte Carlo (MCMC) techniques have been used to approximate similar sums over trees in Bayesian phylogenetics [12], and Section 3. [sent-97, score-0.181]

56 1 follows a simpler approach: we assume that most of the probability p(T |Yobs , X) is concentrated on or near the most probable tree T ∗ and approximate Equation 1 as p(yi = 1|Yobs , T ∗ ). [sent-101, score-0.224]

57 The tree T ∗ can be estimated using more or less sophisticated means. [sent-102, score-0.224]

58 2 we compare this greedy method to the best tree found in our MCMC runs. [sent-106, score-0.224]

59 Note that we ignore Yobs when building T ∗ , because we run many trials on each dataset and do not want to compute a new tree for each value of Yobs . [sent-107, score-0.265]

60 Since our data include many features and few labeled objects, the contribution of Yobs is likely to be negligible. [sent-108, score-0.143]

61 4 Tree Nearest Neighbor (TNN) A Bayesian formulation based on the mutation process provides a principled approach to learning with trees, but there are simpler algorithms that instantiate similar intuitions. [sent-110, score-0.352]

62 For instance, we could build a one-nearest-neighbor classiﬁer using the metric of distance in the tree T (with ties resolved randomly). [sent-111, score-0.224]

63 Theorem 1 For each ultrametric tree T , there is a λ0 such that TNN and TBB produce identical classiﬁcations for all examples with a unique nearest neighbor when λ > λ 0 . [sent-114, score-0.396]

64 The value chosen for yi will depend on all the labels in Yobs , but the inﬂuence of any single label decreases with distance in the tree from yi . [sent-121, score-0.473]

65 Once λ becomes sufﬁciently high it can be shown that yi is always determined uniquely by the closest labeled example in the tree. [sent-122, score-0.161]

66 Given this equivalence between the algorithms, TNN is the method of choice when a high mutation rate is indicated. [sent-123, score-0.332]

67 1 Experiments Trees versus Manifolds We compared TBB and TNN with the Laplacian method of Belkin and Niyogi [4], an approach that effectively assumes a latent manifold structure T . [sent-129, score-0.164]

68 We also ran generic onenearest neighbor (NN) as a baseline. [sent-130, score-0.063]

69 The best performing method on a given dataset should be the algorithm that assumes the right latent structure for that domain. [sent-131, score-0.137]

70 The taxonomic datasets were expected to have a tree-like structure. [sent-133, score-0.14]

71 ” Since these taxonomic sets do not include class labels, we chose features at random to stand in for the class label. [sent-138, score-0.161]

72 The molecular biology sets were taken from the UCI repository. [sent-140, score-0.08]

73 The objects in both sets are strings of DNA, and tree structures might also be appropriate here since these strings arose through evolution. [sent-141, score-0.369]

74 The manifold sets arose from human motor behaviors, and were therefore expected to have a low-dimensional manifold structure. [sent-142, score-0.154]

75 Our experiments focused on learning from very small labeled sets. [sent-144, score-0.087]

76 The number of labeled examples was always set to a small multiple (m = 1, 2, 3, 5, or 10) of the total number of classes. [sent-145, score-0.116]

77 The algorithms were compared under random and retrospective sampling, and training sets were always sampled with replacement. [sent-146, score-0.135]

78 TBB outperforms the other algorithms across the four taxonomic sets (only Beetles and Crustaceans shown), but the differences between TBB and Nearest Neighbor are rather small. [sent-150, score-0.098]

79 As expected, this pattern is reversed on the Digits set, but it is encouraging that the tree-based methods can still improve on Nearest Neighbor even for datasets that are not normally associated with trees. [sent-152, score-0.061]

80 More dramatic differences between the algorithms appear under retrospective sampling (Figure 3b). [sent-154, score-0.163]

81 There is a clear advantage here for TBB on the taxonomic sets. [sent-155, score-0.079]

82 TBB fares better than the other algorithms when the class proportions in the training set do not match the proportions in the population, and it turns out that many of the features in the taxonomic datasets are unbalanced. [sent-156, score-0.225]

83 Since the other datasets have classes of approximately equal size, the results for retrospective sampling are similar to those for random sampling. [sent-157, score-0.224]

84 2 MCMC over trees Figure 3 shows that TBB can perform well on real-world datasets using only a single tree. [sent-160, score-0.209]

85 Each dataset was based on a “true” tree T0 , with objects at the leaves of T0 . [sent-168, score-0.332]

86 Each object was represented by a vector of 20 binary features generated by a mutation process over T0 , with high λ. [sent-169, score-0.406]

87 For each dataset, we created 20 test concepts from the same mutation process. [sent-171, score-0.367]

88 The algorithms saw m labeled examples of each test concept and had to infer the labels of the remaining objects. [sent-172, score-0.204]

89 This experiment was repeated for 10 random trees T0 . [sent-173, score-0.148]

90 2 Our MCMC approach was inspired by an 4 8 12 algorithm for reconstruction of phylogeNumber of labeled examples netic trees [12], which uses MetropolisFigure 4: Error rates on sparse artiﬁcial data Hastings over tree topologies with two as a function of number of labels observed. [sent-177, score-0.561]

91 kinds of proposals: local (nearest neighbor interchange) and global (subtree pruning and regrafting). [sent-178, score-0.063]

92 Unlike the previous section, none of the trees considered (including the true tree T0 ) was ultrametric. [sent-179, score-0.372]

93 Instead, each branch in each tree was assigned a ﬁxed length. [sent-180, score-0.301]

94 This meant that any two trees with the same hierarchical structure were identical, and we did not have to store trees with the same topology but different branch lengths. [sent-181, score-0.433]

95 Four versions of TBB are shown: “ideal” uses the true tree T0 , “MCMC” uses model averaging over a distribution of trees, “modal” uses the single most likely tree in the distribution, and “agglom” uses a tree built by average-link clustering. [sent-183, score-0.781]

96 The ideal learner beats all others because the true tree is impossible to identify with such sparse data. [sent-184, score-0.3]

97 Using MCMC over trees brings TBB substantially closer to the ideal than simpler alternatives that ignore the tree structure (NN) or consider only a single tree (modal, agglom). [sent-185, score-0.671]

98 4 Conclusion We have shown how to make optimal Bayesian concept learning tractable in a semisupervised setting by assuming a latent tree structure that can be inferred from the unlabeled data and deﬁning a prior for concepts based on a mutation process over the tree. [sent-186, score-0.957]

99 Inferring the nature of the latent structure T – rather than assuming a manifold structure or a tree structure – is a particularly interesting problem. [sent-188, score-0.446]

100 Learning from labeled and unlabeled data using graph mincuts. [sent-206, score-0.219]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yobs', 0.615), ('tbb', 0.449), ('mutation', 0.332), ('tree', 0.224), ('tnn', 0.15), ('trees', 0.148), ('unlabeled', 0.132), ('retrospective', 0.116), ('labelings', 0.101), ('labeled', 0.087), ('latent', 0.084), ('taxonomic', 0.079), ('mcmc', 0.078), ('branch', 0.077), ('yi', 0.074), ('label', 0.072), ('mutations', 0.066), ('neighbor', 0.063), ('branches', 0.061), ('datasets', 0.061), ('beetles', 0.058), ('manifold', 0.051), ('objects', 0.051), ('agglom', 0.05), ('crustaceans', 0.05), ('splice', 0.05), ('vowels', 0.05), ('bayesian', 0.05), ('bayes', 0.049), ('sampling', 0.047), ('nearest', 0.047), ('gene', 0.047), ('nn', 0.044), ('modal', 0.043), ('digits', 0.042), ('laplacian', 0.042), ('biology', 0.035), ('concepts', 0.035), ('concept', 0.035), ('diffusion', 0.035), ('prior', 0.034), ('leaves', 0.033), ('arose', 0.033), ('phylogenetics', 0.033), ('promoter', 0.033), ('ultrametric', 0.033), ('inferred', 0.032), ('topology', 0.031), ('features', 0.031), ('ideal', 0.029), ('labels', 0.029), ('agglomerative', 0.029), ('kemp', 0.029), ('attened', 0.029), ('structure', 0.029), ('examples', 0.029), ('net', 0.026), ('leaf', 0.026), ('molecular', 0.026), ('likely', 0.025), ('disease', 0.024), ('beats', 0.024), ('complexities', 0.024), ('saw', 0.024), ('dataset', 0.024), ('inferring', 0.024), ('favors', 0.023), ('domains', 0.023), ('strongly', 0.023), ('sparse', 0.023), ('object', 0.023), ('classi', 0.022), ('equation', 0.022), ('uses', 0.021), ('proposals', 0.021), ('switch', 0.021), ('strings', 0.021), ('process', 0.02), ('smoothness', 0.02), ('parent', 0.02), ('belkin', 0.02), ('dirichlet', 0.019), ('manifolds', 0.019), ('proportions', 0.019), ('chain', 0.019), ('sets', 0.019), ('child', 0.019), ('root', 0.019), ('consistent', 0.018), ('clusters', 0.018), ('nodes', 0.018), ('inductive', 0.018), ('uci', 0.018), ('along', 0.017), ('ignore', 0.017), ('induces', 0.017), ('suf', 0.016), ('volume', 0.016), ('class', 0.016), ('assumptions', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

2 0.1341922 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. 1

3 0.093682118 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1

4 0.08901035 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun

Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1

5 0.088718504 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

Abstract: This paper reports on a family of computationally practical classiﬁers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classiﬁers are based on dyadic classiﬁcation trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) ﬁtting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classiﬁcation rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classiﬁers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1

6 0.087768383 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

7 0.071904644 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

8 0.064560674 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

9 0.063855976 169 nips-2003-Sample Propagation

10 0.061376099 124 nips-2003-Max-Margin Markov Networks

11 0.060666207 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

12 0.058894377 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

13 0.054666657 73 nips-2003-Feature Selection in Clustering Problems

14 0.052456658 126 nips-2003-Measure Based Regularization

15 0.050522637 46 nips-2003-Clustering with the Connectivity Kernel

16 0.05024628 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

17 0.047188804 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

18 0.046215128 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

19 0.045689695 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

20 0.044630878 111 nips-2003-Learning the k in k-means

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.157), (1, -0.071), (2, -0.054), (3, 0.044), (4, -0.032), (5, -0.039), (6, -0.057), (7, 0.02), (8, 0.015), (9, 0.029), (10, 0.058), (11, 0.045), (12, 0.018), (13, -0.04), (14, 0.088), (15, 0.011), (16, -0.074), (17, 0.085), (18, 0.12), (19, 0.003), (20, -0.057), (21, 0.069), (22, -0.047), (23, 0.084), (24, 0.11), (25, 0.003), (26, 0.1), (27, 0.064), (28, 0.158), (29, -0.082), (30, -0.087), (31, -0.011), (32, 0.095), (33, 0.148), (34, 0.019), (35, -0.028), (36, -0.027), (37, 0.071), (38, -0.168), (39, -0.093), (40, 0.002), (41, 0.06), (42, 0.048), (43, -0.055), (44, 0.097), (45, -0.075), (46, -0.082), (47, 0.026), (48, 0.019), (49, -0.228)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93317699 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

2 0.5849539 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification

Author: Ting liu, Andrew W. Moore, Alexander Gray

Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classiﬁers and the prediction phase of Support Vector Machine classiﬁers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly ﬁnd the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classiﬁcation and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1

3 0.56906873 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

4 0.52513081 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

Author: David Kauchak, Sanjoy Dasgupta

Abstract: We describe a procedure which ﬁnds a hierarchical clustering by hillclimbing. The cost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. We show these can be accomplished efﬁciently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms. 1

5 0.51508474 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

6 0.44218156 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

7 0.422198 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

8 0.38228017 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

9 0.37517259 189 nips-2003-Tree-structured Approximations by Expectation Propagation

10 0.37258732 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

11 0.37219822 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

12 0.35670209 130 nips-2003-Model Uncertainty in Classical Conditioning

13 0.32713896 126 nips-2003-Measure Based Regularization

14 0.3181693 3 nips-2003-AUC Optimization vs. Error Rate Minimization

15 0.30252248 120 nips-2003-Locality Preserving Projections

16 0.2817803 169 nips-2003-Sample Propagation

17 0.28067824 124 nips-2003-Max-Margin Markov Networks

18 0.27804831 51 nips-2003-Design of Experiments via Information Theory

19 0.27802804 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

20 0.2561422 111 nips-2003-Learning the k in k-means

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.063), (11, 0.045), (29, 0.027), (30, 0.017), (33, 0.277), (35, 0.048), (53, 0.08), (69, 0.027), (71, 0.069), (76, 0.04), (78, 0.015), (85, 0.098), (91, 0.08), (99, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83159244 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identiﬁes the number and locations of the kernels. Our algorithm overcomes some of the computational difﬁculties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efﬁcient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

same-paper 2 0.78156835 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

3 0.76358581 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

Author: Stuart Andrews, Thomas Hofmann

Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classiﬁcation problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1

4 0.68287528 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

5 0.55951571 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky

Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The ﬁrst is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating signiﬁcant improvements over existing methods. 1

6 0.54728311 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

7 0.54585892 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

8 0.54314476 113 nips-2003-Learning with Local and Global Consistency

9 0.54238701 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

10 0.54210031 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

11 0.5416882 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

12 0.54133725 78 nips-2003-Gaussian Processes in Reinforcement Learning

13 0.54063636 124 nips-2003-Max-Margin Markov Networks

14 0.53784966 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

15 0.53703594 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

16 0.53656757 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

17 0.53299546 126 nips-2003-Measure Based Regularization

18 0.53281122 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

19 0.53256315 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

20 0.53251576 41 nips-2003-Boosting versus Covering