nips nips2007 nips2007-69 knowledge-graph by maker-knowledge-mining

69 nips-2007-Discriminative Batch Mode Active Learning

Source: pdf

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. [sent-3, score-0.879]

2 Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. [sent-4, score-1.003]

3 Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. [sent-5, score-1.777]

4 In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. [sent-6, score-1.629]

5 The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. [sent-7, score-0.641]

6 Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. [sent-9, score-1.273]

7 In many circumstances, unlabeled instances are easy to obtain, while labeling is expensive or time consuming. [sent-11, score-0.75]

8 Randomly selecting unlabeled instances for labeling is inefﬁcient in many situations, since non-informative or redundant instances might be selected. [sent-13, score-1.132]

9 Given a large pool of unlabeled instances, active learning provides a way to iteratively select the most informative unlabeled instances—the queries—to label. [sent-17, score-1.171]

10 This is the typical setting of poolbased active learning. [sent-18, score-0.322]

11 Most active learning approaches, however, have focused on selecting only one unlabeled instance at one time, while retraining the classiﬁer on each iteration. [sent-19, score-0.95]

12 Furthermore, if a parallel labeling system is available, a single instance selection system can make wasteful use of the resource. [sent-21, score-0.28]

13 Thus, a batch mode active learning strategy that selects multiple instances each time is more appropriate under these circumstances. [sent-22, score-1.372]

14 Note that simply using a single instance selection strategy to select more than one unlabeled instance in each iteration does not work well, since it fails to take the information overlap between the multiple instances into account. [sent-23, score-1.233]

15 Principles for batch mode active learning need to be developed to address the multi-instance selection speciﬁcally. [sent-24, score-0.976]

16 In fact, a few batch mode active learning approaches have been proposed recently [2, 8, 9, 17, 19]. [sent-25, score-0.874]

17 However, most extend existing single instance selection strategies into multi-instance selection simply by using a heuristic score or greedy procedure to ensure both the instance diversity and informativeness. [sent-26, score-0.643]

18 In this paper, we propose a new discriminative batch mode active learning strategy that exploits information from an unlabeled set to attempt to learn a good classiﬁer directly. [sent-27, score-1.511]

19 We deﬁne a good classiﬁer to be one that obtains high likelihood on the labeled training instances and low uncertainty on labels of the unlabeled instances. [sent-28, score-0.96]

20 We therefore formulate the instance selection problem as an optimization problem with respect to auxiliary instance selection variables, taking a combination of discriminative classiﬁcation performance and label uncertainty as the objective function. [sent-29, score-0.847]

21 The instance selection variables we introduce can be interpreted as indicating self-supervised, optimistic guesses for the labels of the selected unlabeled instances. [sent-32, score-0.851]

22 A concern about the instance selection process, therefore, is that some information in the unlabeled data that is inconsistent with the true classiﬁcation partition might mislead instance selection. [sent-33, score-0.773]

23 Fortunately, the active learning method can immediately tell whether it has been misled, by comparing the true labels with its optimized guesses. [sent-34, score-0.443]

24 Therefore, one can then adjust the active selection strategy to avoid such over-ﬁtting in the next iteration, whenever a mismatch between the labeled and unlabeled data has been detected. [sent-35, score-1.009]

25 An empirical study on UCI datasets shows that the proposed batch mode active learning method is more effective than some current state-of-the-art batch mode active learning algorithms. [sent-36, score-1.795]

26 2 Related Work Many researchers have addressed the active learning problem in a variety of ways. [sent-37, score-0.352]

27 Most have focused on selecting a single most informative unlabeled instance to label at a time. [sent-38, score-0.641]

28 Many such approaches therefore make myopic decisions based solely on the current learned classiﬁer, and select the unlabeled instance for which there is the greatest uncertainty. [sent-39, score-0.587]

29 [10] chooses the unlabeled instance with conditional probability closest to 0. [sent-40, score-0.484]

30 [3, 18] suggest choosing the instance closest to the classiﬁcation boundary, where [18] analyzes this active learning strategy as a version space reduction process. [sent-43, score-0.538]

31 Approaches that exploit unlabeled data to provide complementary information for active learning have also been proposed. [sent-44, score-0.704]

32 [4, 20] exploit unlabeled data by using the prior density p(x) as uncertainty weights. [sent-45, score-0.375]

33 [16] selects the instance that optimizes the expected generalization error over the unlabeled data. [sent-46, score-0.576]

34 [11] uses an EM approach to integrate information from unlabeled data. [sent-47, score-0.352]

35 [13, 22] consider combining active learning with semi-supervised learning. [sent-48, score-0.352]

36 [14] presents a mathematical model that explicitly combines clustering and active learning. [sent-49, score-0.322]

37 [7] presents a discriminative approach that implicitly exploits the clustering information contained in the unlabeled data by considering optimistic labelings. [sent-50, score-0.715]

38 Since single instance selection strategies require tedious retraining with each instance labeled (and, moreover, since they cannot take advantage of parallel labeling systems), many batch mode active learning methods have recently been proposed. [sent-51, score-1.519]

39 [2, 17, 19] extend single instance selection strategies that use support vector machines. [sent-52, score-0.269]

40 [2] takes the diversity of the selected instances into account, in addition to individual informativeness. [sent-53, score-0.447]

41 [19] proposes a representative sampling approach that selects the cluster centers of the instances lying within the margin of a support vector machine. [sent-54, score-0.469]

42 [8, 9] choose multiple instances that efﬁciently reduce the Fisher information. [sent-55, score-0.352]

43 Overall, these approaches use a variety of heuristics to guide the instance selection process, where the selected batch should be informative about the classiﬁcation model while being diverse enough so that their information overlap is minimized. [sent-56, score-0.676]

44 Instead of using heuristic measures, in this paper, we formulate batch mode active learning as an optimization problem that aims to learn a good classiﬁer directly. [sent-57, score-0.987]

45 Our optimization selects the best set of unlabeled instances and their labels to produce a classiﬁer that attains maximum likelihood on labels of the labeled instances while attaining minimum uncertainty on labels of the unlabeled instances. [sent-58, score-2.007]

46 Our proposed approach provides an example of exploiting optimization techniques in batch model active learning research, much like other areas of machine learning where optimization techniques have been widely applied [1]. [sent-61, score-0.846]

47 Given a test instance x, binary logistic regression models the conditional probability of the class label y ∈ {+1, −1} by p(y|x, w) = 1 1 + exp(−yw x) where w is the model parameter. [sent-64, score-0.263]

48 , minimizing the logloss of the training instances min w log(1 + exp(−yi w xi )) + i∈L λ w w 2 (1) where L indexes the training instances, and λ w w is a regularization term introduced to avoid 2 over-ﬁtting problems. [sent-68, score-0.394]

49 4 Discriminative Batch Mode Active Learning For active learning, one typically encounters a small number of labeled instances and a large number of unlabeled instances. [sent-71, score-1.168]

50 Instance selection strategies based only on the labeled data therefore ignore potentially useful information embodied in the unlabeled instances. [sent-72, score-0.631]

51 In this section, we present a new discriminative batch mode active learning algorithm for binary classiﬁcation that exploits information in the unlabeled instances. [sent-73, score-1.457]

52 The proposed approach is discriminative in the sense that (1) it selects a batch of instances by optimizing a discriminative classiﬁcation model; and (2) it selects instances by considering the best discriminative conﬁguration of their labels leading to the best classiﬁer. [sent-74, score-1.887]

53 Unlike other batch mode active learning methods, which identify the most informative batch of instances using heuristic measures, our approach aims to identify the batch of instances that directly optimizes classiﬁcation performance. [sent-75, score-2.32]

54 1 Optimization Problem An optimal active learning strategy selects a set of instances to label that leads to learning the best classiﬁer. [sent-77, score-0.933]

55 With unlabeled data being available, semi-supervised learning methods have been proposed that train by simultaneously maximizing the likelihood of labeled instances and minimizing the uncertainty of the labels for unlabeled instances [6]. [sent-80, score-1.694]

56 The new active learning approach we propose is motivated by this semi-supervised learning principle. [sent-82, score-0.382]

57 We propose to select a batch of m unlabeled instances, S, to label in each iteration from the total unlabeled set U , with the goal of maximizing the objective (2). [sent-83, score-1.232]

58 In practice, however it is problematic to use the f (S) score directly to guide instance selection: the labels for instances S are not known when the selection is conducted. [sent-85, score-0.748]

59 Instead, we propose an optimistic strategy: use the best f (S) score that the batch of unlabeled instances S can achieve over all possible label conﬁgurations. [sent-87, score-1.258]

60 This optimistic scoring function can be written as t log P (yi |xi , wt+1 ) − α f (S) = max yS i∈Lt ∪S H(y|xj , wt+1 ) (4) j∈U t \S Thus the problem becomes how to select a set of instances S that achieves the best optimistic f (S) score deﬁned in (4). [sent-88, score-0.729]

61 Although this problem can be solved using an exhaustive search on all size m subsets, S, of the unlabeled set U , it is intractable to do so in practice since the search space is exponentially large. [sent-89, score-0.431]

62 Explicit heuristic search approaches seeking a local optima do not exist either, since it is hard to deﬁne an efﬁcient set of operators that can transfer from one position to another one within the search space while guaranteeing improvements to the optimistic score. [sent-90, score-0.307]

63 Instead, in this paper we propose to approach the problem by formulating optimistic batch mode active learning as an explicit mathematical optimization. [sent-91, score-1.006]

64 Given the labeled set L t and unlabeled set U t after iteration t, the task in iteration t + 1 is to select a size m subset S from U t that achieves the best score deﬁned in (4). [sent-92, score-0.708]

65 To do so, we ﬁrst introduce a set of {0, 1}-valued instance selection variables µ. [sent-93, score-0.234]

66 In particular, µ is a |U t | × 2 sized indicator matrix, where each row vector µj corresponds to the two possible labels {+1, −1} of the jth instance in U t . [sent-94, score-0.247]

67 Then the optimistic instance selection for iteration t + 1 can be formulated as the following optimization problem max µ s. [sent-95, score-0.523]

68 Note that, the selection variables µ not only choose instances from U t , but also select labels for the selected instances. [sent-98, score-0.634]

69 Solving this optimization yields the optimal µ for instance selection in iteration t + 1. [sent-99, score-0.365]

70 Such a local optimization approach iteratively updates µ to improve the objective (15), and stops when a local maximum is reached. [sent-110, score-0.223]

71 3 Adjustment Strategy In the discriminative optimization problem formulated in Section 4. [sent-129, score-0.289]

72 1, the µ variables are used to optimistically select both instances and their labels, with the goal of achieving the best classiﬁcation model according to the objective (5). [sent-130, score-0.472]

73 However, when the labeled set is small and the discriminative partition (clustering) information contained in the large unlabeled set is inconsistent with the true classiﬁcation, the labels optimistically guessed for the selected instances through µ might not match the underlying true labels. [sent-131, score-1.31]

74 Furthermore, the unlabeled data might continue to mislead the next instance selection iteration. [sent-133, score-0.619]

75 Fortunately, we can immediately identify when the process has been misled once the true labels for the selected instances have been obtained. [sent-134, score-0.534]

76 If the true labels are different from the labels guessed by the optimization, we need to make an adjustment for the next instance selection iteration. [sent-135, score-0.551]

77 Note that the being-misled problem is caused by the unlabeled data, which affects the target classiﬁcation t+1 model through the term β j∈U t vj µj . [sent-137, score-0.413]

78 Speciﬁcally, at the end of each iteration t, we obtain the true labels y S for the ˆ selected instances S, and compare them with our guessed labels yS indicated by µ∗ . [sent-139, score-0.72]

79 If they are consistent, we will set β = 1, which means we trust the partition information from the unlabeled data as same as the label information in the labeled data for building the classiﬁcation model. [sent-140, score-0.547]

80 If ˆ yS = yS , apparently we should reduce the β value, that is, reducing the inﬂuence of the unlabeled data for the next selection iteration t + 1. [sent-141, score-0.516]

81 Note that, if we reduce β to zero, our optimization problem will be exactly equivalent to picking the most uncertain instance (when m = 1). [sent-145, score-0.239]

82 The UCI datasets we used include (we show the name, followed by the number of instances and the number of attributes): Australian(690;14), Cleve(303;13), Corral(128;6), Crx(690;15), Flare(1066;10), Glass2(163;9), Heart(270;13), Hepatitis(155;20) and Vote(435;15). [sent-147, score-0.399]

83 We consider a hard case of active learning, where only a few labeled instances are given at the start. [sent-148, score-0.816]

84 In each experiment, we start with four randomly selected labeled instances, two in each class. [sent-149, score-0.184]

85 We then randomly select 2/3 of the remaining instances as the unlabeled set, using the remaining instances for testing. [sent-150, score-1.103]

86 All the algorithms start with the same initial labeled set, unlabeled set and testing set. [sent-151, score-0.494]

87 For a ﬁxed batch size m, each algorithm repeatedly select m instances to label each time. [sent-152, score-0.778]

88 Moreover, Discriminative also apparently outperforms the other two batch mode algorithms, svmD and Fisher, on ﬁve datasets—Australian, Cleve, Flare, Heart and Hepatitis, and reaches a tie on two datasets—Crx and Vote. [sent-156, score-0.55]

89 The myopic most uncertain selection method, MostUncertain, shows an overall inferior performance to Discriminative on Australian, Cleve, Crx, Heart and Hepatitis, and achieves a tie on Flare and Vote. [sent-157, score-0.224]

90 These empirical results suggest that selecting unlabeled instances through optimizing the classiﬁcation model directly would obtain more relevant and informative instances, comparing with using heuristic scores to guide the selection. [sent-216, score-0.852]

91 Although the original optimization problem formulated is NP-hard, a relaxed local optimization method that leads to a local optimal solution still works effectively. [sent-217, score-0.281]

92 6 Conclusion In this paper, we proposed a discriminative batch mode active learning approach that exploits information in unlabeled data and selects a batch of instances by optimizing the target classiﬁcation model. [sent-218, score-2.227]

93 Although the proposed technique could be overly optimistic about the information presented by the unlabeled set, and consequently be misled, this problem can be identiﬁed immediately after obtaining the true labels. [sent-219, score-0.484]

94 Experimental results on UCI datasets show that this approach is generally more effective comparing with other batch mode active learning methods, a random sampling method, and a myopic non-batch mode active learning method. [sent-221, score-1.55]

95 Incorporating diversity in active learning with support vector machines. [sent-230, score-0.405]

96 Batch mode active learning and its application to medical image classiﬁcation. [sent-273, score-0.548]

97 Employing EM in pool-based active learning for text classiﬁcation. [sent-283, score-0.376]

98 Toward optimal active learning through sampling estimation of error reduction. [sent-312, score-0.377]

99 Support vector machine active learning with applications to text classiﬁcation. [sent-322, score-0.376]

100 Combining active learning and semi-supervised learning using gaussian ﬁelds and harmonic functions. [sent-346, score-0.382]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unlabeled', 0.352), ('instances', 0.352), ('batch', 0.326), ('active', 0.322), ('mostuncertain', 0.208), ('svmd', 0.208), ('mode', 0.196), ('discriminative', 0.194), ('wt', 0.186), ('labeled', 0.142), ('lt', 0.134), ('optimistic', 0.132), ('instance', 0.132), ('fisher', 0.128), ('classi', 0.118), ('selection', 0.102), ('ys', 0.098), ('hk', 0.093), ('selects', 0.092), ('labels', 0.091), ('guessed', 0.082), ('xj', 0.077), ('cleve', 0.076), ('crx', 0.076), ('flare', 0.076), ('optimization', 0.069), ('hepatitis', 0.066), ('iteration', 0.062), ('vj', 0.061), ('uci', 0.059), ('er', 0.058), ('corral', 0.057), ('retraining', 0.056), ('myopic', 0.056), ('logistic', 0.055), ('strategy', 0.054), ('label', 0.053), ('diversity', 0.053), ('adjustment', 0.053), ('misled', 0.049), ('vec', 0.048), ('datasets', 0.047), ('select', 0.047), ('informative', 0.046), ('labeling', 0.046), ('local', 0.046), ('heuristic', 0.044), ('score', 0.043), ('australian', 0.043), ('heart', 0.043), ('indexes', 0.042), ('selected', 0.042), ('proceedings', 0.041), ('international', 0.041), ('objective', 0.04), ('yk', 0.039), ('uncertain', 0.038), ('cation', 0.038), ('adjust', 0.037), ('exploits', 0.037), ('conference', 0.035), ('strategies', 0.035), ('hessian', 0.035), ('fk', 0.034), ('mislead', 0.033), ('guo', 0.033), ('yuhong', 0.033), ('hoi', 0.033), ('optimistically', 0.033), ('seeking', 0.033), ('yi', 0.03), ('jin', 0.03), ('dale', 0.03), ('learning', 0.03), ('selecting', 0.03), ('taylor', 0.029), ('focused', 0.028), ('zhu', 0.028), ('guide', 0.028), ('tie', 0.028), ('exhaustive', 0.027), ('accuracy', 0.027), ('trained', 0.027), ('committee', 0.026), ('search', 0.026), ('formulated', 0.026), ('integer', 0.026), ('sampling', 0.025), ('relaxed', 0.025), ('continuous', 0.024), ('sized', 0.024), ('text', 0.024), ('regression', 0.023), ('log', 0.023), ('uncertainty', 0.023), ('inconsistent', 0.022), ('xs', 0.022), ('vote', 0.022), ('selective', 0.022), ('iteratively', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

2 0.24162343 136 nips-2007-Multiple-Instance Active Learning

Author: Burr Settles, Mark Craven, Soumya Ray

Abstract: We present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners assume that every instance in a bag labeled negative is actually negative, whereas at least one instance in a bag labeled positive is actually positive. We consider the particular case in which an MI learner is allowed to selectively query unlabeled instances from positive bags. This approach is well motivated in domains in which it is inexpensive to acquire bag labels and possible, but expensive, to acquire instance labels. We describe a method for learning from labels at mixed levels of granularity, and introduce two active query selection strategies motivated by the MI setting. Our experiments show that learning from instance labels can signiﬁcantly improve performance of a basic MI learning algorithm in two multiple-instance domains: content-based image retrieval and text classiﬁcation. 1

3 0.1689188 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received signiﬁcant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identiﬁable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identiﬁable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identiﬁable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

4 0.15883887 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our ﬁrst result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

5 0.14983004 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1

6 0.14009002 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

7 0.13699286 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

8 0.12185959 110 nips-2007-Learning Bounds for Domain Adaptation

9 0.1155507 175 nips-2007-Semi-Supervised Multitask Learning

10 0.11513191 32 nips-2007-Bayesian Co-Training

11 0.1135979 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

12 0.10898753 15 nips-2007-A general agnostic active learning algorithm

13 0.090023458 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

14 0.089633018 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

15 0.0883549 199 nips-2007-The Price of Bandit Information for Online Optimization

16 0.087979145 40 nips-2007-Bundle Methods for Machine Learning

17 0.08640781 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

18 0.084085897 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

19 0.077982329 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

20 0.076961443 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.23), (1, 0.028), (2, -0.161), (3, 0.147), (4, 0.05), (5, 0.139), (6, 0.086), (7, -0.121), (8, 0.162), (9, 0.043), (10, 0.049), (11, -0.031), (12, -0.18), (13, -0.139), (14, 0.033), (15, 0.172), (16, -0.079), (17, 0.018), (18, 0.186), (19, -0.01), (20, 0.064), (21, -0.142), (22, 0.208), (23, -0.087), (24, 0.067), (25, 0.07), (26, 0.002), (27, -0.038), (28, 0.007), (29, -0.084), (30, -0.111), (31, -0.09), (32, 0.042), (33, -0.077), (34, -0.008), (35, -0.067), (36, 0.137), (37, 0.094), (38, -0.047), (39, 0.097), (40, 0.078), (41, -0.058), (42, 0.043), (43, 0.029), (44, 0.062), (45, -0.059), (46, 0.03), (47, -0.015), (48, -0.016), (49, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97607005 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

2 0.80351865 136 nips-2007-Multiple-Instance Active Learning

Author: Burr Settles, Mark Craven, Soumya Ray

3 0.74065745 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

4 0.62228513 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1

5 0.59496206 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

6 0.56582588 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

7 0.54720259 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

8 0.53130752 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

9 0.45232826 175 nips-2007-Semi-Supervised Multitask Learning

10 0.44949481 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

11 0.43058607 32 nips-2007-Bayesian Co-Training

12 0.41746986 19 nips-2007-Active Preference Learning with Discrete Choice Data

13 0.41509891 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

14 0.39836463 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

15 0.39057651 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

16 0.38831985 110 nips-2007-Learning Bounds for Domain Adaptation

17 0.37555054 40 nips-2007-Bundle Methods for Machine Learning

18 0.36199951 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

19 0.34562317 187 nips-2007-Structured Learning with Approximate Inference

20 0.34548587 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.045), (11, 0.245), (13, 0.037), (16, 0.025), (18, 0.013), (21, 0.141), (31, 0.027), (34, 0.029), (35, 0.017), (47, 0.1), (83, 0.148), (85, 0.023), (90, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85839128 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classiﬁcation model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classiﬁcation, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classiﬁers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a ﬁxed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its proﬁtability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Proﬁt. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and proﬁt) should have a non-zero association with Microsoft’s proﬁt: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random ﬁelds [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random ﬁeld (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classiﬁcation model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classiﬁcation are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classiﬁcation using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random ﬁelds) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is deﬁned. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classiﬁer. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the ﬁgure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The ﬁgure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classiﬁcation For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulﬁll two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive deﬁnite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as ﬂexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The ﬁrst method is an automated method to relax some of the independence constraints, while guaranteeing positive-deﬁniteness, and a parameterization that depends on a single scalar ρ. This allows for more efﬁcient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive deﬁnite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we ﬁrst triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classiﬁer with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classiﬁer. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as deﬁned in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classiﬁcation can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The ﬁnal algorithm deﬁnes a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modiﬁcation of code written for EP-based Gaussian process classiﬁers4 . 4 Results We now compare three different methods in relational classiﬁcation tasks. We will compare a standard Gaussian process classiﬁer (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider ﬁrst a simple classiﬁcation problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classiﬁcations are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a speciﬁc university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives signiﬁcant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classiﬁers by extending a classical statistical model [12] to non-parametric relational classiﬁcation. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classiﬁcation. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efﬁcient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.

same-paper 2 0.79908442 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

3 0.70158964 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efﬁcient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.67469335 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

Abstract: This paper describes a new model for human visual classiﬁcation that enables the recovery of image features that explain human subjects’ performance on different visual classiﬁcation tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classiﬁer operating on raw image pixels. Instead, it represents classiﬁcation as the combination of multiple feature detectors. This approach extracts more information about human visual classiﬁcation than previous methods and provides a foundation for further exploration. 1

5 0.66875422 19 nips-2007-Active Preference Learning with Discrete Choice Data

Author: Brochu Eric, Nando D. Freitas, Abhijeet Ghosh

Abstract: We propose an active learning algorithm that learns a continuous valuation model from discrete preferences. The algorithm automatically decides what items are best presented to an individual in order to ﬁnd the item that they value highly in as few trials as possible, and exploits quirks of human psychology to minimize time and cognitive burden. To do this, our algorithm maximizes the expected improvement at each query without accurately modelling the entire valuation surface, which would be needlessly expensive. The problem is particularly difﬁcult because the space of choices is inﬁnite. We demonstrate the effectiveness of the new algorithm compared to related active learning methods. We also embed the algorithm within a decision making tool for assisting digital artists in rendering materials. The tool ﬁnds the best parameters while minimizing the number of queries. 1

6 0.66686261 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

7 0.66665232 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

8 0.66488028 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.66427684 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

10 0.66326237 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

11 0.6621545 86 nips-2007-Exponential Family Predictive Representations of State

12 0.66205239 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

13 0.66199994 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

14 0.65745193 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

15 0.65734881 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

16 0.6571666 158 nips-2007-Probabilistic Matrix Factorization

17 0.65642202 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

18 0.65421426 175 nips-2007-Semi-Supervised Multitask Learning

19 0.65389508 156 nips-2007-Predictive Matrix-Variate t Models

20 0.65344125 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models