nips nips2007 nips2007-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuhong Guo, Dale Schuurmans
Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. [sent-3, score-0.879]
2 Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. [sent-4, score-1.003]
3 Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. [sent-5, score-1.777]
4 In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. [sent-6, score-1.629]
5 The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. [sent-7, score-0.641]
6 Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. [sent-9, score-1.273]
7 In many circumstances, unlabeled instances are easy to obtain, while labeling is expensive or time consuming. [sent-11, score-0.75]
8 Randomly selecting unlabeled instances for labeling is inefficient in many situations, since non-informative or redundant instances might be selected. [sent-13, score-1.132]
9 Given a large pool of unlabeled instances, active learning provides a way to iteratively select the most informative unlabeled instances—the queries—to label. [sent-17, score-1.171]
10 This is the typical setting of poolbased active learning. [sent-18, score-0.322]
11 Most active learning approaches, however, have focused on selecting only one unlabeled instance at one time, while retraining the classifier on each iteration. [sent-19, score-0.95]
12 Furthermore, if a parallel labeling system is available, a single instance selection system can make wasteful use of the resource. [sent-21, score-0.28]
13 Thus, a batch mode active learning strategy that selects multiple instances each time is more appropriate under these circumstances. [sent-22, score-1.372]
14 Note that simply using a single instance selection strategy to select more than one unlabeled instance in each iteration does not work well, since it fails to take the information overlap between the multiple instances into account. [sent-23, score-1.233]
15 Principles for batch mode active learning need to be developed to address the multi-instance selection specifically. [sent-24, score-0.976]
16 In fact, a few batch mode active learning approaches have been proposed recently [2, 8, 9, 17, 19]. [sent-25, score-0.874]
17 However, most extend existing single instance selection strategies into multi-instance selection simply by using a heuristic score or greedy procedure to ensure both the instance diversity and informativeness. [sent-26, score-0.643]
18 In this paper, we propose a new discriminative batch mode active learning strategy that exploits information from an unlabeled set to attempt to learn a good classifier directly. [sent-27, score-1.511]
19 We define a good classifier to be one that obtains high likelihood on the labeled training instances and low uncertainty on labels of the unlabeled instances. [sent-28, score-0.96]
20 We therefore formulate the instance selection problem as an optimization problem with respect to auxiliary instance selection variables, taking a combination of discriminative classification performance and label uncertainty as the objective function. [sent-29, score-0.847]
21 The instance selection variables we introduce can be interpreted as indicating self-supervised, optimistic guesses for the labels of the selected unlabeled instances. [sent-32, score-0.851]
22 A concern about the instance selection process, therefore, is that some information in the unlabeled data that is inconsistent with the true classification partition might mislead instance selection. [sent-33, score-0.773]
23 Fortunately, the active learning method can immediately tell whether it has been misled, by comparing the true labels with its optimized guesses. [sent-34, score-0.443]
24 Therefore, one can then adjust the active selection strategy to avoid such over-fitting in the next iteration, whenever a mismatch between the labeled and unlabeled data has been detected. [sent-35, score-1.009]
25 An empirical study on UCI datasets shows that the proposed batch mode active learning method is more effective than some current state-of-the-art batch mode active learning algorithms. [sent-36, score-1.795]
26 2 Related Work Many researchers have addressed the active learning problem in a variety of ways. [sent-37, score-0.352]
27 Most have focused on selecting a single most informative unlabeled instance to label at a time. [sent-38, score-0.641]
28 Many such approaches therefore make myopic decisions based solely on the current learned classifier, and select the unlabeled instance for which there is the greatest uncertainty. [sent-39, score-0.587]
29 [10] chooses the unlabeled instance with conditional probability closest to 0. [sent-40, score-0.484]
30 [3, 18] suggest choosing the instance closest to the classification boundary, where [18] analyzes this active learning strategy as a version space reduction process. [sent-43, score-0.538]
31 Approaches that exploit unlabeled data to provide complementary information for active learning have also been proposed. [sent-44, score-0.704]
32 [4, 20] exploit unlabeled data by using the prior density p(x) as uncertainty weights. [sent-45, score-0.375]
33 [16] selects the instance that optimizes the expected generalization error over the unlabeled data. [sent-46, score-0.576]
34 [11] uses an EM approach to integrate information from unlabeled data. [sent-47, score-0.352]
35 [13, 22] consider combining active learning with semi-supervised learning. [sent-48, score-0.352]
36 [14] presents a mathematical model that explicitly combines clustering and active learning. [sent-49, score-0.322]
37 [7] presents a discriminative approach that implicitly exploits the clustering information contained in the unlabeled data by considering optimistic labelings. [sent-50, score-0.715]
38 Since single instance selection strategies require tedious retraining with each instance labeled (and, moreover, since they cannot take advantage of parallel labeling systems), many batch mode active learning methods have recently been proposed. [sent-51, score-1.519]
39 [2, 17, 19] extend single instance selection strategies that use support vector machines. [sent-52, score-0.269]
40 [2] takes the diversity of the selected instances into account, in addition to individual informativeness. [sent-53, score-0.447]
41 [19] proposes a representative sampling approach that selects the cluster centers of the instances lying within the margin of a support vector machine. [sent-54, score-0.469]
42 [8, 9] choose multiple instances that efficiently reduce the Fisher information. [sent-55, score-0.352]
43 Overall, these approaches use a variety of heuristics to guide the instance selection process, where the selected batch should be informative about the classification model while being diverse enough so that their information overlap is minimized. [sent-56, score-0.676]
44 Instead of using heuristic measures, in this paper, we formulate batch mode active learning as an optimization problem that aims to learn a good classifier directly. [sent-57, score-0.987]
45 Our optimization selects the best set of unlabeled instances and their labels to produce a classifier that attains maximum likelihood on labels of the labeled instances while attaining minimum uncertainty on labels of the unlabeled instances. [sent-58, score-2.007]
46 Our proposed approach provides an example of exploiting optimization techniques in batch model active learning research, much like other areas of machine learning where optimization techniques have been widely applied [1]. [sent-61, score-0.846]
47 Given a test instance x, binary logistic regression models the conditional probability of the class label y ∈ {+1, −1} by p(y|x, w) = 1 1 + exp(−yw x) where w is the model parameter. [sent-64, score-0.263]
48 , minimizing the logloss of the training instances min w log(1 + exp(−yi w xi )) + i∈L λ w w 2 (1) where L indexes the training instances, and λ w w is a regularization term introduced to avoid 2 over-fitting problems. [sent-68, score-0.394]
49 4 Discriminative Batch Mode Active Learning For active learning, one typically encounters a small number of labeled instances and a large number of unlabeled instances. [sent-71, score-1.168]
50 Instance selection strategies based only on the labeled data therefore ignore potentially useful information embodied in the unlabeled instances. [sent-72, score-0.631]
51 In this section, we present a new discriminative batch mode active learning algorithm for binary classification that exploits information in the unlabeled instances. [sent-73, score-1.457]
52 The proposed approach is discriminative in the sense that (1) it selects a batch of instances by optimizing a discriminative classification model; and (2) it selects instances by considering the best discriminative configuration of their labels leading to the best classifier. [sent-74, score-1.887]
53 Unlike other batch mode active learning methods, which identify the most informative batch of instances using heuristic measures, our approach aims to identify the batch of instances that directly optimizes classification performance. [sent-75, score-2.32]
54 1 Optimization Problem An optimal active learning strategy selects a set of instances to label that leads to learning the best classifier. [sent-77, score-0.933]
55 With unlabeled data being available, semi-supervised learning methods have been proposed that train by simultaneously maximizing the likelihood of labeled instances and minimizing the uncertainty of the labels for unlabeled instances [6]. [sent-80, score-1.694]
56 The new active learning approach we propose is motivated by this semi-supervised learning principle. [sent-82, score-0.382]
57 We propose to select a batch of m unlabeled instances, S, to label in each iteration from the total unlabeled set U , with the goal of maximizing the objective (2). [sent-83, score-1.232]
58 In practice, however it is problematic to use the f (S) score directly to guide instance selection: the labels for instances S are not known when the selection is conducted. [sent-85, score-0.748]
59 Instead, we propose an optimistic strategy: use the best f (S) score that the batch of unlabeled instances S can achieve over all possible label configurations. [sent-87, score-1.258]
60 This optimistic scoring function can be written as t log P (yi |xi , wt+1 ) − α f (S) = max yS i∈Lt ∪S H(y|xj , wt+1 ) (4) j∈U t \S Thus the problem becomes how to select a set of instances S that achieves the best optimistic f (S) score defined in (4). [sent-88, score-0.729]
61 Although this problem can be solved using an exhaustive search on all size m subsets, S, of the unlabeled set U , it is intractable to do so in practice since the search space is exponentially large. [sent-89, score-0.431]
62 Explicit heuristic search approaches seeking a local optima do not exist either, since it is hard to define an efficient set of operators that can transfer from one position to another one within the search space while guaranteeing improvements to the optimistic score. [sent-90, score-0.307]
63 Instead, in this paper we propose to approach the problem by formulating optimistic batch mode active learning as an explicit mathematical optimization. [sent-91, score-1.006]
64 Given the labeled set L t and unlabeled set U t after iteration t, the task in iteration t + 1 is to select a size m subset S from U t that achieves the best score defined in (4). [sent-92, score-0.708]
65 To do so, we first introduce a set of {0, 1}-valued instance selection variables µ. [sent-93, score-0.234]
66 In particular, µ is a |U t | × 2 sized indicator matrix, where each row vector µj corresponds to the two possible labels {+1, −1} of the jth instance in U t . [sent-94, score-0.247]
67 Then the optimistic instance selection for iteration t + 1 can be formulated as the following optimization problem max µ s. [sent-95, score-0.523]
68 Note that, the selection variables µ not only choose instances from U t , but also select labels for the selected instances. [sent-98, score-0.634]
69 Solving this optimization yields the optimal µ for instance selection in iteration t + 1. [sent-99, score-0.365]
70 Such a local optimization approach iteratively updates µ to improve the objective (15), and stops when a local maximum is reached. [sent-110, score-0.223]
71 3 Adjustment Strategy In the discriminative optimization problem formulated in Section 4. [sent-129, score-0.289]
72 1, the µ variables are used to optimistically select both instances and their labels, with the goal of achieving the best classification model according to the objective (5). [sent-130, score-0.472]
73 However, when the labeled set is small and the discriminative partition (clustering) information contained in the large unlabeled set is inconsistent with the true classification, the labels optimistically guessed for the selected instances through µ might not match the underlying true labels. [sent-131, score-1.31]
74 Furthermore, the unlabeled data might continue to mislead the next instance selection iteration. [sent-133, score-0.619]
75 Fortunately, we can immediately identify when the process has been misled once the true labels for the selected instances have been obtained. [sent-134, score-0.534]
76 If the true labels are different from the labels guessed by the optimization, we need to make an adjustment for the next instance selection iteration. [sent-135, score-0.551]
77 Note that the being-misled problem is caused by the unlabeled data, which affects the target classification t+1 model through the term β j∈U t vj µj . [sent-137, score-0.413]
78 Specifically, at the end of each iteration t, we obtain the true labels y S for the ˆ selected instances S, and compare them with our guessed labels yS indicated by µ∗ . [sent-139, score-0.72]
79 If they are consistent, we will set β = 1, which means we trust the partition information from the unlabeled data as same as the label information in the labeled data for building the classification model. [sent-140, score-0.547]
80 If ˆ yS = yS , apparently we should reduce the β value, that is, reducing the influence of the unlabeled data for the next selection iteration t + 1. [sent-141, score-0.516]
81 Note that, if we reduce β to zero, our optimization problem will be exactly equivalent to picking the most uncertain instance (when m = 1). [sent-145, score-0.239]
82 The UCI datasets we used include (we show the name, followed by the number of instances and the number of attributes): Australian(690;14), Cleve(303;13), Corral(128;6), Crx(690;15), Flare(1066;10), Glass2(163;9), Heart(270;13), Hepatitis(155;20) and Vote(435;15). [sent-147, score-0.399]
83 We consider a hard case of active learning, where only a few labeled instances are given at the start. [sent-148, score-0.816]
84 In each experiment, we start with four randomly selected labeled instances, two in each class. [sent-149, score-0.184]
85 We then randomly select 2/3 of the remaining instances as the unlabeled set, using the remaining instances for testing. [sent-150, score-1.103]
86 All the algorithms start with the same initial labeled set, unlabeled set and testing set. [sent-151, score-0.494]
87 For a fixed batch size m, each algorithm repeatedly select m instances to label each time. [sent-152, score-0.778]
88 Moreover, Discriminative also apparently outperforms the other two batch mode algorithms, svmD and Fisher, on five datasets—Australian, Cleve, Flare, Heart and Hepatitis, and reaches a tie on two datasets—Crx and Vote. [sent-156, score-0.55]
89 The myopic most uncertain selection method, MostUncertain, shows an overall inferior performance to Discriminative on Australian, Cleve, Crx, Heart and Hepatitis, and achieves a tie on Flare and Vote. [sent-157, score-0.224]
90 These empirical results suggest that selecting unlabeled instances through optimizing the classification model directly would obtain more relevant and informative instances, comparing with using heuristic scores to guide the selection. [sent-216, score-0.852]
91 Although the original optimization problem formulated is NP-hard, a relaxed local optimization method that leads to a local optimal solution still works effectively. [sent-217, score-0.281]
92 6 Conclusion In this paper, we proposed a discriminative batch mode active learning approach that exploits information in unlabeled data and selects a batch of instances by optimizing the target classification model. [sent-218, score-2.227]
93 Although the proposed technique could be overly optimistic about the information presented by the unlabeled set, and consequently be misled, this problem can be identified immediately after obtaining the true labels. [sent-219, score-0.484]
94 Experimental results on UCI datasets show that this approach is generally more effective comparing with other batch mode active learning methods, a random sampling method, and a myopic non-batch mode active learning method. [sent-221, score-1.55]
95 Incorporating diversity in active learning with support vector machines. [sent-230, score-0.405]
96 Batch mode active learning and its application to medical image classification. [sent-273, score-0.548]
97 Employing EM in pool-based active learning for text classification. [sent-283, score-0.376]
98 Toward optimal active learning through sampling estimation of error reduction. [sent-312, score-0.377]
99 Support vector machine active learning with applications to text classification. [sent-322, score-0.376]
100 Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. [sent-346, score-0.382]
wordName wordTfidf (topN-words)
[('unlabeled', 0.352), ('instances', 0.352), ('batch', 0.326), ('active', 0.322), ('mostuncertain', 0.208), ('svmd', 0.208), ('mode', 0.196), ('discriminative', 0.194), ('wt', 0.186), ('labeled', 0.142), ('lt', 0.134), ('optimistic', 0.132), ('instance', 0.132), ('fisher', 0.128), ('classi', 0.118), ('selection', 0.102), ('ys', 0.098), ('hk', 0.093), ('selects', 0.092), ('labels', 0.091), ('guessed', 0.082), ('xj', 0.077), ('cleve', 0.076), ('crx', 0.076), ('flare', 0.076), ('optimization', 0.069), ('hepatitis', 0.066), ('iteration', 0.062), ('vj', 0.061), ('uci', 0.059), ('er', 0.058), ('corral', 0.057), ('retraining', 0.056), ('myopic', 0.056), ('logistic', 0.055), ('strategy', 0.054), ('label', 0.053), ('diversity', 0.053), ('adjustment', 0.053), ('misled', 0.049), ('vec', 0.048), ('datasets', 0.047), ('select', 0.047), ('informative', 0.046), ('labeling', 0.046), ('local', 0.046), ('heuristic', 0.044), ('score', 0.043), ('australian', 0.043), ('heart', 0.043), ('indexes', 0.042), ('selected', 0.042), ('proceedings', 0.041), ('international', 0.041), ('objective', 0.04), ('yk', 0.039), ('uncertain', 0.038), ('cation', 0.038), ('adjust', 0.037), ('exploits', 0.037), ('conference', 0.035), ('strategies', 0.035), ('hessian', 0.035), ('fk', 0.034), ('mislead', 0.033), ('guo', 0.033), ('yuhong', 0.033), ('hoi', 0.033), ('optimistically', 0.033), ('seeking', 0.033), ('yi', 0.03), ('jin', 0.03), ('dale', 0.03), ('learning', 0.03), ('selecting', 0.03), ('taylor', 0.029), ('focused', 0.028), ('zhu', 0.028), ('guide', 0.028), ('tie', 0.028), ('exhaustive', 0.027), ('accuracy', 0.027), ('trained', 0.027), ('committee', 0.026), ('search', 0.026), ('formulated', 0.026), ('integer', 0.026), ('sampling', 0.025), ('relaxed', 0.025), ('continuous', 0.024), ('sized', 0.024), ('text', 0.024), ('regression', 0.023), ('log', 0.023), ('uncertainty', 0.023), ('inconsistent', 0.022), ('xs', 0.022), ('vote', 0.022), ('selective', 0.022), ('iteratively', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 69 nips-2007-Discriminative Batch Mode Active Learning
Author: Yuhong Guo, Dale Schuurmans
Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1
2 0.24162343 136 nips-2007-Multiple-Instance Active Learning
Author: Burr Settles, Mark Craven, Soumya Ray
Abstract: We present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners assume that every instance in a bag labeled negative is actually negative, whereas at least one instance in a bag labeled positive is actually positive. We consider the particular case in which an MI learner is allowed to selectively query unlabeled instances from positive bags. This approach is well motivated in domains in which it is inexpensive to acquire bag labels and possible, but expensive, to acquire instance labels. We describe a method for learning from labels at mixed levels of granularity, and introduce two active query selection strategies motivated by the MI setting. Our experiments show that learning from instance labels can significantly improve performance of a basic MI learning algorithm in two multiple-instance domains: content-based image retrieval and text classification. 1
3 0.1689188 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
Author: Kaushik Sinha, Mikhail Belkin
Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.
4 0.15883887 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
Author: Larry Wasserman, John D. Lafferty
Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1
5 0.14983004 166 nips-2007-Regularized Boost for Semi-Supervised Learning
Author: Ke Chen, Shihai Wang
Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1
6 0.14009002 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
7 0.13699286 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
8 0.12185959 110 nips-2007-Learning Bounds for Domain Adaptation
9 0.1155507 175 nips-2007-Semi-Supervised Multitask Learning
10 0.11513191 32 nips-2007-Bayesian Co-Training
11 0.1135979 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
12 0.10898753 15 nips-2007-A general agnostic active learning algorithm
13 0.090023458 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
14 0.089633018 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
15 0.0883549 199 nips-2007-The Price of Bandit Information for Online Optimization
16 0.087979145 40 nips-2007-Bundle Methods for Machine Learning
17 0.08640781 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
18 0.084085897 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
19 0.077982329 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
20 0.076961443 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
topicId topicWeight
[(0, -0.23), (1, 0.028), (2, -0.161), (3, 0.147), (4, 0.05), (5, 0.139), (6, 0.086), (7, -0.121), (8, 0.162), (9, 0.043), (10, 0.049), (11, -0.031), (12, -0.18), (13, -0.139), (14, 0.033), (15, 0.172), (16, -0.079), (17, 0.018), (18, 0.186), (19, -0.01), (20, 0.064), (21, -0.142), (22, 0.208), (23, -0.087), (24, 0.067), (25, 0.07), (26, 0.002), (27, -0.038), (28, 0.007), (29, -0.084), (30, -0.111), (31, -0.09), (32, 0.042), (33, -0.077), (34, -0.008), (35, -0.067), (36, 0.137), (37, 0.094), (38, -0.047), (39, 0.097), (40, 0.078), (41, -0.058), (42, 0.043), (43, 0.029), (44, 0.062), (45, -0.059), (46, 0.03), (47, -0.015), (48, -0.016), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.97607005 69 nips-2007-Discriminative Batch Mode Active Learning
Author: Yuhong Guo, Dale Schuurmans
Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1
2 0.80351865 136 nips-2007-Multiple-Instance Active Learning
Author: Burr Settles, Mark Craven, Soumya Ray
Abstract: We present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners assume that every instance in a bag labeled negative is actually negative, whereas at least one instance in a bag labeled positive is actually positive. We consider the particular case in which an MI learner is allowed to selectively query unlabeled instances from positive bags. This approach is well motivated in domains in which it is inexpensive to acquire bag labels and possible, but expensive, to acquire instance labels. We describe a method for learning from labels at mixed levels of granularity, and introduce two active query selection strategies motivated by the MI setting. Our experiments show that learning from instance labels can significantly improve performance of a basic MI learning algorithm in two multiple-instance domains: content-based image retrieval and text classification. 1
3 0.74065745 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
Author: Kaushik Sinha, Mikhail Belkin
Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.
4 0.62228513 15 nips-2007-A general agnostic active learning algorithm
Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu
Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1
5 0.59496206 166 nips-2007-Regularized Boost for Semi-Supervised Learning
Author: Ke Chen, Shihai Wang
Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1
6 0.56582588 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
7 0.54720259 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
8 0.53130752 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
9 0.45232826 175 nips-2007-Semi-Supervised Multitask Learning
10 0.44949481 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
11 0.43058607 32 nips-2007-Bayesian Co-Training
12 0.41746986 19 nips-2007-Active Preference Learning with Discrete Choice Data
13 0.41509891 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
14 0.39836463 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
15 0.39057651 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
16 0.38831985 110 nips-2007-Learning Bounds for Domain Adaptation
17 0.37555054 40 nips-2007-Bundle Methods for Machine Learning
18 0.36199951 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
19 0.34562317 187 nips-2007-Structured Learning with Approximate Inference
20 0.34548587 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
topicId topicWeight
[(5, 0.045), (11, 0.245), (13, 0.037), (16, 0.025), (18, 0.013), (21, 0.141), (31, 0.027), (34, 0.029), (35, 0.017), (47, 0.1), (83, 0.148), (85, 0.023), (90, 0.049)]
simIndex simValue paperId paperTitle
1 0.85839128 97 nips-2007-Hidden Common Cause Relations in Relational Learning
Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani
Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classification, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classifiers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a fixed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its profitability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Profit. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and profit) should have a non-zero association with Microsoft’s profit: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random fields [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random field (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classification model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classification are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classification using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random fields) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is defined. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classifier. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the figure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The figure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classification For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulfill two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive definite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as flexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The first method is an automated method to relax some of the independence constraints, while guaranteeing positive-definiteness, and a parameterization that depends on a single scalar ρ. This allows for more efficient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive definite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we first triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classifier with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classifier. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as defined in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classification can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The final algorithm defines a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modification of code written for EP-based Gaussian process classifiers4 . 4 Results We now compare three different methods in relational classification tasks. We will compare a standard Gaussian process classifier (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider first a simple classification problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classifications are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a specific university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives significant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classifiers by extending a classical statistical model [12] to non-parametric relational classification. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classification. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.
same-paper 2 0.79908442 69 nips-2007-Discriminative Batch Mode Active Learning
Author: Yuhong Guo, Dale Schuurmans
Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1
3 0.70158964 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
4 0.67469335 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
Author: Michael Ross, Andrew Cohen
Abstract: This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects’ performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it represents classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than previous methods and provides a foundation for further exploration. 1
5 0.66875422 19 nips-2007-Active Preference Learning with Discrete Choice Data
Author: Brochu Eric, Nando D. Freitas, Abhijeet Ghosh
Abstract: We propose an active learning algorithm that learns a continuous valuation model from discrete preferences. The algorithm automatically decides what items are best presented to an individual in order to find the item that they value highly in as few trials as possible, and exploits quirks of human psychology to minimize time and cognitive burden. To do this, our algorithm maximizes the expected improvement at each query without accurately modelling the entire valuation surface, which would be needlessly expensive. The problem is particularly difficult because the space of choices is infinite. We demonstrate the effectiveness of the new algorithm compared to related active learning methods. We also embed the algorithm within a decision making tool for assisting digital artists in rendering materials. The tool finds the best parameters while minimizing the number of queries. 1
6 0.66686261 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
7 0.66665232 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
8 0.66488028 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
9 0.66427684 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
10 0.66326237 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
11 0.6621545 86 nips-2007-Exponential Family Predictive Representations of State
12 0.66205239 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
13 0.66199994 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
14 0.65745193 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
15 0.65734881 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
16 0.6571666 158 nips-2007-Probabilistic Matrix Factorization
17 0.65642202 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
18 0.65421426 175 nips-2007-Semi-Supervised Multitask Learning
19 0.65389508 156 nips-2007-Predictive Matrix-Variate t Models
20 0.65344125 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models