nips nips2012 nips2012-305 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding
Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.
Reference: text
sentIndex sentText sentNum sentScore
1 edu † Abstract In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. [sent-9, score-0.126]
2 This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. [sent-10, score-0.214]
3 We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). [sent-11, score-0.309]
4 In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. [sent-12, score-0.444]
5 Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. [sent-13, score-0.086]
6 1 Introduction The performance of (semi-)supervised learning methods typically depends on the amount of labeled data. [sent-15, score-0.084]
7 Roughly speaking, the more the labeled data, the better the learning performance will be. [sent-16, score-0.084]
8 However, in many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. [sent-17, score-0.126]
9 To overcome this problem, active learning [9, 10] was proposed, which iteratively queries the oracle (labeler) to obtain the labels at new data points. [sent-18, score-0.168]
10 Representative methods include support vector machine (SVM) active learning [19, 18], agnostic active learning [2, 5, 14], etc. [sent-19, score-0.259]
11 Due to the close interaction between the learner and the oracle, active learning can be advantageous to achieve better learning performance. [sent-20, score-0.187]
12 For example, when one turns to Amazon Mechanical Turk1 to label data, the interaction between the learner and the labeling workers is very limited. [sent-22, score-0.237]
13 Therefore, standard active learning is not very practical in this case. [sent-23, score-0.105]
14 Another potential solution to the label deficiency problem is semi-supervised learning [7, 22, 21, 4], which aims at combining a small number of labeled data and a large amount of unlabeled data to improve the learning performance. [sent-24, score-0.22]
15 In a typical setting of semi-supervised learning, a small set of labeled data is assumed to be given at hand or randomly generated in practice. [sent-25, score-0.106]
16 However, randomly selecting (uniformly sampling) data points to label is unwise because not all the data points are equally informative. [sent-26, score-0.22]
17 It is desirable to obtain a labeled subset which is most beneficial for semisupervised learning. [sent-27, score-0.084]
18 In this paper, based on the above motivation, we investigate a problem as follows: given a fixed label budget, how to select a subset of data points to label such that the learning performance is optimized. [sent-28, score-0.214]
19 We refer to this problem as selective labeling, in contrast to conventional random labeling. [sent-29, score-0.137]
20 To achieve the goal of selective labeling, it is crucial to consider the out-of-sample error of a specific learner. [sent-30, score-0.178]
21 We derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, which suggests to select the data points to label by minimizing this upper bound. [sent-36, score-0.444]
22 The resulting selective labeling method is a combinatorial optimization problem. [sent-37, score-0.272]
23 In order to optimize it effectively and efficiently, we relax it into a continuous optimization problem, and solve it by projected gradient descent algorithm followed by discretization. [sent-38, score-0.105]
24 In Section 2, we briefly review manifold regularization and LapRLS. [sent-41, score-0.108]
25 In Section 3, we derive an out-of-sample error bound for LapRLS on subsampled data, and present a selective labeling criterion by minimizing the this bound, followed by its optimization algorithm. [sent-42, score-0.51]
26 In order to estimate and preserve the geometrical and topological properties of the data, LapRLS [4] assumes that if two data points xi and xj are close in the intrinsic geometry of the data distribution, the labels of this two points are also close to each other. [sent-50, score-0.172]
27 Let f (x) be a function that maps the original data point x in a ∫ compact submanifold M to R, we use ||f ||2 = x∈M || ▽M f ||2 dx to measure the smoothness of M f along the geodesics in the intrinsic manifold of the data, where ▽M f is the gradient of f along the manifold M. [sent-51, score-0.216]
28 Recent study on spectral graph theory [8] has demonstrated that ||f ||2 can be M discretely approximated through a nearest neighbor graph on a set of data points. [sent-52, score-0.102]
29 , fn ]T , D is a diagonal matrix, called degree matrix, ∑n with Dii = j=1 Wij , and L = D − W is the combinatorial graph Laplacian [8]. [sent-56, score-0.068]
30 Intuitively, the regularization incurs a heavy penalty if neighboring points xi and xj are mapped far apart. [sent-59, score-0.087]
31 Based on manifold regularization, LapRLS solves the following optimization problem, arg min ||XT w − y||2 + 2 w λA λI T ||w||2 + w XLXT w, 2 2 2 (2) where λA , λI > 0 are positive regularization parameters, X = [x1 , . [sent-60, score-0.159]
32 , yn ]T is the response vector, ||w||2 is ℓ2 regularization of linear function, and wT XLXT w is manifold regularization of f (x) = wT x. [sent-66, score-0.151]
33 When λI = 0, LapRLS reduces to ridge regression [15]. [sent-67, score-0.092]
34 (2) is a supervised version of LapRLS, because only labeled data are used in manifold regularization. [sent-70, score-0.191]
35 1 The Proposed Method Problem Formulation The generic problem of selective labeling is as follows. [sent-73, score-0.244]
36 , xn }, namely the pool of candidate data points, our goal is to find a subsample L ⊂ {1, . [sent-77, score-0.248]
37 To derive a selective labeling approach for LapRLS, we first derive an out-of-sample error bound of LapRLS. [sent-81, score-0.384]
38 In this case, the approximation error vanishes and the excess error equals to the estimation error. [sent-101, score-0.082]
39 Note that this assumption can be relaxed with more effort, under which we can derive a similar error bound as below. [sent-102, score-0.118]
40 In selective labeling, we are interested in estimating w∗ using LapRLS in Eq. [sent-105, score-0.137]
41 Denote the subsample of X by XL , the subsample of y by yL , and the subsample of ϵ by ϵL . [sent-110, score-0.357]
42 In the following, we will present a deterministic out-of-sample error bound for LapRLS trained on the subsampled data, which is among the main contributions of this paper. [sent-112, score-0.195]
43 , xn ], and a subsample L of X, the expected error of LapRLS trained on L in predicting the true response VT w∗ is upper bounded as ( ) ˆ E||VT wL − VT w∗ ||2 ≤ (B + σ 2 )tr VT (XL XT + λA I + λI XL LL XT )−1 V . [sent-120, score-0.229]
44 A1 Similarly, the second term can be bounded by ( ) A2 ≤ σ 2 tr VT (XL XT + ML )−1 XL XT (XL XT + ML )−1 V L L L ( ) ≤ σ 2 tr VT (XL XT + ML )−1 V , L E[ϵL ϵT ] L (9) where the first equality uses ≤ σ I, and it becomes equality if ϵi are independent and identically distributed (i. [sent-129, score-0.202]
45 For any fixed X, and a subsample L of X, the expected error of LapRLS trained on L in estimating the true weight vector w∗ is upper bounded as ) ( ˆ E||wL − w∗ ||2 ≤ (B + σ 2 )tr (XL XT + λA I + λI XL LL XT )−1 (10) 2 L L The proof of this theorem follows similar derivations of Theorem 2. [sent-138, score-0.231]
46 3 The Criterion of Selective Labeling From Theorem 2, we can see that given a subsample L of X, the expected prediction error of LapRLS on V is upper bounded by Eq. [sent-140, score-0.18]
47 More importantly, the error bound derived in this paper is deterministic, which is unlike those probabilistic error bounds derived based on Rademacher complexity [3] or algorithmic stability [6]. [sent-146, score-0.137]
48 sample rather than a particular sample, they cannot provide a criterion to choose a subsample set for labeling due to the correlation between the pool of candidate points and the i. [sent-150, score-0.431]
49 On the contrary, the deterministic error bound does not suffer from such a kind of problem. [sent-154, score-0.127]
50 Therefore, it provides a natural criterion for selective labeling. [sent-155, score-0.198]
51 In detail, given a pool of candidate data points, i. [sent-156, score-0.102]
52 , n}, by minimizing the follow objective function ) ( arg min tr XT (XL XT + λI XL LL XT + λA I)−1 X , (11) L L L⊂{1,. [sent-161, score-0.174]
53 (11) can be equivalently reformulated as ( ) arg min tr XT (XSST XT + λI XSST LSST XT + λA I)−1 X S∈S2 ( ) = arg min tr XT (XSST L′ SST XT + λA I)−1 X , (15) S∈S2 ′ where L = I + λI L. [sent-176, score-0.266]
54 Then we solve the following continuous optimization, ( ) arg min tr XT (XSST L′ SST XT + λA I)−1 X . [sent-180, score-0.133]
55 (17) S∈S3 4 We derive a projected gradient descent algorithm to find a local optimum of Eq. [sent-181, score-0.102]
56 Since ST S = I, we introduce a Lagrange multiplier Λ ∈ Rl×l , thus the Lagrangian function is ( ) ( ) L(S) = tr XT (XSST L′ SST XT + λA I)−1 X + tr Λ(ST S − I) . [sent-184, score-0.193]
57 Thus we can use projected gradient descent to find a local optimal solution for Eq. [sent-193, score-0.08]
58 In order to determine which l data points to select, we need to project S∗ into S1 . [sent-200, score-0.086]
59 4 Related Work We notice that our proposed method shares similar spirit with optimal experimental design3 in statistics [1, 20, 16], whose intent is to select the most informative data points to learn a function which has minimum variance of estimation, or minimum variance of prediction. [sent-202, score-0.184]
60 In particular, for ridge regression, it optimizes the following criterion, ( ) arg min tr (XL XT + λA I)−1 , (22) L L⊂{1,. [sent-204, score-0.2]
61 TED selects the samples which minimize the expected predictive variance of ridge regression on the data, ( ) (23) arg min tr XT (XL XT + λA I)−1 X . [sent-212, score-0.289]
62 Some literature also call it active learning, while our understand is there is no adaptive interaction between the learner and the oracle within optimal experimental design. [sent-217, score-0.208]
63 Therefore, it is better to call it nonadaptive active learning. [sent-218, score-0.105]
64 3 5 Although TED is motivated by minimizing the variance of the prediction, it is very interesting to demonstrate that the above criterion is coinciding with minimizing the out-of-sample error bound in Theorem 2 with λI = 0. [sent-219, score-0.272]
65 The reason is that for ridge regression, the upper bounds of the bias ( ) and variance terms share a common factor tr XT (XL XT + λA I)−1 X . [sent-220, score-0.202]
66 This is a very important L observation because it explains why TED performs very well even though its criterion is minimizing the variance of the prediction. [sent-221, score-0.16]
67 [16] proposed Laplacian Optimal Design (LOD), which selects data points that minimize the expected predictive variance of Laplacian regularized least squares [4] on the data, ( ) arg min tr XT (λI XLXT + XL XT + λA I)−1 X , (24) L L⊂{1,. [sent-224, score-0.338]
68 ,n} where the graph Laplacian L is computed on all the data points in the pool, i. [sent-227, score-0.126]
69 LOD selects the points by XL XT while leaving the graph Laplacian term XLXT fixed. [sent-230, score-0.135]
70 However, our method L selects the points by XL XT as well as the graph Laplacian term i. [sent-231, score-0.135]
71 Yet it has been L well-solved by the projected gradient descent algorithm derived in previous section. [sent-239, score-0.08]
72 1 Compared Methods To demonstrate the effectiveness of our proposed method, we compare it with the following baseline approaches: Random Sampling (Random) uniformly selects data points from the pool as training data. [sent-245, score-0.172]
73 Transductive Experiment Design (TED) is proposed in [20], which is the state-of-the-art (non-adaptive) active learning method. [sent-249, score-0.105]
74 Laplacian Optimal Design (LOD) [16] is an extension of TED, which incorporates the manifold structure of the data. [sent-251, score-0.085]
75 Then we construct a 5-NN graph and use the cosine distance to measure the similarity between data points throughout of our experiments. [sent-256, score-0.126]
76 Note that the problem setting of our study is to select a batch of data points to label without training a classifier. [sent-257, score-0.166]
77 Therefore, we do not compare our method with typical active learning methods such as SVM active learning [19, 18] and agnostic active learning [2]. [sent-258, score-0.364]
78 After selecting the data points by the above methods, we train a LapRLS [4] as the learner to do classification. [sent-259, score-0.142]
79 As can be seen, the data points selected by AOD are concentrated on the inner circle (belonging to one class), which are not able to train a classifier. [sent-268, score-0.136]
80 The data points selected by TED, LapIOD and Bound are distributed on both inner and outer circles 6 2. [sent-269, score-0.156]
81 5 (d) Bound Figure 1: Selected points (the red marks) on the two circles dataset by (a) AOD; (b) TED; (c) LOD; and (d) Bound. [sent-317, score-0.113]
82 Furthermore, the 8 data points selected by Bound are uniformly distributed on the two circles, four from the inner circle, and the other four from the outer circle, which can better represent the original data. [sent-319, score-0.125]
83 wdbc is the Wisconsin Diagnostic Breast Cancer data set, which is from UCI machine learning repository4 . [sent-322, score-0.117]
84 It aims at predicting the breast cancer as benign or malignant based on the digitalized images. [sent-323, score-0.068]
85 For each data set, we randomly select 20% data as held-out set for model selection, and the rest 80% data as work set. [sent-332, score-0.098]
86 In order to randomize the experiments, in each run of experiments, we restrict the training data (pool of candidate data points) to be selected from a random sampling of 50% work set (which accounts for 40% of the total data). [sent-333, score-0.088]
87 Once the labeled data are selected, we train a semi-supervised version of LapRLS, which uses both labeled and unlabeled data (all the training data) for manifold regularization. [sent-335, score-0.32]
88 For the wdbc dataset, the chosen parameters are λA = 0. [sent-341, score-0.095]
89 , 20} points to label, for ORL, we incrementally choose {80, 90, . [sent-353, score-0.087]
90 , 150} points for labeling, and for Isolet1, we choose {30, 40, . [sent-356, score-0.064]
91 In all subfigures, the x-axis represents the number of labeled points, while the y-axis is the averaged classification accuracy on the test data over 10 runs. [sent-362, score-0.106]
92 In order to show some concrete results, we also list the accuracy and running time (in second) of all the compared methods on the three datasets with 2, 80 and 30 labeled data points respectively in 4 5 http://archive. [sent-363, score-0.192]
93 Dataset wdbc (2 labeled) ORL (80 labeled) Isolet1 (30 labeled) Acc time Acc time Acc time Random 69. [sent-372, score-0.095]
94 We observe that the proposed selective labeling method greatly outperforms the other methods at most cases. [sent-417, score-0.244]
95 The reason is that minimizing the variance of model parameter does not guarantee the quality of predictions on the data. [sent-419, score-0.074]
96 As we mentioned before, the criterion of TED coincides with minimizing the out-of-sample error bound of ridge regression. [sent-421, score-0.265]
97 The superior performance of our method is attributed to its theoretical foundation, which guarantees that the learner (LapRLS) can achieve small error on the test data. [sent-425, score-0.097]
98 Therefore, we also compare different methods using ridge regression (RR) as the learner. [sent-428, score-0.092]
99 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-464, score-0.107]
100 Support vector machine active learning with applications to text classification. [sent-559, score-0.105]
wordName wordTfidf (topN-words)
[('laprls', 0.566), ('xl', 0.383), ('ted', 0.239), ('lod', 0.238), ('xt', 0.224), ('aod', 0.215), ('vt', 0.144), ('ml', 0.138), ('selective', 0.137), ('subsample', 0.119), ('xlxt', 0.117), ('xsst', 0.117), ('laplacian', 0.11), ('labeling', 0.107), ('active', 0.105), ('sst', 0.103), ('orl', 0.095), ('wdbc', 0.095), ('manifold', 0.085), ('labeled', 0.084), ('tr', 0.082), ('st', 0.074), ('ridge', 0.067), ('ll', 0.067), ('points', 0.064), ('criterion', 0.061), ('yl', 0.057), ('learner', 0.056), ('bound', 0.055), ('pool', 0.055), ('wl', 0.054), ('agnostic', 0.049), ('label', 0.048), ('subsampled', 0.046), ('design', 0.046), ('error', 0.041), ('minimizing', 0.041), ('graph', 0.04), ('acc', 0.039), ('btr', 0.039), ('bxsst', 0.039), ('projected', 0.037), ('wt', 0.035), ('variance', 0.033), ('arg', 0.033), ('select', 0.032), ('circles', 0.031), ('squares', 0.031), ('deterministic', 0.031), ('circle', 0.031), ('selects', 0.031), ('xs', 0.029), ('derivations', 0.029), ('multiplier', 0.029), ('cohn', 0.028), ('speakers', 0.028), ('combinatorial', 0.028), ('lagrange', 0.027), ('xn', 0.027), ('facial', 0.026), ('interaction', 0.026), ('explains', 0.025), ('candidate', 0.025), ('breast', 0.025), ('transductive', 0.025), ('regression', 0.025), ('relax', 0.025), ('gradient', 0.024), ('regularized', 0.024), ('beygelzimer', 0.024), ('regularization', 0.023), ('discretization', 0.023), ('incrementally', 0.023), ('identity', 0.023), ('unlabeled', 0.023), ('nonnegative', 0.023), ('derivative', 0.023), ('trained', 0.022), ('cancer', 0.022), ('tong', 0.022), ('bousquet', 0.022), ('datasets', 0.022), ('data', 0.022), ('derive', 0.022), ('oracle', 0.021), ('aims', 0.021), ('letter', 0.021), ('rademacher', 0.021), ('overcome', 0.02), ('outer', 0.02), ('upper', 0.02), ('yn', 0.02), ('wij', 0.02), ('acquisition', 0.02), ('selected', 0.019), ('equality', 0.019), ('descent', 0.019), ('min', 0.018), ('dataset', 0.018), ('benchmark', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 305 nips-2012-Selective Labeling via Error Bound Minimization
Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding
Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.
2 0.17195731 324 nips-2012-Stochastic Gradient Descent with Only One Projection
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1
3 0.1437079 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
Author: Marc Deisenroth, Shakir Mohamed
Abstract: Rich and complex time-series data, such as those generated from engineering systems, financial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian filters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1
4 0.13530903 225 nips-2012-Multi-task Vector Field Learning
Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He
Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1
5 0.11072936 32 nips-2012-Active Comparison of Prediction Models
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1
6 0.11021511 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
7 0.10756842 293 nips-2012-Relax and Randomize : From Value to Algorithms
8 0.10210232 271 nips-2012-Pointwise Tracking the Optimal Regression Function
9 0.089304924 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
10 0.08526919 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
11 0.077948913 292 nips-2012-Regularized Off-Policy TD-Learning
12 0.065568618 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
13 0.064327218 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes
14 0.062653206 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation
15 0.062109973 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.060957566 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
17 0.060868524 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
18 0.059606548 147 nips-2012-Graphical Models via Generalized Linear Models
19 0.059476621 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
20 0.05902376 187 nips-2012-Learning curves for multi-task Gaussian process regression
topicId topicWeight
[(0, 0.171), (1, 0.011), (2, 0.076), (3, 0.082), (4, 0.077), (5, -0.029), (6, 0.003), (7, -0.0), (8, -0.043), (9, -0.083), (10, -0.01), (11, -0.072), (12, -0.038), (13, 0.058), (14, 0.002), (15, -0.004), (16, -0.02), (17, 0.039), (18, 0.041), (19, 0.066), (20, 0.009), (21, -0.025), (22, -0.037), (23, -0.069), (24, 0.024), (25, -0.026), (26, 0.045), (27, 0.079), (28, -0.006), (29, -0.134), (30, -0.195), (31, -0.09), (32, 0.019), (33, -0.038), (34, -0.177), (35, -0.052), (36, -0.026), (37, -0.012), (38, 0.082), (39, -0.023), (40, -0.058), (41, -0.052), (42, -0.052), (43, 0.002), (44, 0.08), (45, -0.078), (46, -0.002), (47, 0.028), (48, -0.078), (49, -0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.9377861 305 nips-2012-Selective Labeling via Error Bound Minimization
Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding
Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.
2 0.67927551 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
Author: Marc Deisenroth, Shakir Mohamed
Abstract: Rich and complex time-series data, such as those generated from engineering systems, financial markets, videos, or neural recordings are now a common feature of modern data analysis. Explaining the phenomena underlying these diverse data sets requires flexible and accurate models. In this paper, we promote Gaussian process dynamical systems as a rich model class that is appropriate for such an analysis. We present a new approximate message-passing algorithm for Bayesian state estimation and inference in Gaussian process dynamical systems, a nonparametric probabilistic generalization of commonly used state-space models. We derive our message-passing algorithm using Expectation Propagation and provide a unifying perspective on message passing in general state-space models. We show that existing Gaussian filters and smoothers appear as special cases within our inference framework, and that these existing approaches can be improved upon using iterated message passing. Using both synthetic and real-world data, we demonstrate that iterated message passing can improve inference in a wide range of tasks in Bayesian state estimation, thus leading to improved predictions and more effective decision making. 1
3 0.65273118 324 nips-2012-Stochastic Gradient Descent with Only One Projection
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1
4 0.60912931 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
Author: Toke Hansen, Michael W. Mahoney
Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1
5 0.57224488 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
6 0.56836694 32 nips-2012-Active Comparison of Prediction Models
7 0.56822354 225 nips-2012-Multi-task Vector Field Learning
8 0.54288042 293 nips-2012-Relax and Randomize : From Value to Algorithms
9 0.51690143 292 nips-2012-Regularized Off-Policy TD-Learning
10 0.49885789 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
11 0.48270717 271 nips-2012-Pointwise Tracking the Optimal Regression Function
12 0.47164893 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
13 0.44828999 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
14 0.4349426 11 nips-2012-A Marginalized Particle Gaussian Process Regression
15 0.43330243 168 nips-2012-Kernel Latent SVM for Visual Recognition
16 0.42706317 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
17 0.41854343 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
18 0.41725516 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
19 0.41468224 221 nips-2012-Multi-Stage Multi-Task Feature Learning
20 0.41089281 34 nips-2012-Active Learning of Multi-Index Function Models
topicId topicWeight
[(0, 0.044), (11, 0.286), (17, 0.029), (21, 0.02), (38, 0.136), (42, 0.019), (53, 0.016), (54, 0.04), (55, 0.015), (74, 0.035), (76, 0.123), (80, 0.116), (92, 0.032)]
simIndex simValue paperId paperTitle
1 0.85155946 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
Author: Jaldert Rombouts, Pieter Roelfsema, Sander M. Bohte
Abstract: A key function of brains is undoubtedly the abstraction and maintenance of information from the environment for later use. Neurons in association cortex play an important role in this process: by learning these neurons become tuned to relevant features and represent the information that is required later as a persistent elevation of their activity [1]. It is however not well known how such neurons acquire these task-relevant working memories. Here we introduce a biologically plausible learning scheme grounded in Reinforcement Learning (RL) theory [2] that explains how neurons become selective for relevant information by trial and error learning. The model has memory units which learn useful internal state representations to solve working memory tasks by transforming partially observable Markov decision problems (POMDP) into MDPs. We propose that synaptic plasticity is guided by a combination of attentional feedback signals from the action selection stage to earlier processing levels and a globally released neuromodulatory signal. Feedback signals interact with feedforward signals to form synaptic tags at those connections that are responsible for the stimulus-response mapping. The neuromodulatory signal interacts with tagged synapses to determine the sign and strength of plasticity. The learning scheme is generic because it can train networks in different tasks, simply by varying inputs and rewards. It explains how neurons in association cortex learn to 1) temporarily store task-relevant information in non-linear stimulus-response mapping tasks [1, 3, 4] and 2) learn to optimally integrate probabilistic evidence for perceptual decision making [5, 6]. 1
2 0.80780762 225 nips-2012-Multi-task Vector Field Learning
Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He
Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1
Author: Hua Wang, Feiping Nie, Heng Huang, Jingwen Yan, Sungeun Kim, Shannon Risacher, Andrew Saykin, Li Shen
Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. Regression analysis has been studied to relate neuroimaging measures to cognitive status. However, whether these measures have further predictive power to infer a trajectory of cognitive performance over time is still an under-explored but important topic in AD research. We propose a novel high-order multi-task learning model to address this issue. The proposed model explores the temporal correlations existing in imaging and cognitive data by structured sparsity-inducing norms. The sparsity of the model enables the selection of a small number of imaging measures while maintaining high prediction accuracy. The empirical studies, using the longitudinal imaging and cognitive data of the ADNI cohort, have yielded promising results.
same-paper 4 0.76117539 305 nips-2012-Selective Labeling via Error Bound Minimization
Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding
Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.
5 0.74021351 327 nips-2012-Structured Learning of Gaussian Graphical Models
Author: Karthik Mohan, Mike Chung, Seungyeop Han, Daniela Witten, Su-in Lee, Maryam Fazel
Abstract: We consider estimation of multiple high-dimensional Gaussian graphical models corresponding to a single set of nodes under several distinct conditions. We assume that most aspects of the networks are shared, but that there are some structured differences between them. Specifically, the network differences are generated from node perturbations: a few nodes are perturbed across networks, and most or all edges stemming from such nodes differ between networks. This corresponds to a simple model for the mechanism underlying many cancers, in which the gene regulatory network is disrupted due to the aberrant activity of a few specific genes. We propose to solve this problem using the perturbed-node joint graphical lasso, a convex optimization problem that is based upon the use of a row-column overlap norm penalty. We then solve the convex problem using an alternating directions method of multipliers algorithm. Our proposal is illustrated on synthetic data and on an application to brain cancer gene expression data. 1
6 0.62935328 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
7 0.61837244 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
8 0.61682057 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
9 0.61564487 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
10 0.6146909 160 nips-2012-Imitation Learning by Coaching
11 0.61443824 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
12 0.61437476 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
13 0.61374503 292 nips-2012-Regularized Off-Policy TD-Learning
14 0.61371499 200 nips-2012-Local Supervised Learning through Space Partitioning
15 0.61309141 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
16 0.61228138 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
17 0.61204726 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
18 0.61115104 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
19 0.61101335 65 nips-2012-Cardinality Restricted Boltzmann Machines
20 0.61100507 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders