nips nips2009 nips2009-15 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. [sent-7, score-0.963]
2 In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. [sent-8, score-0.912]
3 We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. [sent-9, score-0.363]
4 Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. [sent-10, score-1.527]
5 However, supervised machine learning techniques require large quantities of data be manually labeled so that automatic learning algorithms can build sophisticated models. [sent-14, score-0.164]
6 The challenge is to find ways to exploit the large quantity of unlabeled data and turn it into a resource that can improve the performance of supervised machine learning algorithms. [sent-16, score-0.28]
7 Most of these semi-supervised learning algorithms are applicable only to multiclass classification problems [1, 10, 32], with very few exceptions that develop discriminative models suitable for structured prediction [2, 9, 16, 20, 21, 22]. [sent-19, score-0.142]
8 In this paper, we propose an information theoretic approach for semi-supervised learning of conditional random fields (CRFs) [19], where we use the mutual information between the empirical distribution of unlabeled data and the discriminative model as a data-dependent regularized prior. [sent-20, score-0.723]
9 [16] have proposed a similar information theoretic approach that used the conditional entropy of their discriminative models on unlabeled data as a data-dependent regularization term to obtain very encouraging results. [sent-22, score-0.843]
10 As far as we know, there is no formal principled explanation for the validity of this minimum conditional entropy approach. [sent-24, score-0.425]
11 1 distortion theory framework which is well-known in information theory [14]. [sent-26, score-0.353]
12 Both works are discriminative models and do indeed use mutual information concepts. [sent-29, score-0.269]
13 As a result, their model can be trained by optimizing a convex objective function through a variant of Blahut-Arimoto alternating minimization algorithm, whereas our model is more complex and the objective function becomes non-convex. [sent-33, score-0.131]
14 In particular, training a simple chain structured CRF model [19] in our framework turns out to be intractable even if using Blahut-Arimoto’s type of alternating minimization algorithm. [sent-34, score-0.218]
15 We develop a convergent variational approach to approximately solve this problem. [sent-35, score-0.128]
16 Formally speaking, the notion of compression is quantified by the mutual information between T and X while the informativeness is quantified by the mutual information between T and Y . [sent-39, score-0.482]
17 The objective is to maximize both the joint likelihood on labeled data and the mutual information between the hidden variables and the observations on unlabeled data for a generative model. [sent-42, score-0.712]
18 It is equivalent to minimizing conditional entropy of a generative HMM for the part of unlabeled data. [sent-43, score-0.644]
19 The maximum mutual information of a generative HMM was originally proposed by Bahl et al. [sent-44, score-0.266]
20 , one HMM for each word string), and the point-wise mutual information between the choice of HMM and the observation sequence is maximized. [sent-47, score-0.273]
21 It is equivalent to maximizing the conditional likelihood of a word string given observation sequence to improve the discrimination across different models [18]. [sent-48, score-0.229]
22 [4] proposed a discriminative learning algorithm for generative HMMs of training utterances in speech recognition. [sent-50, score-0.153]
23 In the following, we first motivate our rate distortion approach for semi-supervised CRFs as a data compression scheme and formulate the semi-supervised learning paradigm as a classic rate distortion problem. [sent-51, score-0.968]
24 We then analyze the tractability of the framework for structured prediction and present a convergent variational learning algorithm to defy the combinatorial explosion of terms in the sum over label configurations. [sent-52, score-0.339]
25 Finally we demonstrate encouraging results with two real-world problems to show the effectiveness of the proposed approach: text categorization as a multi-class classification problem and hand-written character recognition as a sequence labeling problem. [sent-53, score-0.264]
26 2 Rate distortion formulation Let X be a random variable over data sequences to be labeled, and Y be a random variable over corresponding label sequences. [sent-55, score-0.412]
27 Our goal is to learn such a model from the combined set of labeled and unlabeled examples, Dl ∪ Du . [sent-58, score-0.352]
28 2 The standard supervised training procedure for CRFs is based on minimizing the negative log conditional likelihood of the labeled examples in Dl CL(θ) = − N X log pθ (y(i) |x(i) ) + λU (θ) (1) i=1 where U (θ) can be any standard regularizer on θ, e. [sent-60, score-0.457]
29 [16] proposed a semi-supervised learning algorithm that exploits a form of minimum conditional entropy regularization on the unlabeled data. [sent-67, score-0.748]
30 The tradeoff parameters λ and γ control the influences of U (θ) and the unlabeled data, respectively. [sent-69, score-0.258]
31 Here we use pl (x, y) to denote the empirical distribution of both X and Y on labeled data Dl , pl (x) to denote the empirical distribution of X on labeled data Dl , and pu (x) ˜ ˜ to denote the empirical distribution of X on unlabeled data Du . [sent-71, score-1.609]
32 Rather than using minimum conditional entropy as a regularization term on unlabeled data, we use minimum mutual information on unlabeled data. [sent-73, score-1.296]
33 This approach has a nice and strong information theoretic interpretation by rate distortion theory. [sent-74, score-0.479]
34 We define the marginal distribution pθ (y) of our discriminative model on unlabeled data Du to be pθ (y) = x∈Du pu (x)pθ (y|x) over the input data x. [sent-75, score-0.845]
35 Thus in rate distortion terminology, the empirical distribution of unlabeled data pu (x) corresponds to input distribution, the model pθ (y|x) corresponds to the prob˜ abilistic mapping from X to Y , and pθ (y) corresponds to the output distribution of Y . [sent-77, score-1.189]
36 Our proposed rate distortion approach for semi-supervised CRFs optimizes the following constrained optimization problem, “ ” min I pu (x), pθ (y|x) s. [sent-78, score-1.002]
37 ˜ θ “ ” D pl (x, y), pl (x)pθ (y|x) + λU (θ) ≤ d ˜ ˜ (4) The rationale for this formulation can be seen from an information-theoretic perspective using the rate distortion theory [14]. [sent-80, score-0.994]
38 Thus mutual information is the minimum information rate and is used as a good metric for clustering [26, 27]. [sent-88, score-0.354]
39 True distribution of X should be used to compute the mutual information. [sent-89, score-0.208]
40 Since it is unknown, we use its empirical distribution on unlabeled data set Du and the mutual information I pu (x), pθ (y|x) instead. [sent-90, score-0.971]
41 However, information rate alone is not enough to characterize good ˜ representation since the rate can always be reduced by throwing away many features in the probabilistic mapping. [sent-91, score-0.146]
42 Therefore we need an additional constraint provided through a distortion function which is presumed to be small for good representations. [sent-93, score-0.353]
43 Apparently there is a tradeoff between minimum representation and maximum distortion. [sent-94, score-0.128]
44 Since joint distribution gives the distribution for the pair of X and its p(x,y) representation Y , we choose the log likelihood ratio, log p(x)pθ (y|x) , plus a regularized complexity term of θ, λU (θ), as the distortion function. [sent-95, score-0.543]
45 Thus the expected distortion is the non-negative term D p(x, y), p(x)pθ (y|x) + λU (θ). [sent-96, score-0.386]
46 There is a monotonic tradeoff between the rate of the compression and the expected distortion: the larger the rate, the smaller is the achievable distortion. [sent-99, score-0.163]
47 Given a distortion measure between X and Y on the labeled data set Dl , what is the minimum rate description required to achieve a particular distortion on the unlabeled data set Du ? [sent-100, score-1.246]
48 The equivalence between the two problems can be verified using convex analysis [8] by noting that the Lagrangian for the constrained optimization (4) is exactly the objective in the optimization (5) (plus a constant that does not depend on θ), where κ is the Lagrange multiplier. [sent-105, score-0.142]
49 Unfortunately (4) is not a convex optimization problem, because its objective I pu (x), pθ (y|x) is not convex. [sent-107, score-0.582]
50 This can be verified using the same ˜ argument as in the minimum conditional entropy regularization case [15, 16]. [sent-108, score-0.514]
51 Moreover there are generally local minima in (5) or (6) due to the non-convexity of its mutual information regularization term. [sent-111, score-0.297]
52 Another training method for semi-supervised CRFs is the maximum entropy approach, maximizing conditional entropy (minimizing negative conditional entropy) over unlabeled data Du subject to the constraint on labeled data Dl , min θ “ − X x∈D u “ ”” pu (x)H pθ (y|x) ˜ s. [sent-112, score-1.661]
53 4 Again minimizing (8) is not exactly equivalent to (7); however, it is not essential to motivate the optimization criterion. [sent-115, score-0.096]
54 When comparing maximum entropy approach with minimum conditional entropy approach, there is only a sign change on conditional entropy term. [sent-116, score-1.036]
55 For non-parametric models, using the analysis developed in [5, 6, 7, 25], it can be shown that maximum conditional entropy approach is equivalent to rate distortion approach when we compress code vectors in a mass constrained scheme [25]. [sent-117, score-0.841]
56 The difference between our rate distortion approach for semi-supervised CRFs (6) and the minimum conditional entropy regularized semi-supervised CRFs (2) is not only on the different sign of conditional entropy on unlabeled data but also the additional term – entropy of pθ (y) on unlabeled data. [sent-119, score-1.975]
57 It is this term that makes direct computation of the derivative of the objective for the rate distortion approach for semi-supervised CRFs intractable. [sent-120, score-0.537]
58 These make the computation of the derivative intractable even for a simple chain structured CRF. [sent-122, score-0.179]
59 An alternative way to solve (6) is to use the famous algorithm for the computation of the rate distortion function established by Blahut [6] and Arimoto [3]. [sent-123, score-0.426]
60 However as illustrated in the following, this approach is still intractable for structured prediction in our case. [sent-125, score-0.106]
61 First, we assign the initial CRF model to be the optimal solution of the supervised CRF on labeled data and denote it as pθ(0) (y|x). [sent-127, score-0.164]
62 Then we define r(0) (y) and in general r(t) (y) for t ≥ 1 by r(t) (y) = X pu (x)pθ(t) (y|x) ˜ (9) x∈D u In order to define pθ(1) (y|x) and in general pθ(t) (y|x), we need to find the pθ (y|x) which minimizes g for a given r(y). [sent-128, score-0.531]
63 This makes the computation of the derivative in the alternating minimization algorithm intractable. [sent-134, score-0.095]
64 5 3 A variational training procedure In this section, we derive a convergent variational algorithm to train rate distortion based semisupervised CRFs for sequence labeling. [sent-135, score-0.717]
65 The basic idea of convexity-based variational inference is to make use of Jensen’s inequality to obtain an adjustable upper bound on the objective function [17]. [sent-136, score-0.115]
66 The variational parameters are chosen by an optimization procedure that attempts to find the tightest possible upper bound. [sent-138, score-0.113]
67 X pu (x)pθ (y|x) ˜ q(x) q(x) y x∈D u x∈D u " M „ «# M X X X pu (x(l) )pθ (y|x(l) ) ˜ (j) (j) (l) pu (x )pθ (y|x ) ˜ q(x ) log − q(x(l) ) y j=N +1 l=N +1 − X X pu (x)pθ (y|x) log ˜ Thus the desideratum of finding a tight upper bound of RLMI (θ) in Eq. [sent-141, score-2.148]
68 Without loss of generality, we assume all the unlabeled data are of equal lengths in the sequence labeling case. [sent-151, score-0.389]
69 Nevertheless similar to [21], we compute this term as follows [21]: we first define pairwise subsequence constrained entropy on (x(j) , x(l) ) (as suppose to the subsequence constrained entropy defined in [21]) as: σ Hjl (y−(a. [sent-155, score-0.633]
70 If we have Hjl for all (a, b), then the term y pθ (y|x(j) ) log pθ (y|x(l) )fk (x(j) , y) can be easily computed. [sent-174, score-0.091]
71 n |yb , x(j) , x(l) ) β α Given Hjl (·) and Hjl (·), any sequence entropy can be computed in constant time [21]. [sent-197, score-0.269]
72 i |yi+1 , x(j) , x(l) ) = X pθ (yi |yi+1 , x(j) ) log pθ (yi |yi+1 , x(l) ) yi + X α pθ (yi |yi+1 , x(j) )Hjl (y1. [sent-200, score-0.112]
73 The purpose of the first task is to show the effectiveness of rate distortion approach over minimum and maximum conditional entropy approaches when no approximation is needed in training. [sent-208, score-0.882]
74 In the second task, a variational method has to be used to train semi-supervised chain structured CRFs. [sent-209, score-0.191]
75 We demonstrate the effectiveness of the rate distortion approach over minimum and maximum conditional entropy approaches even when an approximation is used during training. [sent-210, score-0.882]
76 For each label, we rank words based on their mutual information with that label (whether it predicts label 1 or 0). [sent-215, score-0.304]
77 For each problem, we select 15% of the training data, almost 150 instances, as the labeled training data and select the unlabeled data from the remaining data. [sent-217, score-0.442]
78 We vary the ratio between the amount of unlabeled and labeled data, repeat the experiments ten times with different randomly selected labeled and unlabeled training data, and report the mean and standard deviation over different trials. [sent-222, score-0.789]
79 For each run, we initialize the model parameter for mutual information (MI) regularization and maximum/minimum conditional entropy (CE) regularization using the parameter learned from a l2 -regularized logistic regression classifier. [sent-223, score-0.738]
80 Figure 1 shows the classification accuracies of these four regularization methods versus the ratio between the amount of unlabeled and labeled data on different classification problems. [sent-224, score-0.523]
81 We can see that mutual information regularization outperforms the other three regularization schemes. [sent-225, score-0.407]
82 In most cases, maximum CE regularization outperforms minimum CE regularization and the baseline (logistic regression with l2 regularization) which uses only the labeled data. [sent-226, score-0.421]
83 765 0 6 1 ratio unlabel/label 2 3 4 5 6 ratio unlabel/label Figure 1: Results on five different binary classification problems in text categorization (left to right): comp. [sent-275, score-0.186]
84 64 0 6 ratio unlabel/label 1 2 3 4 5 6 ratio unlabel/label Figure 2: Results on hand-written character recognition: (left) sequence labeling; (right) multi-class classification. [sent-311, score-0.24]
85 2 Hand-written character recognition Our dataset for hand-written character recognition contains ∼6000 handwritten words with average length of ∼8 characters. [sent-313, score-0.232]
86 Each word was divided into characters, each character is resized to a 16 × 8 binary image. [sent-314, score-0.101]
87 We choose ∼600 words as labeled data, ∼600 words as validation data, ∼2000 words as test data. [sent-315, score-0.178]
88 Similar to text categorization, we vary the ratio between the amount of unlabeled and labeled data, and report the mean and standard deviation of classification accuracies over several trials. [sent-316, score-0.441]
89 We use a chain structured graph to model hand-written character recognition as a sequence labeling problem, similar to [29]. [sent-317, score-0.314]
90 Since the unlabeled data may have different lengths, we modify the mutual information as I = ℓ Iℓ , where Iℓ is the mutual information computed on all the unlabeled data with length ℓ. [sent-318, score-0.926]
91 We compare our approach (MI) with other regularizations (maximum/minimum conditional entropy, l2 ). [sent-319, score-0.149]
92 As a sanity check, we have also tried solving hand-written character recognition as a multi-class classification problem, i. [sent-322, score-0.106]
93 We can see that MI regularization outperforms maxCE, minCE and l2 regularizations in both multi-class and sequence labeling cases. [sent-327, score-0.229]
94 There are significant gains in the structured learning compared with the standard multi-class classification setting. [sent-328, score-0.081]
95 The proposed approach is motivated by the rate distortion framework in information theory and utilizes the mutual information on the unlabeled data as a regularization term, to be more precise a data dependent prior. [sent-330, score-0.999]
96 Maximum mutual information estimation of hidden Markov model parameters for speech recognition. [sent-352, score-0.275]
97 Semi-supervised conditional random fields for improved sequence segmentation and labeling. [sent-413, score-0.165]
98 Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [sent-430, score-0.094]
99 Efficient computation of entropy gradient for semi-supervised conditional random fields. [sent-443, score-0.352]
100 Generalized expectation criteria for semi-supervised learning of conditional random fields. [sent-448, score-0.124]
wordName wordTfidf (topN-words)
[('pu', 0.508), ('distortion', 0.353), ('pl', 0.284), ('unlabeled', 0.234), ('entropy', 0.228), ('mutual', 0.208), ('hjl', 0.207), ('crfs', 0.174), ('maxce', 0.15), ('mince', 0.15), ('du', 0.135), ('conditional', 0.124), ('labeled', 0.118), ('dl', 0.115), ('corduneanu', 0.099), ('regularization', 0.089), ('structured', 0.081), ('mi', 0.081), ('character', 0.077), ('variational', 0.077), ('jiao', 0.075), ('rlmi', 0.075), ('minimum', 0.073), ('rate', 0.073), ('crf', 0.071), ('compression', 0.066), ('ratio', 0.061), ('discriminative', 0.061), ('log', 0.058), ('bahl', 0.056), ('mmihmm', 0.056), ('yi', 0.054), ('theoretic', 0.053), ('labeling', 0.053), ('hmm', 0.052), ('convergent', 0.051), ('ib', 0.045), ('oliver', 0.045), ('vs', 0.044), ('grandvalet', 0.042), ('speech', 0.041), ('sequence', 0.041), ('subsequence', 0.04), ('derivative', 0.04), ('lengths', 0.04), ('objective', 0.038), ('label', 0.038), ('defy', 0.038), ('rlmince', 0.038), ('optimization', 0.036), ('categorization', 0.036), ('classi', 0.035), ('chain', 0.033), ('explosion', 0.033), ('scholk', 0.033), ('term', 0.033), ('constrained', 0.032), ('minimizing', 0.031), ('maximum', 0.031), ('garg', 0.03), ('fk', 0.03), ('bottleneck', 0.03), ('alternating', 0.029), ('wang', 0.029), ('motivate', 0.029), ('recognition', 0.029), ('text', 0.028), ('pf', 0.027), ('generative', 0.027), ('minimization', 0.026), ('hidden', 0.026), ('mann', 0.025), ('degeneracy', 0.025), ('regularizations', 0.025), ('tishby', 0.025), ('elds', 0.025), ('supervised', 0.025), ('intractable', 0.025), ('word', 0.024), ('tradeoff', 0.024), ('training', 0.024), ('enhance', 0.024), ('characters', 0.024), ('jensen', 0.024), ('minimizes', 0.023), ('regularized', 0.022), ('data', 0.021), ('string', 0.021), ('semisupervised', 0.021), ('tractability', 0.021), ('outperforms', 0.021), ('ce', 0.021), ('jaakkola', 0.02), ('dynamic', 0.02), ('words', 0.02), ('accuracy', 0.02), ('cation', 0.019), ('likelihood', 0.019), ('nips', 0.019), ('parametric', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
2 0.13397507 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
3 0.10489064 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
4 0.099372596 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling
Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu
Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1
5 0.094514333 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
6 0.089513525 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
7 0.089088857 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis
8 0.087276176 2 nips-2009-3D Object Recognition with Deep Belief Nets
9 0.086337328 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
10 0.074211039 97 nips-2009-Free energy score space
11 0.072261982 72 nips-2009-Distribution Matching for Transduction
12 0.071030445 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
13 0.07089179 122 nips-2009-Label Selection on Graphs
14 0.067932539 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
15 0.065423764 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
16 0.06430126 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals
17 0.062729269 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
18 0.061231215 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
19 0.058264006 157 nips-2009-Multi-Label Prediction via Compressed Sensing
20 0.05543793 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
topicId topicWeight
[(0, -0.176), (1, 0.029), (2, -0.036), (3, 0.06), (4, -0.029), (5, -0.014), (6, -0.036), (7, -0.021), (8, -0.099), (9, 0.1), (10, -0.077), (11, 0.099), (12, -0.077), (13, -0.12), (14, 0.007), (15, -0.007), (16, 0.01), (17, -0.106), (18, -0.024), (19, -0.027), (20, -0.001), (21, -0.122), (22, -0.064), (23, 0.049), (24, -0.009), (25, -0.039), (26, 0.04), (27, 0.06), (28, 0.058), (29, 0.032), (30, -0.013), (31, 0.064), (32, -0.124), (33, 0.092), (34, -0.031), (35, -0.004), (36, -0.079), (37, 0.073), (38, -0.066), (39, -0.005), (40, -0.03), (41, -0.0), (42, 0.08), (43, 0.003), (44, -0.066), (45, 0.066), (46, 0.025), (47, 0.043), (48, 0.061), (49, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.95041186 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
2 0.64650965 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
Author: Natasha Singh-miller, Michael Collins
Abstract: We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P (y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning label embeddings (similar to error-correcting output codes (ECOCs)) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate significant improvements in terms of word error rate (WER) on a lecture recognition task over a state-of-the-art baseline GMM model. 1
4 0.58922106 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
Author: Amarnag Subramanya, Jeff A. Bilmes
Abstract: We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to outperform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph. 1
5 0.57080126 97 nips-2009-Free energy score space
Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic
Abstract: A score function induced by a generative model of the data can provide a feature vector of a fixed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a fixed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classifiers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classifiers, or the discriminative classifiers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classification performance that at least matches the performance of the free energy classifier based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classifiers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.
6 0.56283271 56 nips-2009-Conditional Neural Fields
7 0.55039215 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
8 0.54124969 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling
9 0.53801763 2 nips-2009-3D Object Recognition with Deep Belief Nets
10 0.52588999 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
11 0.51699537 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning
12 0.51330358 72 nips-2009-Distribution Matching for Transduction
13 0.50777704 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
14 0.50676578 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
15 0.47732171 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
16 0.4770025 71 nips-2009-Distribution-Calibrated Hierarchical Classification
17 0.45519876 122 nips-2009-Label Selection on Graphs
18 0.4507111 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
19 0.4472675 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
20 0.44653311 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
topicId topicWeight
[(24, 0.471), (25, 0.034), (35, 0.038), (36, 0.136), (39, 0.034), (58, 0.043), (61, 0.011), (71, 0.056), (81, 0.016), (86, 0.059), (91, 0.011)]
simIndex simValue paperId paperTitle
1 0.97486728 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
2 0.9695164 181 nips-2009-Online Learning of Assignments
Author: Matthew Streeter, Daniel Golovin, Andreas Krause
Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1
3 0.96459866 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.
4 0.951446 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
same-paper 5 0.93626094 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
6 0.80823445 45 nips-2009-Beyond Convexity: Online Submodular Minimization
7 0.80487359 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
8 0.75672865 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games
9 0.74911612 161 nips-2009-Nash Equilibria of Static Prediction Games
10 0.7263822 239 nips-2009-Submodularity Cuts and Applications
11 0.71594518 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
12 0.68104613 14 nips-2009-A Parameter-free Hedging Algorithm
13 0.68053436 122 nips-2009-Label Selection on Graphs
14 0.67788053 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
15 0.67404866 178 nips-2009-On Stochastic and Worst-case Models for Investing
16 0.66855621 94 nips-2009-Fast Learning from Non-i.i.d. Observations
17 0.65967023 55 nips-2009-Compressed Least-Squares Regression
18 0.64956844 232 nips-2009-Strategy Grafting in Extensive Games
19 0.64881337 24 nips-2009-Adapting to the Shifting Intent of Search Queries
20 0.64723456 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers