nips nips2007 nips2007-186 knowledge-graph by maker-knowledge-mining

186 nips-2007-Statistical Analysis of Semi-Supervised Regression


Source: pdf

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. [sent-5, score-0.431]

2 In this paper we study semi-supervised learning from the viewpoint of minimax theory. [sent-7, score-0.206]

3 Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. [sent-8, score-0.404]

4 Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. [sent-9, score-0.679]

5 We then develop several new approaches that provably lead to improved performance. [sent-10, score-0.148]

6 The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. [sent-11, score-0.206]

7 1 Introduction Suppose that we have labeled data L = {(X 1 , Y1 ), . [sent-12, score-0.144]

8 Ordinary regression and classification techniques use L to predict Y from X . [sent-19, score-0.152]

9 Semi-supervised methods also use the unlabeled data U in an attempt to improve the predictions. [sent-20, score-0.287]

10 Semi-Supervised Smoothness Assumption (SSS): The regression function m(x) = EY | X = x is very smooth where the density p(x) of X is large. [sent-22, score-0.216]

11 In particular, we explore precise formulations of SSS, which is motivated by the intuition that high density level sets correspond to clusters of similar objects, but as stated above is quite vague. [sent-26, score-0.165]

12 Under the manifold assumption M, the semi-supervised smoothness assumption SSS is superfluous. [sent-30, score-0.558]

13 This point was made heuristically by Bickel and Li (2006), but we show that in fact ordinary regression methods are automatically adaptive if the distribution of X concentrates on a manifold. [sent-31, score-0.301]

14 Without the manifold assumption M, the semi-supervised smoothness assumption SSS as usually defined is too weak, and current methods don’t lead to improved inferences. [sent-33, score-0.678]

15 In particular, methods that use regularization based on graph Laplacians do not achieve faster rates of convergence. [sent-34, score-0.15]

16 Assuming specific conditions that relate m and p, we develop new semi-supervised methods that lead to improved estimation. [sent-36, score-0.12]

17 In particular, we propose estimators that reduce bias by estimating the Hessian of the regression function, improve the choice of bandwidths using unlabeled data, and estimate the regression function on level sets. [sent-37, score-0.847]

18 The focus of the paper is on a theoretical analysis of semi-supervised regression techniques, rather than the development of practical new algorithms and techniques. [sent-38, score-0.152]

19 Our intent is to bring the statistical perspective of minimax analysis to bear on the problem, in order to study the interplay between the labeled sample size and the unlabeled sample size, and between the regression function and the data density. [sent-40, score-0.789]

20 By studying simplified versions of the problem, our analysis suggests how precise formulations of assumptions M and SSS can be made and exploited to lead to improved estimators. [sent-41, score-0.232]

21 The labeled data are L = {(X i , Yi ) Ri = 1} and the unlabeled data are U = {(X i , Yi ) Ri = 0}. [sent-46, score-0.431]

22 For convenience, assume that data are labeled so that Ri = 1 for i = 1, . [sent-47, score-0.144]

23 Thus, the labeled sample size is n, and the unlabeled sample size is u = N − n. [sent-54, score-0.431]

24 Let p(x) be the density of X and let m(x) = E(Y | X = x) denote the regression function. [sent-55, score-0.262]

25 The missing at random assumption R ⊥ Y | X is crucial, although this point is rarely emphasized in the machine learning ⊥ literature. [sent-59, score-0.106]

26 It is clear that without some further conditions, the unlabeled data are useless. [sent-60, score-0.287]

27 The key assumption we need is that there is some correspondence between the shape of the regression function m and the shape of the data density p. [sent-61, score-0.418]

28 We will use minimax theory to judge the quality of an estimator. [sent-62, score-0.206]

29 Let R denote a class of regression functions and let F denote a class of density functions. [sent-63, score-0.262]

30 In the classical setting, we observe labeled data (X 1 , Y2 ), . [sent-64, score-0.144]

31 The pointwise minimax risk, or mean squared error (MSE), is defined by Rn (x) = inf sup E(m n (x) − m(x))2 (1) m n m∈R, p∈F where the infimum is over all estimators. [sent-68, score-0.298]

32 The global minimax risk is defined by Rn = inf sup m n m∈R, p∈F (m n (x) − m(x))2 d x. [sent-69, score-0.426]

33 E (2) A typical assumption is that R is the Sobolev space of order two, meaning essentially that m has smooth second derivatives. [sent-70, score-0.106]

34 The minimax rate is achieved by kernel estimators and local polynomial estimators. [sent-72, score-0.471]

35 In particular, for kernel estimators if we use a product kernel with common bandwidth h n for each variable, choosing h n ∼ n −1/(4+D) yields an 1 We write a bn to mean that an /bn is bounded away from 0 and infinity for large n. [sent-73, score-0.54]

36 (3) (4) Fan (1993) shows that the local linear estimator is asymptotically minimax for this class. [sent-78, score-0.457]

37 This estin T mator is given by m n (x) = a0 where (a0 , a1 ) minimizes i=1 (Yi −a0 −a1 (X i −x))2 K (H −1/2 (X i − x)), where K is a symmetric kernel and H is a matrix of bandwidths. [sent-79, score-0.16]

38 This result is important to what follows, because it suggests that if the Hessian Hm of the regression function is related to the Hessian H p of the data density, one may be able to estimate the optimal bandwidth matrix from unlabeled data in order to reduce the risk. [sent-82, score-0.683]

39 ” Suppose X ∈ R D has support on a manifold M with dimension d < D. [sent-86, score-0.269]

40 Let m h be the local linear estimator with diagonal bandwidth matrix H = h 2 I . [sent-87, score-0.495]

41 Then Bickel and Li show that the bias and variance are J2 (x) b(x) = h 2 J1 (x)(1 + o P (1)) and v(x) = (1 + o P (1)) (7) nh d −1/(4+d) yields a risk of order n −4/(4+d) , which is the for some functions J1 and J2 . [sent-88, score-0.342]

42 Choosing h n optimal rate for data that to lie on a manifold of dimension d. [sent-89, score-0.313]

43 Finally, choose the bandwidth h ∈ B that minimizes a local cross-validation score. [sent-97, score-0.327]

44 Construct m h for h ∈ H using the data in I0 , and estimate the risk from I1 by setting R(h) = |I1 |−1 i∈I1 (Yi − m h (X i ))2 . [sent-105, score-0.128]

45 Suppose that the data density p(x) is supported on a manifold of dimension d ≥ 4. [sent-111, score-0.333]

46 The risk is, up to a constant, R(h) = E(Y − m(X ))2 , where (X, Y ) is a new pair and Y = m(X ) + . [sent-115, score-0.128]

47 Then, except on a set of probability tending to 0, R(h) ≤ C log n C log n ≤ R(h ∗ ) + n n C log n 1 C log n R(h ∗ ) + 2 =O =O +2 n n n 4/(4+d) R(h) + (12) 1 (13) n 4/(4+d) √ where we used the assumption d ≥ 4 in the last equality. [sent-122, score-0.21]

48 ≤ We conclude that ordinary regression methods are automatically adaptive, and achieve the lowdimensional minimax rate if the distribution of X concentrates on a manifold; there is no need for semi-supervised methods in this case. [sent-124, score-0.503]

49 Nevertheless, the shape of the data density p(x) might provide information about the regression function m(x), in which case the unlabeled data are informative. [sent-127, score-0.551]

50 Several recent methods for semi-supervised learning attempt to exploit the smoothness assumption SSS using regularization operators defined with respect to graph Laplacians (Zhu et al. [sent-128, score-0.368]

51 (2003), the locally constant estimate m(x) is formed using not only the labeled data, but also using the estimates at the unlabeled points. [sent-136, score-0.431]

52 The semi-supervised regression estimate is then (m(X 1 ), m(X 2 ), . [sent-144, score-0.152]

53 Thus, the estimates are coupled, unlike the standard kernel regression estimate (14) where the estimate at each point x can be formed independently, given the labeled data. [sent-151, score-0.401]

54 The estimator can be written in closed form as a linear smoother m = C −1 B Y = G Y where m = (m(X n+1 ), . [sent-152, score-0.229]

55 , m(X n+u ))T is the vector of estimates over the unlabeled test points, and Y = (Y1 , . [sent-155, score-0.287]

56 The (N −n)×(N −n) matrix C and the (N −n)×n matrix B denote blocks of the combinatorial Laplacian on the data graph corresponding to the labeled and unlabeled data: A BT = (16) B C where the Laplacian = i j has entries K h (X i , X k ) if i = j (17) −K h (X i , X j ) otherwise. [sent-159, score-0.486]

57 ij k = This estimator assumes the noise is zero, since m(X i ) = Yi for i = 1, . [sent-161, score-0.195]

58 To work in the standard model Y = m(X ) + , the natural extension of the harmonic function approach is manifold regularization (Belkin et al. [sent-165, score-0.436]

59 Here the estimator is chosen to minimize the regularized empirical risk functional N Rγ (m) = n N N 2 i=1 j=1 K H (X i , X j ) Y j − m(X i ) +γ i=1 j=1 K H (X i , X j ) m(X j ) − m(X i ) 2 (18) where H is a matrix of bandwidths and K H (X i , X j ) = K (H −1/2 (X i − X j )). [sent-168, score-0.416]

60 When γ = 0 the standard kernel smoother is obtained. [sent-169, score-0.139]

61 The regularization term is N J (m) ≡ N i=1 j=1 K H (X i , X j ) m(X j ) − m(X i ) 2 = 2m T m (19) where is the combinatorial Laplacian associated with K H . [sent-170, score-0.141]

62 This regularization term is motivated by the semi-supervised smoothness assumption—it favors functions m for which m(X i ) is close to m(X j ) when X i and X j are similar, according to the kernel function. [sent-171, score-0.323]

63 The name manifold regulariza1 tion is justified by the fact that 2 J (m) → M m(x) 2 dM x, the energy of m over the manifold. [sent-172, score-0.242]

64 For an appropriate choice of γ , minimizing the functional (18) can be expected to give essentially the same results as the harmonic function approach that minimizes (15). [sent-175, score-0.117]

65 Let m H,γ minimize (18), and let operator defined by 1 trace(H f (x)H ) + 2 Then the asymptotic MSE of m H,γ (x) is p,H M= f (x) = c1 µσ 2 + n(µ + γ ) p(x)|H |1/2 c2 (µ + γ ) µ I− γ µ p,H be the differential p(x)T H f (x) . [sent-178, score-0.12]

66 Note that the bias of the standard kernel estimator, in the notation of this theorem, is b(x) = c2 p,H m(x), and the variance is V (x) = c1 /np(x)|H |1/2 . [sent-180, score-0.243]

67 It follows from this theorem that M = M + o( tr(H )) where M is the usual MSE for a kernel estimator. [sent-182, score-0.182]

68 The implication of this theorem is that the estimator that uses Laplacian regularization has the same rate of convergence as the usual kernel estimator, and thus the unlabeled data have not improved the estimator asymptotically. [sent-185, score-1.062]

69 5 5 Semi-Supervised Methods With Improved Rates The previous result is negative, in the sense that it shows unlabeled data do not help to improve the rate of convergence. [sent-186, score-0.331]

70 This is because the bias and variance of a manifold regularized kernel estimator are of the same order in H as the bias and variance of standard kernel regression. [sent-187, score-0.923]

71 We now demonstrate how improved rates of convergence can be obtained by formulating and exploiting appropriate SSS assumptions. [sent-188, score-0.166]

72 We describe three different approaches: semi-supervised bias reduction, improved bandwidth selection, and averaging over level sets. [sent-189, score-0.471]

73 1 Semi-Supervised Bias Reduction We first show a positive result by formulating an SSS assumption that links the shape of p to the shape of m by positing a relationship between the Hessian Hm of m and the Hessian H p of p. [sent-191, score-0.263]

74 First note that the variance of the estimator m n , conditional on X 1 , . [sent-200, score-0.234]

75 Now, the term 1 µ2 (K )tr(Hm (x)H ) is pre2 2 cisely the bias of the local linear estimator, under the SSS assumption that H p (x) = Hm (x). [sent-210, score-0.288]

76 The result now follows from the fact that the next term in the bias of the local linear estimator is of order O(tr(H )4 ). [sent-212, score-0.377]

77 When p is estimated from the data, the risk will be inflated by N −4/(4+D) assuming standard smoothness assumptions on p. [sent-214, score-0.273]

78 This term will not dominate the improved rate n −(4+4 )/(4+4 +D) as long as N > n . [sent-215, score-0.143]

79 The assumption that Hm = H p can be replaced by the more realistic assumption that Hm = g( p; β) for some parameterized family of functions g(·; β). [sent-216, score-0.212]

80 2 Improved Bandwidth Selection Let H be the estimated bandwidth using the labeled data. [sent-220, score-0.388]

81 We will now show how a bandwidth H ∗ can be estimated using the labeled and unlabeled data together, such that, under appropriate assumptions, |R( H ∗ ) − R(H ∗ )| lim sup = 0, where H ∗ = arg min R(H ). [sent-221, score-0.81]

82 (23) H n→∞ |R( H ) − R(H ∗ )| Therefore, the unlabeled data allow us to construct an estimator that gets closer to the oracle risk. [sent-222, score-0.512]

83 Let m H denote the local linear estimator with bandwidth H ∈ R, H > 0. [sent-226, score-0.495]

84 To use the unlabeled data, note that the optimal (global) bandwidth is H ∗ = (c2 B/(4nc1 A))1/5 where A = m (x)2 d x and B = d x/ p(x). [sent-227, score-0.531]

85 Let p(x) be the kernel density estimator of p using X 1 , . [sent-228, score-0.364]

86 |R( H ) − R(H ∗ )| (24) The risk is R(H ) = c1 H 4 (m (x))2 d x + c2 nH dx 1 +o p(x) nH . [sent-239, score-0.128]

87 (25) The oracle bandwidth is H ∗ = c3 /n 1/5 and then R(H ∗ ) = O(n −4/5 ). [sent-240, score-0.274]

88 Now let H be the bandwidth estimated by cross-validation. [sent-241, score-0.29]

89 If N /n D/4 → ∞, θ − θ = O P (N −β ) for some β > lim sup n→∞ 5. [sent-254, score-0.109]

90 For N large we can replace L with L = {x p(x) > λ} with small loss in accuracy, where p is an estimate of p using n the unlabeled data; see Rigollet (2006) for details. [sent-264, score-0.287]

91 If the regression function is slowly varying in over this set, the risk should be small. [sent-267, score-0.28]

92 A similar estimator is considered by Cortes and Mohri (2006), but they do not provide estimates of the risk. [sent-268, score-0.195]

93 The risk of m(x) for x ∈ L ∩ C j is bounded by O where δ j = supx∈C j 1 nπ j + O δ2ξ 2 j j m(x) , ξ j = diameter(C j ) and π j = P(X ∈ C j ). [sent-270, score-0.154]

94 (32) kj kj i X i ∈C j i X i ∈C j − x)T Now m(X i ) − m(x) = (X j m(u i ) for some u i between x and X i . [sent-278, score-0.114]

95 Hence, |m(X i ) − m(x)| ≤ X j − x supx∈C j m(x) and so the bias is bounded by δ j ξ j . [sent-279, score-0.125]

96 We have indicated some new approaches to understanding and exploiting the relationship between the labeled and unlabeled data. [sent-286, score-0.431]

97 Of course, we make no claim that these are the only ways of incorporating unlabeled data. [sent-287, score-0.287]

98 But our results indicate that decoupling the manifold assumption and the semi-supervised smoothness assumption is crucial to clarifying the problem. [sent-288, score-0.586]

99 The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-307, score-0.431]

100 Asymptotic comparison of (partial) cross-validation, gcv and randomized gcv in nonparametric regression. [sent-323, score-0.128]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sss', 0.448), ('unlabeled', 0.287), ('hm', 0.255), ('bandwidth', 0.244), ('manifold', 0.242), ('minimax', 0.206), ('estimator', 0.195), ('regression', 0.152), ('labeled', 0.144), ('risk', 0.128), ('ri', 0.127), ('bickel', 0.123), ('mse', 0.119), ('assumption', 0.106), ('kernel', 0.105), ('smoothness', 0.104), ('yi', 0.102), ('bias', 0.099), ('iyogi', 0.096), ('laplacians', 0.096), ('hessian', 0.089), ('regularization', 0.087), ('rigollet', 0.084), ('tr', 0.077), ('nh', 0.076), ('laplacian', 0.074), ('improved', 0.072), ('bandwidths', 0.067), ('elkin', 0.064), ('gcv', 0.064), ('ickel', 0.064), ('indhwani', 0.064), ('density', 0.064), ('harmonic', 0.064), ('zhu', 0.064), ('sup', 0.063), ('suppose', 0.061), ('estimators', 0.06), ('li', 0.059), ('kj', 0.057), ('belkin', 0.057), ('local', 0.056), ('girard', 0.056), ('hu', 0.056), ('ordinary', 0.053), ('fan', 0.051), ('larry', 0.051), ('yn', 0.049), ('lead', 0.048), ('shape', 0.048), ('asymptotic', 0.048), ('supx', 0.048), ('concentrates', 0.048), ('heuristically', 0.048), ('invoke', 0.048), ('lim', 0.046), ('let', 0.046), ('theorem', 0.045), ('kondor', 0.045), ('rate', 0.044), ('rn', 0.043), ('op', 0.043), ('et', 0.043), ('assumptions', 0.041), ('pittsburgh', 0.039), ('variance', 0.039), ('formulations', 0.037), ('classi', 0.036), ('rates', 0.035), ('smoother', 0.034), ('carnegie', 0.034), ('mellon', 0.034), ('precise', 0.034), ('formulating', 0.033), ('usual', 0.032), ('lafferty', 0.032), ('var', 0.031), ('oracle', 0.03), ('papers', 0.03), ('level', 0.03), ('pa', 0.029), ('inf', 0.029), ('provably', 0.028), ('graph', 0.028), ('positing', 0.028), ('ated', 0.028), ('decoupling', 0.028), ('mator', 0.028), ('sang', 0.028), ('castelli', 0.028), ('bernoulli', 0.028), ('dimension', 0.027), ('minimizes', 0.027), ('combinatorial', 0.027), ('term', 0.027), ('log', 0.026), ('averaging', 0.026), ('minimize', 0.026), ('bounded', 0.026), ('appropriate', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

2 0.17201467 107 nips-2007-Iterative Non-linear Dimensionality Reduction with Manifold Sculpting

Author: Michael Gashler, Dan Ventura, Tony Martinez

Abstract: Many algorithms have been recently developed for reducing dimensionality by projecting data onto an intrinsic non-linear manifold. Unfortunately, existing algorithms often lose significant precision in this transformation. Manifold Sculpting is a new algorithm that iteratively reduces dimensionality by simulating surface tension in local neighborhoods. We present several experiments that show Manifold Sculpting yields more accurate results than existing algorithms with both generated and natural data-sets. Manifold Sculpting is also able to benefit from both prior dimensionality reduction efforts. 1

3 0.16731267 161 nips-2007-Random Projections for Manifold Learning

Author: Chinmay Hegde, Michael Wakin, Richard Baraniuk

Abstract: We propose a novel method for linear dimensionality reduction of manifold modeled data. First, we show that with a small number M of random projections of sample points in RN belonging to an unknown K-dimensional Euclidean manifold, the intrinsic dimension (ID) of the sample set can be estimated to high accuracy. Second, we rigorously prove that using only this set of random projections, we can estimate the structure of the underlying manifold. In both cases, the number of random projections required is linear in K and logarithmic in N , meaning that K < M ≪ N . To handle practical situations, we develop a greedy algorithm to estimate the smallest size of the projection space required to perform manifold learning. Our method is particularly relevant in distributed sensing systems and leads to significant potential savings in data acquisition, storage and transmission costs.

4 0.16572024 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

5 0.16546449 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1

6 0.15883887 69 nips-2007-Discriminative Batch Mode Active Learning

7 0.1536445 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

8 0.12690622 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

9 0.11069868 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

10 0.1044338 32 nips-2007-Bayesian Co-Training

11 0.10033339 175 nips-2007-Semi-Supervised Multitask Learning

12 0.092473619 99 nips-2007-Hierarchical Penalization

13 0.089198619 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

14 0.087969199 110 nips-2007-Learning Bounds for Domain Adaptation

15 0.086918809 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

16 0.079365216 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

17 0.07934998 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

18 0.079076268 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

19 0.078559071 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

20 0.076588131 160 nips-2007-Random Features for Large-Scale Kernel Machines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, 0.053), (2, -0.147), (3, 0.18), (4, -0.013), (5, 0.113), (6, -0.012), (7, -0.082), (8, 0.016), (9, 0.209), (10, 0.022), (11, 0.179), (12, -0.1), (13, -0.032), (14, 0.169), (15, 0.022), (16, -0.006), (17, 0.089), (18, 0.12), (19, -0.007), (20, 0.05), (21, -0.115), (22, 0.031), (23, -0.127), (24, 0.146), (25, 0.097), (26, 0.074), (27, -0.021), (28, 0.074), (29, -0.003), (30, 0.06), (31, 0.098), (32, 0.009), (33, 0.079), (34, 0.089), (35, -0.047), (36, 0.006), (37, 0.028), (38, 0.087), (39, -0.023), (40, -0.038), (41, 0.006), (42, 0.026), (43, -0.018), (44, 0.041), (45, 0.002), (46, 0.007), (47, 0.025), (48, -0.004), (49, -0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95915329 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

2 0.75674886 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

3 0.61293048 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

Author: Ping Li, Trevor J. Hastie

Abstract: Many tasks (e.g., clustering) in machine learning only require the lα distances instead of the original data. For dimension reductions in the lα norm (0 < α ≤ 2), the method of stable random projections can efficiently compute the lα distances in massive datasets (e.g., the Web or massive data streams) in one pass of the data. The estimation task for stable random projections has been an interesting topic. We propose a simple estimator based on the fractional power of the samples (projected data), which is surprisingly near-optimal in terms of the asymptotic variance. In fact, it achieves the Cram´ r-Rao bound when α = 2 and α = 0+. This e new result will be useful when applying stable random projections to distancebased clustering, classifications, kernels, massive data streams etc.

4 0.57938868 161 nips-2007-Random Projections for Manifold Learning

Author: Chinmay Hegde, Michael Wakin, Richard Baraniuk

Abstract: We propose a novel method for linear dimensionality reduction of manifold modeled data. First, we show that with a small number M of random projections of sample points in RN belonging to an unknown K-dimensional Euclidean manifold, the intrinsic dimension (ID) of the sample set can be estimated to high accuracy. Second, we rigorously prove that using only this set of random projections, we can estimate the structure of the underlying manifold. In both cases, the number of random projections required is linear in K and logarithmic in N , meaning that K < M ≪ N . To handle practical situations, we develop a greedy algorithm to estimate the smallest size of the projection space required to perform manifold learning. Our method is particularly relevant in distributed sensing systems and leads to significant potential savings in data acquisition, storage and transmission costs.

5 0.55584419 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

Abstract: We present a new and efficient semi-supervised training method for parameter estimation and feature selection in conditional random fields (CRFs). In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs – a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. The objective function of sVEB combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. It reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real-world inference systems. Experiments on synthetic data and real activity traces collected from wearable sensors, illustrate that sVEB benefits from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised approaches. 1

6 0.55489576 107 nips-2007-Iterative Non-linear Dimensionality Reduction with Manifold Sculpting

7 0.54462039 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

8 0.54226589 166 nips-2007-Regularized Boost for Semi-Supervised Learning

9 0.53084338 69 nips-2007-Discriminative Batch Mode Active Learning

10 0.47230893 43 nips-2007-Catching Change-points with Lasso

11 0.47048762 175 nips-2007-Semi-Supervised Multitask Learning

12 0.46623135 179 nips-2007-SpAM: Sparse Additive Models

13 0.45119119 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

14 0.44279248 53 nips-2007-Compressed Regression

15 0.42630735 15 nips-2007-A general agnostic active learning algorithm

16 0.42047915 32 nips-2007-Bayesian Co-Training

17 0.41974303 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

18 0.39075959 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

19 0.37503615 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

20 0.37104511 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.234), (5, 0.03), (13, 0.035), (16, 0.039), (18, 0.011), (19, 0.024), (21, 0.105), (31, 0.011), (34, 0.031), (35, 0.056), (47, 0.079), (49, 0.029), (83, 0.145), (85, 0.02), (87, 0.014), (90, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86103964 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspecified horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1

2 0.85071343 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright

Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1

same-paper 3 0.77414703 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

4 0.66899997 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

5 0.6679554 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

6 0.66194695 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

7 0.65898621 175 nips-2007-Semi-Supervised Multitask Learning

8 0.65795577 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

9 0.65601909 16 nips-2007-A learning framework for nearest neighbor search

10 0.6558767 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

11 0.65461481 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

12 0.65446949 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

13 0.65336615 69 nips-2007-Discriminative Batch Mode Active Learning

14 0.65303564 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

15 0.65205318 24 nips-2007-An Analysis of Inference with the Universum

16 0.65188533 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

17 0.65158248 63 nips-2007-Convex Relaxations of Latent Variable Training

18 0.65085727 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

19 0.65082163 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

20 0.6498397 158 nips-2007-Probabilistic Matrix Factorization