nips nips2009 nips2009-34 knowledge-graph by maker-knowledge-mining

34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Source: pdf

Author: Manqi Zhao, Venkatesh Saligrama

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the speciﬁed false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efﬁcient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artiﬁcial and real data sets in high dimensional feature spaces.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. [sent-5, score-1.401]

2 Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. [sent-6, score-0.63]

3 The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the speciﬁed false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. [sent-7, score-2.097]

4 1 Introduction Anomaly detection involves detecting statistically signiﬁcant deviations of test data from nominal distribution. [sent-11, score-0.719]

5 In typical applications the nominal distribution is unknown and generally cannot be reliably estimated from nominal training data due to a combination of factors such as limited data size and high dimensionality. [sent-12, score-1.064]

6 We propose an adaptive non-parametric method for anomaly detection based on score functions that maps data samples to the interval [0, 1]. [sent-13, score-0.765]

7 Our score function is derived from a K-nearest neighbor graph (K-NNG) on n-point nominal data. [sent-14, score-0.715]

8 Anomaly is declared whenever the score of a test sample falls below α (the desired false alarm error). [sent-15, score-0.608]

9 In statistical hypothesis testing, p-value is any transformation of the feature space to the interval [0, 1] that induces a uniform distribution on the nominal data. [sent-17, score-0.576]

10 When test samples with p-values smaller than α are declared as anomalies, false alarm error is less than α. [sent-18, score-0.463]

11 Our notion provides a characterization of the optimal anomaly detector, in that, it is uniformly most powerful for a speciﬁed false alarm level for the case when the anomaly density is a mixture of the nominal and a known density. [sent-20, score-2.081]

12 We show that our score function is asymptotically consistent, namely, it converges to our multivariate p-value as data length approaches inﬁnity. [sent-21, score-0.2]

13 It is also referred to as novelty detection [1, 2], outlier detection [3], one-class classiﬁcation [4, 5] and single-class classiﬁcation [6] in the literature. [sent-23, score-0.31]

14 Approaches to anomaly detection can be grouped into several categories. [sent-24, score-0.623]

15 In parametric approaches [7] the nominal densities are assumed to come from a parameterized family and generalized likelihood ratio tests are used for detecting deviations from nominal. [sent-25, score-0.541]

16 A K-nearest neighbor 1 (K-NN) anomaly detection approach is presented in [3, 8]. [sent-27, score-0.67]

17 There an anomaly is declared whenever the distance to the K-th nearest neighbor of the test sample falls outside a threshold. [sent-28, score-0.782]

18 In comparison our anomaly detector utilizes the global information available from the entire K-NN graph to detect deviations from the nominal. [sent-29, score-0.581]

19 Learning theoretic approaches attempt to ﬁnd decision regions, based on nominal data, that separate nominal instances from their outliers. [sent-31, score-1.065]

20 While these approaches provide impressive computationally efﬁcient solutions on real data, it is generally difﬁcult to precisely relate tuning parameter choices to desired false alarm probability. [sent-36, score-0.367]

21 They approximate (in appropriate function classes) level sets of the unknown nominal multivariate density from training samples. [sent-38, score-0.715]

22 Related work by Hero [13] based on geometric entropic minimization (GEM) detects outliers by comparing test samples to the most concentrated subset of points in the training sample. [sent-39, score-0.132]

23 This most concentrated set is the K-point minimum spanning tree(MST) for n-point nominal data and converges asymptotically to the minimum entropy set (which is also the MV set). [sent-40, score-0.646]

24 We develop score functions on K-NNG which turn out to be the empirical estimates of the volume of the MV sets containing the test point. [sent-44, score-0.245]

25 Yet our algorithms lead to statistically optimal solutions with the ability to control false alarm and miss error probabilities. [sent-47, score-0.346]

26 The main features of our anomaly detector are summarized. [sent-48, score-0.532]

27 (2) Like [12] our algorithm is provably optimal in that it is uniformly most powerful for the speciﬁed false alarm level, α, for the case that the anomaly density is a mixture of the nominal and any other density (not necessarily uniform). [sent-50, score-1.588]

28 Furthermore, our algorithm adapts to the inherent manifold structure or local dimensionality of the nominal density. [sent-52, score-0.54]

29 Let S = {x1 , x2 , · · · , xn } be the nominal training set of size n belonging to the unit cube [0, 1]d . [sent-56, score-0.575]

30 Our task is to declare whether the test point is consistent with nominal data or deviates from the nominal data. [sent-58, score-1.197]

31 If the test point is an anomaly it is assumed to come from a mixture of nominal distribution underlying the training data and another known density (see Section 3). [sent-59, score-1.204]

32 The geodesic distance is deﬁned as the shortest distance on the manifold. [sent-64, score-0.153]

33 Once a distance function is deﬁned our next step is to form a K nearest neighbor graph (K-NNG) or alternatively an neighbor graph ( -NG). [sent-69, score-0.226]

34 We then sort the K nearest distances for each xi in increasing order di,i1 ≤ · · · ≤ di,iK and denote RS (xi ) = di,iK , that is, the distance from xi to its K-th nearest neighbor. [sent-71, score-0.285]

35 For the simple case when the anomalous density is an arbitrary mixture of nominal and uniform density1 we consider the following two score functions associated with the two graphs K-NNG and -NNG respectively. [sent-74, score-0.909]

36 The score functions map the test data η to the interval [0, 1]. [sent-75, score-0.192]

37 05), we declare η to be anomalous if pˆ (η), p (η) ≤ α. [sent-80, score-0.207]

38 The score function K-LPE (or -LPE) measures the relative concentration of point η compared to the training set. [sent-83, score-0.144]

39 Section 3 establishes that the scores for nominally generated data is asymptotically uniformly distributed in [0, 1]. [sent-84, score-0.14]

40 Hence when scores below level α are declared as anomalous the false alarm error is smaller than α asymptotically (since the integral of a uniform distribution from 0 to α is α). [sent-86, score-0.725]

41 05 Bivariate Gaussian mixture distribution 4 4 3 3 2 2 1 1 empirical distribution of the scoring function K−LPE 5 0 −1 −2 −2 −3 −4 −5 −5 8 −3 −4 nominal data anomaly data 10 0 −1 12 empirical density 5 level set at α=0. [sent-88, score-1.272]

42 05 6 4 2 labeled as anomaly labeled as nominal −6 −6 −4 −2 0 2 4 −6 −6 −4 −2 0 2 4 0 0 0. [sent-89, score-1.015]

43 8 1 Figure 1: Left: Level sets of the nominal bivariate Gaussian mixture distribution used to illustrate the KLPE algorithm. [sent-94, score-0.628]

44 Middle: Results of K-LPE with K = 6 and Euclidean distance metric for m = 150 test points drawn from a equal mixture of 2D uniform and the (nominal) bivariate distributions. [sent-95, score-0.289]

45 Scores for the test points are based on 200 nominal training samples. [sent-96, score-0.625]

46 The dotted contour corresponds to the exact bivariate Gaussian density level set at level α = 0. [sent-99, score-0.269]

47 Right: The empirical distribution of the test point scores associated with the bivariate Gaussian appear to be uniform while scores for the test points drawn from 2D uniform distribution cluster around zero. [sent-101, score-0.425]

48 Figure 1 illustrates the use of K-LPE algorithm for anomaly detection when the nominal data is a 2D Gaussian mixture. [sent-102, score-1.143]

49 The middle panel of ﬁgure 1 shows the detection results based on K-LPE are consistent with the theoretical contour for signiﬁcance level α = 0. [sent-103, score-0.196]

50 The right panel of ﬁgure 1 shows the empirical distribution (derived from the kernel density estimation) of the score function K-LPE for the nominal (solid blue) and the anomaly (dashed red) data. [sent-105, score-1.242]

51 We can see that the curve for the nominal data is approximately uniform in the interval [0, 1] and the curve for the anomaly data has a peak at 0. [sent-106, score-1.169]

52 As seen from the ﬁgure (right) small changes in α lead to small changes in actual false alarm and miss levels. [sent-111, score-0.386]

53 1 1 When the mixing density is not uniform but, say f1 , the score functions must be modiﬁed to pK (η) = n ˆ and p (η) = ˆ 1 n n i=1 I NS (η) N (x ) ≥ S i f1 (η) f1 (xi ) for the two graphs K-NNG and -NNG respectively. [sent-112, score-0.264]

54 3 n i=1 1 I R (η)f (η) ≤ R (x 1 (x ) 1 S S i )f1 i To summarize the above discussion, our LPE algorithm has three steps: (1) Inputs: Signiﬁcance level α, distance metric (Euclidean, geodesic, weighted etc. [sent-113, score-0.134]

55 (2) Score computation: Construct K-NNG (or -NG) based on dij and compute the score function K-LPE from Equation 1 (or -LPE from Equation 2). [sent-115, score-0.158]

56 Finally, computing the score for each test data requires O(nd+n) operations(given that RS (·), NS (·) have already been computed). [sent-119, score-0.17]

57 Remark: LPE is fundamentally different from non-parametric density estimation or level set estimation schemes (e. [sent-120, score-0.138]

58 These turn out to be sufﬁcient statistics for optimal anomaly detection. [sent-126, score-0.495]

59 3 Theory: Consistency of LPE A statistical framework for the anomaly detection problem is presented in this section. [sent-127, score-0.623]

60 We establish that anomaly detection is equivalent to thresholding p-values for multivariate data. [sent-128, score-0.656]

61 We will then show that the score functions developed in the previous section is an asymptotically consistent estimator of the p-values. [sent-129, score-0.167]

62 Consequently, it will follow that the strategy of declaring an anomaly when a test sample has a low score is asymptotically optimal. [sent-130, score-0.712]

63 Assume that the data belongs to the d-dimensional unit cube [0, 1]d and the nominal data is sampled from a multivariate density f0 (x) supported on the d-dimensional unit cube [0, 1]d . [sent-131, score-0.685]

64 Anomaly detection can be formulated as a composite hypothesis testing problem. [sent-132, score-0.148]

65 Suppose test data, η comes from a mixture distribution, namely, f (η) = (1 − π)f0 (η) + πf1 (η) where f1 (η) is a mixing density supported on [0, 1]d . [sent-133, score-0.185]

66 Anomaly detection involves testing the nominal hypotheses H0 : π = 0 versus the alternative (anomaly) H1 : π > 0. [sent-134, score-0.668]

67 The goal is to maximize the detection power subject to false alarm level α, namely, P(declare H1 | H0 ) ≤ α. [sent-135, score-0.52]

68 Let P0 be the nominal probability measure and f1 (·) be P0 measurable. [sent-137, score-0.52]

69 When the mixing density is uniform, namely, f1 (η) = U (η) where U (η) is uniform over [0, 1]d , note that Ωα = {η | p(η) ≥ α} is a density level set at level α. [sent-146, score-0.33]

70 It is well known (see [12]) that such a density level set is equivalent to a minimum volume set of level α. [sent-147, score-0.27]

71 The minimum volume set at level α is known to be the uniformly most powerful decision region for testing H0 : π = 0 versus the alternative H1 : π > 0 (see [13, 12]). [sent-148, score-0.216]

72 The uniformly most powerful test for testing H0 : π = 0 versus the alternative (anomaly) H1 : π > 0 at a prescribed level α of signiﬁcance P(declare H1 | H0 ) ≤ α is: φ(η) = H1 , p(η) ≤ α H0 , otherwise 4 Proof. [sent-151, score-0.202]

73 First, measure theoretic arguments are used to establish p(X) as a random variable over [0, 1] under both nominal and anomalous distributions. [sent-153, score-0.645]

74 , distributed with nominal density it follows that the random variable p(X) ∼ d d U [0, 1]. [sent-156, score-0.59]

75 Consequently, the uniformly most powerful test for a signiﬁcance level α is to declare p-values smaller than α as anomalies. [sent-158, score-0.289]

76 We divide training set S into two parts: S = S1 ∩ S2 = {x0 , x1 , · · · , xm } ∩ {xm+1 , · · · , x2m } 1 We modify -LPE to p (η) = m xi ∈S1 I{NS2 (η)≥NS1 (xi )} (or K-LPE to pK (η) ˆ ˆ 1 I{RS2 (η)≤RS1 (xi )} ). [sent-163, score-0.143]

77 By interchanging the expectation with the summation, ES1 [ˆ (η)] p = ES1 = 1 m 1 m I{NS2 (η)≥NS1 (xi )} xi ∈S1 Exi ES1 \xi I{NS2 (η)≥NS1 (xi )} xi ∈S1 = Ex1 [PS1 \x1 (NS2 (η) ≥ NS1 (x1 ))] 5 where the last inequality follows from the symmetric structure of {x0 , x1 , · · · , xm }. [sent-189, score-0.195]

78 1 in [18], sup x0 ,··· ,xm ,xi |F (x0 , · · · , xi , · · · , xm ) − F (x0 , · · · , xi , · · · , xn )| ≤ Kγd /m (9) Then the lemma directly follows from applying McDiarmid’s inequality. [sent-216, score-0.257]

79 When the support of f0 lies on a lower dimensional manifold (say d < d) adopting the geodesic metric leads to faster convergence. [sent-219, score-0.148]

80 We randomly pick 109 points with “+1” label and regard them as the nominal training data. [sent-222, score-0.61]

81 The test data comprises of 108 “+1” points and 183 “−1” points (ground truth) and the algorithm is supposed to predict “+1” data as nominal and “−1” data as anomalous. [sent-223, score-0.654]

82 Scores computed for test set using Equation 1 is oblivious to true f1 density (“−1” labels). [sent-224, score-0.12]

83 To control false alarm at level α, points with score smaller than α are predicted as anomaly. [sent-226, score-0.543]

84 Empirical false alarm and true positives are computed from ground truth. [sent-227, score-0.378]

85 We see that for a relatively small training set of size 160 the average empirical ROC curve is very close to the clairvoyant ROC curve. [sent-245, score-0.188]

86 If there are more than 2 labels in the data set, we artiﬁcially regard points with one particular label as nominal and regard the points with other labels as anomalous. [sent-247, score-0.652]

87 For example, for the USPS dataset, we regard instances of digit 0 as nominal and instances of digits 1, · · · , 9 as anomaly. [sent-248, score-0.555]

88 The data points are normalized to be within [0, 1]d and we 7 banana data set 2D Gaussian mixture 0. [sent-249, score-0.138]

89 The test set is a mixture of 20 nominal points and 158 anomaly points. [sent-292, score-1.141]

90 The test set is a mixture of 50 nominal points and 126 anomaly points. [sent-294, score-1.141]

91 The test set is a mixture of 367 nominal points and 33 anomaly points. [sent-296, score-1.141]

92 5 Conclusion In this paper, we proposed a novel non-parametric adaptive anomaly detection algorithm which leads to a computationally efﬁcient solution with provable optimality guarantees. [sent-358, score-0.645]

93 Our algorithm takes a K-nearest neighbor graph as an input and produces a score for each test point. [sent-359, score-0.245]

94 Scores turn out to be empirical estimates of the volume of minimum volume level sets containing the test point. [sent-360, score-0.257]

95 While minimum volume level sets provide an optimal characterization for anomaly detection, they are high dimensional quantities and generally difﬁcult to reliably compute in high dimensional feature spaces. [sent-361, score-0.703]

96 Nevertheless, a sufﬁcient statistic for optimal tradeoff between false alarms and misses is the volume of the MV set itself, which is a real number. [sent-362, score-0.215]

97 By computing score functions we avoid computing high dimensional quantities and still ensure optimal control of false alarms and misses. [sent-363, score-0.335]

98 Jin, “A new local distance-based outlier detection approach for scattered real-world data,” March 2009, arXiv:0903. [sent-401, score-0.128]

99 Jordan, “Robust novelty detection with single-class MPM,” in Neural Information Processing Systems Conference, vol. [sent-426, score-0.182]

100 Hero, “Geometric entropy minimization(GEM) for anomaly detection and localization,” in Neural Information Processing Systems Conference, vol. [sent-436, score-0.623]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nominal', 0.52), ('anomaly', 0.495), ('lpe', 0.357), ('roc', 0.224), ('alarm', 0.178), ('false', 0.146), ('detection', 0.128), ('score', 0.12), ('ns', 0.112), ('declare', 0.107), ('anomalous', 0.1), ('declared', 0.089), ('clairvoyant', 0.078), ('xi', 0.076), ('rs', 0.076), ('density', 0.07), ('level', 0.068), ('geodesic', 0.067), ('bivariate', 0.063), ('scores', 0.063), ('usps', 0.063), ('banana', 0.062), ('lemma', 0.062), ('pk', 0.059), ('positives', 0.054), ('mv', 0.054), ('novelty', 0.054), ('boston', 0.051), ('test', 0.05), ('curve', 0.049), ('asymptotically', 0.047), ('neighbor', 0.047), ('mq', 0.047), ('mixture', 0.045), ('ionosphere', 0.043), ('xm', 0.043), ('distance', 0.043), ('cance', 0.041), ('wine', 0.041), ('gem', 0.038), ('dij', 0.038), ('volume', 0.038), ('dimensional', 0.038), ('empirical', 0.037), ('detector', 0.037), ('mvol', 0.036), ('tax', 0.036), ('regard', 0.035), ('uniform', 0.034), ('powerful', 0.034), ('vol', 0.034), ('nearest', 0.033), ('smoothness', 0.033), ('multivariate', 0.033), ('lm', 0.032), ('um', 0.032), ('cube', 0.031), ('alarms', 0.031), ('hero', 0.031), ('points', 0.031), ('uniformly', 0.03), ('picking', 0.03), ('binomial', 0.029), ('ece', 0.029), ('anomalies', 0.029), ('vert', 0.029), ('graph', 0.028), ('concentrated', 0.027), ('chernoff', 0.027), ('minimum', 0.026), ('dt', 0.026), ('curves', 0.025), ('theoretic', 0.025), ('falls', 0.025), ('bandwidth', 0.025), ('training', 0.024), ('namely', 0.024), ('distances', 0.024), ('svm', 0.024), ('relate', 0.023), ('mcdiarmid', 0.023), ('metric', 0.023), ('scott', 0.022), ('interval', 0.022), ('supposed', 0.022), ('provable', 0.022), ('miss', 0.022), ('operations', 0.021), ('nowak', 0.021), ('deviations', 0.021), ('graphs', 0.02), ('changes', 0.02), ('silva', 0.02), ('mixing', 0.02), ('testing', 0.02), ('tuning', 0.02), ('langford', 0.02), ('equation', 0.02), ('manifold', 0.02), ('accomplished', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

2 0.13587742 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

Abstract: We propose new methodologies to detect anomalies in discrete-time processes taking values in a probability space. These methods are based on the inference of functionals whose evaluations on successive states visited by the process are stationary and have low autocorrelations. Deviations from this behavior are used to ﬂag anomalies. The candidate functionals are estimated in a subspace of a reproducing kernel Hilbert space associated with the original probability space considered. We provide experimental results on simulated datasets which show that these techniques compare favorably with other algorithms.

3 0.13567089 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is ﬁrst learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the ﬁrst stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efﬁciency of the method. 1

4 0.069517538 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

Author: Maxim Raginsky, Svetlana Lazebnik

Abstract: This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.

5 0.060512044 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a ﬁxed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is ﬁxed and the number of unlabeled points goes to inﬁnity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually inﬁnite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the inﬁnite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the inﬁnite-unlabeled-data limit is also a necessary ﬁrst step to studying the convergence of the ﬁnite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classiﬁcation Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is ﬁxed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We ﬁrst consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a ﬁxed graph, over a ﬁnite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay ﬁlter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian ﬁlter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step ﬁlter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held ﬁxed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well deﬁned limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step ﬁlters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions speciﬁed by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We ﬁrst consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be inﬁnite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisﬁes inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the inﬁmum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the ﬁrst labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We ﬁrst consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisﬁes the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of inﬁnite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian ﬁlter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisﬁes: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very ﬂat predictor is still problematic even from a classiﬁcation perspective. First, it is not possible to obtain a meaningful conﬁdence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classiﬁcation error of the resulting predictor y (x). We plot the classiﬁcation error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a ﬂat predictor. Using a threshold of zero leads to an increase in the classiﬁcation error, possibly due to numerical instability. Interestingly, although the predictors become very ﬂat, the classiﬁcation error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classiﬁcation according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classiﬁcation error with a threshold of zero (center) and minimal classiﬁcation error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classiﬁcation performance is achieved with a signiﬁcantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even ﬂatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classiﬁcation case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The ﬁrst is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a ﬁnite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to inﬁnity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a ﬂat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ ﬁnite while taking the number of unlabeled points to inﬁnity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a ﬁnite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a ﬁxed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any ﬁxed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a ﬁxed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes ﬂatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the ﬁrst few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the ﬁrst p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefﬁcients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classiﬁcation function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufﬁcient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any signiﬁcant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classiﬁcation function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best ﬁt when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are signiﬁcantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of inﬁnite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is ﬁxed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not ﬁnd this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely ﬂat, in ˆ binary tasks, it is still typically possible to ﬁnd a threshold leading to a good classiﬁcation performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with ﬂat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the ﬁrst order gradient is not a sufﬁcient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the inﬁnite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classiﬁcation, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian ﬁelds and harmonic functions, ICML, 2003. 9

6 0.059536237 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

7 0.057496123 97 nips-2009-Free energy score space

8 0.056505859 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

9 0.053671151 72 nips-2009-Distribution Matching for Transduction

10 0.052121319 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

11 0.048196804 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

12 0.045359824 133 nips-2009-Learning models of object structure

13 0.044858113 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

14 0.04464199 201 nips-2009-Region-based Segmentation and Object Detection

15 0.044298388 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

16 0.044110276 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

17 0.04196411 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

18 0.041945014 238 nips-2009-Submanifold density estimation

19 0.04120649 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

20 0.040268384 236 nips-2009-Structured output regression for detection with partial truncation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.15), (1, 0.052), (2, -0.022), (3, 0.049), (4, -0.018), (5, -0.01), (6, -0.019), (7, 0.044), (8, -0.038), (9, -0.005), (10, -0.016), (11, 0.012), (12, 0.051), (13, -0.038), (14, -0.012), (15, -0.041), (16, 0.013), (17, -0.016), (18, 0.062), (19, -0.042), (20, -0.06), (21, 0.129), (22, -0.027), (23, -0.023), (24, -0.124), (25, 0.051), (26, 0.036), (27, 0.027), (28, 0.034), (29, 0.071), (30, 0.026), (31, 0.047), (32, 0.146), (33, 0.087), (34, 0.049), (35, -0.018), (36, 0.007), (37, -0.176), (38, 0.043), (39, -0.088), (40, -0.108), (41, -0.078), (42, 0.156), (43, -0.066), (44, 0.034), (45, -0.031), (46, -0.141), (47, -0.003), (48, -0.002), (49, 0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90816772 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

2 0.78080404 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

3 0.65923721 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

Author: Marco Cuturi, Jean-philippe Vert, Alexandre D'aspremont

4 0.4154436 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in ﬁnding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difﬁculties. The ﬁrst is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this ﬁnding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a ﬁxed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

5 0.40413305 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufﬁciently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difﬁculty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an inﬁnite weighted sum of χ2 random variables. Prior ﬁnite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and ﬁtting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artiﬁcial example, high dimensional multivariate data, and text.

6 0.40243089 248 nips-2009-Toward Provably Correct Feature Selection in Arbitrary Domains

7 0.40173551 238 nips-2009-Submanifold density estimation

8 0.38825291 97 nips-2009-Free energy score space

9 0.36917225 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

10 0.36308545 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

11 0.36192623 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

12 0.35318282 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

13 0.33382174 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

14 0.332993 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

15 0.31496 90 nips-2009-Factor Modeling for Advertisement Targeting

16 0.31355068 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

17 0.31176043 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

18 0.30995163 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

19 0.30607548 256 nips-2009-Which graphical models are difficult to learn?

20 0.30492154 182 nips-2009-Optimal Scoring for Unsupervised Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.052), (25, 0.076), (35, 0.448), (36, 0.094), (39, 0.038), (58, 0.069), (61, 0.013), (71, 0.033), (81, 0.02), (86, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98703277 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

Author: Jacob Goldberger, Amir Leshem

Abstract: This paper proposes a new algorithm for the linear least squares problem where the unknown variables are constrained to be in a ﬁnite set. The factor graph that corresponds to this problem is very loopy; in fact, it is a complete graph. Hence, applying the Belief Propagation (BP) algorithm yields very poor results. The algorithm described here is based on an optimal tree approximation of the Gaussian density of the unconstrained linear system. It is shown that even though the approximation is not directly applied to the exact discrete distribution, applying the BP algorithm to the modiﬁed factor graph outperforms current methods in terms of both performance and complexity. The improved performance of the proposed algorithm is demonstrated on the problem of MIMO detection.

2 0.95479316 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identiﬁability and predictive densities to perform the model comparison. We require identiﬁability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artiﬁcial and biological data showing that our approach outperform a number of state of the art methods. 1

3 0.94804829 236 nips-2009-Structured output regression for detection with partial truncation

Author: Andrea Vedaldi, Andrew Zisserman

Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efﬁcient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term signiﬁcantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efﬁciently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most signiﬁcantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with signiﬁcant truncations. 1

4 0.933972 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

Author: Marek Petrik, Shlomo Zilberstein

Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably ﬁnds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have signiﬁcant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a ﬁnite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difﬁcult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require signiﬁcant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method uniﬁes policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is brieﬂy discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally deﬁne MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the ﬁnite set of states, A is the ﬁnite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the inﬁnite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisﬁes v ∗ = Lv ∗ . We focus on linear value function approximation for discounted inﬁnite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we deﬁne a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as:   − φ(s1 ) −   M = − φ(s2 ) − v = M x. . . . Deﬁnition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it ﬁrst. It is well known that an inﬁnite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisﬁes that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Deﬁnition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is deﬁned as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is deﬁned for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are deﬁned for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We ﬁrst analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisﬁes: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefﬁcient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modiﬁed approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We ﬁrst describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the ﬁnite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in ﬁnite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efﬁciently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to ﬁnd an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a ﬁxed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the ﬁnite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further reﬁned. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends signiﬁcantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the ﬁrst and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to signiﬁcant improvements in speciﬁc domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is deﬁned using the L1 norm on the states, and the constraints are deﬁned using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modiﬁcation of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with ﬁxed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in ﬁnite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a ﬁxed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the ﬁnal value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it ﬁrst needs to back up to an opposite hill to gain sufﬁcient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform signiﬁcantly better given a ﬁner approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since ﬁnding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Ofﬁce of Scientiﬁc Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calaﬁore and M.C. Campi. Uncertain convex programs: Randomized solutions and conﬁdence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efﬁcient solution algorithms for factored MDPs. Journal of Artiﬁcial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artiﬁcial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artiﬁcial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artiﬁcial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexiﬁcation. Journal of Artiﬁcial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9

same-paper 5 0.89882016 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

6 0.87502867 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

7 0.68751085 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

8 0.65543401 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

9 0.6450358 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

10 0.64023501 187 nips-2009-Particle-based Variational Inference for Continuous Systems

11 0.63327408 100 nips-2009-Gaussian process regression with Student-t likelihood

12 0.63315094 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

13 0.63312954 113 nips-2009-Improving Existing Fault Recovery Policies

14 0.62063074 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

15 0.6176964 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

16 0.61369759 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

17 0.60894442 3 nips-2009-AUC optimization and the two-sample problem

18 0.60126728 48 nips-2009-Bootstrapping from Game Tree Search

19 0.60094154 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

20 0.59624219 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling