nips nips2008 nips2008-193 knowledge-graph by maker-knowledge-mining

193 nips-2008-Regularized Co-Clustering with Dual Supervision

Source: pdf

Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic

Abstract: By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classiﬁcation algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. [sent-3, score-0.31]

2 A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). [sent-4, score-0.279]

3 In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. [sent-5, score-0.556]

4 In this paper, we develop two novel semi-supervised multi-class classiﬁcation algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. [sent-6, score-0.53]

5 1 Introduction Consider the setting where we are given large amounts of unlabeled data together with dual supervision in the form of a few labeled examples as well as a few labeled features, and the goal is to estimate an unknown classiﬁcation function. [sent-9, score-0.653]

6 They typically implement the cluster assumption [3] by learning decision boundaries such that unlabeled points belonging to the same cluster are given the same label, and empirical loss over labeled examples is concurrently minimized. [sent-16, score-0.373]

7 In situations where the classes are predominantly supported on unknown subsets of similar features, it is clear that feature supervision can potentially illuminate the true cluster structure inherent in the unlabeled examples over which the cluster assumption ought to be enforced. [sent-17, score-0.484]

8 Even when feature supervision is not available, there is ample empirical evidence in numerous recent papers in the co-clustering literature (see e. [sent-18, score-0.253]

9 , [5, 1] and references therein), suggesting that the clustering of columns (features) of a data matrix can lead to massive improvements in the quality of row (examples) clustering. [sent-20, score-0.306]

10 An intuitive explanation is that column clustering enforces a form of dimensional reduction or implicit regularization that is responsible for performance enhancements observed in many applications such as text clustering, microarray data analysis and video content mining [1]. [sent-21, score-0.242]

11 In this paper, we utilize data-dependent co-clustering regularizers for semi-supervised learning in the presence of partial dual supervision. [sent-22, score-0.1]

12 1 Our starting point is the spectral bipartite graph partitioning approach of [5] which we brieﬂy review in Section 2. [sent-23, score-0.453]

13 This approach effectively applies spectral clustering on a graph representation of the data matrix and is also intimately related to Singular Value Decomposition. [sent-25, score-0.324]

14 2 we review an equivalence between this approach and a matrix approximation objective function that is minimized under orthogonality constraints [6]. [sent-27, score-0.153]

15 By dropping the orthogonality constraints but imposing non-negativity constraints, one is led to a large family of co-clustering algorithms that arise from the non-negative matrix factorization literature. [sent-28, score-0.161]

16 Based on the algorithmic intuitions embodied in the algorithms above, we develop two semisupervised classiﬁcation algorithms that extend the spectral bipartite graph partitioning approach and the matrix approximation approach respectively. [sent-29, score-0.555]

17 We start with Reproducing Kernel Hilbert Spaces (RKHSs) deﬁned over both row and column spaces. [sent-30, score-0.244]

18 In the ﬁrst algorithm, we directly adopt graph Laplacian regularizers constructed from the bipartite graph of [5] and include it as a row and column smoothing term in the standard regularization objective function. [sent-32, score-0.752]

19 This approach may be viewed as a modiﬁcation of the Manifold Regularization framework [2] where we now jointly learn row and column classiﬁcation functions. [sent-34, score-0.244]

20 In the second algorithm proposed in this paper, we instead add a (non-convex) matrix approximation term to the objective function, which is then minimized using a block-coordinate descent procedure. [sent-35, score-0.115]

21 Unlike, their unsupervised counterparts, our methods support dual supervison and naturally possess out-of-sample extension. [sent-36, score-0.092]

22 2 Co-Clustering Algorithms Let X denote the data matrix with n data points and d features. [sent-38, score-0.077]

23 The methods that we discuss in this section output a row partition function πr : {i}n → {j}mr and a column partition function i=1 j=1 πc : {i}d → {j}mc that give cluster assignments to row and column indices respectively. [sent-39, score-0.592]

24 Here, i=1 j=1 mr is the desired number of row clusters and mc is the desired number of column clusters. [sent-40, score-0.688]

25 Below, by xi we mean the ith example (row) and by fj we mean the j th column (feature) in the data matrix. [sent-41, score-0.119]

26 1 Bipartite Graph Partitioning In the co-clustering technique introduced by [5], the data matrix is modeled as a bipartite graph with examples (rows) as one set of nodes and features (columns) as another. [sent-43, score-0.453]

27 This bi-partite graph is undirected and there are no inter-example or inter-feature edges. [sent-45, score-0.089]

28 The adjacency matrix, W, and the normalized Laplacian [4], M, of this graph are given by, W= 0 XT X 0 1 1 , M = I − D− 2 WD− 2 (1) where D is the diagonal degree matrix deﬁned by Dii = i Wij and I is the (n + d) × (n + d) identity matrix. [sent-46, score-0.166]

29 Guided by the premise that column clustering induces row clustering while row clustering induces column clustering, [5] propose to ﬁnd an optimal partitioning of the nodes of the bipartite graph. [sent-47, score-0.974]

30 This method is retricted to obtaining co-clusterings where mr = mc = m. [sent-48, score-0.444]

31 The mpartitioning is obtained by minimizing the relaxation of the normalized cut objective function using standard spectral clustering techniques. [sent-49, score-0.196]

32 This reduces to ﬁrst constructing a spectral representation of rows and columns given by the smallest eigenvectors of M, and then performing standard k-means clustering on this representation, to ﬁnally obtain the partition functions πr , πc . [sent-50, score-0.273]

33 1, it can be shown that the spectral representation used in this algorithm is related to the singular vectors of a normalized version of X. [sent-52, score-0.088]

34 2 Matrix Approximation Formulation In [6] it is shown that the bipartite spectral graph partitioning is closely related to solving the following matrix approximation problem, (Fr ⋆ , Fc ⋆ ) = argminFr T Fr =I,Fc T Fc =I X − Fr Fc T f ro where Fr is an n × m matrix and Fc is a d × m matrix. [sent-54, score-0.649]

35 Once the minimization is performed, 2 πr (i) = argmaxj Fr ⋆ and πc (i) = argmaxj Fc ⋆ . [sent-55, score-0.078]

36 In a non-negative matrix factorization approach, ij ij the orthogonality constraints are dropped to make the optimization easier while non-negativity constraints Fr , Fc ≥ 0 are introduced with the goal of lending better interpretability to the solutions. [sent-56, score-0.231]

37 2 we consider a 3-factor non-negative matrix approximation to incorporate unequal values of mr and mc , and to improve the quality of the approximation. [sent-60, score-0.521]

38 We consider column values f for each feature to be a data point in C ⊂ ℜn . [sent-63, score-0.148]

39 Our goal is to learn partition functions deﬁned over the entire row and column spaces (as opposed to matrix indices), i. [sent-64, score-0.352]

40 i=1 i=1 For this purpose, let us introduce kr : R × R → ℜ to be the row kernel that deﬁnes an associated RKHS Hr . [sent-67, score-0.435]

41 Similarly, kc : C × C → ℜ denotes the column kernel whose associated RKHS is Hc . [sent-68, score-0.326]

42 Consider a simultaneous assignment of rows into mr classes and columns into mc classes. [sent-70, score-0.528]

43 For any 1 m data point x, denote Fr (x) = [fr (x) · · · fr r (x)]T ∈ ℜmr to be a vector whose elements are soft j class assignments where fr ∈ Hr for all j. [sent-71, score-0.645]

44 For the given n data points, denote Fr to be the n × mr class assignment matrix. [sent-72, score-0.337]

45 Correspondingly, Fc (f ) is deﬁned for any feature f ∈ C, and Fc denotes the associated column class assignment matrix. [sent-73, score-0.179]

46 Additionally, we are given dual supervision in the form of label matrices Yr ∈ ℜn×mr and Yc ∈ ℜm×mc where Yrij = 1 speciﬁes that the ith example is labeled with class j (simlarly for the feature labels matrix Yc ). [sent-74, score-0.561]

47 Unlabeled points have all-zero rows, and the row sums are therefore 0. [sent-76, score-0.125]

48 Let Jr (Jc ) denote a diagonal matrix of size n×n (d×d) whose diagonal entry is 1 for labeled examples (features) and 0 otherwise. [sent-77, score-0.231]

49 The middle two terms measure squared loss on labeled data. [sent-82, score-0.11]

50 The ﬁnal term measure smoothness of the row and column indicator functions with respect to the bipartite graph introduced in Section 2. [sent-83, score-0.534]

51 Clearly, by Representer Theorem the solution is has the form, n j fr (x) = d j αij kr (x, xi ), 1 ≤ j ≤ mr , fc (f ) = i=1 βij kc (f , fi ), 1 ≤ j ≤ mc (3) i=1 Let α, β denote the corresponding optimal expansion coefﬁcient matrices. [sent-87, score-1.537]

52 The partition functions are then deﬁned by n πr (x) = argmax d αij kr (x, xi ), πc (f ) = argmax 1≤j≤m i=1 βij kc (f , fi ) (5) 1≤j≤m i=1 As in Section 2. [sent-90, score-0.548]

53 This approach is closely related to the Manifold Regularization framework of [2], and may be viewed as an modiﬁcation of the Laplacian Regularized Least Squares (LAPRLS) algorithm, which uses a euclidean nearest neighbor row similarity graph to capture the manifold structure in the data. [sent-93, score-0.265]

54 One can also use a large family of graph regularizers derived from the graph Laplacian [3]. [sent-95, score-0.216]

55 Alternatively, one can solve the linear system [10]: B ⊤ ⊗ A + D⊤ ⊗ C vec(X) = vec(E) where ⊗ denotes 4 Kronecker product and vec(X) vectorizes X in a column oriented way (it behaves as the matlab operator X(:)). [sent-107, score-0.119]

56 Thus, the solution to Eqns (9,10) are as follows, [Imr ⊗ (γr In + Jr Kr ) + µZc ⊗ Kr ] vec(α) [Imc ⊗ (γr Id + Jc Kc ) + µZr ⊗ Kc ] vec(β) = vec(Jr Yr + µXKc βQT ) T = vec(Jc Yc + µX Kr αQ) (12) (13) These linear systems are of size nmr × nmr and dmc × dmc respectively. [sent-108, score-0.148]

57 The goal is to observe whether dual supervision particularly along features can help improve classiﬁcation performance, and whether joint RKHS regularization as formulated in our algorithms (abbreviated MR for the manifold regularization based method of Section 3. [sent-119, score-0.453]

58 2) along both rows and columns leads to good quality out-of-sample prediction. [sent-121, score-0.084]

59 In the experiments below, the performance of RLS and LAPRLS is optimized for best performance on the unlabeled set over a grid of hyperparameters. [sent-122, score-0.135]

60 These were set to 2k σ0r , 2k σ0c respectively where σ0r , σ0c are (1/m)-quantile of pairwise euclidean distances among rows and columns respectively for an m class problem, and k is tuned over {−2, −1, 0, 1, 2} to optimize 3fold cross-validation performance of fully supervised RLS. [sent-124, score-0.115]

61 Given a column partitioning πc , consider the following transformation: xi xi i:πc (i)=1 i:πc (i)=−1 T (x) = that maps examples in ℜ100 to the plane ℜ2 by composing |i:πc (i)=1| , |i:πc (i)=−1| a single feature whose value equals the mean of all features in the same partition. [sent-135, score-0.309]

62 For the correct column partitioning, πc (i) = 1, 1 ≤ i ≤ 50, πc (i) = −1, 50 < i ≤ 100, the examples under the 5 action of T are shown in Figure 1 (left). [sent-136, score-0.163]

63 In Figure 1 (right), we plot the learning curves of various algorithms with respect to increasing number of row and column labels. [sent-139, score-0.244]

64 On this dataset, co-clustering techniques (BIPARTITE, NMF) perform fairly well, and even signiﬁcantly better than RLS, which has an optimized F-measure of 67% with 25 row labels. [sent-140, score-0.125]

65 With increasing amounts of column labels, the learning curves of MR and MA steadily lift eventually outperforming the unsupervised techniques. [sent-141, score-0.149]

66 Figure 1: left: Examples in the toy dataset under the transformation deﬁned by the correct column partitioning. [sent-147, score-0.184]

67 right: Performance comparison – the number of column labels used are marked. [sent-148, score-0.179]

68 Finally, we restricted attention to 631 words with highest overall mutual information and 2000 documents that belong to the following 5 classes: comp. [sent-175, score-0.078]

69 mideast accounted for more than half the vocabulary, we used the class normalization prescribed in [11] to handle the imbalance in the labeled data. [sent-185, score-0.141]

70 In each run, we randomly split the documents into training and test sets, in the ratio 1 : 3. [sent-187, score-0.08]

71 The training set is then further split into labeled and unlabeled sets by randomly selecting 75 labeled documents. [sent-188, score-0.388]

72 We observe that even without any word supervision MR outperforms all the baseline approaches: unsupervised co-clustering with BIPARTITE and NMF, standard RLS that only uses labeled documents, and also LAPRLS which uses a graph Laplacian based on document similarity for semisupervised learning. [sent-195, score-0.501]

73 This validates the effectiveness of the bipartite document and word graph regularizer. [sent-196, score-0.345]

74 As the amount of word supervision increases, the performance of both MR and MA improves gracefully. [sent-197, score-0.247]

75 In, Figure 2 we show the top unlabeled words 1 At http://www. [sent-205, score-0.166]

76 gz 6 Table 1: Performance on 5-Newsgroups Dataset with 75 row labels (a) F-measure on Unlabeled Set (b) F-measure on Test Set BIPARTITE NMF RLS LAPRLS RLS LAPRLS 54. [sent-209, score-0.185]

77 4) (c) F-measure on Unlabeled Set # col labs 0 100 200 350 500 MR 64. [sent-221, score-0.122]

78 0) for each class sorted by the real-valued prediction score assigned by MR (in one run trained with 100 labeled words). [sent-261, score-0.17]

79 Figure 2: Top unlabeled words categorized by MR COMP. [sent-263, score-0.196]

80 The dataset is composed of 1169 projects tracked by the Integrated Technology Services division of IBM. [sent-272, score-0.085]

81 These projects need to be categorized into 8 predeﬁned product categories within IBM’s Server Services product line, with the eventual goal of performing various follow-up business analyses at the granularity of categories. [sent-273, score-0.11]

82 Domain experts validated project (row) labels and additionally provided category labels for 25 features deemed to be important skills for delivering projects in the corresponding category. [sent-278, score-0.275]

83 By demonstrating our algorithms on this dataset, we are able to validate a general methodology with which to approach project categorization across all service product lines (SPLs) on a regular basis. [sent-279, score-0.081]

84 The amount of dual supervision available in other SPLs is indeed severely limited as both the project categories and skill deﬁnitions are constantly evolving due to the highly dynamic business environment. [sent-280, score-0.327]

85 In each run, we randomly split the projects into training and test sets, in the ratio 3 : 1. [sent-282, score-0.078]

86 The training set is then further split into labeled and unlabeled sets by randomly selecting 30 labeled projects. [sent-283, score-0.388]

87 We experimented with increasing number of randomly chosen column labels, from none to all 25 available labels. [sent-284, score-0.119]

88 We observe that BIPARTITE performs signiﬁcantly better than NMF on this dataset, and is competitve with supervised RLS performance that relies only on labeled data. [sent-291, score-0.11]

89 We ﬁnd that MR outperforms all approaches signiﬁcantly even with very few column labels. [sent-293, score-0.119]

90 7 Table 2: Performance on IBM Project Categorization Dataset with 30 row labels (a) F-measure on Unlabeled Set (b) F-measure on Test Set BIPARTITE NMF RLS LAPRLS RLS LAPRLS 89. [sent-296, score-0.185]

91 0) (c) F-measure on Unlabeled Set # col labs 0 5 10 15 25 5 MR 92. [sent-308, score-0.122]

92 8) Conclusion We have developed semi-supervised kernel methods that support partial supervision along both dimensions of the data. [sent-348, score-0.192]

93 Empirical studies show promising results and highlight the previously untapped beneﬁts of feature supervision in semi-supervised settings. [sent-349, score-0.221]

94 For an application of closely related algorithms to blog sentiment classiﬁcation, we point the reader to [14]. [sent-350, score-0.133]

95 For recent work on text categorization with labeled features instead of labeled examples, see [8]. [sent-351, score-0.305]

96 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-365, score-0.245]

97 Co-clustering documents and words using bipartite spectral graph partitioning. [sent-379, score-0.456]

98 On the equivalence of nonnegative matrix factorization and spectral clustering. [sent-386, score-0.242]

99 The relationships among various nonnegative matrix factorization methods for clustering. [sent-427, score-0.154]

100 Document clustering using word clusters via the information bottleneck method. [sent-437, score-0.125]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kr', 0.31), ('fr', 0.307), ('mr', 0.306), ('fc', 0.269), ('kc', 0.207), ('bipartite', 0.201), ('supervision', 0.192), ('yr', 0.173), ('yc', 0.163), ('jr', 0.163), ('laprls', 0.155), ('vec', 0.147), ('jc', 0.142), ('rls', 0.14), ('mc', 0.138), ('unlabeled', 0.135), ('hr', 0.126), ('row', 0.125), ('column', 0.119), ('nmf', 0.118), ('labeled', 0.11), ('hc', 0.095), ('graph', 0.089), ('spectral', 0.088), ('laplacian', 0.08), ('matrix', 0.077), ('partitioning', 0.075), ('tr', 0.07), ('sentiment', 0.07), ('clustering', 0.07), ('col', 0.069), ('blog', 0.063), ('dual', 0.062), ('labels', 0.06), ('posts', 0.059), ('xkc', 0.059), ('word', 0.055), ('labs', 0.053), ('zc', 0.053), ('regularization', 0.053), ('manifold', 0.051), ('rows', 0.05), ('rkhs', 0.05), ('ma', 0.048), ('qt', 0.048), ('documents', 0.047), ('factorization', 0.046), ('projects', 0.045), ('examples', 0.044), ('zr', 0.044), ('categorization', 0.043), ('features', 0.042), ('ibm', 0.042), ('ro', 0.042), ('cluster', 0.042), ('dataset', 0.04), ('argmaxj', 0.039), ('dmc', 0.039), ('eqns', 0.039), ('imr', 0.039), ('qfc', 0.039), ('spls', 0.039), ('sylvester', 0.039), ('orthogonality', 0.038), ('regularizers', 0.038), ('objective', 0.038), ('project', 0.038), ('id', 0.037), ('kdd', 0.036), ('representer', 0.036), ('ij', 0.035), ('nmr', 0.035), ('rkhss', 0.035), ('business', 0.035), ('columns', 0.034), ('split', 0.033), ('numerous', 0.032), ('sindhwani', 0.032), ('partition', 0.031), ('class', 0.031), ('nonnegative', 0.031), ('argmin', 0.031), ('words', 0.031), ('unsupervised', 0.03), ('ding', 0.03), ('icdm', 0.03), ('categorized', 0.03), ('rec', 0.03), ('skills', 0.03), ('score', 0.029), ('feature', 0.029), ('outer', 0.029), ('services', 0.028), ('graphics', 0.026), ('preprocessed', 0.026), ('cg', 0.026), ('toy', 0.025), ('semisupervised', 0.025), ('regularized', 0.023), ('precision', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 193 nips-2008-Regularized Co-Clustering with Dual Supervision

Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic

2 0.20845911 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables

Author: Neil D. Lawrence, Magnus Rattray, Michalis K. Titsias

Abstract: Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efﬁcient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by minimizing an objective function. We demonstrate the algorithm on regression and classiﬁcation problems and we use it to estimate the parameters of a differential equation model of gene regulation. 1

3 0.14274596 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classiﬁcation, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge signiﬁcantly improve prediction accuracy. 1

4 0.13593483 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

Author: Haixuan Yang, Irwin King, Michael Lyu

Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-ﬁtting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1

5 0.12657103 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classiﬁcation. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classiﬁcation accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classiﬁcation tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a signiﬁcant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.

6 0.10704511 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

7 0.089849263 112 nips-2008-Kernel Measures of Independence for non-iid Data

8 0.086882301 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

9 0.086458057 117 nips-2008-Learning Taxonomies by Dependence Maximization

10 0.086329646 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.085472018 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

12 0.083908305 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition

13 0.08301957 62 nips-2008-Differentiable Sparse Coding

14 0.078997366 218 nips-2008-Spectral Clustering with Perturbed Data

15 0.078229114 143 nips-2008-Multi-label Multiple Kernel Learning

16 0.077459112 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

17 0.076206528 85 nips-2008-Fast Rates for Regularized Objectives

18 0.074540809 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

19 0.073640823 113 nips-2008-Kernelized Sorting

20 0.070777297 225 nips-2008-Supervised Bipartite Graph Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.211), (1, -0.112), (2, -0.051), (3, 0.014), (4, 0.03), (5, -0.032), (6, 0.092), (7, 0.05), (8, -0.049), (9, -0.102), (10, 0.093), (11, -0.069), (12, -0.137), (13, -0.094), (14, -0.021), (15, 0.055), (16, -0.145), (17, 0.16), (18, 0.09), (19, -0.052), (20, -0.044), (21, 0.028), (22, -0.063), (23, -0.037), (24, 0.012), (25, -0.026), (26, -0.107), (27, 0.058), (28, -0.056), (29, -0.045), (30, -0.034), (31, 0.025), (32, 0.012), (33, -0.041), (34, -0.025), (35, -0.082), (36, 0.126), (37, 0.033), (38, -0.089), (39, 0.069), (40, -0.009), (41, 0.083), (42, -0.087), (43, -0.021), (44, 0.079), (45, -0.049), (46, 0.069), (47, 0.132), (48, -0.053), (49, -0.269)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93240255 193 nips-2008-Regularized Co-Clustering with Dual Supervision

Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic

2 0.58967727 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

3 0.5135873 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

4 0.51290983 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification

Author: Doug Downey, Oren Etzioni

Abstract: Is accurate classiﬁcation possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction—where the probability of class membership increases monotonically with the MF’s value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classiﬁcation applications. On the classic “20 Newsgroups” data set, a learner given an MF and unlabeled data achieves classiﬁcation accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data. 1

5 0.48382533 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider

Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any speciﬁc task in the same space through regularization. In an empirical study, we construct 190 different text classiﬁcation tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classiﬁcation tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the speciﬁc task to be enhanced, and the prediction model used.

6 0.47256517 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

7 0.46823895 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

8 0.46411034 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables

9 0.43986076 225 nips-2008-Supervised Bipartite Graph Inference

10 0.43379271 218 nips-2008-Spectral Clustering with Perturbed Data

11 0.41268563 126 nips-2008-Localized Sliced Inverse Regression

12 0.39874795 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

13 0.39851606 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

14 0.39252189 117 nips-2008-Learning Taxonomies by Dependence Maximization

15 0.37302235 113 nips-2008-Kernelized Sorting

16 0.3614797 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

17 0.35263276 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection

18 0.33385524 48 nips-2008-Clustering via LP-based Stabilities

19 0.3263112 107 nips-2008-Influence of graph construction on graph-based clustering measures

20 0.31501621 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.01), (6, 0.037), (7, 0.066), (12, 0.438), (28, 0.132), (57, 0.036), (63, 0.021), (77, 0.055), (78, 0.011), (83, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95188689 157 nips-2008-Nonrigid Structure from Motion in Trajectory Space

Author: Ijaz Akhter, Yaser Sheikh, Sohaib Khan, Takeo Kanade

Abstract: Existing approaches to nonrigid structure from motion assume that the instantaneous 3D shape of a deforming object is a linear combination of basis shapes, which have to be estimated anew for each video sequence. In contrast, we propose that the evolving 3D structure be described by a linear combination of basis trajectories. The principal advantage of this approach is that we do not need to estimate any basis vectors during computation. We show that generic bases over trajectories, such as the Discrete Cosine Transform (DCT) basis, can be used to compactly describe most real motions. This results in a signiﬁcant reduction in unknowns, and corresponding stability in estimation. We report empirical performance, quantitatively using motion capture data, and qualitatively on several video sequences exhibiting nonrigid motions including piece-wise rigid motion, partially nonrigid motion (such as a facial expression), and highly nonrigid motion (such as a person dancing). 1

2 0.90881127 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters

Author: Roger P. Levy, Florencia Reali, Thomas L. Griffiths

Abstract: Language comprehension in humans is signiﬁcantly constrained by memory, yet rapid, highly incremental, and capable of utilizing a wide range of contextual information to resolve ambiguity and form expectations about future input. In contrast, most of the leading psycholinguistic models and ﬁelded algorithms for natural language parsing are non-incremental, have run time superlinear in input length, and/or enforce structural locality constraints on probabilistic dependencies between events. We present a new limited-memory model of sentence comprehension which involves an adaptation of the particle ﬁlter, a sequential Monte Carlo method, to the problem of incremental parsing. We show that this model can reproduce classic results in online sentence comprehension, and that it naturally provides the ﬁrst rational account of an outstanding problem in psycholinguistics, in which the preferred alternative in a syntactic ambiguity seems to grow more attractive over time even in the absence of strong disambiguating information. 1

3 0.86438143 207 nips-2008-Shape-Based Object Localization for Descriptive Classification

Author: Geremy Heitz, Gal Elidan, Benjamin Packer, Daphne Koller

Abstract: Discriminative tasks, including object categorization and detection, are central components of high-level computer vision. Sometimes, however, we are interested in more reﬁned aspects of the object in an image, such as pose or particular regions. In this paper we develop a method (LOOPS) for learning a shape and image feature model that can be trained on a particular object class, and used to outline instances of the class in novel images. Furthermore, while the training data consists of uncorresponded outlines, the resulting LOOPS model contains a set of landmark points that appear consistently across instances, and can be accurately localized in an image. Our model achieves state-of-the-art results in precisely outlining objects that exhibit large deformations and articulations in cluttered natural images. These localizations can then be used to address a range of tasks, including descriptive classiﬁcation, search, and clustering. 1

same-paper 4 0.85588551 193 nips-2008-Regularized Co-Clustering with Dual Supervision

Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic

5 0.8403272 170 nips-2008-Online Optimization in X-Armed Bandits

Author: Sébastien Bubeck, Gilles Stoltz, Csaba Szepesvári, Rémi Munos

Abstract: We consider a generalization of stochastic bandit problems where the set of arms, X , is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally H¨ lder √ a known exponent, then the expected o with regret is bounded up to a logarithmic factor by n, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider. 1 Introduction and motivation Bandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of algorithms or optimization of controllers based on simulations. In the classical bandit problem there are a ﬁnite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in a random reward, whose distribution is determined by the identity of the arm selected. The distributions associated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of the rewards received. In many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7], hybrid-continuous, or it could be the space of inﬁnite sequences over a ﬁnite alphabet [4]. In this paper we consider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space. We assume that the decision maker knows a dissimilarity function deﬁned over this space that constraints the shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower bound on the mean-payoff function from below at each maxima. We also assume that the decision maker is able to cover the space of arms in a recursive manner, successively reﬁning the regions in the covering such that the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity. ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the a Alberta Ingenuity Fund, iCore and NSERC. 1 Our work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6] and Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic metric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the space. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax sense with respect to these environments. The goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be a generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful tree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order smoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2], while (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our work represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties of the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in √ the whole space. This allows us to obtain a regret which scales as O( n) 1 when e.g. the space is the unit hypercube and the mean-payoff function is locally H¨ lder with known exponent in the neighborhood of any o maxima (which are in ﬁnite number) and bounded away from the maxima outside of these neighborhoods. Thus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality of the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic factors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense. Besides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on ideas that have proved to be efﬁcient, we expect it to perform well in practice and to make a signiﬁcant impact on how on-line global optimization is performed. 2 Problem setup, notation We consider a topological space X , whose elements will be referred to as arms. A decision maker “pulls” the arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm chosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so as to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic environments. Such an environment M associates to each arm x ∈ X a distribution Mx on the real line. The support of these distributions is assumed to be uniformly bounded with a known bound. For the sake of simplicity, we assume this bound is 1. We denote by f (x) the expectation of Mx , which is assumed to be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function f : X → R thus deﬁned is called the mean-payoff function. When in round n the decision maker pulls arm Xn ∈ X , he receives a reward Yn drawn from MXn , independently of the past arm choices and rewards. A pulling strategy of a decision maker is determined by a sequence ϕ = (ϕn )n≥1 of measurable mappings, n−1 where each ϕn maps the history space Hn = X × [0, 1] to the space of probability measures over X . By convention, ϕ1 does not take any argument. A strategy is deterministic if for every n the range of ϕn contains only Dirac distributions. According to the process that was already informally described, a pulling strategy ϕ and an environment M jointly determine a random process (X1 , Y1 , X2 , Y2 , . . .) in the following way: In round one, the decision maker draws an arm X1 at random from ϕ1 and gets a payoff Y1 drawn from MX1 . In round n ≥ 2, ﬁrst, Xn is drawn at random according to ϕn (X1 , Y1 , . . . , Xn−1 , Yn−1 ), but otherwise independently of the past. Then the decision maker gets a rewards Yn drawn from MXn , independently of all other random variables in the past given Xn . Let f ∗ = supx∈X f (x) be the maximal expected payoff. The cumulative regret of a pulling strategy in n n environment M is Rn = n f ∗ − t=1 Yt , and the cumulative pseudo-regret is Rn = n f ∗ − t=1 f (Xt ). 1 We write un = O(vu ) when un = O(vn ) up to a logarithmic factor. 2 In the sequel, we restrict our attention to the expected regret E [Rn ], which in fact equals E[Rn ], as can be seen by the application of the tower rule. 3 3.1 The Hierarchical Optimistic Optimization (HOO) strategy Trees of coverings We ﬁrst introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input. Deﬁnition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i )1≤i≤2h , h≥0 of X such that for all ﬁxed integer h ≥ 0, the covering ∪1≤i≤2h Ph,i = X holds. Moreover, the elements of the covering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i . A tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain X = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i–th node of depth h, which will be referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i−1 and Ph+1,2i corresponds to the childhood relationship in the tree. Although the deﬁnition allows the childregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h increases (cf. Assumption 1). Remark 1. Our algorithm will instantiate the nodes of the tree on an ”as needed” basis, one by one. In fact, at any round n it will only need n nodes connected to the root. 3.2 Statement of the HOO strategy The algorithm picks at each round a node in the inﬁnite tree T as follows. In the ﬁrst round, it chooses the root node (0, 1). Now, consider round n + 1 with n ≥ 1. Let us denote by Tn the set of nodes that have been picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm picks at round n + 1 a node (Hn+1 , In+1 ) ∈ Sn according to the deterministic rule that will be described below. After selecting the node, the algorithm further chooses an arm Xn+1 ∈ PHn+1 ,In+1 . This selection can be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a reward Yn+1 as described above and the procedure goes on: (Hn+1 , In+1 ) is added to Tn to form Tn+1 and the children of (Hn+1 , In+1 ) are added to Sn to give rise to Sn+1 . Let us now turn to how (Hn+1 , In+1 ) is selected. Along with the nodes the algorithm stores what we call B–values. The node (Hn+1 , In+1 ) ∈ Sn to expand at round n + 1 is picked by following a path from the root to a node in Sn , where at each node along the path the child with the larger B–value is selected (ties are broken arbitrarily). In order to deﬁne a node’s B–value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let n Nh,i (n) = I{(Ht ,It )∈C(h,i)} t=1 be the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but since its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently, 1 ≤ Nh,i (n) ≤ n for all nodes (h, i) ∈ Tn . Let µh,i (n) be the empirical average of the rewards received for the time-points when the path followed by the algorithm went through (h, i): n 1 µh,i (n) = Yt I{(Ht ,It )∈C(h,i)} . Nh,i (n) t=1 The corresponding upper conﬁdence bound is by deﬁnition Uh,i (n) = µh,i (n) + 3 2 ln n + ν 1 ρh , Nh,i (n) where 0 < ρ < 1 and ν1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see Assumption 1). For nodes not in Tn , by convention, Uh,i (n) = +∞. Now, for a node (h, i) in Sn , we deﬁne its B–value to be Bh,i (n) = +∞. The B–values for nodes in Tn are given by Bh,i (n) = min Uh,i (n), max Bh+1,2i−1 (n), Bh+1,2i (n) . Note that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt ,It ). Its total space requirement is linear in n while total running time at round n is at most quadratic in n, though we conjecture that it is O(n log n) on average. 4 Assumptions made on the model and statement of the main result We suppose that X is equipped with a dissimilarity , that is a non-negative mapping : X 2 → R satisfying (x, x) = 0. The diameter (with respect to ) of a subset A of X is given by diam A = supx,y∈A (x, y). Given the dissimilarity , the “open” ball with radius ε > 0 and center c ∈ X is B(c, ε) = { x ∈ X : (c, x) < ε } (we do not require the topology induced by to be related to the topology of X .) In what follows when we refer to an (open) ball, we refer to the ball deﬁned with respect to . The dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker chooses and the tree of coverings. The following assumption relates this choice to the parameters ρ and ν1 of the algorithm: Assumption 1. There exist ρ < 1 and ν1 , ν2 > 0 such that for all integers h ≥ 0 and all i = 1, . . . , 2h , the diameter of Ph,i is bounded by ν1 ρh , and Ph,i contains an open ball Ph,i of radius ν2 ρh . For a given h, the Ph,i are disjoint for 1 ≤ i ≤ 2h . Remark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles. They can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along their longest side, in an axis parallel manner; if all sides are equal, we split them along the√ axis. In ﬁrst this example, if X = [0, 1]D and (x, y) = x − y α then we can take ρ = 2−α/D , ν1 = ( D/2)α and ν2 = 1/8α . The next assumption concerns the environment. Deﬁnition 2. We say that f is weakly Lipschitz with respect to if for all x, y ∈ X , f ∗ − f (y) ≤ f ∗ − f (x) + max f ∗ − f (x), (x, y) . (1) Note that weak Lipschitzness is satisﬁed whenever f is 1–Lipschitz, i.e., for all x, y ∈ X , one has |f (x) − f (y)| ≤ (x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1–Lipschitzness at any maxima. Indeed, at an optimal arm x∗ (i.e., such that f (x∗ ) = f ∗ ), (1) rewrites to f (x∗ ) − f (y) ≤ (x∗ , y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other points. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss at any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on the loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is deﬁned with respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example, f (x) = 1 − x2 is weak Lipschitz with the dissimilarity (x, y) = c(x − y)2 for some appropriate constant c. Assumption 2. The mean-payoff function f is weakly Lipschitz. ∗ ∗ Let fh,i = supx∈Ph,i f (x) and ∆h,i = f ∗ − fh,i be the suboptimality of node (h, i). We say that def a node (h, i) is optimal (respectively, suboptimal) if ∆h,i = 0 (respectively, ∆h,i > 0). Let Xε = { x ∈ X : f (x) ≥ f ∗ − ε } be the set of ε-optimal arms. The following result follows from the deﬁnitions; a proof can be found in the appendix. 4 Lemma 1. Let Assumption 1 and 2 hold. If the suboptimality ∆h,i of a region is bounded by cν1 ρh for some c > 0, then all arms in Ph,i are max{2c, c + 1}ν1 ρh -optimal. The last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of a continuum-armed bandit algorithm should depend on how fast the volume of the sets of ε-optimal arms shrinks as ε → 0. Here, we capture this by deﬁning a new notion, the near-optimality dimension of the mean-payoff function. The connection between these concepts, as well as the zooming dimension deﬁned by Kleinberg et al. [7] will be further discussed in Section 7. Deﬁne the packing number P(X , , ε) to be the size of the largest packing of X with disjoint open balls of radius ε with respect to the dissimilarity .2 We now deﬁne the near-optimality dimension, which characterizes the size of the sets Xε in terms of ε, and then state our main result. Deﬁnition 3. For c > 0 and ε0 > 0, the (c, ε0 )–near-optimality dimension of f with respect to equals inf d ∈ [0, +∞) : ∃ C s.t. ∀ε ≤ ε0 , P Xcε , , ε ≤ C ε−d (2) (with the usual convention that inf ∅ = +∞). Theorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4ν1 /ν2 , ν2 )–near-optimality dimension of the considered environment is d < +∞. Then, for any d > d there exists a constant C(d ) such that for all n ≥ 1, ERn ≤ C(d ) n(d +1)/(d +2) ln n 1/(d +2) . Further, if the near-optimality dimension is achieved, i.e., the inﬁmum is achieved in (2), then the result holds also for d = d. Remark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima. In fact, at the price of increased constants, the result continues to hold if there exists ε > 0 such that (1) holds for any x, y ∈ Xε . To show this we only need to carefully adapt the steps of the proof below. We omit the details from this extended abstract. 5 Analysis of the regret and proof of the main result We ﬁrst state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely on concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us ﬁx some path (0, 1), (1, i∗ ), . . . , (h, i∗ ), . . . , of optimal nodes, starting from the root. 1 h Lemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i∗ ) is on the path from k the root to (h, i). Then we have n E Nh,i (n) ≤ u+ P Nh,i (t) > u and Uh,i (t) > f ∗ or Us,i∗ ≤ f ∗ for some s ∈ {k+1, . . . , t−1} s t=u+1 Lemma 3. Let Assumptions 1 and 2 hold. 1, P Uh,i (n) ≤ f ∗ ≤ n−3 . Then, for all optimal nodes and for all integers n ≥ Lemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t ≤ n, for all suboptimal nodes (h, i) 8 ln such that ∆h,i > ν1 ρh , and for all integers u ≥ 1 such that u ≥ (∆h,i −νnρh )2 , one has P Uh,i (t) > 1 f ∗ and Nh,i (t) > u ≤ t n−4 . 2 Note that sometimes packing numbers are deﬁned as the largest packing with disjoint open balls of radius ε/2, or, ε-nets. 5 . Taking u as the integer part of (8 ln n)/(∆h,i − ν1 ρh )2 , and combining the results of Lemma 2, 3, and 4 with a union bound leads to the following key result. Lemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that ∆h,i > ν1 ρh , we have, for all n ≥ 1, 8 ln n 2 E[Nh,i (n)] ≤ + . (∆h,i − ν1 ρh )2 n We are now ready to prove Theorem 1. Proof. For the sake of simplicity we assume that the inﬁmum in the deﬁnition of near-optimality is achieved. To obtain the result in the general case one only needs to replace d below by d > d in the proof below. First step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2ν1 ρh –optimal, i.e., the nodes ∗ (h, i) such that fh,i ≥ f ∗ − 2ν1 ρh . Then, I is the union of these sets of nodes. Further, let J be the set of nodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at depth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) ∈ Jh is visited. Since ∆h,i > 2ν1 ρh , we get 8 ln n 2 E Nh,i (n) ≤ 2 2h + . ν1 ρ n Second step. We bound here the cardinality |Ih |, h > 0. If (h, i) ∈ Ih then since ∆h,i ≤ 2ν1 ρh , by Lemma 1 Ph,i ⊂ X4ν1 ρh . Since by Assumption 1, the sets (Ph,i ), for (h, i) ∈ Ih , contain disjoint balls of radius ν2 ρh , we have that |Ih | ≤ P ∪(h,i)∈Ih Ph,i , , ν2 ρh ≤ P X(4ν1 /ν2 ) ν2 ρh , , ν2 ρh ≤ C ν2 ρh −d , where we used the assumption that d is the (4ν1 /ν2 , ν2 )–near-optimality dimension of f (and C is the constant introduced in the deﬁnition of the near-optimality dimension). Third step. Choose η > 0 and let H be the smallest integer such that ρH ≤ η. We partition the inﬁnite tree T into three sets of nodes, T = T1 ∪ T2 ∪ T3 . The set T1 contains nodes of IH and their descendants, T2 = ∪0≤h < 1 and then, by optimizing over ρH (the worst value being ρH ∼ ( ln n )−1/(d+2) ). 6 Minimax optimality The packing dimension of a set X is the smallest d such that there exists a constant k such that for all ε > 0, P X , , ε ≤ k ε−d . For instance, compact subsets of Rd (with non-empty interior) have a packing dimension of d whenever is a norm. If X has a packing dimension of d, then all environments have a near-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only depends on d, k (of the deﬁnition of packing dimension), ν1 , ν2 , and ρ, but not on the environment as long as it is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form O(n(d+1)/(d+2) ). In fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions are satisﬁed by, e.g., compact subsets of Rd and if is some norm of Rd . The proof can be found in the appendix. Theorem 2. If X is such that there exists c > 0 with P(X , , ε) ≥ c ε−d ≥ 2 for all ε ≤ 1/4 then for all n ≥ 4d−1 c/ ln(4/3), all strategies ϕ are bound to suffer a regret of at least 2/(d+2) 1 1 c n(d+1)/(d+2) , 4 4 4 ln(4/3) where the supremum is taken over all environments with weakly Lipschitz payoff functions. sup E Rn (ϕ) ≥ 7 Discussion Several works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and provided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret √ of O( n) for compact and convex subset of Rd and a mean-payoff function with unique minima and second order smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H¨ lder with o degree 0 < α ≤ 1. The derived regret is Θ(n(α+1)/(α+2) ). Auer et al. [2] extended the analysis to classes of functions with only a local H¨ lder assumption around maximum (with possibly higher smoothness degree o 1+α−αβ α ∈ [0, ∞)), and derived the regret Θ(n 1+2α−αβ ), where β is such that the Lebesgue measure of ε-optimal 7 states is O(εβ ). Another setting is that of [7] who considered a metric space (X , ) and assumed that f is Lipschitz w.r.t. . The obtained regret is O(n(d+1)/(d+2) ) where d is the zooming dimension (deﬁned similarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the sets Xε \ Xε/2 ). When (X , ) is a metric space covering and packing numbers are equivalent and we may prove that the zooming dimension and near-optimality dimensions are equal. Our main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker than the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common situations, such as when the mean-payoff function assumes a local smoothness whose order is larger than one. In order to relate all these results, let us consider a speciﬁc example: Let X = [0, 1]D and assume that the mean-reward function f is locally equivalent to a H¨ lder function with degree α ∈ [0, ∞) around any o maxima x∗ of f (the number of maxima is assumed to be ﬁnite): f (x∗ ) − f (x) = Θ(||x − x∗ ||α ) as x → x∗ . (3) This means that ∃c1 , c2 , ε0 > 0, ∀x, s.t. ||x − x∗ || ≤ ε0 , c1 ||x − x∗ ||α ≤ f (x∗ ) − f (x) ≤ c2 ||x − x∗ ||α . √ Under this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is Θ( n) (since here √ β = 1/α). Our result allows us to extend the n regret rate to any dimension D. Indeed, if we choose our def dissimilarity measure to be α (x, y) = ||x − y||α , we may prove that f satisﬁes a locally weak-Lipschitz √ condition (as deﬁned in Remark 3) and that the near-optimality dimension is 0. Thus our regret is O( n), i.e., the rate is independent of the dimension D. In comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use α when α > 1. Indeed a function globally Lipschitz with respect to α is essentially constant. Moreover α does not deﬁne a metric for α > 1. If one resort to the Euclidean metric to fulﬁll their requirement that f be Lipschitz w.r.t. the metric then the zooming dimension becomes D(α − 1)/α, while the regret becomes √ O(n(D(α−1)+α)/(D(α−1)+2α) ), which is strictly worse than O( n) and in fact becomes close to the slow rate O(n(D+1)/(D+2) ) when α is larger. Nevertheless, in the case of α ≤ 1 they get the same regret rate. In contrast, our result shows that under very weak constraints on the mean-payoff function and if the local behavior of the function around its maximum (or ﬁnite number of maxima) is known then global optimization √ suffers a regret of order O( n), independent of the space dimension. As an interesting sidenote let us also remark that our results allow different smoothness orders along different dimensions, i.e., heterogenous smoothness spaces. References [1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926–1951, 1995. [2] P. Auer, R. Ortner, and Cs. Szepesv´ ri. Improved rates for the stochastic continuum-armed bandit problem. 20th a Conference on Learning Theory, pages 454–468, 2007. [3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action spaces. Preprint, 2004. [4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty in Artiﬁcial Intelligence, 2007. [5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modiﬁcation of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA, 2006. [6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information Processing Systems, 2004. [7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM Symposium on Theory of Computing, 2008. [8] L. Kocsis and Cs. Szepesv´ ri. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference a on Machine Learning, pages 282–293, 2006. 8

6 0.57580888 95 nips-2008-Grouping Contours Via a Related Image

7 0.54602468 136 nips-2008-Model selection and velocity estimation using novel priors for motion patterns

8 0.5449844 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

9 0.53890949 229 nips-2008-Syntactic Topic Models

10 0.52686995 119 nips-2008-Learning a discriminative hidden part model for human action recognition

11 0.52562428 4 nips-2008-A Scalable Hierarchical Distributed Language Model

12 0.52483445 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits

13 0.51311195 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

14 0.51297629 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models

15 0.51179183 99 nips-2008-High-dimensional support union recovery in multivariate regression

16 0.51040757 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

17 0.50888264 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

18 0.50825596 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

19 0.50812429 66 nips-2008-Dynamic visual attention: searching for coding length increments

20 0.50798553 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs