jmlr jmlr2013 jmlr2013-99 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
Reference: text
sentIndex sentText sentNum sentScore
1 Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. [sent-10, score-0.629]
2 Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. [sent-14, score-0.527]
3 To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. [sent-15, score-0.34]
4 Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. [sent-17, score-0.447]
5 Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy Max-Cut 1. [sent-20, score-0.412]
6 Introduction In many real applications, labeled samples are scarce but unlabeled samples are abundant. [sent-21, score-0.34]
7 Paradigms that consider both labeled and unlabeled data, that is, semi-supervised learning (SSL) methods, have c 2013 Jun Wang, Tony Jebara and Shih-Fu Chang. [sent-22, score-0.34]
8 While many semi-supervised learning approaches estimate a smooth function over labeled and unlabeled examples, this article presents a novel approach which emphasizes a bivariate optimization problem over the classification function and the labels. [sent-24, score-0.47]
9 Graph-based semi-supervised learning (GSSL) treats both labeled and unlabeled samples from a data set as vertices in a graph and builds pairwise edges between these vertices which are weighted by the affinity between the corresponding samples. [sent-41, score-0.77]
10 The small portion of vertices with labels are then used by SSL methods to perform graph partition or information propagation to predict labels for unlabeled vertices. [sent-42, score-0.816]
11 For instance, the graph mincuts approach formulates the label prediction as a graph cut problem (Blum and Chawla, 2001; Blum et al. [sent-43, score-0.712]
12 The weighted graph and the optimal function ultimately propagate label information from labeled data to unlabeled data to produce transductive predictions. [sent-47, score-0.754]
13 In particular, both the graph construction methodology and the label initialization conditions can significantly impact prediction accuracy (Jebara et al. [sent-61, score-0.433]
14 Even if one assumes that the graph structures used in the above methods faithfully describe the data manifold, GSSL algorithms may still be misled by problems in the label information. [sent-65, score-0.359]
15 Figure 3 depicts several cases where the label information leads to invalid graph transduction solutions for all the aforementioned algorithms. [sent-66, score-0.464]
16 In the proposed greedy solution, initial labels simply act as initial values of the graph cut which is incrementally refined until convergence. [sent-71, score-0.598]
17 During each iteration of the greedy search, the optimal unlabeled vertex is assigned to the labeled subset with minimum connectivity to maximally preserve cross-subset edge weight. [sent-72, score-0.707]
18 Finally, an overall cut is produced after placing the unlabeled vertices into one of the label sets. [sent-73, score-0.551]
19 It is then straightforward to obtain the final label prediction from the graph cut result. [sent-74, score-0.503]
20 Note that this greedy gradient Max-Cut solution is equivalent to alternating between minimization of the cost over the label matrix and minimization of the cost over the prediction function. [sent-75, score-0.582]
21 Moreover, to alleviate dependencies on the initialization of the cut (the given labels), a re-weighting of the connectivity between unlabeled vertices and labeled subsets is proposed. [sent-76, score-0.636]
22 We demonstrate that the greedy gradient Max-Cut based graph transduction produces significantly better performance on both artificial and real data sets. [sent-78, score-0.459]
23 In Section 3, we present our bivariate graph transduction framework, followed by the theoretical proof of its equivalence with the constrained Max-Cut problem in Section 4. [sent-81, score-0.418]
24 Background and Open Issues In this section, we introduce some notation and then revisit two critical components of graph-based SSL: graph construction and label propagation. [sent-87, score-0.388]
25 Subsequently, we discuss some challenging issues such as SSL’s sensitivity to graph construction and label initialization. [sent-88, score-0.388]
26 , xl } with cardinality |Xl | = l and the set of unlabeled inputs Xu = {xl+1 , . [sent-130, score-0.376]
27 The labeled set Xl is associated with labels Zl = {z1 , · · · , zl }, where zi ∈ {1, · · · , c}, i = 1, 2, · · · , l. [sent-134, score-0.348]
28 In this article, assume the undirected graph converted from the data X is represented by G = {X, E}, where the set of vertices is X = {xi } and the set of edges is E = {ei j }. [sent-147, score-0.361]
29 The graph Laplacian is defined as ∆ = D − W and the normalized graph Laplacian is j=1 L = D−1/2 ∆ D−1/2 = I − D−1/2 WD−1/2 . [sent-152, score-0.418]
30 The graph Laplacian and its normalized version can be viewed as operators on the space of functions f which can be used to define a regularization measure of smoothness over strongly-connected regions in a graph (Chung and Biggs, 1997). [sent-153, score-0.447]
31 S EMI -S UPERVISED L EARNING U SING G REEDY M AX -C UT Finally, the label information is formulated as a label matrix Y = {yi j } ∈ Bn×c , where yi j = 1 if sample xi is associated with label j for j ∈ {1, 2, · · · , c}, that is, zi = j, and yi j = 0 otherwise. [sent-155, score-0.58]
32 Most of the GSSL methods then use the graph quantity W as well as the known labels to recover a continuous classification function F ∈ Rn×c by minimizing a predefined cost on the graph. [sent-159, score-0.378]
33 Given input data X with cardinality |X| = l + u, graph construction produces a graph G consisting of n = l + u vertices where each vertex is associated with the sample xi . [sent-169, score-0.687]
34 Sparsification via Neighborhood Methods: There are two typical ways to build a neighborhood graph: the ε-neighborhood graph connecting samples within a distance of ε, and the kNN (k-nearestneighbors) graph connecting k closest samples. [sent-181, score-0.443]
35 More specifically, the k-nearest neighbor graph is a graph in which two vertices xi and x j are connected by an edge if the distance Hi j between xi and x j is within or equal k-th smallest among the distances from xi to other samples in X. [sent-257, score-0.677]
36 In other words, each vertex in the b-matched graph has exactly b edges connecting it to other vertices. [sent-285, score-0.357]
37 2 G RAPH E DGE R E -W EIGHTING Once a graph has been sparsified and a binary matrix B is computed and used to delete unwanted edges, several procedures can then be used to update the weights in the matrix K to produce a final set of edge weights W. [sent-296, score-0.421]
38 Binary Weighting: The simplest approach for building the weighted graph is the binary weighting approach, where all the linked edges in the graph are given the weight 1 and the edge weights of disconnected vertices are given the weight 0. [sent-299, score-0.753]
39 However, this uniform weight on graph edges can be sensitive, particularly if some of the graph vertices were improperly connected by the sparsification procedure (either the neighborhood based procedures or the b-matching procedure). [sent-301, score-0.598]
40 This final step in the graph construction procedure ensures that the unlabeled data X has now been converted into a graph G with a weighted sparse undirected adjacency matrix W. [sent-307, score-0.711]
41 Given this graph and some initial label information Yl , any of the current popular algorithms for graph based SSL can be used to solve the labeling problem. [sent-308, score-0.666]
42 3 Univariate Graph Regularization Framework Given the constructed graph G = {X, E}, whose geometric structure is represented by the weight matrix W, the label inference task is to diffuse the known labels Zl to all the unlabeled vertices ˆ Xu in the graph and estimate Zu . [sent-310, score-1.058]
43 The cost function typically involves a trade-off between the smoothness of the function over the graph of both labeled and unlabeled data (consistency of the predictions on closely connected vertices) and the accuracy of the function at fitting the label information on the labeled vertices. [sent-314, score-0.9]
44 , 2004) fall into this category, so does our previous method of graph transduction via alternating minimization (Wang et al. [sent-317, score-0.41]
45 2 (4) The first term F 2 represents function smoothness over graph G and F − Y 2 measures the emG pirical loss on given labeled samples. [sent-323, score-0.37]
46 (5) Meanwhile, the harmonic function F minimizing the above cost also satisfies two conditions: ∂Q = ∆ Fu = 0, ∂F u Fl = Yl , where Fl , Fu are the function values of f (·) over labeled and unlabeled vertices, that is, Fl = f (Xl ), Fu = f (Xu ), and F = [Fl Fu ]⊤ . [sent-329, score-0.38]
47 The first equation above denotes the zero derivative of the object function on the unlabeled data and the second equation clamps the function value on the given label value Yl . [sent-330, score-0.41]
48 4 Open Issues Existing graph-based SSL methods hinge on having good label information and an appropriately constructed graph (Wang et al. [sent-334, score-0.359]
49 In addition, the label propagation procedure can easily be misled if there exist excessive noise or outliers in the initial labeled set. [sent-338, score-0.36]
50 We next discuss some open issues which occur often in graph construction and label propagation, two critical components of all GSSL algorithms. [sent-342, score-0.388]
51 Particularly challenging conditions are shown in (a) where an uninformative label on an outlier sample is the only negative label (denoted by a black circle) and in (g) where imbalanced labeling is involved. [sent-383, score-0.414]
52 the graph is perfectly constructed from the data, problematic initial labels under practical situations can easily deteriorate the performance of SSL prediction. [sent-388, score-0.369]
53 Figure 3 provides examples depicting imbalanced and noisy labels that lead to invalid graph transduction solutions for all the aforementioned algorithms. [sent-389, score-0.49]
54 The leading SSL methods classify the majority of unlabeled nodes in the graph as positive (Figure 3(b)-Figure 3(e)). [sent-392, score-0.417]
55 In addition, when the graph contains background noise and makes class manifolds non-separable (as in Figure 1(b)), these existing graph transduction approaches fail to output reasonable classification results. [sent-412, score-0.523]
56 Since the univariate framework treats the initial label information as a constant, we propose a novel bivariate optimization framework that explicitly optimizes over both the classification function F and the binary label matrix Y: (F∗ , Y∗ ) = arg minF∈Rn×c ,Y∈Bn×c Q (F, Y) s. [sent-413, score-0.53]
57 For a labeled sample xi ∈ Xl , yi j = 1 if zi = j, and the constraint ∑c yi j = 1 indicates that this a single label prediction problem. [sent-417, score-0.426]
58 Because the graph is often symmetric, it is easy to show that the graph Laplacian L and the propagation matrix P are both symmetric. [sent-426, score-0.496]
59 However, this may produce biased classification results in practice since, at each iteration, the class with more labels will be preferred and will propagate more quickly to the unlabeled examples. [sent-437, score-0.366]
60 To compensate for this bias during ˜ label propagation, we propose using a normalized label variable Y = ΛY for computing the cost function in Equation (6) as 1 ˜ ˜ Q = tr F⊤ LF + µ(F − Y)⊤ (F − Y) 2 1 = tr F⊤ LF + µ(F − Λ Y)⊤ (F − Λ Y) . [sent-439, score-0.46]
61 Using the normalized label matrix Y in the bivariate formulation allows labeled nodes with high degrees to contribute more during the label propagation process. [sent-443, score-0.614]
62 We propose a straightforward way to guarantee convergence and avoid backtracking or unstable oscillation in the greedy propagation process: once an unlabeled point has been labeled, its labeling can no longer be changed. [sent-466, score-0.421]
63 Although the original objective is formed in a bivariate manner in Equation 6, the above alternating optimization procedure drives the prediction of new labels without explicitly calculating F∗ as is done in other graph transduction methods like LGC and GFHF. [sent-483, score-0.653]
64 Once the first few iterations are completed, the new labels are added and the standard propagation step can be used to predict the optimal F∗ as indicated in Equation (11) over the whole graph in one step. [sent-492, score-0.385]
65 The details of the algorithm, namely graph transduction via alternating minimization, can be referred to Wang et al. [sent-493, score-0.375]
66 In the following section, we provide a greedy Max-Cut based solution, which essentially interprets the above alternating minimization procedure from a graph cut view. [sent-495, score-0.503]
67 Greedy Max-Cut for Semi-Supervised Learning In this section, we introduce a connection between the proposed bivariate graph transduction framework and the well-known maximum cut problem. [sent-497, score-0.517]
68 Since y = {y1 , y2 , · · · , yn } is a binary vector, each setting of y partitions the vertex set VA in the graph GA into two disjoint subsets (S1 , S2 ). [sent-520, score-0.353]
69 However, in graph based semi-supervised learning, the variable y is partially specified by the initial label values. [sent-524, score-0.39]
70 Instead, here we use a greedy gradient based strategy to find local optima by assigning each unlabeled vertex to the label set with minimum connectivity to maximize cross-set edge weights iteratively. [sent-538, score-0.795]
71 The greedy Max-Cut algorithm randomly selects unlabeled vertices and places each of them into the appropriate class subset depending on the edges between this unlabeled vertex and the vertices in the labeled subset. [sent-539, score-0.983]
72 Define the following as the connectivity between unlabeled vertex xi and labeled subset S j n ci j = ∑ aim ym j = Ai. [sent-541, score-0.618]
73 Intuitively, ci j represents the sum of edge weights between vertex xi and label set S j given the graph GA with edge weights A. [sent-546, score-0.682]
74 Based on this definition, a straightforward local search for the maximum cut involves placing each unlabeled vertex xi ∈ Xu in the labeled subset S j with minimum connectivity ci j to maximize the cross-set edge weights as shown in Algorithm (1). [sent-547, score-0.791]
75 According to the definition in Equation (15), the initialized labels determine the connectivity between unlabeled vertices and labeled subsets. [sent-550, score-0.666]
76 If the computed connectivity is negative, the above random search will prefer assigning unlabeled vertices to the label set with the most labeled vertices which results in biased partitioning. [sent-551, score-0.81]
77 Such biased partitioning also occurs in minimum cut problems over an undirected graph with positive weights (Shi and Malik, 2000). [sent-552, score-0.386]
78 Alternatively, the labeled vertices may be outliers and lie in regions of the graph with low density. [sent-555, score-0.435]
79 Furthermore, the algorithm’s random selection of an unlabeled vertex results in unstable predictions since the chosen unlabeled vertex xi could have equally low connectivity to multiple label subsets S j. [sent-558, score-0.93]
80 Finally, to handle any instability due to the random search algorithm, we propose a greedy gradient search approach where the most beneficial vertex is assigned to the label set with minimum connectivity. [sent-566, score-0.41]
81 In other words, we first compute the connectivity matrix C = {ci j } ∈ Rn×c that gives the connectivity between all unlabeled vertices to existing label sets Λ C = AΛ Y. [sent-567, score-0.689]
82 i, j:xi ∈Xu This means that the unlabeled vertex xi∗ has the least connectivity with label set S j∗ . [sent-569, score-0.576]
83 Then, we update the labeled set S j∗ by adding vertex xi∗ as one greedy step to maximize the cross-set edge weights. [sent-570, score-0.419]
84 This greedy search can be repeated until all the unlabeled vertices are assigned to labeled sets. [sent-571, score-0.533]
85 In each iteration of the greedy cut process, the weighted connectivity of all unlabeled vertices to labeled sets is re-computed. [sent-572, score-0.735]
86 Then the vertex with minimum connectivity is placed in the proper labeled set. [sent-573, score-0.35]
87 ˜ ∂Y 789 WANG , J EBARA AND C HANG We name the algorithm greedy gradient Max-Cut (GGMC) since, in the greedy step, the unlabeled vertices are assigned labels in a manner that reduces the value of Q along the direction of the steepest descent. [sent-577, score-0.675]
88 , 2004), our GGMC algorithm tends to generate more natural graph cuts and avoid biased solutions since it uses a weighted connectivity matrix. [sent-581, score-0.365]
89 3 Complexity and Speed Up Assume the graph has n = |X| vertices and a subset Xl with l = |Xl | labeled vertices (where l ≪ n). [sent-584, score-0.529]
90 For example, the computation of the connectivity in Equation (16) can be done incrementally after assigning each new unlabeled vertex to a certain label set. [sent-589, score-0.576]
91 Assume in the t’th iteration the connectivity is Ct and an unlabeled vertex xi with degree di is assigned to the labeled set S j . [sent-591, score-0.589]
92 Clearly, for all remaining unlabeled vertices, the connectivity to the labeled sets remains unchanged except for the j’th labeled set. [sent-592, score-0.575]
93 i , j where dS t+1 = dS tj +di is the sum of the degrees of the labeled vertices after assigning xi to the labeled j set S j . [sent-597, score-0.389]
94 As in Section 2, any real implementation of graph-based SSL needs a graph construction method algorithm that builds a graph from the training data X. [sent-612, score-0.447]
95 1 4 6 8 10 12 14 Imbalance Ratio (h) 16 18 20 0 1 2 4 6 8 10 12 14 Imbalance Ratio 16 18 20 (i) Figure 4: Experimental results on the noisy two-moon data set simulating different graph construction approaches and label conditions. [sent-648, score-0.388]
96 Clearly, GGMC is insensitive to the imbalance since it computes a per-class label weight normalization which compensates directly for differences in label proportions. [sent-701, score-0.382]
97 The previous graph construction methodology was followed and algorithm performance was evaluated using the average error rate across 100 random folds with the number of initial labels varying from 60 to 100. [sent-782, score-0.423]
98 This greedy gradient Max-Cut (GGMC) solution presents a different interpretation for the alternating minimization procedure from a graph cut view. [sent-800, score-0.549]
99 Multi-Class Case as a Max K-Cut Problem Here, we show that K-class bivariate graph transduction is equivalent to a Max K-Cut problem. [sent-815, score-0.418]
100 Learning from labeled and unlabeled data using graph mincuts. [sent-871, score-0.549]
wordName wordTfidf (topN-words)
[('ggmc', 0.419), ('lgc', 0.305), ('ssl', 0.261), ('gfhf', 0.244), ('laprls', 0.236), ('graph', 0.209), ('lapsvm', 0.209), ('unlabeled', 0.208), ('xl', 0.168), ('label', 0.15), ('labeled', 0.132), ('ebara', 0.131), ('labels', 0.129), ('vertex', 0.115), ('gssl', 0.105), ('emi', 0.105), ('transduction', 0.105), ('bivariate', 0.104), ('connectivity', 0.103), ('cut', 0.099), ('greedy', 0.099), ('vertices', 0.094), ('reedy', 0.094), ('xu', 0.088), ('zl', 0.087), ('upervised', 0.087), ('sing', 0.073), ('blum', 0.071), ('jebara', 0.07), ('wang', 0.067), ('labeling', 0.067), ('hang', 0.063), ('alternating', 0.061), ('tr', 0.06), ('ax', 0.057), ('sparsi', 0.055), ('transductive', 0.055), ('imbalance', 0.054), ('lf', 0.052), ('sindhwani', 0.052), ('chapelle', 0.052), ('edge', 0.05), ('weighting', 0.049), ('ga', 0.048), ('imbalanced', 0.047), ('chawla', 0.047), ('propagation', 0.047), ('bi', 0.046), ('zhu', 0.046), ('gradient', 0.046), ('prediction', 0.045), ('graphs', 0.044), ('knn', 0.044), ('earning', 0.042), ('ut', 0.041), ('ay', 0.041), ('zhou', 0.041), ('cost', 0.04), ('columbia', 0.036), ('minimization', 0.035), ('bip', 0.035), ('goemans', 0.035), ('univariate', 0.035), ('yi', 0.034), ('edges', 0.033), ('diffusion', 0.032), ('xi', 0.031), ('initial', 0.031), ('matrix', 0.031), ('belkin', 0.03), ('figures', 0.03), ('biased', 0.029), ('construction', 0.029), ('animal', 0.029), ('ci', 0.029), ('smoothness', 0.029), ('binary', 0.029), ('bn', 0.028), ('weight', 0.028), ('zu', 0.027), ('article', 0.026), ('aee', 0.026), ('hfgf', 0.026), ('qsmooth', 0.026), ('unevenly', 0.026), ('equation', 0.026), ('th', 0.025), ('rate', 0.025), ('undirected', 0.025), ('py', 0.025), ('neighborhood', 0.025), ('ds', 0.024), ('weights', 0.024), ('cuts', 0.024), ('neighbors', 0.024), ('yk', 0.023), ('update', 0.023), ('tsvm', 0.022), ('bii', 0.022), ('neighbor', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
2 0.11027399 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou
Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation
3 0.097870983 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
4 0.091341257 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
Author: Matthew J. Urry, Peter Sollich
Abstract: We consider learning on graphs, guided by kernels that encode similarity between vertices. Our focus is on random walk kernels, the analogues of squared exponential kernels in Euclidean spaces. We show that on large, locally treelike graphs these have some counter-intuitive properties, specifically in the limit of large kernel lengthscales. We consider using these kernels as covariance functions of Gaussian processes. In this situation one typically scales the prior globally to normalise the average of the prior variance across vertices. We demonstrate that, in contrast to the Euclidean case, this generically leads to significant variation in the prior variance across vertices, which is undesirable from a probabilistic modelling point of view. We suggest the random walk kernel should be normalised locally, so that each vertex has the same prior variance, and analyse the consequences of this by studying learning curves for Gaussian process regression. Numerical calculations as well as novel theoretical predictions for the learning curves using belief propagation show that one obtains distinctly different probabilistic models depending on the choice of normalisation. Our method for predicting the learning curves using belief propagation is significantly more accurate than previous approximations and should become exact in the limit of large random graphs. Keywords: Gaussian process, generalisation error, learning curve, cavity method, belief propagation, graph, random walk kernel
5 0.090806633 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
Author: Julien Mairal, Bin Yu
Abstract: We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding” penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. Keywords: convex and non-convex optimization, network flow optimization, graph sparsity
6 0.082470715 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
7 0.081797495 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
9 0.078295283 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
10 0.077714287 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
11 0.061928492 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections
12 0.054357603 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
13 0.051155835 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
14 0.050428994 8 jmlr-2013-A Theory of Multiclass Boosting
15 0.048507821 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
16 0.047988661 68 jmlr-2013-Machine Learning with Operational Costs
17 0.04670674 41 jmlr-2013-Experiment Selection for Causal Discovery
18 0.046117332 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
19 0.046001021 90 jmlr-2013-Quasi-Newton Method: A New Direction
20 0.044295274 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
topicId topicWeight
[(0, -0.234), (1, 0.073), (2, -0.019), (3, -0.031), (4, 0.122), (5, 0.224), (6, 0.003), (7, -0.005), (8, -0.084), (9, -0.21), (10, 0.074), (11, 0.075), (12, -0.028), (13, 0.063), (14, 0.106), (15, 0.246), (16, 0.066), (17, 0.152), (18, -0.006), (19, -0.047), (20, -0.132), (21, -0.014), (22, -0.001), (23, 0.057), (24, 0.073), (25, -0.039), (26, -0.126), (27, -0.025), (28, -0.027), (29, -0.052), (30, -0.103), (31, 0.082), (32, -0.019), (33, -0.111), (34, -0.074), (35, -0.109), (36, -0.113), (37, -0.041), (38, 0.031), (39, -0.048), (40, -0.039), (41, 0.04), (42, 0.01), (43, 0.038), (44, -0.003), (45, -0.028), (46, -0.075), (47, -0.022), (48, -0.062), (49, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.94398469 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
2 0.6122517 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
3 0.59143114 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou
Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation
4 0.5838449 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections
Author: Mark Vere Culp, Kenneth Joseph Ryan
Abstract: The cluster assumption had a significant impact on the reasoning behind semi-supervised classification methods in graph-based learning. The literature includes numerous applications where harmonic functions provided estimates that conformed to data satisfying this well-known assumption, but the relationship between this assumption and harmonic functions is not as well-understood theoretically. We investigate these matters from the perspective of supervised kernel classification and provide concrete answers to two fundamental questions. (i) Under what conditions do semisupervised harmonic approaches satisfy this assumption? (ii) If such an assumption is satisfied then why precisely would an observation sacrifice its own supervised estimate in favor of the cluster? First, a harmonic function is guaranteed to assign labels to data in harmony with the cluster assumption if a specific condition on the boundary of the harmonic function is satisfied. Second, it is shown that any harmonic function estimate within the interior is a probability weighted average of supervised estimates, where the weight is focused on supervised kernel estimates near labeled cases. We demonstrate that the uniqueness criterion for harmonic estimators is sensitive when the graph is sparse or the size of the boundary is relatively small. This sets the stage for a third contribution, a new regularized joint harmonic function for semi-supervised learning based on a joint optimization criterion. Mathematical properties of this estimator, such as its uniqueness even when the graph is sparse or the size of the boundary is relatively small, are proven. A main selling point is its ability to operate in circumstances where the cluster assumption may not be fully satisfied on real data by compromising between the purely harmonic and purely supervised estimators. The competitive stature of the new regularized joint harmonic approach is established. Keywords: harmonic function, joint training, cluster assumption, s
5 0.47476241 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi
Abstract: In this paper we establish that the Lov´ sz ϑ function on a graph can be restated as a kernel learning a problem. We introduce the notion of SVM − ϑ graphs, on which Lov´ sz ϑ function can be apa proximated well by a Support vector machine (SVM). We show that Erd¨ s-R´ nyi random G(n, p) o e log4 n np graphs are SVM − ϑ graphs for n ≤ p < 1. Even if we embed a large clique of size Θ 1−p in a G(n, p) graph the resultant graph still remains a SVM − ϑ graph. This immediately suggests an SVM based algorithm for recovering a large planted clique in random graphs. Associated with the ϑ function is the notion of orthogonal labellings. We introduce common orthogonal labellings which extends the idea of orthogonal labellings to multiple graphs. This allows us to propose a Multiple Kernel learning (MKL) based solution which is capable of identifying a large common dense subgraph in multiple graphs. Both in the planted clique case and common subgraph detection problem the proposed solutions beat the state of the art by an order of magnitude. Keywords: orthogonal labellings of graphs, planted cliques, random graphs, common dense subgraph
6 0.46732858 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
7 0.46241319 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
8 0.45465276 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
9 0.43881536 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
11 0.35703355 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
12 0.31295025 41 jmlr-2013-Experiment Selection for Causal Discovery
13 0.31069693 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
15 0.27454612 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.2714214 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
17 0.26720712 68 jmlr-2013-Machine Learning with Operational Costs
18 0.26322889 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
19 0.26246145 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
20 0.24929054 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
topicId topicWeight
[(0, 0.029), (5, 0.156), (6, 0.072), (10, 0.099), (20, 0.025), (23, 0.025), (44, 0.013), (68, 0.039), (70, 0.019), (71, 0.01), (75, 0.048), (85, 0.019), (87, 0.022), (90, 0.326)]
simIndex simValue paperId paperTitle
1 0.93752146 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics
Author: Hervé Frezza-Buet, Matthieu Geist
Abstract: This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good fit between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example. Keywords: reinforcement learning, C++, generic programming 1. C++ Genericity for Fitting the Mathematics of Reinforcement Learning Reinforcement learning (RL) is a field of machine learning that benefits from a rigorous mathematical formalism, as shown for example by Bertsekas (1995). Although this formalism is well accepted in the field, its translation into efficient computer science tools has surprisingly not led to any standard yet, as mentioned by Kovacs and Egginton (2011). The claim of this paper is that genericity enables a natural expression of the mathematics of RL. The rllib (2011) library implements this idea in the C++ language, where genericity relies on templates. Templates automate the re-writing of some generic code involving user types, offering a strong type checking at compile time that improves the code safety. Using the rllib templates requires that the user-defined types fit some documented concepts. For example, some class C defining an agent should be designed so that C::state type is the type used for the states, C::action type is the type used for the actions, and the method C::action type policy(const C::state type& s) const is implemented, in order to compute the action to be performed in a given state. This concept definition specifies what is required for an agent mathematically. Note that C does not need to inherit from any kind of abstract rl::Agent class to be used by the rllib tools. It can be directly provided as a type argument to any rllib template requiring an argument fitting the concept of an agent, so that the re-written code actually compiles. 2. A Short Example Let us consider the following toy-example. The state space contains values from 0 to 9, and actions consist in increasing or decreasing the value. When the value gets out of bounds, a reward is returned ∗. Also at UMI 2958 Georgia Tech / CNRS, 2-3, rue Marconi, 57070 Metz, France. c 2013 Herv´ Frezza-Buet and Matthieu Geist. e F REZZA -B UET AND G EIST (-1 for bound 0, 1 for bound 9). Otherwise, a null reward is returned. Let us define this problem and run Sarsa. First, a simulator class fitting the concept Simulator described in the documentation is needed. c l a s s Sim { / / Our s i m u l a t o r c l a s s . No i n h e r i t a n c e r e q u i r e d . private : i n t c u r r e n t ; double r ; public : typedef int phase type ; typedef int observation type ; t y p e d e f enum { up , down} a c t i o n t y p e ; t y p e d e f d o u b l e r e w a r d t y p e ; Sim ( v o i d ) : c u r r e n t ( 0 ) , r ( 0 ) {} v o i d s e t P h a s e ( c o n s t p h a s e t y p e &s; ) { c u r r e n t = s %10;} c o n s t o b s e r v a t i o n t y p e& s e n s e ( v o i d ) c o n s t { r e t u r n c u r r e n t ; } reward type reward ( void ) const { return r ;} v o i d t i m e S t e p ( c o n s t a c t i o n t y p e &a; ) { i f ( a == up ) c u r r e n t ++; e l s e c u r r e n t −−; i f ( c u r r e n t < 0 ) r =−1; e l s e i f ( c u r r e n t > 9 ) r = 1 ; e l s e r = 0 ; i f ( r ! = 0 ) throw r l : : e x c e p t i o n : : T e r m i n a l ( ” Out o f r a n g e ” ) ; } }; Following the concept requirements, the class Sim naturally implements a sensor method sense that provides an observation from the current phase of the controlled dynamical system, and a method timeStep that computes a transition consecutive to some action. Note the use of exceptions for terminal states. For the sake of simplicity in further code, the following is added. typedef typedef typedef typedef typedef Sim : : p h a s e t y p e Sim : : a c t i o n t y p e r l : : I t e r a t o r t y p e d e f r l : : a g e n t : : o n l i n e : : E p s i l o n G r e e d y Critic ; ArgmaxCritic ; TestAgent ; LearnAgent ; The rllib expresses that Sarsa provides a critic, offering a Q-function. As actions are discrete, the best action (i.e., argmaxa∈A Q(s, a)) can be found by considering all the actions sequentially. This is what ArgmaxCritic offers thanks to the action enumerator Aenum, in order to define greedy and ε-greedy agents. The main function then only consists in running episodes with the appropriate agents. i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { Sim simulator ; Transition transition ; ArgmaxCritic c r i t i c ; LearnAgent learner ( critic ); TestAgent tester ( critic ); A a; S s; int episode , length , s t e p =0; // // // // // // // T h i s i s what t h e a g e n t c o n t r o l s . T h i s i s some s , a , r , s ’ , a ’ d a t a . T h i s c o m p u t e s Q and argmax a Q( s , a ) . SARSA u s e s t h i s a g e n t t o l e a r n t h e p o l i c y . This behaves according to the c r i t i c . Some a c t i o n . Some s t a t e . f o r ( e p i s o d e = 0 ; e p i s o d e < 1 0 0 0 0 ; ++ e p i s o d e ) { / / L e a r n i n g p h a s e s i m u l a t o r . setPhase ( rand ()%10); r l : : episode : : sa : : r u n a n d l e a r n ( simulator , l e a r n e r , t r a n s i t i o n , 0 , length ) ; } try { / / T e s t phase simulator . setPhase (0); while ( true ) { s = simulator . sense ( ) ; a = t e s t e r . policy ( s ) ; s t e p ++; s i m u l a t o r . t i m e S t e p ( a ) ; } } c a t c h ( r l : : e x c e p t i o n : : T e r m i n a l e ) { s t d : : c o u t << s t e p << ” s t e p s . ” << s t d : : e n d l ; } return 0 ; / / t h e message p r i n t e d i s ‘ ‘10 s t e p s . ’ ’ } 3. Features of the Library Using the library requires to define the features that are specific to the problem (the simulator and the Q-function architecture in our example) from scratch, but with the help of concepts. Then, the specific features can be handled by generic code provided by the library to implement RL techniques with value function estimation. 627 F REZZA -B UET AND G EIST Currently, Q-learing, Sarsa, KTD-Q, LSTD, and policy iteration are available, as well as a multi-layer perceptron architecture. Moreover, some benchmark problems (i.e., simulators) are also provided: the mountain car, the cliff walking, the inverted pendulum and the Boyan chain. Extending the library with new algorithms is allowed, since it consists in defining new templates. This is a bit more technical than only using the existing algorithms, but the structure of existing concepts helps, since it reflects the mathematics of RL. For example, concepts like Feature, for linear approaches mainly (i.e., Q(s, a) = θT ϕ(s, a)) and Architecture (i.e., Q(s, a) = fθ (s, a) for more general approximation) orient the design toward functional approaches of RL. The algorithms implemented so far rely on the GNU Scientific Library (see GSL, 2011) for linear algebra computation, so the GPL licence of GSL propagates to the rllib. 4. Conclusion The rllib relies only on the C++ standard and the availability of the GSL on the system. It offers state-action function approximation tools for applying RL to real problems, as well as a design that fits the mathematics. The latter allows for extensions, but is also compliant with pedagogical purpose. The design of the rllib aims at allowing the user to build (using C++ programming) its own experiment, using several algorithms, several agents, on-line or batch learning, and so on. Actually, the difficult part of RL is the algorithms themselves, not the script-like part of the experiment where things are put together (see the main function in our example). With a framework, in the sense of Kovacs and Egginton (2011), the experiment is not directly accessible to the user programs, since it is handled by some libraries in order to offer graphical interface or analyzing tools. The user code is then called by the framework when required. We advocate that allowing the user to call the rllib functionality at his/her convenience provides an open and extensible access to RL for students, researchers and engineers. Last, the rllib fits the requirements expressed by Kovacs and Egginton (2011, Section 4.3): support of good scientific research, formulation compliant with the domain, allowing for any kind of agents and any kind of approximators, interoperability of components (the Q function of the example can be used for different algorithms and agents), maximization of run-time speed (use of C++ and templates that inline massively the code), open source, etc. Extensions of rllib can be considered, for example for handling POMDPs, and contributions of users are expected. The use of templates is unfortunately unfamiliar to many programmers, but the effort is worth it, since it brings the code at the level of the mathematical formalism, increasing readability (by a rational use of typedefs) and reducing bugs. Even if the approach is dramatically different from existing frameworks, wrappings with frameworks can be considered in further development. References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 3rd (20052007) edition, 1995. GSL, 2011. http://http://www.gnu.org/software/gsl. Tim Kovacs and Robert Egginton. On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning, 84:7–49, 2011. rllib, 2011. http://ims.metz.supelec.fr/spip.php?article122. 628
same-paper 2 0.7707696 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
3 0.55956042 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
Author: Yu-Feng Li, Ivor W. Tsang, James T. Kwok, Zhi-Hua Zhou
Abstract: In this paper, we study the problem of learning from weakly labeled data, where labels of the training examples are incomplete. This includes, for example, (i) semi-supervised learning where labels are partially known; (ii) multi-instance learning where labels are implicitly known; and (iii) clustering where labels are completely unknown. Unlike supervised learning, learning with weak labels involves a difficult Mixed-Integer Programming (MIP) problem. Therefore, it can suffer from poor scalability and may also get stuck in local minimum. In this paper, we focus on SVMs and propose the W ELL SVM via a novel label generation strategy. This leads to a convex relaxation of the original MIP, which is at least as tight as existing convex Semi-Definite Programming (SDP) relaxations. Moreover, the W ELL SVM can be solved via a sequence of SVM subproblems that are much more scalable than previous convex SDP relaxations. Experiments on three weakly labeled learning tasks, namely, (i) semi-supervised learning; (ii) multi-instance learning for locating regions of interest in content-based information retrieval; and (iii) clustering, clearly demonstrate improved performance, and W ELL SVM is also readily applicable on large data sets. Keywords: weakly labeled data, semi-supervised learning, multi-instance learning, clustering, cutting plane, convex relaxation
4 0.5436272 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer
Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modification to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modified SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER
5 0.52217853 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
9 0.51364416 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
10 0.51364011 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
11 0.5106355 41 jmlr-2013-Experiment Selection for Causal Discovery
12 0.51025021 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
13 0.51000828 120 jmlr-2013-Variational Algorithms for Marginal MAP
14 0.5086289 59 jmlr-2013-Large-scale SVD and Manifold Learning
15 0.50825125 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
16 0.5079003 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
17 0.50710225 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
18 0.50693733 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
19 0.505961 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
20 0.50533736 22 jmlr-2013-Classifying With Confidence From Incomplete Information