nips nips2012 nips2012-186 knowledge-graph by maker-knowledge-mining

186 nips-2012-Learning as MAP Inference in Discrete Graphical Models


Source: pdf

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We present a new formulation for binary classification. [sent-11, score-0.154]

2 We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. [sent-13, score-0.556]

3 By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. [sent-14, score-0.168]

4 Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. [sent-16, score-0.325]

5 1 Introduction A large fraction of the machine learning community is concerned itself with the formulation of a learning problem as a single, well-defined optimization problem. [sent-18, score-0.174]

6 Among these optimization-based frameworks for learning, two paradigms stand out: the one based on convex formulations (such as SVMs) and the one based on non-convex formulations (such as deep belief networks). [sent-20, score-0.314]

7 The main argument in favor of convex formulations is that we can effectively decouple modeling from optimization, what has substantial theoretical and practical benefits. [sent-21, score-0.297]

8 Coming from the other end, the main argument for non-convexity is that a convex formulation very often fails to capture fundamental properties of a real problem (e. [sent-23, score-0.249]

9 see [1, 2] for examples of some fundamental limitations of convex loss functions). [sent-25, score-0.25]

10 1 The motivation for this paper starts from the observation that the above tension is not really between convexity and non-convexity, but between convexity and continuous non-convexity. [sent-26, score-0.158]

11 On the contrary, unless P=NP there is no general tool to efficiently optimize discrete functions. [sent-29, score-0.25]

12 We suspect this is one of the reasons why machine learning has traditionally been formulated in terms of continuous optimization: it is indeed convenient to compute gradients or subgradients and delegate optimization to some off-the-shelf gradient-based algorithm. [sent-30, score-0.287]

13 The formulation we introduce in this paper is non-convex, but discrete rather than continuous. [sent-31, score-0.361]

14 By being non-convex we will attempt at capturing some of the expressive power of continuous non-convex formulations (such as robustness to labeling noise), and by being discrete we will retain the ability of convex formulations to provide theoretical guarantees in optimization. [sent-32, score-0.724]

15 There are highly non-trivial classes of non-convex discrete functions defined over exponentially large discrete spaces which can be optimized efficiently. [sent-33, score-0.551]

16 Discrete functions factored over cliques of low-treewidth graphs can be optimized efficiently via dynamic programming [3]. [sent-35, score-0.327]

17 Arbitrary submodular functions can be minimized in polynomial time [4]. [sent-36, score-0.115]

18 Particular submodular functions can be optimized very efficiently using max-flow algorithms [5]. [sent-37, score-0.115]

19 Discrete functions defined over other particular classes of graphs also have polynomial-time algorithms (planar graphs [6], perfect graphs [7]). [sent-38, score-0.381]

20 And of course although many discrete optimization problems are NP-hard, several have efficient constant-factor approximations [8]. [sent-39, score-0.313]

21 Although all these discrete approaches have been widely used for solving inference problems in machine learning settings, we argue in this paper that they should also be used to solve estimation problems, or learning per se. [sent-41, score-0.334]

22 The discrete approach does pose several new questions though, which we list at the end. [sent-42, score-0.25]

23 These features essentially guide the assumptions behind our framework. [sent-47, score-0.105]

24 Option to decouple modeling from optimization: As discussed in the introduction, this is the great appeal of convex formulations, and we would like to retain it. [sent-48, score-0.143]

25 For instance, in our framework the user has the option of reducing the learning algorithm to a series of matrix multiplications and lookup operations, while having a precise estimate of the total runtime of the algorithm and retaining good performance. [sent-56, score-0.425]

26 Robustness to label noise: SVMs are considered state-of-the-art estimators for binary classifiers, as well as boosting and logistic regression. [sent-57, score-0.294]

27 However, when label noise is present, convex loss functions inflict arbitrarily large penalty on misclassifications because they are unbounded. [sent-59, score-0.455]

28 In other words, in high label noise settings these convex loss functions become poor proxies for the 0/1 loss (the loss we really care about). [sent-60, score-0.961]

29 This fundamental limitation of convex loss functions is well understood theoretically [1]. [sent-61, score-0.301]

30 The fact that the loss function of interest is itself discrete is indeed a hint that maybe we should investigate discrete surrogates rather than continuous surrogates for the 0/1 loss: optimizing discrete functions over continuous spaces is hard, but not necessarily over discrete spaces. [sent-62, score-1.813]

31 However the assumptions required to make 1 -regularized models be actually good proxies for the support cardinality function ( 0 pseudo-norm) are very strong and in practice rarely met [10]. [sent-67, score-0.194]

32 Ideally we would like to regularize with 0 directly; maybe this suggests the possibility of exploring an inherently discrete formulation? [sent-70, score-0.308]

33 However we go beyond the Naive Bayes assumption by constructing graphs that model dependencies between variables. [sent-75, score-0.11]

34 By varying the properties of these graphs we can trade-off model complexity and optimization efficiency in a straightforward manner. [sent-76, score-0.173]

35 f : X → Y is a member of a given class of predictors parameterized by θ, Θ is a continuous space such as a Hilbert space, and as well as Ω are continuous and convex functions of θ. [sent-78, score-0.428]

36 is a loss function which enforces a penalty whenever f (xn ) = y n , and therefore the first term in (1) measures the total loss incurred by predictor f on the training sample {xn , y n } under parameterization θ. [sent-79, score-0.373]

37 This formulation is used for regression (Y continuous) as well as classification and structured prediction (Y discrete). [sent-82, score-0.178]

38 Logistic Regression, Regularized Least-Squares 3 Regression, SVMs, CRFs, structured SVMs, Lasso, Group Lasso and a variety of other estimators are all instances of (1) for particular choices of , f , Θ and Ω. [sent-83, score-0.127]

39 The formulation in (1) is a very general formulation for machine learning under the i. [sent-84, score-0.222]

40 In this paper we study problem (1) under the assumption that the parameter space Θ is discrete and finite, focusing on binary classification, when Y = {−1, 1}. [sent-88, score-0.293]

41 4 Formulation Our formulation departs from the one in (1) in two ways. [sent-89, score-0.111]

42 The first assumption is that both the loss and the regularizer Ω are additive on low-dimensional functions defined by a graph G = (V, E), i. [sent-90, score-0.425]

43 , (2) (y, f (x; θ)) = c (y, fc (x; θc )) c∈C Ω(θ) = Ωc (θc ) (3) c∈C where C ∪ C is the set of maximal cliques in G. [sent-92, score-0.35]

44 c is a low-dimensional discrete surrogate for , and fc is a low-dimensional predictor, both to be defined below. [sent-104, score-0.434]

45 Note that in general two parameter subvectors θci and θcj are not independent since the cliques ci and cj can overlap. [sent-105, score-0.166]

46 Indeed, one of the key reasons sustaining the power of this formulation is that all θc are coupled either directly or indirectly through the connected graph G = (V, E). [sent-106, score-0.372]

47 The second assumption is that Θ is discrete and therefore the vector θ = (θ1 , . [sent-107, score-0.25]

48 , θD ) is discrete in the sense that θi is only allowed to take on finitely many values, including the value 0 (this will be important when we discuss regularization). [sent-110, score-0.25]

49 For simplicity of exposition let’s assume that the number of discrete values (bins) for each θi is the same: B. [sent-111, score-0.25]

50 This often provides improved performance for our model due to the spreading of higher-order dependencies over lowerorder cliques (when mapping to a higher dimensional space) and also is motivated from a theoretical argument (section 6). [sent-117, score-0.21]

51 We will assume a standard linear predictor of the kind fc (x; θ) = argmax y xc , θc = sign xc , θc (4) y∈{−1,1} In other words, we have a linear classifier that only considers the features in clique c. [sent-120, score-0.573]

52 In other words, the 0/1 loss constrained to the discretized subspace defined by clique c can be exactly and efficiently computed (for small cliques). [sent-122, score-0.302]

53 One critical technical issue is that linear predictors of the kind argmaxy φ(x, y), θ are insensitive to scalings of θ [14]. [sent-124, score-0.136]

54 Therefore, the loss will be such that (y, f (x; αθ)) = (y, f (x; θ)) for α = 0. [sent-125, score-0.156]

55 This means that any regularizer that depends 1 For notational simplicity we assume an offset parameter is already included in θc and a corresponding entry of 1 is appended to the vector xc . [sent-126, score-0.153]

56 In other words, in such discrete setting we need a scale-invariant regularizer, such as the 0 pseudo-norm. [sent-128, score-0.25]

57 Again, we can trade-off model simplicity and optimization efficiency by controlling the size of the maximal clique in C . [sent-133, score-0.209]

58 The critical observation now is that (7) is a MAP inference problem in a discrete graphical model with clique set C, high-order clique potentials ψc (θc ) and unary potentials φi (θi ) [15]. [sent-136, score-0.743]

59 Therefore we can resort to the vast literature on inference in graphical models to find exact or approximate solutions for (7). [sent-137, score-0.155]

60 For example, if G = (V, E) is a tree, then (7) can be solved exactly and efficiently using a dynamic programming algorithm that only requires matrix-vector multiplications in the (min, +) semiring, in addition to elementary lookup operations [3]. [sent-138, score-0.273]

61 For more general graphs the problem (7) can become NP-hard, but even in that case there are several principled approaches that often find excellent solutions, such as those based on linear programming relaxations [9] for tightly outer-bounding the marginal polytope [16]. [sent-139, score-0.164]

62 In a similar spirit to our approach, it also addresses the problem of estimating linear binary classifiers in a discrete formulation. [sent-142, score-0.293]

63 However, instead of composing low-dimensional discrete surrogates of the 0/1 loss as we do, it instead uses a fully connected factor graph and performs inference by estimating the mean of the max-marginals rather than MAP. [sent-143, score-0.897]

64 Inference is approached using message-passing, which for the fully connected graph reduces to an intractable knapsack problem. [sent-144, score-0.272]

65 Our assumptions involve three approximations of the problem of 0/1 loss minimization. [sent-149, score-0.213]

66 Second, the computation of low-dimensional proxies for the 0/1 loss rather than attacking the 0/1 loss directly in the resulting discrete space. [sent-151, score-0.699]

67 Finally, the use of a graph G = (V, E) which in general will be sparse, i. [sent-152, score-0.132]

68 Therefore there are good reasons to believe that indeed, for linear predictors, increasing binning has a diminishing returns behavior and after only a moderate amount of bins no much improvement can be obtained. [sent-161, score-0.202]

69 2 Low-dimensional proxies for the 0/1 loss This assumption can be justified using recent results stating that the margin is well-preserved under random projections to low-dimensional subspaces [18, 19]. [sent-164, score-0.401]

70 For instance, Theorem 6 in [19] shows that the margin is preserved with high probability for embeddings with dimension only logarithmic on the sample size (a result similar in spirit to the Johnson-Lindenstrauss Lemma [20]). [sent-165, score-0.119]

71 In our case, we have a graph with the nodes representing different features, and this can be seen as a patching of low-dimensional subspaces, where each subspace is defined by one of the cliques in the graph. [sent-173, score-0.298]

72 Rather, we show that even with random subgraphs, and in particular subgraphs as simple as chains, we can obtain models that have high accuracy and remarkable robustness to high degrees of label noise. [sent-175, score-0.203]

73 The second observation is that nothing prevents us from using quite dense graphs and seeking approximate rather than exact MAP inference, say through LP relaxations [9]. [sent-176, score-0.255]

74 For both algorithms, the only hyperparameter is the tradeoff between the loss and the regularization term. [sent-183, score-0.156]

75 The number of bins used for discretization may affect the accuracy of DISCRETE. [sent-185, score-0.183]

76 In the first experiment, we test the robustness of different methods to increasing label noise. [sent-189, score-0.152]

77 For DISCRETE, we used as the graph G a random chain, i. [sent-193, score-0.132]

78 In this case, optimization is straightforward via a Viterbi algorithm: a sequence of matrixvector multiplications in the (min, +) semiring with trivial bookkeeping and subsequential lookup, which will run in O(B 2 D) since we have B states per variable and D variables. [sent-196, score-0.286]

79 , when the hinge loss is a good proxy for the 0/1 loss). [sent-201, score-0.205]

80 However, as soon as a significant amount of label noise is present, SVM degrades substantially while DISCRETE remains remarkably stable, delivering high accuracy even after flipping labels with 40% probability. [sent-202, score-0.154]

81 Note in particular how this differs from continuous optimization settings in which the analysis is in terms of rate of convergence rather than the precise number of discrete operations performed. [sent-204, score-0.458]

82 A natural question to ask is therefore how would more complex graph topologies perform. [sent-209, score-0.132]

83 a random junction tree with cliques {i, i + 1, i + 2}) and a random k-regular graph, where k is set to be such that the resulting graph has 10% of the possible edges. [sent-212, score-0.298]

84 For the 2-chain, the optimization algorithm is exact inference via (min, +) message-passing, just as the Viterbi algorithm, but now applied to a larger clique, which increases the memory and runtime cost by O(B). [sent-213, score-0.279]

85 In our experiments we used the approximate inference algorithm from [22], which solves optimally and efficiently an LP relaxation via the alternating direction method of multipliers, ADMM [23]. [sent-215, score-0.129]

86 7 # Train # Test # Features Table 1: Datasets used for the experiments in Figure 1 GISETTE MNIST A2A USPS ISOLET ACOUSTIC 6000 10205 2265 950 480 19705 1000 1134 30296 237 120 78823 5000 784 123 256 617 50 Table 2: Error rates of different methods for binary classification, without label noise. [sent-216, score-0.136]

87 In this setting, the hinge loss used by SVM is an excellent proxy for the 0/1 loss. [sent-217, score-0.205]

88 In section 6 we only sketched the reasons why we pursued the assumptions laid out in this paper. [sent-241, score-0.133]

89 The extension to multi-class and structured prediction, as well as other learning settings is an open problem. [sent-247, score-0.124]

90 The problem of selecting a graph topology in an informative way is highly relevant and is left open. [sent-253, score-0.132]

91 For example B-matching can be used to generate an informative regular graph [24]. [sent-254, score-0.132]

92 It would be interesting to consider nonparametric models, where the discretization occurs at the level of parameters associated with each training instance (as in the dual formulation of SVMs). [sent-259, score-0.223]

93 9 Conclusion We presented a discrete formulation for learning linear binary classifiers. [sent-260, score-0.404]

94 Parameters associated with features of the linear model are discretized into bins, and low-dimensional discrete surrogates of the 0/1 loss restricted to small groups of features are constructed. [sent-261, score-0.675]

95 Servedio, “Random classification noise defeats all convex potential boosters,” Machine Learning, vol. [sent-279, score-0.155]

96 Zabih, “What energy functions can be minimizedvia graph cuts? [sent-305, score-0.183]

97 Jaakkola, “Approximate inference using planar graph decomposition,” in Advances in Neural Information Processing Systems 19 (B. [sent-312, score-0.269]

98 Bach, “Structured sparsity-inducing norms through submodular functions,” in NIPS, pp. [sent-351, score-0.106]

99 van den Hengel, “Is margin preserved after random projection? [sent-389, score-0.119]

100 Globerson, “An alternating direction method for dual map lp relaxation,” in Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II, ECML PKDD’11, (Berlin, Heidelberg), pp. [sent-403, score-0.168]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discrete', 0.25), ('fc', 0.184), ('surrogates', 0.173), ('cliques', 0.166), ('loss', 0.156), ('clique', 0.146), ('svms', 0.14), ('proxies', 0.137), ('runtime', 0.132), ('graph', 0.132), ('option', 0.124), ('discretization', 0.112), ('formulation', 0.111), ('formulations', 0.11), ('graphs', 0.11), ('lp', 0.101), ('continuous', 0.101), ('erent', 0.095), ('convex', 0.094), ('label', 0.093), ('ectively', 0.092), ('gisette', 0.092), ('sydney', 0.087), ('knapsack', 0.087), ('multiplications', 0.087), ('svm', 0.086), ('regularizer', 0.086), ('inference', 0.084), ('noiseless', 0.083), ('lookup', 0.082), ('predictors', 0.081), ('reasons', 0.076), ('matrixvector', 0.075), ('potetz', 0.075), ('predictability', 0.075), ('acoustic', 0.073), ('bins', 0.071), ('graphical', 0.071), ('usps', 0.069), ('australia', 0.069), ('xc', 0.067), ('isolet', 0.067), ('map', 0.067), ('structured', 0.067), ('servedio', 0.067), ('preserved', 0.066), ('submodular', 0.064), ('optimization', 0.063), ('classi', 0.063), ('canberra', 0.061), ('semiring', 0.061), ('singletons', 0.061), ('noise', 0.061), ('di', 0.061), ('predictor', 0.061), ('elementary', 0.06), ('estimators', 0.06), ('robustness', 0.059), ('maybe', 0.058), ('open', 0.057), ('really', 0.057), ('assumptions', 0.057), ('mnist', 0.056), ('scalings', 0.055), ('binning', 0.055), ('subspaces', 0.055), ('ect', 0.055), ('relaxations', 0.054), ('connected', 0.053), ('chains', 0.053), ('margin', 0.053), ('planar', 0.053), ('viterbi', 0.053), ('functions', 0.051), ('boosting', 0.051), ('crfs', 0.051), ('jebara', 0.051), ('subgraphs', 0.051), ('composing', 0.049), ('decouple', 0.049), ('hinge', 0.049), ('features', 0.048), ('australian', 0.047), ('subgradients', 0.047), ('logistic', 0.047), ('optimizer', 0.046), ('nothing', 0.046), ('globerson', 0.046), ('unary', 0.046), ('relaxation', 0.045), ('randomization', 0.045), ('prevents', 0.045), ('sparsity', 0.044), ('argument', 0.044), ('operations', 0.044), ('ict', 0.044), ('xn', 0.043), ('binary', 0.043), ('norms', 0.042), ('regularizers', 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

2 0.17077862 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difficult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

3 0.15867367 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Aaron Defazio, Tibério S. Caetano

Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

4 0.15682855 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

Author: Po-ling Loh, Martin J. Wainwright

Abstract: We investigate a curious relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. Based on our population-level results, we show how the graphical Lasso may be used to recover the edge structure of certain classes of discrete graphical models, and present simulations to verify our theoretical results. 1

5 0.1504382 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

6 0.14682251 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

7 0.14677058 16 nips-2012-A Polynomial-time Form of Robust Regression

8 0.1417647 227 nips-2012-Multiclass Learning with Simplex Coding

9 0.14113028 204 nips-2012-MAP Inference in Chains using Column Generation

10 0.13021903 200 nips-2012-Local Supervised Learning through Space Partitioning

11 0.12820809 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

12 0.12353658 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

13 0.11800414 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

14 0.11605185 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

15 0.10809717 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

16 0.10377329 361 nips-2012-Volume Regularization for Binary Classification

17 0.10096862 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

18 0.099266768 304 nips-2012-Selecting Diverse Features via Spectral Regularization

19 0.098969363 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

20 0.098142132 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.309), (1, 0.06), (2, 0.057), (3, -0.118), (4, 0.066), (5, 0.012), (6, -0.009), (7, 0.039), (8, -0.165), (9, 0.003), (10, 0.024), (11, 0.049), (12, 0.073), (13, 0.153), (14, 0.103), (15, -0.079), (16, -0.059), (17, -0.081), (18, -0.036), (19, 0.115), (20, 0.084), (21, 0.024), (22, -0.076), (23, 0.051), (24, -0.035), (25, 0.045), (26, -0.096), (27, -0.044), (28, -0.044), (29, -0.038), (30, -0.004), (31, 0.046), (32, -0.039), (33, 0.08), (34, 0.025), (35, -0.046), (36, 0.096), (37, 0.112), (38, 0.058), (39, -0.023), (40, 0.006), (41, 0.027), (42, 0.01), (43, -0.029), (44, 0.045), (45, 0.053), (46, -0.054), (47, 0.035), (48, -0.021), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96016699 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

2 0.74613833 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, defined by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the first provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

3 0.73918951 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difficult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

4 0.72213286 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

Abstract: We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1

5 0.65233541 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

Author: Abner Guzmán-rivera, Dhruv Batra, Pushmeet Kohli

Abstract: We address the problem of generating multiple hypotheses for structured prediction tasks that involve interaction with users or successive components in a cascaded architecture. Given a set of multiple hypotheses, such components/users typically have the ability to retrieve the best (or approximately the best) solution in this set. The standard approach for handling such a scenario is to first learn a single-output model and then produce M -Best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we learn to produce multiple outputs by formulating this task as a multiple-output structured-output prediction problem with a loss-function that effectively captures the setup of the problem. We present a max-margin formulation that minimizes an upper-bound on this lossfunction. Experimental results on image segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this type of scenario and leads to substantial improvements in prediction accuracy. 1

6 0.65066671 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

7 0.63991916 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

8 0.63611728 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

9 0.62951535 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

10 0.62708443 197 nips-2012-Learning with Recursive Perceptual Representations

11 0.62434381 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

12 0.61067462 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

13 0.60060817 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.59642541 279 nips-2012-Projection Retrieval for Classification

15 0.59577578 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

16 0.57534236 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

17 0.57155579 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data

18 0.57036167 16 nips-2012-A Polynomial-time Form of Robust Regression

19 0.56749499 217 nips-2012-Mixability in Statistical Learning

20 0.56565922 280 nips-2012-Proper losses for learning from partial labels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.045), (21, 0.036), (27, 0.18), (38, 0.17), (39, 0.012), (42, 0.039), (54, 0.036), (55, 0.021), (74, 0.07), (76, 0.131), (80, 0.126), (92, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9697746 39 nips-2012-Analog readout for optical reservoir computers

Author: Anteo Smerieri, François Duport, Yvon Paquot, Benjamin Schrauwen, Marc Haelterman, Serge Massar

Abstract: Reservoir computing is a new, powerful and flexible machine learning technique that is easily implemented in hardware. Recently, by using a timemultiplexed architecture, hardware reservoir computers have reached performance comparable to digital implementations. Operating speeds allowing for real time information operation have been reached using optoelectronic systems. At present the main performance bottleneck is the readout layer which uses slow, digital postprocessing. We have designed an analog readout suitable for time-multiplexed optoelectronic reservoir computers, capable of working in real time. The readout has been built and tested experimentally on a standard benchmark task. Its performance is better than non-reservoir methods, with ample room for further improvement. The present work thereby overcomes one of the major limitations for the future development of hardware reservoir computers. 1

2 0.92696089 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson

Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1

3 0.91498554 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

Author: Liva Ralaivola

Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1

same-paper 4 0.89353585 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

5 0.86556089 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1

6 0.82420319 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.8212983 348 nips-2012-Tractable Objectives for Robust Policy Optimization

8 0.81508601 65 nips-2012-Cardinality Restricted Boltzmann Machines

9 0.81331027 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.8126238 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

11 0.81223744 292 nips-2012-Regularized Off-Policy TD-Learning

12 0.81201649 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

13 0.81157267 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

14 0.80989015 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

15 0.80970359 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

16 0.80931377 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

17 0.80845529 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

18 0.80809832 227 nips-2012-Multiclass Learning with Simplex Coding

19 0.80747092 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

20 0.80744219 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes