jmlr jmlr2013 jmlr2013-21 knowledge-graph by maker-knowledge-mining

21 jmlr-2013-Classifier Selection using the Predicate Depth


Source: pdf

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by defining a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classification, estimation, median, Tukey depth

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To answer this question we extend the notions of depth and the multivariate median, that are commonly used in multivariate statistics (Liu et al. [sent-44, score-0.454]

2 The depth function measures the centrality of a point in a sample or a distribution. [sent-46, score-0.481]

3 For example, if Q is a probability measure over Rd , the Tukey depth for a point x ∈ Rd , also known as the half-space depth (Tukey, 1975), is defined as TukeyDepthQ (x|Q) = H s. [sent-47, score-0.957]

4 and H is a halfspace (2) That is, the depth of a point x is the minimal measure of a half-space that contains it. [sent-50, score-0.486]

5 2 The Tukey depth also has a minimum entropy interpretation: each hyperplane containing x defines a Bernoulli distribution by splitting the distribution Q in two. [sent-51, score-0.473]

6 The Tukey depth is then the probability mass of the halfspace on the side of the hyperplane with the lowest such mass. [sent-53, score-0.5]

7 The depth function is thus a measure of centrality. [sent-54, score-0.474]

8 In this work we extend Tukey’s definition beyond half spaces and define a depth for any hypothesis class which we call the predicate depth. [sent-57, score-0.726]

9 Hence, the median predicate hypothesis, or predicate median, has the best generalization guarantee. [sent-59, score-0.473]

10 We present algorithms for approximating the predicate depth and the predicate median. [sent-60, score-0.819]

11 Since the Tukey depth is a special case of the predicate depth, our algorithms provide polynomial approximations to the Tukey depth and the Tukey median as well. [sent-61, score-1.187]

12 The depth of the function f on the instance x with respect to the measure Q. [sent-78, score-0.474]

13 The depth of the function f with respect to the measure Q. [sent-79, score-0.474]

14 The empirical depth of f on the instance x with respect to the sample T The empirical depth of f with respect to the samples T and S. [sent-81, score-0.949]

15 Table 1: A summary of the notation used in this work this special case, the average hypothesis has a depth of at least 1/e, independent of the dimension. [sent-84, score-0.535]

16 We address both the issues of approximating depth and of approximating the deepest hypothesis, that is, the predicate median. [sent-89, score-0.687]

17 The Predicate Depth: Definitions and Properties In this study, unlike Tukey who used the depth function on the instance space, we view the depth function as operating on the dual space, that is the space of classification functions. [sent-92, score-0.908]

18 The depth function measures the agreement of the function f with the weighted majority vote on x. [sent-94, score-0.495]

19 The predicate depth of f on the instance x ∈ X with respect to Q is defined as DQ ( f | x) = Pr [g (x) = f (x)] . [sent-97, score-0.631]

20 g∼Q The predicate depth of f with respect to Q is defined as DQ ( f ) = inf DQ ( f | x) = inf Pr [g (x) = f (x)] . [sent-98, score-0.661]

21 The δ-insensitive depth of f with respect to Q and µ is defined as δ,µ DQ ( f ) = sup inf ′ DQ ( f | x) . [sent-108, score-0.484]

22 X ′ ⊆X , µ(X ′ )≤δ x∈X \X The δ-insensitive depth function relaxes the infimum in the depth definition. [sent-109, score-0.908]

23 Instead of requiring that the function f always have a large agreement in the class F , the δ-insensitive depth makes this requirement on all but a set of the instances with probability mass δ. [sent-110, score-0.507]

24 Gibbs (Theorem 4) bounds the ratio of the generalization error of the Gibbs classifier and the generalization error of a given classifier as a function of the depth of the given classifier. [sent-127, score-0.498]

25 By definition, the depth of this classifier is at least one half; thus Theorem 4 recovers the well-known result that the generalization error of the Bayes optimal classifier is at most twice as large as the generalization error of the Gibbs classifier. [sent-129, score-0.488]

26 The generalization bounds theorem (Theorem 5) shows that if a deep function exists, then it is expected to generalize well, provided that the PAC-Bayes bound for Q is sufficiently smaller than the depth of f . [sent-142, score-0.536]

27 Our goal is to show that deep functions exist and that the Tukey depth is a special case of the predicate depth. [sent-148, score-0.666]

28 The Mean Voter Theorem of Caplin and Nalebuff (1991) shows that if f is the center of gravity of Q then for every x, DQ ( f | x) ≥ 1/e and thus the center of gravity of Q has a depth of at least 1/e (e is Euler’s number). [sent-160, score-0.511]

29 Hence, in all these cases, the median, that is the deepest point, has a depth of at least 1/e. [sent-164, score-0.488]

30 We now turn to show that the Tukey depth is a special case of the predicate depth. [sent-169, score-0.631]

31 The following shows that for any f ∈ F , the Tukey-depth of f and the predicate depth of f are the same. [sent-173, score-0.631]

32 The Tukey depth of f is the infimum measure of half-spaces that contain f : TukeyDepthQ ( f ) = inf Q (H) = H: f ∈H = inf Q {g : g (x) = 1} x: f (x)=1 inf DQ ( f | x) x: f (x)=1 ≥ DQ ( f ) . [sent-182, score-0.519]

33 Hence, the Tukey depth cannot be larger than the predicate depth. [sent-183, score-0.631]

34 Next we show that the Tukey depth cannot be smaller than the predicate depth for if the Tukey depth is smaller than the predicate depth then there exists a half space H such that its measure is smaller than the predicate depth. [sent-184, score-2.378]

35 Therefore, the Tukey depth can be neither smaller nor larger than the predicate depth and so the two must be equal. [sent-187, score-1.085]

36 If d is the depth of the median for Q then breakdown (median, Q) ≥ d−p 2 . [sent-210, score-0.652]

37 4 The Maximum A Posteriori Estimator So far, we have introduced the predicate depth and median and we have analyzed their properties. [sent-274, score-0.733]

38 , fn } such that f j ∈ F • A function f Output: ˆT • DS ( f ) - an approximation for the depth of f Algorithm: ˆ 1. [sent-333, score-0.512]

39 In this section we focus on algorithms that measure the depth of functions. [sent-347, score-0.474]

40 The main results are an efficient algorithm for approximating the depth uniformly over the entire function class and an algorithm for approximating the median. [sent-348, score-0.49]

41 We suggest a straightforward method to measure the depth of a function. [sent-349, score-0.474]

42 The depth on a point xi is estimated by counting the fraction of the functions f1 , . [sent-363, score-0.498]

43 Since samples are used to estimate depth, we call the value ˆT returned by this algorithm, DS ( f ), the empirical depth of f . [sent-367, score-0.49]

44 The following theorem shows that if the xi ’s are sampled from the underlying distribution over X , and the f j ’s are sampled from Q, then the empirical depth is a good estimator of the true 3602 P REDICATE D EPTH depth. [sent-369, score-0.54]

45 Theorem 14 Uniform convergence of depth Let Q be a probability measure on F and let µ be a probability measure on X . [sent-373, score-0.543]

46 For every f we have that DT ( f ) ≥ DQ ( f ) ≥ DQ ( f ) − ε ˆT where DS ( f ) is the empirical depth computed by the depth measure algorithm. [sent-431, score-0.964]

47 ′ From the definition of the depth on the point xi , it follows that DQ ( f y | xi ) = DQ f y | xi if and only ′ if f y (xi ) = f y (xi ). [sent-436, score-0.562]

48 We have seen that if the samples S and T are large enough then with high probability the estimated depth is accurate uniformly for all functions f ∈ F . [sent-451, score-0.471]

49 A function that has large empirical i=1 depth will agree with the majority vote on these points. [sent-457, score-0.508]

50 Outputs: • a function f ∈ F which approximates the predicate median, together with its depth estimation ˆT DS ( f ) Details: 1. [sent-467, score-0.641]

51 If i∗ ≡ 1 return f and depth D = 1 − q1 else return f and depth D = qi∗ −1 . [sent-487, score-0.93]

52 some instances, the empirical depth will be higher if these points are such that the majority vote on n them wins by a small margin. [sent-488, score-0.508]

53 The algorithm will always terminate and return a function f ∈ F and an empirical depth D. [sent-506, score-0.478]

54 ˆ Lemma 20 The MA algorithm will always return a hypothesis f and a depth D Proof It is sufficient to show that the binary search will always find i∗ ≤ u. [sent-516, score-0.546]

55 ˆ ˆ Lemma 21 Let f be the hypothesis that MA returned and let D be the depth returned. [sent-522, score-0.573]

56 DS (g), the estimated depth of g, is a function of Y (g) given by: ˆT DS (g) = min min (1 − qi ) , min qi i∈Y (g) i∈Y (g) / . [sent-525, score-0.526]

57 f ∈F In the proof of Lemma 21 we have seen that the empirical depth of a function is a function of the set of points on which it agrees with the majority vote. [sent-545, score-0.502]

58 If i∗ = 1 then the empirical depth of f is the maximum possible. [sent-548, score-0.467]

59 A closer look at the MA algorithm and the depth estimation algorithm reveals that these algorithms use the sample of functions in order to estimate the marginal Q [Y = 1|X = x] = Prg∼Q [g (x) = 1]. [sent-579, score-0.469]

60 ˆ T ( f | x) such that it is close to DQ ( f | x) is sufficient for estimating the depth Note that computing D using the depth measure algorithm (Algorithm 1) and for finding the approximated median using the MA algorithm (Algorithm 2). [sent-589, score-1.03]

61 Therefore, in this section we will focus only on computing the ˆ empirical conditional depth DT ( f | x). [sent-590, score-0.467]

62 1 we showed how to cast the Tukey depth as a special case of the predicate depth. [sent-604, score-0.631]

63 We can use this reduction to use Algorithm 1 and Algorithm 2 to compute the Tukey depth and approximate the median respectively. [sent-605, score-0.556]

64 Algorithm 3 shows how to use these samples to estimate the Tukey depth of a point f ∈ Rd . [sent-613, score-0.466]

65 However, for the special case of the linear functions we study here, it is possible to find the minimal depth over all possible selections of xiθ once xiv is fixed. [sent-621, score-0.493]

66 , fn } such that f j ∈ Rd • A point f ∈ Rd Output: ˆT • DS ( f ) - an approximation for the depth of f Algorithm: ˆ 1. [sent-638, score-0.524]

67 Outputs: ˆT • A point f ∈ Rd and its depth estimation DS ( f ) Details: 1. [sent-650, score-0.466]

68 Therefore, we can start by seeking f with the highest possible depth in every direction. [sent-669, score-0.477]

69 The definitions of depth used therein is closer in spirit to the simplicial depth in the multivariate case (Liu, 1990). [sent-682, score-0.908]

70 For a point x in Rd and a measure ν over Rd × R, the depth of x with respect to ν is defined to be −1 Dν (x) = 1 + sup φ (u, x, ν) where u =1 φ (u, x, ν) ≡ |u · x − µ (ν|x )| σ (ν|x ) where µ is a measure of dislocation and σ is a measure of scaling. [sent-685, score-0.541]

71 The functional depth we present in this work can be presented in similar form by defining, for a function f , D ( f , ν) = 1 + sup φ ( f , x, ν) −1 where x∈X φ ( f , x, ν) = | f (x) − Eg∼ν [g (x)]| . [sent-686, score-0.469]

72 Fraiman and Muniz (2001) introduced an extension of univariate depth to function spaces. [sent-687, score-0.467]

73 For a real function f , the depth of f is defined to be Ex [D ( f (x))] where D (·) is the univariate depth function. [sent-688, score-0.921]

74 At inference time, the depth of a point x is computed with respect to each of the samples. [sent-697, score-0.466]

75 The regression based depth function is defined to be ∆ (α, β) = π+ π− ∑ I [α · xi + β > 0] + n− ∑ I [α · xi + β < 0] n+ i:yi =1 i:yi =−1 where π+ and π− are positive scalars that sum to one. [sent-708, score-0.518]

76 The authors showed that as the sample size goes to infinity, the maximal depth classifier is the optimal linear classifier. [sent-710, score-0.469]

77 (2004) used the Tukey depth to analyze the generalization performance of the Bayes Point Machine (Herbrich et al. [sent-714, score-0.471]

78 However, the definition of the Bayes depth therein compares the generalization error of a hypothesis to the Bayes classifier in a way that does not allow the use of the PAC-Bayes theory to bound the gap between the empirical error and the generalization error. [sent-717, score-0.582]

79 We say that the function f ∈ F has depth zero (“non-fit” in Rousseeuw and Hubert, 1999) if there exists g ∈ F that is strictly better than f on every point in S. [sent-723, score-0.5]

80 4 An Axiomatic Definition of Depth Most of the applications of depth for classification define the depth of a classifier with respect to a given sample. [sent-732, score-0.908]

81 However, in this study we define the depth of a function with respect to a probability measure over the function class. [sent-735, score-0.491]

82 In our setting, we require that if there is a symmetry group acting on the hypothesis class then the same symmetry group acts on the depth function too. [sent-739, score-0.549]

83 The final requirement is that the depth function is not trivial, that is, that the depth is not a constant value. [sent-743, score-0.908]

84 It is a simple exercise to verify that the predicate depth meets all of the above requirements. [sent-755, score-0.631]

85 5 Methods for Computing the Tukey Median Part of the contribution of this work is the proposal of algorithms for approximating the predicate depth and the predicate median. [sent-757, score-0.819]

86 The Tukey depth is a special case of the predicate depth and therefore we survey the existing literature for computing the Tukey median here. [sent-758, score-1.187]

87 The empirical Tukey depth is the Tukey depth function when it is applied to an empirical measure sampled from the true distribution of interest. [sent-763, score-0.954]

88 He showed that the empirical depth converges uniformly to the true depth with probability one. [sent-764, score-0.938]

89 They proposed picking k random directions and computing the univariate depth of each candidate point for each of the k directions. [sent-767, score-0.479]

90 They defined the random Tukey depth for a given point to be the minimum univariate depth of this point with respect to the k random directions. [sent-768, score-0.945]

91 Note that the empirical depth of Massé (2002) and the random Tukey depth of Cuesta-Albertos and Nieto-Reyes (2008) are different quantities. [sent-771, score-0.921]

92 In the empirical depth, when evaluating the depth of a point x, one considers every possible hyperplane and evaluates the measure of the corresponding half-space using only a sample. [sent-772, score-0.541]

93 (1996) presented an algorithm for finding a point with depth Ω (1/d 2 ) in polynomial time. [sent-783, score-0.466]

94 When the distribution is logconcave, there exists a point with depth 1/e, independent of the dimension (Caplin and Nalebuff, 1991). [sent-785, score-0.477]

95 Moreover, for any distribution there is a point with a depth of at least 1/d+1 (Carathèodory’s theorem). [sent-786, score-0.466]

96 Using the same posterior in Theorem 5 will result in inferior results since there is an extra penalty of factor 2 due to the 1/2 depth of the center of the Gaussian. [sent-810, score-0.49]

97 To address this challenge, we defined a depth function for classifiers, the predicate depth, and showed that the generalization of a classifier is tied to its predicate depth. [sent-816, score-0.825]

98 We analyzed the breakdown properties of the median and showed it is related to the depth as well. [sent-818, score-0.652]

99 We presented efficient algorithms for uniformly measuring the predicate depth and for finding the predicate median. [sent-821, score-0.808]

100 Since the Tukey depth is a special case of the depth presented here, it also follows that the Tukey depth and the Tukey median can be approximated in polynomial time by our algorithms. [sent-822, score-1.464]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dq', 0.754), ('depth', 0.454), ('tukey', 0.292), ('predicate', 0.177), ('median', 0.102), ('breakdown', 0.096), ('hypothesis', 0.081), ('ds', 0.081), ('ilad', 0.079), ('urges', 0.079), ('epth', 0.067), ('redicate', 0.067), ('est', 0.067), ('fn', 0.058), ('pr', 0.04), ('dt', 0.04), ('xiv', 0.039), ('posterior', 0.036), ('qi', 0.036), ('deep', 0.035), ('germain', 0.034), ('zh', 0.034), ('deepest', 0.034), ('eg', 0.034), ('er', 0.032), ('xi', 0.032), ('xu', 0.032), ('classi', 0.031), ('gibbs', 0.031), ('bayes', 0.03), ('map', 0.03), ('vc', 0.029), ('rd', 0.027), ('ma', 0.027), ('ghosh', 0.024), ('maximizer', 0.024), ('returned', 0.023), ('every', 0.023), ('ambroladze', 0.022), ('fnk', 0.022), ('zuo', 0.022), ('estimator', 0.021), ('vote', 0.021), ('theorem', 0.02), ('measure', 0.02), ('majority', 0.02), ('hyperplane', 0.019), ('rs', 0.019), ('probability', 0.017), ('posteriori', 0.017), ('caplin', 0.017), ('fraiman', 0.017), ('gravity', 0.017), ('hampel', 0.017), ('tukeydepthq', 0.017), ('generalization', 0.017), ('belief', 0.016), ('inf', 0.015), ('sample', 0.015), ('sup', 0.015), ('agrees', 0.015), ('let', 0.015), ('yi', 0.014), ('foreach', 0.014), ('class', 0.014), ('hypotheses', 0.013), ('univariate', 0.013), ('ser', 0.013), ('disagrees', 0.013), ('rousseeuw', 0.013), ('empirical', 0.013), ('instances', 0.012), ('point', 0.012), ('xv', 0.012), ('approximating', 0.011), ('billor', 0.011), ('cuevas', 0.011), ('medians', 0.011), ('muniz', 0.011), ('nalebuff', 0.011), ('prg', 0.011), ('stimator', 0.011), ('voter', 0.011), ('welzl', 0.011), ('return', 0.011), ('nition', 0.011), ('qn', 0.011), ('exists', 0.011), ('chaudhuri', 0.011), ('kl', 0.011), ('bounds', 0.01), ('nding', 0.01), ('haussler', 0.01), ('oracle', 0.01), ('select', 0.01), ('ex', 0.01), ('mum', 0.01), ('approximates', 0.01), ('mass', 0.01), ('clarkson', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 21 jmlr-2013-Classifier Selection using the Predicate Depth

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by defining a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classification, estimation, median, Tukey depth

2 0.049448848 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

3 0.045578156 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone

Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction

4 0.043095108 78 jmlr-2013-On the Learnability of Shuffle Ideals

Author: Dana Angluin, James Aspnes, Sarah Eisenstat, Aryeh Kontorovich

Abstract: PAC learning of unrestricted regular languages is long known to be a difficult problem. The class of shuffle ideals is a very restricted subclass of regular languages, where the shuffle ideal generated by a string u is the collection of all strings containing u as a subsequence. This fundamental language family is of theoretical interest in its own right and provides the building blocks for other important language families. Despite its apparent simplicity, the class of shuffle ideals appears quite difficult to learn. In particular, just as for unrestricted regular languages, the class is not properly PAC learnable in polynomial time if RP = NP, and PAC learning the class improperly in polynomial time would imply polynomial time algorithms for certain fundamental problems in cryptography. In the positive direction, we give an efficient algorithm for properly learning shuffle ideals in the statistical query (and therefore also PAC) model under the uniform distribution. Keywords: PAC learning, statistical queries, regular languages, deterministic finite automata, shuffle ideals, subsequences

5 0.041745268 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves define the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal definition leads to some technical and practical difficulties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difficulties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modification of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

6 0.041401796 104 jmlr-2013-Sparse Single-Index Model

7 0.032012071 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

8 0.029657673 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

9 0.026552552 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

10 0.024859926 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

11 0.023173966 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

12 0.02172377 90 jmlr-2013-Quasi-Newton Method: A New Direction

13 0.020505905 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

14 0.01884977 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

15 0.018673841 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

16 0.018575458 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

17 0.018336762 8 jmlr-2013-A Theory of Multiclass Boosting

18 0.018022528 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

19 0.017766515 68 jmlr-2013-Machine Learning with Operational Costs

20 0.017668841 120 jmlr-2013-Variational Algorithms for Marginal MAP


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.103), (1, 0.026), (2, -0.043), (3, 0.006), (4, -0.005), (5, -0.063), (6, 0.043), (7, -0.012), (8, -0.041), (9, 0.016), (10, -0.055), (11, -0.022), (12, 0.013), (13, 0.025), (14, -0.065), (15, 0.037), (16, -0.03), (17, -0.045), (18, -0.035), (19, 0.039), (20, -0.088), (21, -0.054), (22, -0.024), (23, 0.009), (24, 0.103), (25, -0.112), (26, 0.024), (27, -0.011), (28, -0.11), (29, 0.072), (30, -0.183), (31, -0.037), (32, -0.039), (33, 0.029), (34, -0.214), (35, -0.114), (36, -0.201), (37, 0.072), (38, -0.077), (39, -0.079), (40, 0.295), (41, 0.142), (42, -0.222), (43, 0.315), (44, -0.128), (45, -0.067), (46, 0.107), (47, 0.226), (48, 0.258), (49, -0.142)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95170331 21 jmlr-2013-Classifier Selection using the Predicate Depth

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by defining a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classification, estimation, median, Tukey depth

2 0.34088659 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

3 0.3001551 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

4 0.26776347 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves define the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal definition leads to some technical and practical difficulties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difficulties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modification of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

5 0.21691096 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion

6 0.21286532 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming

7 0.19757317 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

8 0.18401678 78 jmlr-2013-On the Learnability of Shuffle Ideals

9 0.18135735 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

10 0.18055539 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

11 0.1801942 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

12 0.17616121 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

13 0.17248413 33 jmlr-2013-Dimension Independent Similarity Computation

14 0.15951592 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

15 0.15744303 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

16 0.14757425 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

17 0.14623678 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

18 0.14433402 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

19 0.13920242 90 jmlr-2013-Quasi-Newton Method: A New Direction

20 0.12829219 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (5, 0.101), (6, 0.02), (10, 0.053), (23, 0.049), (68, 0.013), (70, 0.03), (75, 0.467), (85, 0.053), (87, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98738706 45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

Author: Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, Aki Vehtari

Abstract: The GPstuff toolbox is a versatile collection of Gaussian process models and computational tools required for Bayesian inference. The tools include, among others, various inference methods, sparse approximations and model assessment methods. Keywords: Gaussian process, Bayesian hierarchical model, nonparametric Bayes

2 0.96236128 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

Author: Lisha Chen, Andreas Buja

Abstract: Multidimensional scaling (MDS) is the art of reconstructing pointsets (embeddings) from pairwise distance data, and as such it is at the basis of several approaches to nonlinear dimension reduction and manifold learning. At present, MDS lacks a unifying methodology as it consists of a discrete collection of proposals that differ in their optimization criteria, called “stress functions”. To correct this situation we propose (1) to embed many of the extant stress functions in a parametric family of stress functions, and (2) to replace the ad hoc choice among discrete proposals with a principled parameter selection method. This methodology yields the following benefits and problem solutions: (a) It provides guidance in tailoring stress functions to a given data situation, responding to the fact that no single stress function dominates all others across all data situations; (b) the methodology enriches the supply of available stress functions; (c) it helps our understanding of stress functions by replacing the comparison of discrete proposals with a characterization of the effect of parameters on embeddings; (d) it builds a bridge to graph drawing, which is the related but not identical art of constructing embeddings from graphs. Keywords: multidimensional scaling, force-directed layout, cluster analysis, clustering strength, unsupervised learning, Box-Cox transformations

3 0.90597206 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen

Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization

4 0.87909883 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

same-paper 5 0.85379076 21 jmlr-2013-Classifier Selection using the Predicate Depth

Author: Ran Gilad-Bachrach, Christopher J.C. Burges

Abstract: Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. We present an alternative approach by defining a median of the distribution, which we show is both more robust, and has good generalization guarantees. We present algorithms to approximate this median. One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work. Keywords: classification, estimation, median, Tukey depth

6 0.68513221 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

7 0.60839486 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

8 0.58666247 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

9 0.55847907 120 jmlr-2013-Variational Algorithms for Marginal MAP

10 0.53197527 86 jmlr-2013-Parallel Vector Field Embedding

11 0.51643473 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

12 0.51610744 108 jmlr-2013-Stochastic Variational Inference

13 0.51508403 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

14 0.50910419 32 jmlr-2013-Differential Privacy for Functions and Functional Data

15 0.49748996 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

16 0.49308893 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

17 0.49230152 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

18 0.48963937 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

19 0.48824978 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

20 0.4829616 22 jmlr-2013-Classifying With Confidence From Incomplete Information