nips nips2010 nips2010-290 knowledge-graph by maker-knowledge-mining

290 nips-2010-t-logistic regression


Source: pdf

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. [sent-7, score-0.465]

2 This gives rise to a regularized risk minimization problem with a non-convex loss function. [sent-8, score-0.254]

3 Because of the nature of the loss function, our algorithm is tolerant to label noise. [sent-10, score-0.268]

4 Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. [sent-11, score-0.189]

5 1 Introduction Many machine learning algorithms minimize a regularized risk [1]: m J(θ) = Ω(θ) + Remp (θ), where Remp (θ) = 1 � l(xi , yi , θ). [sent-13, score-0.205]

6 m i=1 (1) Here, Ω is a regularizer which penalizes complex θ; and Remp , the empirical risk, is obtained by averaging the loss l over the training dataset {(x1 , y1 ), . [sent-14, score-0.165]

7 If we define the margin of a training example (x, y) as u(x, y, θ) := y �φ(x), θ�, then many popular loss functions for binary classification can be written as functions of the margin. [sent-19, score-0.223]

8 (2) (3) (4) (5) The 0 − 1 loss is non-convex and difficult to handle; it has been shown that it is NP-hard to even approximately minimize the regularized risk with the 0 − 1 loss [2]. [sent-22, score-0.422]

9 Therefore, other loss functions can be viewed as convex proxies of the 0 − 1 loss. [sent-23, score-0.265]

10 Hinge loss leads to support vector machines (SVMs), exponential loss is used in Adaboost, and logistic regression uses the logistic loss. [sent-24, score-0.957]

11 Convexity is a very attractive property because it ensures that the regularized risk minimization problem has a unique global optimum [3]. [sent-25, score-0.153]

12 However, as was recently shown by Long and Servedio [4], learning algorithms based on convex loss functions are not robust to noise2 . [sent-26, score-0.3]

13 Intuitively, the convex loss functions grows at least linearly with slope |l� (0)| as u ∈ (−∞, 0), which introduces the overwhelming impact from the data with u � 0. [sent-27, score-0.29]

14 There has been some recent and some notso-recent work on using non-convex loss functions to alleviate the above problem. [sent-28, score-0.157]

15 Although, the analysis of [4] is carried out in the context of boosting, we believe, the results hold for a larger class of algorithms which minimize a regularized risk with a convex loss function. [sent-31, score-0.399]

16 By loss extending logistic regression from the exLogistic exp ponential family to the t-exponential fam6 ily, a natural extension of exponential family Hinge of distributions studied in statistical physics [6–10], we obtain the t-logistic regression 4 algorithm. [sent-33, score-1.059]

17 Furthermore, we show that a simple block coordinate descent scheme can be used to solve the resultant regularized 2 0-1 loss risk minimization problem. [sent-34, score-0.349]

18 Analysis of this procedure also intuitively explains why tmargin logistic regression is able to handle label -4 -2 0 2 4 noise. [sent-35, score-0.488]

19 Figure 1: Some commonly used loss functions for binary Our paper is structured as follows: In sec- classification. [sent-36, score-0.157]

20 The hinge, expotion 2 we briefly review logistic regression nential, and logistic losses are convex upper bounds of the especially in the context of exponential fam- 0-1 loss. [sent-38, score-0.803]

21 In section 3, we review t-exponential families, which form the basis for our proposed t-logistic regression algorithm introduced in section 4. [sent-40, score-0.15]

22 In section 5 we utilize ideas from convex multiplicative programming to design an optimization strategy. [sent-41, score-0.156]

23 2 Logistic Regression Since we build upon the probabilistic underpinnings of logistic regression, we briefly review some salient concepts. [sent-44, score-0.244]

24 Given a family of conditional distributions parameterized by θ, using Bayes rule, and making a standard iid assumption about the data allows us to write p(θ | X, Y) = p(θ) m � i=1 p(yi | xi ; θ)/p(Y | X) ∝ p(θ) m � i=1 p(yi | xi ; θ) (6) where p(Y | X) is clearly independent of θ. [sent-50, score-0.208]

25 To model p(yi | xi ; θ), consider the conditional exponential family of distributions p(y| x; θ) = exp (�φ(x, y), θ� − g(θ | x)) , (7) g(θ | x) = log (exp (�φ(x, +1), θ�) + exp (�φ(x, −1), θ�)) . [sent-51, score-0.361]

26 (8) with the log-partition function g(θ | x) given by If we choose the feature map φ(x, y) = that p(y| x; θ) is the logistic function p(y| x; θ) = y 2 φ(x), and denote u = y �φ(x), θ� then it is easy to see 1 exp(u/2) = . [sent-52, score-0.244]

27 2 i=1 (10) Logistic regression computes a maximum a-posteriori (MAP) estimate for θ by minimizing (10) as a function of θ. [sent-55, score-0.15]

28 Comparing (1) and (10) it is easy to see that the regularizer employed in logistic 2 regression is λ �θ� , while the loss function is the negative log-likelihood − log p(y| x; θ), which 2 thanks to (9) can be identified with the logistic loss (5). [sent-56, score-0.9]

29 2 3 t-Exponential family of Distributions In this section we will look at generalizations of the log and exp functions which were first introduced in statistical physics [6–9]. [sent-57, score-0.227]

30 The t-exponential function expt for (0 < t < 2) is defined as follows: � exp(x) if t = 1 expt (x) := 1/(1−t) otherwise. [sent-60, score-1.02]

31 Clearly, expt generalizes the usual exp function, which is recovered in the limit as t → 1. [sent-63, score-0.61]

32 Furthermore, many familiar properties of exp are preserved: expt functions are convex, non-decreasing, non-negative and satisfy expt (0) = 1 [9]. [sent-64, score-1.121]

33 But expt does not preserve one very important property of exp, namely expt (a + b) �= expt (a) · expt (b). [sent-65, score-2.04]

34 One can also define the inverse of expt namely logt as � log(x) if t = 1 � (12) logt (x) := � 1−t − 1 /(1 − t) otherwise. [sent-66, score-1.076]

35 From Figure 2, it is clear that expt decays towards 0 more slowly than the exp function for 1 < t < 2. [sent-68, score-0.585]

36 9 3 x 0 2 -1 1 2 3 4 5 6 7 0-1 loss 1 -2 -3 -2 -1 0 1 2 x -3 -4 -2 loss 6 4 2 0 2 4 margin Figure 2: Left: expt and Middle: logt for various values of t indicated. [sent-77, score-1.095]

37 The right figure depicts the t-logistic loss functions for different values of t. [sent-78, score-0.157]

38 When t = 1, we recover the logistic loss Analogous to the exponential family of distributions, the t-exponential family of distributions is defined as [9, 13]: p(x; θ) := expt (�φ(x), θ� − gt (θ)) . [sent-79, score-1.465]

39 (13) qt (x; θ) := p(x; θ)t /Z(θ) (14) A prominent member of the t-exponential family is the Student’s-t distribution [14]. [sent-80, score-0.16]

40 Just like in the exponential family case, gt the log-partition function ensures that p(x; θ) is normalized. [sent-81, score-0.484]

41 However, no closed form solution exists for computing gt exactly in general. [sent-82, score-0.32]

42 A closely related distribution, which often appears when working with t-exponential families is the so-called escort distribution [9, 13]: where Z(θ) = integrates to 1. [sent-83, score-0.237]

43 � p(x; θ) dx is the normalizing constant which ensures that the escort distribution t Although gt (θ) is not the cumulant function of the t-exponential family, it still preserves convexity. [sent-84, score-0.468]

44 In addition, it is very close to being a moment generating function ∇θ gt (θ) = Eqt (x;θ) [φ(x)] . [sent-85, score-0.296]

45 8 in Sears [13] and a version specialized to the generalized exponential families appears as Proposition 5. [sent-88, score-0.179]

46 The main difference from ∇θ g(θ) of the normal exponential family is that now ∇θ gt (θ) is equal to the expectation of its escort distribution qt (x; θ) instead of p(x; θ). [sent-90, score-0.655]

47 (17) Even though no closed form solution exists, one can compute gt given θ and x using numerical techniques efficiently. [sent-92, score-0.32]

48 If we select t satisfying −(v + 1)/2 = 1/(1 − t) and denote, �−2/(v+1) � Γ((v + 1)/2) , Ψ= √ vπΓ(v/2)σ 1/2 then by some simple but tedious calculation (included in the supplementary material) ˜ St(x|µ, σ, v) = exp (−λ(x − µ)2 /2 − gt ) ˜ (19) t 2Ψ ˜ where λ = (t − 1)vσ and gt = ˜ Ψ−1 . [sent-95, score-0.75]

49 (20) Here, the degree of freedom for Student’s-t distribution is chosen such that it also belongs to the expt family, which in turn yields v = (3 − t)/(t − 1). [sent-97, score-0.51]

50 As before, if we let φ(x, y) = y φ(x) and plot the negative log-likelihood − log p(y| x; θ), then we 2 no longer obtain a convex loss function (see Figure 2). [sent-100, score-0.262]

51 Using (13), (18), and (11), we can further write m d ��� � � �y �� i ˜ 2 ˆ φ(xi ), θ − gt (θ | xi )) +const. [sent-104, score-0.324]

52 ˜ 1 + (1 − t)(−λθj /2 − gt ) 1 + (1 − t)( J(θ) ∝ 2 �� �� � i=1 � � j=1 � rj (θ) = d � j=1 rj (θ) m � li (θ) li (θ) + const. [sent-106, score-0.378]

53 4 Since t > 1, it is easy to see that rj (θ) > 0 is a convex function of θ. [sent-109, score-0.149]

54 On the other hand, since gt ˆ is convex and t > 1 it follows that li (θ) > 0 is also a convex function of θ. [sent-110, score-0.512]

55 5 Convex Multiplicative Programming In convex multiplicative programming [18] we are interested in the following optimization problem: N � min P(θ) � zn (θ) s. [sent-113, score-0.454]

56 θ ∈ Rd , (23) θ n=1 where zn (θ) are positive convex functions. [sent-115, score-0.406]

57 Clearly, (22) can be identified with (23) by setting N = d+m and identifying zn (θ) = rn (θ) for n = 1, . [sent-116, score-0.298]

58 (24) min min MP(θ, ξ) � ξ θ n=1 n=1 The optimization problem in (24) is very reminiscent of logistic regression. [sent-128, score-0.244]

59 In logistic regression, � n � �� n � � ln (θ) = − y2 φ(xn ), θ + g(θ | xn ), while here ln (θ) = 1 + (1 − t) y2 φ(xn ), θ − gt (θ | xn ) . [sent-129, score-0.766]

60 The key difference is that in t-logistic regression each data point xn has a weight (or influence) ξn associated with it. [sent-130, score-0.223]

61 ξ-Step: Assume that θ is fixed, and denote zn = zn (θ) to rewrite (24) as: ˜ min ξ N � ξn zn s. [sent-135, score-0.919]

62 Since the objective function is linear in ξ and the feasible region is a convex set, (25) is a convex optimization problem. [sent-138, score-0.242]

63 By introducing a non-negative Lagrange multiplier γ ≥ 0, the partial Lagrangian and its gradient with respect to ξn� can be written as � � N N � � (26) L(ξ, γ) = ξ n zn + γ · 1 − ˜ ξn n=1 � ∂ L(ξ, γ) = zn� − γ ˜ ξn . [sent-139, score-0.298]

64 ˜N (28) n=1 Recall that ξn in (24) is the weight (or influence) of each term zn (θ). [sent-153, score-0.298]

65 The above analysis shows that γ = zn (θ)ξn remains constant for all n. [sent-154, score-0.298]

66 If zn (θ) becomes very large then its influence ξn ˜ ˜ is reduced. [sent-155, score-0.298]

67 Therefore, points with very large loss have their influence capped and this makes the algorithm robust to outliers. [sent-156, score-0.166]

68 This step is essentially the same as logistic regression, except that each component has a weight ξ here. [sent-158, score-0.244]

69 This requires us to compute the gradient ∇θ zn (θ): ˜ ∇θ zn (θ) = ∇θ rn (θ) = (t − 1)λθn · en �y � n φ(xn ) − ∇θ gt (θ | xn ) for n = 1, . [sent-164, score-0.996]

70 , m ∇θ zn+d (θ) = ∇θ ln (θ) = (1 − t) 2 �y �� �y n n = (1 − t) φ(xn ) − Eqt (yn | xn ;θ) φ(xn ) , 2 2 where en denotes the d dimensional vector with one at the n-th coordinate and zeros elsewhere (n-th unit vector). [sent-167, score-0.19]

71 qt (y| x; θ) is the escort distribution of p(y| x; θ) (16): for n = 1, . [sent-168, score-0.201]

72 3) How much overhead does t-logistic regression incur as compared to logistic regression? [sent-176, score-0.394]

73 The Long-Servedio dataset is an artificially constructed dataset to show that algorithms which minimize a differentiable convex loss are not tolerant to label noise Long and Servedio [4]. [sent-180, score-0.511]

74 The Mease-Wyner is another synthetic dataset to test the effect of label noise. [sent-191, score-0.153]

75 Our comparators are logistic regression, linear SVMs4 , and an algorithm (the probit) which employs the probit loss, L(u) = 1 − erf (2u), used in BrownBoost/RobustBoost [5]. [sent-201, score-0.462]

76 L-BFGS is also used to train logistic regression and the probit loss based algorithms. [sent-203, score-0.718]

77 Label noise is added by randomly choosing 10% of the labels in the training set and flipping them; each dataset is tested with and without label noise. [sent-204, score-0.158]

78 The optimal parameters namely λ for t-logistic and � logistic regression and C for SVMs is chosen by performing a grid search over the � parameter space 2−7,−6,. [sent-206, score-0.394]

79 In the latter case, the test error of t-logistic regression is very similar to logistic regression and Linear SVM (with 0% test error in 4 We also experimented with RampSVM [20], however, the results are worser than the other algorithms. [sent-215, score-0.594]

80 9 probit SVM Figure 3: The test error rate of various algorithms on six datasets (left to right, top: Long-Servedio, Mease-Wyner, Mushroom; bottom: USPS-N, Adult, Web) with and without 10% label noise. [sent-253, score-0.363]

81 The blue (light) bar denotes a clean dataset while the magenta (dark) bar are the results with label noise added. [sent-255, score-0.292]

82 When label noise is added, t-logistic regression (especially with t = 1. [sent-258, score-0.274]

83 One can clearly observe that the ξ of the noisy data is much smaller than that of the clean data, which indicates that the algorithm is able to effectively identify these points and cap their influence. [sent-265, score-0.166]

84 From left to right, the first spike corresponds to the noisy large margin examples, the second spike represents the noisy pullers, the third spike denotes the clean pullers, while the rightmost spike corresponds to the clean large margin examples. [sent-267, score-0.568]

85 Clearly, the noisy large margin examples and the noisy pullers are assigned a low value of ξ thus capping their influence and leading to the perfect classification of the test set. [sent-268, score-0.323]

86 On the other hand, logistic regression is unable to discriminate between clean and noisy training samples which leads to bad performance on noisy datasets. [sent-269, score-0.594]

87 In a nutshell, t-logistic regression takes longer to train than either logistic regression or the probit. [sent-271, score-0.567]

88 First, there is no closed form expression for gt (θ | x). [sent-273, score-0.32]

89 When it does converge, it is often faster than the convex multiplicative programming algorithm. [sent-277, score-0.156]

90 Just like logistic regression which uses a convex loss and hence converges to the same solution independent of the initialization, the solution obtained 5 We provide the significance test results in Table 2 of supplementary material. [sent-284, score-0.741]

91 0 ξ ξ Figure 4: The distribution of ξ obtained after training t-logistic regression with t = 1. [sent-321, score-0.15]

92 by t-logistic regression seems fairly independent of the initial value of θ. [sent-327, score-0.173]

93 On the other hand, the performance of the probit fluctuates widely with different initial values of θ. [sent-328, score-0.193]

94 7 Discussion and Outlook In this paper, we generalize logistic regression to t-logistic regression by using the t-exponential family. [sent-351, score-0.544]

95 On Long-Servedio experiment, if the label noise is increased significantly beyond 10%, the performance of t-logistic regression may degrade (see Fig. [sent-355, score-0.274]

96 It will be interesting to investigate if t-logistic regression can be married with graphical models to yield t-conditional random fields. [sent-358, score-0.15]

97 We will also focus on better numerical techniques to accelerate the θ-step, especially a faster way to compute gt . [sent-359, score-0.296]

98 Generalized thermostatistics based on deformed exponential and logarithmic functions. [sent-402, score-0.16]

99 Estimators, escort proabilities, and φ-exponential families in statistical physics. [sent-410, score-0.213]

100 An outer approximation method for minimizing the product of several convex functions on a convex set. [sent-450, score-0.242]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('expt', 0.51), ('zn', 0.298), ('gt', 0.296), ('logt', 0.283), ('logistic', 0.244), ('probit', 0.193), ('regression', 0.15), ('escort', 0.142), ('mushroom', 0.142), ('testerror', 0.142), ('loss', 0.131), ('student', 0.118), ('pullers', 0.113), ('convex', 0.108), ('family', 0.101), ('label', 0.094), ('supplementary', 0.083), ('clean', 0.08), ('exp', 0.075), ('physica', 0.074), ('xn', 0.073), ('families', 0.071), ('risk', 0.071), ('adult', 0.067), ('remp', 0.064), ('noisy', 0.06), ('qt', 0.059), ('exponential', 0.057), ('eqt', 0.057), ('servedio', 0.057), ('thermostatistics', 0.057), ('regularized', 0.052), ('spike', 0.052), ('datasets', 0.051), ('kuno', 0.05), ('uence', 0.048), ('multiplicative', 0.048), ('coordinate', 0.046), ('outlook', 0.046), ('deformed', 0.046), ('yi', 0.045), ('hinge', 0.044), ('tolerant', 0.043), ('rj', 0.041), ('logarithms', 0.041), ('margin', 0.04), ('ln', 0.04), ('minimize', 0.037), ('cyan', 0.036), ('libsvm', 0.036), ('svm', 0.036), ('robust', 0.035), ('dataset', 0.034), ('coordinates', 0.034), ('convexity', 0.034), ('bundle', 0.033), ('isotropic', 0.033), ('dark', 0.031), ('en', 0.031), ('svms', 0.03), ('answer', 0.03), ('noise', 0.03), ('ensures', 0.03), ('st', 0.029), ('springer', 0.028), ('xi', 0.028), ('generalized', 0.027), ('bar', 0.027), ('functions', 0.026), ('ym', 0.026), ('boosting', 0.026), ('clearly', 0.026), ('objective', 0.026), ('descent', 0.025), ('distributions', 0.025), ('usual', 0.025), ('questions', 0.025), ('xm', 0.025), ('rewrite', 0.025), ('test', 0.025), ('physics', 0.025), ('phil', 0.025), ('jylanki', 0.025), ('vanhatalo', 0.025), ('timothy', 0.025), ('capping', 0.025), ('erf', 0.025), ('rocco', 0.025), ('roderick', 0.025), ('overwhelming', 0.025), ('empirically', 0.025), ('block', 0.024), ('appears', 0.024), ('plotted', 0.024), ('closed', 0.024), ('fairly', 0.023), ('monotonically', 0.023), ('frequency', 0.023), ('longer', 0.023), ('rmly', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 290 nips-2010-t-logistic regression

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

2 0.14267041 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

Author: Min Yang, Linli Xu, Martha White, Dale Schuurmans, Yao-liang Yu

Abstract: Robust regression and classification are often thought to require non-convex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of “loss clipping” can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems. 1

3 0.1381449 282 nips-2010-Variable margin losses for classifier design

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1

4 0.13673285 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

Author: Armand Joulin, Jean Ponce, Francis R. Bach

Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1

5 0.12965347 228 nips-2010-Reverse Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1

6 0.11218139 284 nips-2010-Variational bounds for mixed-data factor analysis

7 0.10625445 243 nips-2010-Smoothness, Low Noise and Fast Rates

8 0.088671163 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

9 0.087395489 99 nips-2010-Gated Softmax Classification

10 0.084840305 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

11 0.083108127 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

12 0.07475885 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.070952386 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

14 0.070045061 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

15 0.068899669 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

16 0.067341313 192 nips-2010-Online Classification with Specificity Constraints

17 0.065727778 269 nips-2010-Throttling Poisson Processes

18 0.065308943 217 nips-2010-Probabilistic Multi-Task Feature Selection

19 0.063955322 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

20 0.063409761 235 nips-2010-Self-Paced Learning for Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.205), (1, 0.053), (2, 0.102), (3, -0.026), (4, 0.039), (5, 0.08), (6, -0.089), (7, 0.004), (8, -0.043), (9, -0.01), (10, -0.038), (11, -0.007), (12, 0.1), (13, 0.069), (14, -0.039), (15, 0.06), (16, 0.021), (17, 0.122), (18, 0.071), (19, -0.108), (20, 0.037), (21, -0.091), (22, -0.005), (23, -0.052), (24, 0.045), (25, 0.071), (26, 0.048), (27, 0.182), (28, 0.089), (29, 0.05), (30, -0.042), (31, -0.009), (32, 0.002), (33, 0.02), (34, 0.009), (35, 0.04), (36, -0.021), (37, -0.06), (38, -0.069), (39, 0.086), (40, 0.045), (41, 0.089), (42, -0.002), (43, -0.032), (44, 0.055), (45, 0.038), (46, -0.028), (47, 0.01), (48, 0.054), (49, -0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93994117 290 nips-2010-t-logistic regression

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

2 0.77716327 228 nips-2010-Reverse Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1

3 0.76234752 282 nips-2010-Variable margin losses for classifier design

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1

4 0.74557668 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

Author: Min Yang, Linli Xu, Martha White, Dale Schuurmans, Yao-liang Yu

Abstract: Robust regression and classification are often thought to require non-convex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of “loss clipping” can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems. 1

5 0.70920295 269 nips-2010-Throttling Poisson Processes

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

6 0.63921785 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

7 0.60539114 243 nips-2010-Smoothness, Low Noise and Fast Rates

8 0.52633303 284 nips-2010-Variational bounds for mixed-data factor analysis

9 0.52571958 61 nips-2010-Direct Loss Minimization for Structured Prediction

10 0.52306783 191 nips-2010-On the Theory of Learnining with Privileged Information

11 0.52254647 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

12 0.51780999 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates

13 0.50794822 15 nips-2010-A Theory of Multiclass Boosting

14 0.50434166 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

15 0.50358033 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

16 0.50296175 99 nips-2010-Gated Softmax Classification

17 0.48644033 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

18 0.48602238 202 nips-2010-Parallelized Stochastic Gradient Descent

19 0.48024911 192 nips-2010-Online Classification with Specificity Constraints

20 0.47823387 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.059), (17, 0.018), (27, 0.051), (30, 0.083), (35, 0.022), (45, 0.213), (49, 0.186), (50, 0.047), (52, 0.05), (53, 0.016), (60, 0.044), (77, 0.052), (78, 0.024), (90, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90747201 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

Author: Barnabás Póczos, Csaba Szepesvári, David Tax

Abstract: We present simple and computationally efficient nonparametric estimators of R´ nyi entropy and mutual information based on an i.i.d. sample drawn from an e unknown, absolutely continuous distribution over Rd . The estimators are calculated as the sum of p-th powers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respectively. For the first time, we prove the almost sure consistency of these estimators and upper bounds on their rates of convergence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1

2 0.8676573 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

Author: Hongbo Zhou, Qiang Cheng

Abstract: Regularization technique has become a principled tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further give mathematically exact definition for a novel representation called sparse grouping representation (SGR), and prove a set of sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also provide some generalization bounds in a classification setting. 1

3 0.8651697 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures

Author: Kamiya Motwani, Nagesh Adluru, Chris Hinrichs, Andrew Alexander, Vikas Singh

Abstract: We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with “advice” encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region – as a histogram over a bag of ‘words’ (e.g., suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting a new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm. 1

same-paper 4 0.85674518 290 nips-2010-t-logistic regression

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

5 0.80312979 117 nips-2010-Identifying graph-structured activation patterns in networks

Author: James Sharpnack, Aarti Singh

Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1

6 0.79917377 265 nips-2010-The LASSO risk: asymptotic results and real world examples

7 0.79863292 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

8 0.79678357 148 nips-2010-Learning Networks of Stochastic Differential Equations

9 0.7963475 63 nips-2010-Distributed Dual Averaging In Networks

10 0.79609537 243 nips-2010-Smoothness, Low Noise and Fast Rates

11 0.79597372 222 nips-2010-Random Walk Approach to Regret Minimization

12 0.79476416 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

13 0.79452205 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

14 0.79401273 280 nips-2010-Unsupervised Kernel Dimension Reduction

15 0.79384893 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

16 0.79287678 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

17 0.79185921 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

18 0.79169196 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

19 0.79166651 282 nips-2010-Variable margin losses for classifier design

20 0.78992468 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models