nips nips2008 nips2008-239 knowledge-graph by maker-knowledge-mining

239 nips-2008-Tighter Bounds for Structured Estimation


Source: pdf

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 org Abstract Large-margin structured estimation methods minimize a convex upper bound of loss functions. [sent-10, score-0.996]

2 While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. [sent-11, score-0.272]

3 We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. [sent-12, score-1.004]

4 On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. [sent-14, score-0.661]

5 At the heart of those methods is an inverse optimization problem, namely that of finding a function f (x, y) such that the prediction y ∗ which maximizes f (x, y ∗ ) for a given x, minimizes some loss ∆(y, y ∗ ) on a training set. [sent-16, score-0.322]

6 Y can represent a rich class of possible data structures, ranging from binary sequences (tagging), to permutations (matching and ranking), to alignments (sequence matching), to path plans [15]. [sent-18, score-0.155]

7 To make such inherently discontinuous and nonconvex optimization problems tractable, one applies a convex upper bound on the incurred loss. [sent-19, score-0.777]

8 This has two benefits: firstly, the problem has no local minima, and secondly, the optimization problem is continuous and piecewise differentiable, which allows for effective optimization [17, 19, 20]. [sent-20, score-0.132]

9 This setting, however, exhibits a significant problem: the looseness of the convex upper bounds can sometimes lead to poor accuracy. [sent-21, score-0.409]

10 For binary classification, [2] proposed to switch from the hinge loss, a convex upper bound, to a tighter nonconvex upper bound, namely the ramp loss. [sent-22, score-1.248]

11 The resulting optimization uses the convex-concave procedure of [22], which is well known in optimization as the DC-programming method [9]. [sent-24, score-0.132]

12 We extend the notion of ramp loss to structured estimation. [sent-25, score-0.807]

13 We show that with some minor modifications, the DC algorithms used in the binary case carry over to the structured setting. [sent-26, score-0.249]

14 Unlike the binary case, however, we observe that for structured prediction problems with noisy data, DC programming can lead to improved accuracy in practice. [sent-27, score-0.354]

15 Finally, denote by ∆ : Y × Y → R+ 0 a loss function which maps pairs of labels to nonnegative numbers. [sent-41, score-0.266]

16 ∆(y, y ) = y − y 1 or considerably more complicated loss functions, e. [sent-44, score-0.221]

17 We want to find f such that for y ∗ (x, f ) := argmax f (x, y ) (1) y the loss ∆(y, y ∗ (x, f )) is minimized: given X and Y we want to minimize the regularized risk, Rreg [f, X, Y ] := 1 m m ∆(yi , y ∗ (xi , f )) + λΩ[f ]. [sent-47, score-0.258]

18 Since (2) is notoriously hard to minimize several convex upper bounds have been proposed to make ∆(yi , y ∗ (xi , f )) tractable in f . [sent-49, score-0.409]

19 In regular convex structured estimation, l(x, y, y, f ) is used. [sent-55, score-0.427]

20 This also shows why both formulations lead to convex upper bounds of the loss. [sent-57, score-0.409]

21 (3) The loss bound of Lemma 1 suffers from a significant problem: for large values of f the loss may grow without bound, provided that the estimate is incorrect. [sent-62, score-0.602]

22 This is not desirable since in this setting even a single observation may completely ruin the quality of the convex upper bound on the misclassification error. [sent-63, score-0.517]

23 Another case where the convex upper bound is not desirable is the following: imagine that there are a lot of y which are as good as the label in the training set; this happens frequently in ranking where there are ties between the optimal permutations. [sent-64, score-0.703]

24 Then one can replace y by any element of Yopt in the bound of Lemma 1. [sent-66, score-0.16]

25 Minimization over y ∈ Yopt leads to a tighter non-convex upper bound: l(x, y, y, f ) ≥ inf sup β(x, y , y , f ) + ∆(y , y ) ≥ ∆ (y, y ∗ (x, f )) . [sent-67, score-0.333]

26 y ∈Yopt y In the case of binary classification, [2] proposed the following non-convex loss that can be minimized using DC programming: l(x, y, f ) := min(1, max(0, 1 − yf (x))) = max(0, 1 − yf (x)) − max(0, −yf (x)). [sent-68, score-0.423]

27 2 (4) We see that (4) is the difference between a soft-margin loss and a hinge loss. [sent-69, score-0.257]

28 That is, the difference between a loss using a large margin related quantity and one using simply the violation of the margin. [sent-70, score-0.317]

29 The intuition for extending this to structured losses is that the generalized hinge loss underestimates the actual loss whereas the soft margin loss overestimates the actual loss. [sent-72, score-1.112]

30 y (6) y Then under the assumptions of Lemma 1 the following bound holds ∆(y, y (x, y, f )) ≥ l(x, y, f ) ≥ ∆(y, y ∗ (x, f )) ¯ (7) This loss is a difference between two convex functions, hence it may be (approximately) minimized by a DC programming procedure. [sent-75, score-0.66]

31 Moreover, it is easy to see that for Γ(η) = 1 and f (x, y) = 1 2 yf (x) and y ∈ {±1} we recover the ramp loss of (4). [sent-76, score-0.654]

32 Note that this nonconvex upper bound is not likely to be Bayes consistent. [sent-86, score-0.505]

33 However, it will generate solutions which have a smaller model complexity since it is never larger than the convex upper bound on the loss, hence the regularizer on f plays a more important role in regularized risk minimization. [sent-87, score-0.517]

34 For a function f (x) = fcave (x) + fvex (x) which can be expressed as the sum of a convex fvex and a concave fcave function, we can find a convex upper bound by fcave (x0 ) + x − x0 , fcave (x0 ) + fvex (x). [sent-90, score-1.513]

35 This follows from the first-order Taylor expansion of the concave part fcave at the current value of x. [sent-91, score-0.166]

36 Subsequently, this upper bound is minimized, a new Taylor approximation is computed, and the procedure is repeated. [sent-92, score-0.311]

37 We now proceed to deriving an explicit instantiation for structured estimation. [sent-94, score-0.255]

38 3 Algorithm 1 Structured Estimation with Tighter Bounds m Using the loss of Lemma 1 initialize f = argminf i=1 l(xi , yi , yi , f ) + λΩ[f ] repeat Compute yi := y (xi , yi , f ) for all i. [sent-96, score-0.441]

39 ˜ ˜ m ˜ Using the tightened loss bound recompute f = argminf ˜ i=1 l(xi , yi , yi , f ) + λΩ[f ] until converged Denote by k the kernel associated with H, defined on (X × Y) × (X × Y). [sent-97, score-0.517]

40 Likewise we may perform the linearization in (6) as follows: − sup β(x, y, y , f ) ≤ −β(x, y, y , f ) ˜ y In other words, we use the rescaled estimate y to provide an upper bound on the concave part of ˜ the loss function. [sent-99, score-0.686]

41 1 Experiments Multiclass Classification In this experiment, we investigate the performance of convex and ramp loss versions of the WinnerTakes-All multiclass classification [1] when the training data is noisy. [sent-105, score-0.857]

42 Table 1 shows the results (average accuracy ± standard deviation) on several datasets with different percentages of labels shuffled. [sent-109, score-0.166]

43 It can be seen that ramp loss outperforms the convex upper bound when the datasets are noisy. [sent-112, score-1.137]

44 For clean data the convex upper bound is slightly superior, albeit not in a statistically significant fashion. [sent-113, score-0.546]

45 This supports our conjecture that, compared to the convex upper bound, the ramp loss is more robust on noisy datasets. [sent-114, score-0.943]

46 They compared several methods showing that optimizing the Normalized Discounted Cumulative Gains (NDCG) score using a form of structured estimation yields best performance. [sent-117, score-0.258]

47 In this experiment, we perform ranking experiments with the OHSUMED dataset which is publicly available [13]. [sent-119, score-0.155]

48 We first carried out the structured output training algorithm which optimizes the convex upper bound of NDCG as described in [21]. [sent-121, score-0.738]

49 The convex upper bounds led to the 4 Dataset DNA LETTER SATIMAGE SEGMENT SHUTTLE USPS Methods convex ramp loss convex ramp loss convex ramp loss convex ramp loss convex ramp loss convex ramp loss 0% 95. [sent-123, score-5.161]

50 1 Table 1: Average accuracy for multiclass classification using the convex upper bound and the ramp loss. [sent-195, score-0.982]

51 51 Figure 1: NDCG comparison against ranking SVM and RankBoost. [sent-200, score-0.155]

52 In the context of web page ranking an improvement of 0. [sent-203, score-0.187]

53 As a result, there is no w that learns the data well, and for each w the associated maxy f (x, y ) − f (x, y) + ∆(y, y ) causes either the first part or the second part of the loss to be big such that the total value of the loss function always exceeds max ∆(y, y ). [sent-216, score-0.442]

54 We compared the results of our method and two standard methods for ranking: ranking SVM [10, 8] and RankBoost [6] (the baselines for OHSUMED are shown in [13]) and used NDCG as the performance criterion. [sent-218, score-0.155]

55 3 Structured classification We also assessed the performance of the algorithm on two different structured classification tasks for computational biology, namely protein sequence alignment and RNA secondary structure prediction. [sent-223, score-0.817]

56 Protein sequence alignment is the problem of comparing the amino acid sequences corresponding to two different proteins in order to identify regions of the sequences which have common ancestry or biological function. [sent-224, score-0.535]

57 In the pairwise sequence alignment task, the elements of the input space X consist of pairs of amino acid sequences, represented as strings of approximately 100-1000 char5 Method CRF convex ramp loss 0-10% (324) 0. [sent-225, score-1.191]

58 488 Table 2: Protein pairwise sequence alignment results, stratified by reference alignment percentage identity. [sent-240, score-0.537]

59 The second through fifth columns refer to the four non-overlapping reference alignment percentage identity ranges described in the text, and the sixth column corresponds to overall results, pooled across all four subsets. [sent-241, score-0.406]

60 Each non-bolded value represents the average test set recall for a particular algorithm on alignment from the corresponding subset. [sent-242, score-0.258]

61 The output space Y contains candidate alignments which identify the corresponding positions in the two sequences which are hypothesized to be evolutionarily related. [sent-280, score-0.127]

62 For each inner optimization step, we used a fast-converging subgradientbased optimization algorithm with an adaptive Polyak-like step size [23]. [sent-282, score-0.132]

63 We performed two-fold cross-validation over a collection of 1785 pairs of structurally aligned protein domains [14]. [sent-283, score-0.158]

64 The percentage identity of a reference alignment is defined as the proportion of aligned residue pairs corresponding to identical amino acids. [sent-286, score-0.425]

65 We partitioned the alignments in the testing collection into four subsets based on percent identity (0-10%, 11-20%, 21-30%, and 31+%), showed the recall of the algorithm for each subset in addition to overall recall (see Table 2). [sent-287, score-0.205]

66 1 We note that the accuracy differences are most pronounced at the low percentage identity ranges, the ‘twilight zone’ regime where better alignment accuracy has far reaching consequences in many other computational biology applications [16]. [sent-289, score-0.317]

67 RNA secondary structure prediction Ribonucleic acid (RNA) refers to a class of long linear polymers composed of four different types of nucleotides (A, C, G, U). [sent-290, score-0.41]

68 Nucleotides within a single RNA molecule base-pair with each other, giving rise to a pattern of base-pairing known as the RNA’s secondary structure. [sent-291, score-0.262]

69 In the RNA secondary structure prediction problem, we are given an RNA sequence (a string of approximately 20-500 characters) and are asked to predict the secondary structure that the RNA molecule will form in vivo. [sent-292, score-0.597]

70 Conceptually, an RNA secondary structure can be thought of as a set of unordered pairs of nucleotide indices, where each pair designates two 1 We note that the results here are based on using the Viterbi algorithm for parsing, which differs from the inference method used in [3]. [sent-293, score-0.223]

71 6 (a) (b) (c) Figure 2: Tightness of the nonconvex bound. [sent-295, score-0.194]

72 Figures (a) and (b) show the value of the nonconvex loss, the convex loss and the actual loss as a function of the number of iterations when minimizing the nonconvex upper bound. [sent-296, score-1.219]

73 At each relinearization, which occurs every 1000 iterations, the nonconvex upper bound decreases. [sent-297, score-0.505]

74 Note that the convex upper bound increases in the process as convex and nonconvex bound diverge further from each other. [sent-298, score-1.077]

75 Figure (c) shows the tightness of the final nonconvex bound at the end of optimization for different values of the regularization parameter λ. [sent-300, score-0.49]

76 nucleotides in the RNA molecule which base-pair with each other. [sent-301, score-0.146]

77 Following convention, we take the structured output space Y to be the set of all possible pseudoknot-free structures. [sent-302, score-0.221]

78 We used a max-margin model for secondary structure prediction. [sent-303, score-0.223]

79 As our loss function, we used ∆(y, y ) = 1 − recall (where recall is the proportion of base-pairs in the reference structure y that are recovered in the predicted structure y ). [sent-305, score-0.457]

80 To test the algorithm, we performed two-fold cross-validation over a large collection of 1359 RNA sequences with known secondary structures from the RFAM database (release 8. [sent-307, score-0.262]

81 We evaluated the methods using two standard metrics for RNA secondary structure prediction accuracy known as sensitivity and selectivity (which are the equivalent of recall and precision, respectively, for this domain). [sent-309, score-0.458]

82 For reporting, we binned the sequences in the test collection by length into four ranges (150, 51-100, 101-200, 201+ nucleotides), and evaluated the sensitivity and selectivity of the algorithm for each subset in addition to overall accuracy (see Table 3). [sent-310, score-0.286]

83 This is likely because we opted for a loss function that penalizes for “false negative” base-pairings but not “false-positives” since our main interest is in identifying correct base-pairings (a harder task than predicting only a small number of high-confidence basepairings). [sent-313, score-0.221]

84 An alternative loss function that chooses a different balance between penalizing false positives and false negatives would achieve a different trade-off of sensitivity and selectivity. [sent-314, score-0.33]

85 Tightness of the bound: We generated plots of the convex, nonconvex, and actual losses (which correspond to l(x, y, y, f ), l(x, y, f ), and ∆(y, y ∗ (x, f )), respectively, from Lemma 2) over the course of optimization for our RNA folding task (see Figure 2). [sent-315, score-0.176]

86 From Figures 2a and 2b, we see that the nonconvex loss provides a much tighter upper bound on the actual loss function. [sent-316, score-1.096]

87 Figure 2c shows that the tightness of the bound decreases for increasing regularization parameters λ. [sent-317, score-0.23]

88 In summary, our bound leads to improvements whenever there is a large number of instances (x, y) which cannot be classified perfectly. [sent-318, score-0.16]

89 This is not surprising as for “clean” datasets even the convex upper bound vanishes when no margin errors are encountered. [sent-319, score-0.647]

90 Hence noticeable improvements can be gained mainly in the structured output setting rather than in binary classification. [sent-320, score-0.249]

91 7 6 Summary and Discussion We proposed a simple modification of the convex upper bound of the loss in structured estimation which can be used to obtain tighter bounds on sophisticated loss functions. [sent-322, score-1.386]

92 The advantage of our approach is that it requires next to no modification of existing optimization algorithms but rather repeated invocation of a structured estimation solver such as SVMStruct, BMRM, or Pegasos. [sent-323, score-0.363]

93 In several applications our approach outperforms the convex upper bounds. [sent-324, score-0.357]

94 This can be seen both for multiclass classification, for ranking where we encountered underfitting and undesirable trivial solutions for the convex upper bound, and in the context of sequence alignment where in particular for the hard-to-align observations significant gains can be found. [sent-325, score-0.915]

95 From this experimental study, it seems that the tighter non-convex upper bound is useful in two scenarios: when the labels are noisy and when for each example there is a large set of labels which are (almost) as good as the label in the training set. [sent-326, score-0.518]

96 Future work includes studying other types of structured estimation problems such as the ones encountered in NLP to check if our new upper bound can also be useful for these problems. [sent-327, score-0.569]

97 MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. [sent-429, score-0.251]

98 A scalable modular convex solver for regularized risk minimization. [sent-469, score-0.245]

99 Large margin methods for structured and interdependent output variables. [sent-477, score-0.317]

100 Cofi rank - maximum margin matrix factorization for collaborative ranking. [sent-488, score-0.157]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rna', 0.376), ('ramp', 0.365), ('structured', 0.221), ('loss', 0.221), ('convex', 0.206), ('alignment', 0.203), ('nonconvex', 0.194), ('secondary', 0.194), ('ndcg', 0.167), ('bound', 0.16), ('ranking', 0.155), ('upper', 0.151), ('fcave', 0.13), ('protein', 0.122), ('tighter', 0.117), ('yopt', 0.104), ('margin', 0.096), ('dc', 0.093), ('fvex', 0.078), ('nucleotides', 0.078), ('acid', 0.074), ('amino', 0.074), ('crf', 0.072), ('tightness', 0.07), ('molecule', 0.068), ('sequences', 0.068), ('yf', 0.068), ('optimization', 0.066), ('selectivity', 0.065), ('sup', 0.065), ('multiclass', 0.065), ('strati', 0.063), ('lemma', 0.061), ('alignments', 0.059), ('shuf', 0.059), ('recall', 0.055), ('rescaled', 0.053), ('argminf', 0.052), ('percentages', 0.052), ('rankboost', 0.052), ('rfam', 0.052), ('supy', 0.052), ('twilight', 0.052), ('bounds', 0.052), ('fth', 0.049), ('sequence', 0.048), ('undesirable', 0.047), ('subgradient', 0.047), ('pooled', 0.047), ('folding', 0.046), ('satimage', 0.046), ('labels', 0.045), ('sensitivity', 0.045), ('percentage', 0.044), ('le', 0.043), ('yi', 0.042), ('zone', 0.042), ('shuttle', 0.042), ('gains', 0.04), ('classi', 0.04), ('parentheses', 0.039), ('maximizer', 0.039), ('ohsumed', 0.039), ('joachims', 0.039), ('solver', 0.039), ('reference', 0.039), ('minimized', 0.038), ('ranges', 0.037), ('argmax', 0.037), ('estimation', 0.037), ('teo', 0.037), ('hinge', 0.036), ('overall', 0.036), ('aligned', 0.036), ('concave', 0.036), ('prediction', 0.035), ('usps', 0.035), ('programming', 0.035), ('accuracy', 0.035), ('datasets', 0.034), ('acids', 0.034), ('instantiation', 0.034), ('losses', 0.032), ('rank', 0.032), ('actual', 0.032), ('web', 0.032), ('false', 0.032), ('ties', 0.031), ('modi', 0.031), ('dna', 0.03), ('monotonic', 0.03), ('parsing', 0.03), ('cation', 0.03), ('clean', 0.029), ('truncation', 0.029), ('structure', 0.029), ('proportion', 0.029), ('collaborative', 0.029), ('binary', 0.028), ('letter', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 239 nips-2008-Tighter Bounds for Structured Estimation

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

2 0.1645274 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

Author: Tao Qin, Tie-yan Liu, Xu-dong Zhang, De-sheng Wang, Hang Li

Abstract: This paper studies global ranking problem by learning to rank methods. Conventional learning to rank methods are usually designed for ‘local ranking’, in the sense that the ranking model is defined on a single object, for example, a document in information retrieval. For many applications, this is a very loose approximation. Relations always exist between objects and it is better to define the ranking model as a function on all the objects to be ranked (i.e., the relations are also included). This paper refers to the problem as global ranking and proposes employing a Continuous Conditional Random Fields (CRF) for conducting the learning task. The Continuous CRF model is defined as a conditional probability distribution over ranking scores of objects conditioned on the objects. It can naturally represent the content information of objects as well as the relation information between objects, necessary for global ranking. Taking two specific information retrieval tasks as examples, the paper shows how the Continuous CRF method can perform global ranking better than baselines.

3 0.13448852 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari

Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1

4 0.13370547 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

Author: Alexandre Bouchard-côté, Dan Klein, Michael I. Jordan

Abstract: Accurate and efficient inference in evolutionary trees is a central problem in computational biology. While classical treatments have made unrealistic site independence assumptions, ignoring insertions and deletions, realistic approaches require tracking insertions and deletions along the phylogenetic tree—a challenging and unsolved computational problem. We propose a new ancestry resampling procedure for inference in evolutionary trees. We evaluate our method in two problem domains—multiple sequence alignment and reconstruction of ancestral sequences—and show substantial improvement over the current state of the art. 1

5 0.12300853 224 nips-2008-Structured ranking learning using cumulative distribution networks

Author: Jim C. Huang, Brendan J. Frey

Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1

6 0.12026633 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

7 0.10651966 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

8 0.10635205 113 nips-2008-Kernelized Sorting

9 0.1055593 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

10 0.10484795 228 nips-2008-Support Vector Machines with a Reject Option

11 0.10323198 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

12 0.10004342 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

13 0.097439133 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking

14 0.092805482 85 nips-2008-Fast Rates for Regularized Objectives

15 0.091426611 40 nips-2008-Bounds on marginal probability distributions

16 0.090835035 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models

17 0.08742775 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

18 0.081484646 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

19 0.080223218 196 nips-2008-Relative Margin Machines

20 0.077027172 238 nips-2008-Theory of matching pursuit


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.233), (1, -0.061), (2, -0.159), (3, 0.03), (4, -0.09), (5, 0.011), (6, 0.095), (7, -0.069), (8, 0.033), (9, 0.035), (10, -0.033), (11, 0.113), (12, 0.021), (13, 0.107), (14, 0.107), (15, 0.079), (16, 0.108), (17, -0.047), (18, 0.024), (19, 0.168), (20, -0.013), (21, 0.039), (22, -0.139), (23, -0.062), (24, 0.017), (25, 0.057), (26, -0.22), (27, 0.067), (28, -0.087), (29, -0.033), (30, -0.047), (31, -0.143), (32, -0.002), (33, 0.03), (34, -0.041), (35, -0.018), (36, -0.09), (37, 0.051), (38, 0.006), (39, 0.01), (40, -0.041), (41, 0.081), (42, 0.049), (43, -0.004), (44, 0.126), (45, 0.072), (46, 0.009), (47, -0.055), (48, -0.008), (49, -0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95041323 239 nips-2008-Tighter Bounds for Structured Estimation

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

2 0.6393227 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

Author: Paolo Frasconi, Andrea Passerini

Abstract: Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75%/46% correct ligand-ion assignments, which improves to 88%/88% in the setting where the metal binding state is known. 1

3 0.63146859 85 nips-2008-Fast Rates for Regularized Objectives

Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro

Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1

4 0.5993557 224 nips-2008-Structured ranking learning using cumulative distribution networks

Author: Jim C. Huang, Brendan J. Frey

Abstract: Ranking is at the heart of many information retrieval applications. Unlike standard regression or classification in which we predict outputs independently, in ranking we are interested in predicting structured outputs so that misranking one object can significantly affect whether we correctly rank the other objects. In practice, the problem of ranking involves a large number of objects to be ranked and either approximate structured prediction methods are required, or assumptions of independence between object scores must be made in order to make the problem tractable. We present a probabilistic method for learning to rank using the graphical modelling framework of cumulative distribution networks (CDNs), where we can take into account the structure inherent to the problem of ranking by modelling the joint cumulative distribution functions (CDFs) over multiple pairwise preferences. We apply our framework to the problem of document retrieval in the case of the OHSUMED benchmark dataset. We will show that the RankNet, ListNet and ListMLE probabilistic models can be viewed as particular instances of CDNs and that our proposed framework allows for the exploration of a broad class of flexible structured loss functionals for learning to rank. 1

5 0.5833562 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1

6 0.58140093 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

7 0.57274324 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking

8 0.53804982 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

9 0.53058046 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

10 0.52582431 185 nips-2008-Privacy-preserving logistic regression

11 0.51710105 228 nips-2008-Support Vector Machines with a Reject Option

12 0.50197357 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

13 0.49193987 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks

14 0.49082878 196 nips-2008-Relative Margin Machines

15 0.45067075 113 nips-2008-Kernelized Sorting

16 0.45057115 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees

17 0.433364 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

18 0.42492086 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

19 0.40999308 78 nips-2008-Exact Convex Confidence-Weighted Learning

20 0.40600246 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.199), (6, 0.101), (7, 0.084), (12, 0.028), (15, 0.017), (28, 0.171), (57, 0.054), (59, 0.015), (63, 0.03), (71, 0.036), (77, 0.034), (78, 0.01), (83, 0.104), (87, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94078732 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

Author: Massih Amini, Nicolas Usunier, François Laviolette

Abstract: We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The first one involves the margin distribution of the classifier and a risk bound on its associate Gibbs classifier. The bound is tight when so is the Gibbs’s bound and when the errors of the majority vote classifier is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of confidence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classifier on the examples for whose margins are above a fixed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is fixed manually. 1

same-paper 2 0.84896475 239 nips-2008-Tighter Bounds for Structured Estimation

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

3 0.76604551 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1

4 0.76397002 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Author: Tong Zhang

Abstract: Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1

5 0.76309514 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

6 0.76296198 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

7 0.76187289 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

8 0.76015139 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

9 0.75778532 62 nips-2008-Differentiable Sparse Coding

10 0.75691563 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

11 0.75634754 143 nips-2008-Multi-label Multiple Kernel Learning

12 0.75620103 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

13 0.75355774 196 nips-2008-Relative Margin Machines

14 0.7516197 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition

15 0.75089562 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

16 0.75039274 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

17 0.74821913 75 nips-2008-Estimating vector fields using sparse basis field expansions

18 0.74532425 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

19 0.74528486 226 nips-2008-Supervised Dictionary Learning

20 0.74526745 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization