nips nips2002 nips2002-19 knowledge-graph by maker-knowledge-mining

19 nips-2002-Adapting Codes and Embeddings for Polychotomies


Source: pdf

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de   ¡ Abstract In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. [sent-9, score-0.398]

2 Differently from many previous approaches we learn the code as well as the embedding function. [sent-11, score-0.76]

3 We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. [sent-12, score-0.335]

4 To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting. [sent-13, score-0.311]

5 However, when comparing the reported performance of these systems with the de-facto standard of using two-class techniques in combination with simple, fixed output codes to solve multi-class problems, they often lack in terms of performance, ease of optimization, and/or run-time behavior. [sent-19, score-0.389]

6 On the other hand, many methods have been proposed to apply binary classifiers to multiclass problems, such as Error Correcting Output Codes (ECOC) [6, 1], Pairwise Coupling [9], or by simply reducing the problem of discriminating classes to “one vs. [sent-20, score-0.439]

7 These problems can, in reverse order of their appearance, be understood as more and more refined variants of a multi-variate regression, i. [sent-26, score-0.171]

8 However, most previous work is either focused on finding a good embedding given a fixed code or just optimizing the code, given a fixed embedding (cf. [sent-29, score-1.057]

9 One could try to cast this as a standard multi-class problem by assigning each training sequence to the structure ranked highest. [sent-37, score-0.158]

10 But then, there will be classes to which only very few or no sequences are assigned and one can obviously hardly learn using traditional techniques. [sent-38, score-0.145]

11 The machine we propose is (at least in principle) able to solve problems like this by reflecting relations between classes in the way the code book is constructed and at the same time trying to find an embedding of the data space into the code space that allows for a good discrimination. [sent-39, score-1.806]

12 Then we discuss the approaches of [4] and [21] and propose to learn the code book. [sent-41, score-0.56]

13 In Section 3 we propose a rather general idea to solve resulting multi-class problems using two-class classifiers. [sent-42, score-0.219]

14 for multi-class problems where denotes the number of classes, or for a ranking problem), and let be a training sample of size , with . [sent-47, score-0.24]

15 ¢  ¢    ¡ Output Coding It is well known (see [6, 1] and references therein) that multi-class problems can be solved by decomposing a polychotomy into dichotomies and solving these separately using a two-class technique. [sent-50, score-0.198]

16 This can be understood as assigning to each class a binary string of length which is called a code word. [sent-51, score-0.758]

17 Now each of the columns of this matrix defines a partitioning of classes into two subsets, forming binary problems for which a classifier is trained. [sent-53, score-0.324]

18 Evaluation is done by computing the output of all learned functions, forming a new bit-string, and then choosing the class such that some distance measure between this string and the corresponding row of the code matrix is minimal, usually the Hamming distance. [sent-54, score-0.745]

19 1 Since the codes for each class must be unique, there are (for ) possible code matrices to choose from. [sent-56, score-0.852]

20 One possibility is to choose the codes to be errorcorrecting (ECOC) [6]. [sent-57, score-0.279]

21 large Hamming distance between the code words, such that one still gets the correct decision even if a few of the classifiers err. [sent-60, score-0.637]

22 However, finding the code that minimizes the training error is NP-complete, even for fixed binary classifiers [4]. [sent-61, score-0.62]

23 Furthermore, errors committed by the binary classifiers are not necessarily independent, significantly reducing the effective number of wrong bits that one can handle [18, 19]. [sent-62, score-0.112]

24 Nonetheless ECOC has proven useful and algorithms for finding a good code (and partly also finding the corresponding classifiers) have been 6 D 7   D ¡ D 1 FE ¢ ¢ 7 T R ' B Q V# U¢ S&PQIHG; ice ec a hAgfdb` We could also use ternary codes, i. [sent-63, score-0.516]

25 Noticeably, most practical approaches suggest to drop the requirement of binary codes, and instead propose to use continuous ones. [sent-70, score-0.117]

26 Hamming) distance to their appropriated code words can be related to a large margin classifier, beginning with binary classification. [sent-73, score-0.936]

27 1 Large Margins Dichotomies Here a large margin classifier is defined as a mapping with the property that , or more specifically with , where is some positive constant [20]. [sent-75, score-0.218]

28 Here is a regularization term, is a regularization constant and denotes the class of functions under consideration. [sent-78, score-0.185]

29 In other words, we can express the margin as the difference between the distance of from the target and the target . [sent-80, score-0.373]

30 ¤@ ¤ #"   5 Polychotomies While this insight by itself is not particularly useful, it paves the way for an extension of the notion of the margin to multi-class problems: denote by a distance measure and by , ( is the length of the code) target vectors corresponding to class . [sent-81, score-0.424]

31 We obtain accordingly the following correct target optimization problem: where ¤ $  @ ¤ '5'($ " dA ¦ ($ % 8 6 5'($ " A ¦ ' 7 8 6 ¦ ' ' @ '  0a) A `1$ ¦ $ XY7 ( ¦ % ¢   D$  % A $! [sent-84, score-0.173]

32 For the time being we chose as a reference margin — an adaptive means of choosing the reference margin can be implemented using the -trick, which leads to an easier to control regularization parameter [16]. [sent-86, score-0.5]

33 However, one can show that only and related functions will lead to a convex constraint on : is convex in is symmetric. [sent-90, score-0.279]

34 8 Lemma 1 implies that any distance functions other than the ones described above will lead to optimization problems with potentially many local minima, which is not desirable. [sent-95, score-0.357]

35 However, for quadratic we will get a convex optimization problem (assuming suitable ) % # &  $! [sent-96, score-0.269]

36 6 ¨ ' "#dA p ' % r'7 8 8 @ V ' " dA p ' 7 8 V H c % h88 c H c A' @ hc ¢¡ ' 7 8   and then there are ways to efficiently solve (3). [sent-97, score-0.148]

37 We obtain means (4) 8 ' #" A p ' 7 ¢ &'' " dA ¦ ' 7 8 6 A Hc 8 ¦ % 8 6 @ &'' " dA ¦ ' 7 8 6 7 hc ¢ ¢@ ' ' A b8 6 &'' #"dA ' ¦ Note, that if the code words have the same length, the difference of the projections of onto different code words determines the margin. [sent-99, score-1.231]

38 We will indeed later consider a more convenient case: , which will lead to linear constraints only and allows us to use standard optimization packages. [sent-100, score-0.251]

39 This means that during optimization we are trying to find functions , , from an dimensional subspace. [sent-104, score-0.134]

40 In other words, we choose the subspace and perform regularization by allowing only a smaller class of functions. [sent-105, score-0.16]

41 3 Discussion and Relation to Previous Approaches Note that for we have that (4) is equal and hence the problem of multi-class classification reverts to the problem of solving binary classification problems of one vs. [sent-108, score-0.274]

42 Here, the function is held fix and the code is optimized. [sent-112, score-0.55]

43 In their approach, the code is described as a vector in a kernel feature space and one obtains in fact an optimization problem very similar to the one in [21] and (3) (again, the slack-variables are defined slightly different). [sent-113, score-0.684]

44 The resulting optimization problem turns out to be convex, but with the drawback, that one can either not fully optimize the code vectors or not guarantee that they are well separated. [sent-115, score-0.771]

45 Since these approaches were motivated by different ideas (one optimizing the code, the other optimizing the embedding), this shows that the role of the code and the embedding function is interchangeable if the function or the code, respectively, is fixed. [sent-116, score-0.866]

46 Our approach allows arbitrary codes for which a function is learned. [sent-117, score-0.271]

47 The position of the code words (=“class centers”) determine the function . [sent-119, score-0.568]

48 The position of the centers relative to each other may reflect relationships between the classes (e. [sent-120, score-0.103]

49 classes “black” & “white” and “white” & “grey” are close). [sent-122, score-0.103]

50 The spatial organization of the code book vectors reflects the organization of classes in the space. [sent-124, score-0.908]

51 4 Learning Code & Embedding This leaves us with the question of how to determine a “good” code and a suitable . [sent-126, score-0.516]

52 As we can see from (4), for fixed the constraints are linear in and vice versa, yet we have non- A 8 A 8 A convex constraints, if both and are variable. [sent-127, score-0.154]

53 Finding the global optimum is therefore computationally infeasible when optimizing and simultaneously (furthermore note that any rotation applied to and will leave the margin invariant, which shows the presence of local minima due to equivalent codes). [sent-128, score-0.315]

54 Instead, we propose the following method: for fixed code optimize over , and subsequently, for fixed , optimize over , possibly repeating the process. [sent-129, score-0.668]

55 Both steps separately can be performed fairly efficient (since the optimization problems are convex; cf. [sent-133, score-0.215]

56 We now show how a code maximizing the margin can be found. [sent-136, score-0.734]

57 To avoid a trivial solution (we can may virtually increase the margin by rescaling all by some constant), we add to the objective function. [sent-137, score-0.218]

58 It can be shown that one does not need an additional regularization constant in front of this term, if the distance is linear on both arguments. [sent-138, score-0.141]

59 One can show that the probability that there exists two such vectors (out of ) that have a smaller distance than is bounded by (proof given in the full paper). [sent-144, score-0.11]

60 Hence, with probability greater than the random code vectors have distances greater than from each other. [sent-145, score-0.549]

61 % ) ( 0' ¢ V H W ' ¤ @ W H       3 Column Generation for Finding the Embedding There are several ways to setup and optimize the resulting optimization problem (3). [sent-148, score-0.222]

62 For instance in [21, 4] the class of functions is the set of hyperplanes in some kernel feature space and the regularizer is the sum of the -norms of the hyperplane normal vectors. [sent-149, score-0.1]

63 The idea of column generation is to start with a restricted master problem, namely without the variables (i. [sent-159, score-0.212]

64 Then one solves the corresponding dual problem (7) and then finds the hypothesis that corresponds to a violated constraint (and also one primal variable). [sent-161, score-0.294]

65 This hypothesis is included in the optimization problem, one resolves and finds the next violated constraint. [sent-162, score-0.291]

66 By using the described column generation technique one can, however, find the solution of this semi-infinite programming problem [13]. [sent-166, score-0.246]

67 Maximizing (8) with respect to is easy for a given : , one chooses ; if , then and for for one chooses the minimizing unit vector. [sent-172, score-0.098]

68 Only if one cannot find a hypothesis that violates a constraint, one employs the more sophisticated techniques suggested in [15]. [sent-176, score-0.11]

69 If there is no hypothesis left that corresponds to a violated constraint, the dual optimization problem is optimal. [sent-177, score-0.384]

70 Then we can use another learning algorithm that problem of finding minimizes or approximately minimizes the training error of a weighted training set (rewrite (8)). [sent-179, score-0.096]

71 Following the ideas in [14] one can show that there is a close relationship between our technique using the trivial code and the multi-class boosting algorithms as e. [sent-181, score-0.557]

72 1 A first Experiment In a preliminary set of experiments we use two benchmark data sets from the UCI benchmark repository: glass and iris. [sent-185, score-0.137]

73 We used our column generation strategy as described in Section 3 in conjunction with the code optimization problem to solve the combined optimization problem to find the code and the embedding. [sent-186, score-1.592]

74 As base learning algorithm we chose decision trees (C4. [sent-191, score-0.095]

75 5) which we only use as two-class classifier in our column generation algorithm. [sent-192, score-0.171]

76 We also computed the test error of multiclass decision trees and obtained error. [sent-195, score-0.285]

77 On the iris data we could achieve an error rate of and could slightly improve the result of decision trees ( ). [sent-197, score-0.095]

78 We conjecture that this is due to the properties of decision trees which have problems generating smooth boundaries not aligned with coordinate axes. [sent-199, score-0.176]

79 It is in particular interesting to find practical examples, where a non-trivial choice of the code (via optimization) helps simplifying the embedding and finally leads to additional improvements. [sent-201, score-0.76]

80 Preliminary results indicate that one can achieve considerable improvements when adapting codes and embeddings [3]. [sent-203, score-0.314]

81 Shown is the decision boundary and the confidence for assigning a sample to the upper left class. [sent-205, score-0.098]

82 Instead, we used (9) with the information that each example besides belonging to its own class with confidence two also belongs to the other classes with confidence one iff its distance to the respective center is less than one. [sent-207, score-0.237]

83 We have the sets , which contain all pairs of “relations” between contains all pairs of positive and negative classes of the positive classes. [sent-217, score-0.103]

84 In this formulation one tries to find a code and an embedding , such that for each example the output wrt. [sent-222, score-0.889]

85 each class this example has a relation with, reflects the order of this relations (i. [sent-223, score-0.095]

86 Furthermore, the program tries to achieve a “large margin” between relevant and irrelevant classes for each sample. [sent-226, score-0.136]

87 Optimization of (9) is analogous to the column generation approach discussed in Section 3. [sent-228, score-0.171]

88 ¡ 8 1 T   Connection to Ranking Techniques Ordinal regression through large margins [10] can be seen as an extreme case of (9), where we have as many classes as observations, and each pair of observations has to satisfy a ranking relation , if is to be preferred to . [sent-231, score-0.302]

89 This formulation can of course also be understood as a special case of multi-dimensional regression. [sent-232, score-0.093]

90 #" 4 #  @ ¨ $' 4 "   @ ' # "   ¦ 4 " 5 Conclusion We proposed an algorithm to simultaneously optimize output codes and the embedding of the sample into the code book space building upon the notion of large margins. [sent-233, score-1.295]

91 Further- more, we have shown, that only quadratic and related distance measures in the code book space will lead to convex constraints and hence convex optimization problems whenever either the code or the embedding is held fixed. [sent-234, score-2.07]

92 This is desirable since at least for these sub-problems there exist fairly efficient techniques to compute these (of course the combined optimization problem of finding the code and the embedding is not convex and has local minima). [sent-235, score-1.064]

93 We proposed a column generation technique for solving the embedding optimization problems. [sent-236, score-0.642]

94 Finally we proposed a technique along the same lines that should be favorable when dealing with many classes or even empty classes. [sent-238, score-0.144]

95 Future work will concentrate on finding more efficient algorithms to solve the optimization problem and on more carefully evaluating their performance. [sent-239, score-0.221]

96 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-249, score-0.408]

97 On the learnability and design of output codes for multiclass problems. [sent-271, score-0.491]

98 Constraint classification: A new approach to multiclass classification and ranking. [sent-301, score-0.19]

99 Sparse regression ensembles in infinite and finite a hypothesis spaces. [sent-349, score-0.108]

100 Stochastic organization of output codes in multiclass learning problems. [sent-392, score-0.529]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('code', 0.516), ('embedding', 0.244), ('codes', 0.24), ('margin', 0.218), ('multiclass', 0.19), ('book', 0.18), ('da', 0.155), ('classi', 0.153), ('optimization', 0.134), ('ranking', 0.128), ('generation', 0.105), ('classes', 0.103), ('convex', 0.101), ('hc', 0.095), ('nding', 0.082), ('violated', 0.082), ('problems', 0.081), ('distance', 0.077), ('ecoc', 0.077), ('hh', 0.076), ('hypothesis', 0.075), ('binary', 0.073), ('ers', 0.067), ('column', 0.066), ('demiriz', 0.065), ('dichotomies', 0.065), ('polychotomies', 0.065), ('regularization', 0.064), ('tsch', 0.061), ('output', 0.061), ('dual', 0.059), ('mika', 0.059), ('understood', 0.058), ('dence', 0.058), ('class', 0.057), ('hamming', 0.057), ('ordinal', 0.056), ('assigning', 0.054), ('optimize', 0.054), ('optimizing', 0.053), ('solve', 0.053), ('constraints', 0.053), ('words', 0.052), ('solving', 0.052), ('trees', 0.051), ('chooses', 0.049), ('colt', 0.045), ('minima', 0.044), ('francisco', 0.044), ('constraint', 0.044), ('propose', 0.044), ('decision', 0.044), ('embeddings', 0.043), ('ranks', 0.043), ('regularizer', 0.043), ('proof', 0.042), ('assigned', 0.042), ('idea', 0.041), ('technique', 0.041), ('nips', 0.04), ('nd', 0.04), ('nds', 0.04), ('target', 0.039), ('fh', 0.039), ('ranked', 0.039), ('williamson', 0.039), ('reducing', 0.039), ('cation', 0.039), ('choose', 0.039), ('relations', 0.038), ('hp', 0.038), ('thousand', 0.038), ('formulations', 0.038), ('margins', 0.038), ('lemma', 0.038), ('organization', 0.038), ('smola', 0.037), ('xed', 0.036), ('preliminary', 0.036), ('rank', 0.036), ('glass', 0.035), ('techniques', 0.035), ('formulation', 0.035), ('forming', 0.034), ('held', 0.034), ('morgan', 0.034), ('san', 0.034), ('problem', 0.034), ('benchmark', 0.033), ('partitioning', 0.033), ('tries', 0.033), ('er', 0.033), ('regression', 0.033), ('vectors', 0.033), ('lead', 0.033), ('implies', 0.032), ('variants', 0.032), ('allows', 0.031), ('training', 0.031), ('adapting', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.

2 0.36089295 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

Author: Ofer Dekel, Yoram Singer

Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.

3 0.2537753 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.

4 0.22251518 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

5 0.1761978 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1

6 0.16781457 58 nips-2002-Conditional Models on the Ranking Poset

7 0.15244241 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.14698999 156 nips-2002-On the Complexity of Learning the Kernel Matrix

9 0.14095148 140 nips-2002-Margin Analysis of the LVQ Algorithm

10 0.13742982 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

11 0.12932242 42 nips-2002-Bias-Optimal Incremental Problem Solving

12 0.12839547 161 nips-2002-PAC-Bayes & Margins

13 0.1254918 10 nips-2002-A Model for Learning Variance Components of Natural Images

14 0.11854792 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

15 0.11380477 98 nips-2002-Going Metric: Denoising Pairwise Data

16 0.11366791 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

17 0.11080026 92 nips-2002-FloatBoost Learning for Classification

18 0.11013524 45 nips-2002-Boosted Dyadic Kernel Discriminants

19 0.093644932 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

20 0.092466101 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.317), (1, -0.168), (2, 0.07), (3, -0.059), (4, 0.257), (5, -0.03), (6, -0.046), (7, -0.17), (8, -0.072), (9, 0.007), (10, -0.011), (11, -0.072), (12, -0.071), (13, -0.245), (14, 0.021), (15, 0.162), (16, 0.138), (17, 0.154), (18, -0.088), (19, -0.116), (20, 0.018), (21, -0.179), (22, -0.092), (23, 0.146), (24, 0.051), (25, -0.038), (26, 0.121), (27, 0.089), (28, 0.032), (29, 0.166), (30, 0.053), (31, -0.033), (32, 0.096), (33, -0.037), (34, -0.009), (35, 0.028), (36, 0.01), (37, 0.024), (38, 0.099), (39, 0.022), (40, -0.056), (41, 0.014), (42, 0.001), (43, -0.03), (44, -0.041), (45, -0.029), (46, 0.088), (47, 0.037), (48, -0.029), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97359103 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.

2 0.84749973 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

Author: Ofer Dekel, Yoram Singer

Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.

3 0.7580182 58 nips-2002-Conditional Models on the Ranking Poset

Author: Guy Lebanon, John D. Lafferty

Abstract: A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.

4 0.64336711 42 nips-2002-Bias-Optimal Incremental Problem Solving

Author: Jürgen Schmidhuber

Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language.   ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦  ©  £ £¨ © © ©  © ¦ ¦ ¦   Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution.   !     © © © ¢

5 0.64283806 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.

6 0.57331395 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

7 0.54565299 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.50322491 140 nips-2002-Margin Analysis of the LVQ Algorithm

9 0.44761527 67 nips-2002-Discriminative Binaural Sound Localization

10 0.44019449 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

11 0.43031281 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

12 0.40517834 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

13 0.39350146 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

14 0.38448688 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

15 0.3796806 45 nips-2002-Boosted Dyadic Kernel Discriminants

16 0.37417647 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

17 0.35574597 6 nips-2002-A Formulation for Minimax Probability Machine Regression

18 0.35285532 92 nips-2002-FloatBoost Learning for Classification

19 0.34989157 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis

20 0.3484256 190 nips-2002-Stochastic Neighbor Embedding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.013), (23, 0.014), (42, 0.066), (52, 0.031), (54, 0.151), (55, 0.339), (57, 0.01), (67, 0.028), (68, 0.027), (74, 0.08), (92, 0.05), (98, 0.122)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98539311 118 nips-2002-Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images

Author: Alistair Bray, Dominique Martinez

Abstract: In Slow Feature Analysis (SFA [1]), it has been demonstrated that high-order invariant properties can be extracted by projecting inputs into a nonlinear space and computing the slowest changing features in this space; this has been proposed as a simple general model for learning nonlinear invariances in the visual system. However, this method is highly constrained by the curse of dimensionality which limits it to simple theoretical simulations. This paper demonstrates that by using a different but closely-related objective function for extracting slowly varying features ([2, 3]), and then exploiting the kernel trick, this curse can be avoided. Using this new method we show that both the complex cell properties of translation invariance and disparity coding can be learnt simultaneously from natural images when complex cells are driven by simple cells also learnt from the image. The notion of maximising an objective function based upon the temporal predictability of output has been progressively applied in modelling the development of invariances in the visual system. F6ldiak used it indirectly via a Hebbian trace rule for modelling the development of translation invariance in complex cells [4] (closely related to many other models [5,6,7]); this rule has been used to maximise invariance as one component of a hierarchical system for object and face recognition [8]. On the other hand, similar functions have been maximised directly in networks for extracting linear [2] and nonlinear [9, 1] visual invariances. Direct maximisation of such functions have recently been used to model complex cells [10] and as an alternative to maximising sparseness/independence in modelling simple cells [11]. Slow Feature Analysis [1] combines many of the best properties of these methods to provide a good general nonlinear model. That is, it uses an objective function that minimises the first-order temporal derivative of the outputs; it provides a closedform solution which maximises this function by projecting inputs into a nonlinear http://www.loria.fr/equipes/cortex/ space; it exploits sphering (or PCA-whitening) of the data to ensure that all outputs have unit variance and are uncorrelated. However, the method suffers from the curse of dimensionality in that the nonlinear feature space soon becomes very large as the input dimension grows, and yet this feature space must be represented explicitly in order for the essential sphering to occur. The alternative that we propose here is to use the objective function of Stone [2, 9], that maximises output variance over a long period whilst minimising variance over a shorter period; in the linear case, this can be implemented by a biologically plausible mixture of Hebbian and anti-Hebbian learning on the same synapses [2]. In recent work, Stone has proposed a closed-form solution for maximising this function in the linear domain of blind source separation that does not involve data-sphering. This paper describes how this method can be kernelised. The use of the

same-paper 2 0.9576506 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.

3 0.94860464 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

Author: Emanuel Todorov, Michael I. Jordan

Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.

4 0.93556982 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting

Author: Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero Candela, Roderick Murray-Smith

Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. -step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form , the prediction of at time is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction.   ¡ % # ¢ ¡     ¢ ¡¨ ¦ ¤ ¢ $

5 0.81325638 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Author: Jarmo Hurri, Aapo Hyvärinen

Abstract: We show that two important properties of the primary visual cortex emerge when the principle of temporal coherence is applied to natural image sequences. The properties are simple-cell-like receptive fields and complex-cell-like pooling of simple cell outputs, which emerge when we apply two different approaches to temporal coherence. In the first approach we extract receptive fields whose outputs are as temporally coherent as possible. This approach yields simple-cell-like receptive fields (oriented, localized, multiscale). Thus, temporal coherence is an alternative to sparse coding in modeling the emergence of simple cell receptive fields. The second approach is based on a two-layer statistical generative model of natural image sequences. In addition to modeling the temporal coherence of individual simple cells, this model includes inter-cell temporal dependencies. Estimation of this model from natural data yields both simple-cell-like receptive fields, and complex-cell-like pooling of simple cell outputs. In this completely unsupervised learning, both layers of the generative model are estimated simultaneously from scratch. This is a significant improvement on earlier statistical models of early vision, where only one layer has been learned, and others have been fixed a priori.

6 0.79122025 10 nips-2002-A Model for Learning Variance Components of Natural Images

7 0.77504373 2 nips-2002-A Bilinear Model for Sparse Coding

8 0.77077579 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

9 0.72935867 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

10 0.72225398 119 nips-2002-Kernel Dependency Estimation

11 0.72029084 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

12 0.71353638 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

13 0.70942128 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

14 0.70501596 106 nips-2002-Hyperkernels

15 0.70360088 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

16 0.69652951 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives

17 0.69519019 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

18 0.69245976 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

19 0.6919772 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

20 0.69175643 110 nips-2002-Incremental Gaussian Processes