nips nips2002 nips2002-19 knowledge-graph by maker-knowledge-mining

19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Source: pdf

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classiﬁers, similar in spirit to Boosting.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de ¡ Abstract In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. [sent-9, score-0.398]

2 Differently from many previous approaches we learn the code as well as the embedding function. [sent-11, score-0.76]

3 We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. [sent-12, score-0.335]

4 To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classiﬁers, similar in spirit to Boosting. [sent-13, score-0.311]

5 However, when comparing the reported performance of these systems with the de-facto standard of using two-class techniques in combination with simple, ﬁxed output codes to solve multi-class problems, they often lack in terms of performance, ease of optimization, and/or run-time behavior. [sent-19, score-0.389]

6 On the other hand, many methods have been proposed to apply binary classiﬁers to multiclass problems, such as Error Correcting Output Codes (ECOC) [6, 1], Pairwise Coupling [9], or by simply reducing the problem of discriminating classes to “one vs. [sent-20, score-0.439]

7 These problems can, in reverse order of their appearance, be understood as more and more reﬁned variants of a multi-variate regression, i. [sent-26, score-0.171]

8 However, most previous work is either focused on ﬁnding a good embedding given a ﬁxed code or just optimizing the code, given a ﬁxed embedding (cf. [sent-29, score-1.057]

9 One could try to cast this as a standard multi-class problem by assigning each training sequence to the structure ranked highest. [sent-37, score-0.158]

10 But then, there will be classes to which only very few or no sequences are assigned and one can obviously hardly learn using traditional techniques. [sent-38, score-0.145]

11 The machine we propose is (at least in principle) able to solve problems like this by reﬂecting relations between classes in the way the code book is constructed and at the same time trying to ﬁnd an embedding of the data space into the code space that allows for a good discrimination. [sent-39, score-1.806]

12 Then we discuss the approaches of [4] and [21] and propose to learn the code book. [sent-41, score-0.56]

13 In Section 3 we propose a rather general idea to solve resulting multi-class problems using two-class classiﬁers. [sent-42, score-0.219]

14 for multi-class problems where denotes the number of classes, or for a ranking problem), and let be a training sample of size , with . [sent-47, score-0.24]

15 ¢ ¢ ¡ Output Coding It is well known (see [6, 1] and references therein) that multi-class problems can be solved by decomposing a polychotomy into dichotomies and solving these separately using a two-class technique. [sent-50, score-0.198]

16 This can be understood as assigning to each class a binary string of length which is called a code word. [sent-51, score-0.758]

17 Now each of the columns of this matrix deﬁnes a partitioning of classes into two subsets, forming binary problems for which a classiﬁer is trained. [sent-53, score-0.324]

18 Evaluation is done by computing the output of all learned functions, forming a new bit-string, and then choosing the class such that some distance measure between this string and the corresponding row of the code matrix is minimal, usually the Hamming distance. [sent-54, score-0.745]

19 1 Since the codes for each class must be unique, there are (for ) possible code matrices to choose from. [sent-56, score-0.852]

20 One possibility is to choose the codes to be errorcorrecting (ECOC) [6]. [sent-57, score-0.279]

21 large Hamming distance between the code words, such that one still gets the correct decision even if a few of the classiﬁers err. [sent-60, score-0.637]

22 However, ﬁnding the code that minimizes the training error is NP-complete, even for ﬁxed binary classiﬁers [4]. [sent-61, score-0.62]

23 Furthermore, errors committed by the binary classiﬁers are not necessarily independent, signiﬁcantly reducing the effective number of wrong bits that one can handle [18, 19]. [sent-62, score-0.112]

24 Nonetheless ECOC has proven useful and algorithms for ﬁnding a good code (and partly also ﬁnding the corresponding classiﬁers) have been 6 D 7 D ¡ D 1 FE ¢ ¢ 7 T R ' B Q V# U¢ S&PQIHG; ice ec a hAgfdb` We could also use ternary codes, i. [sent-63, score-0.516]

25 Noticeably, most practical approaches suggest to drop the requirement of binary codes, and instead propose to use continuous ones. [sent-70, score-0.117]

26 Hamming) distance to their appropriated code words can be related to a large margin classiﬁer, beginning with binary classiﬁcation. [sent-73, score-0.936]

27 1 Large Margins Dichotomies Here a large margin classiﬁer is deﬁned as a mapping with the property that , or more speciﬁcally with , where is some positive constant [20]. [sent-75, score-0.218]

28 Here is a regularization term, is a regularization constant and denotes the class of functions under consideration. [sent-78, score-0.185]

29 In other words, we can express the margin as the difference between the distance of from the target and the target . [sent-80, score-0.373]

30 ¤@ ¤ #" 5 Polychotomies While this insight by itself is not particularly useful, it paves the way for an extension of the notion of the margin to multi-class problems: denote by a distance measure and by , ( is the length of the code) target vectors corresponding to class . [sent-81, score-0.424]

31 We obtain accordingly the following correct target optimization problem: where ¤ $ @ ¤ '5'($ " dA ¦ ($ % 8 6 5'($ " A ¦ ' 7 8 6 ¦ ' ' @ ' 0a) A `1$ ¦ $ XY7 ( ¦ % ¢ D$ % A $! [sent-84, score-0.173]

32 For the time being we chose as a reference margin — an adaptive means of choosing the reference margin can be implemented using the -trick, which leads to an easier to control regularization parameter [16]. [sent-86, score-0.5]

33 However, one can show that only and related functions will lead to a convex constraint on : is convex in is symmetric. [sent-90, score-0.279]

34 8 Lemma 1 implies that any distance functions other than the ones described above will lead to optimization problems with potentially many local minima, which is not desirable. [sent-95, score-0.357]

35 However, for quadratic we will get a convex optimization problem (assuming suitable ) % # & $! [sent-96, score-0.269]

36 6 ¨ ' "#dA p ' % r'7 8 8 @ V ' " dA p ' 7 8 V H c % h88 c H c A' @ hc ¢¡ ' 7 8 and then there are ways to efﬁciently solve (3). [sent-97, score-0.148]

37 We obtain means (4) 8 ' #" A p ' 7 ¢ &'' " dA ¦ ' 7 8 6 A Hc 8 ¦ % 8 6 @ &'' " dA ¦ ' 7 8 6 7 hc ¢ ¢@ ' ' A b8 6 &'' #"dA ' ¦ Note, that if the code words have the same length, the difference of the projections of onto different code words determines the margin. [sent-99, score-1.231]

38 We will indeed later consider a more convenient case: , which will lead to linear constraints only and allows us to use standard optimization packages. [sent-100, score-0.251]

39 This means that during optimization we are trying to ﬁnd functions , , from an dimensional subspace. [sent-104, score-0.134]

40 In other words, we choose the subspace and perform regularization by allowing only a smaller class of functions. [sent-105, score-0.16]

41 3 Discussion and Relation to Previous Approaches Note that for we have that (4) is equal and hence the problem of multi-class classiﬁcation reverts to the problem of solving binary classiﬁcation problems of one vs. [sent-108, score-0.274]

42 Here, the function is held ﬁx and the code is optimized. [sent-112, score-0.55]

43 In their approach, the code is described as a vector in a kernel feature space and one obtains in fact an optimization problem very similar to the one in [21] and (3) (again, the slack-variables are deﬁned slightly different). [sent-113, score-0.684]

44 The resulting optimization problem turns out to be convex, but with the drawback, that one can either not fully optimize the code vectors or not guarantee that they are well separated. [sent-115, score-0.771]

45 Since these approaches were motivated by different ideas (one optimizing the code, the other optimizing the embedding), this shows that the role of the code and the embedding function is interchangeable if the function or the code, respectively, is ﬁxed. [sent-116, score-0.866]

46 Our approach allows arbitrary codes for which a function is learned. [sent-117, score-0.271]

47 The position of the code words (=“class centers”) determine the function . [sent-119, score-0.568]

48 The position of the centers relative to each other may reﬂect relationships between the classes (e. [sent-120, score-0.103]

49 classes “black” & “white” and “white” & “grey” are close). [sent-122, score-0.103]

50 The spatial organization of the code book vectors reﬂects the organization of classes in the space. [sent-124, score-0.908]

51 4 Learning Code & Embedding This leaves us with the question of how to determine a “good” code and a suitable . [sent-126, score-0.516]

52 As we can see from (4), for ﬁxed the constraints are linear in and vice versa, yet we have non- A 8 A 8 A convex constraints, if both and are variable. [sent-127, score-0.154]

53 Finding the global optimum is therefore computationally infeasible when optimizing and simultaneously (furthermore note that any rotation applied to and will leave the margin invariant, which shows the presence of local minima due to equivalent codes). [sent-128, score-0.315]

54 Instead, we propose the following method: for ﬁxed code optimize over , and subsequently, for ﬁxed , optimize over , possibly repeating the process. [sent-129, score-0.668]

55 Both steps separately can be performed fairly efﬁcient (since the optimization problems are convex; cf. [sent-133, score-0.215]

56 We now show how a code maximizing the margin can be found. [sent-136, score-0.734]

57 To avoid a trivial solution (we can may virtually increase the margin by rescaling all by some constant), we add to the objective function. [sent-137, score-0.218]

58 It can be shown that one does not need an additional regularization constant in front of this term, if the distance is linear on both arguments. [sent-138, score-0.141]

59 One can show that the probability that there exists two such vectors (out of ) that have a smaller distance than is bounded by (proof given in the full paper). [sent-144, score-0.11]

60 Hence, with probability greater than the random code vectors have distances greater than from each other. [sent-145, score-0.549]

61 % ) ( 0' ¢ V H W ' ¤ @ W H 3 Column Generation for Finding the Embedding There are several ways to setup and optimize the resulting optimization problem (3). [sent-148, score-0.222]

62 For instance in [21, 4] the class of functions is the set of hyperplanes in some kernel feature space and the regularizer is the sum of the -norms of the hyperplane normal vectors. [sent-149, score-0.1]

63 The idea of column generation is to start with a restricted master problem, namely without the variables (i. [sent-159, score-0.212]

64 Then one solves the corresponding dual problem (7) and then ﬁnds the hypothesis that corresponds to a violated constraint (and also one primal variable). [sent-161, score-0.294]

65 This hypothesis is included in the optimization problem, one resolves and ﬁnds the next violated constraint. [sent-162, score-0.291]

66 By using the described column generation technique one can, however, ﬁnd the solution of this semi-inﬁnite programming problem [13]. [sent-166, score-0.246]

67 Maximizing (8) with respect to is easy for a given : , one chooses ; if , then and for for one chooses the minimizing unit vector. [sent-172, score-0.098]

68 Only if one cannot ﬁnd a hypothesis that violates a constraint, one employs the more sophisticated techniques suggested in [15]. [sent-176, score-0.11]

69 If there is no hypothesis left that corresponds to a violated constraint, the dual optimization problem is optimal. [sent-177, score-0.384]

70 Then we can use another learning algorithm that problem of ﬁnding minimizes or approximately minimizes the training error of a weighted training set (rewrite (8)). [sent-179, score-0.096]

71 Following the ideas in [14] one can show that there is a close relationship between our technique using the trivial code and the multi-class boosting algorithms as e. [sent-181, score-0.557]

72 1 A ﬁrst Experiment In a preliminary set of experiments we use two benchmark data sets from the UCI benchmark repository: glass and iris. [sent-185, score-0.137]

73 We used our column generation strategy as described in Section 3 in conjunction with the code optimization problem to solve the combined optimization problem to ﬁnd the code and the embedding. [sent-186, score-1.592]

74 As base learning algorithm we chose decision trees (C4. [sent-191, score-0.095]

75 5) which we only use as two-class classiﬁer in our column generation algorithm. [sent-192, score-0.171]

76 We also computed the test error of multiclass decision trees and obtained error. [sent-195, score-0.285]

77 On the iris data we could achieve an error rate of and could slightly improve the result of decision trees ( ). [sent-197, score-0.095]

78 We conjecture that this is due to the properties of decision trees which have problems generating smooth boundaries not aligned with coordinate axes. [sent-199, score-0.176]

79 It is in particular interesting to ﬁnd practical examples, where a non-trivial choice of the code (via optimization) helps simplifying the embedding and ﬁnally leads to additional improvements. [sent-201, score-0.76]

80 Preliminary results indicate that one can achieve considerable improvements when adapting codes and embeddings [3]. [sent-203, score-0.314]

81 Shown is the decision boundary and the conﬁdence for assigning a sample to the upper left class. [sent-205, score-0.098]

82 Instead, we used (9) with the information that each example besides belonging to its own class with conﬁdence two also belongs to the other classes with conﬁdence one iff its distance to the respective center is less than one. [sent-207, score-0.237]

83 We have the sets , which contain all pairs of “relations” between contains all pairs of positive and negative classes of the positive classes. [sent-217, score-0.103]

84 In this formulation one tries to ﬁnd a code and an embedding , such that for each example the output wrt. [sent-222, score-0.889]

85 each class this example has a relation with, reﬂects the order of this relations (i. [sent-223, score-0.095]

86 Furthermore, the program tries to achieve a “large margin” between relevant and irrelevant classes for each sample. [sent-226, score-0.136]

87 Optimization of (9) is analogous to the column generation approach discussed in Section 3. [sent-228, score-0.171]

88 ¡ 8 1 T Connection to Ranking Techniques Ordinal regression through large margins [10] can be seen as an extreme case of (9), where we have as many classes as observations, and each pair of observations has to satisfy a ranking relation , if is to be preferred to . [sent-231, score-0.302]

89 This formulation can of course also be understood as a special case of multi-dimensional regression. [sent-232, score-0.093]

90 #" 4 # @ ¨ $' 4 " @ ' # " ¦ 4 " 5 Conclusion We proposed an algorithm to simultaneously optimize output codes and the embedding of the sample into the code book space building upon the notion of large margins. [sent-233, score-1.295]

91 Further- more, we have shown, that only quadratic and related distance measures in the code book space will lead to convex constraints and hence convex optimization problems whenever either the code or the embedding is held ﬁxed. [sent-234, score-2.07]

92 This is desirable since at least for these sub-problems there exist fairly efﬁcient techniques to compute these (of course the combined optimization problem of ﬁnding the code and the embedding is not convex and has local minima). [sent-235, score-1.064]

93 We proposed a column generation technique for solving the embedding optimization problems. [sent-236, score-0.642]

94 Finally we proposed a technique along the same lines that should be favorable when dealing with many classes or even empty classes. [sent-238, score-0.144]

95 Future work will concentrate on ﬁnding more efﬁcient algorithms to solve the optimization problem and on more carefully evaluating their performance. [sent-239, score-0.221]

96 Reducing multiclass to binary: A unifying approach for margin classiﬁers. [sent-249, score-0.408]

97 On the learnability and design of output codes for multiclass problems. [sent-271, score-0.491]

98 Constraint classiﬁcation: A new approach to multiclass classiﬁcation and ranking. [sent-301, score-0.19]

99 Sparse regression ensembles in inﬁnite and ﬁnite a hypothesis spaces. [sent-349, score-0.108]

100 Stochastic organization of output codes in multiclass learning problems. [sent-392, score-0.529]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('code', 0.516), ('embedding', 0.244), ('codes', 0.24), ('margin', 0.218), ('multiclass', 0.19), ('book', 0.18), ('da', 0.155), ('classi', 0.153), ('optimization', 0.134), ('ranking', 0.128), ('generation', 0.105), ('classes', 0.103), ('convex', 0.101), ('hc', 0.095), ('nding', 0.082), ('violated', 0.082), ('problems', 0.081), ('distance', 0.077), ('ecoc', 0.077), ('hh', 0.076), ('hypothesis', 0.075), ('binary', 0.073), ('ers', 0.067), ('column', 0.066), ('demiriz', 0.065), ('dichotomies', 0.065), ('polychotomies', 0.065), ('regularization', 0.064), ('tsch', 0.061), ('output', 0.061), ('dual', 0.059), ('mika', 0.059), ('understood', 0.058), ('dence', 0.058), ('class', 0.057), ('hamming', 0.057), ('ordinal', 0.056), ('assigning', 0.054), ('optimize', 0.054), ('optimizing', 0.053), ('solve', 0.053), ('constraints', 0.053), ('words', 0.052), ('solving', 0.052), ('trees', 0.051), ('chooses', 0.049), ('colt', 0.045), ('minima', 0.044), ('francisco', 0.044), ('constraint', 0.044), ('propose', 0.044), ('decision', 0.044), ('embeddings', 0.043), ('ranks', 0.043), ('regularizer', 0.043), ('proof', 0.042), ('assigned', 0.042), ('idea', 0.041), ('technique', 0.041), ('nips', 0.04), ('nd', 0.04), ('nds', 0.04), ('target', 0.039), ('fh', 0.039), ('ranked', 0.039), ('williamson', 0.039), ('reducing', 0.039), ('cation', 0.039), ('choose', 0.039), ('relations', 0.038), ('hp', 0.038), ('thousand', 0.038), ('formulations', 0.038), ('margins', 0.038), ('lemma', 0.038), ('organization', 0.038), ('smola', 0.037), ('xed', 0.036), ('preliminary', 0.036), ('rank', 0.036), ('glass', 0.035), ('techniques', 0.035), ('formulation', 0.035), ('forming', 0.034), ('held', 0.034), ('morgan', 0.034), ('san', 0.034), ('problem', 0.034), ('benchmark', 0.033), ('partitioning', 0.033), ('tries', 0.033), ('er', 0.033), ('regression', 0.033), ('vectors', 0.033), ('lead', 0.033), ('implies', 0.032), ('variants', 0.032), ('allows', 0.031), ('training', 0.031), ('adapting', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

2 0.36089295 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

Author: Ofer Dekel, Yoram Singer

Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.

3 0.2537753 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classiﬁcation framework captures many ﬂavors of multiclass classiﬁcation including winner-take-all multiclass classiﬁcation, multilabel classiﬁcation and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classiﬁer in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classiﬁcation beneﬁts over existing methods of multiclass classiﬁcation.

4 0.22251518 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

5 0.1761978 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classiﬁcation task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to diﬀerent sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classiﬁer which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive deﬁnite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classiﬁcation and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classiﬁcation results. 1

6 0.16781457 58 nips-2002-Conditional Models on the Ranking Poset

7 0.15244241 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.14698999 156 nips-2002-On the Complexity of Learning the Kernel Matrix

9 0.14095148 140 nips-2002-Margin Analysis of the LVQ Algorithm

10 0.13742982 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

11 0.12932242 42 nips-2002-Bias-Optimal Incremental Problem Solving

12 0.12839547 161 nips-2002-PAC-Bayes & Margins

13 0.1254918 10 nips-2002-A Model for Learning Variance Components of Natural Images

14 0.11854792 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

15 0.11380477 98 nips-2002-Going Metric: Denoising Pairwise Data

16 0.11366791 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

17 0.11080026 92 nips-2002-FloatBoost Learning for Classification

18 0.11013524 45 nips-2002-Boosted Dyadic Kernel Discriminants

19 0.093644932 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

20 0.092466101 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.317), (1, -0.168), (2, 0.07), (3, -0.059), (4, 0.257), (5, -0.03), (6, -0.046), (7, -0.17), (8, -0.072), (9, 0.007), (10, -0.011), (11, -0.072), (12, -0.071), (13, -0.245), (14, 0.021), (15, 0.162), (16, 0.138), (17, 0.154), (18, -0.088), (19, -0.116), (20, 0.018), (21, -0.179), (22, -0.092), (23, 0.146), (24, 0.051), (25, -0.038), (26, 0.121), (27, 0.089), (28, 0.032), (29, 0.166), (30, 0.053), (31, -0.033), (32, 0.096), (33, -0.037), (34, -0.009), (35, 0.028), (36, 0.01), (37, 0.024), (38, 0.099), (39, 0.022), (40, -0.056), (41, 0.014), (42, 0.001), (43, -0.03), (44, -0.041), (45, -0.029), (46, 0.088), (47, 0.037), (48, -0.029), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97359103 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

2 0.84749973 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

Author: Ofer Dekel, Yoram Singer

3 0.7580182 58 nips-2002-Conditional Models on the Ranking Poset

Author: Guy Lebanon, John D. Lafferty

Abstract: A distance-based conditional model on the ranking poset is presented for use in classiﬁcation and ranking. The model is an extension of the Mallows model, and generalizes the classiﬁer combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.

4 0.64336711 42 nips-2002-Bias-Optimal Incremental Problem Solving

Author: Jürgen Schmidhuber

Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program preﬁxes that modify the distribution on their sufﬁxes by reusing successful code for previous tasks (stored in non-modiﬁable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the ﬁrst general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by proﬁting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly veriﬁable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢

5 0.64283806 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

6 0.57331395 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

7 0.54565299 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

8 0.50322491 140 nips-2002-Margin Analysis of the LVQ Algorithm

9 0.44761527 67 nips-2002-Discriminative Binaural Sound Localization

10 0.44019449 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

11 0.43031281 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

12 0.40517834 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

13 0.39350146 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

14 0.38448688 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

15 0.3796806 45 nips-2002-Boosted Dyadic Kernel Discriminants

16 0.37417647 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

17 0.35574597 6 nips-2002-A Formulation for Minimax Probability Machine Regression

18 0.35285532 92 nips-2002-FloatBoost Learning for Classification

19 0.34989157 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis

20 0.3484256 190 nips-2002-Stochastic Neighbor Embedding

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.013), (23, 0.014), (42, 0.066), (52, 0.031), (54, 0.151), (55, 0.339), (57, 0.01), (67, 0.028), (68, 0.027), (74, 0.08), (92, 0.05), (98, 0.122)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98539311 118 nips-2002-Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images

Author: Alistair Bray, Dominique Martinez

Abstract: In Slow Feature Analysis (SFA [1]), it has been demonstrated that high-order invariant properties can be extracted by projecting inputs into a nonlinear space and computing the slowest changing features in this space; this has been proposed as a simple general model for learning nonlinear invariances in the visual system. However, this method is highly constrained by the curse of dimensionality which limits it to simple theoretical simulations. This paper demonstrates that by using a different but closely-related objective function for extracting slowly varying features ([2, 3]), and then exploiting the kernel trick, this curse can be avoided. Using this new method we show that both the complex cell properties of translation invariance and disparity coding can be learnt simultaneously from natural images when complex cells are driven by simple cells also learnt from the image. The notion of maximising an objective function based upon the temporal predictability of output has been progressively applied in modelling the development of invariances in the visual system. F6ldiak used it indirectly via a Hebbian trace rule for modelling the development of translation invariance in complex cells [4] (closely related to many other models [5,6,7]); this rule has been used to maximise invariance as one component of a hierarchical system for object and face recognition [8]. On the other hand, similar functions have been maximised directly in networks for extracting linear [2] and nonlinear [9, 1] visual invariances. Direct maximisation of such functions have recently been used to model complex cells [10] and as an alternative to maximising sparseness/independence in modelling simple cells [11]. Slow Feature Analysis [1] combines many of the best properties of these methods to provide a good general nonlinear model. That is, it uses an objective function that minimises the first-order temporal derivative of the outputs; it provides a closedform solution which maximises this function by projecting inputs into a nonlinear http://www.loria.fr/equipes/cortex/ space; it exploits sphering (or PCA-whitening) of the data to ensure that all outputs have unit variance and are uncorrelated. However, the method suffers from the curse of dimensionality in that the nonlinear feature space soon becomes very large as the input dimension grows, and yet this feature space must be represented explicitly in order for the essential sphering to occur. The alternative that we propose here is to use the objective function of Stone [2, 9], that maximises output variance over a long period whilst minimising variance over a shorter period; in the linear case, this can be implemented by a biologically plausible mixture of Hebbian and anti-Hebbian learning on the same synapses [2]. In recent work, Stone has proposed a closed-form solution for maximising this function in the linear domain of blind source separation that does not involve data-sphering. This paper describes how this method can be kernelised. The use of the

same-paper 2 0.9576506 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola

3 0.94860464 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

Author: Emanuel Todorov, Michael I. Jordan

Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.

4 0.93556982 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting

Author: Agathe Girard, Carl Edward Rasmussen, Joaquin Quiñonero Candela, Roderick Murray-Smith

Abstract: We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. -step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form , the prediction of at time is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction. ¡ % # ¢ ¡ ¢ ¡¨ ¦ ¤ ¢ $

5 0.81325638 193 nips-2002-Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Author: Jarmo Hurri, Aapo Hyvärinen

Abstract: We show that two important properties of the primary visual cortex emerge when the principle of temporal coherence is applied to natural image sequences. The properties are simple-cell-like receptive ﬁelds and complex-cell-like pooling of simple cell outputs, which emerge when we apply two different approaches to temporal coherence. In the ﬁrst approach we extract receptive ﬁelds whose outputs are as temporally coherent as possible. This approach yields simple-cell-like receptive ﬁelds (oriented, localized, multiscale). Thus, temporal coherence is an alternative to sparse coding in modeling the emergence of simple cell receptive ﬁelds. The second approach is based on a two-layer statistical generative model of natural image sequences. In addition to modeling the temporal coherence of individual simple cells, this model includes inter-cell temporal dependencies. Estimation of this model from natural data yields both simple-cell-like receptive ﬁelds, and complex-cell-like pooling of simple cell outputs. In this completely unsupervised learning, both layers of the generative model are estimated simultaneously from scratch. This is a signiﬁcant improvement on earlier statistical models of early vision, where only one layer has been learned, and others have been ﬁxed a priori.

6 0.79122025 10 nips-2002-A Model for Learning Variance Components of Natural Images

7 0.77504373 2 nips-2002-A Bilinear Model for Sparse Coding

8 0.77077579 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

9 0.72935867 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

10 0.72225398 119 nips-2002-Kernel Dependency Estimation

11 0.72029084 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

12 0.71353638 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

13 0.70942128 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

14 0.70501596 106 nips-2002-Hyperkernels

15 0.70360088 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

16 0.69652951 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives

17 0.69519019 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

18 0.69245976 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

19 0.6919772 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

20 0.69175643 110 nips-2002-Incremental Gaussian Processes