nips nips2000 nips2000-68 knowledge-graph by maker-knowledge-mining

68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation


Source: pdf

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. [sent-4, score-1.056]

2 Previous research on output coding has employed, almost solely, predefined discrete codes. [sent-5, score-0.397]

3 We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. [sent-6, score-0.702]

4 The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. [sent-7, score-0.238]

5 We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. [sent-8, score-0.312]

6 The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes. [sent-9, score-0.652]

7 1 Introduction The problem of multiclass categorization is about assigning labels to instances where the labels are drawn from some finite set. [sent-10, score-0.815]

8 Many machine learning problems include a multiclass categorization component in them. [sent-11, score-0.716]

9 There are many algorithms for the binary class problem, where there are only two possible labels, such as SVM [17], CART [4] and C4. [sent-13, score-0.31]

10 Some of them can be extended to handle multiclass problems. [sent-15, score-0.535]

11 An alternative and general approach is to reduce a multiclass problem to a multiple binary problems. [sent-16, score-0.812]

12 In [9] Dietterich and Bakiri described a method for reducing multiclass problems to multiple binary problems based on error correcting output codes (ECOC). [sent-17, score-1.482]

13 In the training stage, a set of binary classifiers is constructed, where each classifier is trained to distinguish between two disjoint subsets of the labels. [sent-19, score-0.635]

14 In the classification stage, each of the trained binary classifiers is applied to test instances and a voting scheme is used to decide on the label. [sent-20, score-0.635]

15 Experimental work has shown that the output coding approach can improve performance in a wide range of problems such as text classification [3], text to speech synthesis [8], cloud classification [1] and others [9, 10, 15]. [sent-21, score-0.569]

16 The performance of output coding was also analyzed in statistics and learning theoretic contexts [11, 12, 16, 2]. [sent-22, score-0.275]

17 Most of previous work on output coding has concentrated on the problem of solving multiclass problems using predefined output codes, independently of the specific application and the learning algorithm used to construct the binary classifiers. [sent-23, score-1.367]

18 Furthermore, the "decoding" scheme assigns the same weight to each learned binary classifier, regardless of its performance. [sent-24, score-0.277]

19 Last, the induced binary problems are treated as separate problems and are learned independently. [sent-25, score-0.442]

20 Thus, there might be strong statistical correlations between the resulting classifiers, especially when the induced binary problems are similar. [sent-26, score-0.452]

21 These problems call for an improved output coding scheme. [sent-27, score-0.352]

22 In a recent theoretical work [7] we suggested a relaxation of discrete output codes to continuous codes where each entry of the code matrix is a real number. [sent-28, score-1.56]

23 As in discrete codes, each column of the code matrix defines a partition of the set of the labels into two subsets which are labeled positive (+) and negative (-). [sent-29, score-0.572]

24 The sign of each entry in the code matrix determines the subset association (+ or -) and magnitude corresponds to the confidence in this association. [sent-30, score-0.466]

25 In this paper we discuss the usage of continuous codes for multiclass problems using a two phase approach. [sent-31, score-1.097]

26 First, we create a binary output code matrix that is used to train binary classifiers in the same way suggested by Dietterich and Bakiri. [sent-32, score-1.246]

27 Given the trained classifiers and some training data we look for a more suitable continuous code by casting the problem as a constrained optimization problem. [sent-33, score-0.762]

28 We then replace the original binary code with the improved continuous code and proceed analogously to classify new test instances. [sent-34, score-1.261]

29 An important property of our algorithm is that the resulting continuous code can be expressed as a linear combination of a subset of the training patterns. [sent-35, score-0.601]

30 Since classification of new instances is performed using scalar products between the predictions vector of the binary classifiers and the rows of the code matrix, we can exploit this particular form of the code matrix and use kernels [17] to construct high dimensional product spaces. [sent-36, score-1.32]

31 This approach enables an efficient and simple way to take into account correlations between the different binary classifiers. [sent-37, score-0.316]

32 In the next section we formally describe the framework that uses output coding for multiclass problems. [sent-39, score-0.825]

33 3 we describe our algorithm for designing a continuous code from a set of binary classifiers. [sent-41, score-0.852]

34 2 Multiclass learning using output coding Let S = {(Xl, Yl)"'" (xm, Ym)} be a set of m training examples where each instance Xi belongs to a domain X. [sent-45, score-0.391]

35 A multiclass classifier is a function H : X -+ Y that maps an instance X into an element Y of y. [sent-50, score-0.665]

36 In this work we focus on a framework that uses output codes to build multiclass classifiers from binary classifiers. [sent-51, score-1.443]

37 A binary output code M is a matrix of size k x lover { -1, + I} where each row of M correspond to a class Y E y. [sent-52, score-0.896]

38 This set is fed as training data to a learning algorithm that finds a binary classifier. [sent-59, score-0.414]

39 In this work we assume that each binary classifier ht is of the form h t : X -+ R This reduction yields l different binary classifiers hl , . [sent-60, score-0.983]

40 We denote the vector of predictions of these classifiers on an instance X by h(x) = (h l (x), . [sent-64, score-0.225]

41 The higher the value of K(h(x), Mr) is the more confident we are that r is the correct label of x according to the set of classifiers h. [sent-71, score-0.291]

42 Note that this notion of similarity holds for both discrete and continuous matrices. [sent-72, score-0.342]

43 It is easy to verify that when both the output code and the binary classifiers are over { -1, + I} this choice of K is equivalent to picking the row of M which attains the minimal Hamming distance to h(x). [sent-75, score-0.986]

44 To summarize, the learning algorithm receives a training set S, a discrete output code (matrix) of size k x l, and has access to a binary learning algorithm, denoted L. [sent-76, score-0.994]

45 The learning algorithm L is called l times, once for each induced binary problem. [sent-77, score-0.39]

46 The result of this process is a set of binary classifiers h(x) = (hI (x), . [sent-78, score-0.467]

47 These classifiers are fed, together with the original labels YI, . [sent-82, score-0.263]

48 , Ym to our second stage of the learning algorithm which learns a continuous code. [sent-85, score-0.276]

49 This continuous code is then used to classify new instances by choosing the class which correspond to a row with the largest innerproduct. [sent-86, score-0.693]

50 , hi (x) and the output unit predicts the final class by choosing the label r which maximizes K(h(x), Mr). [sent-91, score-0.317]

51 3 Finding an improved continuous code We now describe our method for finding a continuous code that improves on a given ensemble of binary classifiers h. [sent-92, score-1.555]

52 We would like to note that we do not need to know the original code that was originally used to train the binary classifiers. [sent-93, score-0.591]

53 The approach we take is to cast the code design problem as a constrained optimization problem. [sent-96, score-0.391]

54 The multiclass empirical error is given by where [7f] is equal to 1 if the predicate 7f holds and 0 otherwise. [sent-97, score-0.579]

55 Borrowing the idea of soft margins [6] we replace the 0-1 multiclass error with a piece wise linear bound maxr{h(xi) . [sent-98, score-0.61]

56 My,] (1) i=1 Put another way, the correct label should have a confidence value that is larger by at least one than any of the confidences for the rest of the labels. [sent-106, score-0.186]

57 Otherwise, we suffer loss which is linearly proportional to the difference between the confidence of the correct label and the maximum among the confidences of the other labels. [sent-107, score-0.186]

58 Define the l2-norm of a code M to be the l2-norm of the vector represented by the concatenation of M's rows, IIMII~ = II(MI , . [sent-108, score-0.314]

59 We now cast the problem of finding a good code which minimizes the bound Eq. [sent-112, score-0.357]

60 Since the continuous output code is constructed from the support patterns, we call our algorithm SPOC for Support Patterns Output Coding. [sent-125, score-0.709]

61 (4) we obtain the kernelbased classification rule H(x), (5) The ability to use kernels as a means for calculating inner-products enables a simple and efficient way to take into account correlations between the binary classifiers. [sent-135, score-0.455]

62 For instance, a second order polynomial of the form (1 + iiii)2 correspond to a transformation to a feature space that includes all the products of pairs of binary classifiers. [sent-136, score-0.307]

63 Therefore, the relaxation of discrete codes to continuous codes offers a partial remedy by assigning different importance weight to each binary classifier while taking into account the statistical correlations between the binary classifiers. [sent-137, score-1.739]

64 4 Experiments In this section we describe experiments we performed comparing discrete and continuous output codes. [sent-138, score-0.494]

65 We selected eight multiclass datasets, seven from the VCI repository! [sent-139, score-0.566]

66 ; When a test set was provided we used the original split into training and test sets, otherwise we used 5-fold cross validation for evaluating the test error. [sent-141, score-0.228]

67 We are in the process of performing experiments with the complete datasets and other datasets using a subset of the kernels. [sent-143, score-0.185]

68 comryann/ocr/mnist Name satimage shuttle mnist isolet letter vowel glass soybean No. [sent-152, score-0.257]

69 For a classification problem with k classes we set the random code to have about 10 log2 (k) columns. [sent-158, score-0.371]

70 We then set each entry in the matrix defining the code to be -1 or + 1 uniformly at random. [sent-159, score-0.4]

71 We used SVM as the base binary learning algorithm in two different modes: In the first mode we used the margin of the vector machine classifier as its real-valued prediction. [sent-160, score-0.585]

72 That is, each binary classifier h t is of the fonn ht(x) = w·x+b where wand b are the parameters ofthe separating hyperplane. [sent-161, score-0.421]

73 Thus, each binary classifier ht in this case is of the fonn ht : X -t {-I, +1}. [sent-163, score-0.633]

74 For brevity, we refer to these classifiers as thresholded-SVMs. [sent-164, score-0.19]

75 We would like to note in passing that this setting is by no means superficial as there are learning algorithms, such as RIPPER [5], that build classifiers of this type. [sent-165, score-0.221]

76 Each plot show the relative test error difference between discrete and continuous codes. [sent-169, score-0.403]

77 Formally, the height of each bar is proportional to (Ed - Ec) / Ed where Ed (Ec) is the test error when using a discrete (continuous) code. [sent-170, score-0.221]

78 For each problem there are three bars, one for each type of code (one-against-all, BCH, and random). [sent-171, score-0.314]

79 The left plots correspond to the results obtained using thresholded-SVM as the base binary classifier and right plots show the results using the real-valued predictions. [sent-173, score-0.557]

80 In general, the continuous output code relaxation indeed results in an improved performance over the original discrete output codes. [sent-175, score-1.095]

81 The most significant improvements are achieved with thresholded-SVM as the base binary classifiers. [sent-176, score-0.384]

82 Impressive improvements are achieved for data sets with a large number of training examples per class, shuttle being a notable example. [sent-179, score-0.168]

83 For this dataset the test error is reduced from an average of over 3% when using discrete code to an average test error which is significantly lower than 1% for continuous codes. [sent-180, score-0.89]

84 In contrast, for the so yb e a n dataset, which contains 307 training examples, and 19 classes, none of the kernels achieved any improvement, and often resulted in an increase in the test error. [sent-183, score-0.186]

85 Examining the training error reveals that the greater the decrease in the training error due to the continuous code relaxation the worse the increase in the corresponding test error. [sent-184, score-0.862]

86 _ oil ~II m~st vowel soybean Isolel lell er glass Figure 1: Comparison of the performance of discrete and continuous output codes using SVM (right figures) and thresholded-SVM (left figures) as the base learner for three different families of codes. [sent-197, score-0.986]

87 The top figures show the relative change in test error for the best performing kernel and the bottom figures show the relative change in test error averaged across seven different kernels . [sent-198, score-0.452]

88 Using the same setting described above we ran SPOC with random codes of lengths 5 through 35 for the vo wel dataset and lengths 15 through 50 for the let te r dataset. [sent-200, score-0.4]

89 In Figure 2 we show the test error rate as a function of the the code length with SVM as the base binary learner. [sent-201, score-0.766]

90 (Similar results were obtained using thresholded-SVM as the base binary classifiers. [sent-202, score-0.346]

91 ) For the lette r dataset we see consistent and significant improvements of the continuous codes over the discrete ones whereas for vowe l dataset there is a major improvement for short codes that decays with the code's length. [sent-203, score-1.049]

92 Therefore, since continuous codes can achieve performance comparable to much longer discrete codes they may serve as a viable alternative for discrete codes when computational power is limited or for classification tasks of large datasets. [sent-204, score-1.381]

93 5 Discussion In this paper we described and experimented with an algorithm for continuous relaxation of output codes for multiclass categorization problems. [sent-205, score-1.366]

94 The algorithm appears to be especially useful when the codes are short. [sent-206, score-0.352]

95 An interesting question is whether the proposed approach can be generalized by calling the algorithm successively on the previous code it improved. [sent-207, score-0.347]

96 Another viable direction is to try to combine the algorithm with other scheme for reducing vowel . [sent-208, score-0.168]

97 c:ode length Figure 2: Comparison of the performance of discrete random codes and their continuous relaxation as a function of the code length. [sent-220, score-1.033]

98 multiclass problems to multiple binary problems such as tree-based codes and directed acyclic graphs [13]. [sent-221, score-1.218]

99 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-236, score-0.535]

100 On the learnability and design of output codes for multiclass problems. [sent-255, score-0.976]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('multiclass', 0.535), ('code', 0.314), ('codes', 0.29), ('binary', 0.277), ('classifiers', 0.19), ('continuous', 0.182), ('output', 0.151), ('relaxation', 0.132), ('spoc', 0.122), ('discrete', 0.115), ('ht', 0.106), ('mr', 0.105), ('label', 0.101), ('classifier', 0.095), ('coding', 0.093), ('kernels', 0.082), ('labels', 0.073), ('soybean', 0.073), ('yoram', 0.073), ('dietterich', 0.07), ('base', 0.069), ('dataset', 0.067), ('test', 0.062), ('datasets', 0.061), ('problems', 0.058), ('classification', 0.057), ('vowel', 0.057), ('row', 0.054), ('text', 0.052), ('improved', 0.05), ('entry', 0.049), ('machine', 0.049), ('induced', 0.049), ('instances', 0.049), ('bakiri', 0.049), ('bch', 0.049), ('cloud', 0.049), ('confidences', 0.049), ('crammer', 0.049), ('fonn', 0.049), ('ghulum', 0.049), ('isolel', 0.049), ('jrt', 0.049), ('koby', 0.049), ('lell', 0.049), ('shuttle', 0.049), ('figures', 0.047), ('describe', 0.046), ('similarity', 0.045), ('dual', 0.045), ('letter', 0.045), ('error', 0.044), ('plots', 0.043), ('ran', 0.043), ('categorization', 0.043), ('cast', 0.043), ('training', 0.042), ('viable', 0.042), ('assigning', 0.042), ('argm', 0.042), ('yi', 0.04), ('correlations', 0.039), ('examples', 0.039), ('improvements', 0.038), ('robert', 0.038), ('predefined', 0.038), ('hl', 0.038), ('xi', 0.037), ('matrix', 0.037), ('svm', 0.037), ('li', 0.037), ('reducing', 0.036), ('confidence', 0.036), ('ed', 0.036), ('instance', 0.035), ('optimization', 0.034), ('algorithm', 0.033), ('column', 0.033), ('performing', 0.033), ('schapire', 0.033), ('mnist', 0.033), ('rand', 0.033), ('vladimir', 0.033), ('correcting', 0.033), ('class', 0.033), ('hi', 0.032), ('discuss', 0.032), ('fed', 0.031), ('seven', 0.031), ('classify', 0.031), ('disjoint', 0.031), ('learning', 0.031), ('replace', 0.031), ('mode', 0.031), ('subset', 0.03), ('stage', 0.03), ('correspond', 0.03), ('support', 0.029), ('especially', 0.029), ('mercer', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

2 0.1631591 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

Author: Renato Vicente, David Saad, Yoshiyuki Kabashima

Abstract: We analyze Gallager codes by employing a simple mean-field approximation that distorts the model geometry and preserves important interactions between sites. The method naturally recovers the probability propagation decoding algorithm as an extremization of a proper free-energy. We find a thermodynamic phase transition that coincides with information theoretical upper-bounds and explain the practical code performance in terms of the free-energy landscape.

3 0.13477273 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

Author: Shie Mannor, Ron Meir

Abstract: The problem of constructing weak classifiers for boosting algorithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Additionally, we provide improved convergence rate bounds for the generalization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established. 1

4 0.12591006 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.12249698 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

6 0.11839787 133 nips-2000-The Kernel Gibbs Sampler

7 0.10373721 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

8 0.10231984 58 nips-2000-From Margin to Sparsity

9 0.1016113 22 nips-2000-Algorithms for Non-negative Matrix Factorization

10 0.10077059 138 nips-2000-The Use of Classifiers in Sequential Inference

11 0.092823751 4 nips-2000-A Linear Programming Approach to Novelty Detection

12 0.089413561 130 nips-2000-Text Classification using String Kernels

13 0.079602025 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

14 0.077566549 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

15 0.077404998 74 nips-2000-Kernel Expansions with Unlabeled Examples

16 0.075621977 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

17 0.072648346 109 nips-2000-Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features

18 0.072442904 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

19 0.071289122 18 nips-2000-Active Support Vector Machine Classification

20 0.070009373 134 nips-2000-The Kernel Trick for Distances


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.249), (1, 0.123), (2, -0.031), (3, 0.009), (4, -0.038), (5, -0.041), (6, 0.042), (7, -0.004), (8, 0.051), (9, 0.151), (10, -0.039), (11, -0.036), (12, 0.02), (13, 0.093), (14, -0.131), (15, 0.101), (16, -0.142), (17, -0.014), (18, 0.234), (19, -0.259), (20, -0.049), (21, 0.028), (22, 0.076), (23, -0.056), (24, 0.016), (25, -0.06), (26, 0.05), (27, 0.083), (28, 0.119), (29, -0.1), (30, -0.014), (31, 0.086), (32, 0.045), (33, 0.244), (34, -0.039), (35, -0.131), (36, 0.122), (37, -0.005), (38, -0.012), (39, -0.086), (40, -0.0), (41, -0.006), (42, 0.062), (43, 0.084), (44, 0.029), (45, 0.14), (46, 0.048), (47, -0.08), (48, -0.057), (49, -0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96127164 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

2 0.59521127 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

Author: Renato Vicente, David Saad, Yoshiyuki Kabashima

Abstract: We analyze Gallager codes by employing a simple mean-field approximation that distorts the model geometry and preserves important interactions between sites. The method naturally recovers the probability propagation decoding algorithm as an extremization of a proper free-energy. We find a thermodynamic phase transition that coincides with information theoretical upper-bounds and explain the practical code performance in terms of the free-energy landscape.

3 0.54402691 22 nips-2000-Algorithms for Non-negative Matrix Factorization

Author: Daniel D. Lee, H. Sebastian Seung

Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

4 0.50132275 109 nips-2000-Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features

Author: Penio S. Penev

Abstract: Low-dimensional representations are key to solving problems in highlevel vision, such as face compression and recognition. Factorial coding strategies for reducing the redundancy present in natural images on the basis of their second-order statistics have been successful in accounting for both psychophysical and neurophysiological properties of early vision. Class-specific representations are presumably formed later, at the higher-level stages of cortical processing. Here we show that when retinotopic factorial codes are derived for ensembles of natural objects, such as human faces, not only redundancy, but also dimensionality is reduced. We also show that objects are built from parts in a non-Gaussian fashion which allows these local-feature codes to have dimensionalities that are substantially lower than the respective Nyquist sampling rates.

5 0.49207136 138 nips-2000-The Use of Classifiers in Sequential Inference

Author: Vasin Punyakanok, Dan Roth

Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

6 0.43002093 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators

7 0.38706461 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

8 0.36589399 75 nips-2000-Large Scale Bayes Point Machines

9 0.34639314 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems

10 0.34128448 133 nips-2000-The Kernel Gibbs Sampler

11 0.33688176 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

12 0.32359308 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

13 0.3193706 74 nips-2000-Kernel Expansions with Unlabeled Examples

14 0.29763648 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

15 0.28654107 58 nips-2000-From Margin to Sparsity

16 0.27955666 4 nips-2000-A Linear Programming Approach to Novelty Detection

17 0.27035969 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

18 0.26577231 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration

19 0.26367393 18 nips-2000-Active Support Vector Machine Classification

20 0.25220507 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.031), (17, 0.105), (32, 0.015), (33, 0.047), (54, 0.012), (55, 0.022), (62, 0.024), (65, 0.38), (67, 0.059), (75, 0.022), (76, 0.046), (79, 0.02), (81, 0.023), (90, 0.078), (91, 0.015), (97, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97509116 33 nips-2000-Combining ICA and Top-Down Attention for Robust Speech Recognition

Author: Un-Min Bae, Soo-Young Lee

Abstract: We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selective attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence. It modifies the density of recovered signals to the density appropriate for classification. For noisy speech signal recorded in real environments, the algorithm improved the recognition performance and showed robustness against parametric changes. 1

2 0.92563349 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation

Author: Brendan J. Frey, Anitha Kannan

Abstract: One way to approximate inference in richly-connected graphical models is to apply the sum-product algorithm (a.k.a. probability propagation algorithm), while ignoring the fact that the graph has cycles. The sum-product algorithm can be directly applied in Gaussian networks and in graphs for coding, but for many conditional probability functions - including the sigmoid function - direct application of the sum-product algorithm is not possible. We introduce

same-paper 3 0.89664149 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

4 0.5320335 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

Author: Renato Vicente, David Saad, Yoshiyuki Kabashima

Abstract: We analyze Gallager codes by employing a simple mean-field approximation that distorts the model geometry and preserves important interactions between sites. The method naturally recovers the probability propagation decoding algorithm as an extremization of a proper free-energy. We find a thermodynamic phase transition that coincides with information theoretical upper-bounds and explain the practical code performance in terms of the free-energy landscape.

5 0.50407058 36 nips-2000-Constrained Independent Component Analysis

Author: Wei Lu, Jagath C. Rajapakse

Abstract: The paper presents a novel technique of constrained independent component analysis (CICA) to introduce constraints into the classical ICA and solve the constrained optimization problem by using Lagrange multiplier methods. This paper shows that CICA can be used to order the resulted independent components in a specific manner and normalize the demixing matrix in the signal separation procedure. It can systematically eliminate the ICA's indeterminacy on permutation and dilation. The experiments demonstrate the use of CICA in ordering of independent components while providing normalized demixing processes. Keywords: Independent component analysis, constrained independent component analysis, constrained optimization, Lagrange multiplier methods 1

6 0.48392606 62 nips-2000-Generalized Belief Propagation

7 0.46792966 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

8 0.45887727 80 nips-2000-Learning Switching Linear Models of Human Motion

9 0.45585951 118 nips-2000-Smart Vision Chip Fabricated Using Three Dimensional Integration Technology

10 0.44993353 74 nips-2000-Kernel Expansions with Unlabeled Examples

11 0.44987127 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

12 0.44798264 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

13 0.44655108 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

14 0.4448725 52 nips-2000-Fast Training of Support Vector Classifiers

15 0.44345498 111 nips-2000-Regularized Winnow Methods

16 0.44235432 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

17 0.44168624 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

18 0.44083843 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

19 0.44020513 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

20 0.43583018 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks