jmlr jmlr2007 jmlr2007-49 knowledge-graph by maker-knowledge-mining

49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method


Source: pdf

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classification of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm specifically intended for classification problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classifier are also provided. An experimental study with artificial and real data sets, including an application to gene expression analysis, verifies the usefulness of the proposed approach. Keywords: classification, ordinal data, support vector machines, neural networks

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 PT Faculdade Ciˆ ncias Universidade Porto e Rua do Campo Alegre, 687 4169-007 Porto, Portugal Editor: Ralf Herbrich Abstract Classification of ordinal data is one of the most important tasks of relation learning. [sent-8, score-0.445]

2 Generalization bounds of the proposed ordinal classifier are also provided. [sent-12, score-0.403]

3 Keywords: classification, ordinal data, support vector machines, neural networks 1. [sent-14, score-0.403]

4 Finally, the presence or absence of a “natural” order among classes will separate nominal from ordinal problems. [sent-18, score-0.505]

5 Although two-class and nominal data classification problems have been thoroughly analysed in the literature, the ordinal sibling has not received nearly as much attention yet. [sent-19, score-0.479]

6 Conventional methods for nominal classes or for regression problems could be employed to solve ordinal data problems. [sent-25, score-0.547]

7 Although the ordinal formulation seems conceptually simpler than the nominal one, difficulties to incorporate in the algorithms this c 2007 Jaime S. [sent-27, score-0.488]

8 C ARDOSO AND P INTO DA C OSTA piece of additional information—the order—may explain the widespread use of conventional methods to tackle the ordinal data problem. [sent-30, score-0.476]

9 This work addresses this void by introducing in Section 2 the data replication method, a nonparametric procedure for the classification of ordinal data. [sent-31, score-0.774]

10 Finally, the generic version of the data replication method is presented, allowing partial constraints on variables. [sent-35, score-0.399]

11 The section is concluded with a reinterpretation of the neural network model as a generalization of the ordinal logistic regression model. [sent-38, score-0.435]

12 As the intersection point of two boundaries would indicate an example with three or more classes equally probable—not plausible with ordinal classes—this strategy imposes a sensible restriction. [sent-58, score-0.58]

13 By avoiding the intersection of any two boundaries, this model tries to capture the essence of the ordinal data problem. [sent-61, score-0.445]

14 , bK−1 , such that the feature space is divided into K regions by the decision boundaries wt x + br = 0, r = 1, . [sent-68, score-0.397]

15 Given that there are many good two-class learning algorithms, it is tempting to reduce this ordinal formulation to a two-class problem. [sent-72, score-0.454]

16 1 Data Replication Method—the Linear Case1 Before moving to the formal presentation of the data replication method, it is instructive to motivate the method by considering a hypothetical, simplified scenario, with five classes in R 2 . [sent-75, score-0.439]

17 To start the presentation of the data replication method let us consider an even more simplified toy example with just three classes, as depicted in Figure 2(a). [sent-103, score-0.418]

18 It is clear now that the principle behind the replication method is to have a replica of the original data set for each boundary. [sent-157, score-0.494]

19 After the above exposition on a toy model, we will now formally describe a general K-class classifier for ordinal data classification. [sent-222, score-0.492]

20 The reduction technique presented here uses a binary classifier to make multiclass ordinal predictions. [sent-298, score-0.448]

21 Instead of resorting to multiple binary classifiers to make predictions in the ordinal problem (as is common in reduction techniques from multiclass to binary problems), the data replication method uses a single binary classifier to classify (K − 1) dependent replicas of a test point. [sent-299, score-1.024]

22 A pertinent question is how the performance of the binary classifier translates into the performance on the ordinal problem. [sent-300, score-0.448]

23 In Appendix A, a bound on the generalization error of the ordinal data classifier is expressed as a function of the error of the binary classifier. [sent-301, score-0.49]

24 It is not trivial to see how to keep them well ordered with this standard data replication method, for s general. [sent-303, score-0.431]

25 2 Homogeneous Data Replication Method ¯¯ With the data replication method just presented, the boundary in the extended space wt x + b = 0 has correspondence in the original space to the (K − 1) boundaries wt x + bi , with b1 = b, bi = h w p+i−1 + b1 , i = 2, . [sent-307, score-1.106]

26 Note that the homogeneous extended data set has dimension p + K − 1, as opposed to (p + K − 2) in the standard formulation of the data replication method. [sent-381, score-0.515]

27 Therefore, some adaptation is required before applying existing linear binary classifiers to the homogeneous data replication method. [sent-388, score-0.467]

28 Homogeneous Data Replication Method with Explicit Constrains on the Thresholds Unless one sets s = K − 1, the data replication method does not enforce ordered thresholds (we will return to this point later, when mapping to SVMs and neural networks). [sent-389, score-0.5]

29 3 Data Replication Method—the Nonlinear Case Previously, the data replication method was considered as a design methodology of a linear classifier for ordinal data. [sent-403, score-0.774]

30 Inspired by the data replication method just presented, we now look for generic boundaries that are level curves of some nonlinear, real-valued function G(x) defined in the feature space. [sent-407, score-0.48]

31 Once again, the search for nonintersecting, nonlinear boundaries can be carried out in the extended space of the data replication method. [sent-414, score-0.523]

32 First, extend and modify the feature space to a binary problem, as dictated by the data replication method. [sent-415, score-0.416]

33 The nonlinear extension of the homogeneous data replication method follows the same rationale as the standard formulation. [sent-445, score-0.465]

34 Finally, the enforcement of ordered thresholds with the introduction of additional training points is still valid in the nonlinear case, as G( 0p ui+1 −ui ) = G(0 p ) + wt (ui+1 − ui ) = h w p+i+1 − h w p+i = bi+1 − bi . [sent-447, score-0.552]

35 4 A General Framework As presented so far, the data replication method allows only searching for parallel hyperplanes (level curves in the nonlinear case) boundaries. [sent-449, score-0.459]

36 In the quest for an extension allowing more loosely coupled boundaries, let us start by reviewing the method for ordinal data by Frank and Hall (2001). [sent-451, score-0.445]

37 1 T HE METHOD OF F RANK AND H ALL Frank and Hall (2001) proposed to use (K − 1) standard binary classifiers to address the K-class ordinal data problem. [sent-454, score-0.49]

38 Toward that end, the training of the i-th classifier is performed by converting the ordinal data set with classes C1 , . [sent-455, score-0.513]

39 2 A PARAMETERIZED FAMILY OF CLASSIFIERS Thus far, nonintersecting boundaries have been motivated as the best way to capture ordinal relation among classes. [sent-481, score-0.594]

40 That may be a too restrictive condition for problems where some features are not in relation with the ordinal property. [sent-482, score-0.403]

41 Without further information, it is unadvised to draw from them any ordinal information. [sent-484, score-0.403]

42 We suggest a generalization of the data replication method where the enforcement of nonintersecting boundaries is restricted only to the first j features, while the last p − j features enjoy the independence as materialized in the Frank and Hall’s method. [sent-486, score-0.59]

43 Towards that end we start by showing how the independent boundaries approach of Frank and Hall can be subsumed in the data replication framework. [sent-487, score-0.48]

44 (1) but instead using the following rule: 1403 C ARDOSO AND P INTO DA C OSTA (1) xi 02 0 ∈ C 1, (2) xi 02 0 (3) xi 02 0 , 02 (1) xi h ∈ C 2, , 02 (2) xi h ∈ C 1, 02 (3) xi h ∈ C2 where 02 is the sequence of 2 zeros. [sent-497, score-0.42]

45 This general formulation of the data replication method allows the enforcement of only the amount of knowledge (constraints) that is effectively known a priori, building the right amount of parsimony into the model (see the pasture production experiment). [sent-547, score-0.549]

46 , bK−1 , such that the feature space is divided into K regions by the decision boundaries wt x + br = 0, r = 1, . [sent-575, score-0.397]

47 However, instead of using only the two closest classes in the constraints of an hyperplane, more appropriate for the loss function l0−1 (), we adopt a formulation that captures better the performance of a classifier for ordinal data. [sent-623, score-0.55]

48 In fact, as pointed out by Chu and Keerthi (2005), the ordinal inequalities on the thresholds −b 1 ≤ −b2 ≤ 2. [sent-632, score-0.472]

49 Only under the setting s = K −1 the ordinal inequalities on the thresholds are automatically satisfied (Chu and Keerthi, 2005). [sent-643, score-0.472]

50 But because wt [ xi ] = wt xi 0 wt [ xi ] = wt xi + w3 h h and renaming b to b1 and b + w3 h to b2 the formulation above simplifies to minw,b1 ,b2 1 (b2 −b1 )2 1 t 2w w+ 2 h2 (1) s. [sent-654, score-1.307]

51 −(wt xi + b1 ) ≥ +1, (2) +(wt xi + b1 ) ≥ +1, (3) +(wt xi + b1 ) ≥ +1, (1) −(wt xi + b2 ) ≥ +1, (2) −(wt xi + b2 ) ≥ +1, (3) +(wt xi + b2 ) ≥ +1. [sent-656, score-0.42]

52 (5) for ordinal data previously introduced, with K = 3, s = K − 1 = 2, and a slightly modified objective function by the introduction of a regularization member, proportional to the distance between the hyperplanes. [sent-658, score-0.445]

53 the classification of ordinal data as a standard SVM problem and to remove the ambiguity in the solution by the introduction of a regularization term in the objective function. [sent-668, score-0.445]

54 The insight gained from studying the toy example paves the way for the formal presentation of the instantiation of the data replication method in SVMs. [sent-669, score-0.449]

55 This formulation for the high-dimensional data set matches the proposed formulation for ordinal data up to an additional regularization member in the objective function. [sent-677, score-0.589]

56 To instantiate the homogeneous data replication method in support vector machines, some popular algorithm for binary SVMs, such as the SMO algorithm, must be adapted for outputting a so3. [sent-684, score-0.467]

57 2 1409 C ARDOSO AND P INTO DA C OSTA data kernel K(x, y) data extension kernel modification data in kernel definition two-class SVM algorithm Figure 9: oSVM interpretation of an ordinal multiclass problem as a two-class problem. [sent-693, score-0.529]

58 Summarizing, the nonlinear ordinal problem can be solved by extending the feature set and modifying the kernel function, as represented diagrammatically in Figure 9. [sent-700, score-0.446]

59 2 Mapping the Data Replication Method to NNs When the nonlinear data replication method was formulated, the real-valued function G(x) was defined arbitrarily. [sent-730, score-0.414]

60 Setting G(x) as the output of a neural network, a flexible architecture for ordinal data can be devised, as represented diagrammatically in Figure 10. [sent-732, score-0.445]

61 The mapping of the homogeneous data replication method to neural networks is easily realized. [sent-741, score-0.422]

62 bias + x1 activation function f1 bias + activation function fN binary classifier  $  % & bias + xp activation function f1 x p+1 ! [sent-745, score-0.387]

63 1 O RDINAL L OGISTIC R EGRESSION M ODEL Here we provide a probabilistic interpretation for the ordinal neural network model just introduced. [sent-749, score-0.435]

64 The traditional statistical approach for ordinal classification models the cumulative class probability Pk = p(C ≤ k|x) by logit(Pk ) = Φk − G(x) ⇔ Pk = logsig(Φk − G(x)), k = 1, . [sent-750, score-0.403]

65 3 Summation The data replication method has some advantages over standard algorithms presented in the literature for the classification of ordinal data: • It has an interesting and intuitive geometric interpretation. [sent-770, score-0.774]

66 It provides a new conceptual framework integrating disparate algorithms for ordinal data classification: Chu and Keerthi (2005) algorithm, Frank and Hall (2001), ordinal logistic regression. [sent-771, score-0.848]

67 • Even the SVM instantiation of the data replication method possesses an advantage over the algorithm presented in Chu and Keerthi (2005): the latter misses the explicit inclusion of a regularization term in the objective function, leading to ambiguity in the solution. [sent-773, score-0.402]

68 The data replication method incorporates naturally a regularization term; the unique regularization term allows interpreting the optimization problem as a single binary SVM in an extended space. [sent-774, score-0.416]

69 Experimental Methodology In the following sections, experimental results are provided for several models based on SVMs and NNs, when applied to diverse data sets, ranging from synthetic to real ordinal data, and to a problem of feature selection. [sent-776, score-0.445]

70 Here, the set of models under comparison is presented and different assessment criteria for ordinal data classifiers are examined. [sent-777, score-0.445]

71 To test the hypothesis that methods specifically targeted for ordinal data improve the performance of a standard classifier, we tested a conventional feed forward network, fully connected, with a single hidden layer, trained with the special activation function softmax. [sent-780, score-0.564]

72 • Pairwise NN (pNN): Frank and Hall (2001) introduced a simple algorithm that enables standard classification algorithms to exploit the ordering information in ordinal prediction problems. [sent-781, score-0.403]

73 First, the data is transformed from a K-class ordinal problem to (K − 1) binary problems. [sent-782, score-0.49]

74 • Costa (1996), following a probabilistic approach, proposes a neural network architecture (iNN) that exploits the ordinal nature of the data, by defining the classification task on a suitable space through a “partitive approach”. [sent-784, score-0.435]

75 It is proposed a feedforward neural network with (K − 1) outputs to solve a K-class ordinal problem. [sent-785, score-0.435]

76 • Regression model (rNN): as stated in the introduction, regression models can be applied to solve the classification of ordinal data. [sent-787, score-0.403]

77 ˆ ˆ • Proposed ordinal method (oNN), based on the standard data extension technique, as previously introduced. [sent-803, score-0.445]

78 • Proposed ordinal method (oSVM), based on the standard data extension technique, as previously introduced. [sent-819, score-0.445]

79 However, as already expressed, losses that increase with the absolute difference between the class numbers capture better the fundamental structure of the ordinal problem. [sent-827, score-0.403]

80 τb = √ concordant + discordant + et concordant + discordant + e p Although insensitive to the number assigned to each class, both r s and τb are in fact more appropriate for pairwise ranking rather than to ordinal regression, due to their failure to detect bias errors. [sent-851, score-0.672]

81 For this reason, we shall restrict in the following to present only the results for the MAD criterion, possibly the most meaningful criterion for the ordinal regression problem. [sent-879, score-0.403]

82 2 Discussion The main assertion concerns the superiority of all algorithms specific to ordinal data over conventional methods, both for support vector machines and neural networks. [sent-987, score-0.476]

83 Classifying Real Ordinal Data In this section, we continue the experimental study by applying the algorithms considered to the classification of real ordinal data, namely to solving problems of prediction of pasture production and employee selection. [sent-1123, score-0.502]

84 We start by observing that conventional methods performed as well as ordinal methods. [sent-1216, score-0.434]

85 We were led to the suggestion that some of the features may not properly reflect the ordinal relation among classes. [sent-1217, score-0.403]

86 In this experiment, as well as in all the previous ones, the SVM instantiation of the data replication method exhibits a performance nearly independent of the s parameter. [sent-1229, score-0.402]

87 Predicting the Gleason score from the gene expression data is thus a typical ordinal classification problem, already addressed in Chu and Ghahramani (2005) using Gaussian processes. [sent-1323, score-0.475]

88 n genes 30 28 26 25 n genes 26 25 24 23 n genes 24 21 19 17 n genes 19 17 16 15 n genes 16 15 14 13 MAD 0. [sent-1377, score-0.59]

89 Conclusion This study focuses on the application of machine learning methods, and in particular of neural networks and support vector machines, to the problem of classifying ordinal data. [sent-1405, score-0.403]

90 A novel approach to train learning algorithms for ordinal data was presented. [sent-1406, score-0.445]

91 The idea is to reduce the problem to the standard two-class setting, using the so called data replication method, a nonparametric procedure for the classification of ordinal categorical data. [sent-1407, score-0.774]

92 Two standard methods for the classification of ordinal categorical data were unified under this framework, the minimum margin principle (Shashua and Levin, 2002) and the generic approach by Frank and Hall (2001). [sent-1409, score-0.445]

93 The study compares the results of the proposed model with conventional learning algorithms for nominal classes and with models proposed in the literature specifically for ordinal data. [sent-1411, score-0.536]

94 In spite of being usually assumed that learning in a higher dimension becomes a harder problem, the performance of the data replication method does not seem to be affected, probably due to the dependence among the data replicas. [sent-1414, score-0.413]

95 The data replication method is parameterised by h (and C); because it may be difficult and time consuming to choose the best value for h, it would be interesting to study possible ways to automatically set this parameter, probably as a function of the data and C. [sent-1415, score-0.413]

96 Although the data replication method was designed for ordinal classes, nothing impedes its application to nominal classes. [sent-1417, score-0.808]

97 In this appendix we derive a margin-based bound on the generalization error of the proposed ordinal classifier. [sent-1422, score-0.403]

98 Modelling ordinal relations with SVMs: an application to objective aesthetic evaluation of breast cancer conservative treatment. [sent-1494, score-0.403]

99 Probabilistic interpretation of feedforward network outputs, with relationships to statistical prediction of ordinal quantities. [sent-1509, score-0.435]

100 Regression models for ordinal data: a machine learning approach. [sent-1534, score-0.403]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ordinal', 0.403), ('replication', 0.329), ('mad', 0.273), ('osvm', 0.264), ('wt', 0.244), ('rdinal', 0.173), ('ardoso', 0.164), ('eplication', 0.164), ('lassify', 0.164), ('osta', 0.164), ('onn', 0.127), ('replica', 0.123), ('genes', 0.118), ('replicas', 0.115), ('da', 0.114), ('boundaries', 0.109), ('cx', 0.108), ('chu', 0.1), ('ethod', 0.099), ('err', 0.098), ('fn', 0.094), ('csvm', 0.091), ('psvm', 0.091), ('rsvm', 0.091), ('activation', 0.088), ('logsig', 0.082), ('nonintersecting', 0.082), ('ranking', 0.081), ('xi', 0.07), ('bi', 0.069), ('ck', 0.069), ('thresholds', 0.069), ('classes', 0.068), ('pasture', 0.064), ('shashua', 0.062), ('setups', 0.062), ('frank', 0.06), ('ordered', 0.06), ('svms', 0.057), ('mathieson', 0.055), ('pnn', 0.055), ('rnn', 0.055), ('bq', 0.054), ('earning', 0.052), ('formulation', 0.051), ('homogeneous', 0.051), ('cardoso', 0.05), ('toy', 0.047), ('predicted', 0.047), ('svm', 0.047), ('sgn', 0.047), ('concordant', 0.045), ('inn', 0.045), ('nns', 0.045), ('porto', 0.045), ('hyperplanes', 0.045), ('binary', 0.045), ('levin', 0.044), ('br', 0.044), ('nonlinear', 0.043), ('data', 0.042), ('er', 0.041), ('hyperplane', 0.041), ('ui', 0.039), ('hall', 0.039), ('cnn', 0.038), ('prostate', 0.038), ('discordant', 0.036), ('fertiliser', 0.036), ('remp', 0.036), ('xmin', 0.036), ('classi', 0.036), ('discriminating', 0.035), ('production', 0.035), ('sx', 0.034), ('keerthi', 0.034), ('nominal', 0.034), ('bk', 0.034), ('sec', 0.033), ('ci', 0.033), ('layer', 0.033), ('network', 0.032), ('conventional', 0.031), ('instantiation', 0.031), ('min', 0.031), ('gene', 0.03), ('constraints', 0.028), ('enforcement', 0.028), ('neurons', 0.028), ('cxi', 0.027), ('esl', 0.027), ('jaime', 0.027), ('kendall', 0.027), ('mer', 0.027), ('penalise', 0.027), ('xmax', 0.027), ('replicated', 0.026), ('bias', 0.026), ('eq', 0.025), ('kp', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classification of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm specifically intended for classification problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classifier are also provided. An experimental study with artificial and real data sets, including an application to gene expression analysis, verifies the usefulness of the proposed approach. Keywords: classification, ordinal data, support vector machines, neural networks

2 0.090592936 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron

3 0.072901443 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

Author: Aggelos Chariatis

Abstract: The experimental investigation on the efficient learning of highly non-linear problems by online training, using ordinary feed forward neural networks and stochastic gradient descent on the errors computed by back-propagation, gives evidence that the most crucial factors for efficient training are the hidden units’ differentiation, the attenuation of the hidden units’ interference and the selective attention on the parts of the problems where the approximation error remains high. In this report, we present global and local selective attention techniques and a new hybrid activation function that enables the hidden units to acquire individual receptive fields which may be global or local depending on the problem’s local complexities. The presented techniques enable very efficient training on complex classification problems with embedded subproblems. Keywords: neural networks, online training, selective attention, activation functions, receptive fields 1. Framework Online supervised learning is in many cases the only practical way of learning. This includes situations where the problem size is very big, or situations where we have a non-recurring stream of input vectors that are unavailable before training begins. We examine online supervised learning using a particular class of adaptive models, the very popular feed forward neural networks, trained with stochastic gradient descent on the errors computed by back-propagation. In order to easily visualize the online training dynamics of highly complex non linear problems, we are experimenting on 2:η:1 networks where the input is a point in a two dimensional image and the output is the value of the pixel at the corresponding input position. This framework allows the creation of very complex non-linear problems, just by hand-drawing the problem on a bitmap and presenting it to the network. Most problems’ images in this report are 256 × 256 pixels in size, producing in total 65536 different samples each one. Classification and regression problems can be modeled as black & white and gray scale images respectively. In this report we only examine training on classification problems. However, since mixed problems are possible, we are only interested on techniques that can be applied to both classification and regression. The target of this investigation is online training where the input is not known in advance, so the input samples are treated as random and non-recurring vectors from the input space and are discarded after being used. We select and train on random samples until the average classification or RMS error is acceptable. Since both the number of training exemplars and the complexity of the underlying function are assumed unknown, we require from our training mechanism to have “initial state invariance” as a fundamental property. Thus we deliberately exclude from our arsenal any c 2007 Aggelos Chariatis. C HARIATIS training techniques that require a schedule to be decided ahead of training. Ideally we would like from the training mechanism to be totally invariant to the initial training parameters and network state. This report is organized as follows: Sections 2 and 3 describe techniques for global and local selective attention. Section 4 is devoted to acceleration of training. In Section 5 we present experimental results and in Section 6 we discuss the presented techniques and give some directions for future research. Finally, Appendix A contains a description of the notations that have been used. In Figure 1 you can see some examples of problems that can be learned very efficiently using the techniques that are presented in the following sections. (a) (b) (c) (d) (e) Figure 1: Examples of complex non-linear problems that can be learned efficiently. 2. Global Selective Attention - Dynamic Training Set Evolution Consider the two problems depicted in Figure 2. Clearly, both problems are of approximately equal complexity, since they encapsulate the same image in a different scale and position within the input space. We would like to have a mechanism that will make the network capable of learning both problems at about the same speed. (a) (b) Figure 2: Approximately equal complexity problems. Intuitively, the samples on the boundaries, which are the samples on positions with the highest contrast, are those that determine the complexity of each problem. During training, these samples have the property that they produce the highest errors. We thus need a method that will focus attention on samples with high error relatively to the rest. Previous work on such global selective attention has been published by many authors (Munro, 1987; Yu and Simmons, 1990; Bakker, 1992, 1993; Schapire, 1999; Zhong and Ghosh, 2000). 2018 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Of particular interest are the various boosting algorithms, such as AdaBoost (Schapire, 1999), which work by placing more emphasis on training samples that the system is currently getting wrong. Unfortunately, the most successful of these algorithms require a predefined set of samples on which training will be performed, something that is excluded from our scenario. Nevertheless, in a less constrained scenario, boosting can be applied on top of our techniques as a meta-learning algorithm, using our techniques as the base-learning algorithm. A simple method that can provide such an adaptive selective attention mechanism, by keeping an exponential trace of the average error in the training set, is described in Algorithm 1. e←0 ¯ Repeat Pick a random sample Evaluate the error e for the sample If e > 0.5 e ¯ e ← e α + e (1 − α) ¯ ¯ Train End Until a stopping criterion is satisfied In this report’s context, error evaluation and train are defined as: Error Evaluation: Computation of the output values by forward propagating the activations from the input to the output layer for a single sample, plus computation of the output errors. The sample’s error e is set to the quadratic mean (RMS) of the output units’ errors. Train: Back-propagation of the output errors to the hidden layer and immediate weights’ adjustment. Algorithm 1: The dynamic training set evolution algorithm. The algorithm evaluates the errors of all samples, but trains only for samples with error greater than half the average error of the current training set. Training is initially performed for all samples, but gradually, it is concentrated on the samples at the problem’s boundaries. When the error for these samples is reduced, other previously excluded samples enter the training set. Thus, samples enter and leave the training set automatically, with a tendency to train on samples with high error. The magnitude of the constant α that determines the time scale of the exponential trace is problem specific, but in all experiments in this report it was kept fixed to 10 −4 . The fraction of 0.5 was determined experimentally to give a good balance between sample selectivity and training set size. If it is close to 0 then we train for almost all samples. If it is close to 1 then we are at risk of making the training set starve from samples. Of course, one can choose to vary it dynamically in order to have a fixed percentage of samples in the training set, or, to not allow the training percentage to fall below a pre-specified limit. Figure 3 shows the training set evolution for the two-spirals problem (in Figure 1a) in various training stages. The network topology was 2:64:1. You can see the training set forming gradually and tracing the problem boundaries where the error is the highest. One could argue that such a process may be very sensitive to outliers. Experiments have shown that this does not happen. The algorithm does not try to recognize the outliers, but at least, adjusts naturally by not allowing the training set size to shrink. So, at the presence of heavy noise, the algorithm becomes ineffective, but does not introduce any additional harm. Figure 4 shows the two2019 C HARIATIS 10000-11840-93% 20000-25755-76% 40000-66656-41% 60000-142789-22% 90000-296659-18% Figure 3: Training set evolution for the two-spirals problem. Under each image you can see the stage of training in trains, error-evaluations and the percentage of samples for which training is performed at the corresponding stage. spirals problem distorted by dynamic noise and the corresponding training set after 90000 trains with 64 hidden units. You can see that the algorithm tolerates noise by not allowing the training set size to shrink. It is also interesting that at noise levels as high as 30% the algorithm can still exclude large areas of the input space from training. 10%-42% 20%-62% 30%-75% 50%-93% 70%-99% Figure 4: Top row shows the model with a visualization of the applied dynamic noise. Bottom row shows the corresponding training sets after 90000 trains. Under each pair of images you can see the percentage of noise distortion applied to the original input and the percentage of samples for which training is performed. 3. Local Selective Attention - Receptive Fields Having established a global method to focus attention on the important parts of a problem, we now come to address the main issue, which is the network training. Let first discuss the roles of the hidden and output layers in a feed forward neural network with a single hidden layer and without shortcut input-to-output connections. 2020 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS The hidden layer is responsible for transforming a non linear input-to-output mapping, into a non linear input-to-hidden layer mapping, that can be mapped linearly to the output. The output layer is responsible for learning a linear hidden-to-output mapping (which is an easy job), but most importantly, it must provide to the hidden layer error gradient information that will be used for the error credit assignment problem. In this respect, it becomes apparent that all hidden units should receive the most possibly accurate error information. That is why, we must train all hidden to output connections and back propagate the error through all these connections. This is not the case for the hidden layer. Consider, for a classification problem, how the hidden units with sigmoidal activations partition the input space into sub areas. By adjusting the input-tohidden weights and biases, each hidden unit develops a hyperplane that bi-partitions the input space in the most useful sense. We would like to limit the number of hyperplanes in order to reduce the system’s available degrees of freedom and obtain better generalization capabilities. At the same time, we would like to thoroughly use them in order to optimize the input output approximation. This can be done by arranging the hyperplanes to touch the problem’s boundaries at regular intervals dictated by the boundary curvature, as it is shown in Figure 5a. Figure 5b, shows a suboptimal placement of the hyperplanes which causes a waste of resources. Each hidden unit must be differentiated from the others and ideally not interfere with the subproblems that the other units are trying to solve. Suppose that two hidden units are governed by the same, or nearly the same, parameters. How can we differentiate them? There are many possibilities. (a) (b) Figure 5: Optimal vs. suboptimal hyperplanes. One could be, to just throw one unit away and make the output weight of the other equal to the sum of the two original output weights. That would leave the function unchanged. However, identifying these similar units during training is not easy computationally. In addition, we would have to figure out a method that would compute the best initial placement for the hyperplane of the new unit that would substitute the one that was thrown away. Another possibility would be to add noise in the weight updates, gradually reduced with a simulated annealing schedule which should be decided before training begins. Unfortunately, the loss of initial state invariance would complicate training for unknown complex non linear problems. To our thinking, it is much better to embed constraints into the system, so that it will not be possible for two hidden units to develop the same hyperplane. Two computationally efficient techniques to embed such constraints are described in sections 3.1 and 3.2. Many other authors have also examined methods for local selective attention. For the related discussions see Huang and Huang (1990), Ahmad and Omohundro (1990), Baluja and Pomerleau (1995), Flake (1998), Duch et al. (1998), and Phillips and Noelle (2004). 2021 C HARIATIS 3.1 Fixed Cascaded Inhibitory Connections A problem with the hidden units of conventional feed forward networks is that they are all fed with the same inputs and back propagated errors and that they operate without knowing each other’s existence. So, nothing prevents them from behaving identically. This lack of communication between hidden units has been addressed by researchers through hidden unit lateral connections. Agyepong and Kothari (1997) use unidirectional lateral interconnections between adjacent hidden layer units, claiming that these connections facilitate the controlled assignment of role and specialization of the hidden units. Kothari and Ensley (1998) use Gaussian lateral connections which enable the hidden decision boundaries to be global in nature but also be able to represent local structure. Numerous neural network algorithms employ bidirectional lateral inhibitory connections in order to generate competition between the hidden units. In an interesting variation described by Spratling and Johnson (2004), competition is provided by each hidden unit blocking its preferred inputs from activating other units. We use a single hidden layer where the hidden units are considered sequenced. Each hidden unit is connected to all succeeding hidden units with a fixed connection with weight set to minus one. The hidden units get differentiated, because they receive different inputs, they produce different activations and they get back different error information. Another benefit is that they can generate higher order feature detectors, that is, the resulting hidden hyperplanes are no longer strictly linear, but they may also be curved. Considering the fixed value, -1 is used just to avoid a multiplication. Values from -0.5 to -2 give good results as well. As it is shown in Section 5.1.1, the fixed cascaded inhibitory connections are very effective at reducing a problem’s asymptotic residual error. This should be attributed to both of their abilities, to generate higher order feature detectors and to hasten the hidden units’ symmetry breaking. These connections can be implemented very efficiently with just one subtraction per hidden unit for both hidden activation and hidden error computation. In addition, the disturbance to the parallelism of the backpropagation algorithm is minimal. Most operations on the hidden units can still be done in parallel and only the final computations must be performed sequentially. We include the algorithms for the hidden activation and error computations as examples of sequential implementations. These changes can be very easily incorporated into conventional neural network code. Hidden Activations δ←0 For j ← 1 . . . η nj ← δ+x·wj h j ← f (n j ) δ ← δ−hj End Hidden Error Signals δ←0 For j ← η . . . 1 ej ← δ+r ·uj g j ← e j f (n j ) δ ← δ−gj End Algorithm 2: Hidden unit activation and error computation with Fixed Cascaded -1 Connections. 2022 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 3.2 Selective Training of the Hidden Units The hidden units’ differentiation can be farther magnified if each unit is not trained on all samples, but only on the samples for which it receives a high error. We train all output units, but only the hidden units for which the error signal is higher than the RMS of the error signals of all hidden units. Typically about 10% of the hidden units are trained on each sample during early training and the percentage falls up to 2% when the network is close to the solution. This is intuitively justified by the observation that at the beginning of training the hidden units are largely undifferentiated and receive high error signals from the whole input space. At the final stage of training, the hidden hyperplanes’ linear soft decision boundaries are combined by the output layer to define arbitrarily shaped decision boundaries. For µ input dimensions, from 1 up to µ units can define an open sub-region and µ + 1 units are enough to define a closed convex region. With such high level constructs, each sample may be discriminated from the rest with very few hidden units. These, are the units that receive the highest error signal for the sample. Experiments on various problems have shown that training on a fraction of the hidden units is always better (in respect to number of trains to convergence), than training all or just one hidden unit. It seems that training only one hidden unit on each sample is not sufficient for some problems (Thornton, 1992). Measurements for one of these experiments are reported in Section 5.1.1. In addition to the convergence acceleration, the combined effect of training a fraction of the hidden units on a fraction of the samples, gives big savings in CPU usage per sample as well. This sparseness of training in respect to evaluation provides further opportunities for speedup as it is discussed in Section 4. 3.3 Centering On The Input Space It is a well known recommendation (Schraudolph, 1998a,b; LeCun et al., 1998) that the input values should be normalized to have zero mean and unit standard deviation over each input dimension. This is achieved by subtracting from each input value the mean and dividing by the standard deviation. For some problems, like the one in Figure 2b, the center of the input space is not equal to the center of the problem. When the input is not known in advance, the later must be computed adaptively. Moreover, since the hidden units are trained on different input samples, we should compute for each hidden unit its own mean and standard deviation over each input dimension. For the connection between hidden unit j and input unit i we can adaptively compute the approximate mean m ji and standard deviation s ji over the inputs that train the hidden unit, using either exponential traces: m ji (t) ← β xi + (1 − β) m ji (t−1) , q ji (t) ← β xi2 + (1 − β) q ji (t−1) , s ji (t) ← (q ji (t) − m ji 2 )1/2 , (t) or perturbated calculations: m ji (t) ← m ji (t−1) + β (xi − m ji (t−1) ), v ji (t) ← v ji (t−1) + β (xi − m ji (t) ) (xi − m ji (t−1) ) − v ji (t−1) , 2023 C HARIATIS 1/2 s ji (t) ← v ji (t) , where β is a constant that determines the time scale of exponential averaging, vector x holds the input values, matrix Q holds the means of the squared input values and matrix V holds the variances. The means and standard deviations of a hidden unit’s input connections are updated only when the hidden unit is trained. The result of this treatment is that each hidden unit is centered on a different part of the input space. This center is indirectly affected by the error that the inputs produce on the hidden unit. The magnitude of the constant β is problem specific, but in all experiments in this report it was kept fixed to 10−3 . This constant must be selected large enough, so that the centers will rapidly move to their optimum locations, and small enough, so that the hidden units will see a relatively static view of the input space and the gradient descent algorithm will not be confused. As the hidden units jitter around their centers, we effectively train them on slightly shifted views of the input space, something that can assist generalization. We get something analogous to training with jitter (Reed et al., 1995), at no extra cost. In Figure 6, the squares show where each hidden unit is centered. You can see that most are centered on the problem boundaries at regular intervals. The crosses show the standard deviations. On some directions the standard deviations are very small, which results in very high normalized input values, causing the hidden units to act as threshold units at those directions. The sloped lines show the hyperplane distance from center and the slope. These are computed for display purposes, from their theoretical formulas for a conventional network, without considering the effect of the cascaded connections. For some units the hyperplanes shown are not exactly on the boundaries. This is because of the fixed cascaded connections that cause the hidden units to be not exactly linear discriminants. In the last picture you can see the decision surface of a hidden unit which is a bit curved and coincides with the class boundary although its calculated hyperplane is not on the boundary. An observant reader may also notice that the hyperplane distances from the centers are very small, which implies that the corresponding biases are small as well. On the contrary, if all hidden units were centered on the center of the image, we would have the following problem. The hyperplanes of some hidden units must be positioned on the outer parts of the image. For this to happen, these units should develop large biases in respect to the weights. This would make their activations to have small variances. These small variances might need to be compensated by large output weights and biases, which would saturate the output units and in addition ill-condition the problems. One may wonder if the hidden biases are still necessary. Since the centers are individually set, it may seem at first that they are not. However, the centers are not trained through error backpropagation, and the hyperplanes do not necessarily pass over them. The biases role is to drive the hyperplanes to the correct location and thus pull the centers in the corresponding direction. The individual centering of the hidden units based on the samples’ positions is feasible, because we train only on samples with high errors and only the hidden units with high errors. By ignoring the small errors, we effectively position the center of each hidden unit near the center of mass of the high errors that it receives. However, this centering technique can still be used even if one chooses to train on all samples and all hidden units. Then, the statistics interval should be differentiated for each hidden unit and be recomputed for each sample relatively to the normalized absolute error that each hidden unit receives. A way to do it is to set the effective statistics interval for hidden unit j 2024 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Figure 6: Hidden unit centers, standard deviations, hyperplanes, global and local training sets and a hidden unit’s output. The images were captured at the final stage of training, of the problem in Figure 1a with 64 hidden units. and sample s to: β |e j,s | |e j | where β is the global statistics interval, e j,s is the hidden unit’s backpropagated error for the sample and |e j | is the mean of the absolute backpropagated errors that the hidden unit receives, measured via an exponential trace. The denominator acts as a normalizer, which makes the hidden unit’s mobility to be independent of the average magnitude of the errors. Centering on other factors has been extensively investigated by Schraudolph (1998a,b). These techniques can provide further convergence acceleration, but we chose not to use them because of the additional computational overhead that they require. 3.4 A Hybrid Activation Function As it is shown in Section 5, the aforementioned techniques enable successful training on some difficult problems like those in Figures 1a and 1b. However, if the problem contains subproblems, or put in another way, if the problem generates more than one cluster of high error density, the centering mechanism does not manage to drive the hidden unit centers to the most suitable locations. The centers are attracted by the larger subproblem or get stuck in areas between the subproblems, as shown in Figure 7. 2025 C HARIATIS Figure 7: Model, training set, and inadequate centering We need a mechanism that can force a hidden unit to get out of balanced but suboptimal positions. It would be nice if this mechanism could also allow the centers to migrate to various points in the input space as the need arises. It has been found that both of these requirements are fulfilled by a new hybrid activation function. Sigmoid activations have the property that they produce hyperplanes that separate the input space globally. Our intention is to use a sigmoid like hidden activation function, because it can provide global separability, and at the same time, reduce the activation value towards zero on inputs which are not important to a hidden unit. The Gaussian function is commonly used within radial basis function (RBF) neural networks (Broomhead and Lowe, 1988). When this function is applied to the distance of a sample to the unit’s center, it produces a local response which is stronger near the center. We can then enclose the sigmoidal activation within a Gaussian envelope, by multiplying the activation with a value between 0 and 1, which is provided by applying the Gaussian function to the distance that is measured in the normalized input space. When the number of input dimensions is large, the distance metric that must be used is not an obvious choice. Table 1 contains the distance metrics that we have considered. The most suitable distance metric seems to depend on the distribution of the samples that train the hidden units. µ ∑ xi2 i=1 Euclidean 1 µ µ ∑ xi2 i=1 Euclidean Scaled µ ∑ |xi | i=1 Manhattan 1 µ µ ∑ |xi | i=1 Manhattan Scaled max |xi | Chebyshev Table 1: Various distance metrics that have been considered for the hybrid activation function. In particular, if the samples follow a uniform distribution over a hypercube, then the Euclidean distance has the disturbing property that the average distance grows larger as the number of input dimensions increases and consequently the corresponding average Gaussian response decreases towards zero. As suggested by Hegland and Pestov (1999), we can make the average distance to center independent of the input dimensions, by measuring it in the normalized input space and then dividing it by the square root of the input’s dimensionality. The same problem occurs for the Manhattan distance which has been claimed to be a better metric in high dimensions (Aggarwal et al., 2001). We can normalize this distance by dividing it by the input’s dimensionality. A problem that 2026 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS appears for both of the above rescaled distance metrics, is that for the samples that are near the axes the distances will be very much attenuated and the corresponding Gaussian responses will be close to one, something that will make the Gaussian envelopes ineffective. A more suitable metric for this type of distributions is the Chebyshev metric whose average magnitude is independent of the dimensions. However, for reasons analogous to those mentioned above, this metric is not the most suitable if the distribution of the samples is spherical. In that case, the Euclidean distance does not need any rescaling and is the most natural distance measure. We can obtain spherical distributions by adaptively whitening them. As Plumbley (1993) and Laheld and Cardoso (1994) independently proposed, the whitening matrix Z can be adaptively computed as: Zt+1 = Zt − λ zt zt T − I Zt where λ is the learning rate parameter, zt = Zt xt is the whitened vector and xt is the input vector. However, we would need too many additional parameters to do it individually for each subset of samples on which each hidden unit is trained. For the above reasons (and because of lack of a justified alternative), in the implementation of these techniques we typically use the Euclidean metric when the number of input dimensions is up to three and the Chebyshev metric in all other cases. We have also replaced the usual tanh (sigmoidal) and Gaussian (bell-like) functions, by similar functions which do not involve exponentials (Elliott, 1993). For each hidden unit j we first compute the net-input n j to the hidden unit (that is, the weighted distance of the sample to the hyperplane), as the inner product of normalized inputs and weights plus the bias: xi − m ji , s ji = z j · w j. z ji = nj We then compute the sample’s distance d j to the center of the unit which is measured in the normalized input space: dj = zj . Finally, we compute the activation h j as: nj , (1 + n j ) 1 , = bell(d j ) = (1 + d 2 ) j = a j b j. a j = Elliott(n j ) = bj hj Since d j is not a function of w j , we treat b j as a constant for the calculation of the activation derivative with respect to n j , which becomes: ∂h j = b j (1 − a j )2 . ∂n j 2027 C HARIATIS The hybrid activation function, which by definition may only be used for hidden units connected to the input layer, enables these units to acquire selective attention capabilities on the input space. Each hidden unit may have a global or local receptive field on each input dimension. The size of this dimensional receptive field depends on the standard deviation which is computed for the corresponding dimension. This activation makes balanced positions between subproblems to be unstable. As soon as the center is changed by a small amount, it will be attracted by the nearest subproblem. This is because the unit’s activation and the corresponding error will be increased for samples towards the nearest subproblem and decreased at the other direction. Hidden units can still be centered between subproblems but only if their movement at either direction causes a large error for samples at the opposite direction, that is, if they are absolutely necessary at their current position. Additionally, if a unit is centered near a subproblem that produces low errors and the unit is not necessary in that area, then it may migrate to other areas that still have high errors. This unit center migration has been observed in all experiments on complex problems. This may be due to the non-linear response of the bell function, and its long tails which keep the activation above zero for all input samples. Figure 8: Model, evaluation, training set, hidden unit centers and two hidden unit outputs showing the effect of the hybrid activation function. The images were captured at the final stage of training, of the problem in Figure 1d with 700 hidden units. In Figure 8 you can see a complex problem with 9 clusters of high errors. The hidden units place their centers on all clusters and are able to solve the problem. In the last two images, you can see the 2028 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS effect of the hybrid activation function which attenuates the activation at points far from the center in respect to the standard deviation on each dimension. One unit develops a near circular local receptive field and one other develops an elongated ellipsoidal receptive field. The later provides almost global separation in the vertical direction and becomes a useful discriminant for two of the subproblems. One may find similarities between this hybrid activation function and the Square-MLP architecture described by Flake (1998). The later, partially implements higher order neurons by duplicating the number of input units and setting the new input values equal to the squares of the original inputs. This architecture enables the hidden units to form local features of various shapes, but not the locally constrained sigmoid formed by our proposal. In contrast, the hybrid activation function does not need any additional parameters beyond those that are already used for centering and it has the additional benefit, which is realized by the local receptive fields in conjunction with the small biases and the symmetric sigmoid, that the hidden activations will have a mean close to zero. As discussed by Schraudolph (1998a,b) and LeCun et al. (1998), this is very beneficial for the output layer training. However, there is still room for improvement. As it was also observed by Flake (1998), the orientations of the receptive field ellipses are always at the direction of one of the input axes. This limitation is expected to hinder performance when training hidden units which have sloped hyperplanes. Figure 9 shows a complex problem at the middle of training. Units with sloped hyperplanes are trained on samples whose input values are highly correlated. This can slowdown learning by itself, but in addition, the standard deviations cannot get sufficiently small and as a result the receptive field cannot be sufficiently shrunk at the direction perpendicular to the hyperplane. As a result the hidden unit’s activation unnecessarily interferes with the activations of nearby units. Although it may be possible to address the correlation problem with a more sophisticated training method that uses second order gradient information, like Stochastic Meta Descent (Schraudolph, 1999, 2002), the orientations of the receptive fields will still be limited. In Section 6.2 we discuss possible directions for further research that may circumvent this limitation. Figure 9: Evaluation and global and local training sets during middle training for the problem in Figure 1b. It can be seen that a hidden unit with a sloped hyperplane is trained on samples with highly correlated input values. Samples that are separated by horizontal or vertical hyperplanes are easier to be learned. 2029 C HARIATIS 4. Further Speedups In this section we first describe an implementation technique that reduces the computational requirements of the error evaluation phase and then we give references to methods that have been proposed by other authors for the acceleration of the training phase. 4.1 Evaluation Speedup Two of the discussed techniques, training only for samples with high errors, and then, training only the hidden units with high error, make the error-evaluation phase to be the most processing demanding phase for the solution of a given problem. In addition, some other techniques, like board game learning through temporal difference methods, require many evaluations to be performed before each train. We can speedup evaluation by the following observation: For many problems, only part of the input is changed on successive samples. For example, for a backgammon program with 200 input units (with raw board data and not any additional features encoded), very few inputs will change on successive positions. Even on two dimensional problems such as images, we can arrange to train on samples selected by random changes on the X and Y dimensions alternatively. This process of only resampling one coordinate at a time is also known as “Gibbs sampling” and it nicely generalises to more than two coordinates (Geman and Geman, 1984). Thus, we can keep in memory all intermediate results from the evaluation, and recalculate only for the inputs that have changed. This implementation technique requires more storage, especially for high dimensional inputs. Fortunately, storage is not an issue on modern hardware. 4.2 Training Speedup Many authors have proposed methods for speeding-up online training by using second order gradient information in order to dynamically vary either the learning rate or the momentum (see LeCun et al., 1993; Leen and Orr, 1993; Murata et al., 1996; Harmon and Baird, 1996; Orr and Leen, 1996; Almeida et al., 1997; Amari, 1998; Schraudolph, 1998c, 1999, 2002; Graepel and Schraudolph, 2002). As it is shown in the next section, our techniques enable standard stochastic gradient descent with momentum to efficiently solve all the highly non-linear problems that have been investigated. However, the additional speed up that an accelerating algorithm can give is a nice thing to have. Moreover, these accelerating algorithms automatically reduce the learning rate when we are close to a solution (by sensing the oscillations in the error gradient) something that we should do through annealing if we wanted the best possible solution. We use the Incremental Delta-Delta (IDD) accelerating algorithm (Harmon and Baird, 1996), an incremental nonlinear extension to Jacobs’ (1988) Delta-Delta algorithm, because of its simplicity and relatively small processing requirements. IDD computes an individual learning rate λ for each weight w as: λ(t) = eξ(t) , ξ(t + 1) = ξ(t) + θ ∆w(t + 1) ∆w(t), λ(t) where θ is the meta-learning rate which we typically set to 0.1. 2030 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 5. Experimental Results In order to measure the effectiveness of the described techniques on various classes of problems, we performed several experiments. Each experiment was replicated 10 times with different random initial weights using matched random seeds and the means and standard deviations of the results were plotted in the corresponding figures. For the experiments we used a single hidden layer, the cross entropy error function, the logistic or softmax activation function for the output units and the Elliott or hybrid activation function for the hidden units. Output to hidden layer weights and biases were initialized to zero. Hidden to input layer weights were initialized to random numbers from a normal distribution and then rescaled so that the incoming weights to each hidden unit had norm unity. Hidden unit biases were initialized to a uniform random number between zero and one. The curves in the figures are labelled with a combination of the following letters which indicate the techniques that were applied: B – Adjust weights using stochastic gradient descent with momentum 0.9 and fixed learning rate √ 0.1/ c where c is the number of incoming connections to the unit. A – Adjust weights using IDD with meta-learning rate 0.1 and initial learning rate √ 1/ c where c is as above. L – Use fixed cascaded inhibitory connections as described in Section 3.1. S – Skip weights adjustment for samples with low error as described in Section 2. U – Skip weights adjustment for hidden units with low error as described in Section 3.2. C – Use individual means and stdevs for each hidden to input connection as described in Section 3.3. H – Use the hybrid activation function as described in Section 3.4. For the ‘B’ training method we deliberately avoided an annealing schedule for the learning rate, since this would destroy the initial state invariance of our techniques. Instead, we used a fixed small learning rate which we compensated with a large momentum. For the ‘A’ method, we used a small meta-learning rate, to avoid instabilities due to the high non-linearities of the examined problems. It is important to note that for both training methods the learning parameters were fixed to the above values and not optimized to each individual problem. For the ‘C’ technique, the centers of the hidden units where initially set to the center of the input space and the standard deviations were set to one third of the distance between the extreme values of each dimension. When the technique was not used, a global preprocessing was applied which normalized the input samples to have zero mean and unit standard deviation. 5.1 Two Input Dimensions In this section we give experimental results for the class of problems that we have mainly examined, that is, problems in two input and one output dimensions, for which we have dense and noiseless training samples from the whole input space. In the figures, we measure the average classification error in respect to the stage of training. The classification error was averaged via an exponential trace with time scale 10−4 . 2031 C HARIATIS 5.1.1 C OMPARISON OF T ECHNIQUE C OMBINATIONS For these experiments we used the two-spirals problem shown in Figures 1a, 3, 4 and 6. We chose this problem as a non trivial representative of the class of problems that during early training generate a single cluster of high error density. The goal of this experiment is to measure the effectiveness of various technique combinations and then to measure how well the best technique combination scales with the size of the hidden layer. Figures 10 and 11 show the average classification error in respect to the number of evaluated samples and processing cycles respectively for 13 technique combinations. For these experiments we used 64 hidden units. The standard deviations were not plotted in order to keep the figures uncluttered. Figure 10 has also been split to Figures 12 and 13 in order to show the related error bars. Comparing the curves B vs. BL and BS vs. BLS on Figures 10 and 11, we can see that the fixed cascaded inhibitory connections reduce the asymptotic residual error by more than half. This also applies, but to a lesser degree, when we skip weight updates for hidden units with low errors (B vs. BU, BS vs. BSU). When used in combination, we can see a speed-up of convergence but the asymptotic error is only marginally further improved (BLU and BLSU). In Figure 11, it can be seen that skipping samples with low errors can speed-up convergence and reduce the asymptotic error as well (BLU vs. BLSU). This is a very intriguing result, in the sense that it implies that the system can learn faster and better by throwing away information. Both Figures 10 and 11 show the BLUCH curve to diverge. Considering the success of the BLSUCH curve, we can imply that skipping samples is necessary for the hybrid activation. However, the real problem, which was found out by viewing the dynamics of training, is that the centering mechanism does not work correctly when we train on all samples. A possible remedy may be to modify the statistics interval which is used for centering, as it is described at the end of Section 3.3. BLSUC vs. BLSU shows that centering further reduces the remaining asymptotic error to half and converges much faster as well. Comparing curve BLSUCH vs. BLSUC, we see that the hybrid activation function does better, but only marginally. This was expected since this problem has a single region of interest, so the ability of H to focus on multiple regions simultaneously is not exercised. This is the reason for the additional experiments in Section 5.1.2. BLSUCH and ALSUCH were the most successful technique combinations, with the later being a little faster. Nevertheless, it is very impressive that standard stochastic gradient descent with momentum can approach the best asymptotic error in less than a second, when using a modern 3.2 GHz processor. Figure 14 shows the average classification error in respect to the number of evaluated samples, for the ALSUCH technique combination and various hidden layer sizes. It can be seen that the asymptotic error is almost inversely proportional to the number of hidden units. This is a good indication that our techniques use the available resources efficiently. It is also interesting, that the convergence rates to the corresponding asymptotic errors are quite fast and about the same for all hidden layer sizes. 5.1.2 H YBRID VS . C ONVENTIONAL ACTIVATION For these experiments we used the two dimensional problem depicted in Figures 1c and 7. We chose this problem as a representative of the class of problems that during early training generate 2032 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 10: Average classification error vs. number of evaluated samples for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The standard deviations have been omitted for clarity. B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 1 2 3 4 5 6 7 8 9 10 Figure 11: Average classification error vs. Intel IA32 CPU cycles in billions, for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The horizontal scale also corresponds to seconds when run on a 1 GHz processor. The standard deviations have been omitted for clarity. 2033 C HARIATIS BS BLS BLSUC ALSUCH 0,35 BSU BLSU BLSUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 12: Part of Figure 10 showing error bars for technique combinations which employ S. B BL BLUC 0,35 BU BLU BLUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 13: Part of Figure 10 showing error bars for technique combinations which do not employ S. 2034 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 32 48 64 96 128 256 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 14: Average classification error vs. number of evaluated samples for various hidden layer sizes, while training the problem in Figure 1a with the ALSUCH technique combination. ALSUCH 0,04 ALSUC 0,03 0,02 0,01 0,00 0 300000 600000 900000 1200000 1500000 1800000 2100000 2400000 2700000 3000000 Figure 15: Average classification error vs. number of evaluated samples for the ALSUCH and ALSUC technique combinations, while training the problem in Figure 1c with 100 hidden units. The dashed lines show the minimum and maximum observed values. 2035 C HARIATIS small clusters of high error density of various sizes. For this kind of problems we typically obtain very small residuals for the classification error, although the problem may not have been learned. This is because we measure the error on the whole input space and for these problems most of the input space is trivial to be learned. The problem’s complexities are confined in very small areas. The dynamic training set evolution algorithm is able to locate these areas, but we need much more sample presentations, since most of the samples are not used for training. The goal of this experiment is to measure the effectiveness of the hybrid activation function at coping with the varying sizes of the subproblems. For these experiments we used 100 hidden units. Figure 15 shows that the ALSUCH technique, which employs the hybrid activation function, reduced the asymptotic error to half in respect to the ALSUC technique. As all of the visual inspections revealed, one of which is reproduced in Figure 16, the difference in the residual errors of the two curves is due to the insufficient approximation of the smaller subproblem by the ALSUC technique. Model ALSUCH ALSUC Figure 16: ALSUCH vs. ALSUC approximations for a problem with two sub-problems. 5.2 Higher Input and Output Dimensions In order to evaluate our techniques on a problem with higher input and output dimensions, we selected a standard benchmark, the Letter recognition database from the UCI Machine Learning Repository (Newman et al., 1998). This database consists of 20000 samples that use 16 integer attributes to classify the 26 letters of the English alphabet. This problem is characterized by a medium input dimensionality and a large output dimensionality. The later, makes it a very challenging problem for any classifier. This problem differs from those on which we have experimented so far, in that we do not have the whole input space at our disposal for training. We must train on a limited number of samples and then test the system’s generalization abilities on a separate test set. Although we have not taken any special measures to assist generalization, the experimental results indicate that our techniques have the inherent ability to generalize well, when given noiseless exemplars. An observation that applies to this problem is that the IDD accelerated training method could not do better than standard stochastic gradient descent with momentum. Thus, we report results using 2036 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS the BLSUCH technique combination which is computationally more efficient than the ALSUCH technique. For this experiment, which involves more than two output classes, we used the softmax activation function at the output layer. Table 2 contains previously published results showing the classification accuracy of various classifiers. The most successful of them were the AdaBoosted versions of the C4.5 decision-tree algorithm and of a feed forward neural network with two hidden layers. Both classifier ensembles required quite a lot of machines in order to achieve that high accuracy. Classifier Naive Bayesian classifier AdaBoost on Naive Bayesian classifier Holland-style adaptive classifier C4.5 AdaBoost on C4.5 (100 machines) AdaBoost on C4.5 (1000 machines) CART AdaBoost on CART (50 machines) 16-70-50-26 MLP (500 online epochs) AdaBoost on 16-70-50-26 MLP (20 machines) AdaBoost on 16-70-50-26 MLP (100 machines) Nearest Neighbor Test Error % 25,3 24,1 17,3 13,8 3,3 3,1 12,4 3,4 6,2 2,0 1,5 4,3 Reference Ting and Zheng (1999) Ting and Zheng (1999) Frey and Slate (1991) Freund and Schapire (1996) Freund and Schapire (1996) Schapire et al. (1997) Breiman (1996) Breiman (1996) Schwenk and Bengio (1998) Schwenk and Bengio (1998) Schwenk and Bengio (2000) Fogarty (1992) Table 2: A compilation of previously reported best error rates on the test set for the UCI Letters Recognition Database. Figure 17 shows the average error reduction in respect to the number of online epochs, for the BLSUCH technique combination and various hidden layer sizes. As suggested in the database’s documentation, we used the first 16000 samples for training and for measuring the training accuracy and the rest 4000 samples to measure the predictive accuracy. The solid and dashed curves show the test and training set errors respectively. Similarly to ensemble methods, we can observe two interesting phenomena which both seem to contradict the Occam’s razor principle. The first observation is that the test error stabilizes or continues to slightly decrease even after the training error has been zeroed. What is really happening is that the RMS error for the training set (which is related to the confidence of classification) continues to decrease even after the classification error has been zeroed, something that is also beneficiary for the test set’s classification error. The second observation is that increasing the network’s capacity does not lead to over fitting. Although the training set error can be zeroed with just 125 hidden units, increasing the number of hidden units reduces the residual test error as well. We attribute this phenomenon to the conjecture that the hidden units’ differentiation results in a smoother approximation (as suggested by Figure 5 and the related discussion). Comparing our results with those in Table 2, we can also observe the following: The 16-125-26 MLP (5401 weights) reached a 4.6% misclassification error on average, which is 26% better than the 6.2% of the 16-70-50-26 MLP (6066 weights), despite the fact that it had fewer weights, a simpler 2037 C HARIATIS 125 T RAIN 125 T EST 250 T RAIN 250 T EST 500 T RAIN 500 T EST 1000 T RAIN TEST ERROR % UNITS MIN AVG at end 125 4.0 4.6 250 2.8 3.2 500 2.3 2.6 1000 2.1 2.4 0,10 1000 T EST 0,05 0,00 0 10 20 30 40 50 60 70 80 90 100 Figure 17: Average error reduction vs. number of online epochs for various hidden layer sizes, while training on the UCI Letters Recognition Database with the BLSUCH technique combination. The solid and dashed curves show the test and training set errors respectively. The standard deviations for the training set errors have been omitted for clarity. The embedded table contains the minimum observed errors across all trials and epochs, and the average errors across all trials at epoch 100. architecture with one hidden layer only and it was trained for a far less number of online epochs. It is indicative that the asymptotic residual classification error on the test set was reached in about 30 online epochs. The 16-1000-26 MLP (43026 weights) reached a 2.4% misclassification error on average, which is the third best published result following the AdaBoosted 16-70-50-26 MLPs with 20 and 100 machines (121320 and 606600 weights respectively). The lowest observed classification error was 2.1% and was reached in one of the 10 runs at the 80th epoch. It must be stressed that the above results were obtained without any optimization of the learning rate, without a learning rate annealing schedule and within a by far shorter training time. All MLPs with 250 hidden units and above, gave results which put them at the top of the list of non-ensemble techniques and they even outperformed Adaboost on C4.5 with 100 machines. Similarly to Figure 14, we also see that the convergence rates to the corresponding asymptotic errors on the test set are quite fast and about the same for all hidden layer sizes. 6. Discussion and Future Research We have presented global and local selective attention techniques that can help neural network training to concentrate on the difficult parts of complex non-linear problems. A new hybrid activation function has also been presented that enables the hidden units to acquire individual receptive fields 2038 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS in the input space. These individual receptive fields may be global or local depending on the problem’s local complexities. The success of the new activation function is due to the fact that it depends on two distances. The first is the weighted distance of a sample to the hidden unit’s hyperplane. The second is the distance to the hidden unit’s center. We need both distances and neither of them is sufficient. The first helps us discriminate and the second helps us localize. The dynamic training set evolution algorithm locates the sub-areas of the input space where the problem resides. The fixed cascaded inhibitory connections and the selective training of a subset of the hidden units on each sample, force the hidden units to get differentiated and attack different subproblems. The individual centering of the hidden units at different points in the input space, adaptively conditions the network to the problem’s local structures and enables each hidden unit to solve a well-conditioned subproblem. In coordination with the above, the hidden units’ limited receptive fields allow training to follow a divide and conquer paradigm where each hidden unit only solves a local subproblem. The solutions to the subproblems are then combined by the output layer to give a solution to the original problem. In the reported experiments we initialized the hidden weights and biases so that the hidden hyperplanes would cover the whole input space at random positions and orientations. The initial norm of the weights was also adjusted so that the net-input to each hidden unit would fall in the transition between the linear and non-linear range of the activation function. These specific initializations were necessary for standard backpropagation. On the contrary, we have found that the combined techniques are insensitive to the initial weights and biases, as long as their values are small. We have repeated the experiments with hidden biases set to zero and hidden weight norms set to 10 −3 and the results where equivalent to those reported in Section 5. However, the choice of the best initial learning rate is still problem specific. An additional and important characteristic of these techniques is that training of the hidden layer does not depend solely on gradient information. Gradient based techniques can only perform local optimization by locating a local minimum of the error function when the system is already at the basin of attraction of that minimum. Stochastic training has a potential of escaping from a shallow basin, but only when the basin is not very wide. Once there, the system cannot escape towards a different basin with a lower minimum. On the contrary, in our model some of the hidden layer’s free parameters (the weights) are trained through gradient descent on the error, whereas some other (the means and standard deviations) are “trained” from the statistical properties of the back-propagated errors. Each hidden unit places its center near the center of mass of the error that it receives and limits its visibility only to the area of the input space where it produces a significant error. This model makes the hidden error landscape to constantly change. We conjecture that during training, paths connecting the various error basins are continuously emerging and vanishing. As a result the system can explore much more of the solution space. It is indicative that in all the reported experiments, all trials converged to a solution with more or less the same residual error irrespectively of the initial network state. The combination of the presented techniques enables very fast training on complex classification problems with embedded subproblems. By focusing on the problem’s details and efficiently utilizing the available resources, they make feasible the solution of very difficult problems (like the one in Figure 1e), provided that the adequate number of hidden units has been used. Although other machine learning techniques can do the same, to our knowledge this is the first report that this can be done using ordinary feed forward neural networks and backpropagation, in an online, adaptive 2039 C HARIATIS and memory-less scenario, where the input exemplars are unknown before training and discarded after being used. In the following we discuss some areas that deserve further investigation. 6.1 Generalization and Regression For the classes of problems that were investigated, we had noiseless exemplars and the whole input space at our disposal for training, so there was no danger of overfitting. Thus, we did not use any mechanism to assist generalization. This does not mean of course that the network just stored the input output mapping, as a lookup table would do. By putting constraints on the positions and orientations of the hidden unit hyperplanes and by limiting their receptive fields, we reduced the system’s available degrees of freedom, and the network arranged its resources in a way to achieve the best possible input-output mapping approximation. The experiments on the Letter Recognition Database showed remarkable generalization capabilities. However, when we train on noisy samples or when the number of training samples is small in respect to the size and complexity of the input space, we have the danger of overfitting. It remains to be examined how the described techniques are affected by methods that avoid overfitting, such as, training with jitter, error regularization, target smoothing and sigmoid gain attenuation (Reed et al., 1995). This consideration also applies to regression problems which usually require smoother approximations. Although early experiments give evidence that the presented techniques can be applied to regression problems as well, we feel that some smoothing technique must be included in the training framework. 6.2 Receptive Fields Limited Orientations As it was noted in Section 3.4, the orientations of the receptive field ellipses are limited to have the direction of one of the input axes. This hinders training performance by not allowing the receptive fields to be adequately shrunk at the direction perpendicular to the hyperplane. In addition, hidden units with sloped hyperplanes are trained on highly correlated input values. These problems are expected to be exaggerated in high dimensional input spaces. We would cure both of these problems simultaneously, if we could individually transform the input for each hidden unit through adaptive whitening, or, if we could present to each hidden unit a rotated view of the input space, such that, one of the axes to be perpendicular to the hyperplane and the rest to be parallel to the hyperplane. Unfortunately, both of the above transformations would require too many additional parameters. An approximation (for 2 dimensional problems) that we are currently investigating upon is the following: For each input vector we compute K vectors rotated around the center of the input space with successive angle increments equal to π/(2K). Our purpose is to obtain uniform rotations between 0 and π/4. Every a few hundred training steps, we reassign to each hidden unit the most appropriate input representation and adjust the affected parameters (weights, means and stdevs). The results are promising. 6.3 Dynamic Cascaded Inhibitory Connections Regarding the fixed cascaded inhibitory connections, it must be examined whether it is better to make the strength of the connections, dynamic. Minus one is OK when the weights are small. How2040 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS ever as the weights get larger, the inhibitory connections get less and less effective to differentiate the hidden units. We can try to make them relative to each hidden unit’s average absolute net-input or alternatively to make them trainable. It has been observed that increasing the strength of these connections enables the hidden units to generate more curved discriminant functions, which is very beneficiary for some problems. 6.4 Miscellaneous More experiments need to be done, in order to evaluate the effectiveness of the hybrid activation function on highly non-linear problems in many dimensions. High dimensional input spaces have a multitude of disturbing properties in regard to distance and density metrics, which may affect the hybrid activation in yet unknown ways. Last, we must devise a training mechanism, that will be invariant to the initial learning rate and that will vary automatically the number of hidden units as each problem requires. Acknowledgments I would like to thank all participants in my threads in usenet comp.ai.neural-nets, for their fruitful comments on early presentations of the subjects in this report. Special thanks to Aleks Jakulin for his support and ideas on further research that can make these results even better and to Greg Heath for bringing to my attention the perturbated forms for the calculation of sliding window statistics. I also thank the area editor L´ on Bottou and the anonymous reviewers for their valuable comments e and for helping me to bring this report in shape for publication. Appendix A. Notational Conventions The following list contains the meanings of the symbols that have been used in this report. Symbols with subscripts are used either as scalars or as vectors and matrices when the subscripts are omitted. For example, w ji is a single weight, w j is a weight vector and W is a weight matrix. α – A constant that determines the time scale of the exponential trace of the average training-set error within the dynamic training set evolution algorithm. β – A constant that determines the time scale of the exponential trace of the input means and standard deviations. δ – An accumulator for the efficient implementation of the fixed cascaded inhibitory connections. η – The number of hidden units. µ – The number of input units. f – The hidden units’ squashing function. i – Index enumerating the input units. j – Index enumerating the hidden units. k – Index enumerating the output units. 2041 C HARIATIS a j – The hidden unit’s activation computed from the sample’s weighted distance to the hidden unit’s hyperplane. b j – The hidden unit’s activation attenuation computed from the sample’s distance to the hidden unit’s center. d j – The sample’s distance to the hidden unit’s center. e j – The hidden unit’s accumulated back propagated errors. g j – The hidden unit’s error signal f (n j ) e j . h j – The hidden unit’s activation. m ji – The mean of the values received by hidden unit j from input unit i. n j – The net-input to the hidden unit. q ji – The mean of the squared values received by hidden unit j from input unit i. rk – The error of output unit k. s ji – The standard deviation of the values received by hidden unit j from input unit i. u jk – The weight of the connection from hidden unit j to output unit k. v ji – The variance of the values received by hidden unit j from input unit i. w ji – The weight of the connection from hidden unit j to input unit i. xi – The value of input unit i. z ji – The normalized input value received by hidden unit j from input unit i. It is currently computed as the z-score of the input value. A better alternative would be to compute the vector z j by multiplying the input vector x with a whitening matrix Z j . References C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In J. Van den Bussche and V. Vianu, editors, Proceedings of the 8th International Conference on Database Theory (ICDT), volume 1973 of Lecture Notes in Computer Science, pages 420–434. Springer, 2001. K. Agyepong and R. Kothari. Controlling hidden layer capacity through lateral connections. Neural Computation, 9(6):1381–1402, 1997. S. Ahmad and S. Omohundro. A network for extracting the locations of point clusters using selective attention. In Proceedings of the 12th Annual Conference of the Cognitive Science Society, MIT, 1990. L. B. Almeida, T. Langlois, and J. D. Amaral. On-line step size adaptation. Technical Report INESC RT07/97, INESC/IST, Rua Alves Redol 1000 Lisbon, Portugal, 1997. 2042 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. P. Bakker. Don’t care margins help backpropagation learn exceptions. In A. Adams and L. Sterling, editors, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pages 139– 144, 1992. P. Bakker. Exception learning by backpropagation: A new error function. In P. Leong and M. Jabri, editors, Proceedings of the 4th Australian Conference on Neural Networks, pages 118–121, 1993. S. Baluja and D. Pomerleau. Using the representation in a neural network’s hidden layer for taskspecific focus of attention. In IJCAI, pages 133–141, 1995. L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department, University of California, 1996. D. S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2(3):321–355, 1988. W. Duch, K. Grudzinski, and G. H. F. Diercksen. Minimal distance neural methods. In World Congress of Computational Intelligence, pages 1299–1304, 1998. D. L. Elliott. A better activation function for artificial neural networks. Technical Report TR 93-8, The Institute for Systems Research, University of Maryland, College Park, MD, 1993. G. W. Flake. Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr and K. R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in u Computer Science, pages 145–163. Springer, 1998. T. C. Fogarty. Technical note: First nearest neighbor classification on frey and slate’s letter recognition problem. Machine Learning, 9(4):387–388, 1992. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148– 156, 1996. P. W. Frey and D. J. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6:161–182, 1991. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. T. Graepel and N. N. Schraudolph. Stable adaptive momentum for rapid online learning in nonlinear systems. In J. R. Dorronsoro, editor, Proceedings of the International Conference on Artificial Neural Networks (ICANN), volume 2415 of Lecture Notes in Computer Science, pages 450–455. Springer, 2002. M. Harmon and L. Baird. Multi-player residual advantage learning with general function approximation. Technical Report WL-TR-1065, Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-6543, 1996. 2043 C HARIATIS M. Hegland and V. Pestov. Additive models in high dimensions. Computing Research Repository (CoRR), cs/9912020, 1999. S. C. Huang and Y. F. Huang. Learning algorithms for perceptrons using back propagation with selective updates. IEEE Control Systems Magazine, pages 56–61, April 1990. R.A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1: 295–307, 1988. R. Kothari and D. Ensley. Decision boundary and generalization performance of feed-forward networks with gaussian lateral connections. In S. K. Rogers, D. B. Fogel, J. C. Bezdek, and B. Bosacchi, editors, Applications and Science of Computational Intelligence, SPIE Proceedings, volume 3390, pages 314–321, 1998. B. Laheld and J. F. Cardoso. Adaptive source separation with uniform performance. In Proc. EUSIPCO, pages 183–186, September 1994. Y. LeCun, P. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems, volume 5, pages 156–163. Morgan Kaufmann Publishers, San Mateo, CA, 1993. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mueller. Efficient backprop. In G. B. Orr and K.-R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer u Science, pages 9–50. Springer, 1998. T. K. Leen and G. B. Orr. Optimal stochastic search and adaptive momentum. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Proceedings of the 7th NIPS Conference (NIPS), Advances in Neural Information Processing Systems 6, pages 477–484. Morgan Kaufmann, 1993. P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176, 1987. N. Murata, K. M¨ ller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing environments. u In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 599–605. MIT Press, 1996. D. J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning databases, 1998. G. B. Orr and T. K. Leen. Using curvature information for fast stochastic search. In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 606–612. MIT Press, 1996. J. L. Phillips and D. C. Noelle. Reinforcement learning of dimensional attention for categorization. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society, 2004. M. Plumbley. A hebbian/anti-hebbian network which optimizes information capacity by orthonormalizing the principal subspace. In Proc. IEE Conf. on Artificial Neural Networks, Brighton, UK, pages 86–90, 1993. 2044 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS R. Reed, R.J. Marks, and S. Oh. Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3):529–538, 1995. R. E. Schapire. A brief introduction to boosting. In T. Dean, editor, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), pages 1401–1406. Morgan Kaufmann, 1999. R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, editor, Proceedings of the 14th International Conference on Machine Learning (ICML), pages 322–330. Morgan Kaufmann, 1997. N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. ¨ N. N. Schraudolph. Centering neural network gradient factors. In G. B. Orr and K. R. M uller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 207–226. Springer, 1998a. N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical Report IDSIA-33-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, 1998b. N. N. Schraudolph. Online local gain adaptation for multi-layer perceptrons. Technical Report IDSIA-09-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Galleria 2, CH-6928 Manno, Switzerland, 1998c. N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In ICANN, pages 569–574. IEE, London, 1999. H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8):1869–1887, 2000. H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural networks for character recognition. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10. MIT Press, Cambridge, MA, 1998. M. W. Spratling and M. H. Johnson. Neural coding strategies and mechanisms of competition. Cognitive Systems Research, 5(2):93–117, 2004. C. Thornton. The howl effect in dynamic-network learning. In Proceedings of the International Conference on Artificial Neural Networks, pages 211–214, 1992. K. M. Ting and Z. Zheng. Improving the performance of boosting for naive bayesian classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 296–305, 1999. Y. H. Yu and R. F. Simmons. Descending epsilon in back-propagation: A technique for better generalization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume 3, pages 167–172, 1990. S. Zhong and J. Ghosh. Decision boundary focused neural network classifier. In Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE). ASME Press, 2000. 2045

4 0.069768578 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning     (Special Topic on Model Selection)

Author: Carine Hue, Marc Boullé

Abstract: In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we define a 2D partitioning family obtained by discretizing numerical variables and grouping categorical ones and we derive an analytical criterion to select the partition with the highest posterior probability. We show how these partitions can be used to build univariate predictors and multivariate ones under a naive Bayes assumption. We also propose a new evaluation criterion for probabilistic rank estimators. Based on the logarithmic score, we show that such criterion presents the advantage to be minored, which is not the case of the logarithmic score computed for probabilistic value estimator. A first set of experimentations on synthetic data shows the good properties of the proposed criterion and of our partitioning approach. A second set of experimentations on real data shows competitive performance of the univariate and selective naive Bayes rank estimators projected on the value range compared to methods submitted to a recent challenge on probabilistic metric regression tasks. Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overfitting. Keywords: rank regression, probabilistic approach, 2D partitioning, non parametric estimation, Bayesian model selection

5 0.068437621 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.

6 0.067738406 52 jmlr-2007-Margin Trees for High-dimensional Classification

7 0.061191399 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

8 0.061043508 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

9 0.051301256 23 jmlr-2007-Concave Learners for Rankboost

10 0.048248105 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

11 0.042099237 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

12 0.041626152 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

13 0.040061872 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

14 0.038862407 70 jmlr-2007-Ranking the Best Instances

15 0.037342452 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

16 0.037204348 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.037148617 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

18 0.035861418 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

19 0.035108518 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

20 0.032131862 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.218), (1, 0.069), (2, -0.026), (3, 0.049), (4, 0.062), (5, 0.019), (6, -0.21), (7, -0.041), (8, -0.082), (9, 0.07), (10, 0.019), (11, -0.045), (12, 0.101), (13, 0.073), (14, 0.142), (15, -0.049), (16, -0.2), (17, 0.109), (18, -0.015), (19, -0.002), (20, -0.061), (21, -0.022), (22, 0.053), (23, 0.112), (24, 0.277), (25, 0.149), (26, -0.029), (27, -0.236), (28, -0.018), (29, -0.011), (30, 0.144), (31, -0.039), (32, -0.12), (33, -0.124), (34, 0.097), (35, -0.234), (36, -0.165), (37, -0.05), (38, -0.052), (39, 0.003), (40, 0.085), (41, -0.113), (42, 0.109), (43, 0.005), (44, 0.043), (45, 0.077), (46, -0.044), (47, 0.048), (48, 0.13), (49, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92593992 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classification of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm specifically intended for classification problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classifier are also provided. An experimental study with artificial and real data sets, including an application to gene expression analysis, verifies the usefulness of the proposed approach. Keywords: classification, ordinal data, support vector machines, neural networks

2 0.43276936 23 jmlr-2007-Concave Learners for Rankboost

Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang

Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overfit is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and confidence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative filtering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simplified to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, specified as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which defines a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are specified with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be specified such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reflect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form:  i f f j (x) > θ,  1 0 i f f j (x) ≤ θ, (3) h(x) =  qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a fixed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespecified q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classifier leveraging algorithms (Meir and Ratsch, 2003). The first is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classification error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classification with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overfitting for low noise classification problems (with modifications to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overfitting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefficients leads to overfitting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overfitting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efficiently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and confidence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the final ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisfies these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Confidence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefficients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overfits. In this section we present another regularization principle, consistency in preference and confidence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their confidence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the first problem. Lets assume the grocer considered the first problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more confident in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more confident in this second decision as a reflection of the consultants relative confidence. The above properties of preference and confidence imply that preference should be applied consistently across different problems and confidence should reflect the relative confidences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Confidence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the confidence level for his preference. The consistency principle in preference and confidence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Definition 1. Consistency principle for a pair of decision problems: Suppose that for the first decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the confidence level for the first decision problem, each consultant in A has higher or the same confidence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same confidence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the confidence level of the combiner for the second problem is at least as high as his confidence level for the first problem. Let B be the set of consultants whose preferences are different from that of the combiner in the first decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more confident about the adoption of the preference of A in the second problem regardless of the confidence of those who switch sides. Thus, Definition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal confidence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is specified within each decision problem, two decision problems are needed to compare the qualitative measure of confidence. However, comparison of confidence is not always possible (it is a partial ordering). In particular, the level of confidence between different experts may not be comparable, and the levels of confidence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Confidence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less confidence. For ranks we make the assumption that a constant change of ranks requires more confidence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more significant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Specifically we make the following definition: Definition 2. Preference and confidence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the confidence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are defined to have the same level of confidence. In this definition confidence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for flexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same definitions of preference and confidence to it. That is, for a ranking function G and for any fixed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For confidence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the first decision, between yand y , shows equal or more confidence than the second decision between y and y . This numerical method of capturing confidence and preference in ranks, y, and ranking functions, G, allows us to apply Definition 1. Specifically, for a pair of binary decisions the confidence of a consistent ranking function increases for the second decision if in the second decision the confidence of the agreeing rank values are comparable and increase and the confidence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Confidence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and confidence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have specific mathematical properties under this three-point consistency principle in preference and confidence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and confidence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and confidence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Definition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and confidence (definition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and confidence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the first case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufficiency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and confidence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the first decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and confidence (Definitions 1 and 2) can be enumerated by the result of the first decision problem and relative location of y . The darker gray box is the location where B has lesser or equal confidence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the first problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisfies the preference and confidence requirements. If G(y) = G(y ) = G(y ), the combiner has equal confidence in the two decision problems by definition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the first decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the confidence assumption (Definitions 1 and 2), that is, agreeing with more confidence along the A indices and disagreeing with less confidence or switching sides along the B indices. As y A ≤ yA , either by the confidence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the confidence of G is at least as high as the first decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the first problem and with y in the second problem) and the confidence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the first decision problem we have yA > y A , yB ≤ y B . For the confidence assumption to hold, that is, greater confidence for A in the second problem (Definition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the first problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing confidence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the confidence is at least as high as the first decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and confidence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefficients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in confidence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a confidence based approach, as Highest Rank bets on the classifier that is most confident, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the confidence of different classifiers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefficients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to find a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacrifice accuracy for speed by only evaluating and selecting from a fixed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that specifies output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no significant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as specified by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, first β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difficulty, for a total of 3,816 different images. The data sets, in order of perceived difficulty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difficult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classifiers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efficiency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shuffled them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classification applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was filtered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that significantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classifiers (ANM, UMD and USC) and the combination of all 10 classifiers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classifier. In combining 10, we are introducing classifiers which may be correlated and classifiers which are comparably much noisier. The size of the pool was 6 when we combined 3 classifiers and 20 when combining all 10 classifiers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the significance of the improvement we ran paired one-sided t-tests and evaluated the significance of the p-value (a value less than 0.05 is significant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classifiers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classifiers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classifiers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classifiers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have significant p-values. The two no improvement experiments do not have significant p-values. Table 3 contains the results for combining all 10 classifiers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have significant p-values. The one no improvement experiment does not have a significant p-value. It is interesting to note that we do not seem to see overfitting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded significant improvements in accuracy, as is visible in the change in the average rank of the correct class and the significance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, flabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classifiers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classifiers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 flabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classification task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had significantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in significantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classifiers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their final iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically significant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily flexible ranking functions. But as observed this flexibility comes at a cost of significant overfitting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in confidence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative confidence exhibited by constituents. We prove that under a natural interpretation of preference and confidence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and confidence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our first experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efficiency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, confirming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in significantly less iterations and generates statistically significant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his final stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and confidence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function   G(y) < G(y )   yB − y B ≤ 0 < y A − y A  zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y)   yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the first decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisfies Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classifiers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classifier Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and confidence in classifier combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Office. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812

3 0.40633684 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

Author: Aggelos Chariatis

Abstract: The experimental investigation on the efficient learning of highly non-linear problems by online training, using ordinary feed forward neural networks and stochastic gradient descent on the errors computed by back-propagation, gives evidence that the most crucial factors for efficient training are the hidden units’ differentiation, the attenuation of the hidden units’ interference and the selective attention on the parts of the problems where the approximation error remains high. In this report, we present global and local selective attention techniques and a new hybrid activation function that enables the hidden units to acquire individual receptive fields which may be global or local depending on the problem’s local complexities. The presented techniques enable very efficient training on complex classification problems with embedded subproblems. Keywords: neural networks, online training, selective attention, activation functions, receptive fields 1. Framework Online supervised learning is in many cases the only practical way of learning. This includes situations where the problem size is very big, or situations where we have a non-recurring stream of input vectors that are unavailable before training begins. We examine online supervised learning using a particular class of adaptive models, the very popular feed forward neural networks, trained with stochastic gradient descent on the errors computed by back-propagation. In order to easily visualize the online training dynamics of highly complex non linear problems, we are experimenting on 2:η:1 networks where the input is a point in a two dimensional image and the output is the value of the pixel at the corresponding input position. This framework allows the creation of very complex non-linear problems, just by hand-drawing the problem on a bitmap and presenting it to the network. Most problems’ images in this report are 256 × 256 pixels in size, producing in total 65536 different samples each one. Classification and regression problems can be modeled as black & white and gray scale images respectively. In this report we only examine training on classification problems. However, since mixed problems are possible, we are only interested on techniques that can be applied to both classification and regression. The target of this investigation is online training where the input is not known in advance, so the input samples are treated as random and non-recurring vectors from the input space and are discarded after being used. We select and train on random samples until the average classification or RMS error is acceptable. Since both the number of training exemplars and the complexity of the underlying function are assumed unknown, we require from our training mechanism to have “initial state invariance” as a fundamental property. Thus we deliberately exclude from our arsenal any c 2007 Aggelos Chariatis. C HARIATIS training techniques that require a schedule to be decided ahead of training. Ideally we would like from the training mechanism to be totally invariant to the initial training parameters and network state. This report is organized as follows: Sections 2 and 3 describe techniques for global and local selective attention. Section 4 is devoted to acceleration of training. In Section 5 we present experimental results and in Section 6 we discuss the presented techniques and give some directions for future research. Finally, Appendix A contains a description of the notations that have been used. In Figure 1 you can see some examples of problems that can be learned very efficiently using the techniques that are presented in the following sections. (a) (b) (c) (d) (e) Figure 1: Examples of complex non-linear problems that can be learned efficiently. 2. Global Selective Attention - Dynamic Training Set Evolution Consider the two problems depicted in Figure 2. Clearly, both problems are of approximately equal complexity, since they encapsulate the same image in a different scale and position within the input space. We would like to have a mechanism that will make the network capable of learning both problems at about the same speed. (a) (b) Figure 2: Approximately equal complexity problems. Intuitively, the samples on the boundaries, which are the samples on positions with the highest contrast, are those that determine the complexity of each problem. During training, these samples have the property that they produce the highest errors. We thus need a method that will focus attention on samples with high error relatively to the rest. Previous work on such global selective attention has been published by many authors (Munro, 1987; Yu and Simmons, 1990; Bakker, 1992, 1993; Schapire, 1999; Zhong and Ghosh, 2000). 2018 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Of particular interest are the various boosting algorithms, such as AdaBoost (Schapire, 1999), which work by placing more emphasis on training samples that the system is currently getting wrong. Unfortunately, the most successful of these algorithms require a predefined set of samples on which training will be performed, something that is excluded from our scenario. Nevertheless, in a less constrained scenario, boosting can be applied on top of our techniques as a meta-learning algorithm, using our techniques as the base-learning algorithm. A simple method that can provide such an adaptive selective attention mechanism, by keeping an exponential trace of the average error in the training set, is described in Algorithm 1. e←0 ¯ Repeat Pick a random sample Evaluate the error e for the sample If e > 0.5 e ¯ e ← e α + e (1 − α) ¯ ¯ Train End Until a stopping criterion is satisfied In this report’s context, error evaluation and train are defined as: Error Evaluation: Computation of the output values by forward propagating the activations from the input to the output layer for a single sample, plus computation of the output errors. The sample’s error e is set to the quadratic mean (RMS) of the output units’ errors. Train: Back-propagation of the output errors to the hidden layer and immediate weights’ adjustment. Algorithm 1: The dynamic training set evolution algorithm. The algorithm evaluates the errors of all samples, but trains only for samples with error greater than half the average error of the current training set. Training is initially performed for all samples, but gradually, it is concentrated on the samples at the problem’s boundaries. When the error for these samples is reduced, other previously excluded samples enter the training set. Thus, samples enter and leave the training set automatically, with a tendency to train on samples with high error. The magnitude of the constant α that determines the time scale of the exponential trace is problem specific, but in all experiments in this report it was kept fixed to 10 −4 . The fraction of 0.5 was determined experimentally to give a good balance between sample selectivity and training set size. If it is close to 0 then we train for almost all samples. If it is close to 1 then we are at risk of making the training set starve from samples. Of course, one can choose to vary it dynamically in order to have a fixed percentage of samples in the training set, or, to not allow the training percentage to fall below a pre-specified limit. Figure 3 shows the training set evolution for the two-spirals problem (in Figure 1a) in various training stages. The network topology was 2:64:1. You can see the training set forming gradually and tracing the problem boundaries where the error is the highest. One could argue that such a process may be very sensitive to outliers. Experiments have shown that this does not happen. The algorithm does not try to recognize the outliers, but at least, adjusts naturally by not allowing the training set size to shrink. So, at the presence of heavy noise, the algorithm becomes ineffective, but does not introduce any additional harm. Figure 4 shows the two2019 C HARIATIS 10000-11840-93% 20000-25755-76% 40000-66656-41% 60000-142789-22% 90000-296659-18% Figure 3: Training set evolution for the two-spirals problem. Under each image you can see the stage of training in trains, error-evaluations and the percentage of samples for which training is performed at the corresponding stage. spirals problem distorted by dynamic noise and the corresponding training set after 90000 trains with 64 hidden units. You can see that the algorithm tolerates noise by not allowing the training set size to shrink. It is also interesting that at noise levels as high as 30% the algorithm can still exclude large areas of the input space from training. 10%-42% 20%-62% 30%-75% 50%-93% 70%-99% Figure 4: Top row shows the model with a visualization of the applied dynamic noise. Bottom row shows the corresponding training sets after 90000 trains. Under each pair of images you can see the percentage of noise distortion applied to the original input and the percentage of samples for which training is performed. 3. Local Selective Attention - Receptive Fields Having established a global method to focus attention on the important parts of a problem, we now come to address the main issue, which is the network training. Let first discuss the roles of the hidden and output layers in a feed forward neural network with a single hidden layer and without shortcut input-to-output connections. 2020 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS The hidden layer is responsible for transforming a non linear input-to-output mapping, into a non linear input-to-hidden layer mapping, that can be mapped linearly to the output. The output layer is responsible for learning a linear hidden-to-output mapping (which is an easy job), but most importantly, it must provide to the hidden layer error gradient information that will be used for the error credit assignment problem. In this respect, it becomes apparent that all hidden units should receive the most possibly accurate error information. That is why, we must train all hidden to output connections and back propagate the error through all these connections. This is not the case for the hidden layer. Consider, for a classification problem, how the hidden units with sigmoidal activations partition the input space into sub areas. By adjusting the input-tohidden weights and biases, each hidden unit develops a hyperplane that bi-partitions the input space in the most useful sense. We would like to limit the number of hyperplanes in order to reduce the system’s available degrees of freedom and obtain better generalization capabilities. At the same time, we would like to thoroughly use them in order to optimize the input output approximation. This can be done by arranging the hyperplanes to touch the problem’s boundaries at regular intervals dictated by the boundary curvature, as it is shown in Figure 5a. Figure 5b, shows a suboptimal placement of the hyperplanes which causes a waste of resources. Each hidden unit must be differentiated from the others and ideally not interfere with the subproblems that the other units are trying to solve. Suppose that two hidden units are governed by the same, or nearly the same, parameters. How can we differentiate them? There are many possibilities. (a) (b) Figure 5: Optimal vs. suboptimal hyperplanes. One could be, to just throw one unit away and make the output weight of the other equal to the sum of the two original output weights. That would leave the function unchanged. However, identifying these similar units during training is not easy computationally. In addition, we would have to figure out a method that would compute the best initial placement for the hyperplane of the new unit that would substitute the one that was thrown away. Another possibility would be to add noise in the weight updates, gradually reduced with a simulated annealing schedule which should be decided before training begins. Unfortunately, the loss of initial state invariance would complicate training for unknown complex non linear problems. To our thinking, it is much better to embed constraints into the system, so that it will not be possible for two hidden units to develop the same hyperplane. Two computationally efficient techniques to embed such constraints are described in sections 3.1 and 3.2. Many other authors have also examined methods for local selective attention. For the related discussions see Huang and Huang (1990), Ahmad and Omohundro (1990), Baluja and Pomerleau (1995), Flake (1998), Duch et al. (1998), and Phillips and Noelle (2004). 2021 C HARIATIS 3.1 Fixed Cascaded Inhibitory Connections A problem with the hidden units of conventional feed forward networks is that they are all fed with the same inputs and back propagated errors and that they operate without knowing each other’s existence. So, nothing prevents them from behaving identically. This lack of communication between hidden units has been addressed by researchers through hidden unit lateral connections. Agyepong and Kothari (1997) use unidirectional lateral interconnections between adjacent hidden layer units, claiming that these connections facilitate the controlled assignment of role and specialization of the hidden units. Kothari and Ensley (1998) use Gaussian lateral connections which enable the hidden decision boundaries to be global in nature but also be able to represent local structure. Numerous neural network algorithms employ bidirectional lateral inhibitory connections in order to generate competition between the hidden units. In an interesting variation described by Spratling and Johnson (2004), competition is provided by each hidden unit blocking its preferred inputs from activating other units. We use a single hidden layer where the hidden units are considered sequenced. Each hidden unit is connected to all succeeding hidden units with a fixed connection with weight set to minus one. The hidden units get differentiated, because they receive different inputs, they produce different activations and they get back different error information. Another benefit is that they can generate higher order feature detectors, that is, the resulting hidden hyperplanes are no longer strictly linear, but they may also be curved. Considering the fixed value, -1 is used just to avoid a multiplication. Values from -0.5 to -2 give good results as well. As it is shown in Section 5.1.1, the fixed cascaded inhibitory connections are very effective at reducing a problem’s asymptotic residual error. This should be attributed to both of their abilities, to generate higher order feature detectors and to hasten the hidden units’ symmetry breaking. These connections can be implemented very efficiently with just one subtraction per hidden unit for both hidden activation and hidden error computation. In addition, the disturbance to the parallelism of the backpropagation algorithm is minimal. Most operations on the hidden units can still be done in parallel and only the final computations must be performed sequentially. We include the algorithms for the hidden activation and error computations as examples of sequential implementations. These changes can be very easily incorporated into conventional neural network code. Hidden Activations δ←0 For j ← 1 . . . η nj ← δ+x·wj h j ← f (n j ) δ ← δ−hj End Hidden Error Signals δ←0 For j ← η . . . 1 ej ← δ+r ·uj g j ← e j f (n j ) δ ← δ−gj End Algorithm 2: Hidden unit activation and error computation with Fixed Cascaded -1 Connections. 2022 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 3.2 Selective Training of the Hidden Units The hidden units’ differentiation can be farther magnified if each unit is not trained on all samples, but only on the samples for which it receives a high error. We train all output units, but only the hidden units for which the error signal is higher than the RMS of the error signals of all hidden units. Typically about 10% of the hidden units are trained on each sample during early training and the percentage falls up to 2% when the network is close to the solution. This is intuitively justified by the observation that at the beginning of training the hidden units are largely undifferentiated and receive high error signals from the whole input space. At the final stage of training, the hidden hyperplanes’ linear soft decision boundaries are combined by the output layer to define arbitrarily shaped decision boundaries. For µ input dimensions, from 1 up to µ units can define an open sub-region and µ + 1 units are enough to define a closed convex region. With such high level constructs, each sample may be discriminated from the rest with very few hidden units. These, are the units that receive the highest error signal for the sample. Experiments on various problems have shown that training on a fraction of the hidden units is always better (in respect to number of trains to convergence), than training all or just one hidden unit. It seems that training only one hidden unit on each sample is not sufficient for some problems (Thornton, 1992). Measurements for one of these experiments are reported in Section 5.1.1. In addition to the convergence acceleration, the combined effect of training a fraction of the hidden units on a fraction of the samples, gives big savings in CPU usage per sample as well. This sparseness of training in respect to evaluation provides further opportunities for speedup as it is discussed in Section 4. 3.3 Centering On The Input Space It is a well known recommendation (Schraudolph, 1998a,b; LeCun et al., 1998) that the input values should be normalized to have zero mean and unit standard deviation over each input dimension. This is achieved by subtracting from each input value the mean and dividing by the standard deviation. For some problems, like the one in Figure 2b, the center of the input space is not equal to the center of the problem. When the input is not known in advance, the later must be computed adaptively. Moreover, since the hidden units are trained on different input samples, we should compute for each hidden unit its own mean and standard deviation over each input dimension. For the connection between hidden unit j and input unit i we can adaptively compute the approximate mean m ji and standard deviation s ji over the inputs that train the hidden unit, using either exponential traces: m ji (t) ← β xi + (1 − β) m ji (t−1) , q ji (t) ← β xi2 + (1 − β) q ji (t−1) , s ji (t) ← (q ji (t) − m ji 2 )1/2 , (t) or perturbated calculations: m ji (t) ← m ji (t−1) + β (xi − m ji (t−1) ), v ji (t) ← v ji (t−1) + β (xi − m ji (t) ) (xi − m ji (t−1) ) − v ji (t−1) , 2023 C HARIATIS 1/2 s ji (t) ← v ji (t) , where β is a constant that determines the time scale of exponential averaging, vector x holds the input values, matrix Q holds the means of the squared input values and matrix V holds the variances. The means and standard deviations of a hidden unit’s input connections are updated only when the hidden unit is trained. The result of this treatment is that each hidden unit is centered on a different part of the input space. This center is indirectly affected by the error that the inputs produce on the hidden unit. The magnitude of the constant β is problem specific, but in all experiments in this report it was kept fixed to 10−3 . This constant must be selected large enough, so that the centers will rapidly move to their optimum locations, and small enough, so that the hidden units will see a relatively static view of the input space and the gradient descent algorithm will not be confused. As the hidden units jitter around their centers, we effectively train them on slightly shifted views of the input space, something that can assist generalization. We get something analogous to training with jitter (Reed et al., 1995), at no extra cost. In Figure 6, the squares show where each hidden unit is centered. You can see that most are centered on the problem boundaries at regular intervals. The crosses show the standard deviations. On some directions the standard deviations are very small, which results in very high normalized input values, causing the hidden units to act as threshold units at those directions. The sloped lines show the hyperplane distance from center and the slope. These are computed for display purposes, from their theoretical formulas for a conventional network, without considering the effect of the cascaded connections. For some units the hyperplanes shown are not exactly on the boundaries. This is because of the fixed cascaded connections that cause the hidden units to be not exactly linear discriminants. In the last picture you can see the decision surface of a hidden unit which is a bit curved and coincides with the class boundary although its calculated hyperplane is not on the boundary. An observant reader may also notice that the hyperplane distances from the centers are very small, which implies that the corresponding biases are small as well. On the contrary, if all hidden units were centered on the center of the image, we would have the following problem. The hyperplanes of some hidden units must be positioned on the outer parts of the image. For this to happen, these units should develop large biases in respect to the weights. This would make their activations to have small variances. These small variances might need to be compensated by large output weights and biases, which would saturate the output units and in addition ill-condition the problems. One may wonder if the hidden biases are still necessary. Since the centers are individually set, it may seem at first that they are not. However, the centers are not trained through error backpropagation, and the hyperplanes do not necessarily pass over them. The biases role is to drive the hyperplanes to the correct location and thus pull the centers in the corresponding direction. The individual centering of the hidden units based on the samples’ positions is feasible, because we train only on samples with high errors and only the hidden units with high errors. By ignoring the small errors, we effectively position the center of each hidden unit near the center of mass of the high errors that it receives. However, this centering technique can still be used even if one chooses to train on all samples and all hidden units. Then, the statistics interval should be differentiated for each hidden unit and be recomputed for each sample relatively to the normalized absolute error that each hidden unit receives. A way to do it is to set the effective statistics interval for hidden unit j 2024 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Figure 6: Hidden unit centers, standard deviations, hyperplanes, global and local training sets and a hidden unit’s output. The images were captured at the final stage of training, of the problem in Figure 1a with 64 hidden units. and sample s to: β |e j,s | |e j | where β is the global statistics interval, e j,s is the hidden unit’s backpropagated error for the sample and |e j | is the mean of the absolute backpropagated errors that the hidden unit receives, measured via an exponential trace. The denominator acts as a normalizer, which makes the hidden unit’s mobility to be independent of the average magnitude of the errors. Centering on other factors has been extensively investigated by Schraudolph (1998a,b). These techniques can provide further convergence acceleration, but we chose not to use them because of the additional computational overhead that they require. 3.4 A Hybrid Activation Function As it is shown in Section 5, the aforementioned techniques enable successful training on some difficult problems like those in Figures 1a and 1b. However, if the problem contains subproblems, or put in another way, if the problem generates more than one cluster of high error density, the centering mechanism does not manage to drive the hidden unit centers to the most suitable locations. The centers are attracted by the larger subproblem or get stuck in areas between the subproblems, as shown in Figure 7. 2025 C HARIATIS Figure 7: Model, training set, and inadequate centering We need a mechanism that can force a hidden unit to get out of balanced but suboptimal positions. It would be nice if this mechanism could also allow the centers to migrate to various points in the input space as the need arises. It has been found that both of these requirements are fulfilled by a new hybrid activation function. Sigmoid activations have the property that they produce hyperplanes that separate the input space globally. Our intention is to use a sigmoid like hidden activation function, because it can provide global separability, and at the same time, reduce the activation value towards zero on inputs which are not important to a hidden unit. The Gaussian function is commonly used within radial basis function (RBF) neural networks (Broomhead and Lowe, 1988). When this function is applied to the distance of a sample to the unit’s center, it produces a local response which is stronger near the center. We can then enclose the sigmoidal activation within a Gaussian envelope, by multiplying the activation with a value between 0 and 1, which is provided by applying the Gaussian function to the distance that is measured in the normalized input space. When the number of input dimensions is large, the distance metric that must be used is not an obvious choice. Table 1 contains the distance metrics that we have considered. The most suitable distance metric seems to depend on the distribution of the samples that train the hidden units. µ ∑ xi2 i=1 Euclidean 1 µ µ ∑ xi2 i=1 Euclidean Scaled µ ∑ |xi | i=1 Manhattan 1 µ µ ∑ |xi | i=1 Manhattan Scaled max |xi | Chebyshev Table 1: Various distance metrics that have been considered for the hybrid activation function. In particular, if the samples follow a uniform distribution over a hypercube, then the Euclidean distance has the disturbing property that the average distance grows larger as the number of input dimensions increases and consequently the corresponding average Gaussian response decreases towards zero. As suggested by Hegland and Pestov (1999), we can make the average distance to center independent of the input dimensions, by measuring it in the normalized input space and then dividing it by the square root of the input’s dimensionality. The same problem occurs for the Manhattan distance which has been claimed to be a better metric in high dimensions (Aggarwal et al., 2001). We can normalize this distance by dividing it by the input’s dimensionality. A problem that 2026 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS appears for both of the above rescaled distance metrics, is that for the samples that are near the axes the distances will be very much attenuated and the corresponding Gaussian responses will be close to one, something that will make the Gaussian envelopes ineffective. A more suitable metric for this type of distributions is the Chebyshev metric whose average magnitude is independent of the dimensions. However, for reasons analogous to those mentioned above, this metric is not the most suitable if the distribution of the samples is spherical. In that case, the Euclidean distance does not need any rescaling and is the most natural distance measure. We can obtain spherical distributions by adaptively whitening them. As Plumbley (1993) and Laheld and Cardoso (1994) independently proposed, the whitening matrix Z can be adaptively computed as: Zt+1 = Zt − λ zt zt T − I Zt where λ is the learning rate parameter, zt = Zt xt is the whitened vector and xt is the input vector. However, we would need too many additional parameters to do it individually for each subset of samples on which each hidden unit is trained. For the above reasons (and because of lack of a justified alternative), in the implementation of these techniques we typically use the Euclidean metric when the number of input dimensions is up to three and the Chebyshev metric in all other cases. We have also replaced the usual tanh (sigmoidal) and Gaussian (bell-like) functions, by similar functions which do not involve exponentials (Elliott, 1993). For each hidden unit j we first compute the net-input n j to the hidden unit (that is, the weighted distance of the sample to the hyperplane), as the inner product of normalized inputs and weights plus the bias: xi − m ji , s ji = z j · w j. z ji = nj We then compute the sample’s distance d j to the center of the unit which is measured in the normalized input space: dj = zj . Finally, we compute the activation h j as: nj , (1 + n j ) 1 , = bell(d j ) = (1 + d 2 ) j = a j b j. a j = Elliott(n j ) = bj hj Since d j is not a function of w j , we treat b j as a constant for the calculation of the activation derivative with respect to n j , which becomes: ∂h j = b j (1 − a j )2 . ∂n j 2027 C HARIATIS The hybrid activation function, which by definition may only be used for hidden units connected to the input layer, enables these units to acquire selective attention capabilities on the input space. Each hidden unit may have a global or local receptive field on each input dimension. The size of this dimensional receptive field depends on the standard deviation which is computed for the corresponding dimension. This activation makes balanced positions between subproblems to be unstable. As soon as the center is changed by a small amount, it will be attracted by the nearest subproblem. This is because the unit’s activation and the corresponding error will be increased for samples towards the nearest subproblem and decreased at the other direction. Hidden units can still be centered between subproblems but only if their movement at either direction causes a large error for samples at the opposite direction, that is, if they are absolutely necessary at their current position. Additionally, if a unit is centered near a subproblem that produces low errors and the unit is not necessary in that area, then it may migrate to other areas that still have high errors. This unit center migration has been observed in all experiments on complex problems. This may be due to the non-linear response of the bell function, and its long tails which keep the activation above zero for all input samples. Figure 8: Model, evaluation, training set, hidden unit centers and two hidden unit outputs showing the effect of the hybrid activation function. The images were captured at the final stage of training, of the problem in Figure 1d with 700 hidden units. In Figure 8 you can see a complex problem with 9 clusters of high errors. The hidden units place their centers on all clusters and are able to solve the problem. In the last two images, you can see the 2028 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS effect of the hybrid activation function which attenuates the activation at points far from the center in respect to the standard deviation on each dimension. One unit develops a near circular local receptive field and one other develops an elongated ellipsoidal receptive field. The later provides almost global separation in the vertical direction and becomes a useful discriminant for two of the subproblems. One may find similarities between this hybrid activation function and the Square-MLP architecture described by Flake (1998). The later, partially implements higher order neurons by duplicating the number of input units and setting the new input values equal to the squares of the original inputs. This architecture enables the hidden units to form local features of various shapes, but not the locally constrained sigmoid formed by our proposal. In contrast, the hybrid activation function does not need any additional parameters beyond those that are already used for centering and it has the additional benefit, which is realized by the local receptive fields in conjunction with the small biases and the symmetric sigmoid, that the hidden activations will have a mean close to zero. As discussed by Schraudolph (1998a,b) and LeCun et al. (1998), this is very beneficial for the output layer training. However, there is still room for improvement. As it was also observed by Flake (1998), the orientations of the receptive field ellipses are always at the direction of one of the input axes. This limitation is expected to hinder performance when training hidden units which have sloped hyperplanes. Figure 9 shows a complex problem at the middle of training. Units with sloped hyperplanes are trained on samples whose input values are highly correlated. This can slowdown learning by itself, but in addition, the standard deviations cannot get sufficiently small and as a result the receptive field cannot be sufficiently shrunk at the direction perpendicular to the hyperplane. As a result the hidden unit’s activation unnecessarily interferes with the activations of nearby units. Although it may be possible to address the correlation problem with a more sophisticated training method that uses second order gradient information, like Stochastic Meta Descent (Schraudolph, 1999, 2002), the orientations of the receptive fields will still be limited. In Section 6.2 we discuss possible directions for further research that may circumvent this limitation. Figure 9: Evaluation and global and local training sets during middle training for the problem in Figure 1b. It can be seen that a hidden unit with a sloped hyperplane is trained on samples with highly correlated input values. Samples that are separated by horizontal or vertical hyperplanes are easier to be learned. 2029 C HARIATIS 4. Further Speedups In this section we first describe an implementation technique that reduces the computational requirements of the error evaluation phase and then we give references to methods that have been proposed by other authors for the acceleration of the training phase. 4.1 Evaluation Speedup Two of the discussed techniques, training only for samples with high errors, and then, training only the hidden units with high error, make the error-evaluation phase to be the most processing demanding phase for the solution of a given problem. In addition, some other techniques, like board game learning through temporal difference methods, require many evaluations to be performed before each train. We can speedup evaluation by the following observation: For many problems, only part of the input is changed on successive samples. For example, for a backgammon program with 200 input units (with raw board data and not any additional features encoded), very few inputs will change on successive positions. Even on two dimensional problems such as images, we can arrange to train on samples selected by random changes on the X and Y dimensions alternatively. This process of only resampling one coordinate at a time is also known as “Gibbs sampling” and it nicely generalises to more than two coordinates (Geman and Geman, 1984). Thus, we can keep in memory all intermediate results from the evaluation, and recalculate only for the inputs that have changed. This implementation technique requires more storage, especially for high dimensional inputs. Fortunately, storage is not an issue on modern hardware. 4.2 Training Speedup Many authors have proposed methods for speeding-up online training by using second order gradient information in order to dynamically vary either the learning rate or the momentum (see LeCun et al., 1993; Leen and Orr, 1993; Murata et al., 1996; Harmon and Baird, 1996; Orr and Leen, 1996; Almeida et al., 1997; Amari, 1998; Schraudolph, 1998c, 1999, 2002; Graepel and Schraudolph, 2002). As it is shown in the next section, our techniques enable standard stochastic gradient descent with momentum to efficiently solve all the highly non-linear problems that have been investigated. However, the additional speed up that an accelerating algorithm can give is a nice thing to have. Moreover, these accelerating algorithms automatically reduce the learning rate when we are close to a solution (by sensing the oscillations in the error gradient) something that we should do through annealing if we wanted the best possible solution. We use the Incremental Delta-Delta (IDD) accelerating algorithm (Harmon and Baird, 1996), an incremental nonlinear extension to Jacobs’ (1988) Delta-Delta algorithm, because of its simplicity and relatively small processing requirements. IDD computes an individual learning rate λ for each weight w as: λ(t) = eξ(t) , ξ(t + 1) = ξ(t) + θ ∆w(t + 1) ∆w(t), λ(t) where θ is the meta-learning rate which we typically set to 0.1. 2030 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 5. Experimental Results In order to measure the effectiveness of the described techniques on various classes of problems, we performed several experiments. Each experiment was replicated 10 times with different random initial weights using matched random seeds and the means and standard deviations of the results were plotted in the corresponding figures. For the experiments we used a single hidden layer, the cross entropy error function, the logistic or softmax activation function for the output units and the Elliott or hybrid activation function for the hidden units. Output to hidden layer weights and biases were initialized to zero. Hidden to input layer weights were initialized to random numbers from a normal distribution and then rescaled so that the incoming weights to each hidden unit had norm unity. Hidden unit biases were initialized to a uniform random number between zero and one. The curves in the figures are labelled with a combination of the following letters which indicate the techniques that were applied: B – Adjust weights using stochastic gradient descent with momentum 0.9 and fixed learning rate √ 0.1/ c where c is the number of incoming connections to the unit. A – Adjust weights using IDD with meta-learning rate 0.1 and initial learning rate √ 1/ c where c is as above. L – Use fixed cascaded inhibitory connections as described in Section 3.1. S – Skip weights adjustment for samples with low error as described in Section 2. U – Skip weights adjustment for hidden units with low error as described in Section 3.2. C – Use individual means and stdevs for each hidden to input connection as described in Section 3.3. H – Use the hybrid activation function as described in Section 3.4. For the ‘B’ training method we deliberately avoided an annealing schedule for the learning rate, since this would destroy the initial state invariance of our techniques. Instead, we used a fixed small learning rate which we compensated with a large momentum. For the ‘A’ method, we used a small meta-learning rate, to avoid instabilities due to the high non-linearities of the examined problems. It is important to note that for both training methods the learning parameters were fixed to the above values and not optimized to each individual problem. For the ‘C’ technique, the centers of the hidden units where initially set to the center of the input space and the standard deviations were set to one third of the distance between the extreme values of each dimension. When the technique was not used, a global preprocessing was applied which normalized the input samples to have zero mean and unit standard deviation. 5.1 Two Input Dimensions In this section we give experimental results for the class of problems that we have mainly examined, that is, problems in two input and one output dimensions, for which we have dense and noiseless training samples from the whole input space. In the figures, we measure the average classification error in respect to the stage of training. The classification error was averaged via an exponential trace with time scale 10−4 . 2031 C HARIATIS 5.1.1 C OMPARISON OF T ECHNIQUE C OMBINATIONS For these experiments we used the two-spirals problem shown in Figures 1a, 3, 4 and 6. We chose this problem as a non trivial representative of the class of problems that during early training generate a single cluster of high error density. The goal of this experiment is to measure the effectiveness of various technique combinations and then to measure how well the best technique combination scales with the size of the hidden layer. Figures 10 and 11 show the average classification error in respect to the number of evaluated samples and processing cycles respectively for 13 technique combinations. For these experiments we used 64 hidden units. The standard deviations were not plotted in order to keep the figures uncluttered. Figure 10 has also been split to Figures 12 and 13 in order to show the related error bars. Comparing the curves B vs. BL and BS vs. BLS on Figures 10 and 11, we can see that the fixed cascaded inhibitory connections reduce the asymptotic residual error by more than half. This also applies, but to a lesser degree, when we skip weight updates for hidden units with low errors (B vs. BU, BS vs. BSU). When used in combination, we can see a speed-up of convergence but the asymptotic error is only marginally further improved (BLU and BLSU). In Figure 11, it can be seen that skipping samples with low errors can speed-up convergence and reduce the asymptotic error as well (BLU vs. BLSU). This is a very intriguing result, in the sense that it implies that the system can learn faster and better by throwing away information. Both Figures 10 and 11 show the BLUCH curve to diverge. Considering the success of the BLSUCH curve, we can imply that skipping samples is necessary for the hybrid activation. However, the real problem, which was found out by viewing the dynamics of training, is that the centering mechanism does not work correctly when we train on all samples. A possible remedy may be to modify the statistics interval which is used for centering, as it is described at the end of Section 3.3. BLSUC vs. BLSU shows that centering further reduces the remaining asymptotic error to half and converges much faster as well. Comparing curve BLSUCH vs. BLSUC, we see that the hybrid activation function does better, but only marginally. This was expected since this problem has a single region of interest, so the ability of H to focus on multiple regions simultaneously is not exercised. This is the reason for the additional experiments in Section 5.1.2. BLSUCH and ALSUCH were the most successful technique combinations, with the later being a little faster. Nevertheless, it is very impressive that standard stochastic gradient descent with momentum can approach the best asymptotic error in less than a second, when using a modern 3.2 GHz processor. Figure 14 shows the average classification error in respect to the number of evaluated samples, for the ALSUCH technique combination and various hidden layer sizes. It can be seen that the asymptotic error is almost inversely proportional to the number of hidden units. This is a good indication that our techniques use the available resources efficiently. It is also interesting, that the convergence rates to the corresponding asymptotic errors are quite fast and about the same for all hidden layer sizes. 5.1.2 H YBRID VS . C ONVENTIONAL ACTIVATION For these experiments we used the two dimensional problem depicted in Figures 1c and 7. We chose this problem as a representative of the class of problems that during early training generate 2032 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 10: Average classification error vs. number of evaluated samples for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The standard deviations have been omitted for clarity. B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 1 2 3 4 5 6 7 8 9 10 Figure 11: Average classification error vs. Intel IA32 CPU cycles in billions, for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The horizontal scale also corresponds to seconds when run on a 1 GHz processor. The standard deviations have been omitted for clarity. 2033 C HARIATIS BS BLS BLSUC ALSUCH 0,35 BSU BLSU BLSUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 12: Part of Figure 10 showing error bars for technique combinations which employ S. B BL BLUC 0,35 BU BLU BLUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 13: Part of Figure 10 showing error bars for technique combinations which do not employ S. 2034 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 32 48 64 96 128 256 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 14: Average classification error vs. number of evaluated samples for various hidden layer sizes, while training the problem in Figure 1a with the ALSUCH technique combination. ALSUCH 0,04 ALSUC 0,03 0,02 0,01 0,00 0 300000 600000 900000 1200000 1500000 1800000 2100000 2400000 2700000 3000000 Figure 15: Average classification error vs. number of evaluated samples for the ALSUCH and ALSUC technique combinations, while training the problem in Figure 1c with 100 hidden units. The dashed lines show the minimum and maximum observed values. 2035 C HARIATIS small clusters of high error density of various sizes. For this kind of problems we typically obtain very small residuals for the classification error, although the problem may not have been learned. This is because we measure the error on the whole input space and for these problems most of the input space is trivial to be learned. The problem’s complexities are confined in very small areas. The dynamic training set evolution algorithm is able to locate these areas, but we need much more sample presentations, since most of the samples are not used for training. The goal of this experiment is to measure the effectiveness of the hybrid activation function at coping with the varying sizes of the subproblems. For these experiments we used 100 hidden units. Figure 15 shows that the ALSUCH technique, which employs the hybrid activation function, reduced the asymptotic error to half in respect to the ALSUC technique. As all of the visual inspections revealed, one of which is reproduced in Figure 16, the difference in the residual errors of the two curves is due to the insufficient approximation of the smaller subproblem by the ALSUC technique. Model ALSUCH ALSUC Figure 16: ALSUCH vs. ALSUC approximations for a problem with two sub-problems. 5.2 Higher Input and Output Dimensions In order to evaluate our techniques on a problem with higher input and output dimensions, we selected a standard benchmark, the Letter recognition database from the UCI Machine Learning Repository (Newman et al., 1998). This database consists of 20000 samples that use 16 integer attributes to classify the 26 letters of the English alphabet. This problem is characterized by a medium input dimensionality and a large output dimensionality. The later, makes it a very challenging problem for any classifier. This problem differs from those on which we have experimented so far, in that we do not have the whole input space at our disposal for training. We must train on a limited number of samples and then test the system’s generalization abilities on a separate test set. Although we have not taken any special measures to assist generalization, the experimental results indicate that our techniques have the inherent ability to generalize well, when given noiseless exemplars. An observation that applies to this problem is that the IDD accelerated training method could not do better than standard stochastic gradient descent with momentum. Thus, we report results using 2036 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS the BLSUCH technique combination which is computationally more efficient than the ALSUCH technique. For this experiment, which involves more than two output classes, we used the softmax activation function at the output layer. Table 2 contains previously published results showing the classification accuracy of various classifiers. The most successful of them were the AdaBoosted versions of the C4.5 decision-tree algorithm and of a feed forward neural network with two hidden layers. Both classifier ensembles required quite a lot of machines in order to achieve that high accuracy. Classifier Naive Bayesian classifier AdaBoost on Naive Bayesian classifier Holland-style adaptive classifier C4.5 AdaBoost on C4.5 (100 machines) AdaBoost on C4.5 (1000 machines) CART AdaBoost on CART (50 machines) 16-70-50-26 MLP (500 online epochs) AdaBoost on 16-70-50-26 MLP (20 machines) AdaBoost on 16-70-50-26 MLP (100 machines) Nearest Neighbor Test Error % 25,3 24,1 17,3 13,8 3,3 3,1 12,4 3,4 6,2 2,0 1,5 4,3 Reference Ting and Zheng (1999) Ting and Zheng (1999) Frey and Slate (1991) Freund and Schapire (1996) Freund and Schapire (1996) Schapire et al. (1997) Breiman (1996) Breiman (1996) Schwenk and Bengio (1998) Schwenk and Bengio (1998) Schwenk and Bengio (2000) Fogarty (1992) Table 2: A compilation of previously reported best error rates on the test set for the UCI Letters Recognition Database. Figure 17 shows the average error reduction in respect to the number of online epochs, for the BLSUCH technique combination and various hidden layer sizes. As suggested in the database’s documentation, we used the first 16000 samples for training and for measuring the training accuracy and the rest 4000 samples to measure the predictive accuracy. The solid and dashed curves show the test and training set errors respectively. Similarly to ensemble methods, we can observe two interesting phenomena which both seem to contradict the Occam’s razor principle. The first observation is that the test error stabilizes or continues to slightly decrease even after the training error has been zeroed. What is really happening is that the RMS error for the training set (which is related to the confidence of classification) continues to decrease even after the classification error has been zeroed, something that is also beneficiary for the test set’s classification error. The second observation is that increasing the network’s capacity does not lead to over fitting. Although the training set error can be zeroed with just 125 hidden units, increasing the number of hidden units reduces the residual test error as well. We attribute this phenomenon to the conjecture that the hidden units’ differentiation results in a smoother approximation (as suggested by Figure 5 and the related discussion). Comparing our results with those in Table 2, we can also observe the following: The 16-125-26 MLP (5401 weights) reached a 4.6% misclassification error on average, which is 26% better than the 6.2% of the 16-70-50-26 MLP (6066 weights), despite the fact that it had fewer weights, a simpler 2037 C HARIATIS 125 T RAIN 125 T EST 250 T RAIN 250 T EST 500 T RAIN 500 T EST 1000 T RAIN TEST ERROR % UNITS MIN AVG at end 125 4.0 4.6 250 2.8 3.2 500 2.3 2.6 1000 2.1 2.4 0,10 1000 T EST 0,05 0,00 0 10 20 30 40 50 60 70 80 90 100 Figure 17: Average error reduction vs. number of online epochs for various hidden layer sizes, while training on the UCI Letters Recognition Database with the BLSUCH technique combination. The solid and dashed curves show the test and training set errors respectively. The standard deviations for the training set errors have been omitted for clarity. The embedded table contains the minimum observed errors across all trials and epochs, and the average errors across all trials at epoch 100. architecture with one hidden layer only and it was trained for a far less number of online epochs. It is indicative that the asymptotic residual classification error on the test set was reached in about 30 online epochs. The 16-1000-26 MLP (43026 weights) reached a 2.4% misclassification error on average, which is the third best published result following the AdaBoosted 16-70-50-26 MLPs with 20 and 100 machines (121320 and 606600 weights respectively). The lowest observed classification error was 2.1% and was reached in one of the 10 runs at the 80th epoch. It must be stressed that the above results were obtained without any optimization of the learning rate, without a learning rate annealing schedule and within a by far shorter training time. All MLPs with 250 hidden units and above, gave results which put them at the top of the list of non-ensemble techniques and they even outperformed Adaboost on C4.5 with 100 machines. Similarly to Figure 14, we also see that the convergence rates to the corresponding asymptotic errors on the test set are quite fast and about the same for all hidden layer sizes. 6. Discussion and Future Research We have presented global and local selective attention techniques that can help neural network training to concentrate on the difficult parts of complex non-linear problems. A new hybrid activation function has also been presented that enables the hidden units to acquire individual receptive fields 2038 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS in the input space. These individual receptive fields may be global or local depending on the problem’s local complexities. The success of the new activation function is due to the fact that it depends on two distances. The first is the weighted distance of a sample to the hidden unit’s hyperplane. The second is the distance to the hidden unit’s center. We need both distances and neither of them is sufficient. The first helps us discriminate and the second helps us localize. The dynamic training set evolution algorithm locates the sub-areas of the input space where the problem resides. The fixed cascaded inhibitory connections and the selective training of a subset of the hidden units on each sample, force the hidden units to get differentiated and attack different subproblems. The individual centering of the hidden units at different points in the input space, adaptively conditions the network to the problem’s local structures and enables each hidden unit to solve a well-conditioned subproblem. In coordination with the above, the hidden units’ limited receptive fields allow training to follow a divide and conquer paradigm where each hidden unit only solves a local subproblem. The solutions to the subproblems are then combined by the output layer to give a solution to the original problem. In the reported experiments we initialized the hidden weights and biases so that the hidden hyperplanes would cover the whole input space at random positions and orientations. The initial norm of the weights was also adjusted so that the net-input to each hidden unit would fall in the transition between the linear and non-linear range of the activation function. These specific initializations were necessary for standard backpropagation. On the contrary, we have found that the combined techniques are insensitive to the initial weights and biases, as long as their values are small. We have repeated the experiments with hidden biases set to zero and hidden weight norms set to 10 −3 and the results where equivalent to those reported in Section 5. However, the choice of the best initial learning rate is still problem specific. An additional and important characteristic of these techniques is that training of the hidden layer does not depend solely on gradient information. Gradient based techniques can only perform local optimization by locating a local minimum of the error function when the system is already at the basin of attraction of that minimum. Stochastic training has a potential of escaping from a shallow basin, but only when the basin is not very wide. Once there, the system cannot escape towards a different basin with a lower minimum. On the contrary, in our model some of the hidden layer’s free parameters (the weights) are trained through gradient descent on the error, whereas some other (the means and standard deviations) are “trained” from the statistical properties of the back-propagated errors. Each hidden unit places its center near the center of mass of the error that it receives and limits its visibility only to the area of the input space where it produces a significant error. This model makes the hidden error landscape to constantly change. We conjecture that during training, paths connecting the various error basins are continuously emerging and vanishing. As a result the system can explore much more of the solution space. It is indicative that in all the reported experiments, all trials converged to a solution with more or less the same residual error irrespectively of the initial network state. The combination of the presented techniques enables very fast training on complex classification problems with embedded subproblems. By focusing on the problem’s details and efficiently utilizing the available resources, they make feasible the solution of very difficult problems (like the one in Figure 1e), provided that the adequate number of hidden units has been used. Although other machine learning techniques can do the same, to our knowledge this is the first report that this can be done using ordinary feed forward neural networks and backpropagation, in an online, adaptive 2039 C HARIATIS and memory-less scenario, where the input exemplars are unknown before training and discarded after being used. In the following we discuss some areas that deserve further investigation. 6.1 Generalization and Regression For the classes of problems that were investigated, we had noiseless exemplars and the whole input space at our disposal for training, so there was no danger of overfitting. Thus, we did not use any mechanism to assist generalization. This does not mean of course that the network just stored the input output mapping, as a lookup table would do. By putting constraints on the positions and orientations of the hidden unit hyperplanes and by limiting their receptive fields, we reduced the system’s available degrees of freedom, and the network arranged its resources in a way to achieve the best possible input-output mapping approximation. The experiments on the Letter Recognition Database showed remarkable generalization capabilities. However, when we train on noisy samples or when the number of training samples is small in respect to the size and complexity of the input space, we have the danger of overfitting. It remains to be examined how the described techniques are affected by methods that avoid overfitting, such as, training with jitter, error regularization, target smoothing and sigmoid gain attenuation (Reed et al., 1995). This consideration also applies to regression problems which usually require smoother approximations. Although early experiments give evidence that the presented techniques can be applied to regression problems as well, we feel that some smoothing technique must be included in the training framework. 6.2 Receptive Fields Limited Orientations As it was noted in Section 3.4, the orientations of the receptive field ellipses are limited to have the direction of one of the input axes. This hinders training performance by not allowing the receptive fields to be adequately shrunk at the direction perpendicular to the hyperplane. In addition, hidden units with sloped hyperplanes are trained on highly correlated input values. These problems are expected to be exaggerated in high dimensional input spaces. We would cure both of these problems simultaneously, if we could individually transform the input for each hidden unit through adaptive whitening, or, if we could present to each hidden unit a rotated view of the input space, such that, one of the axes to be perpendicular to the hyperplane and the rest to be parallel to the hyperplane. Unfortunately, both of the above transformations would require too many additional parameters. An approximation (for 2 dimensional problems) that we are currently investigating upon is the following: For each input vector we compute K vectors rotated around the center of the input space with successive angle increments equal to π/(2K). Our purpose is to obtain uniform rotations between 0 and π/4. Every a few hundred training steps, we reassign to each hidden unit the most appropriate input representation and adjust the affected parameters (weights, means and stdevs). The results are promising. 6.3 Dynamic Cascaded Inhibitory Connections Regarding the fixed cascaded inhibitory connections, it must be examined whether it is better to make the strength of the connections, dynamic. Minus one is OK when the weights are small. How2040 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS ever as the weights get larger, the inhibitory connections get less and less effective to differentiate the hidden units. We can try to make them relative to each hidden unit’s average absolute net-input or alternatively to make them trainable. It has been observed that increasing the strength of these connections enables the hidden units to generate more curved discriminant functions, which is very beneficiary for some problems. 6.4 Miscellaneous More experiments need to be done, in order to evaluate the effectiveness of the hybrid activation function on highly non-linear problems in many dimensions. High dimensional input spaces have a multitude of disturbing properties in regard to distance and density metrics, which may affect the hybrid activation in yet unknown ways. Last, we must devise a training mechanism, that will be invariant to the initial learning rate and that will vary automatically the number of hidden units as each problem requires. Acknowledgments I would like to thank all participants in my threads in usenet comp.ai.neural-nets, for their fruitful comments on early presentations of the subjects in this report. Special thanks to Aleks Jakulin for his support and ideas on further research that can make these results even better and to Greg Heath for bringing to my attention the perturbated forms for the calculation of sliding window statistics. I also thank the area editor L´ on Bottou and the anonymous reviewers for their valuable comments e and for helping me to bring this report in shape for publication. Appendix A. Notational Conventions The following list contains the meanings of the symbols that have been used in this report. Symbols with subscripts are used either as scalars or as vectors and matrices when the subscripts are omitted. For example, w ji is a single weight, w j is a weight vector and W is a weight matrix. α – A constant that determines the time scale of the exponential trace of the average training-set error within the dynamic training set evolution algorithm. β – A constant that determines the time scale of the exponential trace of the input means and standard deviations. δ – An accumulator for the efficient implementation of the fixed cascaded inhibitory connections. η – The number of hidden units. µ – The number of input units. f – The hidden units’ squashing function. i – Index enumerating the input units. j – Index enumerating the hidden units. k – Index enumerating the output units. 2041 C HARIATIS a j – The hidden unit’s activation computed from the sample’s weighted distance to the hidden unit’s hyperplane. b j – The hidden unit’s activation attenuation computed from the sample’s distance to the hidden unit’s center. d j – The sample’s distance to the hidden unit’s center. e j – The hidden unit’s accumulated back propagated errors. g j – The hidden unit’s error signal f (n j ) e j . h j – The hidden unit’s activation. m ji – The mean of the values received by hidden unit j from input unit i. n j – The net-input to the hidden unit. q ji – The mean of the squared values received by hidden unit j from input unit i. rk – The error of output unit k. s ji – The standard deviation of the values received by hidden unit j from input unit i. u jk – The weight of the connection from hidden unit j to output unit k. v ji – The variance of the values received by hidden unit j from input unit i. w ji – The weight of the connection from hidden unit j to input unit i. xi – The value of input unit i. z ji – The normalized input value received by hidden unit j from input unit i. It is currently computed as the z-score of the input value. A better alternative would be to compute the vector z j by multiplying the input vector x with a whitening matrix Z j . References C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In J. Van den Bussche and V. Vianu, editors, Proceedings of the 8th International Conference on Database Theory (ICDT), volume 1973 of Lecture Notes in Computer Science, pages 420–434. Springer, 2001. K. Agyepong and R. Kothari. Controlling hidden layer capacity through lateral connections. Neural Computation, 9(6):1381–1402, 1997. S. Ahmad and S. Omohundro. A network for extracting the locations of point clusters using selective attention. In Proceedings of the 12th Annual Conference of the Cognitive Science Society, MIT, 1990. L. B. Almeida, T. Langlois, and J. D. Amaral. On-line step size adaptation. Technical Report INESC RT07/97, INESC/IST, Rua Alves Redol 1000 Lisbon, Portugal, 1997. 2042 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. P. Bakker. Don’t care margins help backpropagation learn exceptions. In A. Adams and L. Sterling, editors, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pages 139– 144, 1992. P. Bakker. Exception learning by backpropagation: A new error function. In P. Leong and M. Jabri, editors, Proceedings of the 4th Australian Conference on Neural Networks, pages 118–121, 1993. S. Baluja and D. Pomerleau. Using the representation in a neural network’s hidden layer for taskspecific focus of attention. In IJCAI, pages 133–141, 1995. L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department, University of California, 1996. D. S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2(3):321–355, 1988. W. Duch, K. Grudzinski, and G. H. F. Diercksen. Minimal distance neural methods. In World Congress of Computational Intelligence, pages 1299–1304, 1998. D. L. Elliott. A better activation function for artificial neural networks. Technical Report TR 93-8, The Institute for Systems Research, University of Maryland, College Park, MD, 1993. G. W. Flake. Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr and K. R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in u Computer Science, pages 145–163. Springer, 1998. T. C. Fogarty. Technical note: First nearest neighbor classification on frey and slate’s letter recognition problem. Machine Learning, 9(4):387–388, 1992. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148– 156, 1996. P. W. Frey and D. J. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6:161–182, 1991. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. T. Graepel and N. N. Schraudolph. Stable adaptive momentum for rapid online learning in nonlinear systems. In J. R. Dorronsoro, editor, Proceedings of the International Conference on Artificial Neural Networks (ICANN), volume 2415 of Lecture Notes in Computer Science, pages 450–455. Springer, 2002. M. Harmon and L. Baird. Multi-player residual advantage learning with general function approximation. Technical Report WL-TR-1065, Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-6543, 1996. 2043 C HARIATIS M. Hegland and V. Pestov. Additive models in high dimensions. Computing Research Repository (CoRR), cs/9912020, 1999. S. C. Huang and Y. F. Huang. Learning algorithms for perceptrons using back propagation with selective updates. IEEE Control Systems Magazine, pages 56–61, April 1990. R.A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1: 295–307, 1988. R. Kothari and D. Ensley. Decision boundary and generalization performance of feed-forward networks with gaussian lateral connections. In S. K. Rogers, D. B. Fogel, J. C. Bezdek, and B. Bosacchi, editors, Applications and Science of Computational Intelligence, SPIE Proceedings, volume 3390, pages 314–321, 1998. B. Laheld and J. F. Cardoso. Adaptive source separation with uniform performance. In Proc. EUSIPCO, pages 183–186, September 1994. Y. LeCun, P. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems, volume 5, pages 156–163. Morgan Kaufmann Publishers, San Mateo, CA, 1993. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mueller. Efficient backprop. In G. B. Orr and K.-R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer u Science, pages 9–50. Springer, 1998. T. K. Leen and G. B. Orr. Optimal stochastic search and adaptive momentum. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Proceedings of the 7th NIPS Conference (NIPS), Advances in Neural Information Processing Systems 6, pages 477–484. Morgan Kaufmann, 1993. P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176, 1987. N. Murata, K. M¨ ller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing environments. u In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 599–605. MIT Press, 1996. D. J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning databases, 1998. G. B. Orr and T. K. Leen. Using curvature information for fast stochastic search. In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 606–612. MIT Press, 1996. J. L. Phillips and D. C. Noelle. Reinforcement learning of dimensional attention for categorization. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society, 2004. M. Plumbley. A hebbian/anti-hebbian network which optimizes information capacity by orthonormalizing the principal subspace. In Proc. IEE Conf. on Artificial Neural Networks, Brighton, UK, pages 86–90, 1993. 2044 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS R. Reed, R.J. Marks, and S. Oh. Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3):529–538, 1995. R. E. Schapire. A brief introduction to boosting. In T. Dean, editor, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), pages 1401–1406. Morgan Kaufmann, 1999. R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, editor, Proceedings of the 14th International Conference on Machine Learning (ICML), pages 322–330. Morgan Kaufmann, 1997. N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. ¨ N. N. Schraudolph. Centering neural network gradient factors. In G. B. Orr and K. R. M uller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 207–226. Springer, 1998a. N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical Report IDSIA-33-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, 1998b. N. N. Schraudolph. Online local gain adaptation for multi-layer perceptrons. Technical Report IDSIA-09-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Galleria 2, CH-6928 Manno, Switzerland, 1998c. N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In ICANN, pages 569–574. IEE, London, 1999. H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8):1869–1887, 2000. H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural networks for character recognition. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10. MIT Press, Cambridge, MA, 1998. M. W. Spratling and M. H. Johnson. Neural coding strategies and mechanisms of competition. Cognitive Systems Research, 5(2):93–117, 2004. C. Thornton. The howl effect in dynamic-network learning. In Proceedings of the International Conference on Artificial Neural Networks, pages 211–214, 1992. K. M. Ting and Z. Zheng. Improving the performance of boosting for naive bayesian classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 296–305, 1999. Y. H. Yu and R. F. Simmons. Descending epsilon in back-propagation: A technique for better generalization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume 3, pages 167–172, 1990. S. Zhong and J. Ghosh. Decision boundary focused neural network classifier. In Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE). ASME Press, 2000. 2045

4 0.39914382 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

Author: Nicolás García-Pedrajas, César García-Osorio, Colin Fyfe

Abstract: In this paper we propose a novel approach for ensemble construction based on the use of nonlinear projections to achieve both accuracy and diversity of individual classifiers. The proposed approach combines the philosophy of boosting, putting more effort on difficult instances, with the basis of the random subspace method. Our main contribution is that instead of using a random subspace, we construct a projection taking into account the instances which have posed most difficulties to previous classifiers. In this way, consecutive nonlinear projections are created by a neural network trained using only incorrectly classified instances. The feature subspace induced by the hidden layer of this network is used as the input space to a new classifier. The method is compared with bagging and boosting techniques, showing an improved performance on a large set of 44 problems from the UCI Machine Learning Repository. An additional study showed that the proposed approach is less sensitive to noise in the data than boosting methods. Keywords: classifier ensembles, boosting, neural networks, nonlinear projections

5 0.37109187 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd

Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.

6 0.34077069 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

7 0.34003437 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

8 0.33102739 52 jmlr-2007-Margin Trees for High-dimensional Classification

9 0.30601731 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning     (Special Topic on Model Selection)

10 0.2385179 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

11 0.20523676 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

12 0.20269273 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

13 0.19896759 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms

14 0.19834261 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

15 0.19264932 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

16 0.18927018 15 jmlr-2007-Bilinear Discriminant Component Analysis

17 0.18441233 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

18 0.17573756 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

19 0.17464696 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

20 0.1745854 44 jmlr-2007-Large Margin Semi-supervised Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.02), (8, 0.014), (10, 0.018), (12, 0.044), (15, 0.029), (28, 0.053), (40, 0.034), (45, 0.023), (48, 0.031), (57, 0.372), (60, 0.056), (80, 0.023), (85, 0.073), (98, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69877285 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classification of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm specifically intended for classification problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classifier are also provided. An experimental study with artificial and real data sets, including an application to gene expression analysis, verifies the usefulness of the proposed approach. Keywords: classification, ordinal data, support vector machines, neural networks

2 0.4048509 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

3 0.40408367 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

4 0.39905992 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics

5 0.39754704 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classification algorithm, feature selection

6 0.39505923 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

7 0.39502236 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

8 0.39323753 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

9 0.39269787 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

10 0.39043814 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions     (Special Topic on the Conference on Learning Theory 2005)

11 0.3896659 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

12 0.389292 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

13 0.38894939 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

14 0.38741654 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

15 0.38731384 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

16 0.38729852 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

17 0.38685179 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

18 0.38613403 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

19 0.3855148 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

20 0.38548455 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels