jmlr jmlr2007 jmlr2007-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram
Abstract: Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. Keywords: classification, hierarchical Bayesian models, Dirichlet process
Reference: text
sentIndex sentText sentNum sentScore
1 In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. [sent-11, score-0.278]
2 First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. [sent-13, score-0.436]
3 In remote sensing, for example, one may have multiple sets of data, each collected at a particular geographical location; rather than designing individual classifiers for each of these sensing tasks, it is desirable to share data across tasks to enhance overall sensing performance. [sent-19, score-0.306]
4 MTL is distinct from standard STL in two principal respects: (i) the tasks are not identical, thus simply pooling them and treating them as a single task is not appropriate; and (ii) some of the classification c 2007 Ya Xue, Xuejun Liao, Lawrence Carin and Balaji Krishnapuram. [sent-22, score-0.376]
5 X UE , L IAO , C ARIN AND K RISHNAPURAM tasks may be highly correlated (dependent on each other), but the strategy of isolating each task and learning the corresponding classifier independently does not exploit the potential information one may acquire from other classification tasks. [sent-23, score-0.313]
6 The fact that some of the classification tasks are correlated (dependent) implies that what is learned from one task is transferable to another. [sent-24, score-0.351]
7 By learning the classifiers in parallel under a unified representation, the transferability of expertise between tasks is exploited to the benefit of all. [sent-25, score-0.28]
8 Typical approaches to information transfer among tasks include: sharing hidden nodes in neural networks (Baxter, 1995, 2000; Caruana, 1997); placing a common prior in hierarchical Bayesian models (Yu et al. [sent-30, score-0.421]
9 Instead of giving an overall evaluation, our objective is to learn multiple tasks jointly, either to improve the learning performance (i. [sent-38, score-0.28]
10 Hierarchical Bayesian models provide the flexibility to model both the individuality of tasks (experiments), and the correlations between tasks. [sent-48, score-0.263]
11 On the layer above, tasks are connected together via a common prior placed on those parameters. [sent-51, score-0.326]
12 The hierarchical model can achieve efficient information-sharing between tasks for the following reason. [sent-52, score-0.324]
13 Learning of the common prior is also a part of the training process, and data from all tasks contribute to learning the common prior, thus making it possible to transfer information between tasks (via sufficient statistics). [sent-53, score-0.598]
14 As a result, the estimation of a classifier (task) is affected by both its own training data and by data from the other tasks related through the common prior. [sent-55, score-0.271]
15 Information is transferred between tasks by learning those hyper-parameters using data from all tasks. [sent-57, score-0.24]
16 In this paper, we provide such a nonparametric hierarchical Bayesian model for jointly learning multiple logistic regression classifiers. [sent-59, score-0.176]
17 Further, the model parameters of individual tasks may have high complexity, and therefore no appropriate parametric form can be found easily. [sent-61, score-0.263]
18 Therefore, the DP is rich enough to model the parameters of individual tasks with arbitrarily high complexity, and flexible enough to fit them well without any assumption about the functional form of the prior distribution. [sent-66, score-0.326]
19 2 I DENTIFYING THE E XTENT OF S IMILARITIES BETWEEN TASKS IN MTL A common assumption in the previous literature on MTL work is that all tasks are (equally) related to each other, but recently there have been a few investigations concerning the extent of relatedness between tasks. [sent-69, score-0.268]
20 An ideal MTL algorithm should be able to automatically identify similarities between tasks and only allow similar tasks to share data or information. [sent-70, score-0.48]
21 First, we define two classification tasks as similar when the two classification boundaries are close, that is, when the weight vectors of two classifiers are similar. [sent-77, score-0.24]
22 Note that this is different from some previous work such as Caruana (1997) where two tasks are defined to be similar if they use the same features to make their decision. [sent-78, score-0.24]
23 Secondly, the property that the distributions drawn from a Dirichlet process are discrete with probability one introduces questions, because it implies that we cluster identical tasks instead of similar tasks. [sent-79, score-0.24]
24 Therefore, the model parameters learned for any task are the result of the tradeoff between sharing with other tasks and retaining the individuality of the current task. [sent-84, score-0.407]
25 This gives similar tasks the chance to share the same model parameters. [sent-85, score-0.263]
26 We hypothesize that this result is attributable to the increase in model complexity, since the DPM prior introduces one more layer in the hierarchical model than the DP prior. [sent-90, score-0.193]
27 The SMTL problem is addressed in Section 3, which describes the proposed Bayesian hierarchical framework and presents a variational inference algorithm. [sent-120, score-0.22]
28 Experimental results are reported in Section 5, demonstrating the application of the proposed models to a landmine detection and an art image retrieval problem. [sent-123, score-0.281]
29 The mathematical representation of the DP model is wm |G ∼ G, G ∼ DP(α, G0 ). [sent-128, score-0.224]
30 Integrating out G, the conditional distribution of wm , given observations of the other M − 1 w values w−m = {w1 , · · · , wm−1 , wm+1 , · · · , wM }, is p(wm |w−m , α, G0 ) = α 1 G0 + M−1+α M−1+α M ∑ δw j , (1) j=1, j=m where δw j is the distribution concentrated at the single point w j . [sent-133, score-0.201]
31 , wM and n−m,k denote the numk ber of w’s equal to w∗ , excluding wm . [sent-140, score-0.201]
32 The model automatically identifies relatedness by task clustering with nonparametric methods. [sent-157, score-0.193]
33 A variational Bayesian (VB) approximation is used to learn the posterior distributions of the model parameters. [sent-158, score-0.244]
34 For task m the conditional distribution of y m,n given xm,n is modeled via logistic regression as, p(ym,n |wm , xm,n ) = σ(wT xm,n )ym,n [1 − σ(wT xm,n )]1−ym,n m m (4) 1 where σ(x) = 1+exp(−x) and wm parameterizes the classifier for task m. [sent-167, score-0.374]
35 The goal is to learn {w m }M m=1 jointly, sharing information between tasks as appropriate, so that the resulting classifiers can accurately predict class labels for new test samples for tasks m = 1, · · · , M. [sent-168, score-0.618]
36 wm = ∏ (w∗ )cm,k , or in an equivalent form wm = ∑ cm,k w∗ , m = 1, . [sent-197, score-0.402]
37 m We refer to this as symmetric multi-task learning (SMTL) because all tasks are treated symmetrically; asymmetric multi-task learning (AMTL) is addressed in Section 4. [sent-206, score-0.269]
38 The VB method approximates the true posterior p(Z|{Dm }M , Φ) by a variational distribution q(Z). [sent-269, score-0.181]
39 To make the optimization problem tractable, it is assumed that the variational distribution q(Z) is sufficiently simple - fully factorized with each factorized component in the exponential family. [sent-271, score-0.197]
40 Jaakkola and Jordan (1997) present a method that uses an accurate variational transformation of p(y|w, x) as follows p(y|w, x) ≥ σ(ξ) exp( 1 (2y − 1)wT x − ξ + ρ(ξ)(xT wwT x − ξ2 )), 2 −σ(ξ) where ρ(ξ) = 2 2ξ and ξ is a variational parameter. [sent-282, score-0.27]
41 The posterior p(w|x, y, µ0 , Σ0 ) remains the normal form with this variational approximation. [sent-284, score-0.181]
42 In the SMTL-1 model, the factorized variational distribution is specified as M K K m=1 k=1 k=1 q(Z) = [ ∏ qcm (cm )] · [ ∏ qvk (vk )] · qα (α) · [( ∏ qw∗ (w∗ )], k k where • qcm (cm ) is a multinomial distribution, cm ∼ MK (1; φm,1 , . [sent-300, score-0.466]
43 k 45 X UE , L IAO , C ARIN AND K RISHNAPURAM Similarly, in the SMTL-2 model, the factorized variational distribution is specified as M K K m=1 k=1 k=1 q(Z) = [ ∏ qcm (cm )] · [ ∏ qvk (vk )] · qα (α) · [( ∏ qw∗ (w∗ )] · qµ,{λ j }d (µ, {λ j }d ). [sent-317, score-0.308]
44 Each factor in the factorized variational distribution and ξ, the variational parameter of the sigmoid function, are re-estimated iteratively conditioning on the current estimate of all the others, assuring the lower bound of the log likelihood increases monotonically until it converges. [sent-321, score-0.301]
45 k (8) k=1 Integrating (8) over the variational distributions qw∗ (w∗ ) and qcm (cm ) yields k k p(ym, = 1|{{φm,k }K }M , {θk }K , {Γk }K , xm, ) k=1 k=1 k=1 m=1 = K (9) ∑ φm,k σ(w∗ T xm, )Nd (θk , Γk )dw∗ . [sent-325, score-0.216]
46 method, that is, we randomly draw 104 samples of w∗ from the normal distribution Nd (θk , Γk ), k substitute the samples into the logistic function σ(w∗ T xm, ) and take the average of the function k values. [sent-348, score-0.202]
47 One option is to re-run the SMTL algorithm on all M + 1 tasks, but this requires storage of the data from all previous tasks and may be computationally prohibitive for real-time applications. [sent-356, score-0.24]
48 In fact, re-running SMTL on all tasks is in many situations not necessary, as the previous M tasks may simply represent the history and re-estimating them is uninteresting. [sent-357, score-0.48]
49 In these cases, we need only concentrate on learning the new task, treating the previous M tasks as the background tasks, from which relevant information is transferred to the new task. [sent-358, score-0.24]
50 Since task M + 1 is treated differently from tasks 1 though M, this is referred to as asymmetric MTL, or AMTL. [sent-361, score-0.342]
51 1 Prior Learned from Previous Tasks From (2), we know that the conditional distribution of the classifier w M+1 given α, G0 and the other M classifiers is p(wM+1 |w1 , · · · , wM , α, G0 ) = K α 1 G0 + ∑ nk δw∗k , M+α M + α k=1 (11) M where nk = ∑ cm,k is the number of wm which are equal to w∗ . [sent-363, score-0.201]
52 k m=1 Assume the SMTL-1 model has been applied to M previous tasks and thus the variational distributions qcm (cm ), qα (α) and qw∗ (w∗ ) have been optimized. [sent-364, score-0.479]
53 Therefore, by taking it as a prior for wM+1 , information learned from previous tasks can be transferred to the new task. [sent-369, score-0.341]
54 M+1 M+1 n=1 Note in Section 4 we limit the discussion on learning from previous tasks to the SMTL-1 model, for which the parameters of G0 are given; the approach developed in this section can be extended to the SMTL-2 model by substituting the expectations on the parameters of G 0 into (12). [sent-371, score-0.263]
55 Considering that the computational burden is small in the AMTL case (we only deal with data from task M + 1), we appeal to MCMC methods and develop a Metropolis-Hastings algorithm to draw samples from the posterior (Robert and Casella, 2004; Neal, 1998). [sent-374, score-0.229]
56 They treat wm as the maximum a posteriori (MAP) estimate of ˆ task m. [sent-404, score-0.274]
57 This approximation is inappropriate if the training samples in each previous task are not sufficient, so that each task cannot learn an accurate classifier by only using the corresponding user’s profile. [sent-408, score-0.282]
58 Experiments and Results Analysis An empirical study of the proposed methods is conducted on two real applications: (i) a landmine detection problem, and (ii) an art image retrieval problem. [sent-423, score-0.281]
59 50 M ULTI -TASK L EARNING FOR C LASSIFICATION WITH D IRICHLET P ROCESS P RIORS Data from 29 tasks are collected from various landmine fields. [sent-426, score-0.402]
60 The landmine detection problem is modeled as a binary classification problem. [sent-430, score-0.191]
61 We treat classification of each data set as a learning task and evaluate the proposed SMTL and AMTL methods on this landmine detection problem. [sent-432, score-0.264]
62 Thus we expect that there are approximately two clusters of tasks corresponding to two classes of ground surface conditions. [sent-434, score-0.275]
63 We first evaluate the SMTL models using data sets 1-10 and 16-24; next, for the AMTL setting, these data are treated as previous tasks and data sets 11-15 and 25-29 are treated as new observed tasks. [sent-435, score-0.24]
64 1, and (iv) simply pooling the data in all tasks and then learning a single classifier with the variational approach as for (iii). [sent-441, score-0.438]
65 5 40 80 120 160 200 240 280 Number of Training Samples for Each Task Figure 4: Average AUC on 19 tasks in the landmine detection problem. [sent-472, score-0.431]
66 settings: (a)(c)(e) learned by using the SMTL-1 model with 20, 100 and 300 training samples per task; (b)(d)(f) learned by using the SMTL-2 model with 20, 100 and 300 training samples per task. [sent-473, score-0.314]
67 The training samples are so few that although both SMTL methods find that all tasks are similar, they cannot identify the extent of similarity between tasks (see (a) and (b) in Fig. [sent-481, score-0.576]
68 When there are a few training samples available, for example, 100 per task, the simple pooling method does not improve further as more training samples are pooled together, because it ignores the statistical differences between tasks. [sent-486, score-0.255]
69 The number of training samples is large enough for each task to learn well by itself, so the curve for the STL method approaches the curves for the SMTL methods and exceeds the curve for the simple pooling. [sent-493, score-0.209]
70 53 X UE , L IAO , C ARIN AND K RISHNAPURAM in (e) and (f) that the 19 tasks are roughly grouped into two clusters, which agree with the ground truth discussed above. [sent-496, score-0.24]
71 1 are treated as previous tasks and data sets 11-15 and 25-29 as the new observed tasks. [sent-501, score-0.24]
72 AMTL-1: First we apply the SMTL-1 model to those previous tasks and learn the variational parameters Ω, and then run the AMTL-1 algorithm on each new task. [sent-503, score-0.438]
73 The parameters τ1 and τ2 in the AMTL-1 model can be estimated from previous tasks, while α0 in the AMTL-2 model is a predefined parameter, which represents a prior belief about the relatedness between the new task and previous tasks. [sent-514, score-0.21]
74 Two factors affecting learning performance are considered: (i) number of training samples per previous task, used for learning the prior of wM+1 , and (ii) number of training samples per new task. [sent-519, score-0.255]
75 The AMTL-1 approach efficiently transfers information from previous tasks to the new task. [sent-527, score-0.24]
76 The SMTL method can learn an informative prior for the new task, with only 40 training samples for each previous task (see (a) in Fig. [sent-528, score-0.272]
77 The performance has almost no change as the number of training samples in the new task increases, because we use the linear classier and thus the matching classifier can be found with even only a pair of “1” and “0” labeled samples. [sent-531, score-0.194]
78 5 = 0; 160 training samples for each previous (d) α0 = τ1 , where τ1 and τ2 are estimated from previous τ2 tasks; 160 training samples for each previous task. [sent-559, score-0.192]
79 65 = 0; 40 training samples for each previous (b) α0 = τ1 , where τ1 and τ2 are estimated from previous τ2 tasks; 40 training samples for each previous task. [sent-576, score-0.192]
80 75 AMTL−1 AMTL−2 STL 40 80 120 160 200 Number of Training Samples for New Task τ1 = 0; all samples in previous tasks used for (f) α0 = τ2 , where τ1 and τ2 are estimated from previous tasks; all samples in previous tasks used for training. [sent-589, score-0.61]
81 Figure 6: Average AUC on 10 new tasks in the landmine detection problem. [sent-590, score-0.431]
82 Next, we recover G0 by using the parameters τ1 and τ2 estimated from previous tasks in the AMTL-1 model and setting α0 = τ1 in the AMTL-2 model (see (b)(d)(f)). [sent-593, score-0.286]
83 In contrast, the AMTL-1 approach benefits from previous tasks so that incorporation of G 0 has a relatively small effect on its performance. [sent-597, score-0.24]
84 The painting image data differ with the landmine data in two respects. [sent-608, score-0.191]
85 Second, because user preferences are very diverse, the clustering structure of tasks is expected to be more complex than that of the landmine detection tasks. [sent-610, score-0.499]
86 Then we take these as previous tasks and those 50 users who rate between 50 and 100 images are treated as new tasks in the AMTL experiment. [sent-612, score-0.48]
87 The SMTL-2 model is also excluded, because the feature dimension (275) is high relative to the number of training samples for each task, so that it is hard to get an accurate estimation on the variational distribution of λ j in (7), which is the precision on each feature dimension. [sent-619, score-0.254]
88 Similar to the landmine experiments, the performance is measured by the ˜ average AUC on all tasks and evaluated at 10, 20, 30, 40 or 50 randomly selected training samples for each task. [sent-621, score-0.498]
89 52 10 20 30 40 50 Number of Training Samples for Each Task Number of Clusters among 68 Tasks Figure 7: Average AUC on 68 tasks in the art image retrieval problem. [sent-639, score-0.33]
90 40 35 30 25 20 10 20 30 40 50 Number of Training Samples for Each Task Figure 8: Number of clusters among 68 tasks in the art image retrieval problem. [sent-640, score-0.365]
91 As mentioned above, the tasks in the art image database are more diverse than those in the landmine data sets. [sent-641, score-0.467]
92 To get a clear view of this, we observe the number of clusters among 68 tasks instead of the between-task similarity matrix. [sent-642, score-0.275]
93 When the training size is small, the algorithm weakly finds the similarity between tasks and most of the tasks learn by themselves, therefore the SMTL-1 method works similar to the single-task learning. [sent-647, score-0.551]
94 45 0 5 10 15 20 25 Number of Training Samples for Each New Task 30 Figure 9: Average AUC on 50 new tasks in the art image retrieval problem. [sent-658, score-0.33]
95 MTL classification performance has been presented on two data sets: (i) a landmine sensing problem based on measured data, and (ii) an art-preference database. [sent-672, score-0.195]
96 Concerning (i), the MTL formulation yielded a clear indication of how the data from the multiple tasks clustered into related physical phenomena. [sent-673, score-0.27]
97 In such a heterogeneous MTL setting, involving tasks characterized by labeled and unlabeled data, it may be possible to label the new data under test (from the new task) without requiring any associated labeled data. [sent-685, score-0.315]
98 In this context one may also consider an active-learning setting, in which labels are acquired selectively from the new task of interest, such that after active learning all tasks have labeled data and the MTL formulations presented here may be applied directly. [sent-686, score-0.361]
99 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-732, score-0.265]
100 Learning multiple related tasks using latent independent ¨ component analysis. [sent-952, score-0.24]
wordName wordTfidf (topN-words)
[('smtl', 0.435), ('amtl', 0.354), ('tasks', 0.24), ('dp', 0.224), ('mtl', 0.223), ('stl', 0.213), ('wm', 0.201), ('landmine', 0.162), ('riors', 0.152), ('arin', 0.142), ('iao', 0.142), ('rishnapuram', 0.142), ('dirichlet', 0.135), ('variational', 0.135), ('irichlet', 0.129), ('landminefield', 0.121), ('auc', 0.11), ('ue', 0.107), ('rocess', 0.107), ('ulti', 0.107), ('dm', 0.099), ('xm', 0.098), ('vk', 0.094), ('nm', 0.085), ('nsam', 0.081), ('qcm', 0.081), ('cm', 0.077), ('lassification', 0.076), ('task', 0.073), ('ishwaran', 0.071), ('qw', 0.069), ('ym', 0.065), ('samples', 0.065), ('prior', 0.063), ('pooling', 0.063), ('qvk', 0.061), ('hierarchical', 0.061), ('vb', 0.056), ('yu', 0.051), ('mukhopadhyay', 0.051), ('bayesian', 0.049), ('posterior', 0.046), ('mcmc', 0.046), ('draw', 0.045), ('earning', 0.045), ('ratings', 0.043), ('nonparametric', 0.042), ('user', 0.041), ('transferability', 0.04), ('learn', 0.04), ('learned', 0.038), ('art', 0.036), ('clusters', 0.035), ('hinton', 0.035), ('wt', 0.034), ('sharing', 0.033), ('ga', 0.033), ('sensing', 0.033), ('base', 0.032), ('training', 0.031), ('gamma', 0.031), ('gelfand', 0.031), ('duke', 0.031), ('beta', 0.031), ('factorized', 0.031), ('balaji', 0.03), ('formulation', 0.03), ('ii', 0.029), ('asymmetric', 0.029), ('detection', 0.029), ('image', 0.029), ('tresp', 0.028), ('truncation', 0.028), ('eq', 0.028), ('relatedness', 0.028), ('logistic', 0.027), ('clustering', 0.027), ('em', 0.026), ('meta', 0.026), ('beal', 0.026), ('dpm', 0.026), ('index', 0.025), ('unlabeled', 0.025), ('retrieval', 0.025), ('labeled', 0.025), ('gibbs', 0.025), ('classi', 0.025), ('inference', 0.024), ('jaakkola', 0.024), ('jordan', 0.024), ('transfer', 0.024), ('jointly', 0.023), ('integral', 0.023), ('acquired', 0.023), ('mallick', 0.023), ('layer', 0.023), ('model', 0.023), ('diagram', 0.022), ('ghahramani', 0.022), ('blei', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram
Abstract: Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. Keywords: classification, hierarchical Bayesian models, Dirichlet process
2 0.081031635 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
3 0.067505121 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.058250204 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
5 0.056719687 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
6 0.048297901 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
7 0.045191593 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
8 0.0420742 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.041675672 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
10 0.040574227 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
11 0.039631203 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
12 0.034977566 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
13 0.034752786 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
14 0.034157515 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
15 0.032747667 15 jmlr-2007-Bilinear Discriminant Component Analysis
16 0.032734934 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
17 0.031119002 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
18 0.030999633 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
19 0.030876983 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
20 0.029991167 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
topicId topicWeight
[(0, 0.188), (1, 0.068), (2, -0.005), (3, 0.044), (4, -0.024), (5, -0.031), (6, 0.002), (7, -0.128), (8, -0.049), (9, -0.087), (10, -0.083), (11, -0.2), (12, 0.027), (13, 0.032), (14, -0.027), (15, -0.018), (16, -0.178), (17, 0.083), (18, -0.038), (19, 0.006), (20, 0.115), (21, 0.004), (22, -0.067), (23, -0.074), (24, -0.108), (25, 0.088), (26, 0.282), (27, 0.041), (28, 0.237), (29, -0.227), (30, -0.086), (31, 0.117), (32, 0.078), (33, -0.048), (34, -0.046), (35, 0.082), (36, 0.012), (37, -0.048), (38, 0.079), (39, 0.198), (40, 0.178), (41, -0.028), (42, -0.072), (43, -0.296), (44, -0.016), (45, 0.122), (46, -0.342), (47, 0.014), (48, -0.242), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.95222807 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram
Abstract: Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. Keywords: classification, hierarchical Bayesian models, Dirichlet process
2 0.42178002 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
3 0.29649135 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
4 0.26481462 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
5 0.25051057 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
6 0.22784635 70 jmlr-2007-Ranking the Best Instances
7 0.17537069 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
8 0.1708466 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics
9 0.16749856 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
10 0.16138476 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
11 0.16117069 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
12 0.15378989 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
13 0.15358052 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models
14 0.14926228 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
15 0.14806899 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
16 0.14064877 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
17 0.13804986 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
18 0.1370409 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
19 0.12522802 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
20 0.12367135 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
topicId topicWeight
[(4, 0.012), (8, 0.047), (10, 0.012), (12, 0.019), (15, 0.02), (28, 0.031), (40, 0.026), (45, 0.016), (48, 0.027), (60, 0.019), (77, 0.012), (80, 0.01), (85, 0.584), (98, 0.085)]
simIndex simValue paperId paperTitle
1 0.95869488 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
Author: Aggelos Chariatis
Abstract: The experimental investigation on the efficient learning of highly non-linear problems by online training, using ordinary feed forward neural networks and stochastic gradient descent on the errors computed by back-propagation, gives evidence that the most crucial factors for efficient training are the hidden units’ differentiation, the attenuation of the hidden units’ interference and the selective attention on the parts of the problems where the approximation error remains high. In this report, we present global and local selective attention techniques and a new hybrid activation function that enables the hidden units to acquire individual receptive fields which may be global or local depending on the problem’s local complexities. The presented techniques enable very efficient training on complex classification problems with embedded subproblems. Keywords: neural networks, online training, selective attention, activation functions, receptive fields 1. Framework Online supervised learning is in many cases the only practical way of learning. This includes situations where the problem size is very big, or situations where we have a non-recurring stream of input vectors that are unavailable before training begins. We examine online supervised learning using a particular class of adaptive models, the very popular feed forward neural networks, trained with stochastic gradient descent on the errors computed by back-propagation. In order to easily visualize the online training dynamics of highly complex non linear problems, we are experimenting on 2:η:1 networks where the input is a point in a two dimensional image and the output is the value of the pixel at the corresponding input position. This framework allows the creation of very complex non-linear problems, just by hand-drawing the problem on a bitmap and presenting it to the network. Most problems’ images in this report are 256 × 256 pixels in size, producing in total 65536 different samples each one. Classification and regression problems can be modeled as black & white and gray scale images respectively. In this report we only examine training on classification problems. However, since mixed problems are possible, we are only interested on techniques that can be applied to both classification and regression. The target of this investigation is online training where the input is not known in advance, so the input samples are treated as random and non-recurring vectors from the input space and are discarded after being used. We select and train on random samples until the average classification or RMS error is acceptable. Since both the number of training exemplars and the complexity of the underlying function are assumed unknown, we require from our training mechanism to have “initial state invariance” as a fundamental property. Thus we deliberately exclude from our arsenal any c 2007 Aggelos Chariatis. C HARIATIS training techniques that require a schedule to be decided ahead of training. Ideally we would like from the training mechanism to be totally invariant to the initial training parameters and network state. This report is organized as follows: Sections 2 and 3 describe techniques for global and local selective attention. Section 4 is devoted to acceleration of training. In Section 5 we present experimental results and in Section 6 we discuss the presented techniques and give some directions for future research. Finally, Appendix A contains a description of the notations that have been used. In Figure 1 you can see some examples of problems that can be learned very efficiently using the techniques that are presented in the following sections. (a) (b) (c) (d) (e) Figure 1: Examples of complex non-linear problems that can be learned efficiently. 2. Global Selective Attention - Dynamic Training Set Evolution Consider the two problems depicted in Figure 2. Clearly, both problems are of approximately equal complexity, since they encapsulate the same image in a different scale and position within the input space. We would like to have a mechanism that will make the network capable of learning both problems at about the same speed. (a) (b) Figure 2: Approximately equal complexity problems. Intuitively, the samples on the boundaries, which are the samples on positions with the highest contrast, are those that determine the complexity of each problem. During training, these samples have the property that they produce the highest errors. We thus need a method that will focus attention on samples with high error relatively to the rest. Previous work on such global selective attention has been published by many authors (Munro, 1987; Yu and Simmons, 1990; Bakker, 1992, 1993; Schapire, 1999; Zhong and Ghosh, 2000). 2018 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Of particular interest are the various boosting algorithms, such as AdaBoost (Schapire, 1999), which work by placing more emphasis on training samples that the system is currently getting wrong. Unfortunately, the most successful of these algorithms require a predefined set of samples on which training will be performed, something that is excluded from our scenario. Nevertheless, in a less constrained scenario, boosting can be applied on top of our techniques as a meta-learning algorithm, using our techniques as the base-learning algorithm. A simple method that can provide such an adaptive selective attention mechanism, by keeping an exponential trace of the average error in the training set, is described in Algorithm 1. e←0 ¯ Repeat Pick a random sample Evaluate the error e for the sample If e > 0.5 e ¯ e ← e α + e (1 − α) ¯ ¯ Train End Until a stopping criterion is satisfied In this report’s context, error evaluation and train are defined as: Error Evaluation: Computation of the output values by forward propagating the activations from the input to the output layer for a single sample, plus computation of the output errors. The sample’s error e is set to the quadratic mean (RMS) of the output units’ errors. Train: Back-propagation of the output errors to the hidden layer and immediate weights’ adjustment. Algorithm 1: The dynamic training set evolution algorithm. The algorithm evaluates the errors of all samples, but trains only for samples with error greater than half the average error of the current training set. Training is initially performed for all samples, but gradually, it is concentrated on the samples at the problem’s boundaries. When the error for these samples is reduced, other previously excluded samples enter the training set. Thus, samples enter and leave the training set automatically, with a tendency to train on samples with high error. The magnitude of the constant α that determines the time scale of the exponential trace is problem specific, but in all experiments in this report it was kept fixed to 10 −4 . The fraction of 0.5 was determined experimentally to give a good balance between sample selectivity and training set size. If it is close to 0 then we train for almost all samples. If it is close to 1 then we are at risk of making the training set starve from samples. Of course, one can choose to vary it dynamically in order to have a fixed percentage of samples in the training set, or, to not allow the training percentage to fall below a pre-specified limit. Figure 3 shows the training set evolution for the two-spirals problem (in Figure 1a) in various training stages. The network topology was 2:64:1. You can see the training set forming gradually and tracing the problem boundaries where the error is the highest. One could argue that such a process may be very sensitive to outliers. Experiments have shown that this does not happen. The algorithm does not try to recognize the outliers, but at least, adjusts naturally by not allowing the training set size to shrink. So, at the presence of heavy noise, the algorithm becomes ineffective, but does not introduce any additional harm. Figure 4 shows the two2019 C HARIATIS 10000-11840-93% 20000-25755-76% 40000-66656-41% 60000-142789-22% 90000-296659-18% Figure 3: Training set evolution for the two-spirals problem. Under each image you can see the stage of training in trains, error-evaluations and the percentage of samples for which training is performed at the corresponding stage. spirals problem distorted by dynamic noise and the corresponding training set after 90000 trains with 64 hidden units. You can see that the algorithm tolerates noise by not allowing the training set size to shrink. It is also interesting that at noise levels as high as 30% the algorithm can still exclude large areas of the input space from training. 10%-42% 20%-62% 30%-75% 50%-93% 70%-99% Figure 4: Top row shows the model with a visualization of the applied dynamic noise. Bottom row shows the corresponding training sets after 90000 trains. Under each pair of images you can see the percentage of noise distortion applied to the original input and the percentage of samples for which training is performed. 3. Local Selective Attention - Receptive Fields Having established a global method to focus attention on the important parts of a problem, we now come to address the main issue, which is the network training. Let first discuss the roles of the hidden and output layers in a feed forward neural network with a single hidden layer and without shortcut input-to-output connections. 2020 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS The hidden layer is responsible for transforming a non linear input-to-output mapping, into a non linear input-to-hidden layer mapping, that can be mapped linearly to the output. The output layer is responsible for learning a linear hidden-to-output mapping (which is an easy job), but most importantly, it must provide to the hidden layer error gradient information that will be used for the error credit assignment problem. In this respect, it becomes apparent that all hidden units should receive the most possibly accurate error information. That is why, we must train all hidden to output connections and back propagate the error through all these connections. This is not the case for the hidden layer. Consider, for a classification problem, how the hidden units with sigmoidal activations partition the input space into sub areas. By adjusting the input-tohidden weights and biases, each hidden unit develops a hyperplane that bi-partitions the input space in the most useful sense. We would like to limit the number of hyperplanes in order to reduce the system’s available degrees of freedom and obtain better generalization capabilities. At the same time, we would like to thoroughly use them in order to optimize the input output approximation. This can be done by arranging the hyperplanes to touch the problem’s boundaries at regular intervals dictated by the boundary curvature, as it is shown in Figure 5a. Figure 5b, shows a suboptimal placement of the hyperplanes which causes a waste of resources. Each hidden unit must be differentiated from the others and ideally not interfere with the subproblems that the other units are trying to solve. Suppose that two hidden units are governed by the same, or nearly the same, parameters. How can we differentiate them? There are many possibilities. (a) (b) Figure 5: Optimal vs. suboptimal hyperplanes. One could be, to just throw one unit away and make the output weight of the other equal to the sum of the two original output weights. That would leave the function unchanged. However, identifying these similar units during training is not easy computationally. In addition, we would have to figure out a method that would compute the best initial placement for the hyperplane of the new unit that would substitute the one that was thrown away. Another possibility would be to add noise in the weight updates, gradually reduced with a simulated annealing schedule which should be decided before training begins. Unfortunately, the loss of initial state invariance would complicate training for unknown complex non linear problems. To our thinking, it is much better to embed constraints into the system, so that it will not be possible for two hidden units to develop the same hyperplane. Two computationally efficient techniques to embed such constraints are described in sections 3.1 and 3.2. Many other authors have also examined methods for local selective attention. For the related discussions see Huang and Huang (1990), Ahmad and Omohundro (1990), Baluja and Pomerleau (1995), Flake (1998), Duch et al. (1998), and Phillips and Noelle (2004). 2021 C HARIATIS 3.1 Fixed Cascaded Inhibitory Connections A problem with the hidden units of conventional feed forward networks is that they are all fed with the same inputs and back propagated errors and that they operate without knowing each other’s existence. So, nothing prevents them from behaving identically. This lack of communication between hidden units has been addressed by researchers through hidden unit lateral connections. Agyepong and Kothari (1997) use unidirectional lateral interconnections between adjacent hidden layer units, claiming that these connections facilitate the controlled assignment of role and specialization of the hidden units. Kothari and Ensley (1998) use Gaussian lateral connections which enable the hidden decision boundaries to be global in nature but also be able to represent local structure. Numerous neural network algorithms employ bidirectional lateral inhibitory connections in order to generate competition between the hidden units. In an interesting variation described by Spratling and Johnson (2004), competition is provided by each hidden unit blocking its preferred inputs from activating other units. We use a single hidden layer where the hidden units are considered sequenced. Each hidden unit is connected to all succeeding hidden units with a fixed connection with weight set to minus one. The hidden units get differentiated, because they receive different inputs, they produce different activations and they get back different error information. Another benefit is that they can generate higher order feature detectors, that is, the resulting hidden hyperplanes are no longer strictly linear, but they may also be curved. Considering the fixed value, -1 is used just to avoid a multiplication. Values from -0.5 to -2 give good results as well. As it is shown in Section 5.1.1, the fixed cascaded inhibitory connections are very effective at reducing a problem’s asymptotic residual error. This should be attributed to both of their abilities, to generate higher order feature detectors and to hasten the hidden units’ symmetry breaking. These connections can be implemented very efficiently with just one subtraction per hidden unit for both hidden activation and hidden error computation. In addition, the disturbance to the parallelism of the backpropagation algorithm is minimal. Most operations on the hidden units can still be done in parallel and only the final computations must be performed sequentially. We include the algorithms for the hidden activation and error computations as examples of sequential implementations. These changes can be very easily incorporated into conventional neural network code. Hidden Activations δ←0 For j ← 1 . . . η nj ← δ+x·wj h j ← f (n j ) δ ← δ−hj End Hidden Error Signals δ←0 For j ← η . . . 1 ej ← δ+r ·uj g j ← e j f (n j ) δ ← δ−gj End Algorithm 2: Hidden unit activation and error computation with Fixed Cascaded -1 Connections. 2022 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 3.2 Selective Training of the Hidden Units The hidden units’ differentiation can be farther magnified if each unit is not trained on all samples, but only on the samples for which it receives a high error. We train all output units, but only the hidden units for which the error signal is higher than the RMS of the error signals of all hidden units. Typically about 10% of the hidden units are trained on each sample during early training and the percentage falls up to 2% when the network is close to the solution. This is intuitively justified by the observation that at the beginning of training the hidden units are largely undifferentiated and receive high error signals from the whole input space. At the final stage of training, the hidden hyperplanes’ linear soft decision boundaries are combined by the output layer to define arbitrarily shaped decision boundaries. For µ input dimensions, from 1 up to µ units can define an open sub-region and µ + 1 units are enough to define a closed convex region. With such high level constructs, each sample may be discriminated from the rest with very few hidden units. These, are the units that receive the highest error signal for the sample. Experiments on various problems have shown that training on a fraction of the hidden units is always better (in respect to number of trains to convergence), than training all or just one hidden unit. It seems that training only one hidden unit on each sample is not sufficient for some problems (Thornton, 1992). Measurements for one of these experiments are reported in Section 5.1.1. In addition to the convergence acceleration, the combined effect of training a fraction of the hidden units on a fraction of the samples, gives big savings in CPU usage per sample as well. This sparseness of training in respect to evaluation provides further opportunities for speedup as it is discussed in Section 4. 3.3 Centering On The Input Space It is a well known recommendation (Schraudolph, 1998a,b; LeCun et al., 1998) that the input values should be normalized to have zero mean and unit standard deviation over each input dimension. This is achieved by subtracting from each input value the mean and dividing by the standard deviation. For some problems, like the one in Figure 2b, the center of the input space is not equal to the center of the problem. When the input is not known in advance, the later must be computed adaptively. Moreover, since the hidden units are trained on different input samples, we should compute for each hidden unit its own mean and standard deviation over each input dimension. For the connection between hidden unit j and input unit i we can adaptively compute the approximate mean m ji and standard deviation s ji over the inputs that train the hidden unit, using either exponential traces: m ji (t) ← β xi + (1 − β) m ji (t−1) , q ji (t) ← β xi2 + (1 − β) q ji (t−1) , s ji (t) ← (q ji (t) − m ji 2 )1/2 , (t) or perturbated calculations: m ji (t) ← m ji (t−1) + β (xi − m ji (t−1) ), v ji (t) ← v ji (t−1) + β (xi − m ji (t) ) (xi − m ji (t−1) ) − v ji (t−1) , 2023 C HARIATIS 1/2 s ji (t) ← v ji (t) , where β is a constant that determines the time scale of exponential averaging, vector x holds the input values, matrix Q holds the means of the squared input values and matrix V holds the variances. The means and standard deviations of a hidden unit’s input connections are updated only when the hidden unit is trained. The result of this treatment is that each hidden unit is centered on a different part of the input space. This center is indirectly affected by the error that the inputs produce on the hidden unit. The magnitude of the constant β is problem specific, but in all experiments in this report it was kept fixed to 10−3 . This constant must be selected large enough, so that the centers will rapidly move to their optimum locations, and small enough, so that the hidden units will see a relatively static view of the input space and the gradient descent algorithm will not be confused. As the hidden units jitter around their centers, we effectively train them on slightly shifted views of the input space, something that can assist generalization. We get something analogous to training with jitter (Reed et al., 1995), at no extra cost. In Figure 6, the squares show where each hidden unit is centered. You can see that most are centered on the problem boundaries at regular intervals. The crosses show the standard deviations. On some directions the standard deviations are very small, which results in very high normalized input values, causing the hidden units to act as threshold units at those directions. The sloped lines show the hyperplane distance from center and the slope. These are computed for display purposes, from their theoretical formulas for a conventional network, without considering the effect of the cascaded connections. For some units the hyperplanes shown are not exactly on the boundaries. This is because of the fixed cascaded connections that cause the hidden units to be not exactly linear discriminants. In the last picture you can see the decision surface of a hidden unit which is a bit curved and coincides with the class boundary although its calculated hyperplane is not on the boundary. An observant reader may also notice that the hyperplane distances from the centers are very small, which implies that the corresponding biases are small as well. On the contrary, if all hidden units were centered on the center of the image, we would have the following problem. The hyperplanes of some hidden units must be positioned on the outer parts of the image. For this to happen, these units should develop large biases in respect to the weights. This would make their activations to have small variances. These small variances might need to be compensated by large output weights and biases, which would saturate the output units and in addition ill-condition the problems. One may wonder if the hidden biases are still necessary. Since the centers are individually set, it may seem at first that they are not. However, the centers are not trained through error backpropagation, and the hyperplanes do not necessarily pass over them. The biases role is to drive the hyperplanes to the correct location and thus pull the centers in the corresponding direction. The individual centering of the hidden units based on the samples’ positions is feasible, because we train only on samples with high errors and only the hidden units with high errors. By ignoring the small errors, we effectively position the center of each hidden unit near the center of mass of the high errors that it receives. However, this centering technique can still be used even if one chooses to train on all samples and all hidden units. Then, the statistics interval should be differentiated for each hidden unit and be recomputed for each sample relatively to the normalized absolute error that each hidden unit receives. A way to do it is to set the effective statistics interval for hidden unit j 2024 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS Figure 6: Hidden unit centers, standard deviations, hyperplanes, global and local training sets and a hidden unit’s output. The images were captured at the final stage of training, of the problem in Figure 1a with 64 hidden units. and sample s to: β |e j,s | |e j | where β is the global statistics interval, e j,s is the hidden unit’s backpropagated error for the sample and |e j | is the mean of the absolute backpropagated errors that the hidden unit receives, measured via an exponential trace. The denominator acts as a normalizer, which makes the hidden unit’s mobility to be independent of the average magnitude of the errors. Centering on other factors has been extensively investigated by Schraudolph (1998a,b). These techniques can provide further convergence acceleration, but we chose not to use them because of the additional computational overhead that they require. 3.4 A Hybrid Activation Function As it is shown in Section 5, the aforementioned techniques enable successful training on some difficult problems like those in Figures 1a and 1b. However, if the problem contains subproblems, or put in another way, if the problem generates more than one cluster of high error density, the centering mechanism does not manage to drive the hidden unit centers to the most suitable locations. The centers are attracted by the larger subproblem or get stuck in areas between the subproblems, as shown in Figure 7. 2025 C HARIATIS Figure 7: Model, training set, and inadequate centering We need a mechanism that can force a hidden unit to get out of balanced but suboptimal positions. It would be nice if this mechanism could also allow the centers to migrate to various points in the input space as the need arises. It has been found that both of these requirements are fulfilled by a new hybrid activation function. Sigmoid activations have the property that they produce hyperplanes that separate the input space globally. Our intention is to use a sigmoid like hidden activation function, because it can provide global separability, and at the same time, reduce the activation value towards zero on inputs which are not important to a hidden unit. The Gaussian function is commonly used within radial basis function (RBF) neural networks (Broomhead and Lowe, 1988). When this function is applied to the distance of a sample to the unit’s center, it produces a local response which is stronger near the center. We can then enclose the sigmoidal activation within a Gaussian envelope, by multiplying the activation with a value between 0 and 1, which is provided by applying the Gaussian function to the distance that is measured in the normalized input space. When the number of input dimensions is large, the distance metric that must be used is not an obvious choice. Table 1 contains the distance metrics that we have considered. The most suitable distance metric seems to depend on the distribution of the samples that train the hidden units. µ ∑ xi2 i=1 Euclidean 1 µ µ ∑ xi2 i=1 Euclidean Scaled µ ∑ |xi | i=1 Manhattan 1 µ µ ∑ |xi | i=1 Manhattan Scaled max |xi | Chebyshev Table 1: Various distance metrics that have been considered for the hybrid activation function. In particular, if the samples follow a uniform distribution over a hypercube, then the Euclidean distance has the disturbing property that the average distance grows larger as the number of input dimensions increases and consequently the corresponding average Gaussian response decreases towards zero. As suggested by Hegland and Pestov (1999), we can make the average distance to center independent of the input dimensions, by measuring it in the normalized input space and then dividing it by the square root of the input’s dimensionality. The same problem occurs for the Manhattan distance which has been claimed to be a better metric in high dimensions (Aggarwal et al., 2001). We can normalize this distance by dividing it by the input’s dimensionality. A problem that 2026 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS appears for both of the above rescaled distance metrics, is that for the samples that are near the axes the distances will be very much attenuated and the corresponding Gaussian responses will be close to one, something that will make the Gaussian envelopes ineffective. A more suitable metric for this type of distributions is the Chebyshev metric whose average magnitude is independent of the dimensions. However, for reasons analogous to those mentioned above, this metric is not the most suitable if the distribution of the samples is spherical. In that case, the Euclidean distance does not need any rescaling and is the most natural distance measure. We can obtain spherical distributions by adaptively whitening them. As Plumbley (1993) and Laheld and Cardoso (1994) independently proposed, the whitening matrix Z can be adaptively computed as: Zt+1 = Zt − λ zt zt T − I Zt where λ is the learning rate parameter, zt = Zt xt is the whitened vector and xt is the input vector. However, we would need too many additional parameters to do it individually for each subset of samples on which each hidden unit is trained. For the above reasons (and because of lack of a justified alternative), in the implementation of these techniques we typically use the Euclidean metric when the number of input dimensions is up to three and the Chebyshev metric in all other cases. We have also replaced the usual tanh (sigmoidal) and Gaussian (bell-like) functions, by similar functions which do not involve exponentials (Elliott, 1993). For each hidden unit j we first compute the net-input n j to the hidden unit (that is, the weighted distance of the sample to the hyperplane), as the inner product of normalized inputs and weights plus the bias: xi − m ji , s ji = z j · w j. z ji = nj We then compute the sample’s distance d j to the center of the unit which is measured in the normalized input space: dj = zj . Finally, we compute the activation h j as: nj , (1 + n j ) 1 , = bell(d j ) = (1 + d 2 ) j = a j b j. a j = Elliott(n j ) = bj hj Since d j is not a function of w j , we treat b j as a constant for the calculation of the activation derivative with respect to n j , which becomes: ∂h j = b j (1 − a j )2 . ∂n j 2027 C HARIATIS The hybrid activation function, which by definition may only be used for hidden units connected to the input layer, enables these units to acquire selective attention capabilities on the input space. Each hidden unit may have a global or local receptive field on each input dimension. The size of this dimensional receptive field depends on the standard deviation which is computed for the corresponding dimension. This activation makes balanced positions between subproblems to be unstable. As soon as the center is changed by a small amount, it will be attracted by the nearest subproblem. This is because the unit’s activation and the corresponding error will be increased for samples towards the nearest subproblem and decreased at the other direction. Hidden units can still be centered between subproblems but only if their movement at either direction causes a large error for samples at the opposite direction, that is, if they are absolutely necessary at their current position. Additionally, if a unit is centered near a subproblem that produces low errors and the unit is not necessary in that area, then it may migrate to other areas that still have high errors. This unit center migration has been observed in all experiments on complex problems. This may be due to the non-linear response of the bell function, and its long tails which keep the activation above zero for all input samples. Figure 8: Model, evaluation, training set, hidden unit centers and two hidden unit outputs showing the effect of the hybrid activation function. The images were captured at the final stage of training, of the problem in Figure 1d with 700 hidden units. In Figure 8 you can see a complex problem with 9 clusters of high errors. The hidden units place their centers on all clusters and are able to solve the problem. In the last two images, you can see the 2028 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS effect of the hybrid activation function which attenuates the activation at points far from the center in respect to the standard deviation on each dimension. One unit develops a near circular local receptive field and one other develops an elongated ellipsoidal receptive field. The later provides almost global separation in the vertical direction and becomes a useful discriminant for two of the subproblems. One may find similarities between this hybrid activation function and the Square-MLP architecture described by Flake (1998). The later, partially implements higher order neurons by duplicating the number of input units and setting the new input values equal to the squares of the original inputs. This architecture enables the hidden units to form local features of various shapes, but not the locally constrained sigmoid formed by our proposal. In contrast, the hybrid activation function does not need any additional parameters beyond those that are already used for centering and it has the additional benefit, which is realized by the local receptive fields in conjunction with the small biases and the symmetric sigmoid, that the hidden activations will have a mean close to zero. As discussed by Schraudolph (1998a,b) and LeCun et al. (1998), this is very beneficial for the output layer training. However, there is still room for improvement. As it was also observed by Flake (1998), the orientations of the receptive field ellipses are always at the direction of one of the input axes. This limitation is expected to hinder performance when training hidden units which have sloped hyperplanes. Figure 9 shows a complex problem at the middle of training. Units with sloped hyperplanes are trained on samples whose input values are highly correlated. This can slowdown learning by itself, but in addition, the standard deviations cannot get sufficiently small and as a result the receptive field cannot be sufficiently shrunk at the direction perpendicular to the hyperplane. As a result the hidden unit’s activation unnecessarily interferes with the activations of nearby units. Although it may be possible to address the correlation problem with a more sophisticated training method that uses second order gradient information, like Stochastic Meta Descent (Schraudolph, 1999, 2002), the orientations of the receptive fields will still be limited. In Section 6.2 we discuss possible directions for further research that may circumvent this limitation. Figure 9: Evaluation and global and local training sets during middle training for the problem in Figure 1b. It can be seen that a hidden unit with a sloped hyperplane is trained on samples with highly correlated input values. Samples that are separated by horizontal or vertical hyperplanes are easier to be learned. 2029 C HARIATIS 4. Further Speedups In this section we first describe an implementation technique that reduces the computational requirements of the error evaluation phase and then we give references to methods that have been proposed by other authors for the acceleration of the training phase. 4.1 Evaluation Speedup Two of the discussed techniques, training only for samples with high errors, and then, training only the hidden units with high error, make the error-evaluation phase to be the most processing demanding phase for the solution of a given problem. In addition, some other techniques, like board game learning through temporal difference methods, require many evaluations to be performed before each train. We can speedup evaluation by the following observation: For many problems, only part of the input is changed on successive samples. For example, for a backgammon program with 200 input units (with raw board data and not any additional features encoded), very few inputs will change on successive positions. Even on two dimensional problems such as images, we can arrange to train on samples selected by random changes on the X and Y dimensions alternatively. This process of only resampling one coordinate at a time is also known as “Gibbs sampling” and it nicely generalises to more than two coordinates (Geman and Geman, 1984). Thus, we can keep in memory all intermediate results from the evaluation, and recalculate only for the inputs that have changed. This implementation technique requires more storage, especially for high dimensional inputs. Fortunately, storage is not an issue on modern hardware. 4.2 Training Speedup Many authors have proposed methods for speeding-up online training by using second order gradient information in order to dynamically vary either the learning rate or the momentum (see LeCun et al., 1993; Leen and Orr, 1993; Murata et al., 1996; Harmon and Baird, 1996; Orr and Leen, 1996; Almeida et al., 1997; Amari, 1998; Schraudolph, 1998c, 1999, 2002; Graepel and Schraudolph, 2002). As it is shown in the next section, our techniques enable standard stochastic gradient descent with momentum to efficiently solve all the highly non-linear problems that have been investigated. However, the additional speed up that an accelerating algorithm can give is a nice thing to have. Moreover, these accelerating algorithms automatically reduce the learning rate when we are close to a solution (by sensing the oscillations in the error gradient) something that we should do through annealing if we wanted the best possible solution. We use the Incremental Delta-Delta (IDD) accelerating algorithm (Harmon and Baird, 1996), an incremental nonlinear extension to Jacobs’ (1988) Delta-Delta algorithm, because of its simplicity and relatively small processing requirements. IDD computes an individual learning rate λ for each weight w as: λ(t) = eξ(t) , ξ(t + 1) = ξ(t) + θ ∆w(t + 1) ∆w(t), λ(t) where θ is the meta-learning rate which we typically set to 0.1. 2030 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 5. Experimental Results In order to measure the effectiveness of the described techniques on various classes of problems, we performed several experiments. Each experiment was replicated 10 times with different random initial weights using matched random seeds and the means and standard deviations of the results were plotted in the corresponding figures. For the experiments we used a single hidden layer, the cross entropy error function, the logistic or softmax activation function for the output units and the Elliott or hybrid activation function for the hidden units. Output to hidden layer weights and biases were initialized to zero. Hidden to input layer weights were initialized to random numbers from a normal distribution and then rescaled so that the incoming weights to each hidden unit had norm unity. Hidden unit biases were initialized to a uniform random number between zero and one. The curves in the figures are labelled with a combination of the following letters which indicate the techniques that were applied: B – Adjust weights using stochastic gradient descent with momentum 0.9 and fixed learning rate √ 0.1/ c where c is the number of incoming connections to the unit. A – Adjust weights using IDD with meta-learning rate 0.1 and initial learning rate √ 1/ c where c is as above. L – Use fixed cascaded inhibitory connections as described in Section 3.1. S – Skip weights adjustment for samples with low error as described in Section 2. U – Skip weights adjustment for hidden units with low error as described in Section 3.2. C – Use individual means and stdevs for each hidden to input connection as described in Section 3.3. H – Use the hybrid activation function as described in Section 3.4. For the ‘B’ training method we deliberately avoided an annealing schedule for the learning rate, since this would destroy the initial state invariance of our techniques. Instead, we used a fixed small learning rate which we compensated with a large momentum. For the ‘A’ method, we used a small meta-learning rate, to avoid instabilities due to the high non-linearities of the examined problems. It is important to note that for both training methods the learning parameters were fixed to the above values and not optimized to each individual problem. For the ‘C’ technique, the centers of the hidden units where initially set to the center of the input space and the standard deviations were set to one third of the distance between the extreme values of each dimension. When the technique was not used, a global preprocessing was applied which normalized the input samples to have zero mean and unit standard deviation. 5.1 Two Input Dimensions In this section we give experimental results for the class of problems that we have mainly examined, that is, problems in two input and one output dimensions, for which we have dense and noiseless training samples from the whole input space. In the figures, we measure the average classification error in respect to the stage of training. The classification error was averaged via an exponential trace with time scale 10−4 . 2031 C HARIATIS 5.1.1 C OMPARISON OF T ECHNIQUE C OMBINATIONS For these experiments we used the two-spirals problem shown in Figures 1a, 3, 4 and 6. We chose this problem as a non trivial representative of the class of problems that during early training generate a single cluster of high error density. The goal of this experiment is to measure the effectiveness of various technique combinations and then to measure how well the best technique combination scales with the size of the hidden layer. Figures 10 and 11 show the average classification error in respect to the number of evaluated samples and processing cycles respectively for 13 technique combinations. For these experiments we used 64 hidden units. The standard deviations were not plotted in order to keep the figures uncluttered. Figure 10 has also been split to Figures 12 and 13 in order to show the related error bars. Comparing the curves B vs. BL and BS vs. BLS on Figures 10 and 11, we can see that the fixed cascaded inhibitory connections reduce the asymptotic residual error by more than half. This also applies, but to a lesser degree, when we skip weight updates for hidden units with low errors (B vs. BU, BS vs. BSU). When used in combination, we can see a speed-up of convergence but the asymptotic error is only marginally further improved (BLU and BLSU). In Figure 11, it can be seen that skipping samples with low errors can speed-up convergence and reduce the asymptotic error as well (BLU vs. BLSU). This is a very intriguing result, in the sense that it implies that the system can learn faster and better by throwing away information. Both Figures 10 and 11 show the BLUCH curve to diverge. Considering the success of the BLSUCH curve, we can imply that skipping samples is necessary for the hybrid activation. However, the real problem, which was found out by viewing the dynamics of training, is that the centering mechanism does not work correctly when we train on all samples. A possible remedy may be to modify the statistics interval which is used for centering, as it is described at the end of Section 3.3. BLSUC vs. BLSU shows that centering further reduces the remaining asymptotic error to half and converges much faster as well. Comparing curve BLSUCH vs. BLSUC, we see that the hybrid activation function does better, but only marginally. This was expected since this problem has a single region of interest, so the ability of H to focus on multiple regions simultaneously is not exercised. This is the reason for the additional experiments in Section 5.1.2. BLSUCH and ALSUCH were the most successful technique combinations, with the later being a little faster. Nevertheless, it is very impressive that standard stochastic gradient descent with momentum can approach the best asymptotic error in less than a second, when using a modern 3.2 GHz processor. Figure 14 shows the average classification error in respect to the number of evaluated samples, for the ALSUCH technique combination and various hidden layer sizes. It can be seen that the asymptotic error is almost inversely proportional to the number of hidden units. This is a good indication that our techniques use the available resources efficiently. It is also interesting, that the convergence rates to the corresponding asymptotic errors are quite fast and about the same for all hidden layer sizes. 5.1.2 H YBRID VS . C ONVENTIONAL ACTIVATION For these experiments we used the two dimensional problem depicted in Figures 1c and 7. We chose this problem as a representative of the class of problems that during early training generate 2032 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 10: Average classification error vs. number of evaluated samples for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The standard deviations have been omitted for clarity. B BU BL BLU BLUC BLUCH ALSUCH 0,35 0,30 BS BSU BLS BLSU BLSUC BLSUCH 0,25 0,20 0,15 0,10 0,05 0,00 0 1 2 3 4 5 6 7 8 9 10 Figure 11: Average classification error vs. Intel IA32 CPU cycles in billions, for various technique combinations, while training the problem in Figure 1a with 64 hidden units. The horizontal scale also corresponds to seconds when run on a 1 GHz processor. The standard deviations have been omitted for clarity. 2033 C HARIATIS BS BLS BLSUC ALSUCH 0,35 BSU BLSU BLSUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 12: Part of Figure 10 showing error bars for technique combinations which employ S. B BL BLUC 0,35 BU BLU BLUCH 0,30 0,25 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 13: Part of Figure 10 showing error bars for technique combinations which do not employ S. 2034 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS 32 48 64 96 128 256 0,20 0,15 0,10 0,05 0,00 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Figure 14: Average classification error vs. number of evaluated samples for various hidden layer sizes, while training the problem in Figure 1a with the ALSUCH technique combination. ALSUCH 0,04 ALSUC 0,03 0,02 0,01 0,00 0 300000 600000 900000 1200000 1500000 1800000 2100000 2400000 2700000 3000000 Figure 15: Average classification error vs. number of evaluated samples for the ALSUCH and ALSUC technique combinations, while training the problem in Figure 1c with 100 hidden units. The dashed lines show the minimum and maximum observed values. 2035 C HARIATIS small clusters of high error density of various sizes. For this kind of problems we typically obtain very small residuals for the classification error, although the problem may not have been learned. This is because we measure the error on the whole input space and for these problems most of the input space is trivial to be learned. The problem’s complexities are confined in very small areas. The dynamic training set evolution algorithm is able to locate these areas, but we need much more sample presentations, since most of the samples are not used for training. The goal of this experiment is to measure the effectiveness of the hybrid activation function at coping with the varying sizes of the subproblems. For these experiments we used 100 hidden units. Figure 15 shows that the ALSUCH technique, which employs the hybrid activation function, reduced the asymptotic error to half in respect to the ALSUC technique. As all of the visual inspections revealed, one of which is reproduced in Figure 16, the difference in the residual errors of the two curves is due to the insufficient approximation of the smaller subproblem by the ALSUC technique. Model ALSUCH ALSUC Figure 16: ALSUCH vs. ALSUC approximations for a problem with two sub-problems. 5.2 Higher Input and Output Dimensions In order to evaluate our techniques on a problem with higher input and output dimensions, we selected a standard benchmark, the Letter recognition database from the UCI Machine Learning Repository (Newman et al., 1998). This database consists of 20000 samples that use 16 integer attributes to classify the 26 letters of the English alphabet. This problem is characterized by a medium input dimensionality and a large output dimensionality. The later, makes it a very challenging problem for any classifier. This problem differs from those on which we have experimented so far, in that we do not have the whole input space at our disposal for training. We must train on a limited number of samples and then test the system’s generalization abilities on a separate test set. Although we have not taken any special measures to assist generalization, the experimental results indicate that our techniques have the inherent ability to generalize well, when given noiseless exemplars. An observation that applies to this problem is that the IDD accelerated training method could not do better than standard stochastic gradient descent with momentum. Thus, we report results using 2036 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS the BLSUCH technique combination which is computationally more efficient than the ALSUCH technique. For this experiment, which involves more than two output classes, we used the softmax activation function at the output layer. Table 2 contains previously published results showing the classification accuracy of various classifiers. The most successful of them were the AdaBoosted versions of the C4.5 decision-tree algorithm and of a feed forward neural network with two hidden layers. Both classifier ensembles required quite a lot of machines in order to achieve that high accuracy. Classifier Naive Bayesian classifier AdaBoost on Naive Bayesian classifier Holland-style adaptive classifier C4.5 AdaBoost on C4.5 (100 machines) AdaBoost on C4.5 (1000 machines) CART AdaBoost on CART (50 machines) 16-70-50-26 MLP (500 online epochs) AdaBoost on 16-70-50-26 MLP (20 machines) AdaBoost on 16-70-50-26 MLP (100 machines) Nearest Neighbor Test Error % 25,3 24,1 17,3 13,8 3,3 3,1 12,4 3,4 6,2 2,0 1,5 4,3 Reference Ting and Zheng (1999) Ting and Zheng (1999) Frey and Slate (1991) Freund and Schapire (1996) Freund and Schapire (1996) Schapire et al. (1997) Breiman (1996) Breiman (1996) Schwenk and Bengio (1998) Schwenk and Bengio (1998) Schwenk and Bengio (2000) Fogarty (1992) Table 2: A compilation of previously reported best error rates on the test set for the UCI Letters Recognition Database. Figure 17 shows the average error reduction in respect to the number of online epochs, for the BLSUCH technique combination and various hidden layer sizes. As suggested in the database’s documentation, we used the first 16000 samples for training and for measuring the training accuracy and the rest 4000 samples to measure the predictive accuracy. The solid and dashed curves show the test and training set errors respectively. Similarly to ensemble methods, we can observe two interesting phenomena which both seem to contradict the Occam’s razor principle. The first observation is that the test error stabilizes or continues to slightly decrease even after the training error has been zeroed. What is really happening is that the RMS error for the training set (which is related to the confidence of classification) continues to decrease even after the classification error has been zeroed, something that is also beneficiary for the test set’s classification error. The second observation is that increasing the network’s capacity does not lead to over fitting. Although the training set error can be zeroed with just 125 hidden units, increasing the number of hidden units reduces the residual test error as well. We attribute this phenomenon to the conjecture that the hidden units’ differentiation results in a smoother approximation (as suggested by Figure 5 and the related discussion). Comparing our results with those in Table 2, we can also observe the following: The 16-125-26 MLP (5401 weights) reached a 4.6% misclassification error on average, which is 26% better than the 6.2% of the 16-70-50-26 MLP (6066 weights), despite the fact that it had fewer weights, a simpler 2037 C HARIATIS 125 T RAIN 125 T EST 250 T RAIN 250 T EST 500 T RAIN 500 T EST 1000 T RAIN TEST ERROR % UNITS MIN AVG at end 125 4.0 4.6 250 2.8 3.2 500 2.3 2.6 1000 2.1 2.4 0,10 1000 T EST 0,05 0,00 0 10 20 30 40 50 60 70 80 90 100 Figure 17: Average error reduction vs. number of online epochs for various hidden layer sizes, while training on the UCI Letters Recognition Database with the BLSUCH technique combination. The solid and dashed curves show the test and training set errors respectively. The standard deviations for the training set errors have been omitted for clarity. The embedded table contains the minimum observed errors across all trials and epochs, and the average errors across all trials at epoch 100. architecture with one hidden layer only and it was trained for a far less number of online epochs. It is indicative that the asymptotic residual classification error on the test set was reached in about 30 online epochs. The 16-1000-26 MLP (43026 weights) reached a 2.4% misclassification error on average, which is the third best published result following the AdaBoosted 16-70-50-26 MLPs with 20 and 100 machines (121320 and 606600 weights respectively). The lowest observed classification error was 2.1% and was reached in one of the 10 runs at the 80th epoch. It must be stressed that the above results were obtained without any optimization of the learning rate, without a learning rate annealing schedule and within a by far shorter training time. All MLPs with 250 hidden units and above, gave results which put them at the top of the list of non-ensemble techniques and they even outperformed Adaboost on C4.5 with 100 machines. Similarly to Figure 14, we also see that the convergence rates to the corresponding asymptotic errors on the test set are quite fast and about the same for all hidden layer sizes. 6. Discussion and Future Research We have presented global and local selective attention techniques that can help neural network training to concentrate on the difficult parts of complex non-linear problems. A new hybrid activation function has also been presented that enables the hidden units to acquire individual receptive fields 2038 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS in the input space. These individual receptive fields may be global or local depending on the problem’s local complexities. The success of the new activation function is due to the fact that it depends on two distances. The first is the weighted distance of a sample to the hidden unit’s hyperplane. The second is the distance to the hidden unit’s center. We need both distances and neither of them is sufficient. The first helps us discriminate and the second helps us localize. The dynamic training set evolution algorithm locates the sub-areas of the input space where the problem resides. The fixed cascaded inhibitory connections and the selective training of a subset of the hidden units on each sample, force the hidden units to get differentiated and attack different subproblems. The individual centering of the hidden units at different points in the input space, adaptively conditions the network to the problem’s local structures and enables each hidden unit to solve a well-conditioned subproblem. In coordination with the above, the hidden units’ limited receptive fields allow training to follow a divide and conquer paradigm where each hidden unit only solves a local subproblem. The solutions to the subproblems are then combined by the output layer to give a solution to the original problem. In the reported experiments we initialized the hidden weights and biases so that the hidden hyperplanes would cover the whole input space at random positions and orientations. The initial norm of the weights was also adjusted so that the net-input to each hidden unit would fall in the transition between the linear and non-linear range of the activation function. These specific initializations were necessary for standard backpropagation. On the contrary, we have found that the combined techniques are insensitive to the initial weights and biases, as long as their values are small. We have repeated the experiments with hidden biases set to zero and hidden weight norms set to 10 −3 and the results where equivalent to those reported in Section 5. However, the choice of the best initial learning rate is still problem specific. An additional and important characteristic of these techniques is that training of the hidden layer does not depend solely on gradient information. Gradient based techniques can only perform local optimization by locating a local minimum of the error function when the system is already at the basin of attraction of that minimum. Stochastic training has a potential of escaping from a shallow basin, but only when the basin is not very wide. Once there, the system cannot escape towards a different basin with a lower minimum. On the contrary, in our model some of the hidden layer’s free parameters (the weights) are trained through gradient descent on the error, whereas some other (the means and standard deviations) are “trained” from the statistical properties of the back-propagated errors. Each hidden unit places its center near the center of mass of the error that it receives and limits its visibility only to the area of the input space where it produces a significant error. This model makes the hidden error landscape to constantly change. We conjecture that during training, paths connecting the various error basins are continuously emerging and vanishing. As a result the system can explore much more of the solution space. It is indicative that in all the reported experiments, all trials converged to a solution with more or less the same residual error irrespectively of the initial network state. The combination of the presented techniques enables very fast training on complex classification problems with embedded subproblems. By focusing on the problem’s details and efficiently utilizing the available resources, they make feasible the solution of very difficult problems (like the one in Figure 1e), provided that the adequate number of hidden units has been used. Although other machine learning techniques can do the same, to our knowledge this is the first report that this can be done using ordinary feed forward neural networks and backpropagation, in an online, adaptive 2039 C HARIATIS and memory-less scenario, where the input exemplars are unknown before training and discarded after being used. In the following we discuss some areas that deserve further investigation. 6.1 Generalization and Regression For the classes of problems that were investigated, we had noiseless exemplars and the whole input space at our disposal for training, so there was no danger of overfitting. Thus, we did not use any mechanism to assist generalization. This does not mean of course that the network just stored the input output mapping, as a lookup table would do. By putting constraints on the positions and orientations of the hidden unit hyperplanes and by limiting their receptive fields, we reduced the system’s available degrees of freedom, and the network arranged its resources in a way to achieve the best possible input-output mapping approximation. The experiments on the Letter Recognition Database showed remarkable generalization capabilities. However, when we train on noisy samples or when the number of training samples is small in respect to the size and complexity of the input space, we have the danger of overfitting. It remains to be examined how the described techniques are affected by methods that avoid overfitting, such as, training with jitter, error regularization, target smoothing and sigmoid gain attenuation (Reed et al., 1995). This consideration also applies to regression problems which usually require smoother approximations. Although early experiments give evidence that the presented techniques can be applied to regression problems as well, we feel that some smoothing technique must be included in the training framework. 6.2 Receptive Fields Limited Orientations As it was noted in Section 3.4, the orientations of the receptive field ellipses are limited to have the direction of one of the input axes. This hinders training performance by not allowing the receptive fields to be adequately shrunk at the direction perpendicular to the hyperplane. In addition, hidden units with sloped hyperplanes are trained on highly correlated input values. These problems are expected to be exaggerated in high dimensional input spaces. We would cure both of these problems simultaneously, if we could individually transform the input for each hidden unit through adaptive whitening, or, if we could present to each hidden unit a rotated view of the input space, such that, one of the axes to be perpendicular to the hyperplane and the rest to be parallel to the hyperplane. Unfortunately, both of the above transformations would require too many additional parameters. An approximation (for 2 dimensional problems) that we are currently investigating upon is the following: For each input vector we compute K vectors rotated around the center of the input space with successive angle increments equal to π/(2K). Our purpose is to obtain uniform rotations between 0 and π/4. Every a few hundred training steps, we reassign to each hidden unit the most appropriate input representation and adjust the affected parameters (weights, means and stdevs). The results are promising. 6.3 Dynamic Cascaded Inhibitory Connections Regarding the fixed cascaded inhibitory connections, it must be examined whether it is better to make the strength of the connections, dynamic. Minus one is OK when the weights are small. How2040 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS ever as the weights get larger, the inhibitory connections get less and less effective to differentiate the hidden units. We can try to make them relative to each hidden unit’s average absolute net-input or alternatively to make them trainable. It has been observed that increasing the strength of these connections enables the hidden units to generate more curved discriminant functions, which is very beneficiary for some problems. 6.4 Miscellaneous More experiments need to be done, in order to evaluate the effectiveness of the hybrid activation function on highly non-linear problems in many dimensions. High dimensional input spaces have a multitude of disturbing properties in regard to distance and density metrics, which may affect the hybrid activation in yet unknown ways. Last, we must devise a training mechanism, that will be invariant to the initial learning rate and that will vary automatically the number of hidden units as each problem requires. Acknowledgments I would like to thank all participants in my threads in usenet comp.ai.neural-nets, for their fruitful comments on early presentations of the subjects in this report. Special thanks to Aleks Jakulin for his support and ideas on further research that can make these results even better and to Greg Heath for bringing to my attention the perturbated forms for the calculation of sliding window statistics. I also thank the area editor L´ on Bottou and the anonymous reviewers for their valuable comments e and for helping me to bring this report in shape for publication. Appendix A. Notational Conventions The following list contains the meanings of the symbols that have been used in this report. Symbols with subscripts are used either as scalars or as vectors and matrices when the subscripts are omitted. For example, w ji is a single weight, w j is a weight vector and W is a weight matrix. α – A constant that determines the time scale of the exponential trace of the average training-set error within the dynamic training set evolution algorithm. β – A constant that determines the time scale of the exponential trace of the input means and standard deviations. δ – An accumulator for the efficient implementation of the fixed cascaded inhibitory connections. η – The number of hidden units. µ – The number of input units. f – The hidden units’ squashing function. i – Index enumerating the input units. j – Index enumerating the hidden units. k – Index enumerating the output units. 2041 C HARIATIS a j – The hidden unit’s activation computed from the sample’s weighted distance to the hidden unit’s hyperplane. b j – The hidden unit’s activation attenuation computed from the sample’s distance to the hidden unit’s center. d j – The sample’s distance to the hidden unit’s center. e j – The hidden unit’s accumulated back propagated errors. g j – The hidden unit’s error signal f (n j ) e j . h j – The hidden unit’s activation. m ji – The mean of the values received by hidden unit j from input unit i. n j – The net-input to the hidden unit. q ji – The mean of the squared values received by hidden unit j from input unit i. rk – The error of output unit k. s ji – The standard deviation of the values received by hidden unit j from input unit i. u jk – The weight of the connection from hidden unit j to output unit k. v ji – The variance of the values received by hidden unit j from input unit i. w ji – The weight of the connection from hidden unit j to input unit i. xi – The value of input unit i. z ji – The normalized input value received by hidden unit j from input unit i. It is currently computed as the z-score of the input value. A better alternative would be to compute the vector z j by multiplying the input vector x with a whitening matrix Z j . References C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In J. Van den Bussche and V. Vianu, editors, Proceedings of the 8th International Conference on Database Theory (ICDT), volume 1973 of Lecture Notes in Computer Science, pages 420–434. Springer, 2001. K. Agyepong and R. Kothari. Controlling hidden layer capacity through lateral connections. Neural Computation, 9(6):1381–1402, 1997. S. Ahmad and S. Omohundro. A network for extracting the locations of point clusters using selective attention. In Proceedings of the 12th Annual Conference of the Cognitive Science Society, MIT, 1990. L. B. Almeida, T. Langlois, and J. D. Amaral. On-line step size adaptation. Technical Report INESC RT07/97, INESC/IST, Rua Alves Redol 1000 Lisbon, Portugal, 1997. 2042 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998. P. Bakker. Don’t care margins help backpropagation learn exceptions. In A. Adams and L. Sterling, editors, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pages 139– 144, 1992. P. Bakker. Exception learning by backpropagation: A new error function. In P. Leong and M. Jabri, editors, Proceedings of the 4th Australian Conference on Neural Networks, pages 118–121, 1993. S. Baluja and D. Pomerleau. Using the representation in a neural network’s hidden layer for taskspecific focus of attention. In IJCAI, pages 133–141, 1995. L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department, University of California, 1996. D. S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2(3):321–355, 1988. W. Duch, K. Grudzinski, and G. H. F. Diercksen. Minimal distance neural methods. In World Congress of Computational Intelligence, pages 1299–1304, 1998. D. L. Elliott. A better activation function for artificial neural networks. Technical Report TR 93-8, The Institute for Systems Research, University of Maryland, College Park, MD, 1993. G. W. Flake. Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr and K. R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in u Computer Science, pages 145–163. Springer, 1998. T. C. Fogarty. Technical note: First nearest neighbor classification on frey and slate’s letter recognition problem. Machine Learning, 9(4):387–388, 1992. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148– 156, 1996. P. W. Frey and D. J. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6:161–182, 1991. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. T. Graepel and N. N. Schraudolph. Stable adaptive momentum for rapid online learning in nonlinear systems. In J. R. Dorronsoro, editor, Proceedings of the International Conference on Artificial Neural Networks (ICANN), volume 2415 of Lecture Notes in Computer Science, pages 450–455. Springer, 2002. M. Harmon and L. Baird. Multi-player residual advantage learning with general function approximation. Technical Report WL-TR-1065, Wright Laboratory, Wright-Patterson Air Force Base, OH 45433-6543, 1996. 2043 C HARIATIS M. Hegland and V. Pestov. Additive models in high dimensions. Computing Research Repository (CoRR), cs/9912020, 1999. S. C. Huang and Y. F. Huang. Learning algorithms for perceptrons using back propagation with selective updates. IEEE Control Systems Magazine, pages 56–61, April 1990. R.A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1: 295–307, 1988. R. Kothari and D. Ensley. Decision boundary and generalization performance of feed-forward networks with gaussian lateral connections. In S. K. Rogers, D. B. Fogel, J. C. Bezdek, and B. Bosacchi, editors, Applications and Science of Computational Intelligence, SPIE Proceedings, volume 3390, pages 314–321, 1998. B. Laheld and J. F. Cardoso. Adaptive source separation with uniform performance. In Proc. EUSIPCO, pages 183–186, September 1994. Y. LeCun, P. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems, volume 5, pages 156–163. Morgan Kaufmann Publishers, San Mateo, CA, 1993. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mueller. Efficient backprop. In G. B. Orr and K.-R. M¨ ller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer u Science, pages 9–50. Springer, 1998. T. K. Leen and G. B. Orr. Optimal stochastic search and adaptive momentum. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Proceedings of the 7th NIPS Conference (NIPS), Advances in Neural Information Processing Systems 6, pages 477–484. Morgan Kaufmann, 1993. P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176, 1987. N. Murata, K. M¨ ller, A. Ziehe, and S. Amari. Adaptive on-line learning in changing environments. u In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 599–605. MIT Press, 1996. D. J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning databases, 1998. G. B. Orr and T. K. Leen. Using curvature information for fast stochastic search. In M. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9 (NIPS), pages 606–612. MIT Press, 1996. J. L. Phillips and D. C. Noelle. Reinforcement learning of dimensional attention for categorization. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society, 2004. M. Plumbley. A hebbian/anti-hebbian network which optimizes information capacity by orthonormalizing the principal subspace. In Proc. IEE Conf. on Artificial Neural Networks, Brighton, UK, pages 86–90, 1993. 2044 V ERY FAST O NLINE L EARNING OF H IGHLY N ON L INEAR P ROBLEMS R. Reed, R.J. Marks, and S. Oh. Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3):529–538, 1995. R. E. Schapire. A brief introduction to boosting. In T. Dean, editor, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), pages 1401–1406. Morgan Kaufmann, 1999. R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, editor, Proceedings of the 14th International Conference on Machine Learning (ICML), pages 322–330. Morgan Kaufmann, 1997. N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. ¨ N. N. Schraudolph. Centering neural network gradient factors. In G. B. Orr and K. R. M uller, editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science, pages 207–226. Springer, 1998a. N. N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical Report IDSIA-33-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, 1998b. N. N. Schraudolph. Online local gain adaptation for multi-layer perceptrons. Technical Report IDSIA-09-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Galleria 2, CH-6928 Manno, Switzerland, 1998c. N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In ICANN, pages 569–574. IEE, London, 1999. H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8):1869–1887, 2000. H. Schwenk and Y. Bengio. Training methods for adaptive boosting of neural networks for character recognition. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10. MIT Press, Cambridge, MA, 1998. M. W. Spratling and M. H. Johnson. Neural coding strategies and mechanisms of competition. Cognitive Systems Research, 5(2):93–117, 2004. C. Thornton. The howl effect in dynamic-network learning. In Proceedings of the International Conference on Artificial Neural Networks, pages 211–214, 1992. K. M. Ting and Z. Zheng. Improving the performance of boosting for naive bayesian classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 296–305, 1999. Y. H. Yu and R. F. Simmons. Descending epsilon in back-propagation: A technique for better generalization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume 3, pages 167–172, 1990. S. Zhong and J. Ghosh. Decision boundary focused neural network classifier. In Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE). ASME Press, 2000. 2045
2 0.93234044 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák
Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (defined as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efficient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK
same-paper 3 0.92823476 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
Author: Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram
Abstract: Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. Keywords: classification, hierarchical Bayesian models, Dirichlet process
4 0.61325729 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh
Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly. Keywords: conditional random fields, graphical models, sequence labeling
7 0.52916819 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
8 0.52785873 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
9 0.52760166 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation
10 0.52379322 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
12 0.51460719 51 jmlr-2007-Loop Corrections for Approximate Inference on Factor Graphs
14 0.50043964 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
15 0.4988395 72 jmlr-2007-Relational Dependency Networks
16 0.49738467 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
17 0.49620378 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
18 0.4959355 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
19 0.48969036 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
20 0.48852307 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts