nips nips2007 nips2007-68 knowledge-graph by maker-knowledge-mining

68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Source: pdf

Author: Charlie Frogner, Avi Pfeffer

Abstract: Dynamic Bayesian networks are structured representations of stochastic processes. Despite their structure, exact inference in DBNs is generally intractable. One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. In addition to features computed over entire factors, for efﬁciency we explored scores computed over pairs of variables. We present search methods that use these features, pairwise and not, to ﬁnd a factorization, and we compare their results on several datasets. Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. Moreover, tests on real DBNs show that automatic factorization can achieve signiﬁcantly lower error in some cases. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. [sent-7, score-0.349]

2 In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. [sent-8, score-0.42]

3 We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. [sent-9, score-0.371]

4 An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. [sent-10, score-0.536]

5 Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. [sent-13, score-0.679]

6 Moreover, tests on real DBNs show that automatic factorization can achieve signiﬁcantly lower error in some cases. [sent-14, score-0.291]

7 Factored inference approximates this joint distribution over all variables as the product of smaller distributions over groups of variables (factors) and in this way enables tractable inference for large, complex models. [sent-18, score-0.43]

8 Factored inference has generally been demonstrated for models that are factored by hand. [sent-20, score-0.343]

9 The quality of a factorization is deﬁned by the amount of error incurred by repeatedly discarding the dependencies between factors and treating them as independent during inference. [sent-22, score-0.51]

10 For this purpose we have examined a range of features that can be computed from the speciﬁcation of the DBN, based both on 1 the underlying graph structure and on two essential conceptions of weak interaction between factors: the degree of separability [6] and mutual information. [sent-24, score-0.831]

11 We ﬁnd that the mutual information between factors that is introduced by one step of belief state propagation is especially well-suited to the problem of ﬁnding a good factorization. [sent-26, score-0.6]

12 We compare several search methods for ﬁnding factors that allow for different tradeoffs between the efﬁciency and the quality of the factorization. [sent-28, score-0.292]

13 The fastest is a graph partitioning algorithm in which we ﬁnd a k-way partition of a weighted graph with edge-weights being pairwise scores between variables. [sent-29, score-0.352]

14 Agglomerative clustering and local search methods use the higher-order scores computed between whole factors, and are hence slower while ﬁnding better factorizations. [sent-30, score-0.381]

15 Our results show that dynamic Bayesian networks can be decomposed efﬁciently and automatically, enabling wider applicability of factored inference. [sent-33, score-0.425]

16 Furthermore, tests on real DBNs show that using automatically found factors can in some cases yield signiﬁcantly lower error than using factors found by hand. [sent-34, score-0.473]

17 2 Background A dynamic Bayesian network (DBN), [7] [8], represents a dynamic system consisting of some set of variables that co-evolve in discrete timesteps. [sent-35, score-0.289]

18 We denote the set of variables in the system by X, with the canonical variables being those that directly inﬂuence at least one variable in the next timestep. [sent-37, score-0.262]

19 We call the probability distribution over the possible states of the system at a given timestep the belief state. [sent-38, score-0.356]

20 We can hence represent this transition model as a Bayesian network containing the variables in X at timestep t, denoted Xt , and the variables in X at timestep t + 1, say Xt+1 – this is called a 2-TBN (for two-timeslice Bayesian network). [sent-40, score-0.526]

21 By inferring the belief state over Xt+1 from that over Xt , and conditioning on observations, we propagate the belief state through the system dynamics to the next timestep. [sent-41, score-0.474]

22 Note that, although each variable at t + 1 may only depend on a small subset of the variables at t, its state might be correlated implicitly with the state of any variable in the system, as the inﬂuence of any variable might propagate through intervening variables over multiple timesteps. [sent-43, score-0.355]

23 Boyen and Koller, [3], ﬁnd that, despite this fact, we can factor the system into components whose belief states are kept independently, and the error incurred by doing so remains bounded over the course of the process. [sent-45, score-0.339]

24 The BK algorithm hence approximates the belief state at a given timestep as the product of the local belief states for the factors (their marginal distributions), and does exact inference to propagate this approximate belief state to the next timestep. [sent-46, score-1.127]

25 Both the Factored Frontier, [4], and Factored Particle, [5], algorithms also rely on this idea of a factored belief state representation. [sent-47, score-0.495]

26 In [9] and [6], Pfeffer introduced conditions under which a single variable’s (or factor’s) marginal distribution will be propagated accurately through belief state propagation, in the BK algorithm. [sent-48, score-0.302]

27 The degree of separability is a property of a conditional probability distribution that describes the degree to which that distribution can be decomposed as the sum of simpler conditional distributions, each of which depends on only a subset of the conditioning variables. [sent-49, score-0.8]

28 We will say that the degree of separability is the maximum α such that there exist pX (Z|X), pX (Z|Y ), and pXY (Z|XY ) and γ that satisfy (1). [sent-52, score-0.496]

29 [10] analyzed the error introduced between the exact distribution and the factored distribution by just one step of belief propagation. [sent-56, score-0.561]

30 The authors noted that this error can be decomposed as the sum of conditional mutual information terms between variables in different factors and showed that each such term is bounded with respect to the mixing rate of the subsystem comprising the variables in that term. [sent-57, score-0.684]

31 Along with other heuristics, we examined two approaches to automatic factorization that seek directly to exploit the above results, labeled in-degree and out-degree in Table 1. [sent-59, score-0.293]

32 3 Automatic factorization with pairwise scores We ﬁrst investigated a collection of features, computable from the speciﬁcation of the DBN, that capture different types of pairwise dependencies between variables. [sent-60, score-0.59]

33 These features are based both on the 2-TBN graph structure and on two conceptions of interaction: the degree of separability and mutual information. [sent-61, score-0.789]

34 1 Algorithm: Recursive min-cut We use the following algorithm to ﬁnd a factorization using only scores between pairs of variables. [sent-64, score-0.325]

35 We build an undirected graph over the canonical variables in the DBN, weighting each edge between two variables with their pairwise score. [sent-65, score-0.399]

36 An obvious algorithm for ﬁnding a partition that minimizes pairwise interactions between variables in different factors would be to compute a k-way min-cut, taking, say, the best-scoring such partition in which all factors are below a size limit. [sent-66, score-0.606]

37 Instead we ﬁnd that a good factorization can be achieved by computing a recursive min-cut, recurring until all factors are smaller than the pre-deﬁned maximum size. [sent-68, score-0.399]

38 For each factor that is too large, we search over the number of smaller factors, k, into which to divide the large factor, for each k computing the k-way min-cut factorization of the variables in the large factor. [sent-71, score-0.488]

39 2 Pairwise scores Graph structure As a baseline in terms of speed and simplicity, we ﬁrst investigated three types of pairwise graph relationships between variables that are indicative of different types of dependency. [sent-78, score-0.392]

40 Suppose that two variables at time t + 1, Xt+1 and Yt+1 , depend on some common parents Zt . [sent-80, score-0.322]

41 The score between X and Y is the number of parents they share in the 2-TBN. [sent-82, score-0.305]

42 3 Degree of separability The degree of separability for a given factor’s conditional distribution in terms of the other factors gives a measure of how accurately the belief state for that factor will be propagated via that conditional distribution to the next timestep, in BK inference. [sent-91, score-1.522]

43 When a factor’s conditional distribution is highly separable in terms of the other factors, ignored dependencies between the other factors lead to relatively small errors in that factor’s marginal belief state after propagation. [sent-92, score-0.648]

44 We can hence use the degree of separability as an objective to be maximized: we want to ﬁnd the factorization that yields the highest degree of separability for each factor’s conditional distribution. [sent-93, score-1.246]

45 Computing the degree of separability is a constrained optimization problem, and [12] gives an approximate method of solution. [sent-94, score-0.496]

46 For distributions over many variables the degree of separability is quite expensive to compute, as the number of variables in the optimization grows exponentially with the number of discrete variables in the input conditional distribution. [sent-95, score-0.843]

47 Computing the degree of separability for a small distribution is, however, reasonably efﬁcient. [sent-96, score-0.527]

48 In adapting the degree of separability to a pairwise score for the min-cut algorithm, we took two approaches. [sent-97, score-0.738]

49 • Separability of the pair’s joint conditional distribution: We assign a score to the pair of canonical variables X and Y equal to the degree of separability for the joint conditional distribution p(Xt+1 Yt+1 |P arents(Xt+1 ) ∪ P arents(Yt+1 )). [sent-98, score-0.977]

50 We want to maximize this value for variables that are joined in a factor, as a high degree of separability implies that the error of the factor marginal distribution after propagation in BK will be low. [sent-99, score-0.826]

51 Note that the degree of separability is deﬁned in terms of groups of parent variables. [sent-100, score-0.555]

52 We compute the degree of separability for the above joint conditional distribution in terms of the parents taken separately. [sent-103, score-0.87]

53 • Non-separability between parents of a common child: If two parents are highly non-separable in a common child’s conditional distribution, then the child’s marginal distribution can be rendered inaccurate by placing these two parents in different components. [sent-104, score-0.822]

54 The strength of interaction between X and Y is deﬁned to be the average degree of non-separability for each variable in Zt+1 in terms of its parents taken separately. [sent-106, score-0.36]

55 The degree of non-separability is one minus the degree of separability. [sent-107, score-0.272]

56 Mutual information Whereas the degree of separability is a property of a single factor’s conditional distribution, the mutual information between two factors measures their joint dependencies. [sent-108, score-1.01]

57 All we are given is a DBN deﬁning the conditional distribution over the next timeslice given the previous, and some initial distribution over the variables at time 1. [sent-110, score-0.27]

58 In order to obtain a suitable joint distribution over the variables at t + 1 we must assume a prior distribution over the variables at time t. [sent-111, score-0.324]

59 We then use this marginal to compute the mutual information between X and Y , thus estimating the degree of dependency between X and Y that results from one step of the process. [sent-114, score-0.399]

60 Again, we assume a prior distribution at time t and use this to obtain the joint distribution p(Yt+1 Xt )), from which we can calculate their mutual information. [sent-116, score-0.325]

61 • Mutual information from the joint over both timeslices: We take into account all possible direct inﬂuences between X and Y, by computing the mutual information between the sets of variables (Xt ∪ Xt+1 ) and (Yt ∪ Yt+1 ). [sent-118, score-0.361]

62 As before, we assume a prior distribution at time t to compute a joint distribution p((Xt ∪Xt+1 )∪(Yt ∪Yt+1 )), from which we can get the mutual information. [sent-119, score-0.325]

63 We can assume a uniform distribution, in which case the resulting mutual information values are exactly those introduced by one step of inference, as all variables are independent at time t. [sent-121, score-0.295]

64 3 Empirical comparison We compared the preceding pairwise scores by factoring randomly-generated DBNs, using the BK algorithm for belief state monitoring. [sent-126, score-0.471]

65 The ﬁrst is the joint belief state error, which is the relative entropy between the product of the factor marginal belief states and the exact joint belief state. [sent-128, score-0.808]

66 The second is the average factor belief state error, which is the average over all factors of the relative entropy between each factor’s marginal distribution and the equivalent marginal distribution from the exact joint belief state. [sent-129, score-0.927]

67 To generate context-speciﬁc independence, the variable’s parents were randomly permuted and between one half and all of the parents were chosen each to induce independence between the child variable and the parents lower in the tree, conditional upon one of its states. [sent-142, score-0.807]

68 These are suggested by Boyen and Koller as a means of controlling the mixing rate of factored inference, which is used to bound the error. [sent-145, score-0.29]

69 In all cases, the mutual-information based factorizations, and in particular the mutual information after one timestep, yielded lower error, both in the joint belief state and in the factor marginal belief states. [sent-146, score-0.798]

70 The degree of separability is apparently not well-adapted to a pairwise score, given that it is naturally deﬁned in terms of an entire factor. [sent-147, score-0.608]

71 Two search algorithms allow us to use scores computed for whole factors, and to ﬁnd better factors while sacriﬁcing speed. [sent-149, score-0.448]

72 1 Algorithms: Agglomerative clustering and local search Agglomerative clustering begins with all canonical variables in separate factors, and at each step chooses a pair of factors to merge such that the score of the factorization is minimized. [sent-151, score-0.918]

73 As the factors being scored are always of relatively small size, agglomerative clustering allows us to use full-factor scores. [sent-154, score-0.593]

74 5 Table 1: Random DBNs with pairwise scores 12 nodes Joint KL Factor KL Out-degree In-degree Children of common parents Parents of common children Parent to child Separability between parents Separability of pairs of variables Mut. [sent-155, score-0.897]

75 15 Local search begins with some initial factorization and attempts to ﬁnd a factorization of minimum score by iteratively modifying this factorization. [sent-198, score-0.577]

76 More speciﬁcally, from any given factorization moves of the following three types are considered: create a new factor with a single node, move a single node from one factor into another, or swap a pair of nodes in different factor. [sent-199, score-0.424]

77 If there is no move that decreases the score (and so we have hit a local minimum), however, the factors are randomly re-initialized and the algorithm continues searching, terminating after a ﬁxed number of iterations. [sent-202, score-0.328]

78 The factorization with the lowest score of all that were examined is returned. [sent-203, score-0.324]

79 As with agglomerative clustering, local search enables the use of full-factor scores. [sent-204, score-0.456]

80 2 Empirical comparison We veriﬁed that the results for the pairwise scores extend to whole-factor scores on a dataset of 120 randomly-generated DBNs, each of which contained 8 binary-valued state variables. [sent-209, score-0.424]

81 We were signiﬁcantly constrained in our choice of models by the complexity of computing the degree of separability for large distributions: even on these smaller models, doing agglomerative clustering with the degree of separability sometimes took over 2 hours and local search much longer. [sent-210, score-1.579]

82 We have therefore conﬁned our comparison to agglomerative clustering on 8-variable models. [sent-211, score-0.395]

83 The mutual information after one timestep again produced the lowest error in both in the factor marginal belief states and in the joint belief state. [sent-213, score-0.897]

84 For the networks with large amounts of contextspeciﬁc independence, the degree of separability was always close to one, and this might have hampered its effectiveness for clustering. [sent-214, score-0.526]

85 Interestingly, we see that agglomerative clustering can sometimes produce results that are worse than those for graph partitioning, although local search consistently outperforms the two. [sent-215, score-0.596]

86 This may be due to the fact that agglomerative clustering tends to produce smaller clusters than the divisive approach. [sent-216, score-0.395]

87 29 Factoring real models Boyen and Koller, [3], demonstrated factored inference on two models that were factored by hand: the Bayesian Automated Taxi network and the water network. [sent-289, score-0.704]

88 In both cases automatic factorization recovered reasonable factorizations that performed better than those found manually. [sent-291, score-0.321]

89 Local search with factors of 5 or fewer variables yielded exactly the 5+5 clustering given in the paper. [sent-294, score-0.5]

90 Local search took about 300 seconds to complete, while agglomerative clustering took 138 seconds and graph min-cut 12 seconds. [sent-298, score-0.645]

91 It has 8 state variables and 4 observation variables (labeled A through H), and all variables are discrete with 3 or 4 states. [sent-300, score-0.358]

92 The agglomerative and local search algorithms yielded the same result ([A+B+C+E], [D+F+G+H]) and graph min-cut was only slightly different ([A+C+E], [D+F+G+H], [B]). [sent-301, score-0.542]

93 Local search took about one minute to complete, while agglomerative clustering took 30 seconds and graph min-cut 3 seconds. [sent-305, score-0.645]

94 These techniques attempt to minimize an objective score that captures the extent to which dependencies that are ignored by the factored approximation will lead to error. [sent-307, score-0.412]

95 The heuristics we examined are based both on the structure of the 2-TBN and on the concepts of degree of separability and mutual information. [sent-308, score-0.792]

96 The mutual information after one step of belief propaga7 Table 3: Algorithm performance 12-var. [sent-309, score-0.338]

97 Recursive min-cut efﬁciently uses scores between pairs of variables, while agglomerative clustering and local search both use scores computed between whole factors – the latter two are slower, while achieving better results. [sent-340, score-1.016]

98 Automatic factorization can extend the applicability of factored inference to larger models for which it is undesireable to ﬁnd factors manually. [sent-341, score-0.82]

99 The factored frontier algorithm for approximate inference in DBNs. [sent-356, score-0.392]

100 Heuristics for automatically decomposing a dynamic Bayesian network for factored inference. [sent-382, score-0.42]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('separability', 0.36), ('agglomerative', 0.313), ('factored', 0.29), ('parents', 0.224), ('dbn', 0.214), ('dbns', 0.211), ('factorization', 0.201), ('factors', 0.198), ('mutual', 0.197), ('timestep', 0.151), ('belief', 0.141), ('degree', 0.136), ('timeslices', 0.133), ('scores', 0.124), ('xt', 0.119), ('avi', 0.114), ('pairwise', 0.112), ('xy', 0.109), ('yt', 0.107), ('variables', 0.098), ('factor', 0.095), ('search', 0.094), ('boyen', 0.082), ('clustering', 0.082), ('score', 0.081), ('engstatus', 0.076), ('ydot', 0.076), ('bk', 0.073), ('factorizations', 0.07), ('joint', 0.066), ('marginal', 0.066), ('dynamic', 0.065), ('state', 0.064), ('children', 0.061), ('graph', 0.058), ('heuristics', 0.057), ('frogner', 0.057), ('frontbackstatus', 0.057), ('fwdact', 0.057), ('harvard', 0.057), ('inlane', 0.057), ('latact', 0.057), ('leftclr', 0.057), ('timeslice', 0.057), ('child', 0.054), ('separable', 0.054), ('conditional', 0.053), ('inference', 0.053), ('px', 0.052), ('kl', 0.052), ('automatic', 0.05), ('frontier', 0.049), ('took', 0.049), ('local', 0.049), ('bayesian', 0.048), ('water', 0.043), ('pxy', 0.042), ('stopped', 0.042), ('examined', 0.042), ('dependencies', 0.041), ('error', 0.04), ('applicability', 0.04), ('zt', 0.04), ('arents', 0.038), ('bat', 0.038), ('conceptions', 0.038), ('determinism', 0.038), ('keiji', 0.038), ('merger', 0.038), ('pfeffer', 0.038), ('rightclr', 0.038), ('taxi', 0.038), ('undesireable', 0.038), ('xavier', 0.038), ('xdot', 0.038), ('automatically', 0.037), ('koller', 0.036), ('automated', 0.034), ('swap', 0.033), ('charlie', 0.033), ('system', 0.033), ('canonical', 0.033), ('whole', 0.032), ('groups', 0.031), ('distribution', 0.031), ('intended', 0.031), ('propagate', 0.031), ('kevin', 0.03), ('factoring', 0.03), ('networks', 0.03), ('incurred', 0.03), ('manual', 0.029), ('independence', 0.028), ('exact', 0.028), ('yielded', 0.028), ('parent', 0.028), ('daphne', 0.028), ('stuart', 0.028), ('yair', 0.028), ('network', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Author: Charlie Frogner, Avi Pfeffer

2 0.15301384 187 nips-2007-Structured Learning with Approximate Inference

Author: Alex Kulesza, Fernando Pereira

Abstract: In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufﬁcient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1

3 0.13208544 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efﬁciently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (ﬁrst-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine ﬂexiblity with analytical and computational tractability, on the other hand, online algorithms provide efﬁcient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the speciﬁc task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than ﬁrst-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and ﬂexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simpliﬁed form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efﬁcient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the ﬂy between higher-order and ﬁrst-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reﬂects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artiﬁcial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm deﬁned in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental ﬁndings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneﬁcial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the speciﬁc task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise speciﬁed), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classiﬁer u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to reﬁne Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efﬁcient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we deﬁne wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classiﬁer over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisﬁes3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We ﬁx an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semideﬁnite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it sufﬁces to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To ﬁnd a lower bound of the left-hand side of (3), we ﬁrst pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more reﬁned bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classiﬁed by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to ﬁrst order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (ﬁrst-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artiﬁcial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as ﬁrst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artiﬁcial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We ﬁrst tested on a couple of well-known DNA microarray datasets. For each dataset, we ﬁrst generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression proﬁles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression proﬁles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression proﬁles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artiﬁcially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by ﬁrst generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the ﬁrst 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the ﬁrst target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are ﬂipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artiﬁcial 0.0 ” and ”Artiﬁcial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the ﬁrst 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classiﬁcation problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the ﬁrst 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not signiﬁcantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the speciﬁc task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artiﬁcial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) signiﬁcantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its signiﬁcant overlap with the other ones. Notice that this setting fulﬁlls the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassiﬁed patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered signiﬁcant. In bold are the smallest ﬁgures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artiﬁcial datasets (Artiﬁcial0.0 , on the left, and Artiﬁcial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artiﬁcial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difﬁculty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. siﬁcation, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efﬁcient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classiﬁers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classiﬁcation tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a ﬁxed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classiﬁcation algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.

4 0.109279 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

Abstract: Planning in partially observable environments remains a challenging problem, despite signiﬁcant recent advances in ofﬂine approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their ofﬂine counterparts. Thus it seems natural to try to unify ofﬂine and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate ofﬂine value iteration algorithms through the use of an efﬁcient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can ﬁnd (provably) near-optimal solutions in reasonable time. 1

5 0.09679386 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We ﬁrst learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classiﬁcation can then be further improved by using backpropagation through the DBN to discriminatively ﬁne-tune the covariance kernel.

6 0.091773853 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

7 0.086580597 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

8 0.083150879 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

9 0.080769792 215 nips-2007-What makes some POMDP problems easy to approximate?

10 0.079943962 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

11 0.076758966 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

12 0.072417274 21 nips-2007-Adaptive Online Gradient Descent

13 0.07107763 213 nips-2007-Variational Inference for Diffusion Processes

14 0.067036405 30 nips-2007-Bayes-Adaptive POMDPs

15 0.066188954 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

16 0.063713424 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

17 0.063485689 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

18 0.059864566 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

19 0.059359223 63 nips-2007-Convex Relaxations of Latent Variable Training

20 0.058808059 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, -0.038), (2, -0.008), (3, -0.016), (4, -0.035), (5, -0.116), (6, -0.031), (7, 0.041), (8, 0.021), (9, -0.155), (10, 0.052), (11, 0.086), (12, 0.015), (13, 0.006), (14, -0.119), (15, 0.036), (16, -0.043), (17, -0.078), (18, 0.037), (19, -0.018), (20, 0.17), (21, -0.105), (22, -0.014), (23, 0.105), (24, -0.015), (25, 0.12), (26, 0.068), (27, 0.088), (28, 0.05), (29, -0.011), (30, 0.027), (31, 0.088), (32, -0.124), (33, -0.026), (34, -0.007), (35, -0.051), (36, -0.106), (37, 0.048), (38, 0.057), (39, 0.02), (40, 0.093), (41, -0.093), (42, -0.032), (43, -0.016), (44, -0.045), (45, -0.082), (46, -0.028), (47, -0.007), (48, 0.028), (49, -0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95362151 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Author: Charlie Frogner, Avi Pfeffer

2 0.63136905 187 nips-2007-Structured Learning with Approximate Inference

Author: Alex Kulesza, Fernando Pereira

3 0.61044598 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

4 0.50253451 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

5 0.48638555 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspeciﬁed horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1

6 0.46363783 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

7 0.45521697 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

8 0.45260027 215 nips-2007-What makes some POMDP problems easy to approximate?

9 0.42996117 30 nips-2007-Bayes-Adaptive POMDPs

10 0.41704088 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

11 0.41376692 97 nips-2007-Hidden Common Cause Relations in Relational Learning

12 0.41062707 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

13 0.40896896 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

14 0.40547743 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

15 0.38908267 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

16 0.38798752 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

17 0.3865681 213 nips-2007-Variational Inference for Diffusion Processes

18 0.38525343 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

19 0.38472119 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

20 0.35351771 86 nips-2007-Exponential Family Predictive Representations of State

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.54), (13, 0.03), (16, 0.016), (18, 0.019), (21, 0.04), (31, 0.013), (34, 0.037), (35, 0.017), (47, 0.056), (49, 0.011), (83, 0.079), (85, 0.024), (87, 0.012), (90, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90949166 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Author: Charlie Frogner, Avi Pfeffer

2 0.8470999 27 nips-2007-Anytime Induction of Cost-sensitive Trees

Author: Saher Esmeir, Shaul Markovitch

Abstract: Machine learning techniques are increasingly being used to produce a wide-range of classiﬁers for complex real-world applications that involve nonuniform testing costs and misclassiﬁcation costs. As the complexity of these applications grows, the management of resources during the learning and classiﬁcation processes becomes a challenging task. In this work we introduce ACT (Anytime Cost-sensitive Trees), a novel framework for operating in such environments. ACT is an anytime algorithm that allows trading computation time for lower classiﬁcation costs. It builds a tree top-down and exploits additional time resources to obtain better estimations for the utility of the different candidate splits. Using sampling techniques ACT approximates for each candidate split the cost of the subtree under it and favors the one with a minimal cost. Due to its stochastic nature ACT is expected to be able to escape local minima, into which greedy methods may be trapped. Experiments with a variety of datasets were conducted to compare the performance of ACT to that of the state of the art cost-sensitive tree learners. The results show that for most domains ACT produces trees of signiﬁcantly lower costs. ACT is also shown to exhibit good anytime behavior with diminishing returns.

3 0.84544981 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

Author: Pierre Garrigues, Bruno A. Olshausen

Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefﬁcients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive ﬁelds. However, the resulting sparse coefﬁcients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefﬁcients by including a pairwise coupling term in the prior over the coefﬁcient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.

4 0.82959288 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis

Abstract: An important problem in many ﬁelds is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.

5 0.49885401 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Author: Matthias Bethge, Philipp Berens

Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1

6 0.43812457 7 nips-2007-A Kernel Statistical Test of Independence

7 0.43504855 146 nips-2007-On higher-order perceptron algorithms

8 0.42769244 115 nips-2007-Learning the 2-D Topology of Images

9 0.42447358 96 nips-2007-Heterogeneous Component Analysis

10 0.42289868 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

11 0.41943523 187 nips-2007-Structured Learning with Approximate Inference

12 0.41523457 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation

13 0.40560839 49 nips-2007-Colored Maximum Variance Unfolding

14 0.40418327 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

15 0.40286589 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

16 0.40213156 171 nips-2007-Scan Strategies for Meteorological Radars

17 0.3993254 164 nips-2007-Receptive Fields without Spike-Triggering

18 0.39701647 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency

19 0.39585006 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

20 0.39210397 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence