nips nips2007 nips2007-28 knowledge-graph by maker-knowledge-mining

28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Source: pdf

Author: Nicolas Chapados, Yoshua Bengio

Abstract: We introduce a functional representation of time series which allows forecasts to be performed over an unspeciﬁed horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We introduce a functional representation of time series which allows forecasts to be performed over an unspeciﬁed horizon with progressively-revealed information sets. [sent-3, score-0.586]

2 By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. [sent-4, score-0.332]

3 This information is put to use in an application to actively trade price spreads between commodity futures contracts. [sent-5, score-0.688]

4 The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads. [sent-6, score-0.233]

5 1 Introduction Classical time-series forecasting models, such as ARMA models [6], assume that forecasting is performed at a ﬁxed horizon, which is implicit in the model. [sent-7, score-0.826]

6 To forecast beyond the ﬁxed horizon, it is necessary to iterate forecasts in a multi-step fashion. [sent-11, score-0.716]

7 These models are good at representing the short-term dynamics of the time series, but degrade rapidly when longer-term forecasts must be made, usually quickly converging to the unconditional expectation of the process after removal of the deterministic time trend. [sent-12, score-0.386]

8 This is a major issue in applications that require a forecast over a complete future trajectory, and not a single (or restricted) horizon. [sent-13, score-0.513]

9 These models are also constrained to deal with regularly-sampled data, and make it difﬁcult to condition the time trend on explanatory variables, especially when iteration of short-term forecasts has to be performed. [sent-14, score-0.335]

10 To a large extent, the same problems are present with non-linear generalizations of such models, such as time-delay or recurrent neural networks [1], which simply allow the short-term dynamics to become nonlinear but leave open the question of forecasting complete future trajectories. [sent-15, score-0.443]

11 Since time is viewed as an independent variable, the approach can forecast at arbitrary horizons and handle irregularly-sampled data. [sent-19, score-0.561]

12 Furthermore, the question remains of how to integrate a progressivelyrevealed information set in order to make increasingly more precise forecasts of the same future trajectory. [sent-21, score-0.233]

13 To incorporate conditioning information, we consider here the output of a prediction to be a whole forecasting curve (as a function of t). [sent-22, score-0.413]

14 The motivation for this work comes from forecasting and actively trading price spreads between commodity futures contracts (see, e. [sent-23, score-1.338]

15 Since futures contracts expire and have a ﬁnite duration, this problem is characterized by the presence of a large number of separate 1 historical time series, which all can be of relevance in forecasting a new time series. [sent-26, score-0.69]

16 Furthermore, conditioning information, in the form of macroeconomic variables, can be of importance, but exhibit the cumbersome property of being released periodically, with explanatory power that varies across the forecasting horizon. [sent-28, score-0.468]

17 A possible solution to this problem is to have multiple models for forecasting each time series, one for each time scale. [sent-30, score-0.507]

18 In addition, in order to measure risk associated with a particular trade (buying at time t and selling at time t ), we need to estimate the covariance of the price predictions associated with these two points in the trajectory. [sent-32, score-0.522]

19 These considerations motivate the use of Gaussian processes, which naturally provide a covariance matrix between forecasts made at several points. [sent-33, score-0.327]

20 To tackle the challenging task of forecasting and trading spreads between commodity futures, we introduce here a form of functional data analysis in which the function to be forecast is indexed both by the date of availability of the information set and by the forecast horizon. [sent-34, score-1.998]

21 The predicted trajectory is thus represented as a functional object associated with a distribution, a Gaussian process, from which the risk of different trading decisions can readily be estimated. [sent-35, score-0.388]

22 This approach allows incorporating input variables that cannot be assumed to remain constant over the forecast horizon, like statistics of the short-term dynamics. [sent-36, score-0.544]

23 Previous Work Gaussian processes for time-series forecasting have been considered before. [sent-37, score-0.449]

24 Multi-step forecasts are explicitly tackled by [4], wherein uncertainty about the intermediate values is formally incorporated into the predictive distribution to obtain more realistic uncertainty bounds at longer horizons. [sent-38, score-0.273]

25 2 The Model i We consider a set of N real time series each of length Mi , {yt }, i = 1, . [sent-42, score-0.181]

26 In our application each i represents a different year, and the series is the sequence of commodity spread prices during the period where it is traded. [sent-49, score-0.52]

27 The lengths of all series are not necessarily identical, but we shall assume that the time periods spanned by the series are “comparable” (e. [sent-50, score-0.315]

28 the same range of days within a year if the series follow an annual cycle) so that knowledge from past series can be transferred to a new one to be forecast. [sent-52, score-0.492]

29 The forecasting problem is that given N observations from the complete series i = 1, . [sent-53, score-0.577]

30 , MN , we want to extrapolate the last series until a predetermined endpoint, i. [sent-59, score-0.21]

31 ,M respectively over the forecasting horizon, the available series and the observations within a series. [sent-75, score-0.547]

32 The i i notation xt|t0 denotes a forecast of xt given information available at t0 . [sent-80, score-0.483]

33 To extrapolate over the unknown horizon, one simply evaluates f and g with the series identity index i set to N and the time index t within a series ranging over the elements of τ (forecasting period). [sent-83, score-0.452]

34 The principal difﬁculty with this method resides in handling the exogenous inputs xN 0 over the t|t forecasting period: the realizations of these variables, xN , are not usually known at the time the t forecast is made and must be extrapolated with some reasonableness. [sent-85, score-1.115]

35 For slow-moving variables that represent a “level” (as opposed to a “difference” or a “return”), one can conceivably keep their value constant to the last known realization across the forecasting period. [sent-86, score-0.476]

36 Augmenting the Functional Representation We propose in this paper to augment the functional representation with an additional input variable that expresses the time at which the forecast is being made, in addition to the time for which the forecast is made. [sent-90, score-1.206]

37 We shall denote the former the operation time and the latter the target time. [sent-91, score-0.164]

38 The distinction is as follows: operation time represents the time at which the other input variables are observed and the time at which, conceptually, a forecast of the entire future trajectory is performed. [sent-92, score-0.824]

39 In contrast, target time represents time at a point of the predicted target series (beyond operation time), given the information known at the operation time. [sent-93, score-0.462]

40 As previously, the time series index i remains part of the inputs. [sent-94, score-0.226]

41 In this framework, forecasting is performed by holding the time series index constant to N , the operation time constant to the time MN of the last observation, the other input variables constant to their last-observed values xN N , and M varying the target time over the forecasting period τ . [sent-95, score-1.47]

42 It can be convenient to represent the target time as a positive delta ∆ from the operation time t0 . [sent-97, score-0.211]

43 (3), this yields the representation i i E yt0 +∆ It0 ] = f (i, t0 , ∆, xi0 ) t i i i Cov yt0 +∆ , yt0 +∆ It0 = g(i, t0 , ∆, xi0 , i , t0 , ∆ , xi0 ), t t (4) where we have assumed the operation time to coincide with the end of the information set. [sent-99, score-0.168]

44 Moreover, from a given information t set, nothing precludes forecasting the same trajectory from several operation times t < t0 , which can be used as a means of evaluating the stability of the obtained forecast. [sent-101, score-0.552]

45 In particular, the training set must contain sufﬁcient information to represent the output variable for many combinations of operation and target times that can be provided as input. [sent-103, score-0.143]

46 The variables u and v are values in the augmented representation introduced previously, containing the three variables representing time (current timeseries index or year, operation time, target time) as well as the additional explanatory variables. [sent-109, score-0.423]

47 The last term of the covariance function, the Kronecker delta, is used to induce an increased similarity among points that belong to the same time series (e. [sent-111, score-0.279]

48 By allowing a series-speciﬁc average level to be maintained into the extrapolated portion, the presence of this term was found to bring better forecasting performance. [sent-114, score-0.444]

49 The forecasting task is to predict the complete future trajectory of each spread (taken individually), from 200 days before maturity until maturity. [sent-118, score-0.811]

50 Methodology Realized prices in the previous trading years are provided from 250 days to maturity, using data going back to 1989. [sent-119, score-0.407]

51 Within a given trading year, the time variables represent the number of calendar days to maturity of the near leg; since no data is observed on week-ends, training examples are sampled on an irregular time scale. [sent-121, score-0.659]

52 Performance evaluation proceeds through a sequential validation procedure [2]: within a trading year, we ﬁrst train models 200 days before maturity and obtain a ﬁrst forecast for the future price trajectory. [sent-122, score-1.136]

53 Proceeding sequentially, this operation is repeated for succeeding trading years. [sent-124, score-0.332]

54 All forecasts are compared amongst models on squared-error and negative log-likelihood criteria (see “assessing signiﬁcance”, below). [sent-125, score-0.233]

55 The price targets require additional treatment: since the price level of a spread can vary signiﬁcantly from year to year, we normalize the price trajectories to start at zero at the start of every trading year, by subtracting the ﬁrst price. [sent-127, score-1.16]

56 Furthermore, in order to get slightly better behaved optimization, we divide the price targets by their overall standard deviation. [sent-128, score-0.237]

57 An example of the sequence of forecasts made by this model, repeated every 25 times steps, is shown in the upper panel of Figure 1. [sent-135, score-0.293]

58 Traditionally, intra-commodity spread positions are taken so as to match the number of contracts on both legs — the number of short contracts equals the number of long ones — not the dollar value of the long and short sides. [sent-142, score-0.388]

59 4 Figure 1: Top Panel: Illustration of multiple forecasts, repeated every 25 days, of the 1996 March–July Wheat spread (dashed lines); realized price is in gray. [sent-143, score-0.345]

60 Although the ﬁrst forecast (smooth solid blue, with conﬁdence bands) mistakes the overall price level, it approximately correctly identiﬁes local price maxima and minima, which is sufﬁcient for trading purposes. [sent-144, score-1.138]

61 Bottom Panel: Position taken by the trading model (in red: short, then neutral, then long), and cumulative proﬁt of that trade (gray). [sent-145, score-0.354]

62 AugRQ/no-inp further removes the price inputs, leaving only the time-representation inputs. [sent-147, score-0.202]

63 Moreover, to quantify the performance gain of the augmented representation of time, the model StdRQ/no-inp implements a “standard time representation” that would likely be used in a functional data analysis model; as described in eq. [sent-148, score-0.217]

64 (3), this uses a single time variable instead of splitting the representation of time between the operation and target times. [sent-149, score-0.251]

65 We consider the scalar data generating process iid εt ∼ N (0, σ 2 ), yt = φ yt−1 + εt , (6) where the process {yt } has an unconditional mean of zero. [sent-155, score-0.176]

66 3 Given information available at time t, It , the h-step ahead forecast from time t under this model, has conditional expectation and covariance (with the h -step ahead forecast), expressed as E [yt+h | It ] = φh yt , Cov yt+h|t , yt+h |t | It = σ 2 φh+h 1 − φ−2 min(h,h ) . [sent-156, score-0.835]

67 φ2 − 1 Assessing Signiﬁcance of Forecasting Performance Differences For each trajectory forecast, we measure the squared error (SE) made at each time-step along with the negative log-likelihood (NLL) of the realized price under the predictive distribution. [sent-157, score-0.318]

68 To account for differences in target variable distribution throughout the years, we normalize the SE by dividing it by the standard deviation of the test targets in a given year. [sent-158, score-0.152]

69 Due to the serial correlation it exhibits, the time series of performance differences (either SE or NLL) between two models cannot directly be subjected to a standard t-test of the null hypothesis of no difference in forecasting performance. [sent-160, score-0.622]

70 The well-known Diebold-Mariano test [3] corrects for this correlation structure in the case where a single time series of performance differences is available. [sent-161, score-0.235]

71 The sample variance of d is readily shown [3] to be K 1 ¯ vDM = Var[d] = ˆ γk , ˆ M k=−K 3 In our experiments, we estimate an independent empirical mean for each trading year, which is subtracted from the prices before proceeding with the analysis. [sent-165, score-0.314]

72 5 Table 1: Forecast performance difference between AugRQ/all-inp and all other models, for the three spreads studied. [sent-166, score-0.146]

73 Due to the repeated forecasts made for the same time-step across several iterations of sequential validation, the error sequences are likely to be cross-correlated since they result from models estimated on strongly overlapping training sets. [sent-232, score-0.284]

74 Results Results of the forecasting performance difference between AugRQ/all-inp and all other models is shown in Table 1. [sent-236, score-0.413]

75 In particular, the augmented representation of time is shown to be of value (i. [sent-238, score-0.138]

76 Moreover, the Gaussian process is capable of making good use of the additional price and economic input variables, although not always with the traditionally accepted levels of signiﬁcance. [sent-241, score-0.312]

77 4 Application: Trading a Portfolio of Spreads We applied this forecasting methodology based on an augmented representation of time to trading a portfolio of spreads. [sent-242, score-0.958]

78 Within a given trading year, we apply an information-ratio criterion to greedily determine the best trade into which to enter, based on the entire price forecast (until the end of the year) produced by the Gaussian process. [sent-243, score-1.039]

79 More speciﬁcally, let {pt } be the future prices forecast by the model at some operation time (presumably the time of last available element in the training set). [sent-244, score-0.776]

80 The expected forecast dollar proﬁt of buying at t1 and selling at t2 is simply given by pt2 − pt1 . [sent-245, score-0.616]

81 6 Table 2: Financial performance statistics for the 30-spread portfolio on the 1994–2007 (until April 30) period, and two disjoint sub-periods. [sent-249, score-0.156]

82 All returns are expressed in excess of the riskfree rate. [sent-250, score-0.154]

83 Skewness and excess kurtosis are on the monthly return distributions. [sent-252, score-0.192]

84 Figure 2: After a price trajectory forecast (in the top and left portions of the ﬁgure), all possible pairs of buyday/sell-day are evaluated on a trade information ratio criterion, whose results are shown by the level plot. [sent-255, score-0.874]

85 The best trade is selected, here shorting 235 days before maturity with forecast price at a local maximum, and covering 100 days later at a local minimum. [sent-256, score-1.081]

86 The trade decision is made in one of two ways, depending on whether a position has already been opened: (i) When making a decision at time t0 , if a position has not yet been entered for the spread in a given trading year, eq. [sent-285, score-0.536]

87 An illustration of this criterion is given in Figure 2, which corresponds to the ﬁrst decision made when trading the spread shown in Figure 1. [sent-287, score-0.386]

88 Simple additional ﬁlters are used to avoid entering marginal trades: we impose a trade duration of at least four days, a minimum forecast IR of 0. [sent-291, score-0.63]

89 25 and a forecast standard deviation of the price sequence of at least 0. [sent-292, score-0.685]

90 These thresholds have not been tuned extensively; they were used only to avoid trading on an approximately ﬂat price forecast. [sent-294, score-0.453]

91 We applied these ideas to trading a portfolio of 30 spreads, selected among the following commodities: Cotton (2 spreads), Feeder Cattle (2), Gasoline (1), Lean Hogs (7), Live Cattle (1), Natural Gas (2), Soybean Meal (5), Soybeans (5), Wheat (5). [sent-295, score-0.407]

92 The spreads were selected on the basis of their good performance on the 1994–2002 period. [sent-296, score-0.146]

93 Transaction costs were assumed to be 5 basis points per spread leg traded. [sent-298, score-0.153]

94 Spreads were never traded later than 25 calendar days before maturity of the near leg. [sent-299, score-0.282]

95 In all sub-periods, but particularly since 2003, the portfolio exhibits a very favorable risk-return proﬁle, including positive skewness and acceptable excess kurtosis. [sent-302, score-0.313]

96 7 Figure 3: Top Panel: cumulative excess return after transaction costs of a portfolio of 30 spreads traded according to the maximum information-ratio criterion; the bottom part plots the number of positions open at a time (right axis). [sent-311, score-0.531]

97 Bottom Panel: monthly portfolio relative excess returns; we observe the signiﬁcant positive skewness in the distribution. [sent-312, score-0.36]

98 5 Future Work and Conclusions We introduced a ﬂexible functional representation of time series, capable of making long-term forecasts from progressively-revealed information sets and of handling multiple irregularly-sampled series as training examples. [sent-313, score-0.591]

99 We demonstrated the approach on a challenging commodity spread trading application, making use of a Gaussian process’ ability to compute a complete covariance matrix between several test outputs. [sent-314, score-0.629]

100 Gaussian process priors with uncertain inputs – application to multiple-step ahead time series forecasting. [sent-340, score-0.293]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('forecast', 0.483), ('forecasting', 0.413), ('trading', 0.251), ('forecasts', 0.233), ('price', 0.202), ('portfolio', 0.156), ('spreads', 0.146), ('commodity', 0.143), ('series', 0.134), ('year', 0.131), ('ccc', 0.125), ('nll', 0.125), ('spread', 0.11), ('maturity', 0.107), ('trade', 0.103), ('futures', 0.094), ('days', 0.093), ('contracts', 0.089), ('drawdown', 0.089), ('soybeans', 0.089), ('yt', 0.089), ('excess', 0.085), ('operation', 0.081), ('functional', 0.079), ('skewness', 0.072), ('period', 0.07), ('covariance', 0.069), ('prices', 0.063), ('trajectory', 0.058), ('july', 0.056), ('mi', 0.055), ('economic', 0.055), ('explanatory', 0.055), ('calendar', 0.054), ('selling', 0.054), ('horizon', 0.053), ('mn', 0.053), ('augmented', 0.051), ('cov', 0.05), ('inputs', 0.048), ('time', 0.047), ('monthly', 0.047), ('extrapolate', 0.047), ('meal', 0.047), ('wheat', 0.047), ('index', 0.045), ('duration', 0.044), ('var', 0.043), ('buying', 0.043), ('leg', 0.043), ('soybean', 0.043), ('se', 0.042), ('returns', 0.041), ('gaussian', 0.04), ('wherein', 0.04), ('representation', 0.04), ('statistic', 0.037), ('processes', 0.036), ('target', 0.036), ('chapados', 0.036), ('dollar', 0.036), ('exogenous', 0.036), ('opened', 0.036), ('vdm', 0.036), ('ahead', 0.036), ('transaction', 0.036), ('targets', 0.035), ('panel', 0.035), ('variables', 0.034), ('return', 0.033), ('realized', 0.033), ('short', 0.032), ('handling', 0.032), ('fda', 0.031), ('cattle', 0.031), ('extrapolated', 0.031), ('extrapolation', 0.031), ('financial', 0.031), ('horizons', 0.031), ('iu', 0.031), ('unconditional', 0.031), ('complete', 0.03), ('pro', 0.029), ('last', 0.029), ('process', 0.028), ('traded', 0.028), ('revised', 0.028), ('ratio', 0.028), ('expressed', 0.028), ('differences', 0.028), ('input', 0.027), ('normalize', 0.027), ('month', 0.027), ('autoregressive', 0.027), ('kurtosis', 0.027), ('lag', 0.027), ('montr', 0.027), ('test', 0.026), ('training', 0.026), ('made', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

2 0.061501734 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

Author: Lars Buesing, Wolfgang Maass

Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1

3 0.056392513 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classiﬁcation model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classiﬁcation, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classiﬁers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a ﬁxed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its proﬁtability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Proﬁt. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and proﬁt) should have a non-zero association with Microsoft’s proﬁt: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random ﬁelds [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random ﬁeld (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classiﬁcation model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classiﬁcation are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classiﬁcation using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random ﬁelds) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is deﬁned. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classiﬁer. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the ﬁgure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The ﬁgure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classiﬁcation For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulﬁll two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive deﬁnite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as ﬂexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The ﬁrst method is an automated method to relax some of the independence constraints, while guaranteeing positive-deﬁniteness, and a parameterization that depends on a single scalar ρ. This allows for more efﬁcient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive deﬁnite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we ﬁrst triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classiﬁer with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classiﬁer. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as deﬁned in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classiﬁcation can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The ﬁnal algorithm deﬁnes a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modiﬁcation of code written for EP-based Gaussian process classiﬁers4 . 4 Results We now compare three different methods in relational classiﬁcation tasks. We will compare a standard Gaussian process classiﬁer (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider ﬁrst a simple classiﬁcation problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classiﬁcations are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a speciﬁc university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives signiﬁcant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classiﬁers by extending a classical statistical model [12] to non-parametric relational classiﬁcation. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classiﬁcation. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efﬁcient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.

4 0.052588273 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efﬁcient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.

5 0.052148011 135 nips-2007-Multi-task Gaussian Process Prediction

Author: Edwin V. Bonilla, Kian M. Chai, Christopher Williams

Abstract: In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a “free-form” covariance matrix over tasks. This allows for good ﬂexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the beneﬁts of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets. 1

6 0.0507784 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

7 0.048447214 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

8 0.04683185 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

9 0.046550207 162 nips-2007-Random Sampling of States in Dynamic Programming

10 0.045801818 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.042667802 146 nips-2007-On higher-order perceptron algorithms

12 0.041129477 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation

13 0.040839069 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

14 0.040585391 163 nips-2007-Receding Horizon Differential Dynamic Programming

15 0.039149128 145 nips-2007-On Sparsity and Overcompleteness in Image Models

16 0.038426887 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

17 0.037408918 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

18 0.037152879 195 nips-2007-The Generalized FITC Approximation

19 0.034946926 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

20 0.034561109 110 nips-2007-Learning Bounds for Domain Adaptation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.139), (1, 0.0), (2, 0.015), (3, 0.017), (4, 0.006), (5, 0.01), (6, -0.022), (7, -0.024), (8, -0.058), (9, -0.051), (10, -0.001), (11, 0.023), (12, 0.023), (13, 0.022), (14, -0.025), (15, 0.01), (16, -0.028), (17, -0.007), (18, 0.014), (19, -0.04), (20, 0.046), (21, 0.035), (22, 0.072), (23, 0.027), (24, -0.024), (25, 0.028), (26, 0.068), (27, 0.042), (28, 0.005), (29, 0.04), (30, -0.041), (31, -0.083), (32, -0.023), (33, 0.019), (34, -0.027), (35, -0.024), (36, -0.005), (37, -0.024), (38, 0.031), (39, 0.04), (40, -0.007), (41, -0.061), (42, 0.008), (43, 0.087), (44, -0.015), (45, -0.023), (46, 0.066), (47, 0.039), (48, -0.033), (49, -0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89741051 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

2 0.60937756 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efﬁciency. 1

3 0.57928896 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray

Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1

4 0.55135494 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difﬁcult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efﬁcient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

5 0.54305863 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes

Author: Richard Turner, Maneesh Sahani

Abstract: Natural sounds are structured on many time-scales. A typical segment of speech, for example, contains features that span four orders of magnitude: Sentences (∼ 1 s); phonemes (∼ 10−1 s); glottal pulses (∼ 10−2 s); and formants ( 10−3 s). The auditory system uses information from each of these time-scales to solve complicated tasks such as auditory scene analysis [1]. One route toward understanding how auditory processing accomplishes this analysis is to build neuroscienceinspired algorithms which solve similar tasks and to compare the properties of these algorithms with properties of auditory processing. There is however a discord: Current machine-audition algorithms largely concentrate on the shorter time-scale structures in sounds, and the longer structures are ignored. The reason for this is two-fold. Firstly, it is a difﬁcult technical problem to construct an algorithm that utilises both sorts of information. Secondly, it is computationally demanding to simultaneously process data both at high resolution (to extract short temporal information) and for long duration (to extract long temporal information). The contribution of this work is to develop a new statistical model for natural sounds that captures structure across a wide range of time-scales, and to provide efﬁcient learning and inference algorithms. We demonstrate the success of this approach on a missing data task. 1

6 0.51874095 163 nips-2007-Receding Horizon Differential Dynamic Programming

7 0.51396614 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

8 0.47573596 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

9 0.47257748 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

10 0.46286154 96 nips-2007-Heterogeneous Component Analysis

11 0.45598754 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation

12 0.44444075 158 nips-2007-Probabilistic Matrix Factorization

13 0.44293192 135 nips-2007-Multi-task Gaussian Process Prediction

14 0.43009204 174 nips-2007-Selecting Observations against Adversarial Objectives

15 0.42585599 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

16 0.42435938 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

17 0.40901625 162 nips-2007-Random Sampling of States in Dynamic Programming

18 0.40740252 43 nips-2007-Catching Change-points with Lasso

19 0.40548435 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

20 0.40090081 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.297), (5, 0.036), (13, 0.032), (16, 0.06), (18, 0.013), (19, 0.012), (21, 0.081), (31, 0.025), (34, 0.013), (35, 0.021), (47, 0.094), (49, 0.022), (53, 0.016), (83, 0.083), (85, 0.035), (87, 0.012), (90, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76256812 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Author: Nicolas Chapados, Yoshua Bengio

2 0.70711613 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright

Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random ﬁeld. Methods based on mean ﬁeld theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any ﬁxed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean ﬁeld bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP ﬁxed points. 1

3 0.58835745 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our ﬁrst result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

4 0.49755585 195 nips-2007-The Generalized FITC Approximation

Author: Andrew Naish-guzman, Sean Holden

Abstract: We present an efﬁcient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classiﬁcation problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.

5 0.49558738 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

Author: Jonathan W. Pillow, Peter E. Latham

Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more ﬂexible for ﬁtting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.

6 0.4921107 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

7 0.49052364 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

8 0.48985001 164 nips-2007-Receptive Fields without Spike-Triggering

9 0.48895377 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.48667043 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

11 0.48234102 100 nips-2007-Hippocampal Contributions to Control: The Third Way

12 0.48089373 156 nips-2007-Predictive Matrix-Variate t Models

13 0.48027962 174 nips-2007-Selecting Observations against Adversarial Objectives

14 0.47942317 158 nips-2007-Probabilistic Matrix Factorization

15 0.47823599 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

16 0.47821265 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

17 0.47808954 86 nips-2007-Exponential Family Predictive Representations of State

18 0.47708136 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

19 0.47642529 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

20 0.47363234 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods