jmlr jmlr2009 jmlr2009-42 knowledge-graph by maker-knowledge-mining

42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

Source: pdf

Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia

Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 CA Department of Computer Science and Operations Research Universit´ de Montr´ al e e 2920 Chemin de la tour, suite 2194 Montreal, Qc, Canada H3A 1J4 Francois B´ lisle ¸ e Claude Nadeau BELISLE . [sent-5, score-0.178]

2 Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. [sent-15, score-0.177]

3 ¸ e e ´ D UGAS , B ENGIO , B E LISLE , NADEAU AND G ARCIA the price of stock options. [sent-22, score-0.143]

4 Based on the Black-Scholes formula, the price of a call stock option is monotonically increasing in both the “moneyness” and time to maturity of the option, and it is convex in the “moneyness”. [sent-23, score-0.383]

5 Comparative experiments on these new classes of functions were performed on stock option prices, showing improvements when using these new classes rather than ordinary feedforward neural networks. [sent-32, score-0.18]

6 Since the sigmoid h is monotonically increasing (h′ (s) = h(s)(1 − h(s)) > 0), it is easy to force the ﬁrst derivatives with respect to x to be positive by forcing the weights to be positive, for example with the exponential function: ˆ N+ = H f (x) = ew0 + ∑ ewi · h bi + ∑ evi j x j i=1 . [sent-43, score-0.225]

7 1 Universality for Functions with Positive Outputs Using the softplus function introduced above, we deﬁne a new class of functions, all of which have positive outputs: ˆ ˆ N>0 = { f (x) = ζ(g(x)), g(x) ∈ N }. [sent-48, score-0.212]

8 Consider ˆ g(x) = ζ−1 ( f (x)) = ln(e f (x) − 1), the inverse softplus transform of f (x). [sent-51, score-0.212]

9 ˆ ln(1 + e ˆ ˆ Since b − a ≤ ε, we have, | fˆ(x) − f (x)| = ln(1 + eb ) − ln(1 + ea ) = ln 1 + (eb − ea )/(1 + ea ) ≤ ln (1 + (eε − 1)ea /(1 + ea )) < ε. [sent-56, score-0.196]

10 Thus, the use of the softplus function to transform the output of a regular one hidden layer artiﬁcial neural network ensures the positivity of the ﬁnal output without hindering the universality property. [sent-57, score-0.391]

11 2 The Class ˆ c,n N++ In this section, we use the softplus function, in order to deﬁne a new class of functions with positive outputs, positive ﬁrst derivatives w. [sent-59, score-0.282]

12 We now state the main universality theorem: ˆ Theorem 3 Within the set c,n F++ of Lipschitz functions from Rn to R whose set of derivatives as ˆ speciﬁed by Equation (6) are non-negative, the class c,n N++ is a universal approximator. [sent-86, score-0.198]

13 Experiments with Artiﬁcial Data In this section, we present a series of controlled experiments in order to assess the potential improvements that can be gained from using the proposed architecture in cases where some derivatives of the target function are known to be positive. [sent-101, score-0.197]

14 The function we shall attempt to learn is f (x) = ζ(x1 )ζ(x2 )ζ(x3 )h(x4 ), y = f (x) + ξ, where ζ(·) is the softplus function deﬁned above and h(·) is the sigmoid function. [sent-106, score-0.299]

15 The process was repeated for two architectures: the proposed architecture of products of softplus and sigmoid functions over input dimensions with constrained weights (CPSD) and regular unconstrained multi-layered perceptrons with a single hidden layer (UMLP). [sent-117, score-0.63]

16 In all cases, the bias and variance are lower for the proposed architecture than for a regular neural network architecture, which is the result we expected. [sent-139, score-0.184]

17 The bias reduction, we conjecture to be a side effect of the bias-variance tradeoff being performed by the model selection on the validation set: to achieve a lower validation error, a larger bias is needed with the unconstrained artiﬁcial neural network. [sent-141, score-0.182]

18 A possible explanation is that the proposed architecture helps reduce the variance of the estimator. [sent-143, score-0.155]

19 Estimating Call Option Prices An option is a contract between two parties that entitles the buyer to a claim at a future date T that depends on the future price, ST of an underlying asset whose price at current time t is St . [sent-146, score-0.358]

20 In this paper we consider the very common European call options, in which the buyer (holder) of the option obtains the right to buy the asset at a ﬁxed price K called the strike price. [sent-147, score-0.442]

21 Thus, if at maturity, the price of the asset ST is above the strike price K, the holder of the option can exercise his option and buy the asset at price K, then sell it back on the market at price ST , thus making a proﬁt of ST − K. [sent-149, score-1.012]

22 If, on the other hand, the price of the asset at maturity ST is below the strike price K, then the holder of the option has no interest in exercising his option (and does not have to) and the option simply expires worthless and unexercised. [sent-150, score-0.838]

23 For this reason, the option is considered to be worth max(0, ST − K) at maturity and our goal is to estimate Ct , the value of that worth at current time t. [sent-151, score-0.182]

24 by letting our approximating function depend on the “moneyness” ratio (M = St /K) instead of the current asset price St and the strike price K independently. [sent-245, score-0.432]

25 We must then modify the target to be the price of the option divided by the strike price: Ct /K. [sent-246, score-0.287]

26 Most of the research on call option modelling relies on strong parametric assumptions of the underlying asset price dynamics. [sent-247, score-0.365]

27 Any misspeciﬁcation of the stochastic process for the asset price will 1245 ´ D UGAS , B ENGIO , B E LISLE , NADEAU AND G ARCIA lead to systematic mispricings for any option based on the asset (Hutchinson et al. [sent-248, score-0.443]

28 In particular, two assumptions on which this formula relies have been challenged by empirical evidence: the assumed lognormality of returns on the asset and the assumed constance of volatility over time. [sent-251, score-0.177]

29 Analyzing the primary economic variables that inﬂuence the call option price, we note that the risk free interest rate (r) needs to be somehow extracted from the term structure of interest rates and the volatility (σ) needs to be forecasted. [sent-253, score-0.188]

30 The novelty of our approach is to account for properties of the call option function as stated in Equation (1). [sent-258, score-0.146]

31 4 Now even though we know the call option function to respect these properties, we do not know if it does respect the additional cross derivative properties of Equation (2). [sent-260, score-0.184]

32 Equations (8), (9) and (10) conﬁrm that the BlackScholes formula is in accordance with our prior knowledge of the call option function: all three 4. [sent-262, score-0.175]

33 Then again, it is known that the Black-Scholes formula does not adequately represent the market pricing of options, but it is considered as a useful guide for evaluating call option prices. [sent-272, score-0.289]

34 So, we do not know if these constraints on the cross derivatives are present in the true price function. [sent-273, score-0.183]

35 Constraining the weights of Equation (13) through exponentiation leads to a different architecture we refer to as the CBS models. [sent-285, score-0.163]

36 The proposed architecture involves the product of softplus and sigmoid functions over input dimensions, hence the UPSD models and CPSD models labels for an unconstrained version of the proposed architecture and the proposed constrained architecture, respectively. [sent-287, score-0.649]

37 Finally, we also tested another architecture derived from the proposed one by simply summing, instead of multiplying, softplus and sigmoid functions. [sent-288, score-0.426]

38 For that last architecture (with constrained weights), positivity, monotonicity and convexity properties are respected but in that case, cross-derivatives are all equal to zero. [sent-289, score-0.203]

39 The unconstrained and constrained architectures are labelled as USSD models and CSSD models, respectively. [sent-291, score-0.154]

40 We used European call option data from 1988 to 1993. [sent-292, score-0.146]

41 In each case, we used the ﬁrst two quarters of 1988 as a training set (3434 examples), the third quarter as a validation set (1642 examples) for model selection and the fourth quarter as a test set (each with around 1500 examples) for ﬁnal generalization error estimation. [sent-295, score-0.27]

42 In tables 2 and 3, we present results for networks with unconstrained weights on the left-hand side, and weights constrained to positive and monotone functions through exponentiation of parameters on the righthand side. [sent-296, score-0.177]

43 Forecasting Results As can be seen in tables 2 and 3, unconstrained architectures obtain better training, validation and testing (test 1) results but fail in the extra testing set (test 2). [sent-303, score-0.146]

44 The importance in the difference in performance between constrained and unconstrained architectures on the second test set lead us to look even farther into the future and test the selected models on data from later years. [sent-306, score-0.154]

45 Bottom: neural networks with an architecture resembling the Black-Scholes formula as deﬁned in Equation (13). [sent-483, score-0.178]

46 In another series of experiments, we tested the unconstrained multi-layered perceptron against the proposed constrained products of softplus convex architecture using data from years 1988 through 1993 incl. [sent-488, score-0.465]

47 For each year, the ﬁrst two quarters were used for training, the third quarter for model selection (validation) and the fourth quarter for testing. [sent-489, score-0.234]

48 Top: products of softplus along the convex axis with sigmoid along the monotone axis. [sent-639, score-0.378]

49 Bottom: the softplus and sigmoid functions are summed instead of being multiplied. [sent-640, score-0.299]

50 Conclusions Motivated by prior knowledge on the positivity of the derivatives of the function that gives the price of European options, we have introduced new classes of functions similar to multi-layer neural networks that have those properties. [sent-648, score-0.24]

51 65 Table 4: Comparison between a simple unconstrained multi-layered architecture (UMLP) and the proposed architecture (CPSD). [sent-687, score-0.306]

52 Data from the ﬁrst two quarters of each year was used as training set, data from the third quarter was used for validation and the fourth quarter was used for testing. [sent-688, score-0.307]

53 in empirical tests of option pricing, we showed that the architecture from the proposed constrained classes usually generalizes better than a standard artiﬁcial neural network. [sent-690, score-0.289]

54 We simply consider sigmoidal and softplus functions that are greater or equal, in every point, than their limit counterparts, used in the ﬁrst part. [sent-723, score-0.244]

55 ˆ Products of these softplus and sigmoidal functions are within c,n N++ . [sent-724, score-0.244]

56 This is done by setting the sigmoid and softplus parameter values appropriately. [sent-727, score-0.299]

57 Universality of approximation follows from (1) the capped difference, at gridpoints, between the functions obtained in the ﬁrst and second parts, (2) the exact approximation obtained at gridpoints in the ﬁrst part and (3) the bounded function variation between gridpoints. [sent-728, score-0.191]

58 The number of gridpoints is N1 + 1 over the x1 axis, N2 + 1 over the x2 axis, . [sent-736, score-0.191]

59 We deﬁne gridpoints a = (a1 , a2 , · · · , an ) and b = (b1 , b2 , · · · , bn ) as the innermost (closest to origin) and outermost corners of T , respectively. [sent-744, score-0.269]

60 Note that, for the remainder of the proof, notations fˆh , fh , gh , without any argument, refer to the functions themselves. [sent-775, score-0.179]

61 At each point along the grid, we add a term (gh , a function) to the current approximating ˆ function so that it becomes exact at point {ph }: fˆh = gh + fˆh−1 , ˆ h = ˆ ∑ gk , k=0 where we have set g0 = fˆ0 . [sent-778, score-0.227]

62 ˆ 1253 ´ D UGAS , B ENGIO , B E LISLE , NADEAU AND G ARCIA The functions fˆh , gh and fˆh−1 are deﬁned over the whole domain and the increment function ˆ gh must be such that at point ph , we have fˆh (ph ) = f (ph ). [sent-779, score-0.673]

63 We compute the constant term δh as ˆ the difference between the value of the function evaluated at point ph , f (ph ), and the value of the currently accumulated approximating function at the same point fˆh−1 (ph ): δh = f (ph ) − fˆh−1 (ph ). [sent-780, score-0.407]

64 Now, the function gh must not affect the value of the approximating function at gridpoints that ˆ have already been visited. [sent-781, score-0.39]

65 According to our sequencing of the gridpoints, this corresponds to having gh (pk ) = 0 for 0 < k < h. [sent-782, score-0.155]

66 ˆ We deﬁne ˆ βh (pk ) = c ∏(pk ( j) − ph ( j) + L)+/L · j=1 n ∏ j=c+1 θ(pk ( j) − ph ( j)), (15) where pk ( j) is the jth coordinate of pk and similarly for ph . [sent-784, score-1.251]

67 We can now deﬁne the incremental function as: for 0 < k < h and β ˆ gh (p) = δh βh (p), ˆ (16) so that after all gridpoints have been visited, our ﬁnal approximation is H fˆH (p) = ˆ ∑ gh (p), h=0 with f (p) = fˆH (p) for all gridpoints. [sent-787, score-0.501]

68 Let q1 and q2 be the innermost (closest to origin) and outermost gridpoints of q’s hypercube, respectively. [sent-790, score-0.269]

69 In order to do so, we will express the target function at gridpoint ph , f (ph ) in terms of the δk coefﬁcients (0 < k ≤ h), then solve for δh and show that it is necessarily positive. [sent-794, score-0.363]

70 First, let pk ( j) = a( j) + ιk ( j)L and deﬁne ιk = (ik (1), ik (2), . [sent-795, score-0.291]

71 Conversely, gk (p) can only be different from zero if pk ( j) ≤ p( j), ∀ j or, equivalently, if ˆ ik ( j) ≤ i( j), ∀ j. [sent-800, score-0.319]

72 , H} as Qh,l = {k : ik ( j) ≤ ih ( j) if j ≤ l and ik ( j) = ih ( j) if j > l}. [sent-808, score-1.288]

73 1254 I NCORPORATING F UNCTIONAL K NOWLEDGE IN N EURAL N ETWORKS In particular, Qh,n = {k : ik ( j) ≤ ih ( j) ∀ j} and Qh,0 = {h}. [sent-809, score-0.644]

74 Thus, we have f (ph ) = fˆH (ph ) ∑ gk (ph ) ˆ k∈Qh,n = ∑ δk ∏ (ih ( j) − ik ( j) + 1)+ ∑ δk ∏ (ih ( j) − ik ( j) + 1). [sent-810, score-0.448]

75 Using Equations (17), (18) and (19) we get: ∆n f (ph ) = ∑ c k∈Qh,n − δk ∏ (ih ( j) − ik ( j) + 1) j=1 c ∑ δk ∏ (ihn ( j) − ik ( j) + 1). [sent-814, score-0.42]

76 k∈Qhn ,n j=1 Since ihn ( j) = ih ( j) for j ≤ c, then ∆n f (ph ) = c k∈Qh,n−1 j=1 This process is repeated for non-convex dimensions n − 1, n − 2, . [sent-816, score-0.474]

77 ∆n f (ph ) = ∑ k∈Qh,c c δk ∏ (ih ( j) − ik ( j) + 1), 1255 j=1 ´ D UGAS , B ENGIO , B E LISLE , NADEAU AND G ARCIA at which point we must consider differentiating with respect to convex dimensions: ∆c . [sent-822, score-0.271]

78 ∆n f (ph ) = c ∑ k∈Qh,c − = j=1 c ∑ k∈Qhc ,c δk ∏ (ihc ( j) − ik ( j) + 1) j=1 c−1 ∑ k∈Qh,c − δk ∏ (ih ( j) − ik ( j) + 1) δk (ih (c) − ik (c) + 1) ∏ (ih ( j) − ik ( j) + 1) j=1 c−1 ∑ k∈Qhc ,c δk (ih (c) − ik (c)) ∏ (ih ( j) − ik ( j) + 1). [sent-825, score-1.26]

79 j=1 According to Equation (19), Qh,c \Qhc ,c = Qh,c−1 and by deﬁnition ik (c) − ih (c) = 0 ∀k ∈ Qh,c−1 . [sent-826, score-0.644]

80 ∆n f (ph ) = c ∑ c−1 k∈Qh,c − δk ∏ (ih ( j) − ik ( j) + 1) j=1 c−1 ∑ δk ∏ (ihc ( j) − ik ( j) + 1). [sent-834, score-0.42]

81 k∈Qhc ,c j=1 and since ihc ( j) = ih (c) ∀ j ≤ c − 1, then ∆2 . [sent-836, score-0.466]

82 c 1 For gridpoints with either ih ( j) = 1 for any j or with ih ( j) = 2 for any j ≤ c, solving for δh requires fewer than n + c differentiations. [sent-860, score-1.059]

83 Since the positivity of the derivatives of f corresponding to these lower order differentiations are covered by Equation (6), then we also have that δh ≥ 0 ˆ∞ for these gridpoints laying at or near some of the boundaries of T . [sent-861, score-0.296]

84 ˆ For the set 1,2 N++ , we have, H f (ph ) = ∑ δk · (ph (1) − pk (1) + L)+ · θ(ph (2) − pk (2)). [sent-873, score-0.162]

85 k=1 Applying this to the six gridpoints of Figure 3, we obtain f (p1 ) = δ1 , f (p2 ) = (δ1 + δ2 ), f (p3 ) = (2δ1 + δ3 ), f (p4 ) = (2δ1 + 2δ2 + δ3 + δ4 ), f (p5 ) = (3δ1 + 2δ3 + δ5 ), f (p6 ) = (3δ1 + 3δ2 + 2δ3 + 2δ4 + δ5 + δ6 ). [sent-874, score-0.191]

86 At each point ph along the grid, we add a term gh (a ˜ function) to the current approximating function f˜h−1 : f˜h = gh + f˜h−1 ˜ h ˜ ∑ gk = k=1 h ˜ ∑ δk βk , = k=1 ˜ where the δk are kept equal to the ones found in Section A. [sent-898, score-0.745]

87 2 and we deﬁne the set of βk functions as a product of sigmoid and softplus functions, one for each input dimension: ˜ βh (p) = n ˜ ∏ βh, j (p). [sent-899, score-0.299]

88 j=1 For each of the convex coordinates, we set: 1 ˜ βh, j (p) = ζ(α · (p( j) − ph ( j) + L)). [sent-900, score-0.393]

89 Now, note that κ, the maximum of the difference between the softplus function of ˆ Equation (20) and the positive part function βh, j (p) = (p( j) − ph ( j) + L)+ , is attained for p( j) = ph ( j) − L where the difference is ln 2/α. [sent-903, score-0.978]

90 Let us now turn to the non-convex dimensions where we set: ˜ βh, j (p) = (1 + κ) h(γ · p( j) + η) and add two constraints: h(γ(ph ( j) − L) + η) = h(γ · ph ( j) + η) = κ , 1+κ 1 . [sent-905, score-0.403]

91 L (21) (22) ˆ For non-convex dimensions, we have βh, j (p) = θ(p( j) − ph ( j)). [sent-907, score-0.363]

92 Thus, for values of p( j) such that ˜ ˆ ph ( j) − L < p( j) < ph ( j), we have a maximum difference βh, j − βh, j of 1. [sent-908, score-0.726]

93 In particular, the difference is bounded above by κ for all gridpoints and is zero for gridpoints with p( j) = ph ( j). [sent-910, score-0.745]

94 Our goal is to cap the difference between gh and gh by ˜ ˆ ˆ h is equal to ε/2H. [sent-913, score-0.31]

95 First, consider the case where m j > 0 ∀ j: n n j=1 j=1 ∏(m j + κ) − ∏ m j gh − gh ≤ δh ˜ ˆ n n ∏ m j (1 + κ) − ∏ m j ≤ δh j=1 j=1 n = δh ((1 + κ)n − 1) ∏ m j j=1 n ≤ δh ((1 + κ)n − 1) ∏ (N j + 1) j=1 ≤ ε((1 + κ) − 1)H. [sent-918, score-0.31]

96 n In that case, we have gh − gh < ε/2H if ˜ ˆ κ ≤ (1/2H 2 + 1)1/n − 1. [sent-919, score-0.31]

97 Then, n n j=1 j=1 ∏(m j + κ) − ∏ m j gh − gh ≤ δh ˜ ˆ ≤ δh κ ∏ (N j + 1)(1 + κ) d m j =0 ≤ δh κd (1 + κ)n−d H ≤ εκd 2n−d H ≤ εκ2n−1 H, so that here, the bound on κ is: κ ≤ 1 2n H 2 . [sent-922, score-0.31]

98 Thus, for any gridpoint, we have: H f˜h − fˆh = ˜ ˆ ∑ gh − gh k=1 ≤ H · ε/2H = ε/2. [sent-925, score-0.31]

99 In this present subsection, we showed that softplus and sigmoid parameters could be chosen such that fˆ ≤ f˜ ≤ fˆ + ε/2 for any gridpoint. [sent-928, score-0.299]

100 Let q1 and q2 be the innermost and outermost gridpoints of q ’s encompassing hypercube of side length L. [sent-932, score-0.291]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ih', 0.434), ('ph', 0.363), ('cpsd', 0.286), ('umlp', 0.265), ('softplus', 0.212), ('ik', 0.21), ('gridpoints', 0.191), ('nadeau', 0.159), ('gh', 0.155), ('lisle', 0.148), ('ugas', 0.127), ('architecture', 0.127), ('option', 0.118), ('ncorporating', 0.117), ('unctional', 0.117), ('price', 0.113), ('engio', 0.108), ('arcia', 0.108), ('asset', 0.106), ('eural', 0.099), ('nowledge', 0.099), ('quarter', 0.099), ('pricing', 0.09), ('etworks', 0.089), ('sigmoid', 0.087), ('pk', 0.081), ('universality', 0.076), ('derivatives', 0.07), ('maturity', 0.064), ('eg', 0.059), ('units', 0.059), ('architectures', 0.058), ('strike', 0.056), ('montr', 0.054), ('grid', 0.053), ('dugas', 0.053), ('qhc', 0.053), ('unconstrained', 0.052), ('universal', 0.052), ('hidden', 0.046), ('approximating', 0.044), ('approximator', 0.044), ('constrained', 0.044), ('st', 0.043), ('moneyness', 0.042), ('outermost', 0.042), ('volatility', 0.042), ('dimensions', 0.04), ('ln', 0.04), ('derivative', 0.038), ('year', 0.037), ('options', 0.036), ('evi', 0.036), ('exponentiation', 0.036), ('innermost', 0.036), ('quarters', 0.036), ('xi', 0.036), ('validation', 0.036), ('positivity', 0.035), ('feedforward', 0.032), ('sigmoidal', 0.032), ('garcia', 0.032), ('qc', 0.032), ('cbs', 0.032), ('cirano', 0.032), ('cmlp', 0.032), ('cssd', 0.032), ('ewi', 0.032), ('holder', 0.032), ('ihc', 0.032), ('respected', 0.032), ('ubs', 0.032), ('differentiating', 0.031), ('al', 0.03), ('stock', 0.03), ('convex', 0.03), ('formula', 0.029), ('ea', 0.029), ('bias', 0.029), ('canada', 0.029), ('variance', 0.028), ('nn', 0.028), ('call', 0.028), ('gk', 0.028), ('train', 0.027), ('francois', 0.027), ('pnn', 0.027), ('axis', 0.026), ('constructive', 0.025), ('fh', 0.024), ('market', 0.024), ('nh', 0.024), ('monotone', 0.023), ('scan', 0.022), ('hypercube', 0.022), ('layer', 0.022), ('networks', 0.022), ('buyer', 0.021), ('claude', 0.021), ('delfour', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia

2 0.096398667 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

Author: Gilles Blanchard, Étienne Roquain

Abstract: In the context of multiple hypothesis testing, the proportion π0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efﬁcency is called adaptive. In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR. In a ﬁrst part, assuming independence of the p-values, we present two new procedures and give a uniﬁed review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation results comparing these procedures and testing their robustness when the independence assumption is violated. The new proposed procedures appear competitive with existing ones. The overall best, though, is reported to be Storey’s estimator, albeit for a speciﬁc parameter setting that does not appear to have been considered before. In a second part, we propose adaptive versions of step-up procedures that have provably controlled FDR under positive dependence and unspeciﬁed dependence of the p-values, respectively. In the latter case, while simulations only show an improvement over non-adaptive procedures in limited situations, these are to our knowledge among the ﬁrst theoretically founded adaptive multiple testing procedures that control the FDR when the p-values are not independent. Keywords: multiple testing, false discovery rate, adaptive procedure, positive regression dependence, p-values

3 0.091012262 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

Author: Cynthia Rudin

Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufﬁciently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a speciﬁc type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a speciﬁc sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval

4 0.086745948 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin

Abstract: Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization often appears to get stuck in poor solutions. Hinton et al. recently proposed a greedy layer-wise unsupervised learning procedure relying on the training algorithm of restricted Boltzmann machines (RBM) to initialize the parameters of a deep belief network (DBN), a generative model with many layers of hidden causal variables. This was followed by the proposal of another greedy layer-wise procedure, relying on the usage of autoassociator networks. In the context of the above optimization problem, we study these algorithms empirically to better understand their success. Our experiments conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. We also present a series of experiments aimed at evaluating the link between the performance of deep neural networks and practical aspects of their topology, for example, demonstrating cases where the addition of more depth helps. Finally, we empirically explore simple variants of these training algorithms, such as the use of different RBM input unit distributions, a simple way of combining gradient estimators to improve performance, as well as on-line versions of those algorithms. Keywords: artiﬁcial neural networks, deep belief networks, restricted Boltzmann machines, autoassociators, unsupervised learning

5 0.06694822 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modiﬁcation of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classiﬁcation in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a speciﬁc intercept, achieves a misclassiﬁcation error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve

6 0.032171536 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

7 0.03126765 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

8 0.030352265 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

9 0.030215178 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

10 0.025541496 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

11 0.025449488 23 jmlr-2009-Discriminative Learning Under Covariate Shift

12 0.024344305 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

13 0.020961292 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

14 0.020546699 67 jmlr-2009-Online Learning with Sample Path Constraints

15 0.020453334 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

16 0.020313023 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

17 0.019653076 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

18 0.01923188 50 jmlr-2009-Learning When Concepts Abound

19 0.018953357 90 jmlr-2009-Structure Spaces

20 0.018580956 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, -0.043), (2, 0.059), (3, 0.08), (4, 0.001), (5, 0.124), (6, 0.027), (7, 0.156), (8, -0.059), (9, -0.03), (10, -0.02), (11, 0.154), (12, 0.04), (13, -0.051), (14, -0.045), (15, 0.036), (16, -0.197), (17, 0.175), (18, -0.019), (19, -0.31), (20, 0.105), (21, -0.264), (22, 0.086), (23, -0.063), (24, 0.095), (25, -0.06), (26, -0.133), (27, -0.027), (28, 0.036), (29, 0.101), (30, 0.023), (31, -0.006), (32, 0.074), (33, 0.037), (34, -0.181), (35, 0.16), (36, -0.054), (37, -0.112), (38, -0.01), (39, -0.076), (40, -0.106), (41, -0.092), (42, 0.076), (43, -0.059), (44, -0.13), (45, -0.039), (46, 0.042), (47, -0.064), (48, 0.051), (49, 0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95186669 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia

2 0.55098087 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

Author: Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, Pascal Lamblin

3 0.46014085 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

Author: Cynthia Rudin

4 0.24718884 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

5 0.23337939 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

Author: Holger Höfling, Robert Tibshirani

Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also ﬁnd that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables

6 0.23122978 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

7 0.21928999 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

8 0.20439202 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

9 0.18762004 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

10 0.17907546 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

11 0.15686727 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

12 0.14356089 16 jmlr-2009-Classification with Gaussians and Convex Loss

13 0.14228171 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

14 0.1417518 23 jmlr-2009-Discriminative Learning Under Covariate Shift

15 0.1392242 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

16 0.12702073 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

17 0.12250754 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

18 0.11788319 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

19 0.11372223 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

20 0.11233433 46 jmlr-2009-Learning Halfspaces with Malicious Noise

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.013), (11, 0.024), (13, 0.478), (19, 0.015), (21, 0.01), (26, 0.013), (38, 0.036), (47, 0.031), (52, 0.038), (55, 0.028), (58, 0.024), (66, 0.076), (68, 0.026), (87, 0.022), (90, 0.074), (96, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68801993 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia

2 0.2493979 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

3 0.24641366 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

Author: Halbert White, Karim Chalak

Abstract: Judea Pearl’s Causal Model is a rich framework that provides deep insight into the nature of causal relations. As yet, however, the Pearl Causal Model (PCM) has had a lesser impact on economics or econometrics than on other disciplines. This may be due in part to the fact that the PCM is not as well suited to analyzing structures that exhibit features of central interest to economists and econometricians: optimization, equilibrium, and learning. We offer the settable systems framework as an extension of the PCM that permits causal discourse in systems embodying optimization, equilibrium, and learning. Because these are common features of physical, natural, or social systems, our framework may prove generally useful for machine learning. Important features distinguishing the settable system framework from the PCM are its countable dimensionality and the use of partitioning and partition-speciﬁc response functions to accommodate the behavior of optimizing and interacting agents and to eliminate the requirement of a unique ﬁxed point for the system. Reﬁnements of the PCM include the settable systems treatment of attributes, the causal role of exogenous variables, and the dual role of variables as causes and responses. A series of closely related machine learning examples and examples from game theory and machine learning with feedback demonstrates some limitations of the PCM and motivates the distinguishing features of settable systems. Keywords: equations causal models, game theory, machine learning, recursive estimation, simultaneous

4 0.24596004 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

Author: Cynthia Rudin, Robert E. Schapire

5 0.24216168 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang

Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems

6 0.24151339 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

7 0.24088517 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

8 0.24032377 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

9 0.23941475 84 jmlr-2009-Scalable Collaborative Filtering Approaches for Large Recommender Systems (Special Topic on Mining and Learning with Graphs and Relations)

10 0.23890446 41 jmlr-2009-Improving the Reliability of Causal Discovery from Small Data Sets Using Argumentation (Special Topic on Causality)

11 0.23829125 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

12 0.2377753 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

13 0.23764941 18 jmlr-2009-Consistency and Localizability

14 0.23732635 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

15 0.23726422 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

16 0.23664916 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

17 0.23588504 38 jmlr-2009-Hash Kernels for Structured Data

18 0.2358353 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

19 0.23504508 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

20 0.23448193 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List