nips nips2009 nips2009-116 knowledge-graph by maker-knowledge-mining

116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization


Source: pdf

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Information-theoretic lower bounds on the oracle complexity of convex optimization Alekh Agarwal Computer Science Division UC Berkeley alekh@cs. [sent-1, score-1.175]

2 edu Abstract Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. [sent-10, score-0.519]

3 Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. [sent-11, score-0.401]

4 In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. [sent-12, score-1.09]

5 We improve upon known results and obtain tight minimax complexity estimates for various function classes. [sent-13, score-0.332]

6 We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. [sent-14, score-0.23]

7 For such problems, understanding the computational complexity of convex optimization is a key issue. [sent-17, score-0.496]

8 A large body of literature is devoted to obtaining rates of convergence of specific procedures for various classes of convex optimization problems. [sent-18, score-0.369]

9 Such analyses have been performed for many standard optimization alogrithms, among them gradient descent, mirror descent, interior point programming, and stochastic gradient descent, to name a few. [sent-20, score-0.346]

10 We refer the reader to standard texts on optimization (e. [sent-21, score-0.188]

11 On the other hand, there has been relatively little study of the inherent complexity of convex optimization problems. [sent-24, score-0.502]

12 One obstacle to a classical complexity-theoretic analysis, as the authors observed, was that of casting convex optimization problems in a Turing Machine model. [sent-26, score-0.369]

13 They avoided this problem by instead considering a natural oracle model of complexity in which at every round, the optimization procedure queries an oracle for certain information on the function being optimized. [sent-27, score-1.381]

14 Working within this framework, the authors obtained a series of lower bounds on the computational complexity of convex optimization 1 problems. [sent-28, score-0.639]

15 In this paper, we consider the computational complexity of stochastic convex optimization in the oracle model. [sent-30, score-1.09]

16 First, our lower bounds have an improved dependence on the dimension of the space. [sent-33, score-0.274]

17 In the context of statistical estimation, these bounds show how the difficulty of the estimation problem increases with the number of parameters. [sent-34, score-0.17]

18 For instance, they show that the optimal oracle complexity of statistical estimation with quadratic loss is significantly smaller than the corresponding complexity with absolute loss. [sent-36, score-0.818]

19 Our proofs exploit a new notion of the discrepancy between two functions that appears to be natural for optimization problems. [sent-37, score-0.315]

20 They are based on a reduction from a statistical parameter estimation problem to the stochastic optimization problem, and an application of information-theoretic lower bounds for the estimation problem. [sent-38, score-0.533]

21 2 Background and problem formulation In this section, we introduce background on the oracle model of complexity for convex optimization, and then define the oracles considered in this paper. [sent-39, score-1.001]

22 1 Convex optimization in the oracle model Convex optimization is the task of minimizing a convex function f over a convex set S ⊆ Rd . [sent-41, score-1.274]

23 Our primary focus in this paper is the following question: given any class of convex functions F, what is the minimum computational labor any such optimization method would expend for any function in F? [sent-44, score-0.488]

24 In order to address this question, we follow the approach of Nemirovski and Yudin [8], based on the oracle model of optimization. [sent-45, score-0.536]

25 More precisely, an oracle is a (possibly random) function φ : S → I that answers any query x ∈ S by returning an element φ(x) in an information set I. [sent-46, score-0.658]

26 The information set varies depending on the oracle; for instance, for an exact oracle of k th order, the answer to a query xt consists of f (xt ) and the first k derivatives of f at xt . [sent-47, score-1.003]

27 For the case of stochastic oracles studied in this paper, these values are corrupted with zero-mean noise with bounded variance. [sent-48, score-0.31]

28 Given some number of rounds T , an optimization method M designed to approximately minimize the convex function f over the convex set S proceeds as follows: at any given round t = 1, T , the method M queries at xt ∈ S, and the oracle reveals the information φ(xt , f ). [sent-49, score-1.449]

29 For a given oracle function φ, let MT denote the class of all optimization methods M that make T queries according to the procedure outlined above. [sent-51, score-0.814]

30 For any method M ∈ MT , we define its error on function f after T steps as (M, f, S, φ) := f (xT ) − inf f (x) = f (xT ) − f (x∗ ), (1) f x∈S where xT is the method’s query at time T . [sent-52, score-0.169]

31 2 Minimax error When the oracle is stochastic, the method’s query xT at time T is itself random, since it depends on the random answers provided by the oracle. [sent-55, score-0.687]

32 In this case, the optimization error (M, f, S, φ) is also a random variable. [sent-56, score-0.167]

33 Accordingly, for the case of stochastic oracles, we measure the accuracy in terms of the expected value Eφ [ (M, f, S, φ)], where the expectation is taken over the oracle randomness. [sent-57, score-0.626]

34 Given a class of functions F, and the class MT of optimization methods making T oracle queries, we can define the minimax error ∗ (F, S, φ) := inf sup Eφ [ (MT , f, S, φ)]. [sent-58, score-1.182]

35 It is worth noting that oracle complexity measures only the number of queries to the oracle—for instance, the number of (approximate) function or gradient evaluations. [sent-63, score-0.766]

36 However, it does not track computational cost within each component of the oracle query (e. [sent-64, score-0.609]

37 Recall that a subgradient of a convex function f is any vector v ∈ Rd such that f (y) ≥ f (x) + v (y − x). [sent-74, score-0.261]

38 For a convex set S, the radius of the largest inscribed ∞ ball is denoted as r∞ . [sent-82, score-0.373]

39 For a convex function f , its minimizer over a set S will be denoted as x∗ when S is obvious from f context. [sent-83, score-0.231]

40 3 Main results and their consequences With the setup of stochastic convex optimization in place, we are now in a position to state the main results of this paper. [sent-88, score-0.459]

41 In particular, we provide some tight lower bounds on the complexity of stochastic oracle optimization. [sent-89, score-0.98]

42 We begin by analyzing the minimax oracle complexity of optimization for the class of convex Lipschitz functions. [sent-90, score-1.217]

43 Recall that a function f : Rd → R is convex if for all x, y ∈ Rd and λ ∈ (0, 1), we have the inequality f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). [sent-91, score-0.231]

44 Before stating the results, we note that scaling the Lipschitz constant scales minimax optimization error linearly. [sent-93, score-0.349]

45 Let F C be the class of all bounded convex 1-Lipschitz functions on Rd . [sent-97, score-0.431]

46 T (5) Remarks: This lower bound is tight in the minimax sense, since the method of stochastic gradient descent attains a matching upper bound for all stochastic first order oracles for any convex set S (see Chapter 5 of NY [8]). [sent-99, score-1.136]

47 Also, even though this lower bound requires the oracle to have only bounded variance, we will use an oracle based on Bernoulli random variables, which has all moments bounded. [sent-100, score-1.336]

48 The above lower bound is obtained by considering the worst case over all convex sets. [sent-103, score-0.366]

49 However, we expect optimization over a smaller convex set to be easier than over a large set. [sent-104, score-0.398]

50 Let F C be the class of all bounded convex 1-Lipschitz functions on Rd . [sent-107, score-0.431]

51 Let S be a convex set such that it contains an ∞ ball of radius r∞ and is contained in an ∞ ball of radius R∞ . [sent-108, score-0.558]

52 This lower bound for the case of the 2 ball is indeed tight, and is recovered by the stochastic gradient descent algorithm [8]. [sent-114, score-0.406]

53 Just as optimization over simpler sets gets easier, optimization over simple function classes should be easier too. [sent-115, score-0.305]

54 A natural function class that has been studied extensively in the context of better upper bounds is that of strongly convex functions. [sent-116, score-0.501]

55 For any given norm · on S, a function f is strongly convex with coefficient κ means that f (x) ≥ f (y) + f (y)T (x − y) + κ x − y 2 for all x, y ∈ S. [sent-117, score-0.287]

56 2 For this class of functions, we obtain a smaller lower bound on the minimax oracle complexity of optimization. [sent-118, score-0.983]

57 Let F S be the class of all bounded strongly convex and 1-Lipschitz functions on Rd . [sent-120, score-0.487]

58 T (7) Once again there is a matching upper bound using stochastic gradient descent for example, when the strong convexity is with respect to the 2 norm. [sent-122, score-0.304]

59 Let F S be the class of all bounded convex 1-Lipschitz functions on Rd . [sent-125, score-0.431]

60 Let S be a convex set such that it contains an ∞ ball of radius r∞ . [sent-126, score-0.373]

61 ∞ 1 In comparison, Nemirovski and Yudin [8] obtained a lower bound scaling as Ω √T for the class F C . [sent-128, score-0.199]

62 Their bound applies only to the class F C , and does not provide any dimension dependence, as opposed to the bounds provided here. [sent-129, score-0.275]

63 Obtaining the correct dependence yields tight minimax results, and allows us to highlight the dependence of bounds on the geometry of the set S. [sent-130, score-0.458]

64 We characterize the hardness of optimization in terms of a relatively easy to compute complexity measure. [sent-132, score-0.276]

65 As a result, our technique provides tight lower bounds for smaller function classes like strongly convex functions rather easily. [sent-133, score-0.601]

66 If the distribution were known, this is exactly the problem of computing the -accurate optimizer of a convex function, assuming the function class F is convex. [sent-137, score-0.295]

67 If indeed the computational model of the estimator were restricted to querying these values and gradients, then the lower bounds in the previous sections would apply. [sent-142, score-0.217]

68 Our bounds, then allow us to deduce the oracle complexity of statistical estimation problems in this realistic model. [sent-143, score-0.734]

69 In particular, a case of interest is when we fix a convex loss function and consider the worst oracle complexity over all possible distributions under which expectation is taken. [sent-144, score-0.889]

70 From our bounds, it is straightforward to deduce: • For the absolute loss (f (x), y) = |f (x) − y|, the oracle complexity of -accurate estimation over all possible distributions is Ω d/ 2 . [sent-145, score-0.723]

71 • For the quadratic loss (f (x), y) = (f (x) − y)2 , the oracle complexity of -accurate estimation over all possible distributions is Ω (d/ ). [sent-146, score-0.723]

72 In contrast, our results provide algorithm-independent lower bounds on the complexity of statistical estimation within the oracle model. [sent-149, score-0.871]

73 An interesting direction for future work is to broader the oracle model so as to more accurately reflect the computational trade-offs in learning and estimation problems, for instance by allowing a method to pay a higher price to query an oracle with lower variance. [sent-150, score-1.28]

74 1 High-level outline Our main idea is to embed the problem of estimating the parameter of a Bernoulli vector (alternatively, the biases of d coins) into a convex optimization problem. [sent-153, score-0.424]

75 For any given function class, we then construct a “difficult” subclass of functions parameterized by these hypercube vertices. [sent-155, score-0.337]

76 We then show that being able to optimize any function in this subclass requires estimating its hypercube vertex, that is, the corresponding biases of the d coins. [sent-156, score-0.282]

77 But the only information for this estimation would be from the coin toss outcomes revealed by the oracle in T queries. [sent-157, score-0.865]

78 With this set-up, we are able to apply the Fano lower bound for statistical estimation, as has been done in past work on nonparametric estimation (e. [sent-158, score-0.2]

79 Our first step is to construct a subclass of functions G ⊆ F that we use to derive lower bounds. [sent-163, score-0.317]

80 i=1 5 (8) In our proofs, the subclasses Gbase and G(δ) are chosen such that G(δ) ⊆ F, the functions fi+ , fi− are bounded over the convex set S with a Lipschitz constant independent of dimension d, and the minimizers xβ of gβ over Rd are contained in S for all β ∈ V. [sent-173, score-0.515]

81 In this step, we show that if a method can optimize over the subclass G(δ) up to a certain tolerance ψ(G(δ)), then it must be capable of identifying which function gα ∈ G(δ) was chosen. [sent-176, score-0.196]

82 Given a convex set S ⊆ Rd and two f functions f, g, we define ρ(f, g) = inf f (x) + g(x) − f (x∗ ) − g(x∗ ) . [sent-179, score-0.353]

83 g f Given the subclass G(δ), we quantify how densely it is packed with respect to the semimetric ρ using the quantity ψ(G(δ)) = min ρ(gα , gβ ), (10) α=β∈V which we also denote by ψ(δ) when the class G is clear from the context. [sent-181, score-0.228]

84 β Suppose that we choose some function gα∗ ∈ G(δ), and some method MT is allowed to make T queries to an oracle with information function φ(·, gα∗ ). [sent-190, score-0.612]

85 Recall the definition (2) of the minimax error in optimization: Lemma 2. [sent-192, score-0.182]

86 Suppose that some method MT has minimax optimization error upper bounded as ψ(δ) . [sent-193, score-0.446]

87 Given a method MT that satisfies the bound (11), we construct an estimator α(MT ) of the true vertex α∗ as follows. [sent-196, score-0.209]

88 We have thus shown that having a low minimax optimization error over G(δ) implies that the vertex α ∈ V can be identified. [sent-202, score-0.394]

89 We now demonstrate a stochastic first order oracle φ for which the samples {φ(x1 , gα ), . [sent-205, score-0.626]

90 In particular, we associate a coin with each dimension i ∈ {1, 2, . [sent-209, score-0.204]

91 , 1/2 + αd δ) | α ∈ V , (12) Given a particular function gα ∈ G(δ) (or equivalently, vertex α ∈ V), we consider the oracle φ that presents noisy value and gradient samples from gα according to the following prescription: • Pick an index it ∈ {1, . [sent-215, score-0.669]

92 t t By construction, the function value and gradient samples are unbiased estimates of those of gα ; moreover, the variance of the effective “noise” is bounded independently of d as long as the Lipschitz constant is independent of d since the function values and gradients are bounded on S. [sent-221, score-0.321]

93 Step IV: Lower bounds on coin-tossing Finally, we use information-theoretic methods to lower bound the probability of correctly estimating the true vertex α∗ ∈ V in our model. [sent-222, score-0.314]

94 Given an arbitrary vertex α∗ ∈ V, suppose that we toss a set of d coins with bias 1 ∗ ∗ θ∗ = ( 2 + α1 δ, . [sent-224, score-0.197]

95 , 1 + α2 δ) a total of T times, but that the outcome of only one coin chosen 2 uniformly at random is revealed at every round. [sent-227, score-0.218]

96 , d} be the variable indicating the coin revealed at time T , and let Xt ∈ {0, 1} denote its outcome. [sent-234, score-0.218]

97 7 Proof of Theorem 1: By the construction of our oracle, it is clear that, at each round, only one coin is revealed to the method MT . [sent-247, score-0.218]

98 (14) In order to obtain an upper bound on P[α(MT ) = α] using Lemma 2, we need to identify the subclass Gbase of F C . [sent-249, score-0.274]

99 Further, if the set is contained in a ball of radius 1 R∞ , then we need to scale the function with R∞ to keep the function values bounded. [sent-266, score-0.185]

100 The reader might suspect that the dimension dependence in our lower bound for strongly convex functions is not tight, due to the dependence of κ on the dimension d. [sent-278, score-0.725]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oracle', 0.536), ('mt', 0.333), ('convex', 0.231), ('xt', 0.197), ('fi', 0.173), ('subclass', 0.164), ('coin', 0.163), ('minimax', 0.153), ('oracles', 0.139), ('optimization', 0.138), ('nemirovski', 0.116), ('gbase', 0.115), ('rd', 0.112), ('bounds', 0.105), ('complexity', 0.095), ('hypercube', 0.09), ('stochastic', 0.09), ('proofs', 0.09), ('alekh', 0.087), ('yudin', 0.087), ('tight', 0.084), ('bounded', 0.081), ('ball', 0.077), ('sup', 0.076), ('queries', 0.076), ('vertex', 0.074), ('kl', 0.074), ('query', 0.073), ('lower', 0.07), ('inf', 0.067), ('lemma', 0.067), ('bernoulli', 0.066), ('radius', 0.065), ('estimation', 0.065), ('fano', 0.065), ('bound', 0.065), ('class', 0.064), ('yt', 0.06), ('gradient', 0.059), ('dependence', 0.058), ('hamming', 0.056), ('corollary', 0.056), ('strongly', 0.056), ('revealed', 0.055), ('lipschitz', 0.055), ('functions', 0.055), ('coins', 0.051), ('reader', 0.05), ('answers', 0.049), ('moments', 0.048), ('toss', 0.046), ('ny', 0.046), ('universal', 0.045), ('upper', 0.045), ('descent', 0.045), ('hardness', 0.043), ('contained', 0.043), ('uc', 0.042), ('estimator', 0.042), ('dimension', 0.041), ('packing', 0.041), ('round', 0.04), ('bit', 0.04), ('inherent', 0.038), ('notation', 0.038), ('deduce', 0.038), ('unbiased', 0.037), ('nesterov', 0.036), ('lectures', 0.035), ('minimizers', 0.035), ('theorem', 0.034), ('gradients', 0.034), ('consequently', 0.033), ('understanding', 0.032), ('discrepancy', 0.032), ('tolerance', 0.032), ('division', 0.032), ('berkeley', 0.03), ('award', 0.03), ('subgradient', 0.03), ('easier', 0.029), ('constant', 0.029), ('error', 0.029), ('proof', 0.028), ('biases', 0.028), ('resources', 0.028), ('chapter', 0.028), ('construct', 0.028), ('loss', 0.027), ('outline', 0.027), ('bartlett', 0.027), ('suppose', 0.026), ('calculation', 0.026), ('pradeepr', 0.025), ('wainwrig', 0.025), ('variates', 0.025), ('undertaken', 0.025), ('turing', 0.025), ('cam', 0.025), ('espaces', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

2 0.1886515 154 nips-2009-Modeling the spacing effect in sequential category learning

Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille

Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently perform model selection to decide whether observations are from one or multiple categories.

3 0.18852249 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

4 0.14440435 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

5 0.13919339 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

6 0.12970157 178 nips-2009-On Stochastic and Worst-case Models for Investing

7 0.12785064 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

8 0.12680678 11 nips-2009-A General Projection Property for Distribution Families

9 0.12514625 220 nips-2009-Slow Learners are Fast

10 0.11996707 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

11 0.11366671 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

12 0.10771108 94 nips-2009-Fast Learning from Non-i.i.d. Observations

13 0.096848927 45 nips-2009-Beyond Convexity: Online Submodular Minimization

14 0.094480403 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

15 0.09381526 27 nips-2009-Adaptive Regularization of Weight Vectors

16 0.091703579 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

17 0.088656314 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

18 0.088007227 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

19 0.086490296 147 nips-2009-Matrix Completion from Noisy Entries

20 0.085462481 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.235), (1, 0.248), (2, 0.081), (3, -0.009), (4, 0.166), (5, 0.067), (6, 0.002), (7, -0.024), (8, 0.049), (9, 0.017), (10, 0.07), (11, -0.012), (12, 0.039), (13, 0.015), (14, 0.145), (15, -0.045), (16, 0.0), (17, 0.037), (18, 0.068), (19, 0.044), (20, 0.105), (21, 0.046), (22, -0.005), (23, 0.09), (24, -0.081), (25, -0.206), (26, 0.033), (27, -0.049), (28, -0.003), (29, -0.026), (30, 0.003), (31, -0.033), (32, -0.068), (33, -0.009), (34, -0.052), (35, -0.001), (36, -0.19), (37, 0.032), (38, -0.002), (39, -0.021), (40, -0.017), (41, -0.021), (42, -0.131), (43, 0.018), (44, 0.012), (45, 0.029), (46, -0.07), (47, 0.041), (48, -0.086), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96708196 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

2 0.74283361 11 nips-2009-A General Projection Property for Distribution Families

Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári

Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1

3 0.64004022 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

4 0.59991878 178 nips-2009-On Stochastic and Worst-case Models for Investing

Author: Elad Hazan, Satyen Kale

Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1

5 0.58897114 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

6 0.57892942 220 nips-2009-Slow Learners are Fast

7 0.57169306 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

8 0.54709232 154 nips-2009-Modeling the spacing effect in sequential category learning

9 0.53522319 94 nips-2009-Fast Learning from Non-i.i.d. Observations

10 0.53289884 27 nips-2009-Adaptive Regularization of Weight Vectors

11 0.50721866 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

12 0.50585878 180 nips-2009-On the Convergence of the Concave-Convex Procedure

13 0.49579367 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

14 0.47775444 33 nips-2009-Analysis of SVM with Indefinite Kernels

15 0.47095123 76 nips-2009-Efficient Learning using Forward-Backward Splitting

16 0.4691492 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

17 0.45656526 69 nips-2009-Discrete MDL Predicts in Total Variation

18 0.43644169 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

19 0.4321782 55 nips-2009-Compressed Least-Squares Regression

20 0.42841575 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.159), (25, 0.104), (35, 0.032), (36, 0.094), (39, 0.042), (51, 0.163), (58, 0.095), (61, 0.037), (71, 0.083), (81, 0.028), (86, 0.077), (91, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90403479 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

2 0.88623285 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

3 0.86692548 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

4 0.81799847 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

Abstract: Many researchers have suggested that the psychological complexity of a concept is related to the length of its representation in a language of thought. As yet, however, there are few concrete proposals about the nature of this language. This paper makes one such proposal: the language of thought allows first order quantification (quantification over objects) more readily than second-order quantification (quantification over features). To support this proposal we present behavioral results from a concept learning study inspired by the work of Shepard, Hovland and Jenkins. Humans can learn and think about many kinds of concepts, including natural kinds such as elephant and water and nominal kinds such as grandmother and prime number. Understanding the mental representations that support these abilities is a central challenge for cognitive science. This paper proposes that quantification plays a role in conceptual representation—for example, an animal X qualifies as a predator if there is some animal Y such that X hunts Y . The concepts we consider are much simpler than real-world examples such as predator, but even simple laboratory studies can provide important clues about the nature of mental representation. Our approach to mental representation is based on the language of thought hypothesis [1]. As pursued here, the hypothesis proposes that mental representations are constructed in a compositional language of some kind, and that the psychological complexity of a concept is closely related to the length of its representation in this language [2, 3, 4]. Following previous researchers [2, 4], we operationalize the psychological complexity of a concept in terms of the ease with which it is learned and remembered. Given these working assumptions, the remaining challenge is to specify the representational resources provided by the language of thought. Some previous studies have relied on propositional logic as a representation language [2, 5], but we believe that the resources of predicate logic are needed to capture the structure of many human concepts. In particular, we suggest that the language of thought can accommodate relations, functions, and quantification, and focus here on the role of quantification. Our primary proposal is that quantification is supported by the language of thought, but that quantification over objects is psychologically more natural than quantification over features. To test this idea we compare concept learning in two domains which are very similar except for one critical difference: one domain allows quantification over objects, and the other allows quantification over features. We consider several logical languages that can be used to formulate concepts in both domains, and find that learning times are best predicted by a language that supports quantification over objects but not features. Our work illustrates how theories of mental representation can be informed by comparing concept learning across two or more domains. Existing studies work with a range of domains, and it is useful to consider a “conceptual universe” that includes these possibilities along with many others that have not yet been studied. Table 1 charts a small fragment of this universe, and the penultimate column shows example stimuli that will be familiar from previous studies of concept learning. Previous studies have made important contributions by choosing a single domain in Table 1 and explaining 1 why some concepts within this domain are easier to learn than others [2, 4, 6, 7, 8, 9]. Comparisons across domains can also provide important information about learning and mental representation, and we illustrate this claim by comparing learning times across Domains 3 and 4. The next section introduces the conceptual universe in Table 1 in more detail. We then present a formal approach to concept learning that relies on a logical language and compare three candidate languages. Language OQ (for object quantification) supports quantification over objects but not features, language F Q (for feature quantification) supports quantification over features but not objects, and language OQ + F Q supports quantification over both objects and features. We use these languages to predict learning times across Domains 3 and 4, and present an experiment which suggests that language OQ comes closest to the language of thought. 1 The conceptual universe Table 1 provides an organizing framework for thinking about the many domains in which learning can occur. The table includes 8 domains, each of which is defined by specifying some number of objects, features, and relations, and by specifying the range of each feature and each relation. We refer to the elements in each domain as items, and the penultimate column of Table 1 shows items from each domain. The first row shows a domain commonly used by studies of Boolean concept learning. Each item in this domain includes a single object a and specifies whether that object has value v1 (small) or v2 (large) on feature F (size), value v3 (white) or v4 (gray) on feature G (color), and value v5 (vertical) or v6 (horizontal) on feature H (texture). Domain 2 also includes three features, but now each item includes three objects and each feature applies to only one of the objects. For example, feature H (texture) applies to only the third object in the domain (i.e. the third square on each card). Domain 3 is similar to Domain 1, but now the three features can be aligned— for any given item each feature will be absent (value 0) or present. The example in Table 1 uses three features (boundary, dots, and slash) that can each be added to an unadorned gray square. Domain 4 is similar to Domain 2, but again the feature values can be aligned, and the feature for each object will be absent (value 0) or present. Domains 5 and 6 are similar to domains 2 and 4 respectively, but each one includes relations rather than features. In Domain 6, for example, the relation R assigns value 0 (absent) or value 1 (present) to each undirected pair of objects. The first six domains in Table 1 are all variants of Domain 1, which is the domain typically used by studies of Boolean concept learning. Focusing on six related domains helps to establish some of the dimensions along which domains can differ, but the final two domains in Table 1 show some of the many alternative possibilities. Domain 7 includes two categorical features, each of which takes three rather than two values. Domain 8 is similar to Domain 6, but now the number of objects is 6 rather than 3 and relation R is directed rather than undirected. To mention just a handful of possibilities which do not appear in Table 1, domains may also have categorical features that are ordered (e.g. a size feature that takes values small, medium, and large), continuous valued features or relations, relations with more than two places, and objects that contain sub-objects or parts. Several learning problems can be formulated within any given domain. The most basic is to learn a single item—for example, a single item from Domain 8 [4]. A second problem is to learn a class of items—for example, a class that includes four of the items in Domain 1 and excludes the remaining four [6]. Learning an item class can be formalized as learning a unary predicate defined over items, and a natural extension is to consider predicates with two or more arguments. For example, problems of the form A is to B as C is to ? can be formulated as problems where the task is to learn a binary relation analogous(·, ·) given the single example analogous(A, B). Here, however, we focus on the task of learning item classes or unary predicates. Since we focus on the role of quantification, we will work with domains where quantification is appropriate. Quantification over objects is natural in cases like Domain 4 where the feature values for all objects can be aligned. Note, for example, that the statement “every object has its feature” picks out the final example item in Domain 4 but that no such statement is possible in Domain 2. Quantification over features is natural in cases like Domain 3 where the ranges of each feature can be aligned. For example, “object a has all three features” picks out the final example item in Domain 3 but no such statement is possible in Domain 1. We therefore focus on Domains 3 and 4, and explore the problem of learning item classes in each domain. 2 3 {a} {a, b, c} {a} {a, b, c} {a, b, c} {a, b, c} {a} {a, b, c, d, e, f } 1 2 3 4 5 6 7 8 R : O × O → {0, 1} — F : O → {v1 , v2 , v3 } G : O → {v4 , v5 , v6 } — R : O × O → {0, 1} R : (a, b) → {v1 , v2 } S : (a, c) → {v3 , v4 } T : (b, c) → {v5 , v6 } — — — — Relations — — Domain specification Features F : O → {v1 , v2 } G : O → {v3 , v4 } H : O → {v5 , v6 } F : a → {v1 , v2 } G : b → {v3 , v4 } H : c → {v5 , v6 } F : O → {0, v1 } G : O → {0, v2 } H : O → {0, v3 } F : a → {0, v1 } G : b → {0, v2 } H : c → {0, v3 } , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... , ... , Example Items , , , , , , , , , , , , , ... , [4] [8, 9] [13] [6] [12] [6] [2, 6, 7, 10, 11] Ref. Table 1: The conceptual universe. Eight domains are shown, and each one is defined by a set of objects, a set of features, and a set of relations. We call the members of each domain items, and an item is created by specifying the extension of each feature and relation in the domain. The six domains above the double lines are closely related to the work of Shepard et al. [6]. Each one includes eight items which differ along three dimensions. These dimensions, however, emerge from different underlying representations in the six cases. Objects O # (a) (b) 1 (I) 2 (II) 3 (III) 4 (III) 5 (IV) 6 (IV) 7 (V) 8 (V) 9 (V) 10 (VI) 111 110 101 011 100 010 001 000 Figure 1: (a) A stimulus lattice for domains (e.g. Domains 3, 4, and 6) that can be encoded as a triple of binary values where 0 represents “absent” and 1 represents “present.” (b) If the order of the values in the triple is not significant, there are 10 distinct ways to partition the lattice into two classes of four items. The SHJ type for each partition is shown in parentheses. Domains 3 and 4 both include 8 items each and we will consider classes that include exactly four of these items. Each item in these domains can be represented as a triple of binary values, where 0 indicates that a feature is absent and value 1 indicates that a feature is present. Each triple represents the values of the three features (Domain 3) or the feature values for the three objects (Domain 4). By representing each domain in this way, we have effectively adopted domain specifications that are simplifications of those shown in Table 1. Domain 3 is represented using three features of the form F, G, H : O → {0, 1}, and Domain 4 is represented using a single feature of the form F : O → {0, 1}. Simplifications of this kind are possible because the features in each domain can be aligned—notice that no corresponding simplifications are possible for Domains 1 and 2. The eight binary triples in each domain can be organized into the lattice shown in Figure 1a. Here we consider all ways to partition the vertices of the lattice into two groups of four. If partitions that differ only up to a permutation of the features (Domain 3) or objects (Domain 4) are grouped into equivalence classes, there are ten of these classes, and a representative of each is shown in Figure 1b. Previous researchers [6] have pointed out that the stimuli in Domain 1 can be organized into a cube similar to Figure 1a, and that there are six ways to partition these stimuli into two groups of four up to permutations of the features and permutations of the range of each feature. We refer to these equivalence classes as the six Shepard-Hovland-Jenkins types (or SHJ types), and each partition in Figure 1b is labeled with its corresponding SHJ type label. Note, for example, that partitions 3 and 4 are both examples of SHJ type III. For us, partitions 3 and 4 are distinct since items 000 (all absent) and 111 (all present) are uniquely identifiable, and partition 3 assigns these items to different classes but partition 4 does not. Previous researchers have considered differences between some of the first six domains in Table 1. Shepard et al. [6] ran experiments using compact stimuli (Domain 1) and distributed stimuli (Domains 2 and 4), and observed the same difficulty ranking of the six SHJ types in all cases. Their work, however, does not acknowledge that Domain 4 leads to 10 distinct types rather than 6, and therefore fails to address issues such as the relative complexities of concepts 5 and 6 in Figure 1. Social psychologists [13, 14] have studied Domain 6 and found that learning patterns depart from the standard SHJ order—in particular, that SHJ type VI (Concept 10 in Figure 1) is simpler than types III, IV and V. This finding has been used to support the claim that social learning relies on a domain-specific principle of structural balance [14]. We will see, however, that the relative simplicity of type VI in domains like 4 and 6 is consistent with a domain-general account based on representational economy. 2 A representation length approach to concept learning The conceptual universe in Table 1 calls for an account of learning that can apply across many domains. One candidate is the representation length approach, which proposes that concepts are mentally represented in a language of thought, and that the subjective complexity of a concept is 4 determined by the length of its representation in this language [4]. We consider the case where a concept corresponds to a class of items, and explore the idea that these concepts are mentally represented in a logical language. More formally, a concept is represented as a logical sentence, and the concept includes all models of this sentence, or all items that make the sentence true. The predictions of this representation length approach depend critically on the language chosen. Here we consider three languages—an object quantification language OQ that supports quantification over objects, a feature quantification language F Q that supports quantification over features, and a language OQ + F Q that supports quantification over both objects and features. Language OQ is based on a standard logical language known as predicate logic with equality. The language includes symbols representing objects (e.g. a and b), and features (e.g. F and G) and these symbols can be combined to create literals that indicate that an object does (Fa ) or does not have a certain feature (Fa ′ ). Literals can be combined using two connectives: AND (Fa Ga ) and OR (Fa + Ga ). The language includes two quantifiers—for all (∀) and there exists (∃)—and allows quantification over objects (e.g. ∀x Fx , where x is a variable that ranges over all objects in the domain). Finally, language OQ includes equality and inequality relations (= and =) which can be used to compare objects and object variables (e.g. =xa or =xy ). Table 2 shows several sentences formulated in language OQ. Suppose that the OQ complexity of each sentence is defined as the number of basic propositions it contains, where a basic proposition can be a positive or negative literal (Fa or Fa ′ ) or an equality or inequality statement (=xa or =xy ). Equivalently, the complexity of a sentence is the total number of ANDs plus the total number of ORs plus one. This measure is equivalent by design to Feldman’s [2] notion of Boolean complexity when applied to a sentence without quantification. The complexity values in Table 2 show minimal complexity values for each concept in Domains 3 and 4. Table 2 also shows a single sentence that achieves each of these complexity values, although some concepts admit multiple sentences of minimal complexity. The complexity values in Table 2 were computed using an “enumerate then combine” approach. We began by enumerating a set of sentences according to criteria described in the next paragraph. Each sentence has an extension that specifies which items in the domain are consistent with the sentence. Given the extensions of all sentences generated during the enumeration phase, the combination phase considered all possible ways to combine these extensions using conjunctions or disjunctions. The procedure terminated once extensions corresponding to all of the concepts in the domain had been found. Although the number of possible sentences grows rapidly as the complexity of these sentences increases, the number of extensions is fixed and relatively small (28 for domains of size 8). The combination phase is tractable since sentences with the same extension can be grouped into a single equivalence class. The enumeration phase considered all formulae which had at most two quantifiers and which had a complexity value lower than four. For example, this phase did not include the formula ∃x ∃y ∃z =yz F′ Fy Fz (too many quantifiers) or the formula ∀x ∃y =xy Fy (Fx + Gx + Hx ) (complexity x too high). Despite these restrictions, we believe that the complexity values in Table 2 are identical to the values that would be obtained if we had considered all possible sentences. Language F Q is similar to OQ but allows quantification over features rather than objects. For example, F Q includes the statement ∀Q Qa , where Q is a variable that ranges over all features in the domain. Language F Q also allows features and feature variables to be compared for equality or inequality (e.g. =QF or =QR ). Since F Q and OQ are closely related, it follows that the F Q complexity values for Domains 3 and 4 are identical to the OQ complexity values for Domains 4 and 3. For example, F Q can express concept 5 in Domain 3 as ∀Q ∃R =QR Ra . We can combine OQ and F Q to create a language OQ + F Q that allows quantification over both objects and features. Allowing both kinds of quantification leads to identical complexity values for Domains 3 and 4. Language OQ + F Q can express each of the formulae for Domain 4 in Table 2, and these formulae can be converted into corresponding formulae for Domain 3 by translating each instance of object quantification into an instance of feature quantification. Logicians distinguish between first-order logic, which allows quantification over objects but not predicates, and second-order logic, which allows quantification over objects and predicates. The difference between languages OQ and OQ + F Q is superficially similar to the difference between first-order and second-order logic, but does not cut to the heart of this matter. Since language 5 # 1 Domain 3 Domain 4 C 1 Ga C 1 Fb 2 Fa Ha + Fa Ha 4 Fa Fc + Fa Fc 4 3 Fa ′ Ga + Fa Ha 4 Fa ′ Fb + Fa Fc 4 4 Fa ′ Ga ′ + Fa Ha 4 Fa ′ Fb ′ + Fa Fc 4 5 Ga (Fa + Ha ) + Fa Ha 2 6 7 8 ′ ′ ′ ′ 5 ∀x ∃y =xy Fy ′ 5 ′ ′ 6 Ga (Fa + Ha ) + Fa Ha Ga (Fa + Ha ) + Fa Ga Ha 3 (∀x Fx ) + Fb ∃y Fy ′ ′ ′ (∀x Fx ) + Fb (Fa + Fc ) 4 ′ ′ ′ 6 ′ ′ 6 (∀x Fx ) + Fa (Fb + Fc ) 4 10 (∀x Fx ) + ∃y ∀z Fy (=zy +Fz ′ ) 4 Ha (Fa + Ga ) + Fa Ga Ha 9 Fa (Ga + Ha ) + Fa Ga Ha 10 Ga ′ (Fa Ha ′ + Fa ′ Ha ) + Ga (Fa ′ Ha ′ + Fa Ha ) ′ ′ ′ Fc (Fa + Fb ) + Fa Fb Fc ′ ′ 6 Table 2: Complexity values C and corresponding formulae for language OQ. Boolean complexity predicts complexity values for both domains that are identical to the OQ complexity values shown here for Domain 3. Language F Q predicts complexity values for Domains 3 and 4 that are identical to the OQ values for Domains 4 and 3 respectively. Language OQ + F Q predicts complexity values for both domains that are identical to the OQ complexity values for Domain 4. OQ + F Q only supports quantification over a pre-specified set of features, it is equivalent to a typed first order logic that includes types for objects and features [15]. Future studies, however, can explore the cognitive relevance of higher-order logic as developed by logicians. 3 Experiment Now that we have introduced languages OQ, F Q and OQ + F Q our theoretical proposals can be sharply formulated. We suggest that quantification over objects plays an important role in mental representations, and predict that OQ complexity will account better for human learning than Boolean complexity. We also propose that quantification over objects is more natural than quantification over features, and predict that OQ complexity will account better for human learning than both F Q complexity and OQ + F Q complexity. We tested these predictions by designing an experiment where participants learned concepts from Domains 3 and 4. Method. 20 adults participated for course credit. Each participant was assigned to Domain 3 or Domain 4 and learned all ten concepts from that domain. The items used for each domain were the cards shown in Table 1. Note, for example, that each Domain 3 card showed one square, and that each Domain 4 card showed three squares. These items are based on stimuli developed by Sakamoto and Love [12]. The experiment was carried out using a custom built graphical interface. For each learning problem in each domain, all eight items were simultaneously presented on the screen, and participants were able to drag them around and organize them however they liked. Each problem had three phases. During the learning phase, the four items belonging to the current concept had red boundaries, and the remaining four items had blue boundaries. During the memory phase, these colored boundaries were removed, and participants were asked to sort the items into the red group and the blue group. If they made an error they returned to the learning phase, and could retake the test whenever they were ready. During the description phase, participants were asked to provide a written description of the two groups of cards. The color assignments (red or blue) were randomized across participants— in other words, the “red groups” learned by some participants were identical to the “blue groups” learned by others. The order in which participants learned the 10 concepts was also randomized. Model predictions. The OQ complexity values for the ten concepts in each domain are shown in Table 2 and plotted in Figure 2a. The complexity values in Figure 2a have been normalized so that they sum to one within each domain, and the differences of these normalized scores are shown in the final row of Figure 2a. The two largest bars in the difference plot indicate that Concepts 10 and 5 are predicted to be easier to learn in Domain 4 than in Domain 3. Language OQ can express 6 OQ complexity Domain 3 a) Learning time b) 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Difference Domain 4 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 2: Normalized OQ complexity values and normalized learning times for the 10 concepts in Domains 3 and 4. statements like “either 1 or 3 objects have F ” (Concept 10 in Domain 4), or “2 or more objects have F ” (Concept 5 in Domain 4). Since quantification over features is not permitted, however, analogous statements (e.g. “object a has either 1 or 3 features”) cannot be formulated in Domain 3. Concept 10 corresponds to SHJ type VI, which often emerges as the most difficult concept in studies of Boolean concept learning. Our model therefore predicts that the standard ordering of the SHJ types will not apply in Domain 4. Our model also predicts that concepts assigned to the same SHJ type will have different complexities. In Domain 4 the model predicts that Concept 6 will be harder to learn than Concept 5 (both are examples of SHJ type IV), and that Concept 8 will be harder to learn than Concepts 7 or 9 (all three are examples of SHJ type V). Results. The computer interface recorded the amount of time participants spent on the learning phase for each concept. Domain 3 was a little more difficult than Domain 4 overall: on average, Domain 3 participants took 557 seconds and Domain 4 participants took 467 seconds to learn the 10 concepts. For all remaining analyses, we consider learning times that are normalized to sum to 1 for each participant. Figure 2b shows the mean values for these normalized times, and indicates the relative difficulties of the concepts within each condition. The difference plot in Figure 2b supports the two main predictions identified previously. Concepts 10 and 5 are the cases that differ most across the domains, and both concepts are easier to learn in Domain 3 than Domain 4. As predicted, Concept 5 is substantially easier than Concept 6 in Domain 4 even though both correspond to the same SHJ type. Concepts 7 through 9 also correspond to the same SHJ type, and the data for Domain 4 suggest that Concept 8 is the most difficult of the three, although the difference between Concepts 8 and 7 is not especially large. Four sets of complexity predictions are plotted against the human data in Figure 3. Boolean complexity and OQ complexity make identical predictions about Domain 3, and OQ complexity and OQ + F Q complexity make identical predictions about Domain 4. Only OQ complexity, however, accounts for the results observed in both domains. The concept descriptions generated by participants provide additional evidence that there are psychologically important differences between Domains 3 and 4. If the descriptions for concepts 5 and 10 are combined, 18 out of 20 responses in Domain 4 referred to quantification or counting. One representative description of Concept 5 stated that “red has multiple filled” and that “blue has one filled or none.” Only 3 of 20 responses in Domain 3 mentioned quantification. One representative description of Concept 5 stated that “red = multiple features” and that “blue = only one feature.” 7 r=0.84 0.2 r=0.84 0.2 r=0.26 0.2 r=0.26 0.2 Learning time (Domain 3) 0.1 0.1 0 (Domain 4) 0.2 r=0.27 0.2 Learning time 0.1 0.1 0 0.2 r=0.83 0.2 0.1 0.1 0 0.1 0.2 0 0.1 0.2 r=0.27 0.2 0.1 Boolean complexity 0.1 0.1 0.2 OQ complexity 0.1 0.2 r=0.83 0.2 0.1 0 0 0.1 0 0.1 0.2 F Q complexity 0 0.1 0.2 OQ + F Q complexity Figure 3: Normalized learning times for each domain plotted against normalized complexity values predicted by four languages: Boolean logic, OQ, F Q and OQ + F Q. These results suggest that people can count or quantify over features, but that it is psychologically more natural to quantify over objects rather than features. Although we have focused on three specific languages, the results in Figure 2b can be used to evaluate alternative proposals about the language of thought. One such alternative is an extension of Language OQ that allows feature values to be compared for equality. This extended language supports concise representations of Concept 2 in both Domain 3 (Fa = Ha ) and Domain 4 (Fa = Fc ), and predicts that Concept 2 will be easier to learn than all other concepts except Concept 1. Note, however, that this prediction is not compatible with the data in Figure 2b. Other languages might also be considered, but we know of no simple language that will account for our data better than OQ. 4 Conclusion Comparing concept learning across qualitatively different domains can provide valuable information about the nature of mental representation. We compared two domains that that are similar in many respects, but that differ according to whether they include a single object (Domain 3) or multiple objects (Domain 4). Quantification over objects is possible in Domain 4 but not Domain 3, and this difference helps to explain the different learning patterns we observed across the two domains. Our results suggest that concept representations can incorporate quantification, and that quantifying over objects is more natural than quantifying over features. The model predictions we reported are based on a language (OQ) that is a generic version of first order logic with equality. Our results therefore suggest that some of the languages commonly considered by logicians (e.g. first order logic with equality) may indeed capture some aspects of the “laws of thought” [16]. A simple language like OQ offers a convenient way to explore the role of quantification, but this language will need to be refined and extended in order to provide a more accurate account of mental representation. For example, a comprehensive account of the language of thought will need to support quantification over features in some cases, but might be formulated so that quantification over features is typically more costly than quantification over objects. Many possible representation languages can be imagined and a large amount of empirical data will be needed to identify the language that comes closest to the language of thought. Many relevant studies have already been conducted [2, 6, 8, 9, 13, 17], but there are vast regions of the conceptual universe (Table 1) that remain to be explored. Navigating this universe is likely to involve several challenges, but web-based experiments [18, 19] may allow it to be explored at a depth and scale that are currently unprecedented. Characterizing the language of thought is undoubtedly a long term project, but modern methods of data collection may support rapid progress towards this goal. Acknowledgments I thank Maureen Satyshur for running the experiment. This work was supported in part by NSF grant CDI-0835797. 8 References [1] J. A. Fodor. The language of thought. Harvard University Press, Cambridge, 1975. [2] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407: 630–633, 2000. [3] D. Fass and J. Feldman. Categorization under complexity: A unified MDL account of human learning of regular and irregular categories. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 35–34. MIT Press, Cambridge, MA, 2003. [4] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [5] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Griffiths. A rational analysis of rule-based concept learning. Cognitive Science, 32(1):108–154, 2008. [6] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75(13), 1961. Whole No. 517. [7] R. M. Nosofsky, M. Gluck, T. J. Palmeri, S. C. McKinley, and P. Glauthier. Comparing models of rule-based classification learning: A replication and extension of Shepard, Hovland, and Jenkins (1961). Memory and Cognition, 22:352–369, 1994. [8] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin and Review, 9(1):43–58, 2002. [9] C. D. Aitkin and J. Feldman. Subjective complexity of categories defined over three-valued features. In R. Sun and N. Miyake, editors, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 961–966. Psychology Press, New York, 2006. [10] F. Mathy and J. Bradmetz. A theory of the graceful complexification of concepts and their learnability. Current Psychology of Cognition, 22(1):41–82, 2004. [11] R. Vigo. A note on the complexity of Boolean concepts. Journal of Mathematical Psychology, 50:501–510, 2006. [12] Y. Sakamoto and B. C. Love. Schematic influences on category learning and recognition memory. Journal of Experimental Psychology: General, 133(4):534–553, 2004. [13] W. H. Crockett. Balance, agreement and positivity in the cognition of small social structures. In Advances in Experimental Social Psychology, Vol 15, pages 1–57. Academic Press, 1982. [14] N. B. Cottrell. Heider’s structural balance principle as a conceptual rule. Journal of Personality and Social Psychology, 31(4):713–720, 1975. [15] H. B. Enderton. A mathematical introduction to logic. Academic Press, New York, 1972. [16] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probabilities. 1854. [17] B. C. Love and A. B. Markman. The nonindependence of stimulus properties in human category learning. Memory and Cognition, 31(5):790–799, 2003. [18] L. von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006. [19] R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008. 9

5 0.79308671 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

6 0.79288578 221 nips-2009-Solving Stochastic Games

7 0.79174125 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

8 0.79020452 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

9 0.78649402 45 nips-2009-Beyond Convexity: Online Submodular Minimization

10 0.78597593 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

11 0.78531039 55 nips-2009-Compressed Least-Squares Regression

12 0.7837652 94 nips-2009-Fast Learning from Non-i.i.d. Observations

13 0.78288388 161 nips-2009-Nash Equilibria of Static Prediction Games

14 0.78194356 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

15 0.77667427 14 nips-2009-A Parameter-free Hedging Algorithm

16 0.77386296 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.77365291 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

18 0.77344728 78 nips-2009-Efficient Moments-based Permutation Tests

19 0.77197933 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

20 0.77063602 181 nips-2009-Online Learning of Assignments