nips nips2009 nips2009-73 knowledge-graph by maker-knowledge-mining

73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization


Source: pdf

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. [sent-4, score-0.639]

2 In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. [sent-5, score-0.22]

3 Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. [sent-6, score-0.217]

4 1 Introduction In machine learning, online algorithms operate by repetitively drawing random examples, one at a time, and adjusting the learning variables using simple calculations that are usually based on the single example only. [sent-7, score-0.192]

5 The low computational complexity (per iteration) of online algorithms is often associated with their slow convergence and low accuracy in solving the underlying optimization problems. [sent-8, score-0.285]

6 As argued in [1, 2], the combined low complexity and low accuracy, together with other tradeoffs in statistical learning theory, still make online algorithms a favorite choice for solving largescale learning problems. [sent-9, score-0.276]

7 Nevertheless, traditional online algorithms, such as stochastic gradient descent (SGD), has limited capability of exploiting problem structure in solving regularized learning problems. [sent-10, score-0.398]

8 As a result, their low accuracy often makes it hard to obtain the desired regularization effects, e. [sent-11, score-0.099]

9 In this paper, we develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. [sent-14, score-0.639]

10 1 Regularized stochastic learning The regularized stochastic learning problems we consider are of the following form: { } đ? [sent-17, score-0.231]

11 ‘› is the optimization variable (called weights in many learning problems), đ? [sent-27, score-0.071]

12 ‘¤) is the indicator function of a closed convex set đ? [sent-100, score-0.055]

13 In this paper, we focus on online algorithms that process samples sequentially as they become available. [sent-109, score-0.176]

14 For solving the problem (1), the standard stochastic gradient descent (SGD) method takes the form đ? [sent-137, score-0.141]

15 The SGD method has been very popular in the machine learning community due to its capability of scaling with large data sets and good generalization performance observed in practice (e. [sent-154, score-0.065]

16 Nevertheless, a main drawback of the SGD method is its lack of capability in exploiting problem structure, especially for regularized learning problems. [sent-157, score-0.182]

17 As a result, their low accuracy (compared with interior-point method for batch optimization) often makes it hard to obtain the desired regularization effect. [sent-158, score-0.191]

18 An important example and motivation for this paper is â„“1 -regularized stochastic learning, where Ψ(đ? [sent-159, score-0.059]

19 œ†, the SGD method (2) usually does not generate sparse solutions because only in very rare cases two oat numbers add up to zero. [sent-164, score-0.087]

20 Various methods for rounding or truncating the solutions are proposed to generate sparse solutions (e. [sent-165, score-0.151]

21 Inspired by recently developed ďŹ rst-order methods for optimizing composite functions [6, 7, 8], the regularized dual averaging (RDA) method we develop exploits the full regularization structure at each online iteration. [sent-168, score-0.512]

22 In other words, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the whole regularization term, not just its subgradients. [sent-169, score-0.197]

23 For many practical learning problems, we actually are able to ďŹ nd a closed-form solution for the auxiliary optimization problem at each iteration. [sent-170, score-0.053]

24 ‘¤) is strongly convex, we have the better rate đ? [sent-178, score-0.046]

25 2 Regularized online optimization In online optimization (e. [sent-183, score-0.426]

26 The goal of an online algorithm is to ensure ∑đ? [sent-204, score-0.16]

27 The difference between these two cost is called the regret of the online algorithm. [sent-221, score-0.262]

28 Applications of online optimization include online prediction of time series and sequential investment (e. [sent-222, score-0.387]

29 In regularized online optimization, we add to each cost function a convex regularization function Ψ(đ? [sent-225, score-0.427]

30 =1 The RDA method we develop can also be used to solve the above regularized online optimization √ problem, and it has an đ? [sent-252, score-0.359]

31 However, the main advantage of the RDA method, compared with other online algorithms, is its explicit regularization effect at each iteration. [sent-259, score-0.259]

32 2 Algorithm 1 Regularized dual averaging (RDA) method input: ∙ a strongly convex function â„Ž(đ? [sent-260, score-0.241]

33 ‘¤ (4) ∙ a pre-determined nonnegative and nondecreasing sequence đ? [sent-269, score-0.103]

34 ‘Ą end for 2 (5) (6) Regularized dual averaging method In this section, we present the generic RDA method (Algorithm 1) for solving regularized stochastic learning and online optimization problems, and give some concrete examples. [sent-317, score-0.604]

35 The RDA method uses an auxiliary strongly convex function â„Ž(đ? [sent-326, score-0.134]

36 A function â„Ž is called strongly convex with respect to a norm âˆĽ â‹… âˆĽ if there exists a constant đ? [sent-328, score-0.101]

37 œŽ is called the convexity parameter, or the modulus of strong convexity. [sent-346, score-0.079]

38 In Algorithm 1, step 1 is to compute a subgradient of đ? [sent-350, score-0.055]

39 Step 2 is the online version of computing average gradient đ? [sent-355, score-0.182]

40 , it is not strongly convex), we can choose a parameter đ? [sent-367, score-0.046]

41 Here are some examples: ∙ Nesterov’s dual averaging method. [sent-382, score-0.107]

42 ‘¤) be the indicator{function of a close convex √ } set đ? [sent-384, score-0.055]

43 œŽ > 0, we can use √ nonnegative, any nondecreasing sequence {đ? [sent-447, score-0.07]

44 Regret bounds and convergence rates We ďŹ rst give bounds on the regret đ? [sent-550, score-0.181]

45 ‘¤) deďŹ ned in (3), when the RDA method is used for solving regularized online optimization problem. [sent-553, score-0.371]

46 œŽ is the convexity parameter of the regularization function Ψ(đ? [sent-582, score-0.142]

47 ‘Ą =1 is the input sequence to the RDA method, which is nonnegative and nondeđ? [sent-587, score-0.056]

48 ‘Ą ≼ 1, where âˆĽ â‹… âˆĽâˆ— is the dual norm of âˆĽ â‹… âˆĽ. [sent-636, score-0.059]

49 Here we give some direct consequences based on concrete choices of algorithmic parameters. [sent-647, score-0.034]

50 œŽ > 0, then any nonnegative, nondecreasing sequence that is dominated by ln đ? [sent-703, score-0.181]

51 When Algorithm 1 is used to solve regularized stochastic learning problems, we have the following: Theorem 2 Assume there exists an optimal solution đ? [sent-802, score-0.157]

52 4 Related work There have been several recent work that address online algorithms for regularized learning problems, especially with â„“1 -regularization; see, e. [sent-843, score-0.293]

53 In particular, a forwardbackward splitting method (FOBOS) is studied in [17] for solving the same problems we consider. [sent-846, score-0.091]

54 In an online setting, each iteration of the FOBOS method can be written as { } 1 2 đ? [sent-847, score-0.21]

55 The RDA method and FOBOS use very different weights on the regularization term Ψ(đ? [sent-870, score-0.15]

56 The TG method truncates the solutions obtained by the standard SGD method with an integer period đ? [sent-883, score-0.111]

57 œƒ = +∞, the TG method is the same as the FOBOS method (13). [sent-949, score-0.066]

58 Therefore, the RDA method can generate much more sparse solutions. [sent-963, score-0.057]

59 ‘‡ ÂŻ Figure 1: Sparsity patterns of the weight đ? [sent-991, score-0.045]

60 ‘‡ for classifying the digits 6 ÂŻ and 7 when varying the regularization parameter đ? [sent-995, score-0.173]

61 The background gray represents the value zero, bright spots represent positive values and dark spots represent negative values. [sent-998, score-0.062]

62 5 Computational experiments We provide computational experiments for the â„“1 -RDA method using the MNIST dataset of handwritten digits [18]. [sent-999, score-0.07]

63 Each of the 10 digits has roughly 6,000 training examples and 1,000 testing examples. [sent-1001, score-0.084]

64 In the experiments, we compare the â„“1 -RDA method (9) with the SGD method (2) and the TG/FOBOS method (14) with đ? [sent-1004, score-0.099]

65 These three online algorithms have similar convergence rate and the same order of computational complexity per iteration. [sent-1006, score-0.205]

66 We also compare them with the batch optimization approach, using an efďŹ cient interior-point method (IPM) developed by [19]. [sent-1007, score-0.145]

67 Each pair of digits have about 12,000 training examples and 2,000 testing examples. [sent-1008, score-0.084]

68 We use online algorithms to go through the (randomly permuted) data only once, therefore the algorithms stop at đ? [sent-1009, score-0.192]

69 œ† for the batch optimization case [19] is mostly in the range of 30 − 50 (beyond which the optimal weights are all zeros). [sent-1015, score-0.149]

70 The tradeoffs in choosing these parameters√ further investigated in [13]. [sent-1023, score-0.073]

71 For the SGD and TG methods, we use a are constant stepsize đ? [sent-1024, score-0.038]

72 ˇ, which gives the best convergence bound (12) 6 Right: đ? [sent-1033, score-0.048]

73 ‘Ą) for the three online algorithms (classifying 6 and 7). [sent-1047, score-0.176]

74 ‘‡ also gives the best convergence rate for the SGD method (e. [sent-1054, score-0.062]

75 ž = 10 for enhanced regularization effect, as suggested in [5]. [sent-1061, score-0.131]

76 Figure 1 shows the sparsity patterns of the solutions đ? [sent-1062, score-0.091]

77 The sparsity patterns obtained by the RDA method are most close to the batch optimization results solved by IPM, especially for larger đ? [sent-1072, score-0.225]

78 In order to count the NNZs in all other cases, we set a small threshold for rounding the weights to zero. [sent-1080, score-0.054]

79 Considering that the magnitudes of the largest weights in Figure 1 are mostly on the order of 10−3 , we set 10−5 as the threshold and veriďŹ ed that rounding elements less than 10−5 to zero does not affect the testing errors. [sent-1081, score-0.105]

80 It can be seen that the RDA method maintains a much more sparse đ? [sent-1085, score-0.057]

81 While the TG method generate more sparse solutions than the SGD method when đ? [sent-1088, score-0.12]

82 Figure 3 illustrates the tradeoffs between sparsity and testing error rates for classifying 6 and 7. [sent-1093, score-0.247]

83 Since the performance of the online algorithms vary when the training data are given in different permutations, we run them on 100 randomly permuted sequences of the same training set, and plot the means and standard deviations shown as error bars. [sent-1094, score-0.255]

84 For the SGD and TG methods, the testing error rates of đ? [sent-1095, score-0.096]

85 ‘‡ , even though the theorems only give performance bound for the averaged weight đ? [sent-1100, score-0.063]

86 ‘‡ obtained by SGD and TG ÂŻ ÂŻ have much smaller error rates than those of RDA and batch optimization, especially for larger đ? [sent-1105, score-0.142]

87 The explanation is that these lower error rates are obtained with much more nonzero features. [sent-1107, score-0.064]

88 For clarity of presentation, here we only plot results of the â„“1 -RDA method and batch optimization using IPM. [sent-1109, score-0.145]

89 We see that, overall, the solutions obtained by the â„“1 -RDA method demonstrate very similar tradeoffs between sparsity and testing error rates as rendered by the batch optimization solutions. [sent-1111, score-0.385]

90 ‘‡ ÂŻ Error rates (%) Last weight đ? [sent-1114, score-0.071]

91 Ϡ Figure 3: Tradeoffs between testing error rates and NNZs in solutions (for classifying 6 and 7). [sent-1156, score-0.163]

92 The images in the lower-left triangular area show sparsity patterns of đ? [sent-1205, score-0.083]

93 The plots in the upper-right triangular area show tradeoffs between sparsity and testing error rates, by varying đ? [sent-1211, score-0.186]

94 The solid circles and solid squares show error rates and NNZs in đ? [sent-1214, score-0.104]

95 The hollow circles and hollow squares show error rates and NNZs of đ? [sent-1217, score-0.21]

96 The vertical bars centered at hollow circles and squares show standard deviations by running on 100 random permutations of the training data. [sent-1220, score-0.146]

97 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-1248, score-0.096]

98 Online convex programming and generalized inďŹ nitesimal gradient ascent. [sent-1272, score-0.077]

99 Dual averaging method for regularized stochastic learning and online optimization. [sent-1290, score-0.398]

100 An interior-point method for large-scale â„“1 -regularized logistic regression. [sent-1336, score-0.049]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rda', 0.709), ('tg', 0.441), ('sgd', 0.277), ('nnzs', 0.195), ('online', 0.16), ('ipm', 0.124), ('regularization', 0.099), ('regularized', 0.098), ('ln', 0.094), ('fobos', 0.089), ('regret', 0.087), ('tradeoffs', 0.073), ('dom', 0.062), ('dual', 0.059), ('stochastic', 0.059), ('batch', 0.059), ('convex', 0.055), ('subgradient', 0.055), ('hollow', 0.053), ('optimization', 0.053), ('averaging', 0.048), ('nondecreasing', 0.047), ('rates', 0.046), ('strongly', 0.046), ('convexity', 0.043), ('sparsity', 0.041), ('stepsize', 0.038), ('digits', 0.037), ('classifying', 0.037), ('rounding', 0.036), ('modulus', 0.036), ('trnc', 0.035), ('nonnegative', 0.033), ('method', 0.033), ('enhanced', 0.032), ('capability', 0.032), ('testing', 0.032), ('catholic', 0.031), ('spots', 0.031), ('truncating', 0.031), ('solutions', 0.03), ('convergence', 0.029), ('solving', 0.027), ('louvain', 0.027), ('kl', 0.026), ('permuted', 0.025), ('mod', 0.025), ('weight', 0.025), ('bottou', 0.025), ('duchi', 0.024), ('sparse', 0.024), ('loss', 0.023), ('sequence', 0.023), ('gradient', 0.022), ('triangular', 0.022), ('circles', 0.022), ('arg', 0.022), ('deviations', 0.021), ('patterns', 0.02), ('mostly', 0.019), ('especially', 0.019), ('bound', 0.019), ('truncation', 0.019), ('give', 0.019), ('weights', 0.018), ('adjusted', 0.018), ('squares', 0.018), ('error', 0.018), ('big', 0.018), ('sign', 0.017), ('permutations', 0.017), ('iteration', 0.017), ('dominated', 0.017), ('algorithms', 0.016), ('logistic', 0.016), ('core', 0.016), ('whenever', 0.016), ('min', 0.016), ('juditsky', 0.016), ('carbonetto', 0.016), ('forwardbackward', 0.016), ('blowup', 0.016), ('oscillates', 0.016), ('banff', 0.016), ('repetitively', 0.016), ('truncate', 0.016), ('xiao', 0.016), ('cost', 0.015), ('develop', 0.015), ('icml', 0.015), ('examples', 0.015), ('siam', 0.015), ('vertical', 0.015), ('concrete', 0.015), ('period', 0.015), ('vary', 0.015), ('problems', 0.015), ('truncated', 0.015), ('technion', 0.014), ('investment', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

2 0.121245 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

3 0.092381746 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

4 0.074197993 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

5 0.073388159 14 nips-2009-A Parameter-free Hedging Algorithm

Author: Kamalika Chaudhuri, Yoav Freund, Daniel J. Hsu

Abstract: We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. We introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret. 1

6 0.07109873 220 nips-2009-Slow Learners are Fast

7 0.064714648 45 nips-2009-Beyond Convexity: Online Submodular Minimization

8 0.061907385 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

9 0.057634972 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

10 0.054578677 178 nips-2009-On Stochastic and Worst-case Models for Investing

11 0.053548668 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

12 0.050712124 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

13 0.049934901 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

14 0.047257517 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

15 0.038935982 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

16 0.038219128 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

17 0.036615215 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation

18 0.03597793 33 nips-2009-Analysis of SVM with Indefinite Kernels

19 0.035444088 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

20 0.035434525 94 nips-2009-Fast Learning from Non-i.i.d. Observations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.11), (1, 0.115), (2, 0.034), (3, -0.007), (4, 0.051), (5, 0.066), (6, -0.029), (7, -0.003), (8, 0.002), (9, 0.067), (10, -0.021), (11, -0.016), (12, -0.007), (13, 0.041), (14, -0.027), (15, 0.083), (16, 0.009), (17, -0.043), (18, -0.014), (19, -0.005), (20, -0.06), (21, -0.052), (22, -0.018), (23, -0.041), (24, 0.03), (25, 0.017), (26, -0.014), (27, 0.053), (28, 0.051), (29, 0.07), (30, -0.02), (31, -0.054), (32, 0.031), (33, 0.0), (34, -0.085), (35, 0.011), (36, -0.026), (37, -0.089), (38, -0.042), (39, 0.009), (40, -0.025), (41, -0.104), (42, -0.011), (43, -0.08), (44, -0.071), (45, 0.171), (46, -0.04), (47, -0.034), (48, -0.017), (49, 0.134)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90163964 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

2 0.75157332 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

3 0.60474658 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

4 0.57149118 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

5 0.50850266 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

6 0.48157564 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.47993833 14 nips-2009-A Parameter-free Hedging Algorithm

8 0.46295747 94 nips-2009-Fast Learning from Non-i.i.d. Observations

9 0.45296404 45 nips-2009-Beyond Convexity: Online Submodular Minimization

10 0.42926753 33 nips-2009-Analysis of SVM with Indefinite Kernels

11 0.40169808 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

12 0.37896475 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

13 0.36667821 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

14 0.35993558 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

15 0.35836238 24 nips-2009-Adapting to the Shifting Intent of Search Queries

16 0.35622457 223 nips-2009-Sparse Metric Learning via Smooth Optimization

17 0.35313022 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

18 0.34286201 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

19 0.33942637 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

20 0.33732641 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.073), (25, 0.043), (35, 0.046), (36, 0.134), (39, 0.036), (51, 0.324), (58, 0.093), (61, 0.016), (71, 0.036), (81, 0.019), (86, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.80961668 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

same-paper 2 0.73746914 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

3 0.66160882 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

Abstract: Many researchers have suggested that the psychological complexity of a concept is related to the length of its representation in a language of thought. As yet, however, there are few concrete proposals about the nature of this language. This paper makes one such proposal: the language of thought allows first order quantification (quantification over objects) more readily than second-order quantification (quantification over features). To support this proposal we present behavioral results from a concept learning study inspired by the work of Shepard, Hovland and Jenkins. Humans can learn and think about many kinds of concepts, including natural kinds such as elephant and water and nominal kinds such as grandmother and prime number. Understanding the mental representations that support these abilities is a central challenge for cognitive science. This paper proposes that quantification plays a role in conceptual representation—for example, an animal X qualifies as a predator if there is some animal Y such that X hunts Y . The concepts we consider are much simpler than real-world examples such as predator, but even simple laboratory studies can provide important clues about the nature of mental representation. Our approach to mental representation is based on the language of thought hypothesis [1]. As pursued here, the hypothesis proposes that mental representations are constructed in a compositional language of some kind, and that the psychological complexity of a concept is closely related to the length of its representation in this language [2, 3, 4]. Following previous researchers [2, 4], we operationalize the psychological complexity of a concept in terms of the ease with which it is learned and remembered. Given these working assumptions, the remaining challenge is to specify the representational resources provided by the language of thought. Some previous studies have relied on propositional logic as a representation language [2, 5], but we believe that the resources of predicate logic are needed to capture the structure of many human concepts. In particular, we suggest that the language of thought can accommodate relations, functions, and quantification, and focus here on the role of quantification. Our primary proposal is that quantification is supported by the language of thought, but that quantification over objects is psychologically more natural than quantification over features. To test this idea we compare concept learning in two domains which are very similar except for one critical difference: one domain allows quantification over objects, and the other allows quantification over features. We consider several logical languages that can be used to formulate concepts in both domains, and find that learning times are best predicted by a language that supports quantification over objects but not features. Our work illustrates how theories of mental representation can be informed by comparing concept learning across two or more domains. Existing studies work with a range of domains, and it is useful to consider a “conceptual universe” that includes these possibilities along with many others that have not yet been studied. Table 1 charts a small fragment of this universe, and the penultimate column shows example stimuli that will be familiar from previous studies of concept learning. Previous studies have made important contributions by choosing a single domain in Table 1 and explaining 1 why some concepts within this domain are easier to learn than others [2, 4, 6, 7, 8, 9]. Comparisons across domains can also provide important information about learning and mental representation, and we illustrate this claim by comparing learning times across Domains 3 and 4. The next section introduces the conceptual universe in Table 1 in more detail. We then present a formal approach to concept learning that relies on a logical language and compare three candidate languages. Language OQ (for object quantification) supports quantification over objects but not features, language F Q (for feature quantification) supports quantification over features but not objects, and language OQ + F Q supports quantification over both objects and features. We use these languages to predict learning times across Domains 3 and 4, and present an experiment which suggests that language OQ comes closest to the language of thought. 1 The conceptual universe Table 1 provides an organizing framework for thinking about the many domains in which learning can occur. The table includes 8 domains, each of which is defined by specifying some number of objects, features, and relations, and by specifying the range of each feature and each relation. We refer to the elements in each domain as items, and the penultimate column of Table 1 shows items from each domain. The first row shows a domain commonly used by studies of Boolean concept learning. Each item in this domain includes a single object a and specifies whether that object has value v1 (small) or v2 (large) on feature F (size), value v3 (white) or v4 (gray) on feature G (color), and value v5 (vertical) or v6 (horizontal) on feature H (texture). Domain 2 also includes three features, but now each item includes three objects and each feature applies to only one of the objects. For example, feature H (texture) applies to only the third object in the domain (i.e. the third square on each card). Domain 3 is similar to Domain 1, but now the three features can be aligned— for any given item each feature will be absent (value 0) or present. The example in Table 1 uses three features (boundary, dots, and slash) that can each be added to an unadorned gray square. Domain 4 is similar to Domain 2, but again the feature values can be aligned, and the feature for each object will be absent (value 0) or present. Domains 5 and 6 are similar to domains 2 and 4 respectively, but each one includes relations rather than features. In Domain 6, for example, the relation R assigns value 0 (absent) or value 1 (present) to each undirected pair of objects. The first six domains in Table 1 are all variants of Domain 1, which is the domain typically used by studies of Boolean concept learning. Focusing on six related domains helps to establish some of the dimensions along which domains can differ, but the final two domains in Table 1 show some of the many alternative possibilities. Domain 7 includes two categorical features, each of which takes three rather than two values. Domain 8 is similar to Domain 6, but now the number of objects is 6 rather than 3 and relation R is directed rather than undirected. To mention just a handful of possibilities which do not appear in Table 1, domains may also have categorical features that are ordered (e.g. a size feature that takes values small, medium, and large), continuous valued features or relations, relations with more than two places, and objects that contain sub-objects or parts. Several learning problems can be formulated within any given domain. The most basic is to learn a single item—for example, a single item from Domain 8 [4]. A second problem is to learn a class of items—for example, a class that includes four of the items in Domain 1 and excludes the remaining four [6]. Learning an item class can be formalized as learning a unary predicate defined over items, and a natural extension is to consider predicates with two or more arguments. For example, problems of the form A is to B as C is to ? can be formulated as problems where the task is to learn a binary relation analogous(·, ·) given the single example analogous(A, B). Here, however, we focus on the task of learning item classes or unary predicates. Since we focus on the role of quantification, we will work with domains where quantification is appropriate. Quantification over objects is natural in cases like Domain 4 where the feature values for all objects can be aligned. Note, for example, that the statement “every object has its feature” picks out the final example item in Domain 4 but that no such statement is possible in Domain 2. Quantification over features is natural in cases like Domain 3 where the ranges of each feature can be aligned. For example, “object a has all three features” picks out the final example item in Domain 3 but no such statement is possible in Domain 1. We therefore focus on Domains 3 and 4, and explore the problem of learning item classes in each domain. 2 3 {a} {a, b, c} {a} {a, b, c} {a, b, c} {a, b, c} {a} {a, b, c, d, e, f } 1 2 3 4 5 6 7 8 R : O × O → {0, 1} — F : O → {v1 , v2 , v3 } G : O → {v4 , v5 , v6 } — R : O × O → {0, 1} R : (a, b) → {v1 , v2 } S : (a, c) → {v3 , v4 } T : (b, c) → {v5 , v6 } — — — — Relations — — Domain specification Features F : O → {v1 , v2 } G : O → {v3 , v4 } H : O → {v5 , v6 } F : a → {v1 , v2 } G : b → {v3 , v4 } H : c → {v5 , v6 } F : O → {0, v1 } G : O → {0, v2 } H : O → {0, v3 } F : a → {0, v1 } G : b → {0, v2 } H : c → {0, v3 } , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... , ... , Example Items , , , , , , , , , , , , , ... , [4] [8, 9] [13] [6] [12] [6] [2, 6, 7, 10, 11] Ref. Table 1: The conceptual universe. Eight domains are shown, and each one is defined by a set of objects, a set of features, and a set of relations. We call the members of each domain items, and an item is created by specifying the extension of each feature and relation in the domain. The six domains above the double lines are closely related to the work of Shepard et al. [6]. Each one includes eight items which differ along three dimensions. These dimensions, however, emerge from different underlying representations in the six cases. Objects O # (a) (b) 1 (I) 2 (II) 3 (III) 4 (III) 5 (IV) 6 (IV) 7 (V) 8 (V) 9 (V) 10 (VI) 111 110 101 011 100 010 001 000 Figure 1: (a) A stimulus lattice for domains (e.g. Domains 3, 4, and 6) that can be encoded as a triple of binary values where 0 represents “absent” and 1 represents “present.” (b) If the order of the values in the triple is not significant, there are 10 distinct ways to partition the lattice into two classes of four items. The SHJ type for each partition is shown in parentheses. Domains 3 and 4 both include 8 items each and we will consider classes that include exactly four of these items. Each item in these domains can be represented as a triple of binary values, where 0 indicates that a feature is absent and value 1 indicates that a feature is present. Each triple represents the values of the three features (Domain 3) or the feature values for the three objects (Domain 4). By representing each domain in this way, we have effectively adopted domain specifications that are simplifications of those shown in Table 1. Domain 3 is represented using three features of the form F, G, H : O → {0, 1}, and Domain 4 is represented using a single feature of the form F : O → {0, 1}. Simplifications of this kind are possible because the features in each domain can be aligned—notice that no corresponding simplifications are possible for Domains 1 and 2. The eight binary triples in each domain can be organized into the lattice shown in Figure 1a. Here we consider all ways to partition the vertices of the lattice into two groups of four. If partitions that differ only up to a permutation of the features (Domain 3) or objects (Domain 4) are grouped into equivalence classes, there are ten of these classes, and a representative of each is shown in Figure 1b. Previous researchers [6] have pointed out that the stimuli in Domain 1 can be organized into a cube similar to Figure 1a, and that there are six ways to partition these stimuli into two groups of four up to permutations of the features and permutations of the range of each feature. We refer to these equivalence classes as the six Shepard-Hovland-Jenkins types (or SHJ types), and each partition in Figure 1b is labeled with its corresponding SHJ type label. Note, for example, that partitions 3 and 4 are both examples of SHJ type III. For us, partitions 3 and 4 are distinct since items 000 (all absent) and 111 (all present) are uniquely identifiable, and partition 3 assigns these items to different classes but partition 4 does not. Previous researchers have considered differences between some of the first six domains in Table 1. Shepard et al. [6] ran experiments using compact stimuli (Domain 1) and distributed stimuli (Domains 2 and 4), and observed the same difficulty ranking of the six SHJ types in all cases. Their work, however, does not acknowledge that Domain 4 leads to 10 distinct types rather than 6, and therefore fails to address issues such as the relative complexities of concepts 5 and 6 in Figure 1. Social psychologists [13, 14] have studied Domain 6 and found that learning patterns depart from the standard SHJ order—in particular, that SHJ type VI (Concept 10 in Figure 1) is simpler than types III, IV and V. This finding has been used to support the claim that social learning relies on a domain-specific principle of structural balance [14]. We will see, however, that the relative simplicity of type VI in domains like 4 and 6 is consistent with a domain-general account based on representational economy. 2 A representation length approach to concept learning The conceptual universe in Table 1 calls for an account of learning that can apply across many domains. One candidate is the representation length approach, which proposes that concepts are mentally represented in a language of thought, and that the subjective complexity of a concept is 4 determined by the length of its representation in this language [4]. We consider the case where a concept corresponds to a class of items, and explore the idea that these concepts are mentally represented in a logical language. More formally, a concept is represented as a logical sentence, and the concept includes all models of this sentence, or all items that make the sentence true. The predictions of this representation length approach depend critically on the language chosen. Here we consider three languages—an object quantification language OQ that supports quantification over objects, a feature quantification language F Q that supports quantification over features, and a language OQ + F Q that supports quantification over both objects and features. Language OQ is based on a standard logical language known as predicate logic with equality. The language includes symbols representing objects (e.g. a and b), and features (e.g. F and G) and these symbols can be combined to create literals that indicate that an object does (Fa ) or does not have a certain feature (Fa ′ ). Literals can be combined using two connectives: AND (Fa Ga ) and OR (Fa + Ga ). The language includes two quantifiers—for all (∀) and there exists (∃)—and allows quantification over objects (e.g. ∀x Fx , where x is a variable that ranges over all objects in the domain). Finally, language OQ includes equality and inequality relations (= and =) which can be used to compare objects and object variables (e.g. =xa or =xy ). Table 2 shows several sentences formulated in language OQ. Suppose that the OQ complexity of each sentence is defined as the number of basic propositions it contains, where a basic proposition can be a positive or negative literal (Fa or Fa ′ ) or an equality or inequality statement (=xa or =xy ). Equivalently, the complexity of a sentence is the total number of ANDs plus the total number of ORs plus one. This measure is equivalent by design to Feldman’s [2] notion of Boolean complexity when applied to a sentence without quantification. The complexity values in Table 2 show minimal complexity values for each concept in Domains 3 and 4. Table 2 also shows a single sentence that achieves each of these complexity values, although some concepts admit multiple sentences of minimal complexity. The complexity values in Table 2 were computed using an “enumerate then combine” approach. We began by enumerating a set of sentences according to criteria described in the next paragraph. Each sentence has an extension that specifies which items in the domain are consistent with the sentence. Given the extensions of all sentences generated during the enumeration phase, the combination phase considered all possible ways to combine these extensions using conjunctions or disjunctions. The procedure terminated once extensions corresponding to all of the concepts in the domain had been found. Although the number of possible sentences grows rapidly as the complexity of these sentences increases, the number of extensions is fixed and relatively small (28 for domains of size 8). The combination phase is tractable since sentences with the same extension can be grouped into a single equivalence class. The enumeration phase considered all formulae which had at most two quantifiers and which had a complexity value lower than four. For example, this phase did not include the formula ∃x ∃y ∃z =yz F′ Fy Fz (too many quantifiers) or the formula ∀x ∃y =xy Fy (Fx + Gx + Hx ) (complexity x too high). Despite these restrictions, we believe that the complexity values in Table 2 are identical to the values that would be obtained if we had considered all possible sentences. Language F Q is similar to OQ but allows quantification over features rather than objects. For example, F Q includes the statement ∀Q Qa , where Q is a variable that ranges over all features in the domain. Language F Q also allows features and feature variables to be compared for equality or inequality (e.g. =QF or =QR ). Since F Q and OQ are closely related, it follows that the F Q complexity values for Domains 3 and 4 are identical to the OQ complexity values for Domains 4 and 3. For example, F Q can express concept 5 in Domain 3 as ∀Q ∃R =QR Ra . We can combine OQ and F Q to create a language OQ + F Q that allows quantification over both objects and features. Allowing both kinds of quantification leads to identical complexity values for Domains 3 and 4. Language OQ + F Q can express each of the formulae for Domain 4 in Table 2, and these formulae can be converted into corresponding formulae for Domain 3 by translating each instance of object quantification into an instance of feature quantification. Logicians distinguish between first-order logic, which allows quantification over objects but not predicates, and second-order logic, which allows quantification over objects and predicates. The difference between languages OQ and OQ + F Q is superficially similar to the difference between first-order and second-order logic, but does not cut to the heart of this matter. Since language 5 # 1 Domain 3 Domain 4 C 1 Ga C 1 Fb 2 Fa Ha + Fa Ha 4 Fa Fc + Fa Fc 4 3 Fa ′ Ga + Fa Ha 4 Fa ′ Fb + Fa Fc 4 4 Fa ′ Ga ′ + Fa Ha 4 Fa ′ Fb ′ + Fa Fc 4 5 Ga (Fa + Ha ) + Fa Ha 2 6 7 8 ′ ′ ′ ′ 5 ∀x ∃y =xy Fy ′ 5 ′ ′ 6 Ga (Fa + Ha ) + Fa Ha Ga (Fa + Ha ) + Fa Ga Ha 3 (∀x Fx ) + Fb ∃y Fy ′ ′ ′ (∀x Fx ) + Fb (Fa + Fc ) 4 ′ ′ ′ 6 ′ ′ 6 (∀x Fx ) + Fa (Fb + Fc ) 4 10 (∀x Fx ) + ∃y ∀z Fy (=zy +Fz ′ ) 4 Ha (Fa + Ga ) + Fa Ga Ha 9 Fa (Ga + Ha ) + Fa Ga Ha 10 Ga ′ (Fa Ha ′ + Fa ′ Ha ) + Ga (Fa ′ Ha ′ + Fa Ha ) ′ ′ ′ Fc (Fa + Fb ) + Fa Fb Fc ′ ′ 6 Table 2: Complexity values C and corresponding formulae for language OQ. Boolean complexity predicts complexity values for both domains that are identical to the OQ complexity values shown here for Domain 3. Language F Q predicts complexity values for Domains 3 and 4 that are identical to the OQ values for Domains 4 and 3 respectively. Language OQ + F Q predicts complexity values for both domains that are identical to the OQ complexity values for Domain 4. OQ + F Q only supports quantification over a pre-specified set of features, it is equivalent to a typed first order logic that includes types for objects and features [15]. Future studies, however, can explore the cognitive relevance of higher-order logic as developed by logicians. 3 Experiment Now that we have introduced languages OQ, F Q and OQ + F Q our theoretical proposals can be sharply formulated. We suggest that quantification over objects plays an important role in mental representations, and predict that OQ complexity will account better for human learning than Boolean complexity. We also propose that quantification over objects is more natural than quantification over features, and predict that OQ complexity will account better for human learning than both F Q complexity and OQ + F Q complexity. We tested these predictions by designing an experiment where participants learned concepts from Domains 3 and 4. Method. 20 adults participated for course credit. Each participant was assigned to Domain 3 or Domain 4 and learned all ten concepts from that domain. The items used for each domain were the cards shown in Table 1. Note, for example, that each Domain 3 card showed one square, and that each Domain 4 card showed three squares. These items are based on stimuli developed by Sakamoto and Love [12]. The experiment was carried out using a custom built graphical interface. For each learning problem in each domain, all eight items were simultaneously presented on the screen, and participants were able to drag them around and organize them however they liked. Each problem had three phases. During the learning phase, the four items belonging to the current concept had red boundaries, and the remaining four items had blue boundaries. During the memory phase, these colored boundaries were removed, and participants were asked to sort the items into the red group and the blue group. If they made an error they returned to the learning phase, and could retake the test whenever they were ready. During the description phase, participants were asked to provide a written description of the two groups of cards. The color assignments (red or blue) were randomized across participants— in other words, the “red groups” learned by some participants were identical to the “blue groups” learned by others. The order in which participants learned the 10 concepts was also randomized. Model predictions. The OQ complexity values for the ten concepts in each domain are shown in Table 2 and plotted in Figure 2a. The complexity values in Figure 2a have been normalized so that they sum to one within each domain, and the differences of these normalized scores are shown in the final row of Figure 2a. The two largest bars in the difference plot indicate that Concepts 10 and 5 are predicted to be easier to learn in Domain 4 than in Domain 3. Language OQ can express 6 OQ complexity Domain 3 a) Learning time b) 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Difference Domain 4 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 2: Normalized OQ complexity values and normalized learning times for the 10 concepts in Domains 3 and 4. statements like “either 1 or 3 objects have F ” (Concept 10 in Domain 4), or “2 or more objects have F ” (Concept 5 in Domain 4). Since quantification over features is not permitted, however, analogous statements (e.g. “object a has either 1 or 3 features”) cannot be formulated in Domain 3. Concept 10 corresponds to SHJ type VI, which often emerges as the most difficult concept in studies of Boolean concept learning. Our model therefore predicts that the standard ordering of the SHJ types will not apply in Domain 4. Our model also predicts that concepts assigned to the same SHJ type will have different complexities. In Domain 4 the model predicts that Concept 6 will be harder to learn than Concept 5 (both are examples of SHJ type IV), and that Concept 8 will be harder to learn than Concepts 7 or 9 (all three are examples of SHJ type V). Results. The computer interface recorded the amount of time participants spent on the learning phase for each concept. Domain 3 was a little more difficult than Domain 4 overall: on average, Domain 3 participants took 557 seconds and Domain 4 participants took 467 seconds to learn the 10 concepts. For all remaining analyses, we consider learning times that are normalized to sum to 1 for each participant. Figure 2b shows the mean values for these normalized times, and indicates the relative difficulties of the concepts within each condition. The difference plot in Figure 2b supports the two main predictions identified previously. Concepts 10 and 5 are the cases that differ most across the domains, and both concepts are easier to learn in Domain 3 than Domain 4. As predicted, Concept 5 is substantially easier than Concept 6 in Domain 4 even though both correspond to the same SHJ type. Concepts 7 through 9 also correspond to the same SHJ type, and the data for Domain 4 suggest that Concept 8 is the most difficult of the three, although the difference between Concepts 8 and 7 is not especially large. Four sets of complexity predictions are plotted against the human data in Figure 3. Boolean complexity and OQ complexity make identical predictions about Domain 3, and OQ complexity and OQ + F Q complexity make identical predictions about Domain 4. Only OQ complexity, however, accounts for the results observed in both domains. The concept descriptions generated by participants provide additional evidence that there are psychologically important differences between Domains 3 and 4. If the descriptions for concepts 5 and 10 are combined, 18 out of 20 responses in Domain 4 referred to quantification or counting. One representative description of Concept 5 stated that “red has multiple filled” and that “blue has one filled or none.” Only 3 of 20 responses in Domain 3 mentioned quantification. One representative description of Concept 5 stated that “red = multiple features” and that “blue = only one feature.” 7 r=0.84 0.2 r=0.84 0.2 r=0.26 0.2 r=0.26 0.2 Learning time (Domain 3) 0.1 0.1 0 (Domain 4) 0.2 r=0.27 0.2 Learning time 0.1 0.1 0 0.2 r=0.83 0.2 0.1 0.1 0 0.1 0.2 0 0.1 0.2 r=0.27 0.2 0.1 Boolean complexity 0.1 0.1 0.2 OQ complexity 0.1 0.2 r=0.83 0.2 0.1 0 0 0.1 0 0.1 0.2 F Q complexity 0 0.1 0.2 OQ + F Q complexity Figure 3: Normalized learning times for each domain plotted against normalized complexity values predicted by four languages: Boolean logic, OQ, F Q and OQ + F Q. These results suggest that people can count or quantify over features, but that it is psychologically more natural to quantify over objects rather than features. Although we have focused on three specific languages, the results in Figure 2b can be used to evaluate alternative proposals about the language of thought. One such alternative is an extension of Language OQ that allows feature values to be compared for equality. This extended language supports concise representations of Concept 2 in both Domain 3 (Fa = Ha ) and Domain 4 (Fa = Fc ), and predicts that Concept 2 will be easier to learn than all other concepts except Concept 1. Note, however, that this prediction is not compatible with the data in Figure 2b. Other languages might also be considered, but we know of no simple language that will account for our data better than OQ. 4 Conclusion Comparing concept learning across qualitatively different domains can provide valuable information about the nature of mental representation. We compared two domains that that are similar in many respects, but that differ according to whether they include a single object (Domain 3) or multiple objects (Domain 4). Quantification over objects is possible in Domain 4 but not Domain 3, and this difference helps to explain the different learning patterns we observed across the two domains. Our results suggest that concept representations can incorporate quantification, and that quantifying over objects is more natural than quantifying over features. The model predictions we reported are based on a language (OQ) that is a generic version of first order logic with equality. Our results therefore suggest that some of the languages commonly considered by logicians (e.g. first order logic with equality) may indeed capture some aspects of the “laws of thought” [16]. A simple language like OQ offers a convenient way to explore the role of quantification, but this language will need to be refined and extended in order to provide a more accurate account of mental representation. For example, a comprehensive account of the language of thought will need to support quantification over features in some cases, but might be formulated so that quantification over features is typically more costly than quantification over objects. Many possible representation languages can be imagined and a large amount of empirical data will be needed to identify the language that comes closest to the language of thought. Many relevant studies have already been conducted [2, 6, 8, 9, 13, 17], but there are vast regions of the conceptual universe (Table 1) that remain to be explored. Navigating this universe is likely to involve several challenges, but web-based experiments [18, 19] may allow it to be explored at a depth and scale that are currently unprecedented. Characterizing the language of thought is undoubtedly a long term project, but modern methods of data collection may support rapid progress towards this goal. Acknowledgments I thank Maureen Satyshur for running the experiment. This work was supported in part by NSF grant CDI-0835797. 8 References [1] J. A. Fodor. The language of thought. Harvard University Press, Cambridge, 1975. [2] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407: 630–633, 2000. [3] D. Fass and J. Feldman. Categorization under complexity: A unified MDL account of human learning of regular and irregular categories. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 35–34. MIT Press, Cambridge, MA, 2003. [4] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [5] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Griffiths. A rational analysis of rule-based concept learning. Cognitive Science, 32(1):108–154, 2008. [6] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75(13), 1961. Whole No. 517. [7] R. M. Nosofsky, M. Gluck, T. J. Palmeri, S. C. McKinley, and P. Glauthier. Comparing models of rule-based classification learning: A replication and extension of Shepard, Hovland, and Jenkins (1961). Memory and Cognition, 22:352–369, 1994. [8] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin and Review, 9(1):43–58, 2002. [9] C. D. Aitkin and J. Feldman. Subjective complexity of categories defined over three-valued features. In R. Sun and N. Miyake, editors, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 961–966. Psychology Press, New York, 2006. [10] F. Mathy and J. Bradmetz. A theory of the graceful complexification of concepts and their learnability. Current Psychology of Cognition, 22(1):41–82, 2004. [11] R. Vigo. A note on the complexity of Boolean concepts. Journal of Mathematical Psychology, 50:501–510, 2006. [12] Y. Sakamoto and B. C. Love. Schematic influences on category learning and recognition memory. Journal of Experimental Psychology: General, 133(4):534–553, 2004. [13] W. H. Crockett. Balance, agreement and positivity in the cognition of small social structures. In Advances in Experimental Social Psychology, Vol 15, pages 1–57. Academic Press, 1982. [14] N. B. Cottrell. Heider’s structural balance principle as a conceptual rule. Journal of Personality and Social Psychology, 31(4):713–720, 1975. [15] H. B. Enderton. A mathematical introduction to logic. Academic Press, New York, 1972. [16] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probabilities. 1854. [17] B. C. Love and A. B. Markman. The nonindependence of stimulus properties in human category learning. Memory and Cognition, 31(5):790–799, 2003. [18] L. von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006. [19] R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008. 9

4 0.59392786 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

5 0.5225603 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

6 0.52028412 72 nips-2009-Distribution Matching for Transduction

7 0.51773578 239 nips-2009-Submodularity Cuts and Applications

8 0.5171572 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

9 0.51671261 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

10 0.5166229 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

11 0.51636726 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

12 0.51592624 129 nips-2009-Learning a Small Mixture of Trees

13 0.51574028 180 nips-2009-On the Convergence of the Concave-Convex Procedure

14 0.51487046 220 nips-2009-Slow Learners are Fast

15 0.51434809 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

16 0.51418465 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

17 0.51293325 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

18 0.51285785 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

19 0.51238751 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

20 0.51209253 128 nips-2009-Learning Non-Linear Combinations of Kernels