nips nips2011 nips2011-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
Reference: text
sentIndex sentText sentNum sentScore
1 il Abstract In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. [sent-7, score-0.078]
2 In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. [sent-9, score-0.295]
3 1 Introduction Semidefinite programming (SDP) has become a tool of great importance in optimization in the past years. [sent-11, score-0.078]
4 In the field of combinatorial optimization for example, numerous approximation algorithms have been discovered starting with Goemans and Williamson [1] and [2, 3, 4]. [sent-12, score-0.101]
5 In the field of machine learning solving semidefinite programs is at the heart of many learning tasks such as learning a distance metric [5], sparse PCA [6], multiple kernel learning [7], matrix completion [8], and more. [sent-13, score-0.148]
6 It is often the case in machine learning that the data is assumed no be noisy and thus when considering the underlying optimization problem, one can settle for an approximated solution rather then an exact one. [sent-14, score-0.172]
7 Moreover it is also common in such problems that the amounts of data are so large that fast approximation algorithms are preferable to exact generic solvers, such as interior-point methods, which have impractical running times and memory demands and are not scalable. [sent-15, score-0.118]
8 In the problem of learning a distance metric [5] one is given a set of points in Rn and similarity information in the form of pairs of points and a label indicating weather the two points are in the same class or not. [sent-16, score-0.063]
9 The goal is to learn a distance metric over Rn which respects this similarity information. [sent-17, score-0.095]
10 Learning such a metric is important for other learning tasks which rely on having a good metric over the input space, such as K-means, nearest-neighbours and kernel-based algorithms. [sent-19, score-0.126]
11 In this work we present the first approximation algorithm for general semidefinite programming which runs in time that is sublinear in the size of the input. [sent-20, score-0.229]
12 For the special case of learning a pseudo-distance metric, we present an even faster sublinear time algorithm. [sent-21, score-0.128]
13 Our algorithms are the fastest possible in terms of the number of constraints and the dimensionality, although slower than other methods in terms of the approximation guarantee. [sent-22, score-0.062]
14 1 Related Work Semidefinite programming is a notoriously difficult optimization formulation, and has attracted a host of attempts at fast approximation methods. [sent-24, score-0.14]
15 Various faster and more sophisticated approximate solvers followed [10, 11, 12], which feature near-linear running time albeit polynomial dependence on the approximation accuracy. [sent-26, score-0.162]
16 For the special case of covering an packing SDP problems, [13] and [14] respectively give approximation algorithms with a smaller dependency on the approximation parameter . [sent-27, score-0.16]
17 Our algorithms are based on the recent work of [15] which described sublinear algorithms for various machine learning optimization problems such has linear classification and minimum enclosing ball. [sent-28, score-0.167]
18 The notation X 0 states that the matrix X is positive semi definite, i. [sent-37, score-0.074]
19 For reasons that will be made clearer in the analysis, we will assume that for all i ∈ [m], Ai ≤ 1 The optimization problem (1) can be reduced to a feasibility problem by a standard reduction of performing a binary search over the value of the objective C◦X and adding an appropriate constraint. [sent-53, score-0.083]
20 Thus we will only consider the feasibility problem of finding a solution that satisfies all constraints. [sent-54, score-0.078]
21 The feasibility problem can be rewritten using the following min-max formulation max min Ai ◦ X X 0 i∈[m] (2) Clearly if the optimum value of (2) is non-negative, then a feasible solution exists and vice versa. [sent-55, score-0.198]
22 Denoting the optimum of (2) by σ, an additive approximation algorithm to (2) is an algorithm that produces a solution X such that X 0 and for all i ∈ [m], Ai ◦ X ≥ σ − . [sent-56, score-0.096]
23 We will be interested in a solution to (2) which lies in the bounded semidefinite cone K = {X|X 0, Tr(X) ≤ 1}. [sent-58, score-0.075]
24 The demand on a solution to (2) to have bounded trace is due to the observation that in case σ > 0, any solution needs to be bounded or else the products Ai ◦ X could be made to be arbitrarily large. [sent-59, score-0.125]
25 Learning distance pseudo metrics In the problem of learning a distance metric from examples, we are given a set triplets S = {{xi , xi , yi }}m such that xi , xi ∈ Rn and yi ∈ {−1, 1}. [sent-60, score-0.473]
26 A value i=1 yi = 1 indicates that the vectors xi , xi are in the same class and a value yi = −1 indicates that they are from different classes. [sent-61, score-0.31]
27 A reasonable demand from a ”good” pseudo metric is that it separates the examples 2 (assuming such a separation exists). [sent-66, score-0.155]
28 If both algorithms achieve sublinear regret, then this framework is known to approximate max-min problems such as (5), in case a feasible solution exists [16]. [sent-70, score-0.196]
29 The primal algorithm which controls X is a gradient ascent algorithm that given p adds to the current solution a vector in the direction of the gradient i∈[m] p(i)Ai . [sent-71, score-0.155]
30 The dual algorithm which controls p is a variant of the well known multiplicative (or exponential) update rule for online optimization over the simplex which updates the weight p(i) according to the product Ai ◦ X (line 11). [sent-73, score-0.125]
31 A problem that arises with this estimation procedure is that it might yield unbounded values which do not fit well with the multiplicative weights analysis. [sent-77, score-0.049]
32 Thus we use a clipping procedure clip(z, V ) ≡ min{V, max{−V, Z}} to bound these estimations in a certain range (line 10). [sent-78, score-0.06]
33 This constraint is enforced by performing a projection step onto the convex set K after each gradient improvement step of the primal online algorithm. [sent-81, score-0.209]
34 A projection of a matrix Y ∈ Rn×n onto K is given by Yp = arg minX∈K Y − X . [sent-82, score-0.21]
35 Unlike the algorithms in [15] that perform optimization over simple sets such as the euclidean unit ball which is trivial to project onto, projecting onto the bounded semidefinite cone is more complicated and usually requires to diagonalize the projected matrix (assuming it is symmetric). [sent-83, score-0.225]
36 Instead, we show that one can settle for an approximated projection which is faster to compute (line 4). [sent-84, score-0.17]
37 Such approximated projections could be computed by Hazan’s algorithm for offline optimization over the bounded semidefinite cone, presented in [12]. [sent-85, score-0.081]
38 Given a matrix Y ∈ Rn×n , > 0, let f (X) = − Y − X 2 and denote X ∗ = ˜ arg maxX∈K f (X). [sent-88, score-0.079]
39 5 We can now state the running time of our algorithm. [sent-91, score-0.056]
40 Algorithm SublinearSDP has running time O 3 m 2 + n2 5 . [sent-94, score-0.056]
41 2: Let T ← 602 −2 log m, Y1 ← 0n×n , w1 ← 1m , η ← 3: for t = 1 to T do w 4: pt ← wtt 1 , Xt ←ApproxProject(Yt , 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: log m T , P ← /2. [sent-96, score-0.22]
42 Yt+1 ← Yt + √1 Ait 2T Choose (jt , lt ) ∈ [n] × [n] by (jt , lt ) ← (j, l) w. [sent-101, score-0.11]
43 for i ∈ [m] do vt ← Ai (jt , lt ) Xt 2 /Xt (jt , lt ) ˜ vt (i) ←clip(˜t (i), 1/η) v wt+1 (i) ← wt (i)(1 − ηvt (i) + η 2 vt (i)2 ) end for end for 1 ¯ return X = T t Xt We also have the following lower bound. [sent-104, score-0.87]
44 Any algorithm which computes an -approximation with probability at least has running time Ω m 2 + n 2 2 2 3 to (2) . [sent-107, score-0.056]
45 Here it is important to note that our lower bound does not reflect the computational effort in computing a general solution that is positive semidefinite which is in fact the computational bottleneck of our algorithm (due to the use of the projection procedure). [sent-109, score-0.105]
46 Let 0 < η ∈ R, w1 ← 1m , and for t ≥ 1, pt ← wt / wt 1 , wt+1 ← wt (i)(1 − ηqt (i) + η 2 qt (i)2 ) The following lemma gives a bound on the regret of the MW algorithm, suitable for the case in which the losses are random variables with bounded variance. [sent-118, score-0.605]
47 The MW algorithm satisfies pt qt ≤ min t∈[T ] i∈[m] t∈[T ] log m 1 +η max{qt (i), − } + η η 2 pt qt t∈[t] The following lemma gives concentration bounds on our random variables from their expectations. [sent-121, score-0.629]
48 For 1/4 ≥ η ≥ (i) maxi∈[m] t∈[T ] [vt (i) log m T , with probability at least 1 − O(1/m), it holds that − Ai ◦ Xt ] ≤ 4ηT Ait ◦ Xt − (ii) t∈[T ] pt vt ≤ 8ηT t∈[T ] The following Lemma gives a regret bound on the lazy gradient ascent algorithm used in our algorithm (line 6). [sent-124, score-0.548]
49 , AT ∈ Rn×n such that for all i ∈ [m] Ai X0 = 0n×n and for all t ≥ 1 let Xt+1 = arg minX∈K X∈K t∈[T ] Aτ − X ≤ 1. [sent-132, score-0.043]
50 Let Then √ At ◦ Xt ≤ 2 2T At ◦ X − max t τ =1 √1 2T t∈[T ] We are now ready to state the main theorem and prove it. [sent-133, score-0.044]
51 With probability 1/2, the SublinearSDP algorithm returns an additive approximation to (5). [sent-136, score-0.062]
52 At first assume that the projection onto the set K in line 4 is an exact projection and not an ˜ approximation and denote by Xt the exact projection of Yt . [sent-138, score-0.384]
53 4 we have √ ˜ max Ai t ◦ X − Ait ◦ Xt ≤ 2 2T (6) x∈K t∈[T ] t∈[T ] By the law of cosines and lemma 3. [sent-140, score-0.149]
54 2, and using the clipping of vt (i) we have pt vt ≤ min i∈[i] t∈[T ] 2 pt vt vt (i) + (log m)/η + η t∈[t] t∈[T ] By Lemma 4. [sent-142, score-1.352]
55 at least 3/4, 2 pt vt ≤ 8T t∈[T ] 5 (9) Combined with lemma 4. [sent-145, score-0.501]
56 at least 3 4 1 − O( n ) ≥ 1 2 i∈[i] ≥ √ −(log m)/η − 8ηT − 4ηT + T σ − 2 2T − 8ηT − T ≥ Ai ◦ Xt min Tσ − √ log m − 20ηT − 2 2T − T η Dividing through by T and plugging in our choice for η and w. [sent-148, score-0.042]
57 vi A 0 Letting vi = and defining the set of matrices P = |A 0, A ≤ 1 , we 1 0 −1 can rewrite (10) in the following form. [sent-152, score-0.274]
58 max min −yi vi vi ◦ A A∈P p∈∆m (11) In what comes next, we use the notation Ai = −yi vi vi . [sent-153, score-0.634]
59 Since projecting a matrix onto the set P is as easy as projecting a matrix onto the set {A 0, A ≤ 1}, we assume for the simplicity of the presentation that the set on which we optimize is indeed P = {A 0, A ≤ 1}. [sent-154, score-0.324]
60 The gradient of yi vi vi ◦ A with respect to A is a symmetric rank one matrix and here we have the following useful fact that was previously stated in [18]. [sent-156, score-0.441]
61 If A ∈ Rn×n is positive semi definite, v ∈ Rn and α ∈ R then the matrix B = A + αvv has at most one negative eigenvalue. [sent-159, score-0.074]
62 The proof is due to the eigenvalue Interlacing Theorem (see [19] pp. [sent-160, score-0.099]
63 Since in practice computing eigenvalues fast, using the Power or Lanczos methods, can be done only up to a desired approximation, in fact the resulting projection Xt+1 might not be positive semidefinite. [sent-163, score-0.071]
64 Nevertheless, we show by care-full analysis that we can still settle for a single eigenvector computation in order to compute an approximated projection with the price that Xt+1 − 3 I. [sent-164, score-0.207]
65 The benefit is an algorithm with improved performance over the general SDP algorithm since far less eigenvalue computations are required than in Hazan’s algorithm. [sent-166, score-0.099]
66 The projection to the set P is carried out in lines 7-11. [sent-167, score-0.071]
67 In line 7 we check if Yt+1 has a negative eigenvalue and if so, we compute the corresponding eigenvector in line 8 and remove it in line 9. [sent-168, score-0.283]
68 In line 11 we normalize the l2 norm of the solution. [sent-169, score-0.049]
69 The procedure Sample(Ai , Xt ) will be detailed later on when we discuss the running time. [sent-170, score-0.056]
70 The following Lemma is a variant of Zinkevich’s Online Gradient Ascent algorithm [21] suitable for the use of approximated projections when Xt is not necessarily inside the set P. [sent-171, score-0.042]
71 Let X0 = 0n×n and for all t ≥ 0 let Yt+1 = Xt + ηAt , ˜ Xt+1 = arg min Yt+1 − X X∈P 6 Algorithm 2 SublinearPseudoMetric 1: Input: > 0, Ai = yi vi vi ∈ Rn×n for i ∈ [m]. [sent-178, score-0.449]
72 Then, for a proper choice of η it holds that, At ◦ Xt ≤ √ 2T + t∈[T ] 3 3/2 dT 2 The following lemma states the connection between the precision used in eigenvalues approximation in lines 7-8, and the quality of the approximated projection. [sent-185, score-0.241]
73 Assume that on each iteration t of the algorithm, the eigenvalue computation in ˜ line 7 is a δ = 4T d additive approximation of the smallest eigenvalue of Yt+1 and let Xt = 1. [sent-188, score-0.309]
74 Algorithm SublinearPseudoMetric computes an additive approximation to (11) w. [sent-192, score-0.062]
75 3 we have, At ◦ X − max X∈P Setting d = 2 P √ 3 T where t∈[T ] At ◦ Xt ≤ √ 2T + t∈[T ] 3 3/2 dT 2 is the same as in theorem 4. [sent-198, score-0.044]
76 5 yields, √ arg max At ◦ X − At ◦ Xt ≤ 2T + P X∈P t∈[T ] PT t∈[T ] The rest of the proof follows the same lines as theorem 4. [sent-199, score-0.087]
77 It is easily observed from the algorithm that for all t ∈ [T ], the matrix Xt can be represented as the sum of kt ≤ 2T symmetric rank-one matrices. [sent-202, score-0.106]
78 That is Xt is of the form Xt = i∈[kt ] αi zi zi , zi = 1 for all i. [sent-203, score-0.198]
79 Thus instead of computing Xt explicitly, we may represent it by the vectors zi and scalars αi . [sent-204, score-0.066]
80 Denote by α the vector of length kt in which the ith entry is just αi , for some iteration t ∈ [T ]. [sent-205, score-0.07]
81 The sampling procedure Sample(Ai , Xt ) in line 13, returns the value α2 2 Ai (j,l) α 2 k zk (j)zk (l)αk with probability α 2 · (zk (j)zk (l)) . [sent-207, score-0.179]
82 That is we first sample a vector zi according to 7 α and then we sample an entry (j, l) according to the chosen vector zi . [sent-208, score-0.132]
83 It is easily observed that vt (i) = Sample(Ai , Xt ) is an unbiased estimator of Ai ◦ Xt . [sent-209, score-0.265]
84 It also holds that: ˜ E[˜t (i)2 ] v = j∈[n],l∈[n],k∈[kt ] = kt α 2 Ai 2 2 αk Ai (j, l)2 α 4 2 (zk (j)zk (l)) · 2 α 2 (zk (j)zk (l))2 αk ˜ = O( −2 ) ˜ Thus taking vt (i) to be the average of O( −2 ) i. [sent-210, score-0.331]
85 We can now state the running time of the algorithm. [sent-213, score-0.056]
86 4, the required precision in eigenvalue approximation is O(1)T 2 . [sent-221, score-0.161]
87 Using the Lanczos method for eigenvalue approximation and the sparse representation of Xt de˜ scribed above, a single eigenvalue computation takes O(n −4. [sent-222, score-0.26]
88 Overall the running time on all iteration is as stated in the lemma. [sent-225, score-0.056]
89 6 Conclusions We have presented the first sublinear time algorithm for approximate semi-definite programming, a widely used optimization framework in machine learning. [sent-226, score-0.167]
90 The algorithm’s running time is optimal up to poly-logarithmic factors and its dependence on ε - the approximation guarantee. [sent-227, score-0.118]
91 The algorithm is based on the primal-dual approach of [15], and incorporates methods from previous SDP solvers [12]. [sent-228, score-0.044]
92 Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. [sent-236, score-0.096]
93 In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, STOC ’04, pages 222–231, 2004. [sent-240, score-0.134]
94 O(sqrt(log n)) approximation algorithms for min uncut, min 2cnf deletion, and directed cut problems. [sent-242, score-0.18]
95 In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, STOC ’05, pages 573–581, 2005. [sent-243, score-0.134]
96 In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, STOC ’05, pages 553–562, 2005. [sent-247, score-0.134]
97 Distance metric learning, with application to clustering with side-information. [sent-252, score-0.063]
98 Efficient approximation algorithms for semidefinite programs arising from max cut and coloring. [sent-270, score-0.189]
99 In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, STOC ’96, pages 338–347, 1996. [sent-271, score-0.134]
100 In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, STOC ’07, pages 227–236, 2007. [sent-278, score-0.134]
wordName wordTfidf (topN-words)
[('xt', 0.426), ('ai', 0.344), ('ait', 0.256), ('semide', 0.247), ('yt', 0.239), ('vt', 0.229), ('pt', 0.167), ('sdp', 0.14), ('vi', 0.137), ('zk', 0.13), ('sublinear', 0.128), ('hazan', 0.114), ('lemma', 0.105), ('eigenvalue', 0.099), ('mw', 0.095), ('elad', 0.092), ('yi', 0.09), ('rn', 0.09), ('sanjeev', 0.08), ('sublinearpseudometric', 0.079), ('sublinearsdp', 0.079), ('qt', 0.074), ('wt', 0.073), ('projection', 0.071), ('arora', 0.07), ('kt', 0.07), ('da', 0.067), ('zi', 0.066), ('xi', 0.065), ('stoc', 0.064), ('avi', 0.064), ('clip', 0.064), ('nite', 0.064), ('metric', 0.063), ('approximation', 0.062), ('onto', 0.06), ('clipping', 0.06), ('symposium', 0.06), ('mini', 0.06), ('demand', 0.057), ('settle', 0.057), ('yit', 0.057), ('israel', 0.057), ('running', 0.056), ('jt', 0.056), ('lt', 0.055), ('garud', 0.053), ('iyengar', 0.053), ('phillips', 0.053), ('wtt', 0.053), ('programs', 0.049), ('multiplicative', 0.049), ('line', 0.049), ('projecting', 0.049), ('clifford', 0.046), ('goemans', 0.046), ('minp', 0.046), ('satyen', 0.046), ('vit', 0.046), ('minx', 0.045), ('max', 0.044), ('feasibility', 0.044), ('solvers', 0.044), ('arg', 0.043), ('annual', 0.043), ('kenneth', 0.043), ('latin', 0.043), ('min', 0.042), ('approximated', 0.042), ('gradient', 0.041), ('cone', 0.041), ('gert', 0.04), ('regret', 0.04), ('programming', 0.039), ('ascent', 0.039), ('optimization', 0.039), ('semi', 0.038), ('clarkson', 0.038), ('lanczos', 0.038), ('sdps', 0.038), ('eigenvector', 0.037), ('online', 0.037), ('unbiased', 0.036), ('klein', 0.036), ('packing', 0.036), ('matrix', 0.036), ('laurent', 0.035), ('pseudo', 0.035), ('david', 0.035), ('cut', 0.034), ('solution', 0.034), ('feasible', 0.034), ('presentation', 0.034), ('haifa', 0.033), ('technion', 0.033), ('holds', 0.032), ('respects', 0.032), ('maxx', 0.032), ('wishes', 0.031), ('pages', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
2 0.30983028 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
3 0.24625137 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
4 0.24388166 220 nips-2011-Prediction strategies without loss
Author: Michael Kapralov, Rina Panigrahy
Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1
5 0.23281616 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
6 0.21773584 80 nips-2011-Efficient Online Learning via Randomized Rounding
7 0.20519039 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
8 0.17143978 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
9 0.16648848 21 nips-2011-Active Learning with a Drifting Distribution
10 0.15216275 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
11 0.12736042 162 nips-2011-Lower Bounds for Passive and Active Learning
12 0.11559536 215 nips-2011-Policy Gradient Coagent Networks
13 0.11466209 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
14 0.10229164 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
15 0.10056291 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
16 0.10048315 260 nips-2011-Sparse Features for PCA-Like Linear Regression
17 0.090754755 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
18 0.089561671 29 nips-2011-Algorithms and hardness results for parallel large margin learning
19 0.088932522 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
20 0.085238695 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
topicId topicWeight
[(0, 0.254), (1, -0.247), (2, -0.046), (3, -0.145), (4, 0.159), (5, 0.035), (6, 0.091), (7, -0.065), (8, 0.021), (9, -0.164), (10, 0.302), (11, -0.039), (12, 0.178), (13, 0.066), (14, -0.108), (15, 0.026), (16, 0.062), (17, 0.155), (18, 0.053), (19, -0.086), (20, 0.028), (21, 0.002), (22, 0.098), (23, -0.031), (24, -0.047), (25, -0.078), (26, 0.073), (27, -0.024), (28, 0.016), (29, -0.038), (30, -0.074), (31, 0.129), (32, 0.04), (33, 0.005), (34, -0.061), (35, 0.012), (36, -0.069), (37, 0.061), (38, 0.007), (39, -0.017), (40, 0.035), (41, -0.08), (42, -0.006), (43, -0.015), (44, 0.029), (45, -0.002), (46, -0.041), (47, 0.0), (48, -0.008), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.97075617 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
2 0.82858008 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
3 0.82540876 145 nips-2011-Learning Eigenvectors for Free
Author: Wouter M. Koolen, Wojciech Kotlowski, Manfred K. Warmuth
Abstract: We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu where u is a vector in Rn of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret. 1
4 0.78149134 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
5 0.70606327 80 nips-2011-Efficient Online Learning via Randomized Rounding
Author: Nicolò Cesa-bianchi, Ohad Shamir
Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1
6 0.67877072 21 nips-2011-Active Learning with a Drifting Distribution
7 0.66379941 220 nips-2011-Prediction strategies without loss
8 0.62907779 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
9 0.54683745 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
10 0.50033849 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
11 0.49071658 162 nips-2011-Lower Bounds for Passive and Active Learning
12 0.48269951 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
13 0.46847028 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
14 0.42041579 139 nips-2011-Kernel Bayes' Rule
15 0.39372504 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
16 0.38889632 260 nips-2011-Sparse Features for PCA-Like Linear Regression
17 0.36434585 225 nips-2011-Probabilistic amplitude and frequency demodulation
18 0.35516983 272 nips-2011-Stochastic convex optimization with bandit feedback
19 0.34022763 4 nips-2011-A Convergence Analysis of Log-Linear Training
20 0.33941492 202 nips-2011-On the Universality of Online Mirror Descent
topicId topicWeight
[(0, 0.02), (4, 0.495), (20, 0.019), (26, 0.025), (31, 0.073), (43, 0.051), (45, 0.113), (57, 0.027), (74, 0.024), (83, 0.04), (99, 0.028)]
simIndex simValue paperId paperTitle
1 0.94830275 130 nips-2011-Inductive reasoning about chimeric creatures
Author: Charles Kemp
Abstract: Given one feature of a novel animal, humans readily make inferences about other features of the animal. For example, winged creatures often fly, and creatures that eat fish often live in the water. We explore the knowledge that supports these inferences and compare two approaches. The first approach proposes that humans rely on abstract representations of dependency relationships between features, and is formalized here as a graphical model. The second approach proposes that humans rely on specific knowledge of previously encountered animals, and is formalized here as a family of exemplar models. We evaluate these models using a task where participants reason about chimeras, or animals with pairs of features that have not previously been observed to co-occur. The results support the hypothesis that humans rely on explicit representations of relationships between features. Suppose that an eighteenth-century naturalist learns about a new kind of animal that has fur and a duck’s bill. Even though the naturalist has never encountered an animal with this pair of features, he should be able to make predictions about other features of the animal—for example, the animal could well live in water but probably does not have feathers. Although the platypus exists in reality, from a eighteenth-century perspective it qualifies as a chimera, or an animal that combines two or more features that have not previously been observed to co-occur. Here we describe a probabilistic account of inductive reasoning and use it to account for human inferences about chimeras. The inductive problems we consider are special cases of the more general problem in Figure 1a where a reasoner is given a partially observed matrix of animals by features then asked to infer the values of the missing entries. This general problem has been previously studied and is addressed by computational models of property induction, categorization, and generalization [1–7]. A challenge faced by all of these models is to capture the background knowledge that guides inductive inferences. Some accounts rely on similarity relationships between animals [6, 8], others rely on causal relationships between features [9, 10], and others incorporate relationships between animals and relationships between features [11]. We will evaluate graphical models that capture both kinds of relationships (Figure 1a), but will focus in particular on relationships between features. Psychologists have previously suggested that humans rely on explicit mental representations of relationships between features [12–16]. Often these representations are described as theories—for example, theories that specify a causal relationship between having wings and flying, or living in the sea and eating fish. Relationships between features may take several forms: for example, one feature may cause, enable, prevent, be inconsistent with, or be a special case of another feature. For simplicity, we will treat all of these relationships as instances of dependency relationships between features, and will capture them using an undirected graphical model. Previous studies have used graphical models to account for human inferences about features but typically these studies consider toy problems involving a handful of novel features such as “has gene X14” or “has enzyme Y132” [9, 11]. Participants might be told, for example, that gene X14 leads to the production of enzyme Y132, then asked to use this information when reasoning about novel animals. Here we explore whether a graphical model approach can account for inferences 1 (a) slow heavy flies (b) wings hippo 1 1 0 0 rhino 1 1 0 0 sparrow 0 0 1 1 robin 0 0 1 1 new ? ? 1 ? o Figure 1: Inductive reasoning about animals and features. (a) Inferences about the features of a new animal onew that flies may draw on similarity relationships between animals (the new animal is similar to sparrows and robins but not hippos and rhinos), and on dependency relationships between features (flying and having wings are linked). (b) A graph product produced by combining the two graph structures in (a). about familiar features. Working with familiar features raises a methodological challenge since participants have a substantial amount of knowledge about these features and can reason about them in multiple ways. Suppose, for example, that you learn that a novel animal can fly (Figure 1a). To conclude that the animal probably has wings, you might consult a mental representation similar to the graph at the top of Figure 1a that specifies a dependency relationship between flying and having wings. On the other hand, you might reach the same conclusion by thinking about flying creatures that you have previously encountered (e.g. sparrows and robins) and noticing that these creatures have wings. Since the same conclusion can be reached in two different ways, judgments about arguments of this kind provide little evidence about the mental representations involved. The challenge of working with familiar features directly motivates our focus on chimeras. Inferences about chimeras draw on rich background knowledge but require the reasoner to go beyond past experience in a fundamental way. For example, if you learn that an animal flies and has no legs, you cannot make predictions about the animal by thinking of flying, no-legged creatures that you have previously encountered. You may, however, still be able to infer that the novel animal has wings if you understand the relationship between flying and having wings. We propose that graphical models over features can help to explain how humans make inferences of this kind, and evaluate our approach by comparing it to a family of exemplar models. The next section introduces these models, and we then describe two experiments designed to distinguish between the models. 1 Reasoning about objects and features Our models make use of a binary matrix D where the rows {o1 , . . . , o129 } correspond to objects, and the columns {f 1 , . . . , f 56 } correspond to features. A subset of the objects is shown in Figure 2a, and the full set of features is shown in Figure 2b and its caption. Matrix D was extracted from the Leuven natural concept database [17], which includes 129 animals and 757 features in total. We chose a subset of these features that includes a mix of perceptual and behavioral features, and that includes many pairs of features that depend on each other. For example, animals that “live in water” typically “can swim,” and animals that have “no legs” cannot “jump far.” Matrix D can be used to formulate problems where a reasoner observes one or two features of a new object (i.e. animal o130 ) and must make inferences about the remaining features of the animal. The next two sections describe graphical models that can be used to address this problem. The first graphical model O captures relationships between objects, and the second model F captures relationships between features. We then discuss how these models can be combined, and introduce a family of exemplar-style models that will be compared with our graphical models. A graphical model over objects Many accounts of inductive reasoning focus on similarity relationships between objects [6, 8]. Here we describe a tree-structured graphical model O that captures these relationships. The tree was constructed from matrix D using average linkage clustering and the Jaccard similarity measure, and part of the resulting structure is shown in Figure 2a. The subtree in Figure 2a includes clusters 2 alligator caiman crocodile monitor lizard dinosaur blindworm boa cobra python snake viper chameleon iguana gecko lizard salamander frog toad tortoise turtle anchovy herring sardine cod sole salmon trout carp pike stickleback eel flatfish ray plaice piranha sperm whale squid swordfish goldfish dolphin orca whale shark bat fox wolf beaver hedgehog hamster squirrel mouse rabbit bison elephant hippopotamus rhinoceros lion tiger polar bear deer dromedary llama giraffe zebra kangaroo monkey cat dog cow horse donkey pig sheep (a) (b) can swim lives in water eats fish eats nuts eats grain eats grass has gills can jump far has two legs has no legs has six legs has four legs can fly can be ridden has sharp teeth nocturnal has wings strong predator can see in dark eats berries lives in the sea lives in the desert crawls lives in the woods has mane lives in trees can climb well lives underground has feathers has scales slow has fur heavy Figure 2: Graph structures used to define graphical models O and F. (a) A tree that captures similarity relationships between animals. The full tree includes 129 animals, and only part of the tree is shown here. The grey points along the branches indicate locations where a novel animal o130 could be attached to the tree. (b) A network capturing pairwise dependency relationships between features. The edges capture both positive and negative dependencies. All edges in the network are shown, and the network also includes 20 isolated nodes for the following features: is black, is blue, is green, is grey, is pink, is red, is white, is yellow, is a pet, has a beak, stings, stinks, has a long neck, has feelers, sucks blood, lays eggs, makes a web, has a hump, has a trunk, and is cold-blooded. corresponding to amphibians and reptiles, aquatic creatures, and land mammals, and the subtree omitted for space includes clusters for insects and birds. We assume that the features in matrix D (i.e. the columns) are generated independently over O: P (f i |O, π i , λi ). P (D|O, π, λ) = i i i i The distribution P (f |O, π , λ ) is based on the intuition that nearby nodes in O tend to have the same value of f i . Previous researchers [8, 18] have used a directed graphical model where the distribution at the root node is based on the baserate π i , and any other node v with parent u has the following conditional probability distribution: i P (v = 1|u) = π i + (1 − π i )e−λ l , if u = 1 i π i − π i e−λ l , if u = 0 (1) where l is the length of the branch joining node u to node v. The variability parameter λi captures the extent to which feature f i is expected to vary over the tree. Note, for example, that any node v must take the same value as its parent u when λ = 0. To avoid free parameters, the feature baserates π i and variability parameters λi are set to their maximum likelihood values given the observed values of the features {f i } in the data matrix D. The conditional distributions in Equation 1 induce a joint distribution over all of the nodes in graph O, and the distribution P (f i |O, π i , λi ) is computed by marginalizing out the values of the internal nodes. Although we described O as a directed graphical model, the model can be converted into an equivalent undirected model with a potential for each edge in the tree and a potential for the root node. Here we use the undirected version of the model, which is a natural counterpart to the undirected model F described in the next section. The full version of structure O in Figure 2a includes 129 familiar animals, and our task requires inferences about a novel animal o130 that must be slotted into the structure. Let D′ be an expanded version of D that includes a row for o130 , and let O′ be an expanded version of O that includes a node for o130 . The edges in Figure 2a are marked with evenly spaced gray points, and we use a 3 uniform prior P (O′ ) over all trees that can be created by attaching o130 to one of these points. Some of these trees have identical topologies, since some edges in Figure 2a have multiple gray points. Predictions about o130 can be computed using: P (D′ |D) = P (D′ |O′ , D)P (O′ |D) ∝ O′ P (D′ |O′ , D)P (D|O′ )P (O′ ). (2) O′ Equation 2 captures the basic intuition that the distribution of features for o130 is expected to be consistent with the distribution observed for previous animals. For example, if o130 is known to fly then the trees with high posterior probability P (O′ |D) will be those where o130 is near other flying creatures (Figure 1a), and since these creatures have wings Equation 2 predicts that o130 probably also has wings. As this example suggests, model O captures dependency relationships between features implicitly, and therefore stands in contrast to models like F that rely on explicit representations of relationships between features. A graphical model over features Model F is an undirected graphical model defined over features. The graph shown in Figure 2b was created by identifying pairs where one feature depends directly on another. The author and a research assistant both independently identified candidate sets of pairwise dependencies, and Figure 2b was created by merging these sets and reaching agreement about how to handle any discrepancies. As previous researchers have suggested [13, 15], feature dependencies can capture several kinds of relationships. For example, wings enable flying, living in the sea leads to eating fish, and having no legs rules out jumping far. We work with an undirected graph because some pairs of features depend on each other but there is no clear direction of causal influence. For example, there is clearly a dependency relationship between being nocturnal and seeing in the dark, but no obvious sense in which one of these features causes the other. We assume that the rows of the object-feature matrix D are generated independently from an undirected graphical model F defined over the feature structure in Figure 2b: P (oi |F). P (D|F) = i Model F includes potential functions for each node and for each edge in the graph. These potentials were learned from matrix D using the UGM toolbox for undirected graphical models [19]. The learned potentials capture both positive and negative relationships: for example, animals that live in the sea tend to eat fish, and tend not to eat berries. Some pairs of feature values never occur together in matrix D (there are no creatures that fly but do not have wings). We therefore chose to compute maximum a posteriori values of the potential functions rather than maximum likelihood values, and used a diffuse Gaussian prior with a variance of 100 on the entries in each potential. After learning the potentials for model F, we can make predictions about a new object o130 using the distribution P (o130 |F). For example, if o130 is known to fly (Figure 1a), model F predicts that o130 probably has wings because the learned potentials capture a positive dependency between flying and having wings. Combining object and feature relationships There are two simple ways to combine models O and F in order to develop an approach that incorporates both relationships between features and relationships between objects. The output combination model computes the predictions of both models in isolation, then combines these predictions using a weighted sum. The resulting model is similar to a mixture-of-experts model, and to avoid free parameters we use a mixing weight of 0.5. The structure combination model combines the graph structures used by the two models and relies on a set of potentials defined over the resulting graph product. An example of a graph product is shown in Figure 1b, and the potential functions for this graph are inherited from the component models in the natural way. Kemp et al. [11] use a similar approach to combine a functional causal model with an object model O, but note that our structure combination model uses an undirected model F rather than a functional causal model over features. Both combination models capture the intuition that inductive inferences rely on relationships between features and relationships between objects. The output combination model has the virtue of 4 simplicity, and the structure combination model is appealing because it relies on a single integrated representation that captures both relationships between features and relationships between objects. To preview our results, our data suggest that the combination models perform better overall than either O or F in isolation, and that both combination models perform about equally well. Exemplar models We will compare the family of graphical models already described with a family of exemplar models. The key difference between these model families is that the exemplar models do not rely on explicit representations of relationships between objects and relationships between features. Comparing the model families can therefore help to establish whether human inferences rely on representations of this sort. Consider first a problem where a reasoner must predict whether object o130 has feature k after observing that it has feature i. An exemplar model addresses the problem by retrieving all previouslyobserved objects with feature i and computing the proportion that have feature k: P (ok = 1|oi = 1) = |f k & f i | |f i | (3) where |f k | is the number of objects in matrix D that have feature k, and |f k & f i | is the number that have both feature k and feature i. Note that we have streamlined our notation by using ok instead of o130 to refer to the kth feature value for object o130 . k Suppose now that the reasoner observes that object o130 has features i and j. The natural generalization of Equation 3 is: P (ok = 1|oi = 1, oj = 1) = |f k & f i & f j | |f i & f j | (4) Because we focus on chimeras, |f i & f j | = 0 and Equation 4 is not well defined. We therefore evaluate an exemplar model that computes predictions for the two observed features separately then computes the weighted sum of these predictions: P (ok = 1|oi = 1, oj = 1) = wi |f k & f i | |f k & f j | + wj . i| |f |f j | (5) where the weights wi and wj must sum to one. We consider four ways in which the weights could be set. The first strategy sets wi = wj = 0.5. The second strategy sets wi ∝ |f i |, and is consistent with an approach where the reasoner retrieves all exemplars in D that are most similar to the novel animal and reports the proportion of these exemplars that have feature k. The third strategy sets wi ∝ |f1i | , and captures the idea that features should be weighted by their distinctiveness [20]. The final strategy sets weights according to the coherence of each feature [21]. A feature is coherent if objects with that feature tend to resemble each other overall, and we define the coherence of feature i as the expected Jaccard similarity between two randomly chosen objects from matrix D that both have feature i. Note that the final three strategies are all consistent with previous proposals from the psychological literature, and each one might be expected to perform well. Because exemplar models and prototype models are often compared, it is natural to consider a prototype model [22] as an additional baseline. A standard prototype model would partition the 129 animals into categories and would use summary statistics for these categories to make predictions about the novel animal o130 . We will not evaluate this model because it corresponds to a coarser version of model O, which organizes the animals into a hierarchy of categories. The key characteristic shared by both models is that they explicitly capture relationships between objects but not features. 2 Experiment 1: Chimeras Our first experiment explores how people make inferences about chimeras, or novel animals with features that have not previously been observed to co-occur. Inferences about chimeras raise challenges for exemplar models, and therefore help to establish whether humans rely on explicit representations of relationships between features. Each argument can be represented as f i , f j → f k 5 exemplar r = 0.42 7 feature F exemplar (wi = |f i |) (wi = 0.5) r = 0.44 7 object O r = 0.69 7 output combination r = 0.31 7 structure combination r = 0.59 7 r = 0.60 7 5 5 5 5 5 3 3 3 3 3 3 all 5 1 1 0 1 r = 0.06 7 conflict 0.5 1 1 0 0.5 1 r = 0.71 7 1 0 0.5 1 r = −0.02 7 1 0 0.5 1 r = 0.49 7 0 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.57 7 5 3 1 0 0.5 1 r = 0.51 7 edge 0.5 r = 0.17 7 1 1 0 0.5 1 r = 0.64 7 1 0 0.5 1 r = 0.83 7 1 0 0.5 1 r = 0.45 7 1 0 0.5 1 r = 0.76 7 0 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.79 7 5 3 1 1 0 0.5 1 r = 0.26 7 other 1 0 1 0 0.5 1 r = 0.25 7 1 0 0.5 1 r = 0.19 7 1 0 0.5 1 r = 0.25 7 1 0 0.5 1 r = 0.24 7 0 7 5 5 5 5 5 3 3 3 3 1 5 3 0.5 r = 0.33 3 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 0 0.5 1 Figure 3: Argument ratings for Experiment 1 plotted against the predictions of six models. The y-axis in each panel shows human ratings on a seven point scale, and the x-axis shows probabilities according to one of the models. Correlation coefficients are shown for each plot. where f i and f k are the premises (e.g. “has no legs” and “can fly”) and f k is the conclusion (e.g. “has wings”). We are especially interested in conflict cases where the premises f i and f j lead to opposite conclusions when taken individually: for example, most animals with no legs do not have wings, but most animals that fly do have wings. Our models that incorporate feature structure F can resolve this conflict since F includes a dependency between “wings” and “can fly” but not between “wings” and “has no legs.” Our models that do not include F cannot resolve the conflict and predict that humans will be uncertain about whether the novel animal has wings. Materials. The object-feature matrix D includes 447 feature pairs {f i , f j } such that none of the 129 animals has both f i and f j . We selected 40 pairs (see the supporting material) and created 400 arguments in total by choosing 10 conclusion features for each pair. The arguments can be assigned to three categories. Conflict cases are arguments f i , f j → f k such that the single-premise arguments f i → f k and f j → f k lead to incompatible predictions. For our purposes, two singlepremise arguments with the same conclusion are deemed incompatible if one leads to a probability greater than 0.9 according to Equation 3, and the other leads to a probability less than 0.1. Edge cases are arguments f i , f j → f k such that the feature network in Figure 2b includes an edge between f k and either f i or f j . Note that some arguments are both conflict cases and edge cases. All arguments that do not fall into either one of these categories will be referred to as other cases. The 400 arguments for the experiment include 154 conflict cases, 153 edge cases, and 120 other cases. 34 arguments are both conflict cases and edge cases. We chose these arguments based on three criteria. First, we avoided premise pairs that did not co-occur in matrix D but that co-occur in familiar animals that do not belong to D. For example, “is pink” and “has wings” do not co-occur in D but “flamingo” is a familiar animal that has both features. Second, we avoided premise pairs that specified two different numbers of legs—for example, {“has four legs,” “has six legs”}. Finally, we aimed to include roughly equal numbers of conflict cases, edge cases, and other cases. Method. 16 undergraduates participated for course credit. The experiment was carried out using a custom-built computer interface, and one argument was presented on screen at a time. Participants 6 rated the probability of the conclusion on seven point scale where the endpoints were labeled “very unlikely” and “very likely.” The ten arguments for each pair of premises were presented in a block, but the order of these blocks and the order of the arguments within these blocks were randomized across participants. Results. Figure 3 shows average human judgments plotted against the predictions of six models. The plots in the first row include all 400 arguments in the experiment, and the remaining rows show results for conflict cases, edge cases, and other cases. The previous section described four exemplar models, and the two shown in Figure 3 are the best performers overall. Even though the graphical models include more numerical parameters than the exemplar models, recall that these parameters are learned from matrix D rather than fit to the experimental data. Matrix D also serves as the basis for the exemplar models, which means that all of the models can be compared on equal terms. The first row of Figure 3 suggests that the three models which include feature structure F perform better than the alternatives. The output combination model is the worst of the three models that incorporate F, and the correlation achieved by this model is significantly greater than the correlation achieved by the best exemplar model (p < 0.001, using the Fisher transformation to convert correlation coefficients to z scores). Our data therefore suggest that explicit representations of relationships between features are needed to account for inductive inferences about chimeras. The model that includes the feature structure F alone performs better than the two models that combine F with the object structure O, which may not be surprising since Experiment 1 focuses specifically on novel animals that do not slot naturally into structure O. Rows two through four suggest that the conflict arguments in particular raise challenges for the models which do not include feature structure F. Since these conflict cases are arguments f i , f j → f k where f i → f k has strength greater than 0.9 and f j → f k has strength less than 0.1, the first exemplar model averages these strengths and assigns an overall strength of around 0.5 to each argument. The second exemplar model is better able to differentiate between the conflict arguments, but still performs substantially worse than the three models that include structure F. The exemplar models perform better on the edge arguments, but are outperformed by the models that include F. Finally, all models achieve roughly the same level of performance on the other arguments. Although the feature model F performs best overall, the predictions of this model still leave room for improvement. The two most obvious outliers in the third plot in the top row represent the arguments {is blue, lives in desert → lives in woods} and {is pink, lives in desert → lives in woods}. Our participants sensibly infer that any animal which lives in the desert cannot simultaneously live in the woods. In contrast, the Leuven database indicates that eight of the twelve animals that live in the desert also live in the woods, and the edge in Figure 2b between “lives in the desert” and “lives in the woods” therefore represents a positive dependency relationship according to model F. This discrepancy between model and participants reflects the fact that participants made inferences about individual animals but the Leuven database is based on features of animal categories. Note, for example, that any individual animal is unlikely to live in the desert and the woods, but that some animal categories (including snakes, salamanders, and lizards) are found in both environments. 3 Experiment 2: Single-premise arguments Our results so far suggest that inferences about chimeras rely on explicit representations of relationships between features but provide no evidence that relationships between objects are important. It would be a mistake, however, to conclude that relationships between objects play no role in inductive reasoning. Previous studies have used object structures like the example in Figure 2a to account for inferences about novel features [11]—for example, given that alligators have enzyme Y132 in their blood, it seems likely that crocodiles also have this enzyme. Inferences about novel objects can also draw on relationships between objects rather than relationships between features. For example, given that a novel animal has a beak you will probably predict that it has feathers, not because there is any direct dependency between these two features, but because the beaked animals that you know tend to have feathers. Our second experiment explores inferences of this kind. Materials and Method. 32 undergraduates participated for course credit. The task was identical to Experiment 1 with the following exceptions. Each two-premise argument f i , f j → f k from Experiment 1 was converted into two one-premise arguments f i → f k and f j → f k , and these 7 feature F exemplar r = 0.78 7 object O r = 0.54 7 output combination r = 0.75 7 structure combination r = 0.75 7 all 5 5 5 5 5 3 3 3 3 3 1 1 0 edge 0.5 1 r = 0.87 7 1 0 0.5 1 r = 0.87 7 1 0 0.5 1 r = 0.84 7 1 0 0.5 1 r = 0.86 7 0 5 5 5 3 3 3 1 5 3 0.5 r = 0.85 7 5 3 1 1 0 0.5 1 r = 0.79 7 other r = 0.77 7 1 0 0.5 1 r = 0.21 7 1 0 0.5 1 r = 0.74 7 1 0 0.5 1 r = 0.66 7 0 5 5 5 5 3 3 3 3 1 r = 0.73 7 5 0.5 3 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 1 0 0.5 1 0 0.5 1 Figure 4: Argument ratings and model predictions for Experiment 2. one-premise arguments were randomly assigned to two sets. 16 participants rated the 400 arguments in the first set, and the other 16 rated the 400 arguments in the second set. Results. Figure 4 shows average human ratings for the 800 arguments plotted against the predictions of five models. Unlike Figure 3, Figure 4 includes a single exemplar model since there is no need to consider different feature weightings in this case. Unlike Experiment 1, the feature model F performs worse than the other alternatives (p < 0.001 in all cases). Not surprisingly, this model performs relatively well for edge cases f j → f k where f j and f k are linked in Figure 2b, but the final row shows that the model performs poorly across the remaining set of arguments. Taken together, Experiments 1 and 2 suggest that relationships between objects and relationships between features are both needed to account for human inferences. Experiment 1 rules out an exemplar approach but models that combine graph structures over objects and features perform relatively well in both experiments. We considered two methods for combining these structures and both performed equally well. Combining the knowledge captured by these structures appears to be important, and future studies can explore in detail how humans achieve this combination. 4 Conclusion This paper proposed that graphical models are useful for capturing knowledge about animals and their features and showed that a graphical model over features can account for human inferences about chimeras. A family of exemplar models and a graphical model defined over objects were unable to account for our data, which suggests that humans rely on mental representations that explicitly capture dependency relationships between features. Psychologists have previously used graphical models to capture relationships between features, but our work is the first to focus on chimeras and to explore models defined over a large set of familiar features. Although a simple undirected model accounted relatively well for our data, this model is only a starting point. The model incorporates dependency relationships between features, but people know about many specific kinds of dependencies, including cases where one feature causes, enables, prevents, or is inconsistent with another. An undirected graph with only one class of edges cannot capture this knowledge in full, and richer representations will ultimately be needed in order to provide a more complete account of human reasoning. Acknowledgments I thank Madeleine Clute for assisting with this research. This work was supported in part by the Pittsburgh Life Sciences Greenhouse Opportunity Fund and by NSF grant CDI-0835797. 8 References [1] R. N. Shepard. Towards a universal law of generalization for psychological science. Science, 237:1317– 1323, 1987. [2] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3):409–429, 1991. [3] E. Heit. A Bayesian analysis of some forms of inductive reasoning. In M. Oaksford and N. Chater, editors, Rational models of cognition, pages 248–274. Oxford University Press, Oxford, 1998. [4] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [5] C. Kemp and J. B. Tenenbaum. Structured statistical models of inductive reasoning. Psychological Review, 116(1):20–58, 2009. [6] D. N. Osherson, E. E. Smith, O. Wilkie, A. Lopez, and E. Shafir. Category-based induction. Psychological Review, 97(2):185–200, 1990. [7] D. J. Navarro. Learning the context of a category. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1795–1803. 2010. [8] C. Kemp, T. L. Griffiths, S. Stromsten, and J. B. Tenenbaum. Semi-supervised learning with trees. In Advances in Neural Information Processing Systems 16, pages 257–264. MIT Press, Cambridge, MA, 2004. [9] B. Rehder. A causal-model theory of conceptual representation and categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29:1141–1159, 2003. [10] B. Rehder and R. Burnett. Feature inference and the causal structure of categories. Cognitive Psychology, 50:264–314, 2005. [11] C. Kemp, P. Shafto, and J. B. Tenenbaum. An integrated account of generalization across objects and features. Cognitive Psychology, in press. [12] S. E. Barrett, H. Abdi, G. L. Murphy, and J. McCarthy Gallagher. Theory-based correlations and their role in children’s concepts. Child Development, 64:1595–1616, 1993. [13] S. A. Sloman, B. C. Love, and W. Ahn. Feature centrality and conceptual coherence. Cognitive Science, 22(2):189–228, 1998. [14] D. Yarlett and M. Ramscar. A quantitative model of counterfactual reasoning. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 123–130. MIT Press, Cambridge, MA, 2002. [15] W. Ahn, J. K. Marsh, C. C. Luhmann, and K. Lee. Effect of theory-based feature correlations on typicality judgments. Memory and Cognition, 30(1):107–118, 2002. [16] D. C. Meehan C. McNorgan, R. A. Kotack and K. McRae. Feature-feature causal relations and statistical co-occurrences in object concepts. Memory and Cognition, 35(3):418–431, 2007. [17] S. De Deyne, S. Verheyen, E. Ameel, W. Vanpaemel, M. J. Dry, W. Voorspoels, and G. Storms. Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behavior Research Methods, 40(4):1030–1048, 2008. [18] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, 2001. [19] M. Schmidt. UGM: A Matlab toolbox for probabilistic undirected graphical models. 2007. Available at http://people.cs.ubc.ca/∼schmidtm/Software/UGM.html. [20] L. J. Nelson and D. T. Miller. The distinctiveness effect in social categorization: you are what makes you unusual. Psychological Science, 6:246–249, 1995. [21] A. L. Patalano, S. Chin-Parker, and B. H. Ross. The importance of being coherent: category coherence, cross-classification and reasoning. Journal of memory and language, 54:407–424, 2006. [22] S. K. Reed. Pattern recognition and categorization. Cognitive Psychology, 3:393–407, 1972. 9
same-paper 2 0.92099357 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
3 0.91058773 193 nips-2011-Object Detection with Grammar Models
Author: Ross B. Girshick, Pedro F. Felzenszwalb, David A. McAllester
Abstract: Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data. 1
4 0.87216806 213 nips-2011-Phase transition in the family of p-resistances
Author: Morteza Alamgir, Ulrike V. Luxburg
Abstract: We study the family of p-resistances on graphs for p 1. This family generalizes the standard resistance distance. We prove that for any fixed graph, for p = 1 the p-resistance coincides with the shortest path distance, for p = 2 it coincides with the standard resistance distance, and for p ! 1 it converges to the inverse of the minimal s-t-cut in the graph. Secondly, we consider the special case of random geometric graphs (such as k-nearest neighbor graphs) when the number n of vertices in the graph tends to infinity. We prove that an interesting phase transition takes place. There exist two critical thresholds p⇤ and p⇤⇤ such that if p < p⇤ , then the p-resistance depends on meaningful global properties of the graph, whereas if p > p⇤⇤ , it only depends on trivial local quantities and does not convey any useful information. We can explicitly compute the critical values: p⇤ = 1 + 1/(d 1) and p⇤⇤ = 1 + 1/(d 2) where d is the dimension of the underlying space (we believe that the fact that there is a small gap between p⇤ and p⇤⇤ is an artifact of our proofs). We also relate our findings to Laplacian regularization and suggest to use q-Laplacians as regularizers, where q satisfies 1/p⇤ + 1/q = 1. 1
5 0.8393172 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
Author: Ichiro Takeuchi, Masashi Sugiyama
Abstract: We consider feature selection and weighting for nearest neighbor classifiers. A technical challenge in this scenario is how to cope with discrete update of nearest neighbors when the feature space metric is changed during the learning process. This issue, called the target neighbor change, was not properly addressed in the existing feature weighting and metric learning literature. In this paper, we propose a novel feature weighting algorithm that can exactly and efficiently keep track of the correct target neighbors via sequential quadratic programming. To the best of our knowledge, this is the first algorithm that guarantees the consistency between target neighbors and the feature space metric. We further show that the proposed algorithm can be naturally combined with regularization path tracking, allowing computationally efficient selection of the regularization parameter. We demonstrate the effectiveness of the proposed algorithm through experiments. 1
6 0.80314481 139 nips-2011-Kernel Bayes' Rule
7 0.69406003 127 nips-2011-Image Parsing with Stochastic Scene Grammar
8 0.63089395 64 nips-2011-Convergent Bounds on the Euclidean Distance
9 0.58642465 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
10 0.56939781 75 nips-2011-Dynamical segmentation of single trials from population neural data
11 0.56604517 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
12 0.55823863 154 nips-2011-Learning person-object interactions for action recognition in still images
13 0.55230951 231 nips-2011-Randomized Algorithms for Comparison-based Search
14 0.54641742 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
15 0.54429758 150 nips-2011-Learning a Distance Metric from a Network
16 0.54420996 168 nips-2011-Maximum Margin Multi-Instance Learning
17 0.54263949 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound
18 0.51321888 303 nips-2011-Video Annotation and Tracking with Active Learning
19 0.51289058 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
20 0.51180208 29 nips-2011-Algorithms and hardness results for parallel large margin learning