nips nips2002 nips2002-30 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. [sent-9, score-0.367]
2 We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. [sent-10, score-0.625]
3 1 Introduction This paper analyzes a class of optimization problems max G(q) + βD(q) q∈∆ (1) where ∆ is a linear constraint space, G and D are continuous, real valued functions of q, smooth in the interior of ∆, and maxq∈∆ G(q) is known. [sent-11, score-0.055]
4 Furthermore, G and D are invariant under the group of symmetries SN . [sent-12, score-0.053]
5 The following basic algorithm, various forms of which have appeared in [3, 4, 6, 7, 8], can be used to solve (1) for β = B. [sent-15, score-0.023]
6 Algorithm 1 Let q0 be the maximizer of max G(q) q∈∆ (2) and let β0 = 0. [sent-16, score-0.121]
7 Take qk+1 = qk + η, where η is a small perturbation, as an initial guess for the solution qk+1 at βk+1 . [sent-22, score-0.506]
8 Optimization: solve max G(q) + βk+1 D(q) q∈∆ (0) to get the maximizer qk+1 , using initial guess qk+1 . [sent-24, score-0.166]
9 Specifically, we implement numerical continuation techniques [9, 10] to effect steps 1 and 2. [sent-26, score-0.181]
10 We show how to detect bifurcation and we rely on bifurcation theory with symmetries [11, 12, 13] to search for the desired solution branch. [sent-27, score-0.791]
11 This paper concludes with the improved algorithm 6 which solves (1). [sent-28, score-0.02]
12 2 The cost functions The four problems we analyze are from Rate Distortion Theory [1, 2], Deterministic Annealing [3], Information Distortion [4, 5, 6] and the Information Bottleneck Method [7, 8]. [sent-29, score-0.017]
13 We discuss the explicit form of the cost function (i. [sent-30, score-0.016]
14 1 The distortion function D(q) Rate distortion theory is the information theoretic approach to the study of optimal source coding systems, including systems for quantization and data compression [2]. [sent-34, score-0.722]
15 q(yN |y) is a stochastic map or quantization of Y into a representation YN [1, 2]. [sent-36, score-0.026]
16 A representation YN is optimal if there is a quantizer q ∗ (yN |y) such that D(q ∗ ) = minq∈∆ D(q). [sent-38, score-0.064]
17 In engineering and imaging applications, the distortion function is usually chosen as the mean ˆ ˆ squared error [1, 3, 14], D(Y, YN ) = Ey,yN d(y, yN ), where the pointwise distortion funcˆ yN ) is the Euclidean squared distance. [sent-39, score-0.728]
18 In [4, 5, 6], the information distortion measure DI (Y, YN ) := y,yN p(y, yN )KL(p(x|yN )||p(x|y)) = I(X; Y ) − I(X; YN ) is used, where the Kullback-Leibler divergence KL is the pointwise distortion function. [sent-41, score-0.728]
19 Unlike the pointwise distortion functions usually investigated in information theory [1, 3], this one is nonlinear, it explicitly considers a third space, X, of inputs, and it depends on the N |y)p(y) quantizer q(yN |y) through p(x|yN ) = y p(x|y) q(yp(yN ) . [sent-42, score-0.484]
20 The only term in DI which depends on the quantizer is I(X; YN ), so we can replace DI with the effective distortion Def f (q) := I(X; YN ). [sent-43, score-0.393]
21 2 Rate Distortion There are two related methods used to analyze communication systems at a distortion D(q) ≤ D0 for some given D0 ≥ 0 [1, 2, 3]. [sent-46, score-0.367]
22 In rate distortion theory [1, 2], the problem of finding a minimum rate at a given distortion is posed as a minimal information rate distortion problem: R(D0 ) = minq(yN |y)∈∆ I(Y ; YN ) . [sent-47, score-1.122]
23 A similar exposition using the Deterministic Annealing approach [3] is a maximal entropy problem maxq(yN |y)∈∆ H(YN |Y ) . [sent-49, score-0.031]
24 D(Y ; YN ) ≤ D0 (5) The justification for using (5) is Jayne’s maximum entropy principle [15]. [sent-50, score-0.031]
25 These formulations are related since I(Y ; YN ) = H(YN ) − H(YN |Y ). [sent-51, score-0.019]
26 In [4, 6], the neural coding problem is formulated as an entropy problem as in (5) maxq(yN |y)∈∆ H(YN |Y ) . [sent-53, score-0.048]
27 Def f (q) ≥ I0 (6) which uses the nonlinear effective information distortion measure Def f . [sent-54, score-0.329]
28 [7, 8] use the information distortion measure to pose an information rate distortion problem as in (4) minq(yN |y)∈∆ I(Y ; YN ) . [sent-57, score-0.712]
29 Def f (q) ≥ I0 (7) Using the method of Lagrange multipliers, the rate distortion problems (4),(5),(6),(7) can be reformulated as finding the maxima of max F (q, β) = max[G(q) + βD(q)] q∈∆ q∈∆ (8) as in (1) where β = B. [sent-58, score-0.403]
30 For the maximal entropy problem (6), F (q, β) = H(YN |Y ) + βDef f (q) (9) F (q, β) = −I(Y ; YN ) + βDef f (q) (10) and so G(q) from (1) is the conditional entropy H(YN |Y ). [sent-59, score-0.062]
31 For the minimal information rate distortion problem (7), and so G(q) = −I(Y ; YN ). [sent-60, score-0.367]
32 In Rate Distortion Theory and the Information Bottleneck Method, one could be interested in solutions to (8) for finite B which takes into account a tradeoff between I(Y ; Y N ) and Def f . [sent-63, score-0.055]
33 Our analysis extends easily to similar ˆ formulations which use a norm based distortion such as D(q), as in [3]. [sent-65, score-0.348]
34 3 Improving the algorithm We now turn our attention back to algorithm 1 and indicate how numerical continuation [9, 10], and bifurcation theory with symmetries [11, 12, 13] can improve upon the choice of the algorithm’s parameters. [sent-66, score-0.572]
35 We begin by rewriting (8), now incorporating the Lagrange multipliers for the equality constraint yN q(yN |yk ) = 1 from (3) which must be satisfied for each yk ∈ Y . [sent-67, score-0.072]
36 (11) There are optimization schemes, such as the Fixed Point [4, 6] and projected Augmented Lagrangian [6, 16] methods, which exploit the structure of (11) to find local solutions to (8) for step 3 of algorithm 1. [sent-69, score-0.074]
37 1 Bifurcation structure of solutions It has been observed that the solutions {qk } undergo bifurcations or phase transitions [3, 4, 6, 7, 8]. [sent-71, score-0.182]
38 We wish to pose (8) as a dynamical system in order to study the bifurcation structure of local solutions for β ∈ [0, B]. [sent-72, score-0.44]
39 To this end, consider the equilibria of the flow q ˙ ˙ λ = q,λ L(q, λ, β) (12) q∗ where q,λ L(q ∗ , λ∗ , β) = 0 for some β. [sent-73, score-0.041]
40 Equilibria, (q ∗ , λ∗ ), of (12), for which ∆q F (q ∗ , β) is negative definite, are local solutions of (8) [16, 17]. [sent-75, score-0.055]
41 (13) blocks The kernel of ∆q,λ L plays a pivotal role in determining the bifurcation structure of solutions to (8). [sent-85, score-0.448]
42 This is due to the fact that bifurcation of an equilibria (q ∗ , λ∗ ) of (12) at β = β ∗ happen when ker ∆q,λ L(q ∗ , λ∗ , β ∗ ) is nontrivial. [sent-86, score-0.452]
43 Furthermore, the bifurcating branches are tangent to certain linear subspaces of ker ∆q,λ L(q ∗ , λ∗ , β ∗ ) [12]. [sent-87, score-0.306]
44 2 Bifurcations with symmetry Any solution q ∗ (yN |y) to (8) gives another equivalent solution simply by permuting the labels of the classes of YN . [sent-89, score-0.125]
45 Let SN be the algebraic group of q all permutations on N symbols [18, 19]. [sent-91, score-0.015]
46 We say that F (q, β) is SN -invariant if F (q, β) = F (σ(q), β) where σ(q) denotes the action on q by permutation of the classes of Y N as defined by any σ ∈ SN [17]. [sent-92, score-0.017]
47 Now suppose that a solution q ∗ is fixed by all the elements of SM for M ≤ N . [sent-93, score-0.038]
48 Bifurcations at β = β ∗ in this scenario are called symmetry breaking if the bifurcating solutions are fixed (and only fixed) by subgroups of SM . [sent-94, score-0.278]
49 To determine where a bifurcation of a solution (q ∗ , λ∗ , β) occurs, one determines β for which ∆q F (q ∗ , β) has a nontrivial kernel. [sent-95, score-0.385]
50 At a bifurcation (q ∗ , λ∗ , β ∗ ) where q ∗ is fixed by SM for M ≤ N , ∆q F (q ∗ , β ∗ ) has M identical blocks. [sent-97, score-0.37]
51 The bifurcation is generic if each of the identical blocks has a single 0-eigenvector, v , and the other blocks are nonsingular. [sent-98, score-0.481]
52 (14) Thus, a generic bifurcation can be detected by looking for singularity of one of the K × K identical blocks of ∆q F (q ∗ , β). [sent-99, score-0.526]
53 We call the classes of YN which correspond to identical blocks unresolved classes. [sent-100, score-0.137]
54 The classes of YN that are not unresolved are called resolved classes. [sent-101, score-0.068]
55 The Equivariant Branching Lemma and the Smoller-Wasserman Theorem [12, 13] ascertain the existence of explicit bifurcating solutions in subspaces of ker ∆q,λ L(q ∗ , λ∗ , β ∗ ) which are fixed by special subgroups of SM [12, 13]. [sent-102, score-0.375]
56 Of particular interest are the bifurcating solutions in subspaces of ker ∆q,λ L(q ∗ , λ∗ , β ∗ ) of dimension 1 guaranteed by the following theorem Theorem 2 [17] Let (q ∗ , λ∗ , β ∗ ) be a generic bifurcation of (12) which is fixed (and only fixed) by SM , for 1 < M ≤ N . [sent-103, score-0.714]
57 Then, for small t, with β(t = 0) = β ∗ , there exists M bifurcating solutions, q∗ λ∗ β∗ + (M − 1)v v u v [u m ]ν = −v 0 u tu m β(t) , where 1 ≤ m ≤ M, (15) if ν is the mth unresolved class of YN if ν is some other unresolved class of YN otherwise (16) and v is defined as in (14). [sent-104, score-0.297]
58 Furthermore, each of these solutions is fixed by the symmetry group SM −1 . [sent-105, score-0.102]
59 1 For a bifurcation from the uniform quantizer, q N , which is identically all of the classes of YN are unresolved. [sent-106, score-0.364]
60 , −v T , 0 T )T v where (N − 1)v is in the mth component of u m . [sent-113, score-0.028]
61 Relevant to the computationalist is that instead of looking for a bifurcation by looking for singularity of the n × n Hessian ∆q F (q ∗ , β), one may look for singularity of one of the n K × K identical blocks, where K = N . [sent-114, score-0.49]
62 After bifurcation of a local solution to (8) has ∗ been detected at β = β , knowledge of the bifurcating directions makes finding solutions of interest for β > β ∗ much easier (see section 3. [sent-115, score-0.638]
63 3 The subcritical bifurcation In all problems under consideration, the solution for β = 0 is known. [sent-119, score-0.481]
64 Rose [3] was able to compute explicitly the critical value β ∗ where q0 loses stability for the Euclidean pointwise distortion function. [sent-122, score-0.439]
65 The solution q0 = 1/N loses stability at β = β ∗ where 1/β ∗ is the second largest eigenvalue of a discrete Markov chain on vertices y ∈ Y , where the transition probabilities p(yl → yk ) := i p(yk |xi )p(xi |yl ). [sent-125, score-0.146]
66 1 Corollary 4 Bifurcation of the solution (q N , β) in (9), (10) occurs at β ≥ 1. [sent-126, score-0.038]
67 The discriminant of the bifurcating branch (15) is defined as [17] ζ(q ∗ , β ∗ , u m ) = 3 3 u u u m , ∂q,λ L(q ∗ , λ∗ , β ∗ )[u m , EL− E∂q,λ L(q ∗ , λ∗ , β ∗ )[u m , u m ]] ∗ ∗ ∗ u 4 −3 u m , ∂q,λ L(q , λ , β )[u m , u m , u m ] , n where ·, · is the Euclidean inner product, ∂q,λ L[·, . [sent-127, score-0.218]
68 Theorem 5 [17] If ζ(q ∗ , β ∗ , u m ) < 0, then the bifurcating branch (15) is subritical (i. [sent-131, score-0.218]
69 For a data set with a joint probability distribution modelled by a mixture of four Gaussians as 1 in [4], Theorem 5 predicts a subcritical bifurcation from (q N , β ∗ ≈ 1. [sent-135, score-0.443]
70 The existence of a subcritical bifurcation (a first order phase transition) is intriguing. [sent-137, score-0.489]
71 Subcritical Bifurcating Branch for F=H(Y |Y)+β I(X;Y ) from uniform solution q for N=4 N N 1/N 3 2. [sent-138, score-0.038]
72 Using this probability space, the equilibria of (12) for F as defined in (9) were found using Newton’s method. [sent-151, score-0.041]
73 Depicted is the subcritical bifurcation from (q 1 , β ∗ ≈ 1. [sent-152, score-0.443]
74 4 In analogy to the rate distortion curve [2, 1], we can define an H-I curve for the problem (6) H(I0 ) := max q∈∆,Def f ≥I0 H(YN |Y ). [sent-154, score-0.403]
75 At such a point there is a Lagrange multiplier β such that q,λ L = 0 (compare with (11) and (12)) and this β solves problem (9). [sent-157, score-0.02]
76 Therefore, for each I ∈ (0, Imax ), there is a corresponding β which solves problem (9). [sent-158, score-0.02]
77 The existence of a subcritical bifurcation in β implies that this correspondence is not monotone for small values of I. [sent-159, score-0.465]
78 4 Numerical Continuation Numerical continuation methods efficiently analyze the solution behavior of dynamical systems such as (12) [9, 10]. [sent-161, score-0.188]
79 Continuation methods can speed up the search for the solution q k+1 (0) at βk+1 in step 3 of algorithm 1 by improving upon the perturbed choice qk+1 = qk +η. [sent-162, score-0.463]
80 First, T the vector (∂β qk ∂β λT )T which is tangent to the curve q,λ L(q, λ, β) = 0 at (qk , λk , βk ) k is computed by solving the matrix system ∂β q k ∆q,λ L(qk , λk , βk ) = −∂β q,λ L(qk , λk , βk ). [sent-163, score-0.453]
81 (17) ∂β λ k (0) Now the initial guess in step 2 becomes qk+1 = qk + dk ∂β qk where dk = ∆s √ for ∆s > 0. [sent-164, score-1.127]
82 Furthermore, βk+1 in step 1 is found by using this 2 2 ||∂β qk || +||∂β λk || +1 T same dk . [sent-165, score-0.542]
83 This choice of dk assures that a fixed step along (∂β qk ∂β λT )T is taken for each k k. [sent-166, score-0.542]
84 We use three different continuation methods which implement variations of this scheme: Parameter, Tangent and Pseudo Arc-Length [9, 17]. [sent-167, score-0.126]
85 These methods can greatly decrease the (0) optimization iterations needed to find qk+1 from qk+1 in step 3. [sent-168, score-0.019]
86 The cost savings can be significant, especially when continuation is used in conjunction with a Newton type optimization scheme which explicitly uses the Hessian ∆q F (qk , βk ). [sent-169, score-0.13]
87 1 Branch switching Suppose that a bifurcation of a solution q ∗ of (8) has been detected at β ∗ . [sent-173, score-0.416]
88 To proceed, one u m=1 uses the explicit form of the bifurcating directions, {u m }M from (16) to search for the bifurcating solution of interest, say qk+1 , whose existence is guaranteed by Theorem 2. [sent-174, score-0.41]
89 To do this, let u = u m for some m ≤ M , then implement a branch switch [9] (0) qk+1 = q ∗ + dk · u. [sent-175, score-0.204]
90 4 A numerical algorithm We conclude with a numerical algorithm to solve (1). [sent-176, score-0.133]
91 Algorithm 6 Let q0 be the maximizer of maxq∈∆ G, β0 = 1 (3. [sent-178, score-0.064]
92 4) Perform β-step: solve (17) for (∂β qk ∂β λT )T and select βk+1 = βk + dk k ∆s √ where dk = . [sent-184, score-0.682]
93 4) The initial guess for qk+1 at βk+1 is qk+1 = qk + dk · ∂β qk . [sent-187, score-1.01]
94 Optimization: solve max G(q) + βk+1 D(q) q∈∆ (0) to get the maximizer qk+1 , using initial guess qk+1 . [sent-189, score-0.166]
95 2) Check for bifurcation: compare the sign of the determinant of an identical block of each of ∆q [G(qk ) + βk D(qk )] and ∆q [G(qk+1 ) + βk+1 D(qk+1 )]. [sent-192, score-0.023]
96 (0) If a bifurcation is detected, then set qk+1 = qk + dk · u where u is defined as in (16) for some m ≤ M , and repeat step 3. [sent-193, score-0.889]
97 Deteministic annealing for clustering, compression, classification, regerssion, and related optimization problems. [sent-206, score-0.074]
98 Numerical analysis and control of bifurcation problems in finite dimensions. [sent-248, score-0.347]
wordName wordTfidf (topN-words)
[('yn', 0.598), ('qk', 0.425), ('bifurcation', 0.347), ('distortion', 0.329), ('bifurcating', 0.167), ('def', 0.123), ('dk', 0.117), ('maxq', 0.112), ('continuation', 0.111), ('subcritical', 0.096), ('sm', 0.084), ('montana', 0.08), ('pointwise', 0.07), ('gedeon', 0.064), ('ker', 0.064), ('minq', 0.064), ('quantizer', 0.064), ('maximizer', 0.064), ('parker', 0.056), ('numerical', 0.055), ('annealing', 0.055), ('solutions', 0.055), ('yk', 0.053), ('unresolved', 0.051), ('branch', 0.051), ('albert', 0.048), ('bifurcations', 0.048), ('imax', 0.048), ('hessian', 0.047), ('blocks', 0.046), ('guess', 0.043), ('dimitrov', 0.042), ('equilibria', 0.041), ('bottleneck', 0.04), ('sn', 0.039), ('rate', 0.038), ('symmetries', 0.038), ('singularity', 0.038), ('solution', 0.038), ('alexander', 0.038), ('max', 0.036), ('theorem', 0.035), ('doedel', 0.032), ('eusebius', 0.032), ('golubitsky', 0.032), ('tomas', 0.032), ('symmetry', 0.032), ('entropy', 0.031), ('detected', 0.031), ('tangent', 0.028), ('ik', 0.028), ('mth', 0.028), ('subspaces', 0.027), ('quantization', 0.026), ('loses', 0.025), ('subgroups', 0.024), ('jacobian', 0.024), ('phase', 0.024), ('deterministic', 0.023), ('lagrange', 0.023), ('justi', 0.023), ('identical', 0.023), ('solve', 0.023), ('naftali', 0.022), ('singularities', 0.022), ('existence', 0.022), ('dynamical', 0.022), ('looking', 0.022), ('theory', 0.021), ('yl', 0.021), ('let', 0.021), ('communication', 0.021), ('optima', 0.02), ('solves', 0.02), ('xed', 0.02), ('di', 0.02), ('euclidean', 0.02), ('iterate', 0.02), ('branches', 0.02), ('lagrangian', 0.02), ('generic', 0.019), ('optimization', 0.019), ('york', 0.019), ('formulations', 0.019), ('multipliers', 0.019), ('verlag', 0.018), ('newton', 0.018), ('tishby', 0.018), ('coding', 0.017), ('analyze', 0.017), ('classes', 0.017), ('explicit', 0.016), ('pose', 0.016), ('john', 0.016), ('stability', 0.015), ('group', 0.015), ('implement', 0.015), ('transition', 0.015), ('springer', 0.015), ('robert', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
2 0.25839064 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
Author: Sumio Watanabe, Shun-ichi Amari
Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1
4 0.11397744 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
5 0.051020771 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
Author: Ron Meir, Tong Zhang
Abstract: We consider Bayesian mixture approaches, where a predictor is constructed by forming a weighted average of hypotheses from some space of functions. While such procedures are known to lead to optimal predictors in several cases, where sufficiently accurate prior information is available, it has not been clear how they perform when some of the prior assumptions are violated. In this paper we establish data-dependent bounds for such procedures, extending previous randomized approaches such as the Gibbs algorithm to a fully Bayesian setting. The finite-sample guarantees established in this work enable the utilization of Bayesian mixture approaches in agnostic settings, where the usual assumptions of the Bayesian paradigm fail to hold. Moreover, the bounds derived can be directly applied to non-Bayesian mixture approaches such as Bagging and Boosting. 1 Introduction and Motivation The standard approach to Computational Learning Theory is usually formulated within the so-called frequentist approach to Statistics. Within this paradigm one is interested in constructing an estimator, based on a finite sample, which possesses a small loss (generalization error). While many algorithms have been constructed and analyzed within this context, it is not clear how these approaches relate to standard optimality criteria within the frequentist framework. Two classic optimality criteria within the latter approach are the minimax and admissibility criteria, which characterize optimality of estimators in a rigorous and precise fashion [9]. Except in some special cases [12], it is not known whether any of the approaches used within the Learning community lead to optimality in either of the above senses of the word. On the other hand, it is known that under certain regularity conditions, Bayesian estimators lead to either minimax or admissible estimators, and thus to well-defined optimality in the classical (frequentist) sense. In fact, it can be shown that Bayes estimators are essentially the only estimators which can achieve optimality in the above senses [9]. This optimality feature provides strong motivation for the study of Bayesian approaches in a frequentist setting. While Bayesian approaches have been widely studied, there have not been generally applicable bounds in the frequentist framework. Recently, several approaches have attempted to address this problem. In this paper we establish finite sample datadependent bounds for Bayesian mixture methods, which together with the above optimality properties suggest that these approaches should become more widely used. Consider the problem of supervised learning where we attempt to construct an estimator based on a finite sample of pairs of examples S = {(x1 , y1 ), . . . , (xn , yn )}, each drawn independently according to an unknown distribution µ(x, y). Let A be a learning algorithm which, based on the sample S, constructs a hypothesis (estimator) h from some set of hypotheses H. Denoting by (y, h(x)) the instantaneous loss of the hypothesis h, we wish to assess the true loss L(h) = Eµ (y, h(x)) where the expectation is taken with respect to µ. In particular, the objective is to provide data-dependent bounds of the following form. For any h ∈ H and δ ∈ (0, 1), with probability at least 1 − δ, L(h) ≤ Λ(h, S) + ∆(h, S, δ), (1) where Λ(h, S) is some empirical assessment of the true loss, and ∆(h, S, δ) is a complexity term. For example, in the classic Vapnik-Chervonenkis framework, Λ(h, S) n is the empirical error (1/n) i=1 (yi , h(xi )) and ∆(h, S, δ) depends on the VCdimension of H but is independent of both the hypothesis h and the sample S. By algorithm and data-dependent bounds we mean bounds where the complexity term depends on both the hypothesis (chosen by the algorithm A) and the sample S. 2 A Decision Theoretic Bayesian Framework Consider a decision theoretic setting where we define the sample dependent loss of an algorithm A by R(µ, A, S) = Eµ (y, A(x, S)). Let θµ be the optimal predictor for y, namely the function minimizing Eµ { (y, φ(x))} over φ. It is clear that the best algorithm A (Bayes algorithm) is the one that always return θµ , assuming µ is known. We are interested in the expected loss of an algorithm averaged over samples S: R(µ, A) = ES R(µ, A, S) = R(µ, A, S)dµ(S), where the expectation is taken with respect to the sample S drawn i.i.d. from the probability measure µ. If we consider a family of measures µ, which possesses some underlying prior distribution π(µ), then we can construct the averaged risk function with respect to the prior as, r(π, A) = Eπ R(µ, A) = where dπ(µ|S) = dµ(S)dπ(µ) dµ(S)dπ(µ) dµ(S)dπ(µ) R(µ, A, S)dπ(µ|S), is the posterior distribution on the µ family, which µ induces a posterior distribution on the sample space as πS = Eπ(µ|S) µ. An algorithm minimizing the Bayes risk r(π, A) is referred to as a Bayes algorithm. In fact, for a given prior, and a given sample S, the optimal algorithm should return the Bayes optimal predictor with respect to the posterior measure πS . For many important practical problems, the optimal Bayes predictor is a linear functional of the underlying probability measure. For example, if the loss function is quadratic, namely (y, A(x)) = (y −A(x))2 , then the optimal Bayes predictor θµ (x) is the conditional mean of y, namely Eµ [y|x]. For binary classification problems, we can let the predictor be the conditional probability θµ (x) = µ(y = 1|x) (the optimal classification decision rule then corresponds to a test of whether θµ (x) > 0.5), which is also a linear functional of µ. Clearly if the Bayes predictor is a linear functional of the probability measure, then the optimal Bayes algorithm with respect to the prior π is given by AB (x, S) = θµ (x)dπ(µ|S) = µ θ (x)dµ(S)dπ(µ) µ µ µ dµ(S)dπ(µ) . (2) In this case, an optimal Bayesian algorithm can be regarded as the predictor constructed by averaging over all predictors with respect to a data-dependent posterior π(µ|S). We refer to such methods as Bayesian mixture methods. While the Bayes estimator AB (x, S) is optimal with respect to the Bayes risk r(π, A), it can be shown, that under appropriate conditions (and an appropriate prior) it is also a minimax and admissible estimator [9]. In general, θµ is unknown. Rather we may have some prior information about possible models for θµ . In view of (2) we consider a hypothesis space H, and an algorithm based on a mixture of hypotheses h ∈ H. This should be contrasted with classical approaches where an algorithm selects a single hypothesis h form H. For simplicity, we consider a countable hypothesis space H = {h1 , h2 , . . .}; the general case will be deferred to the full paper. Let q = {qj }∞ be a probability j=1 vector, namely qj ≥ 0 and j qj = 1, and construct the composite predictor by fq (x) = j qj hj (x). Observe that in general fq (x) may be a great deal more complex that any single hypothesis hj . For example, if hj (x) are non-polynomial ridge functions, the composite predictor f corresponds to a two-layer neural network with universal approximation power. We denote by Q the probability distribution defined by q, namely j qj hj = Eh∼Q h. A main feature of this work is the establishment of data-dependent bounds on L(Eh∼Q h), the loss of the Bayes mixture algorithm. There has been a flurry of recent activity concerning data-dependent bounds (a non-exhaustive list includes [2, 3, 5, 11, 13]). In a related vein, McAllester [7] provided a data-dependent bound for the so-called Gibbs algorithm, which selects a hypothesis at random from H based on the posterior distribution π(h|S). Essentially, this result provides a bound on the average error Eh∼Q L(h) rather than a bound on the error of the averaged hypothesis. Later, Langford et al. [6] extended this result to a mixture of classifiers using a margin-based loss function. A more general result can also be obtained using the covering number approach described in [14]. Finally, Herbrich and Graepel [4] showed that under certain conditions the bounds for the Gibbs classifier can be extended to a Bayesian mixture classifier. However, their bound contained an explicit dependence on the dimension (see Thm. 3 in [4]). Although the approach pioneered by McAllester came to be known as PAC-Bayes, this term is somewhat misleading since an optimal Bayesian method (in the decision theoretic framework outline above) does not average over loss functions but rather over hypotheses. In this regard, the learning behavior of a true Bayesian method is not addressed in the PAC-Bayes analysis. In this paper, we would like to narrow the discrepancy by analyzing Bayesian mixture methods, where we consider a predictor that is the average of a family of predictors with respect to a data-dependent posterior distribution. Bayesian mixtures can often be regarded as a good approximation to a true optimal Bayesian method. In fact, we have shown above that they are equivalent for many important practical problems. Therefore the main contribution of the present work is the extension of the above mentioned results in PAC-Bayes analysis to a rather unified setting for Bayesian mixture methods, where different regularization criteria may be incorporated, and their effect on the performance easily assessed. Furthermore, it is also essential that the bounds obtained are dimension-independent, since otherwise they yield useless results when applied to kernel-based methods, which often map the input space into a space of very high dimensionality. Similar results can also be obtained using the covering number analysis in [14]. However the approach presented in the current paper, which relies on the direct computation of the Rademacher complexity, is more direct and gives better bounds. The analysis is also easier to generalize than the corresponding covering number approach. Moreover, our analysis applies directly to other non-Bayesian mixture approaches such as Bagging and Boosting. Before moving to the derivation of our bounds, we formalize our approach. Consider a countable hypothesis space H = {hj }∞ , and a probability distribution {qj } over j=1 ∞ H. Introduce the vector notation k=1 qk hk (x) = q h(x). A learning algorithm within the Bayesian mixture framework uses the sample S to select a distribution Q over H and then constructs a mixture hypothesis fq (x) = q h(x). In order to constrain the class of mixtures used in constructing the mixture q h we impose constraints on the mixture vector q. Let g(q) be a non-negative convex function of q and define for any positive A, ΩA = {q ∈ S : g(q) ≤ A} ; FA = fq : fq (x) = q h(x) : q ∈ ΩA , (3) where S denotes the probability simplex. In subsequent sections we will consider different choices for g(q), which essentially acts as a regularization term. Finally, for any mixture q h we define the loss by L(q h) = Eµ (y, (q h)(x)) and the n ˆ empirical loss incurred on the sample by L(q h) = (1/n) i=1 (yi , (q h)(xi )). 3 A Mixture Algorithm with an Entropic Constraint In this section we consider an entropic constraint, which penalizes weights deviating significantly from some prior probability distribution ν = {νj }∞ , which may j=1 incorporate our prior information about he problem. The weights q themselves are chosen by the algorithm based on the data. In particular, in this section we set g(q) to be the Kullback-Leibler divergence of q from ν, g(q) = D(q ν) ; qj log(qj /νj ). D(q ν) = j Let F be a class of real-valued functions, and denote by σi independent Bernoulli random variables assuming the values ±1 with equal probability. We define the data-dependent Rademacher complexity of F as 1 ˆ Rn (F) = Eσ sup n f ∈F n σi f (xi ) |S . i=1 ˆ The expectation of Rn (F) with respect to S will be denoted by Rn (F). We note ˆ n (F) is concentrated around its mean value Rn (F) (e.g., Thm. 8 in [1]). We that R quote a slightly adapted result from [5]. Theorem 1 (Adapted from Theorem 1 in [5]) Let {x1 , x2 , . . . , xn } ∈ X be a sequence of points generated independently at random according to a probability distribution P , and let F be a class of measurable functions from X to R. Furthermore, let φ be a non-negative Lipschitz function with Lipschitz constant κ, such that φ◦f is uniformly bounded by a constant M . Then for all f ∈ F with probability at least 1 − δ Eφ(f (x)) − 1 n n φ(f (xi )) ≤ 4κRn (F) + M i=1 log(1/δ) . 2n An immediate consequence of Theorem 1 is the following. Lemma 3.1 Let the loss function be bounded by M , and assume that it is Lipschitz with constant κ. Then for all q ∈ ΩA with probability at least 1 − δ log(1/δ) . 2n ˆ L(q h) ≤ L(q h) + 4κRn (FA ) + M Next, we bound the empirical Rademacher average of FA using g(q) = D(q ν). Lemma 3.2 The empirical Rademacher complexity of FA is upper bounded as follows: 2A n ˆ Rn (FA ) ≤ 1 n sup j n hj (xi )2 . i=1 Proof: We first recall a few facts from the theory of convex duality [10]. Let p(u) be a convex function over a domain U , and set its dual s(z) = supu∈U u z − p(u) . It is known that s(z) is also convex. Setting u = q and p(q) = j qj log(qj /νj ) we find that s(v) = log j νj ezj . From the definition of s(z) it follows that for any q ∈ S, q z≤ qj log(qj /νj ) + log ν j ez j . j j Since z is arbitrary, we set z = (λ/n) i σi h(xi ) and conclude that for q ∈ ΩA and any λ > 0 n 1 λ 1 sup A + log νj exp σi q h(xi ) ≤ σi hj (xi ) . n i=1 λ n i q∈ΩA j Taking the expectation with respect to σ, and using 2 Eσ {exp ( i σi ai )} ≤ exp i ai /2 , we have that λ 1 ˆ νj exp A + Eσ log σi hj (xi ) Rn (FA ) ≤ λ n j ≤ ≤ = i 1 λ A + sup log Eσ exp 1 λ A + sup log exp j j A λ + 2 sup λ 2n j λ2 n2 λ n i σi hj (xi ) the Chernoff bound (Jensen) i hj (xi )2 2 (Chernoff) hj (xi )2 . i Minimizing the r.h.s. with respect to λ, we obtain the desired result. Combining Lemmas 3.1 and 3.2 yields our basic bound, where κ and M are defined in Lemma 3.1. Theorem 2 Let S = {(x1 , y1 ), . . . , (xn , yn )} be a sample of i.i.d. points each drawn according to a distribution µ(x, y). Let H be a countable hypothesis class, and set FA to be the class defined in (3) with g(q) = D(q ν). Set ∆H = (1/n)Eµ supj 1−δ n i=1 hj (xi )2 1/2 . Then for any q ∈ ΩA with probability at least ˆ L(q h) ≤ L(q h) + 4κ∆H 2A +M n log(1/δ) . 2n Note that if hj are uniformly bounded, hj ≤ c, then ∆H ≤ c. Theorem 2 holds for a fixed value of A. Using the so-called multiple testing Lemma (e.g. [11]) we obtain: Corollary 3.1 Let the assumptions of Theorem 2 hold, and let {Ai , pi } be a set of positive numbers such that i pi = 1. Then for all Ai and q ∈ ΩAi with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + 4κ∆H 2Ai +M n log(1/pi δ) . 2n Note that the only distinction with Theorem 2 is the extra factor of log pi which is the price paid for the uniformity of the bound. Finally, we present a data-dependent bound of the form (1). Theorem 3 Let the assumptions of Theorem 2 hold. Then for all q ∈ S with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H , M ) × 130D(q ν) + log(1/δ) . n (4) Proof sketch Pick Ai = 2i and pi = 1/i(i + 1), i = 1, 2, . . . (note that i pi = 1). For each q, let i(q) be the smallest index for which Ai(q) ≥ D(q ν) implying that log(1/pi(q) ) ≤ 2 log log2 (4D(q ν)). A few lines of algebra, to be presented in the full paper, yield the desired result. The results of Theorem 3 can be compared to those derived by McAllester [8] for the randomized Gibbs procedure. In the latter case, the first term on the r.h.s. is ˆ Eh∼Q L(h), namely the average empirical error of the base classifiers h. In our case ˆ the corresponding term is L(Eh∼Q h), namely the empirical error of the average hypothesis. Since Eh∼Q h is potentially much more complex than any single h ∈ H, we expect that the empirical term in (4) is much smaller than the corresponding term in [8]. Moreover, the complexity term we obtain is in fact tighter than the corresponding term in [8] by a logarithmic factor in n (although the logarithmic factor in [8] could probably be eliminated). We thus expect that Bayesian mixture approach advocated here leads to better performance guarantees. Finally, we comment that Theorem 3 can be used to obtain so-called oracle inequalities. In particular, let q∗ be the optimal distribution minimizing L(q h), which can only be computed if the underlying distribution µ(x, y) is known. Consider an ˆ algorithm which, based only on the data, selects a distribution q by minimizing the r.h.s. of (4), with the implicit constants appropriately specified. Then, using standard approaches (e.g. [2]) we can obtain a bound on L(ˆ h) − L(q∗ h). For q lack of space, we defer the derivation of the precise bound to the full paper. 4 General Data-Dependent Bounds for Bayesian Mixtures The Kullback-Leibler divergence is but one way to incorporate prior information. In this section we extend the results to general convex regularization functions g(q). Some possible choices for g(q) besides the Kullback-Leibler divergence are the standard Lp norms q p . In order to proceed along the lines of Section 3, we let s(z) be the convex function associated with g(q), namely s(z) = supq∈ΩA q z − g(q) . Repeating n 1 the arguments of Section 3 we have for any λ > 0 that n i=1 σi q h(xi ) ≤ 1 λ i σi h(xi ) , which implies that λ A+s n 1 ˆ Rn (FA ) ≤ inf λ≥0 λ λ n A + Eσ s σi h(xi ) . (5) i n Assume that s(z) is second order differentiable, and that for any h = i=1 σi h(xi ) 1 2 (s(h + ∆h) + s(h − ∆h)) − s(h) ≤ u(∆h). Then, assuming that s(0) = 0, it is easy to show by induction that n n Eσ s (λ/n) i=1 σi h(xi ) ≤ u((λ/n)h(xi )). (6) i=1 In the remainder of the section we focus on the the case of regularization based on the Lp norm. Consider p and q such that 1/q + 1/p = 1, p ∈ (1, ∞), and let p = max(p, 2) and q = min(q, 2). Note that if p ≤ 2 then q ≥ 2, q = p = 2 and if p > 2 1 then q < 2, q = q, p = p. Consider p-norm regularization g(q) = p q p , in which p 1 case s(z) = q z q . The Rademacher averaging result for p-norm regularization q is known in the Geometric theory of Banach spaces (type structure of the Banach space), and it also follows from Khinchtine’s inequality. We show that it can be easily obtained in our framework. In this case, it is easy to see that s(z) = Substituting in (5) we have 1 ˆ Rn (FA ) ≤ inf λ≥0 λ A+ λ n q−1 q where Cq = ((q − 1)/q ) 1/q q 1 q z q q n h(xi ) q q = i=1 implies u(h(x)) ≤ Cq A1/p n1/p 1 n q−1 q h(x) q q . 1/q n h(xi ) q q i=1 . Combining this result with the methods described in Section 3, we establish a bound for regularization based on the Lp norm. Assume that h(xi ) q is finite for all i, and set ∆H,q = E (1/n) n i=1 h(xi ) q q 1/q . Theorem 4 Let the conditions of Theorem 3 hold and set g(q) = (1, ∞). Then for all q ∈ S, with probability at least 1 − δ, ˆ L(q h) ≤ L(q h) + max(κ∆H,q , M ) × O q p + n1/p log log( q p 1 p q p p , p ∈ + 3) + log(1/δ) n where O(·) hides a universal constant that depends only on p. 5 Discussion We have introduced and analyzed a class of regularized Bayesian mixture approaches, which construct complex composite estimators by combining hypotheses from some underlying hypothesis class using data-dependent weights. Such weighted averaging approaches have been used extensively within the Bayesian framework, as well as in more recent approaches such as Bagging and Boosting. While Bayesian methods are known, under favorable conditions, to lead to optimal estimators in a frequentist setting, their performance in agnostic settings, where no reliable assumptions can be made concerning the data generating mechanism, has not been well understood. Our data-dependent bounds allow the utilization of Bayesian mixture models in general settings, while at the same time taking advantage of the benefits of the Bayesian approach in terms of incorporation of prior knowledge. The bounds established, being independent of the cardinality of the underlying hypothesis space, can be directly applied to kernel based methods. Acknowledgments We thank Shimon Benjo for helpful discussions. The research of R.M. is partially supported by the fund for promotion of research at the Technion and by the Ollendorff foundation of the Electrical Engineering department at the Technion. References [1] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 224–240, 2001. [2] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002. [3] O. Bousquet and A. Chapelle. Stability and generalization. J. Machine Learning Research, 2:499–526, 2002. [4] R. Herbrich and T. Graepel. A pac-bayesian margin bound for linear classifiers; why svms work. In Advances in Neural Information Processing Systems 13, pages 224–230, Cambridge, MA, 2001. MIT Press. [5] V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. [6] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for averaging classifiers. In Proceeding of the Eighteenth International Conference on Machine Learning, pages 290–297, 2001. [7] D. A. McAllester. Some pac-bayesian theorems. In Proceedings of the eleventh Annual conference on Computational learning theory, pages 230–234, New York, 1998. ACM Press. [8] D. A. McAllester. PAC-bayesian model averaging. In Proceedings of the twelfth Annual conference on Computational learning theory, New York, 1999. ACM Press. [9] C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. [10] R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, N.J., 1970. [11] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE trans. Inf. Theory, 44:1926– 1940, 1998. [12] Y. Yang. Minimax nonparametric classification - part I: rates of convergence. IEEE Trans. Inf. Theory, 45(7):2271–2284, 1999. [13] T. Zhang. Generalization performance of some learning problems in hilbert functional space. In Advances in Neural Information Processing Systems 15, Cambridge, MA, 2001. MIT Press. [14] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
6 0.046108898 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution
7 0.041582428 47 nips-2002-Branching Law for Axons
8 0.03712105 49 nips-2002-Charting a Manifold
9 0.036113303 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
10 0.032548625 194 nips-2002-The Decision List Machine
11 0.02921761 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering
12 0.027435394 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
13 0.026102973 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
14 0.0258127 188 nips-2002-Stability-Based Model Selection
15 0.025716379 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
16 0.025201298 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
17 0.024982486 36 nips-2002-Automatic Alignment of Local Representations
18 0.024864823 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
19 0.024700584 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
20 0.024099939 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
topicId topicWeight
[(0, -0.088), (1, -0.018), (2, -0.02), (3, -0.006), (4, -0.015), (5, 0.036), (6, -0.068), (7, -0.049), (8, -0.043), (9, 0.018), (10, 0.044), (11, -0.067), (12, 0.007), (13, -0.055), (14, -0.002), (15, 0.009), (16, -0.024), (17, -0.144), (18, -0.205), (19, -0.064), (20, 0.119), (21, -0.066), (22, -0.134), (23, -0.368), (24, 0.208), (25, 0.149), (26, 0.084), (27, -0.084), (28, -0.072), (29, 0.055), (30, 0.262), (31, 0.044), (32, -0.116), (33, 0.019), (34, -0.007), (35, -0.144), (36, -0.0), (37, 0.011), (38, -0.014), (39, 0.102), (40, -0.031), (41, 0.003), (42, 0.028), (43, -0.006), (44, -0.055), (45, 0.152), (46, 0.031), (47, -0.005), (48, 0.005), (49, -0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.98366785 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
2 0.88476783 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
3 0.49841142 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
Author: Sumio Watanabe, Shun-ichi Amari
Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1
5 0.31802312 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution
Author: Matthias Bethge, David Rotermund, Klaus Pawelzik
Abstract: Here we derive optimal gain functions for minimum mean square reconstruction from neural rate responses subjected to Poisson noise. The shape of these functions strongly depends on the length T of the time window within which spikes are counted in order to estimate the underlying firing rate. A phase transition towards pure binary encoding occurs if the maximum mean spike count becomes smaller than approximately three provided the minimum firing rate is zero. For a particular function class, we were able to prove the existence of a second-order phase transition analytically. The critical decoding time window length obtained from the analytical derivation is in precise agreement with the numerical results. We conclude that under most circumstances relevant to information processing in the brain, rate coding can be better ascribed to a binary (low-entropy) code than to the other extreme of rich analog coding. 1 Optimal neuronal gain functions for short decoding time windows The use of action potentials (spikes) as a means of communication is the striking feature of neurons in the central nervous system. Since the discovery by Adrian [1] that action potentials are generated by sensory neurons with a frequency that is substantially determined by the stimulus, the idea of rate coding has become a prevalent paradigm in neuroscience [2]. In particular, today the coding properties of many neurons from various areas in the cortex have been characterized by tuning curves, which describe the average firing rate response as a function of certain stimulus parameters. This way of description is closely related to the idea of analog coding, which constitutes the basis for many neural network models. Reliabl v inference from the observed number of spikes about the underlying firing rate of a neuronal response, however, requires a sufficiently long time interval, while integration times of neurons in vivo [3] as well as reaction times of humans or animals when performing classification tasks [4, 5] are known to be rather short. Therefore, it is important to understand, how neural rate coding is affected by a limited time window available for decoding. While rate codes are usually characterized by tuning functions relating the intensity of the ,f * http://www.neuro.urn-bremen.dermbethge neuronal response to a particular stimulus parameter, the question, how relevant the idea of analog coding actually is does not depend on the particular entity represented by a neuron. Instead it suffices to determine the shape of the gain function, which displays the mean firing rate as a function of the actual analog signal to be sent to subsequent neurons. Here we seek for optimal gain functions that minimize the minimum average squared reconstruction error for a uniform source signal transmitted through a Poisson channel as a function of the maximum mean number of spikes. In formal terms, the issue is to optimally encode a real random variable x in the number of pulses emitted by a neuron within a certain time window. Thereby, x stands for the intended analog output of the neuron that shall be signaled to subsequent neurons. The latter, however, can only observe a number of spikes k integrated within a time interval of length T. The statistical dependency between x and k is specified by the assumption of Poisson noise p(kIJL(x)) = (JL~))k exp{ -JL(X)} , (1) and the choice of the gain function f(x), which together with T determines the mean spike count J.L(x) == T f(x) . An important additional constraint is the limited output range of the neuronal firing rate, which can be included by the requirement of a bounded gain function (fmin :::; f (x) :::; f max, VX). Since inhibition can reliably prevent a neuron from firing, we will here consider the case f min == 0 only. Instead of specifying f max, we impose a bound directly on the mean spike count (i.e. J.L(x) :::; /l), because f max constitutes a meaningful constraint only in conjunction with a fixed time window length T. As objective function we consider the minimum mean squared error (MMSE) with respect to Lebesgue measure for x E [0, 1], ~ 2 X _ E x2 _ E (i2 _ _ [jt( )] - [] [] - 3 X ~ (Xl (J01 xp(kIJL(x)) dx r J01p(kIJL(x)) dx' (2) where x(k) == E[xlk] denotes the mean square estimator, which is the conditional expectation (see e.g. [6]). 1.1 Tunings and errors As derived in [7] on the basis of Fisher information the optimal gain function for a single neuron in the asymptotic limit T -+ 00 has a parabolic shape: fasymp(x) == fmaxx2 . (3) For any finite /l, however, this gain function is not necessarily optimal, and in the limit T -+ 0, it is straight forward to show that the optimal tuning curve is a step function f step (xl'19) == fmax 8 (x - {)) , (4) where 8(z) denotes the Heaviside function that equals one, if z > 0 and zero if z < O. The optimal threshold 'l9(p,) of the step tuning curve depends on /l and can be determined analytically 11(-) =1_ It 3 - V8e-J.' +1 4(1 - e- il ) (5) as well as the corresponding MMSE [8]: 2 2[fste p] _ 1 ( 3'19 (p,) ) X - 12 1 - [(1 -11(p))(l - e-iL)]-1 - 1 . (6) 1 S +1 0.5 CJ;) o ........ '------'-----'---'---'--'~----'----'-- ~---'---'---'--'~ 10-1 ~---,.---,---.,...............---.----.---.---.-.......-.-.--.-~ ...............~ Figure 1: The upper panel shows a bifurcation plot for {}(Jt) - wand {}(Jt) + w of the optimal gain function in 51 as a function of {t illustrating the phase transition from binary to continuous encoding. The dotted line separates the regions before and after the phase transition in all three panels. Left from this line (i.e. for Jt < Jt C) the step function given by Eq. 4+5 is optimal. The middle panel shows the MMSE of this step function (dashed) and of the optimal gain function in 52 (solid), which becomes smaller than the first one after the phase transition. The relative deviation between the minimal errors of 51 and 52 (i.e. (X~l - X~2)/X~2) is displayed in the lower panel and has a maximum below 0.035. The binary shape for small {t and the continuous parabolic shape for large {t implies that there has to be a transition from discrete to analog encoding with increasing {to Unfortunately it is not possible to determine the optimal gain function within the set of all bounded functions B :== {fli : [0, 1] -+ [0, fmax]} and hence, one has to choose a certain parameterized function space 5 c B in advance that is feasible for the optimization. In [8], we investigated various such function-'spaces and for {t < 2.9, we did not find any gain function with an error smaller than the MMSE of the step function. Furthermore, we always observed a phase transition from binary to analog encoding at a critical {t C that depends only slightly on the function space. As one can see in Fig. 1 (upper) pc is approximately three. In this paper, we consider two function classes 51, 52, which both contain the binary gain function as well as the asymptotic optimal parabolic function as special cases. Furthermore 51 is a proper subset of 52. Our interest in 51 results from the fact that we can analyze the phase transition in this subset analytically, while 52 is the most general parameterization for which we have. determined the optimal encoding numerically. The latter has six free parameters a :::; b :::; c E [0, 1], fmid E (0, fmax), a, f3 E [0,00) and the parameterization of the gain functions is given by o fS2 (xla, b, c, fmid, a, (3) fmid ( ~=: == , O
6 0.18486272 205 nips-2002-Value-Directed Compression of POMDPs
7 0.18364535 194 nips-2002-The Decision List Machine
8 0.16195092 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
9 0.15642488 142 nips-2002-Maximum Likelihood and the Information Bottleneck
10 0.14568985 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
11 0.14297804 49 nips-2002-Charting a Manifold
12 0.14268441 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques
13 0.13949475 42 nips-2002-Bias-Optimal Incremental Problem Solving
14 0.13818178 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
15 0.13366096 177 nips-2002-Retinal Processing Emulation in a Programmable 2-Layer Analog Array Processor CMOS Chip
16 0.12655601 128 nips-2002-Learning a Forward Model of a Reflex
17 0.12337204 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
18 0.12162083 47 nips-2002-Branching Law for Axons
19 0.11930075 114 nips-2002-Information Regularization with Partially Labeled Data
20 0.11909289 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
topicId topicWeight
[(11, 0.018), (14, 0.048), (23, 0.02), (42, 0.041), (54, 0.1), (55, 0.036), (57, 0.01), (67, 0.023), (68, 0.021), (71, 0.41), (74, 0.065), (92, 0.047), (98, 0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.81299591 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
2 0.56427705 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
3 0.48821753 106 nips-2002-Hyperkernels
Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola
Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.
4 0.43265525 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
5 0.35092625 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
6 0.35004961 83 nips-2002-Extracting Relevant Structures with Side Information
7 0.3475706 150 nips-2002-Multiple Cause Vector Quantization
8 0.33891353 27 nips-2002-An Impossibility Theorem for Clustering
9 0.33643806 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
10 0.3363322 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
11 0.33321589 53 nips-2002-Clustering with the Fisher Score
12 0.33191383 190 nips-2002-Stochastic Neighbor Embedding
13 0.33166486 2 nips-2002-A Bilinear Model for Sparse Coding
14 0.33101425 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
15 0.33049768 10 nips-2002-A Model for Learning Variance Components of Natural Images
16 0.33014917 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
17 0.32960603 3 nips-2002-A Convergent Form of Approximate Policy Iteration
18 0.32952935 169 nips-2002-Real-Time Particle Filters
19 0.32892579 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
20 0.32876581 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy