jmlr jmlr2005 jmlr2005-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis
Reference: text
sentIndex sentText sentNum sentScore
1 While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. [sent-14, score-0.296]
2 The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. [sent-15, score-0.481]
3 However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. [sent-16, score-0.188]
4 Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. [sent-18, score-0.223]
5 As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. [sent-19, score-0.142]
6 The positive parameter β determines the tradeoff between compression and preserved relevant information, as the Lagrange multiplier for the constrained optimization problem min p(t|x) I(X; T ) − β (I(T ;Y ) − const). [sent-34, score-0.233]
7 IB achieves this using a single tradeoff parameter to represent the tradeoff between the complexity of the representation of X, measured by I(X; T ), and the accuracy of this representation, measured by I(T ;Y ). [sent-40, score-0.148]
8 As is well acknowledged now source coding with distortion and channel coding with cost are dual problems (see for example Shannon, 1959; Pradhan et al. [sent-49, score-0.35]
9 However, its general information theoretic formulation is not restricted, both in terms of the nature of the variables X and Y , as well as of the compression variable T . [sent-55, score-0.127]
10 The optimal compression in the Gaussian information bottleneck (GIB) is defined in terms of the compression-relevance tradeoff (also known as the “Information Curve”, or “Accuracy-Complexity” tradeoff), determined by varying the parameter β. [sent-64, score-0.305]
11 The optimal solution turns out to be a noisy linear projection to a subspace whose dimensionality is determined by the parameter β. [sent-65, score-0.132]
12 The subspaces are spanned by the basis vectors obtained as in the well known canonical correlation analysis (CCA) (Hotelling, 1935), but the exact nature of the projection is determined in a unique way via the parameter β. [sent-66, score-0.214]
13 Specifically, as β increases, additional dimensions are added to the projection variable T , through a series of critical points (structural phase transitions), while at the same time the relative magnitude of each basis vector is rescaled. [sent-67, score-0.154]
14 The relations to canonical correlation analysis and coding with side-information are discussed in Section 9. [sent-80, score-0.226]
15 Let (X,Y ) be two jointly multivariate Gaussian variables of dimensions nx , ny and denote by Σx ,Σy the covariance matrices of X,Y and by Σxy their cross-covariance matrix. [sent-83, score-0.203]
16 While the goal of GIB is to find the optimal projection parameters A, Σξ jointly, we show below that the problem factorizes such that the optimal projection A does not depend on the noise, which does not carry any information about Y . [sent-94, score-0.264]
17 1 The optimal projection T = AX + ξ for a given tradeoff parameter β is given by Σξ = Ix and 0T ; . [sent-100, score-0.206]
18 , vTx } are left eigenvectors of Σx|y Σ−1 sorted by their corresponding ascending n x 1 2 1 eigenvalues λ1 , λ2 , . [sent-115, score-0.274]
19 , λnx , βc i = 1−λi are critical β values, αi are coefficients defined by αi ≡ β(1−λi )−1 , λi ri ri ≡ vT Σx vi , 0T is an nx dimensional row vector of zeros, and semicolons separate i rows in the matrix A. [sent-118, score-0.35]
20 As β is increased, it goes through a series of critical points βc i , at each of which another eigenvector of Σx|y Σ−1 is added to A. [sent-120, score-0.159]
21 Figure 1 shows the target function L as a function of the (vector of length 2) projection A. [sent-125, score-0.132]
22 Therefore, for β = 15 (Figure 1A) the zero solution is optimal, but for β = 100 > βc (Figure 1B) the corresponding eigenvector is a feasible solution, and the target function manifold contains two mirror minima. [sent-128, score-0.137]
23 Deriving the Optimal Projection We first rewrite the target function as L = I(X; T ) − βI(T ;Y ) = h(T ) − h(T |X) − βh(T ) + βh(T |Y ) where h is the (differential) entropy of a continuous variable h(X) ≡ − Z f (x) log f (x) dx. [sent-133, score-0.12]
24 5 A2 5 2 5 −5 0 5 A1 Figure 1: The surface of the target function L calculated numerically as a function of the optimization parameters in two illustrative examples with a scalar projection A : R2 → R. [sent-144, score-0.157]
25 Now, the conditional covariance matrix Σx|y can be used to calculate the covariance of the conditional variable T |Y , using the Schur complement formula (see e. [sent-159, score-0.125]
26 2 This allows us to simplify the calculations by replacing the noise covariance L (A, matrix Σξ with the identity matrix Id . [sent-166, score-0.127]
27 To identify the minimum of L we differentiate L with respect to the projection A using the δ algebraic identity δA log(|ACAT |) = (ACAT )−1 2AC which holds for any symmetric matrix C: δL = (1 − β)(AΣx AT + Id )−1 2AΣx + β(AΣx|y AT + Id )−1 2AΣx|y . [sent-167, score-0.145]
28 1 Scalar Projections For clearer presentation of the general derivation, we begin with a sketch of the proof by focusing on the case where T is a scalar, that is, the optimal projection matrix A is a now a single row vector. [sent-170, score-0.175]
29 x (8) This equation is therefore an eigenvalue problem in which the eigenvalues depend on A. [sent-172, score-0.132]
30 Otherwise, A must AΣ AT +1 be the eigenvector of Σx|y Σ−1 , with an eigenvalue λ = β−1 AΣx|yAT +1 x β x To characterize the values of β for which the optimal solution does not degenerate, we find when T the eigenvector solution is optimal. [sent-175, score-0.296]
31 For β ≥ βc (λ) the weight of information preservation is large enough, and the optimal solution for A is an eigenvector of Σx|y Σ−1 . [sent-195, score-0.167]
32 For some β values, several eigenvectors can satisfy the condition for non degenerated solutions of Equation (10). [sent-197, score-0.313]
33 We conclude that for scalar projections β(1−λ)−1 v1 0 < λ ≤ β−1 rλ β (11) A(β) = β−1 ≤λ≤1 0 β where v1 is the eigenvector of Σx|y Σ−1 with the smallest eigenvalue. [sent-200, score-0.159]
34 2 The High-Dimensional Case We now return to the proof of the general, high dimensional case, which follows the same lines as the scalar projection case. [sent-202, score-0.127]
35 This means that A should be spanned by up to nt eigenvectors of Σx|y Σ−1 . [sent-205, score-0.194]
36 We can therefore x 172 G AUSSIAN I NFORMATION B OTTLENECK represent the projection A as a mixture A = WV where the rows of V are left normalized eigenvectors of Σx|y Σ−1 and W is a mixing matrix that weights these eigenvectors. [sent-206, score-0.39]
37 , λk } are k ≤ nx eigenvalues of Σx|y Σ−1 with critical β values βc 1 , . [sent-218, score-0.231]
38 Taken together, these observations prove that for a given value of β, the optimal projection is 1 obtained by taking all the eigenvectors whose eigenvalues λi satisfy β ≥ 1−λi , and setting their norm according to A = WV with W determined as in Lemma 4. [sent-229, score-0.406]
39 The GIB Information Curve The information bottleneck is targeted at characterizing the tradeoff between information preservation (accuracy of relevant predictions) and compression. [sent-234, score-0.252]
40 This curve is related to the rate-distortion function in lossy source coding, as well as to the achievability limit in source coding with side-information (Wyner, 1975; Cover and Thomas, 1991). [sent-236, score-0.285]
41 The analytic characterization of the Gaussian IB problem allows us to obtain a closed form expression for the information curve in terms of the relevant eigenvalues. [sent-240, score-0.18]
42 173 C HECHIK , G LOBERSON , T ISHBY AND W EISS Σ log(λ ) i −1 = 1−λ1 I(T;Y) β 0 0 5 10 15 20 25 I(T;X) Figure 3: GIB information curve obtained with four eigenvalues λi = 0. [sent-241, score-0.146]
43 For infinite β, curve is saturated at the log of the determinant ∑ log λi . [sent-247, score-0.182]
44 For comparison, information curves calculated with smaller number of eigenvectors are also depicted (all curves calculated for β < 1000). [sent-248, score-0.194]
45 G AUSSIAN I NFORMATION B OTTLENECK The GIB curve, illustrated in Figure 3, is continuous and smooth, but is built of several segments: as I(T ; X) increases additional eigenvectors are used in the projection. [sent-253, score-0.226]
46 The derivative of the curve, which is equal to β−1 , can be easily shown to be continuous and decreasing, therefore the information curve is concave everywhere, in agreement with the general concavity of information curve in the discrete case (Wyner, 1975; Gilad-Bachrach et al. [sent-254, score-0.195]
47 Unlike the discrete case where concavity proofs rely on the ability to use a large number of clusters, concavity is guaranteed here also for segments of the curve, where the number of eigenvectors are limited a-priori. [sent-256, score-0.256]
48 This algorithm can be interpreted as repeated projection of Ak on the matrix I − Σy|x Σ−1 (whose x eigenvectors we seek) followed by scaling with βΣξk+1 Σt−1 . [sent-290, score-0.339]
49 It thus has similar form to the power k |y method for calculating the dominant eigenvectors of the matrix Σy|x Σ−1 (Demmel, 1997; Golub and x Loan, 1989). [sent-291, score-0.237]
50 The norm of the other two eigenvectors converges to the correct values, which are given in Theorem 3. [sent-296, score-0.194]
51 176 G AUSSIAN I NFORMATION B OTTLENECK 8 α1 A inv(V) 6 4 α2 2 0 1 10 100 1000 α ,α 3 4 iterations Figure 4: The norm of projection on the four eigenvectors of Σx|y Σ−1 , as evolves along the operation x of the iterative algorithm. [sent-298, score-0.333]
52 The projection on the other eigenvectors also vanishes (not shown). [sent-300, score-0.296]
53 Relation To Other Works The GIB solutions are related to studies of two main types: studies of eigenvalues based coprojections, and information theoretic studies of continuous compression. [sent-307, score-0.152]
54 1 Canonical Correlation Analysis and Imax The Gaussian information bottleneck projection derived above uses weighted eigenvectors of the matrix Σx|y Σ−1 = I − Σxy Σ−1 Σyx Σ−1 . [sent-310, score-0.453]
55 Such eigenvectors are also used in canonical correlation analx y x ysis (CCA) (Hotelling, 1935; Thompson, 1984; Borga, 2001), a method of descriptive statistics that 177 C HECHIK , G LOBERSON , T ISHBY AND W EISS finds linear relations between two variables. [sent-311, score-0.306]
56 Given two variables X,Y , CCA finds a set of basis vectors for each variable, such that the correlation coefficient between the projection of the variables on the basis vectors is maximized. [sent-312, score-0.162]
57 In other words, it finds the bases in which the correlation matrix is diagonal and the correlations on the diagonal are maximized. [sent-313, score-0.157]
58 The bases are the eigenvectors of the matrices Σ−1 Σyx Σ−1 Σxy and Σ−1 Σxy Σ−1 Σyx , and the square roots of their corresponding eigenvalues y x x y are the canonical correlation coefficients. [sent-314, score-0.418]
59 First of all, GIB characterizes not only the eigenvectors but also their norm, in a way that that depends on the trade-off parameter β. [sent-317, score-0.194]
60 Since CCA depends on the correlation coefficient between the compressed (projected) versions of X and Y , which is a normalized measure of correlation, it is invariant to a rescaling of the projection vectors. [sent-318, score-0.22]
61 It is therefore interesting that both GIB and CCA use the same eigenvectors for the projection of X. [sent-322, score-0.296]
62 2 Multiterminal Information Theory The information bottleneck formalism was recently shown (Gilad-Bachrach et al. [sent-324, score-0.139]
63 , 2003) to be closely related to the problem of source coding with side information (Wyner, 1975). [sent-325, score-0.153]
64 The IB formalism, although coinciding with coding theorems in the discrete case, is more general in the sense that it reflects the tradeoff between compression and information preservation, and is not concerned with exact reconstruction. [sent-330, score-0.275]
65 While GIB relates to an eigenvalue problem of the form λA = AΣx|y Σ−1 , GIB with side information x (GIBSI) requires to solve of a matrix equation of the form λ A + λ+ AΣx|y+ Σ−1 = λ− AΣx|y− Σ−1 , x x which is similar in form to a generalized eigenvalue problem. [sent-340, score-0.147]
66 However, unlike standard generalized eigenvalue problems, but as in the GIB case analyzed in this paper, the eigenvalues themselves depend on the projection A. [sent-341, score-0.234]
67 This is the approach taken in classical CCA, where the number of eigenvectors is determined according to a statistical significance test, and their weights are then set √ to 1 − λi . [sent-346, score-0.194]
68 This expression is the correlation coefficient between the ith CCA projections on X and Y , and reflects the amount of correlation captured by the ith projection. [sent-347, score-0.147]
69 1 To illustrate the difference, consider the case where β = 1−λ3 , so that only two eigenvectors are √ √ used by GIB. [sent-349, score-0.194]
70 Since only few bits can be transmitted, the information has to be compressed in a relevant way, and the relative scaling of the different eigenvectors becomes important (as in transform coding Goyal, 2001). [sent-357, score-0.4]
71 Since GIB weights the eigenvectors of the normalized cross correlation matrix in a different way than CCA, it may lead to very different interpretation of the relative importance of factors in these studies. [sent-363, score-0.297]
72 Discussion We applied the information bottleneck method to continuous jointly Gaussian variables X and Y , with a continuous representation of the compressed variable T . [sent-365, score-0.267]
73 We derived an analytic optimal solution as well as a new general algorithm for this problem (GIB) which is based solely on the spectral properties of the covariance matrices in the problem. [sent-366, score-0.149]
74 The solutions for GIB are characterized in terms of the trade-off parameter β between compression and preserved relevant information, and consist of eigenvectors of the matrix Σx|y Σ−1 , continuously adding up vectors as more complex x models are allowed. [sent-367, score-0.396]
75 We provide an analytic characterization of the optimal tradeoff between the representation complexity and accuracy - the “information curve” - which relates the spectrum to relevant information in an intriguing manner. [sent-368, score-0.245]
76 Besides its clean analytic structure, GIB offers a way for analyzing empirical multivariate data when only its correlation matrices can be estimated. [sent-369, score-0.138]
77 In that case it extends and provides new information theoretic insight to the classical canonical correlation analysis. [sent-370, score-0.152]
78 While both mutual information values vary continuously on the smooth information curve, the dimensionality of the optimal projection T increases discontinuously through a cascade of structural (second order) phase transitions, and the optimal curve moves from one analytic segment to another. [sent-372, score-0.332]
79 This algorithm, similar to multi-eigenvector power methods, converges to a solution in which the number of eigenvectors is determined by the parameter β, in a way that emerges from the iterations rather than defined a-priori. [sent-376, score-0.194]
80 It is interesting to explore the effects of dimensionality changes in this more general framework, to study how they induce topological transitions in the related graphical models, as some edges of the graphs become important only beyond corresponding critical values of the tradeoff parameter β. [sent-389, score-0.159]
81 1 For every pair (A, Σξ ) of a projection A and a full rank covariance matrix Σξ , there ˜ ˜ exist a matrix A such that L (A, Id ) = L (A, Σξ ), where Id is the nt × nt identity matrix. [sent-398, score-0.266]
82 1 Denote the set of left normalized eigenvectors of Σx|y Σ−1 by vi (||vi || = 1) and their x corresponding eigenvalues by λi . [sent-403, score-0.274]
83 The matrix V Σx is the eigenvector matrix of i −1 1 −2 Σx 2 Σx|y Σx −1 1 2 V Σx since 1 −2 fact that Σx 2 Σx|y Σx 1 −2 −1 Σx 2 Σx|y Σx 1 −2 = V Σx|y Σx 1 1 2 2 = V Σx|y Σ−1 Σx = DV Σx . [sent-419, score-0.193]
84 Optimal Eigenvector For some β values, several eigenvectors can satisfy the conditions for non degenerated solutions (Equation 10). [sent-424, score-0.313]
85 To identify the optimal eigenvector, we substitute the value of ||A||2 from Equation (9) AΣx|y AT = rλ||A||2 and AΣx AT = r||A||2 into the target function L of Equation (6), and obtain L = (1 − β) log (1 − λ)(β − 1) + β log (β(1 − λ)) . [sent-425, score-0.176]
86 , λk } are k ≤ nx eigenvalues of Σx|y Σ−1 with critical β values βc 1 , . [sent-439, score-0.231]
87 i Proof: We write V Σx|y Σ−1 = DV where D is a diagonal matrix whose elements are the correx sponding eigenvalues, and denote by R the diagonal matrix whose ith element is ri . [sent-445, score-0.205]
88 Note that W RW T has left eigenvectors W T with corresponding eigenvalues obtained from the diagonal matrix W T W R. [sent-449, score-0.344]
89 Thus if we substitute A into the target function in Equation (6), a similar calculation yields n n i=1 i=1 L = (1 − β) ∑ log ||wT ||2 ri + 1 + β ∑ log ||wT ||2 ri λi + 1 i i (26) where ||wT ||2 is the ith element of the diagonal of W T W . [sent-450, score-0.303]
90 We can therefore choose to take W to be the diagonal matrix which is the (matrix) square root of (25) W= [β(I − D) − I](DR)−1 (27) which completes the proof of the full rank (k = nx ) case. [sent-452, score-0.206]
91 In the low rank (k < nx ) case W does not mix all the eigenvectors, but only k of them. [sent-453, score-0.136]
92 To prove the lemma for this case, we first show that any such low rank matrix is equivalent (in terms of the target function value) to a low rank matrix that has only k non zero rows. [sent-454, score-0.237]
93 Consider a nx × nx matrix W of rank k < nx , but without any zero rows. [sent-456, score-0.377]
94 Let U be the set of left eigenvectors of WW T (that is, WW T = UΛU T ). [sent-457, score-0.194]
95 Then, since WW T is Hermitian, its eigenvectors are orthonormal, thus (UW )(WU)T = Λ and W = UW is a matrix with k non zero rows and nx − k zero lines. [sent-458, score-0.409]
96 To complete the proof note that the non zero rows of W also have nx − k zero columns and thus define a square matrix of rank k, for which the proof of the full rank case apply, but this time by projecting to a dimension k instead of nx . [sent-460, score-0.388]
97 The KL divergence can thus be rewritten as 1 DKL [p(y|x)||p(y|tk )] = c(x) + (µy|x − Bk tk )T Σ−1 (µy|x − Bk tk ). [sent-475, score-0.938]
98 Duality between source coding and channel coding and its extension to the side information case. [sent-664, score-0.297]
99 On source coding with side information at the decoder. [sent-720, score-0.153]
100 The rate distortion function for source coding with side information at the decoder ii: General sources. [sent-725, score-0.206]
wordName wordTfidf (topN-words)
[('tk', 0.469), ('gib', 0.366), ('ib', 0.322), ('eigenvectors', 0.194), ('eiss', 0.146), ('hechik', 0.146), ('ishby', 0.146), ('loberson', 0.146), ('id', 0.138), ('cca', 0.132), ('yx', 0.127), ('ytk', 0.122), ('coding', 0.114), ('tishby', 0.114), ('bottleneck', 0.114), ('ak', 0.113), ('ottleneck', 0.112), ('eigenvector', 0.107), ('projection', 0.102), ('nx', 0.099), ('rw', 0.092), ('xy', 0.091), ('nformation', 0.091), ('compression', 0.087), ('globerson', 0.085), ('eigenvalues', 0.08), ('tradeoff', 0.074), ('vt', 0.073), ('aussian', 0.073), ('drw', 0.073), ('degenerated', 0.072), ('imax', 0.072), ('curve', 0.066), ('ri', 0.065), ('uw', 0.061), ('wyner', 0.061), ('correlation', 0.06), ('compressed', 0.058), ('slonim', 0.058), ('log', 0.058), ('gaussian', 0.057), ('distortion', 0.053), ('critical', 0.052), ('canonical', 0.052), ('eigenvalue', 0.052), ('dkl', 0.051), ('ya', 0.051), ('ni', 0.049), ('borga', 0.049), ('wv', 0.049), ('non', 0.047), ('bk', 0.047), ('analytic', 0.046), ('matrix', 0.043), ('covariance', 0.041), ('sensors', 0.041), ('theoretic', 0.04), ('source', 0.039), ('slope', 0.038), ('preserved', 0.038), ('rank', 0.037), ('iterative', 0.037), ('pradhan', 0.037), ('zamir', 0.037), ('becker', 0.034), ('characterization', 0.034), ('relevant', 0.034), ('transitions', 0.033), ('matrices', 0.032), ('continuous', 0.032), ('mutual', 0.031), ('jointly', 0.031), ('cni', 0.031), ('ww', 0.031), ('concavity', 0.031), ('gal', 0.031), ('vancouver', 0.031), ('dv', 0.03), ('channel', 0.03), ('preservation', 0.03), ('yb', 0.03), ('target', 0.03), ('optimal', 0.03), ('berger', 0.028), ('intriguing', 0.027), ('lossy', 0.027), ('cascade', 0.027), ('chechik', 0.027), ('relevance', 0.027), ('projections', 0.027), ('diagonal', 0.027), ('rows', 0.026), ('formalism', 0.025), ('dx', 0.025), ('scalar', 0.025), ('vanishing', 0.025), ('mixing', 0.025), ('amir', 0.024), ('dimitrov', 0.024), ('friman', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 39 jmlr-2005-Information Bottleneck for Gaussian Variables
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis
2 0.087308608 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
3 0.073786497 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
4 0.047910467 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
5 0.04738652 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
Author: Neil Lawrence
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation
6 0.04412584 20 jmlr-2005-Clustering with Bregman Divergences
7 0.043978076 36 jmlr-2005-Gaussian Processes for Ordinal Regression
8 0.043001156 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
9 0.039550517 41 jmlr-2005-Kernel Methods for Measuring Independence
10 0.036305219 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
11 0.035969526 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
12 0.03546603 25 jmlr-2005-Denoising Source Separation
13 0.034737498 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
14 0.034584198 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
15 0.033269089 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
16 0.033204865 64 jmlr-2005-Semigroup Kernels on Measures
17 0.031827047 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
18 0.031571638 71 jmlr-2005-Variational Message Passing
19 0.030089146 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
20 0.029861372 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
topicId topicWeight
[(0, 0.19), (1, 0.044), (2, -0.033), (3, -0.024), (4, 0.049), (5, 0.027), (6, -0.18), (7, 0.046), (8, 0.085), (9, 0.067), (10, 0.094), (11, 0.053), (12, 0.034), (13, 0.042), (14, -0.096), (15, 0.117), (16, -0.033), (17, -0.008), (18, -0.109), (19, -0.231), (20, 0.252), (21, -0.17), (22, -0.09), (23, 0.225), (24, -0.321), (25, 0.186), (26, -0.247), (27, -0.05), (28, -0.156), (29, 0.121), (30, -0.111), (31, 0.087), (32, 0.12), (33, 0.25), (34, 0.045), (35, 0.159), (36, -0.156), (37, 0.018), (38, -0.001), (39, -0.047), (40, -0.114), (41, 0.053), (42, 0.033), (43, 0.148), (44, -0.173), (45, -0.092), (46, 0.121), (47, -0.025), (48, 0.069), (49, -0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.97345328 39 jmlr-2005-Information Bottleneck for Gaussian Variables
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
3 0.2916868 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
Author: Gal Elidan, Nir Friedman
Abstract: A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in suboptimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets. Keywords: Bayesian networks, hidden variables, information bottleneck, continuation, variational methods
4 0.2244035 41 jmlr-2005-Kernel Methods for Measuring Independence
Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf
Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF
5 0.1956096 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
6 0.18551844 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
7 0.17669322 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
8 0.16010311 25 jmlr-2005-Denoising Source Separation
9 0.14929187 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
10 0.13311121 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
11 0.12884749 61 jmlr-2005-Prioritization Methods for Accelerating MDP Solvers
12 0.12848754 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
13 0.125241 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
14 0.11967226 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
15 0.11933336 47 jmlr-2005-Learning from Examples as an Inverse Problem
16 0.1096273 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
17 0.10953869 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems
18 0.1068949 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
19 0.1037469 64 jmlr-2005-Semigroup Kernels on Measures
20 0.1029438 20 jmlr-2005-Clustering with Bregman Divergences
topicId topicWeight
[(13, 0.024), (17, 0.047), (19, 0.047), (36, 0.042), (37, 0.035), (42, 0.019), (43, 0.027), (47, 0.012), (52, 0.122), (59, 0.017), (66, 0.319), (70, 0.034), (80, 0.014), (88, 0.109), (90, 0.027), (94, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.77287692 39 jmlr-2005-Information Bottleneck for Gaussian Variables
Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss
Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis
2 0.48652491 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
3 0.48035964 36 jmlr-2005-Gaussian Processes for Ordinal Regression
Author: Wei Chu, Zoubin Ghahramani
Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection
4 0.47863182 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.47841302 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.47201341 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
7 0.46811572 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
8 0.46499798 64 jmlr-2005-Semigroup Kernels on Measures
9 0.46458039 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
10 0.46446866 44 jmlr-2005-Learning Module Networks
11 0.46375266 3 jmlr-2005-A Classification Framework for Anomaly Detection
12 0.45883542 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
13 0.45467108 20 jmlr-2005-Clustering with Bregman Divergences
14 0.45349565 11 jmlr-2005-Algorithmic Stability and Meta-Learning
15 0.45283425 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.45127508 48 jmlr-2005-Learning the Kernel Function via Regularization
17 0.44812155 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
18 0.44715068 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
19 0.44649944 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
20 0.44563511 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning