nips nips2004 nips2004-17 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. [sent-8, score-0.359]
2 We also illustrate the effectiveness of our methods on some synthetic data sets. [sent-10, score-0.023]
3 The proposed algorithms include Isomap [6], locally linear embedding (LLE) [3] and its variations, manifold charting [1], hessian LLE [2] and local tangent space alignment (LTSA) [7], and they have been successfully applied in several computer vision and pattern recognition problems. [sent-12, score-0.435]
4 We will discuss those two issues in the context of local tangent space alignment (LTSA) [7], a variation of locally linear embedding (LLE) [3] (see also [5],[1]). [sent-14, score-0.215]
5 We believe the basic ideas we proposed can be similarly applied to other manifold learning algorithms. [sent-15, score-0.186]
6 We first outline the basic steps of LTSA and illustrate its failure modes using two simple examples. [sent-16, score-0.023]
7 , xN ] with xi ∈ Rm , sampled (possibly with noise) from a d-dimensional manifold (d < m), LTSA proceeds in the following steps. [sent-20, score-0.374]
8 , xiki ] of its neighbors (ki nearest neighbors, for example). [sent-28, score-0.11]
9 2 −20 0 20 Figure 1: The data sets (first column) and computed coordinates τi by LTSA vs. [sent-83, score-0.257]
10 Compute an orthonormal basis Qi for the d-dimensional tangent space of the manifold at xi , and the orthogonal projection of each xij to the tangent (i) space: θj = QT (xij − xi ) where xi is the mean of the neighbors. [sent-87, score-1.155]
11 Align the N local projections Θi = (i) (i) [θ1 , · · · , θki ], i = 1, . [sent-89, score-0.047]
12 Such an alignment is achieved by minimizing the global reconstruction error Ei i 2 2 ≡ i Ti (I − 1 T ee ) − Li Θi ki 2 2 (1. [sent-96, score-0.375]
13 , iki } determined by the neighborhood of each xi , and e is a vector of all ones. [sent-106, score-0.504]
14 Two strategies are commonly used for selecting the local neighborhood size k i : one is k nearest neighborhood ( k-NN with a constant k for all the sample points) and the other is neighborhood [3, 6]. [sent-107, score-0.613]
15 The effectiveness of the manifold learning algorithms including LTSA depends on the manner of how the nearby neighborhoods overlap with each other and the variation of the curvature of the manifold and its interplay with the sampling density [4]. [sent-108, score-0.666]
16 We sample data points from a half unit circle xi = [cos(ti ), sin(ti )]T , i = 1 . [sent-111, score-0.234]
17 It is easy to see that ti represent the arc-length of the circle. [sent-115, score-0.182]
18 We choose ti ∈ [0, π] according to ti+1 − ti = 0. [sent-116, score-0.364]
19 001 + | cos(ti )|) starting at t1 = 0, and set N = 152 so that tN ≤ π and tN +1 > π. [sent-118, score-0.021]
20 The date set is generated as xi = [ti , 10e−ti ]T , i = 1 . [sent-122, score-0.218]
21 , N, where ti ∈ [−6, 6] are uniformly distributed. [sent-125, score-0.182]
22 The curvature of the 1-D curve at parameter value t is given by 2 20|1 − 2t2 |e−t cg (t) = 2 3/2 (1 + 40t2 e−2t ) which changes from mint cg (t) = 0 to maxt cg (t) = 20 over t ∈ [−6, 6]. [sent-126, score-0.273]
23 In the first column of Figure 1, we plot these two data sets. [sent-131, score-0.059]
24 The computed coordinates by LTSA with constant kneighborhoods are plotted against the centered arc-length coordinates for a selected range of k (ideally, the plots should display points on a straight line of slops ±π/4). [sent-132, score-0.524]
25 2 Adaptive Neighborhood Selection In this section, we propose a neighborhood contraction and expansion algorithm for adaptively selecting ki at each sample point xi . [sent-133, score-0.822]
26 We assume that the data are generated from a parameterized manifold, xi = f (τi ), i = 1, . [sent-134, score-0.188]
27 If f is smooth enough, using first-order Taylor expansion at a fixed τ , for a neighboring τ , we ¯ have f (¯) = f (τ ) + Jf (τ ) · (¯ − τ ) + (τ, τ ), τ τ ¯ (2. [sent-138, score-0.035]
28 2) xij = xi + Jf (τi ) · (τij − τi ) + (τi , τij ). [sent-139, score-0.349]
29 3) where Jf (τ ) ∈ Rm×d is the Jacobi matrix of f at τ and (τ, τ ) represents the error term ¯ determined by the Hessian of f , (τ, τ ) ≈ cf (τ ) τ − τ 2 , where cf (τ ) ≥ 0 represents ¯ ¯ 2 the curvature of the manifold at τ . [sent-141, score-0.498]
30 Setting τ = τi and τ = τij gives ¯ A point xij can be regarded as a neighbor of xi with respect to the tangent space spanned by the columns of Jf (τi ) if τij − τ i 2 is small and (τi , τij ) Jf (τi ) · (τij − τi ) 2 . [sent-142, score-0.467]
31 To get around this problem, consider an orthogonal basis matrix Qi of the tangent space spanned by the columns of Jf (τi ) which can be approximately computed by the SVD of Xi − xi eT , where xi is the mean of the neighbors xij = f (τij ), j = 1, . [sent-144, score-0.782]
32 Note that ¯ ¯ xi = ¯ 1 ki ki j=1 xij = xi + Jf (τi ) · (¯i − τi ) + ¯i , τ where ¯i is the mean of (τi , τi1 ), . [sent-148, score-1.107]
33 3) by the representation above yields xij = xi + Jf (τi ) · (τij − τi ) + ¯ ¯ (i) (i) j with j = (τi , τij ) − ¯i . [sent-153, score-0.349]
34 Thus, xij can be selected neighbor of xi if the orthogonal and (i) θj = QT (xij i − xi ), we have ¯ (i) 2 j (i) x ij = x i + Q i θj + ¯ (i) projection θj is small (i) = x ij − x i − Q i θj ¯ 2 (i) Q i θj 2 (i) = θj 2. [sent-155, score-0.765]
35 4) Assume all the xij satisfy the above inequality, then we should approximately have (I − Qi QT )(Xi − x0 eT ) i F ≤ η QT (Xi − x0 eT ) i F (2. [sent-157, score-0.161]
36 5) as a criterion for adaptive neighbor selection, starting with a K-NN at each sample point xi with a large enough initial K and deleting points one by one until (2. [sent-159, score-0.386]
37 This process will terminate when the neighborhood size equals d + k 0 for some small k0 and (2. [sent-161, score-0.202]
38 In that case, we may need to reselect a k-NN that (I−Qi QT )(Xi −¯i eT ) F x i as the neighborhood set as is detailed below. [sent-163, score-0.184]
39 Determine the initial K and K-NN neighborhood Xi ordered in non-decreasing distances to xi , = [xi1 , . [sent-166, score-0.393]
40 Compute the orthogonal basis matrix Qi , ¯ (k) (k) (k) (k) (k) the d largest singular vectors of Xi − xi eT . [sent-176, score-0.284]
41 If Xi − xi eT − Qi Θi F < η Θi F , then set Xi = Xi , Θi = Θi , ¯ and terminate. [sent-179, score-0.188]
42 If k > d+k0 , then delete the last column of Xi to obtain Xi , set k := k−1, and go to step C1, otherwise, go to step C4. [sent-181, score-0.07]
43 5), then the contracted x T neighborhood Xi should be one that minimizes Xi −¯i e i−Qi Θi F . [sent-185, score-0.184]
44 Θ F Once the contraction step is done we can still add back some of unselected x ij to increase the overlap of nearby neighborhoods while still keep (2. [sent-186, score-0.293]
45 In fact, we can add x ij if xij − xi − Qi θj ≤ η θj which is demonstrated in the following result (we refer to [8] ¯ for the proof). [sent-188, score-0.431]
46 Furthermore, we assume (i) x ij − x 0 − Q i θj (i) ≤ η θj , j = k + 1, . [sent-195, score-0.082]
47 Denote by xi the column mean of the expanded matrix ˜ i ˜ ˜ Xi = [Xi , xik+1 , . [sent-200, score-0.238]
48 Then for the left-singular vector matrix Qi corresponding to ˜ i − x i eT , the d largest singular values of X ˜ √ k+p p (i) ˜ ˜i ˜ ˜i ˜ θj 2 . [sent-204, score-0.061]
49 Set ki to be the column number of Xi obtained by the neighborhood contracting (i) ¯ step. [sent-208, score-0.539]
50 Denote by Ji the index subset of j’s, ki < j ≤ K, such that (I − Qi QT )(xij − i (i) xi ) 2 ≤ θj 2 . [sent-214, score-0.473]
51 We construct the data points as xi = [sin(ti ), cos(ti ), 0. [sent-217, score-0.233]
52 , N, with ti ∈ [0, 4π] uniformly distributed, which is plotted in the top-left panel in Figure 2. [sent-221, score-0.208]
53 15 10 −10 0 10 Figure 2: Plots of the data sets (top left), the computed coordinates τi by LTSA vs. [sent-253, score-0.257]
54 the centered arc-length coordinates (a ∼ c), the computed coordinates τi by LTSA with neighborhood C contraction vs the centered arc-length coordinates (e ∼ g), and the computed coordinates τi by LTSA with neighborhood contraction and expansion vs. [sent-254, score-1.581]
55 the centered arc-length coordinates (bottom left) LTSA with constant k-NN fails for any k: small k leads to lack of necessary overlap among the neighborhoods while for large k, the computed tangent space can not represent the local geometry well. [sent-255, score-0.487]
56 In (a ∼ c) of Figure 2, we plot the coordinates computed by LTSA vs. [sent-256, score-0.284]
57 Contracting the neighborhoods without expansion also results in bad results, because of small sizes of the resulting neighborhoods, see (e ∼ g) of Figure 2. [sent-258, score-0.118]
58 Panel (d) of Figure 2 gives an excellent result computed by LTSA with both neighborhood contraction and expansion. [sent-259, score-0.324]
59 We want mention that our adaptive strategies also work well for noisy data sets, we refer the readers to [8] for some examples. [sent-260, score-0.081]
60 3 Alignment incorporating variations of manifold curvature Let Xi = [xi1 , . [sent-261, score-0.355]
61 , xiki ] consists of the neighbors determined by the contraction and expansion steps in the above section. [sent-264, score-0.235]
62 1), we can show that the size of the error term Ei 2 depends on the size of the curvature of manifold at sample point xi [8]. [sent-266, score-0.557]
63 1) more uniform, we need to factor out the effect of the variations of the curvature. [sent-268, score-0.043]
64 To this end, we pose the following minimization problem, min T,{Li } i (i) 1 1 −1 (Ti (I − eeT ) − Li Θi )Di ki ki (i) 2 2, (3. [sent-269, score-0.615]
65 , φ(θki )), and φ(θj ) is proportional to the curvature of the manifold at the parameter value θi , the computation of which will be discussed below. [sent-273, score-0.312]
66 For 1 fixed T , the optimal Li is given by Li = Ti (Iki − ki eeT )Θ+ = Ti Θ+ . [sent-274, score-0.285]
67 , Dn /kn )(SW )T , where W = (Iki − shows that we can (i) ensure φi (θj ) > neighbors of xi . [sent-279, score-0.241]
68 1) i (i) 2 = γ + cf (τi ) θj with a small positive constant γ to 0, and cf (τi ) ≥ 0 represents the mean of curvatures cf (τi , τij ) for all Let Qi denote the orthonormal matrix of the largest d right singular vectors of Xi (I − 1 T ki ee ). [sent-282, score-0.709]
69 We can approximately compute cf (τi ) as follows. [sent-283, score-0.084]
70 1 cf (τi ) ≈ ki − 1 ki =2 arccos(σmin (QT Qi )) i . [sent-284, score-0.654]
71 θ 2 where σmin (·) is the smallest singular value of a matrix. [sent-285, score-0.068]
72 Then the diagonal weights φ(θi ) can be computed as (i) φi (θj ) = η + θj 2 2 ki − 1 ki =2 arccos(σmin (QT Qi )) i . [sent-286, score-0.62]
73 θ 2 With the above preparation, we are now ready to present the adaptive LTSA algorithm. [sent-287, score-0.081]
74 , N, using the neighborhood contraction/expansion steps in Section 2. [sent-298, score-0.184]
75 Compute the truncated SVD, say Qi Σi ViT of Xi (I − ki eeT ) with d columns in (i) both Qi and Vi , the projections θ = QT (xi − xi ) with the mean xi of the ¯ ¯ i (i) (i) neighbors, and denote Θi = [θ1 , . [sent-300, score-0.686]
76 , N , ci = 1 ki − 1 ki −1 =2 arccos(σmin (QT Qi )) i θ (i) , 2 Step 4. [sent-309, score-0.57]
77 , N , set 1 1 Wi = Iki −[ √ e, Vi ][ √ e, Vi ]T , ki ki (i) 2 2, . [sent-314, score-0.57]
78 , Di = γI+ diag(ci θ1 (i) 2 2 ), c i θ ki where γ is a small constant number (usually we set γ = 1. [sent-317, score-0.285]
79 Compute the d + 1 smallest eigen-vectors of B and pick up the eigenvector [u2 , . [sent-326, score-0.025]
80 , ud+1 ] matrix corresponding to the 2nd to d + 1st smallest eigenvalues, and set T = [u2 , . [sent-329, score-0.043]
81 4 Experimental Results In this section, we present several numerical examples to illustrate the performance of the adaptive LTSA algorithm. [sent-333, score-0.104]
82 4 2 −2 0 Figure 3: The computed coordinates τi by LTSA taking into account curvature and variable size of neighborhood. [sent-384, score-0.401]
83 First we apply the adaptive LTSA to the date sets shown in Examples 1 and 2. [sent-385, score-0.111]
84 Adaptive LTSA with different starting k’s works every well. [sent-386, score-0.021]
85 It shows that for these tow data sets, the adaptive LTSA is not sensitive to the choice of the starting k or the variations in sampling densities and manifold curvatures. [sent-388, score-0.352]
86 In the left figure of Figure 4, we show there is a distortion between the computed coordinates by LTSA with the best-fit neighborhood size (bottom left) and the generating coordinates (r, t)T (top right). [sent-392, score-0.693]
87 In the right panel of the bottom row of the left figure of Figure 4, we plot the computed coordinates by the adaptive LTSA with initial neighborhood size k = 30. [sent-393, score-0.679]
88 (In fact, the adaptive LTSA is insensitive to k and we will get similar results with a larger or smaller initial k). [sent-394, score-0.102]
89 We can see that the computed coordinates by the adaptive LTSA can recover the generating coordinates well without much distortion. [sent-395, score-0.572]
90 Finally we applied both LTSA and the adaptive LTSA to a 2D manifold with 3 peaks embedded in a 100 dimensional space. [sent-396, score-0.31]
91 First we generate N = 2000 3D points, yi = (ti , si , h(ti , si ))T , where ti and si randomly distributed in the interval [−1. [sent-398, score-0.236]
92 Then we embed the 3D points into a 100D space by xQ = Qyi , xH = Hyi , where i i Q ∈ R100×3 is a random orthonormal matrix resulting in an orthogonal transformation and H ∈ R100×3 a matrix with its singular values uniformly distributed in (0, 1) resulting in an affine transformation. [sent-401, score-0.17]
93 In the top row of the right figure of Figure 4, we plot the Generating Coordinate swiss role 10 (a) 1 5 0. [sent-402, score-0.076]
94 04 Figure 4: Left figure: 3D swiss-roll and the generating coordinates (top row), computed 2D coordinates by LTSA with the best neighborhood size k = 15 (bottom left) and computed 2D coordinates by adaptive LTSA (bottom right). [sent-454, score-1.031]
95 Right figure: coordinates computed by LTSA for the orthogonally embedded 100D data set {xQ } (a) and the affinely embedded i 100D data set {xH } (b), and the coordinates computed by the adaptive LTSA for {xQ } (c) i i and {xH } (d). [sent-455, score-0.681]
96 i computed coordinates by LTSA for xQ (shown in (a)) and xH (shown in (b)) with best-fit i i neighborhood size k = 15. [sent-456, score-0.459]
97 In the bottom row of the right figure of Figure 4, we plot the computed coordinates by the adaptive LTSA for xQ (shown in (c)) and xH (shown in (d)) with initial i i neighborhood size k = 15. [sent-458, score-0.653]
98 It is clear that the adaptive LTSA gives a much better result. [sent-459, score-0.081]
99 Think globally, fit locally: unsupervised learning of nonlinear manifolds. [sent-477, score-0.031]
wordName wordTfidf (topN-words)
[('ltsa', 0.643), ('ki', 0.285), ('jf', 0.208), ('coordinates', 0.207), ('xi', 0.188), ('manifold', 0.186), ('neighborhood', 0.184), ('ti', 0.182), ('qt', 0.178), ('qi', 0.162), ('xij', 0.161), ('iki', 0.132), ('curvature', 0.126), ('contraction', 0.09), ('tangent', 0.089), ('cf', 0.084), ('ij', 0.082), ('adaptive', 0.081), ('xh', 0.066), ('xq', 0.066), ('neighborhoods', 0.063), ('eet', 0.06), ('xik', 0.06), ('arccos', 0.057), ('ocal', 0.057), ('xiki', 0.057), ('neighbors', 0.053), ('computed', 0.05), ('cg', 0.049), ('cos', 0.048), ('alignment', 0.048), ('interplay', 0.045), ('variations', 0.043), ('singular', 0.043), ('embedded', 0.043), ('ee', 0.042), ('sin', 0.042), ('et', 0.042), ('li', 0.04), ('contracting', 0.038), ('curvatures', 0.038), ('eighborhood', 0.038), ('ud', 0.038), ('lle', 0.038), ('di', 0.037), ('centered', 0.035), ('expansion', 0.035), ('orthogonal', 0.035), ('bottom', 0.034), ('manifolds', 0.033), ('diag', 0.033), ('zha', 0.033), ('hessian', 0.032), ('column', 0.032), ('orthonormal', 0.031), ('nonlinear', 0.031), ('row', 0.031), ('charting', 0.03), ('date', 0.03), ('neighbor', 0.029), ('gure', 0.029), ('issues', 0.028), ('locally', 0.028), ('generating', 0.027), ('plot', 0.027), ('panel', 0.026), ('projections', 0.025), ('points', 0.025), ('smallest', 0.025), ('sw', 0.025), ('align', 0.024), ('expanding', 0.023), ('tn', 0.023), ('illustrate', 0.023), ('min', 0.023), ('zhang', 0.023), ('minimization', 0.022), ('ji', 0.022), ('local', 0.022), ('initial', 0.021), ('ii', 0.021), ('starting', 0.021), ('vi', 0.021), ('sampling', 0.021), ('sample', 0.021), ('overlap', 0.021), ('svd', 0.02), ('pennsylvania', 0.02), ('sizes', 0.02), ('construct', 0.02), ('rm', 0.019), ('ei', 0.019), ('step', 0.019), ('adaptively', 0.019), ('size', 0.018), ('si', 0.018), ('top', 0.018), ('wang', 0.018), ('nearby', 0.018), ('matrix', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 17 nips-2004-Adaptive Manifold Learning
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
2 0.23128845 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
3 0.14178987 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
4 0.11306422 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
5 0.07696256 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
Author: Tzu-kuo Huang, Chih-jen Lin, Ruby C. Weng
Abstract: The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated. 1
6 0.075738028 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
7 0.073374681 160 nips-2004-Seeing through water
8 0.072224528 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
9 0.067793429 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
10 0.062463775 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
11 0.061716393 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
12 0.059037782 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
13 0.056654371 92 nips-2004-Kernel Methods for Implicit Surface Modeling
14 0.055841517 161 nips-2004-Self-Tuning Spectral Clustering
15 0.055770282 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
16 0.053188577 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
17 0.052306779 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
18 0.051231924 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
19 0.048608001 130 nips-2004-Newscast EM
20 0.047934361 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
topicId topicWeight
[(0, -0.153), (1, 0.051), (2, -0.041), (3, 0.007), (4, -0.003), (5, -0.008), (6, -0.067), (7, -0.025), (8, -0.084), (9, 0.059), (10, 0.132), (11, -0.146), (12, 0.014), (13, -0.17), (14, -0.027), (15, 0.148), (16, -0.104), (17, -0.065), (18, 0.028), (19, 0.179), (20, -0.147), (21, 0.141), (22, -0.13), (23, -0.118), (24, 0.078), (25, 0.137), (26, 0.0), (27, 0.136), (28, -0.003), (29, -0.069), (30, -0.064), (31, -0.151), (32, 0.04), (33, 0.028), (34, 0.077), (35, 0.117), (36, 0.132), (37, -0.051), (38, -0.072), (39, -0.033), (40, 0.128), (41, 0.08), (42, -0.069), (43, 0.087), (44, 0.077), (45, -0.052), (46, -0.063), (47, 0.091), (48, -0.052), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.93393558 17 nips-2004-Adaptive Manifold Learning
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
2 0.71140277 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
3 0.46051201 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
4 0.44666031 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
5 0.42489994 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
Author: Erik Aurell, Uri Gordon, Scott Kirkpatrick
Abstract: Survey propagation is a powerful technique from statistical physics that has been applied to solve the 3-SAT problem both in principle and in practice. We give, using only probability arguments, a common derivation of survey propagation, belief propagation and several interesting hybrid methods. We then present numerical experiments which use WSAT (a widely used random-walk based SAT solver) to quantify the complexity of the 3-SAT formulae as a function of their parameters, both as randomly generated and after simpli£cation, guided by survey propagation. Some properties of WSAT which have not previously been reported make it an ideal tool for this purpose – its mean cost is proportional to the number of variables in the formula (at a £xed ratio of clauses to variables) in the easy-SAT regime and slightly beyond, and its behavior in the hardSAT regime appears to re¤ect the underlying structure of the solution space that has been predicted by replica symmetry-breaking arguments. An analysis of the tradeoffs between the various methods of search for satisfying assignments shows WSAT to be far more powerful than has been appreciated, and suggests some interesting new directions for practical algorithm development. 1
6 0.38285547 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
7 0.38015568 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.3554807 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
9 0.32624105 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
10 0.31515354 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
11 0.29936776 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
12 0.29794785 130 nips-2004-Newscast EM
13 0.29463986 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
14 0.29343107 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
15 0.28677544 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
16 0.26956907 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps
17 0.26270974 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
18 0.25264463 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
19 0.2433064 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
20 0.23906542 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
topicId topicWeight
[(13, 0.135), (15, 0.149), (26, 0.041), (31, 0.015), (33, 0.123), (35, 0.015), (39, 0.019), (50, 0.042), (76, 0.013), (87, 0.349)]
simIndex simValue paperId paperTitle
same-paper 1 0.80416715 17 nips-2004-Adaptive Manifold Learning
Author: Jing Wang, Zhenyue Zhang, Hongyuan Zha
Abstract: Recently, there have been several advances in the machine learning and pattern recognition communities for developing manifold learning algorithms to construct nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the neighborhood sizes; and 2) better fitting the local geometric structure to account for the variations in the curvature of the manifold and its interplay with the sampling density of the data set. We also illustrate the effectiveness of our methods on some synthetic data sets. 1
2 0.79490453 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
Author: Hyun J. Park, Te W. Lee
Abstract: Capturing dependencies in images in an unsupervised manner is important for many image processing applications. We propose a new method for capturing nonlinear dependencies in images of natural scenes. This method is an extension of the linear Independent Component Analysis (ICA) method by building a hierarchical model based on ICA and mixture of Laplacian distribution. The model parameters are learned via an EM algorithm and it can accurately capture variance correlation and other high order structures in a simple manner. We visualize the learned variance structure and demonstrate applications to image segmentation and denoising. 1 In trod u ction Unsupervised learning has become an important tool for understanding biological information processing and building intelligent signal processing methods. Real biological systems however are much more robust and flexible than current artificial intelligence mostly due to a much more efficient representations used in biological systems. Therefore, unsupervised learning algorithms that capture more sophisticated representations can provide a better understanding of neural information processing and also provide improved algorithm for signal processing applications. For example, independent component analysis (ICA) can learn representations similar to simple cell receptive fields in visual cortex [1] and is also applied for feature extraction, image segmentation and denoising [2,3]. ICA can approximate statistics of natural image patches by Eq.(1,2), where X is the data and u is a source signal whose distribution is a product of sparse distributions like a generalized Laplacian distribution. X = Au (1) P (u ) = ∏ P (u i ) (2) But the representation learned by the ICA algorithm is relatively low-level. In biological systems there are more high-level representations such as contours, textures and objects, which are not well represented by the linear ICA model. ICA learns only linear dependency between pixels by finding strongly correlated linear axis. Therefore, the modeling capability of ICA is quite limited. Previous approaches showed that one can learn more sophisticated high-level representations by capturing nonlinear dependencies in a post-processing step after the ICA step [4,5,6,7,8]. The focus of these efforts has centered on variance correlation in natural images. After ICA, a source signal is not linearly predictable from others. However, given variance dependencies, a source signal is still ‘predictable’ in a nonlinear manner. It is not possible to de-correlate this variance dependency using a linear transformation. Several researchers have proposed extensions to capture the nonlinear dependencies. Portilla et al. used Gaussian Scale Mixture (GSM) to model variance dependency in wavelet domain. This model can learn variance correlation in source prior and showed improvement in image denoising [4]. But in this model, dependency is defined only between a subset of wavelet coefficients. Hyvarinen and Hoyer suggested using a special variance related distribution to model the variance correlated source prior. This model can learn grouping of dependent sources (Subspace ICA) or topographic arrangements of correlated sources (Topographic ICA) [5,6]. Similarly, Welling et al. suggested a product of expert model where each expert represents a variance correlated group [7]. The product form of the model enables applications to image denoising. But these models don’t reveal higher-order structures explicitly. Our model is motivated by Lewicki and Karklin who proposed a 2-stage model where the 1st stage is an ICA model (Eq. (3)) and the 2 nd-stage is a linear generative model where another source v generates logarithmic variance for the 1st stage (Eq. (4)) [8]. This model captures variance dependency structure explicitly, but treating variance as an additional random variable introduces another level of complexity and requires several approximations. Thus, it is difficult to obtain a simple analytic PDF of source signal u and to apply the model for image processing problems. ( P (u | λ ) = c exp − u / λ q ) (3) log[λ ] = Bv (4) We propose a hierarchical model based on ICA and a mixture of Laplacian distribution. Our model can be considered as a simplification of model in [8] by constraining v to be 0/1 random vector where only one element can be 1. Our model is computationally simpler but still can capture variance dependency. Experiments show that our model can reveal higher order structures similar to [8]. In addition, our model provides a simple parametric PDF of variance correlated priors, which is an important advantage for adaptive signal processing. Utilizing this, we demonstrate simple applications on image segmentation and image denoising. Our model provides an improved statistic model for natural images and can be used for other applications including feature extraction, image coding, or learning even higher order structures. 2 Modeling nonlinear dependencies We propose a hierarchical or 2-stage model where the 1 st stage is an ICA source signal model and the 2nd stage is modeled by a mixture model with different variances (figure 1). In natural images, the correlation of variance reflects different types of regularities in the real world. Such specialized regularities can be summarized as “context” information. To model the context dependent variance correlation, we use mixture models where Laplacian distributions with different variance represent different contexts. For each image patch, a context variable Z “selects” which Laplacian distribution will represent ICA source signal u. Laplacian distributions have 0-mean but different variances. The advantage of Laplacian distribution for modeling context is that we can model a sparse distribution using only one Laplacian distribution. But we need more than two Gaussian distributions to do the same thing. Also conventional ICA is a special case of our model with one Laplacian. We define the mixture model and its learning algorithm in the next sections. Figure 1: Proposed hierarchical model (1st stage is ICA generative model. 2nd stage is mixture of “context dependent” Laplacian distributions which model U. Z is a random variable that selects a Laplacian distribution that generates the given image patch) 2.1 Mixture of Laplacian Distribution We define a PDF for mixture of M-dimensional Laplacian Distribution as Eq.(5), where N is the number of data samples, and K is the number of mixtures. N N K M N K r r r P(U | Λ, Π) = ∏ P(u n | Λ, Π) = ∏∑ π k P(u n | λk ) = ∏∑ π k ∏ n n k n k m 1 (2λ ) k ,m u n,m exp − λk , m (5) r r r r r un = (un,1 , un , 2 , , , un,M ) : n-th data sample, U = (u1 , u 2 , , , ui , , , u N ) r r r r r λk = (λk ,1 , λk , 2 ,..., λk ,M ) : Variance of k-th Laplacian distribution, Λ = (λ1 , λ2 , , , λk , , , λK ) πk : probability of Laplacian distribution k, Π = (π 1 , , , π K ) and ∑ k πk =1 It is not easy to maximize Eq.(5) directly, and we use EM (expectation maximization) algorithm for parameter estimation. Here we introduce a new hidden context variable Z that represents which Laplacian k, is responsible for a given data point. Assuming we know the hidden variable Z, we can write the likelihood of data and Z as Eq.(6), n zk K N r (π )zkn 1 ⋅ exp − z n u n ,m P(U , Z | Λ, Π ) = ∏ P(u n , Z | Λ, Π ) = ∏ ∏ k ∏ k k λk , m n n m 2λk ,m N (6) n z k : Hidden binary random variable, 1 if n-th data sample is generated from k-th n Laplacian, 0 other wise. ( Z = (z kn ) and ∑ z k = 1 for all n = 1…N) k 2.2 EM algorithm for learning the mixture model The EM algorithm maximizes the log likelihood of data averaged over hidden variable Z. The log likelihood and its expectation can be computed as in Eq.(7,8). u 1 n n log P(U , Z | Λ, Π ) = ∑ z k log(π k ) + ∑ z k log( ) − n ,m 2λk ,m λk , m n ,k m (7) u 1 n E {log P (U , Z | Λ, Π )} = ∑ E z k log(π k ) + ∑ log( ) − n ,m 2λ k , m λk , m n ,k m { } (8) The expectation in Eq.(8) can be evaluated, if we are given the data U and estimated parameters Λ and Π. For Λ and Π, EM algorithm uses current estimation Λ’ and Π’. { } { } ∑ z P( z n n E z k ≡ E zk | U , Λ' , Π ' = 1 n z k =0 n k n k n | u n , Λ' , Π ' ) = P( z k = 1 | u n , Λ' , Π ' ) (9) = n n P (u n | z k = 1, Λ' , Π ' ) P( z k = 1 | Λ ' , Π ' ) P(u n | Λ' , Π ' ) = M u n ,m 1 1 1 ∏ 2λ ' exp(− λ ' ) ⋅ π k ' = c P (u n | Λ ' , Π ' ) m k ,m k ,m n M πk ' ∏ 2λ m k ,m ' exp(− u n ,m λk , m ' ) Where the normalization constant can be computed as K K M k k =1 m =1 n cn = P (u n | Λ ' , Π ' ) = ∑ P (u n | z k , Λ ' , Π ' ) P ( z kn | Λ ' , Π ' ) = ∑ π k ∏ 1 (2λ ) exp( − k ,m u n ,m λk ,m ) (10) The EM algorithm works by maximizing Eq.(8), given the expectation computed from Eq.(9,10). Eq.(9,10) can be computed using Λ’ and Π’ estimated in the previous iteration of EM algorithm. This is E-step of EM algorithm. Then in M-step of EM algorithm, we need to maximize Eq.(8) over parameter Λ and Π. First, we can maximize Eq.(8) with respect to Λ, by setting the derivative as 0. 1 u n,m ∂E{log P (U , Z | Λ, Π )} n = 0 = ∑ E z k − + λ k , m (λ k , m ) 2 ∂λ k ,m n { } ⇒ λ k ,m ∑ E{z }⋅ u = ∑ E{z } n k n ,m n (11) n k n Second, for maximization of Eq.(8) with respect to Π, we can rewrite Eq.(8) as below. n (12) E {log P (U , Z | Λ , Π )} = C + ∑ E {z k ' }log(π k ' ) n ,k ' As we see, the derivative of Eq.(12) with respect to Π cannot be 0. Instead, we need to use Lagrange multiplier method for maximization. A Lagrange function can be defined as Eq.(14) where ρ is a Lagrange multiplier. { } (13) n L (Π , ρ ) = − ∑ E z k ' log(π k ' ) + ρ (∑ π k ' − 1) n,k ' k' By setting the derivative of Eq.(13) to be 0 with respect to ρ and Π, we can simply get the maximization solution with respect to Π. We just show the solution in Eq.(14). ∂L(Π, ρ ) ∂L(Π, ρ ) =0 = 0, ∂Π ∂ρ n n ⇒ π k = ∑ E z k / ∑∑ E z k k n n { } { } (14) Then the EM algorithm can be summarized as figure 2. For the convergence criteria, we can use the expectation of log likelihood, which can be calculated from Eq. (8). πk = { } , λk , m = E um + e (e is small random noise) 2. Calculate the Expectation by 1. Initialize 1 K u n ,m 1 M πk ' ∏ 2λ ' exp( − λ ' ) cn m k ,m k ,m 3. Maximize the log likelihood given the Expectation { } { } n n E z k ≡ E zk | U , Λ' , Π ' = λk ,m ← ∑ E {z kn }⋅ u n,m / ∑ E {z kn } , π k ← ∑ E {z kn } / ∑∑ E {z kn } n n k n 4. If (converged) stop, otherwise repeat from step 2. n Figure 2: Outline of EM algorithm for Learning the Mixture Model 3 Experimental Results Here we provide examples of image data and show how the learning procedure is performed for the mixture model. We also provide visualization of learned variances that reveal the structure of variance correlation and an application to image denoising. 3.1 Learning Nonlinear Dependencies in Natural images As shown in figure 1, the 1 st stage of the proposed model is simply the linear ICA. The ICA matrix A and W(=A-1) are learned by the FastICA algorithm [9]. We sampled 105(=N) data from 16x16 patches (256 dim.) of natural images and use them for both first and second stage learning. ICA input dimension is 256, and source dimension is set to be 160(=M). The learned ICA basis is partially shown in figure 1. The 2nd stage mixture model is learned given the ICA source signals. In the 2 nd stage the number of mixtures is set to 16, 64, or 256(=K). Training by the EM algorithm is fast and several hundred iterations are sufficient for convergence (0.5 hour on a 1.7GHz Pentium PC). For the visualization of learned variance, we adapted the visualization method from [8]. Each dimension of ICA source signal corresponds to an ICA basis (columns of A) and each ICA basis is localized in both image and frequency space. Then for each Laplacian distribution, we can display its variance vector as a set of points in image and frequency space. Each point can be color coded by variance value as figure 3. (a1) (a2) (b1) (b2) Figure 3: Visualization of learned variances (a1 and a2 visualize variance of Laplacian #4 and b1 and 2 show that of Laplacian #5. High variance value is mapped to red color and low variance is mapped to blue. In Laplacian #4, variances for diagonally oriented edges are high. But in Laplacian #5, variances for edges at spatially right position are high. Variance structures are related to “contexts” in the image. For example, Laplacian #4 explains image patches that have oriented textures or edges. Laplacian #5 captures patches where left side of the patch is clean but right side is filled with randomly oriented edges.) A key idea of our model is that we can mix up independent distributions to get nonlinearly dependent distribution. This modeling power can be shown by figure 4. Figure 4: Joint distribution of nonlinearly dependent sources. ((a) is a joint histogram of 2 ICA sources, (b) is computed from learned mixture model, and (c) is from learned Laplacian model. In (a), variance of u2 is smaller than u1 at center area (arrow A), but almost equal to u1 at outside (arrow B). So the variance of u2 is dependent on u1. This nonlinear dependency is closely approximated by mixture model in (b), but not in (c).) 3.2 Unsupervised Image Segmentation The idea behind our model is that the image can be modeled as mixture of different variance correlated “contexts”. We show how the learned model can be used to classify different context by an unsupervised image segmentation task. Given learned model and data, we can compute the expectation of a hidden variable Z from Eq. (9). Then for an image patch, we can select a Laplacian distribution with highest probability, which is the most explaining Laplacian or “context”. For segmentation, we use the model with 16 Laplacians. This enables abstract partitioning of images and we can visualize organization of images more clearly (figure 5). Figure 5: Unsupervised image segmentation (left is original image, middle is color labeled image, right image shows color coded Laplacians with variance structure. Each color corresponds to a Laplacian distribution, which represents surface or textural organization of underlying contexts. Laplacian #14 captures smooth surface and Laplacian #9 captures contrast between clear sky and textured ground scenes.) 3.3 Application to Image Restoration The proposed mixture model provides a better parametric model of the ICA source distribution and hence an improved model of the image structure. An advantage is in the MAP (maximum a posterior) estimation of a noisy image. If we assume Gaussian noise n, the image generation model can be written as Eq.(15). Then, we can compute MAP estimation of ICA source signal u by Eq.(16) and reconstruct the original image. (15) X = Au + n (16) ˆ u = argmax log P (u | X , A) = argmax (log P ( X | u , A) + log P (u ) ) u u Since we assumed Gaussian noise, P(X|u,A) in Eq. (16) is Gaussian. P(u) in Eq. (16) can be modeled as a Laplacian or a mixture of Laplacian distribution. The mixture distribution can be approximated by a maximum explaining Laplacian. We evaluated 3 different methods for image restoration including ICA MAP estimation with simple Laplacian prior, same with Laplacian mixture prior, and the Wiener filter. Figure 6 shows an example and figure 7 summarizes the results obtained with different noise levels. As shown MAP estimation with the mixture prior performs better than the others in terms of SNR and SSIM (Structural Similarity Measure) [10]. Figure 6: Image restoration results (signal variance 1.0, noise variance 0.81) 16 ICA MAP (Mixture prior) ICA MAP (Laplacian prior) W iener 14 0.8 SSIM Index SNR 12 10 8 6 0.6 0.4 0.2 4 2 ICA MAP(Mixture prior) ICA MAP(Laplacian prior) W iener Noisy Image 1 0 0.5 1 1.5 Noise variance 2 2.5 0 0 0.5 1 1.5 Noise variance 2 2.5 Figure 7: SNR and SSIM for 3 different algorithms (signal variance = 1.0) 4 D i s c u s s i on We proposed a mixture model to learn nonlinear dependencies of ICA source signals for natural images. The proposed mixture of Laplacian distribution model is a generalization of the conventional independent source priors and can model variance dependency given natural image signals. Experiments show that the proposed model can learn the variance correlated signals grouped as different mixtures and learn highlevel structures, which are highly correlated with the underlying physical properties captured in the image. Our model provides an analytic prior of nearly independent and variance-correlated signals, which was not viable in previous models [4,5,6,7,8]. The learned variances of the mixture model show structured localization in image and frequency space, which are similar to the result in [8]. Since the model is given no information about the spatial location or frequency of the source signals, we can assume that the dependency captured by the mixture model reveals regularity in the natural images. As shown in image labeling experiments, such regularities correspond to specific surface types (textures) or boundaries between surfaces. The learned mixture model can be used to discover hidden contexts that generated such regularity or correlated signal groups. Experiments also show that the labeling of image patches is highly correlated with the object surface types shown in the image. The segmentation results show regularity across image space and strong correlation with high-level concepts. Finally, we showed applications of the model for image restoration. We compare the performance with the conventional ICA MAP estimation and Wiener filter. Our results suggest that the proposed model outperforms other traditional methods. It is due to the estimation of the correlated variance structure, which provides an improved prior that has not been considered in other methods. In our future work, we plan to exploit the regularity of the image segmentation result to lean more high-level structures by building additional hierarchies on the current model. Furthermore, the application to image coding seems promising. References [1] A. J. Bell and T. J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edge Filters, Vision Research, 37(23):3327–3338, 1997. [2] A. Hyvarinen, Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation,Neural Computation, 11(7):1739-1768, 1999. [3] T. Lee, M. Lewicki, and T. Sejnowski., ICA Mixture Models for unsupervised Classification of non-gaussian classes and automatic context switching in blind separation. PAMI, 22(10), October 2000. [4] J. Portilla, V. Strela, M. J. Wainwright and E. P Simoncelli, Image Denoising using Scale Mixtures of Gaussians in the Wavelet Domain, IEEE Trans. On Image Processing, Vol.12, No. 11, 1338-1351, 2003. [5] A. Hyvarinen, P. O. Hoyer. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neurocomputing, 1999. [6] A. Hyvarinen, P.O. Hoyer, Topographic Independent component analysis as a model of V1 Receptive Fields, Neurocomputing, Vol. 38-40, June 2001. [7] M. Welling and G. E. Hinton, S. Osindero, Learning Sparse Topographic Representations with Products of Student-t Distributions, NIPS, 2002. [8] M. S. Lewicki and Y. Karklin, Learning higher-order structures in natural images, Network: Comput. Neural Syst. 14 (August 2003) 483-499. [9] A.Hyvarinen, P.O. Hoyer, Fast ICA matlab code., http://www.cis.hut.fi/projects/compneuro/extensions.html/ [10] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, The SSIM Index for Image Quality Assessment, IEEE Transactions on Image Processing, vol. 13, no. 4, Apr. 2004.
3 0.73022729 100 nips-2004-Learning Preferences for Multiclass Problems
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Many interesting multiclass problems can be cast in the general framework of label ranking defined on a given set of classes. The evaluation for such a ranking is generally given in terms of the number of violated order constraints between classes. In this paper, we propose the Preference Learning Model as a unifying framework to model and solve a large class of multiclass problems in a large margin perspective. In addition, an original kernel-based method is proposed and evaluated on a ranking dataset with state-of-the-art results. 1
4 0.68602908 54 nips-2004-Distributed Information Regularization on Graphs
Author: Adrian Corduneanu, Tommi S. Jaakkola
Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classification tasks. 1
5 0.59264702 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan
Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of filter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which filter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efficacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reflections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear filter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive field development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive field-like filters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and finance.16 Many approaches to the unsupervised specification of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear filters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to find combinations of these (perhaps non-linearly transformed) as a way of finding higher order filters (eg complex cells). One critical facet whose specification from data is not obvious is the neighborhood arrangement, ie which linear filters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for finding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, filter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in figure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n filter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modified Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to filter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM specifies a hierarchy of mixer variables. Wainwright2 considered a prespecified tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being filter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each filter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to fill the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of filter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the filters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear filters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which filter responses share which mixer variables. We first study this issue using synthetic data in which two groups of filter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (figure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few filter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single filter response, the Gaussian estimate peaks away from zero. For assignments including more filter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including filter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each filter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each filter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of filter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of filter associations corresponding to vα , vβ , and vγ . C Statistics of generated filter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable. Specifically, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single filter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many filter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefficients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in figure 1 suggest that it should be possible to infer the assignments, ie work out which filter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each filter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between filter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that filter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of filter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a filter response li , we use Gibbs sampling for the E phase to find possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple filter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the filter response samples just to those associated with each mixer variable. We tested the ability of this inference method to find the associations in the probabilistic mixer variable synthetic example shown in figure 2, (A,B). The true generative model specifies probabilistic overlap of 3 mixer variables. We generated 5000 samples for each filter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the filter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently finds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefficient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear filters from a multi-scale oriented steerable pyramid,28 with 100 filters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in figure 2). Figure 3 depicts the association weights pij of the coefficients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefficients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefficient to be generated from a given mixer variable. For the first two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The first neighborhood (in B) prefers vertical patterns across most of its “receptive field”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefficient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal filters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with filter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between filters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the first 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefficients of two spatially displaced vertical filters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet filters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA first-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet filters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefficients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In figure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial configurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial configuration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal configuration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefficients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to filter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method firmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artificial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.
6 0.55921847 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
7 0.55190176 131 nips-2004-Non-Local Manifold Tangent Learning
8 0.55036384 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
9 0.54916978 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
10 0.54804832 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
11 0.54801863 178 nips-2004-Support Vector Classification with Input Data Uncertainty
12 0.5464735 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
13 0.546175 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
14 0.54469508 28 nips-2004-Bayesian inference in spiking neurons
15 0.54399776 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
16 0.54279393 145 nips-2004-Parametric Embedding for Class Visualization
17 0.54215837 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid
18 0.5418238 165 nips-2004-Semi-supervised Learning on Directed Graphs
19 0.54155117 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
20 0.54039431 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill