nips nips2012 nips2012-309 knowledge-graph by maker-knowledge-mining

309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning


Source: pdf

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 At root, the reason is that eigenvectors are inherently global quantities. [sent-10, score-0.509]

2 In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. [sent-11, score-0.982]

3 These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. [sent-12, score-0.682]

4 We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. [sent-13, score-0.422]

5 , given in a semi-supervised manner, that certain “regions” of the data graph are of particular interest. [sent-20, score-0.092]

6 In social and information network analysis, one might have a small “seed set” of nodes that belong to a cluster or community of interest [2, 13]; in this case, one might want to perform link or edge prediction, or one might want to “refine” the seed set in order to find other nearby members. [sent-27, score-0.352]

7 These examples present considerable challenges for spectral techniques and traditional eigenvectorbased methods. [sent-32, score-0.048]

8 At root, the reason is that eigenvectors are inherently global quantities, thus limiting their applicability in situations where one is interested in very local properties of the data. [sent-33, score-0.529]

9 To achieve this, we will formulate an optimization ansatz that is a variant of the usual global spectral graph partitioning optimization problem that includes a natural locality constraint as well as an orthogonality constraint, and we will iteratively solve this problem. [sent-35, score-0.436]

10 In more detail, assume that we are given as input a (possibly weighted) data graph G = (V, E), an indicator vector s of a small “seed set” of nodes, a correlation parameter κ ∈ [0, 1], and a positive integer k. [sent-36, score-0.175]

11 (We emphasize that the seed set s of nodes, the integer k, and the correlation parameter κ are part of the input; and thus they should be thought of as being available in a semi-supervised manner. [sent-38, score-0.279]

12 From this perspective, our extension incorporates a natural orthogonality constraint that successive vectors need to be orthogonal to previous vectors. [sent-45, score-0.101]

13 The objectives in both [11] and [14] are considered constrained eigenvalue problems, that can be solved by finding the smallest eigenvalue of an asymmetric generalized eigenvalue problem, but in practice this procedure can be highly unstable [8]. [sent-51, score-0.129]

14 The SGT reduces the instabilities by performing all calculations in a subspace spanned by the d smallest eigenvectors of the graph Laplacian, whereas [14] perform a binary search, exploiting the monotonic relationship between a control parameter and the corresponding Lagrange multiplier. [sent-52, score-0.566]

15 In parallel, [3] and a large body of subsequent work including [6] used eigenvectors of the graph Laplacian to perform dimensionality reduction and data representation, in unsupervised and semisupervised settings. [sent-53, score-0.56]

16 Many of these diffusion-based spectral methods also have a natural interpretation in terms of spectral ranking [21]. [sent-55, score-0.128]

17 “Topic sensitive” and “personalized” versions of these spectral ranking methods have also been studied [9, 10]; and these were the motivation for diffusion-based methods to find locally-biased clusters in large graphs [19, 1, 14]. [sent-56, score-0.099]

18 Our optimization ansatz is a generalization of the linear equation formulation of the PageRank procedure [17, 14, 21], and the solution involves Laplacian-based linear equation solving, which has been suggested as a primitive 2 of more general interest in large-scale data analysis [20]. [sent-57, score-0.093]

19 2 Background and Notation Let G = (V, E, w) be a connected undirected graph with n = |V | vertices and m = |E| edges, in which edge {i, j} has non-negative weight wij . [sent-61, score-0.142]

20 In the following, AG ∈ RV ×V will denote the adjacency matrix of G, while DG ∈ RV ×V will denote the diagonal degree matrix of G, i. [sent-62, score-0.056]

21 , DG (i, i) = di = {i,j}∈E wij , the weighted degree of vertex i. [sent-64, score-0.051]

22 Moreover, for a set of vertices def S ⊆ V in a graph, the volume of S is vol(S) = i∈S di . [sent-65, score-0.055]

23 This implies that LG is positive semidefinite and that the all-one vector 1 ∈ RV is the eigenvector corresponding to the smallest eigenvalue 0. [sent-69, score-0.223]

24 , the eigenvector corresponding to λ2 ; v3 to denote the next eigenvector; and so on. [sent-73, score-0.143]

25 For two vectors x, y ∈ Rn , and the degree matrix DG for a graph G, we define the degree-weighted inner product def n as xT DG y = i=1 xi yi di . [sent-75, score-0.183]

26 Given a subset of vertices S ⊆ V , we denote by 1S the indicator vector of S in RV and by 1 the vector in RV having all entries set equal to 1. [sent-77, score-0.066]

27 1 Optimization Approach to Semi-supervised Eigenvectors Motivation for the Program Recall the optimization perspective on how one computes the leading nontrivial global eigenvectors of the normalized Laplacian LG . [sent-79, score-0.608]

28 The first nontrivial eigenvector v2 is the solution to the problem G LOBAL S PECTRAL that is presented on the left of Figure 1. [sent-80, score-0.226]

29 Equivalently, although G LOBAL S PEC TRAL is a non-convex optimization problem, strong duality holds for it and it’s solution may be computed as v2 , the leading nontrivial generalized eigenvector of LG . [sent-81, score-0.283]

30 The next eigenvector v3 is the solution to G LOBAL S PECTRAL, augmented with the constraint that xT DG v2 = 0; and in general the tth generalized eigenvector of LG is the solution to G LOBAL S PECTRAL, augmented with the constraints that xT DG vi = 0, for i ∈ {2, . [sent-82, score-0.375]

31 Clearly, this set of constraints and the constraint xT DG 1 = 0 can be written as xT DG Q = 0, where 0 is a (t − 1)-dimensional all-zeros vector, and where Q is an n × (t − 1) orthogonal matrix whose ith column equals vi (where v1 = 1, the all-ones vector, is the first column of Q). [sent-86, score-0.108]

32 Also presented in Figure 1 is L OCAL S PECTRAL, which includes a constraint requiring the solution to be well-correlated with an input seed set. [sent-87, score-0.277]

33 This L OCAL S PECTRAL optimization problem was introduced in [14], where it was shown that the solution to L OCAL S PECTRAL may be interpreted as a locally-biased version of the second eigenvector of the Laplacian. [sent-88, score-0.186]

34 2 Our Main Algorithm We will formulate the problem of computing semi-supervised vectors in terms of a primitive optimization problem of independent interest. [sent-91, score-0.06]

35 For this problem, we are given a graph G = (V, E), with associated Laplacian matrix LG and diagonal degree matrix DG ; an indicator vector s of a small 3 G LOBAL S PECTRAL L OCAL S PECTRAL T G ENERALIZED L OCAL S PECTRAL minimize xT LG x T minimize x LG x minimize x LG x s. [sent-93, score-0.178]

36 t xT DG x = 1 xT DG 1 = 0 xT DG Q = 0 √ xT DG s ≥ κ xT D G 1 = 0 √ xT D G s ≥ κ Figure 1: Left: The usual G LOBAL S PECTRAL partitioning optimization problem; the vector achieving the optimal solution is v2 , the leading nontrivial generalized eigenvector of LG with respect to DG . [sent-96, score-0.344]

37 Middle: The L OCAL S PECTRAL optimization problem, which was originally introduced in [14]; for κ = 0, this coincides with the usual global spectral objective, while for κ > 0, this produces solutions that are biased toward the seed vector s. [sent-97, score-0.421]

38 Right: The G ENERALIZED L OCAL S PECTRAL optimization problem we introduce that includes both the locality constraint and a more general orthogonality constraint. [sent-98, score-0.114]

39 Our main algorithm for computing semi-supervised eigenvectors will iteratively compute the solution to G ENERALIZED L OCAL S PECTRAL for a sequence of Q matrices. [sent-99, score-0.462]

40 “seed set” of nodes; a correlation parameter κ ∈ [0, 1]; and an n×ν constraint matrix Q that may be assumed to be an orthogonal matrix. [sent-101, score-0.131]

41 While s can be a general unit vector orthogonal to 1, it may be helpful to think of s as the indicator vector of one or more vertices in V , corresponding to the target region of the graph. [sent-103, score-0.116]

42 In words, the problem G ENERALIZED L OCAL S PECTRAL asks us to find a vector x ∈ Rn that minimizes the variance xT LG x subject to several constraints: that x is unit length; that x is orthogonal √ to the span of Q; and that x is κ-well-correlated with the input seed set vector s. [sent-104, score-0.305]

43 , the n-dimensional all-ones vector, which is the trivial eigenvector of LG , in which case Q is an n × 1 matrix; and to compute each subsequent semi-supervised eigenvector, we let the columns of Q consist of 1 and the other semi-supervised eigenvectors found in each of the previous iterations. [sent-108, score-0.588]

44 That is, F is a span for the null space of QT ; and we will take F to be an orthogonal matrix. [sent-111, score-0.049]

45 To compute the first semi-supervised eigenvector, we will let Q = 1, the all-ones vector, in which case the first nontrivial semi-supervised eigenvector is + x∗ = c (LG − γ1 DG ) DG s, 1 (3) where γ1 is chosen to saturate the part of the correlation constraint along the first direction. [sent-123, score-0.286]

46 ) That is, to find the correct setting of γ1 , it suffices to perform a binary search over the possible values of γ1 in the interval (−vol(G), λ2 (G)) until the correlation constraint is satisfied, that is, until (sT DG x)2 is sufficiently close to κ2 , see [8, 14]. [sent-127, score-0.082]

47 , k if one ultimately wants a total of k semi-supervised eigenvectors, then one lets Q be the n × (t − 1) matrix with first column equal to 1 and with j th column, for i = 2, . [sent-133, score-0.051]

48 (2), the tth semi-supervised eigenvector takes the form x∗ t = c F F T (LG − γt DG )F F T + DG s. [sent-142, score-0.159]

49 (4) Algorithm 1 Semi-supervised eigenvectors Input: LG , DG , s, κ = [κ1 , . [sent-143, score-0.422]

50 (2) does ı not immediately lead to an efficient solution, since DG s will not be in the span of (F F T (LG − γDG )F F T ), thus leading to a large residual. [sent-151, score-0.053]

51 Fourth, we use that λ2 can never decrease (here we refer to λ2 as the smallest non-zero eigenvalue of the modified matrix), so we only recalculate the upper bound for the binary search when an iteration saturates without satisfying (xT DG s)2 − κt ≤ . [sent-156, score-0.116]

52 5 4 Illustrative Empirical Results In this section, we will provide a detailed empirical evaluation of our method of semi-supervised eigenvectors and how they can be used for locally-biased machine learning. [sent-158, score-0.422]

53 Our goal will be twofold: first, to illustrate how the “knobs” of our method work; and second, to illustrate the usefulness of the method in a real application. [sent-159, score-0.062]

54 To demonstrate how semi-supervised eigenvectors can focus on specific target regions of a data graph to capture slowest modes of local variation, we plot semi-supervised eigenvectors around illustrations of (non-rewired and rewired) realizations of the small-world graph; see Figure 2. [sent-170, score-1.032]

55 (d) Semi-supervised eigenvectors Figure 2: In each case, (a-d) the data consist of 3600 nodes, each connected to it’s 8 nearestneighbors. [sent-196, score-0.422]

56 In the center of each subfigure, we show the nodes (blue) and edges (black and light gray are the local edges, and blue are the randomly-rewired edges). [sent-197, score-0.079]

57 In each subfigure, we wrap a plot (black x-axis and gray background) visualizing the 4 smallest semi-supervised eigenvectors, allowing us to see the effect of random edges (different values of rewiring probability p) and degree of localization (different values of κ). [sent-198, score-0.101]

58 a, we show a graph with no randomly-rewired edges (p = 0) and a locality parameter κ such that the global eigenvectors are obtained. [sent-202, score-0.649]

59 This yields a symmetric graph with eigenvectors corresponding to orthogonal sinusoids, i. [sent-203, score-0.546]

60 , for all eigenvectors, except the all-ones with eigenvalue 0, the algebraic multiplicity is 2, i. [sent-205, score-0.05]

61 , the first two capture the slowest mode of variation and correspond to a sine and cosine with equal random phase-shift (rotational ambiguity). [sent-207, score-0.054]

62 01 and the locality parameter κ is still chosen such that the global eigenvectors of the rewired graph are obtained. [sent-210, score-0.653]

63 In particular, note small kinks in the eigenvectors at the location of the randomly added edges. [sent-211, score-0.422]

64 Since the graph is no longer symmetric, all of the visualized eigenvectors have algebraic multiplicity 1. [sent-212, score-0.532]

65 Moreover, note that the slow mode of variation in the interval on the top left; a normalized-cut based on the leading global eigenvector would extract this region since the remainder of the ring is more well-connected due to the degree of rewiring. [sent-213, score-0.32]

66 c, we see the same graph realization as in Figure 2. [sent-215, score-0.117]

67 b, except that the semi-supervised eigenvectors have a seed node at the top of the circle and the correlation 6 parameter κt = 0. [sent-216, score-0.724]

68 Note that, like the global eigenvectors, the local approach produces modes of increasing variation. [sent-218, score-0.113]

69 b; the reason for this is that this region is well-connected with the seed via a randomly added edge. [sent-220, score-0.244]

70 , γt is the Lagrange multiplier that defines the effective correlation κt . [sent-223, score-0.053]

71 Not shown is that if we kept reducing κ, then γt would tend towards λt+1 , and the respective semi-supervised eigenvector would tend towards the global eigenvector. [sent-224, score-0.211]

72 It should be clear that, in addition to being determined by the locality parameter, we can think of γ as a regularizer biasing the global eigenvectors towards the region near the seed set. [sent-228, score-0.776]

73 2 MNIST Digit Data We now demonstrate the semi-supervised eigenvectors as a feature extraction preprocessing step in a machine learning setting. [sent-230, score-0.422]

74 We construct the complete 70000 × 70000 k-NN 4 2 graph with k = 10 and with edge weights given by wij = exp(− σ2 xi − xj 2 ), where σi being i the Euclidean distance to it’s nearest neighbor, and we define the graph Laplacian in the usual way. [sent-232, score-0.232]

75 We evaluate the semi-supervised eigenvectors in a transductive learning setting by disregarding the majority of labels in the entire training data. [sent-233, score-0.459]

76 We then use a few samples from each class to seed our semi-supervised eigenvectors, and a few others to train a downstream classification algorithm. [sent-234, score-0.226]

77 Second, using the SGT based on global eigenvectors is a good point of comparison, because we are only interested in the effect of our subspace representation. [sent-237, score-0.509]

78 ) As in [11], we normalize the spectrum of both global and semi-supervised eigenvectors i2 by replacing the eigenvalues with some monotonically increasing function. [sent-239, score-0.49]

79 Clearly, other correlation distributions and values of γ may yield subspaces with even better discriminative properties1 . [sent-247, score-0.053]

80 Labeled points 1:1 1 : 10 5 : 50 10 : 100 50 : 500 #Semi-supervised eigenvectors for SGT 1 2 4 6 8 10 0. [sent-248, score-0.422]

81 04 Table 1: Classification error for the SGT based on respectively semi-supervised and global eigenvectors. [sent-308, score-0.068]

82 , 1:10 interprets as 1 seed and 10 training samples from each class (total of 22 samples - for the global approach these are all used for training). [sent-311, score-0.294]

83 When the seed is well determined and the number of training samples moderate (50:500) a single semi-supervised eigenvector is sufficient, where for less data we benefit from using multiple semi-supervised eigenvectors. [sent-312, score-0.369]

84 ) Hence, we expect localization on low order global eigenvectors, meaning that class separation will not be evident in the leading global eigenvector, but instead will be “buried” further down the spectrum. [sent-316, score-0.193]

85 Thus, this will illustrate how semi-supervised eigenvectors can represent relevant heterogeneities in a local subspace of low dimensionality. [sent-317, score-0.492]

86 Table 1 summarizes our classification results based on respectively semi-supervised and global eigenvectors. [sent-318, score-0.068]

87 7 the seed nodes, to demonstrate the influence of the seed. [sent-320, score-0.226]

88 03 5 6 7 8 9 10 11 12 #Semi-supervised eigenvectors 13 14 15 Figure 3: Left: Shows a subset of the classification results for the SGT based on 5 semi-supervised eigenvectors seeded in s+ and s− , and trained using samples l+ and l− . [sent-353, score-0.844]

89 As the seed nodes are good representatives, we note that the eigenvectors provide a good class separation. [sent-357, score-0.682]

90 We also plot the error as a function of local dimensionality, as well as the unexplained correlation, i. [sent-358, score-0.055]

91 , initial components explain the majority of the correlation with the seed (effect of γ = 0). [sent-360, score-0.279]

92 The particular realization based on the leading 5 semi-supervised eigenvectors yields an error of ≈ 0. [sent-361, score-0.483]

93 04 1 2 3 4 5 6 7 8 9 10 11 12 #Semi-supervised eigenvectors 13 14 15 Figure 4: See the general description in Figure 3. [sent-394, score-0.422]

94 In this constellation we first discover localization on low order semi-supervised eigenvectors (≈ 12 eigenvectors), which is comparable to the error based on global eigenvectors (see Table 1), i. [sent-400, score-0.933]

95 , further down the spectrum we recover from the bad seed and pickup the relevant mode of variation. [sent-402, score-0.243]

96 In summary: We introduced the concept of semi-supervised eigenvectors that are biased towards local regions of interest in a large data graph. [sent-403, score-0.481]

97 Moreover, the algorithm is scalable as the eigenvectors are computed by the solution to a sparse system of linear equations, preserving the low O(m) space complexity. [sent-405, score-0.444]

98 Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. [sent-470, score-0.058]

99 A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. [sent-500, score-0.179]

100 Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. [sent-540, score-0.184]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dg', 0.548), ('eigenvectors', 0.422), ('lg', 0.337), ('pectral', 0.328), ('ocal', 0.227), ('seed', 0.226), ('eneralized', 0.171), ('eigenvector', 0.143), ('sgt', 0.128), ('laplacian', 0.106), ('graph', 0.092), ('lobal', 0.086), ('xt', 0.077), ('global', 0.068), ('nontrivial', 0.061), ('rv', 0.055), ('correlation', 0.053), ('qt', 0.05), ('spectral', 0.048), ('pagerank', 0.048), ('mahoney', 0.044), ('locality', 0.042), ('transductive', 0.037), ('leading', 0.036), ('unexplained', 0.035), ('def', 0.034), ('nodes', 0.034), ('smallest', 0.033), ('vol', 0.033), ('ranking', 0.032), ('eigenvalue', 0.032), ('orthogonal', 0.032), ('illustrate', 0.031), ('wij', 0.029), ('constraint', 0.029), ('ansatz', 0.029), ('knobs', 0.029), ('recalculate', 0.029), ('rewired', 0.029), ('mnist', 0.028), ('partitioning', 0.027), ('st', 0.027), ('web', 0.026), ('personalized', 0.026), ('walks', 0.025), ('modes', 0.025), ('edges', 0.025), ('www', 0.025), ('realization', 0.025), ('biased', 0.024), ('andersen', 0.023), ('semisupervised', 0.023), ('subsequent', 0.023), ('circle', 0.023), ('cuts', 0.022), ('nearby', 0.022), ('solution', 0.022), ('degree', 0.022), ('orthogonality', 0.022), ('saturates', 0.022), ('heterogeneity', 0.022), ('optimization', 0.021), ('localization', 0.021), ('social', 0.021), ('primitive', 0.021), ('slowest', 0.021), ('vertices', 0.021), ('local', 0.02), ('inherit', 0.02), ('wants', 0.019), ('usual', 0.019), ('inherently', 0.019), ('graphs', 0.019), ('subspace', 0.019), ('iteratively', 0.018), ('multiplicity', 0.018), ('stanford', 0.018), ('vectors', 0.018), ('region', 0.018), ('wide', 0.017), ('matrix', 0.017), ('mode', 0.017), ('world', 0.017), ('span', 0.017), ('want', 0.017), ('variation', 0.016), ('communities', 0.016), ('tth', 0.016), ('ag', 0.016), ('vector', 0.015), ('classi', 0.015), ('community', 0.015), ('indicator', 0.015), ('regions', 0.015), ('column', 0.015), ('sub', 0.015), ('realizations', 0.015), ('methodology', 0.015), ('digit', 0.015), ('vision', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

2 0.10937089 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

3 0.090533324 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

Author: Pedro Ortega, Jordi Grau-moya, Tim Genewein, David Balduzzi, Daniel Braun

Abstract: We propose a novel Bayesian approach to solve stochastic optimization problems that involve finding extrema of noisy, nonlinear functions. Previous work has focused on representing possible functions explicitly, which leads to a two-step procedure of first, doing inference over the function space and second, finding the extrema of these functions. Here we skip the representation step and directly model the distribution over extrema. To this end, we devise a non-parametric conjugate prior based on a kernel regressor. The resulting posterior distribution directly captures the uncertainty over the maximum of the unknown function. Given t observations of the function, the posterior can be evaluated efficiently in time O(t2 ) up to a multiplicative constant. Finally, we show how to apply our model to optimize a noisy, non-convex, high-dimensional objective function.

4 0.076380171 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells

Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1

5 0.072506994 351 nips-2012-Transelliptical Component Analysis

Author: Fang Han, Han Liu

Abstract: We propose a high dimensional semiparametric scale-invariant principle component analysis, named TCA, by utilize the natural connection between the elliptical distribution family and the principal component analysis. Elliptical distribution family includes many well-known multivariate distributions like multivariate Gaussian, t and logistic and it is extended to the meta-elliptical by Fang et.al (2002) using the copula techniques. In this paper we extend the meta-elliptical distribution family to a even larger family, called transelliptical. We prove that TCA can obtain a near-optimal s log d/n estimation consistency rate in recovering the leading eigenvector of the latent generalized correlation matrix under the transelliptical distribution family, even if the distributions are very heavy-tailed, have infinite second moments, do not have densities and possess arbitrarily continuous marginal distributions. A feature selection result with explicit rate is also provided. TCA is further implemented in both numerical simulations and largescale stock data to illustrate its empirical usefulness. Both theories and experiments confirm that TCA can achieve model flexibility, estimation accuracy and robustness at almost no cost. 1

6 0.071138829 324 nips-2012-Stochastic Gradient Descent with Only One Projection

7 0.068651535 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

8 0.057567433 69 nips-2012-Clustering Sparse Graphs

9 0.053646445 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

10 0.052296024 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

11 0.051176224 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.050888307 346 nips-2012-Topology Constraints in Graphical Models

13 0.050613586 46 nips-2012-Assessing Blinding in Clinical Trials

14 0.049232624 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

15 0.049157273 305 nips-2012-Selective Labeling via Error Bound Minimization

16 0.048929036 254 nips-2012-On the Sample Complexity of Robust PCA

17 0.048763614 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

18 0.048523769 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

19 0.048319843 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

20 0.048214957 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.033), (2, 0.029), (3, 0.004), (4, 0.025), (5, -0.023), (6, -0.01), (7, -0.031), (8, -0.079), (9, 0.025), (10, 0.021), (11, -0.063), (12, -0.013), (13, 0.021), (14, 0.007), (15, 0.032), (16, 0.059), (17, 0.028), (18, -0.041), (19, 0.034), (20, -0.043), (21, -0.019), (22, -0.001), (23, -0.043), (24, 0.006), (25, -0.044), (26, -0.066), (27, 0.059), (28, 0.01), (29, -0.02), (30, -0.032), (31, -0.074), (32, -0.004), (33, -0.033), (34, -0.022), (35, -0.064), (36, -0.044), (37, -0.038), (38, 0.084), (39, 0.021), (40, -0.014), (41, -0.056), (42, 0.017), (43, -0.046), (44, -0.006), (45, -0.025), (46, 0.053), (47, -0.049), (48, -0.105), (49, -0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91281939 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

2 0.58244252 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

3 0.53145742 346 nips-2012-Topology Constraints in Graphical Models

Author: Marcelo Fiori, Pablo Musé, Guillermo Sapiro

Abstract: Graphical models are a very useful tool to describe and understand natural phenomena, from gene expression to climate change and social interactions. The topological structure of these graphs/networks is a fundamental part of the analysis, and in many cases the main goal of the study. However, little work has been done on incorporating prior topological knowledge onto the estimation of the underlying graphical models from sample data. In this work we propose extensions to the basic joint regression model for network estimation, which explicitly incorporate graph-topological constraints into the corresponding optimization approach. The first proposed extension includes an eigenvector centrality constraint, thereby promoting this important prior topological property. The second developed extension promotes the formation of certain motifs, triangle-shaped ones in particular, which are known to exist for example in genetic regulatory networks. The presentation of the underlying formulations, which serve as examples of the introduction of topological constraints in network estimation, is complemented with examples in diverse datasets demonstrating the importance of incorporating such critical prior knowledge. 1

4 0.52099502 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks

Author: Qirong Ho, Junming Yin, Eric P. Xing

Abstract: In this paper, we argue for representing networks as a bag of triangular motifs, particularly for important network problems that current model-based approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1-edges and 0-edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require Ω(N 2 ) time per iteration, precluding their application to larger real-world networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is 2 Θ( i Di ) (where Di is the degree of vertex i), which is much smaller than N 2 for low-maximum-degree networks. Using this representation, we develop a novel mixed-membership network model and approximate inference algorithm suitable for large networks with low max-degree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a node-centric fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edge-based model, has faster runtime and improved accuracy for mixed-membership community detection. We conclude with a large-scale demonstration on an N ≈ 280, 000-node network, which is infeasible for network models with Ω(N 2 ) inference cost. 1

5 0.51193178 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1

6 0.50050104 196 nips-2012-Learning with Partially Absorbing Random Walks

7 0.49936685 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

8 0.48720497 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

9 0.47609326 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

10 0.46547616 286 nips-2012-Random Utility Theory for Social Choice

11 0.46310905 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

12 0.46004045 324 nips-2012-Stochastic Gradient Descent with Only One Projection

13 0.45531863 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

14 0.45396864 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

15 0.44542283 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

16 0.44499409 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

17 0.44187963 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

18 0.43396369 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

19 0.43326783 298 nips-2012-Scalable Inference of Overlapping Communities

20 0.43237546 352 nips-2012-Transelliptical Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.039), (17, 0.011), (21, 0.015), (32, 0.306), (38, 0.12), (39, 0.03), (42, 0.023), (54, 0.025), (55, 0.028), (74, 0.065), (76, 0.113), (80, 0.07), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.75974536 23 nips-2012-A lattice filter model of the visual pathway

Author: Karol Gregor, Dmitri B. Chklovskii

Abstract: Early stages of visual processing are thought to decorrelate, or whiten, the incoming temporally varying signals. Motivated by the cascade structure of the visual pathway (retina → lateral geniculate nucelus (LGN) → primary visual cortex, V1) we propose to model its function using lattice filters - signal processing devices for stage-wise decorrelation of temporal signals. Lattice filter models predict neuronal responses consistent with physiological recordings in cats and primates. In particular, they predict temporal receptive fields of two different types resembling so-called lagged and non-lagged cells in the LGN. Moreover, connection weights in the lattice filter can be learned using Hebbian rules in a stage-wise sequential manner reminiscent of the neuro-developmental sequence in mammals. In addition, lattice filters can model visual processing in insects. Therefore, lattice filter is a useful abstraction that captures temporal aspects of visual processing. Our sensory organs face an ongoing barrage of stimuli from the world and must transmit as much information about them as possible to the rest of the brain [1]. This is a formidable task because, in sensory modalities such as vision, the dynamic range of natural stimuli (more than three orders of magnitude) greatly exceeds the dynamic range of relay neurons (less than two orders of magnitude) [2]. The reason why high fidelity transmission is possible at all is that the continuity of objects in the physical world leads to correlations in natural stimuli, which imply redundancy. In turn, such redundancy can be eliminated by compression performed by the front end of the visual system leading to the reduction of the dynamic range [3, 4]. A compression strategy appropriate for redundant natural stimuli is called predictive coding [5, 6, 7]. In predictive coding, a prediction of the incoming signal value is computed from past values delayed in the circuit. This prediction is subtracted from the actual signal value and only the prediction error is transmitted. In the absence of transmission noise such compression is lossless as the original signal could be decoded on the receiving end by inverting the encoder. If predictions are accurate, the dynamic range of the error is much smaller than that of the natural stimuli. Therefore, minimizing dynamic range using predictive coding reduces to optimizing prediction. Experimental support for viewing the front end of the visual system as a predictive encoder comes from the measurements of receptive fields [6, 7]. In particular, predictive coding suggests that, for natural stimuli, the temporal receptive fields should be biphasic and the spatial receptive fields center-surround. These predictions are born out by experimental measurements in retinal ganglion cells, [8], lateral geniculate nucleus (LGN) neurons [9] and fly second order visual neurons called large monopolar cells (LMCs) [2]. In addition, the experimentally measured receptive fields vary with signal-to-noise ratio as would be expected from optimal prediction theory [6]. Furthermore, experimentally observed whitening of the transmitted signal [10] is consistent with removing correlated components from the incoming signals [11]. As natural stimuli contain correlations on time scales greater than hundred milliseconds, experimentally measured receptive fields of LGN neurons are equally long [12]. Decorrelation over such long time scales requires equally long delays. How can such extended receptive field be produced by 1 biological neurons and synapses whose time constants are typically less than hundred milliseconds [13]? The field of signal processing offers a solution to this problem in the form of a device called a lattice filter, which decorrelates signals in stages, sequentially adding longer and longer delays [14, 15, 16, 17]. Motivated by the cascade structure of visual systems [18], we propose to model decorrelation in them by lattice filters. Naturally, visual systems are more complex than lattice filters and perform many other operations. However, we show that the lattice filter model explains several existing observations in vertebrate and invertebrate visual systems and makes testable predictions. Therefore, we believe that lattice filters provide a convenient abstraction for modeling temporal aspects of visual processing. This paper is organized as follows. First, we briefly summarize relevant results from linear prediction theory. Second, we explain the operation of the lattice filter in discrete and continuous time. Third, we compare lattice filter predictions with physiological measurements. 1 Linear prediction theory Despite the non-linear nature of neurons and synapses, the operation of some neural circuits in vertebrates [19] and invertebrates [20] can be described by a linear systems theory. The advantage of linear systems is that optimal circuit parameters may be obtained analytically and the results are often intuitively clear. Perhaps not surprisingly, the field of signal processing relies heavily on the linear prediction theory, offering a convenient framework [15, 16, 17]. Below, we summarize the results from linear prediction that will be used to explain the operation of the lattice filter. Consider a scalar sequence y = {yt } where time t = 1, . . . , n. Suppose that yt at each time point depends on side information provided by vector zt . Our goal is to generate a series of linear predictions, yt from the vector zt , yt = w · zt . We define a prediction error as: ˆ ˆ et = yt − yt = yt − w · zt ˆ (1) and look for values of w that minimize mean squared error: e2 = 1 nt e2 = t t 1 nt (yt − w · zt )2 . (2) t The weight vector, w is optimal for prediction of sequence y from sequence z if and only if the prediction error sequence e = y − w · z is orthogonal to each component of vector z: ez = 0. (3) When the whole series y is given in advance, i.e. in the offline setting, these so-called normal equations can be solved for w, for example, by Gaussian elimination [21]. However, in signal processing and neuroscience applications, another setting called online is more relevant: At every time step t, prediction yt must be made using only current values of zt and w. Furthermore, after a ˆ prediction is made, w is updated based on the prediction yt and observed yt , zt . ˆ In the online setting, an algorithm called stochastic gradient descent is often used, where, at each time step, w is updated in the direction of negative gradient of e2 : t w →w−η w (yt − w · zt ) 2 . (4) This leads to the following weight update, known as least mean square (LMS) [15], for predicting sequence y from sequence z: w → w + ηet zt , (5) where η is the learning rate. The value of η represents the relative influence of more recent observations compared to more distant ones. The larger the learning rate the faster the system adapts to recent observations and less past it remembers. In this paper, we are interested in predicting a current value xt of sequence x from its past values xt−1 , . . . , xt−k restricted by the prediction order k > 0: xt = wk · (xt−1 , . . . , xt−k )T . ˆ 2 (6) This problem is a special case of the online linear prediction framework above, where yt = xt , zt = (xt−1 , . . . , xt−k )T . Then the gradient update is given by: w → wk + ηet (xt−1 , . . . , xt−k )T . (7) While the LMS algorithm can find the weights that optimize linear prediction (6), the filter wk has a long temporal extent making it difficult to implement with neurons and synapses. 2 Lattice filters One way to generate long receptive fields in circuits of biological neurons is to use a cascade architecture, known as the lattice filter, which calculates optimal linear predictions for temporal sequences and transmits prediction errors [14, 15, 16, 17]. In this section, we explain the operation of a discrete-time lattice filter, then adapt it to continuous-time operation. 2.1 Discrete-time implementation The first stage of the lattice filter, Figure 1, calculates the error of the first order optimal prediction (i.e. only using the preceding element of the sequence), the second stage uses the output of the first stage and calculates the error of the second order optimal prediction (i.e. using only two previous values) etc. To make such stage-wise error computations possible the lattice filter calculates at every stage not only the error of optimal prediction of xt from past values xt−1 , . . . , xt−k , called forward error, ftk = xt − wk · (xt−1 , . . . , xt−k )T , (8) but, perhaps non-intuitively, also the error of optimal prediction of a past value xt−k from the more recent values xt−k+1 , . . . , xt , called backward error: bk = xt−k − w k · (xt−k+1 , . . . , xt )T , t k where w and w k (9) are the weights of the optimal prediction. For example, the first stage of the filter calculates the forward error ft1 of optimal prediction of xt from xt−1 : ft1 = xt − u1 xt−1 as well as the backward error b1 of optimal prediction of xt−1 from t xt : b1 = xt−1 − v 1 xt , Figure 1. Here, we assume that coefficients u1 and v 1 that give optimal linear t prediction are known and return to learning them below. Each following stage of the lattice filter performs a stereotypic operation on its inputs, Figure 1. The k-th stage (k > 1) receives forward, ftk−1 , and backward, bk−1 , errors from the previous stage, t delays backward error by one time step and computes a forward error: ftk = ftk−1 − uk bk−1 t−1 (10) of the optimal linear prediction of ftk−1 from bk−1 . In addition, each stage computes a backward t−1 error k−1 k bt = bt−1 − v k ftk−1 (11) of the optimal linear prediction of bk−1 from ftk−1 . t−1 As can be seen in Figure 1, the lattice filter contains forward prediction error (top) and backward prediction error (bottom) branches, which interact at every stage via cross-links. Operation of the lattice filter can be characterized by the linear filters acting on the input, x, to compute forward or backward errors of consecutive order, so called prediction-error filters (blue bars in Figure 1). Because of delays in the backward error branch the temporal extent of the filters grows from stage to stage. In the next section, we will argue that prediction-error filters correspond to the measurements of temporal receptive fields in neurons. For detailed comparison with physiological measurements we will use the result that, for bi-phasic prediction-error filters, such as the ones in Figure 1, the first bar of the forward prediction-error filter has larger weight, by absolute value, than the combined weights of the remaining coefficients of the corresponding filter. Similarly, in backward predictionerror filters, the last bar has greater weight than the rest of them combined. This fact arises from the observation that forward prediction-error filters are minimum phase, while backward predictionerror filters are maximum phase [16, 17]. 3 Figure 1: Discrete-time lattice filter performs stage-wise computation of forward and backward prediction errors. In the first stage, the optimal prediction of xt from xt−1 is computed by delaying the input by one time step and multiplying it by u1 . The upper summation unit subtracts the predicted xt from the actual value and outputs prediction error ft1 . Similarly, the optimal prediction of xt−1 from xt is computed by multiplying the input by v 1 . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error b1 . In each following stage t k, the optimal prediction of ftk−1 from bk−1 is computed by delaying bk−1 by one time step and t t multiplying it by uk . The upper summation unit subtracts the prediction from the actual ftk−1 and outputs prediction error ftk . Similarly, the optimal prediction of bk−1 from ftk−1 is computed by t−1 multiplying it by uk . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error bk . Black connections have unitary weights and red connections t have learnable negative weights. One can view forward and backward error calculations as applications of so-called prediction-error filters (blue) to the input sequence. Note that the temporal extent of the filters gets longer from stage to stage. Next, we derive a learning rule for finding optimal coefficients u and v in the online setting. The uk is used for predicting ftk−1 from bk−1 to obtain error ftk . By substituting yt = ftk−1 , zt = bk−1 and t−1 t−1 et = ftk into (5) the update of uk becomes uk → uk + ηftk bk−1 . t−1 (12) Similarly, v k is updated by v k → v k + ηbk ftk−1 . (13) t Interestingly, the updates of the weights are given by the product of the activities of outgoing and incoming nodes of the corresponding cross-links. Such updates are known as Hebbian learning rules thought to be used by biological neurons [22, 23]. Finally, we give a simple proof that, in the offline setting when the entire sequence x is known, f k and bk , given by equations (10, 11), are indeed errors of optimal k-th order linear prediction. Let D be one step time delay operator (Dx)t = xt−1 . The induction statement at k is that f k and bk are k-th order forward and backward errors of optimal linear prediction which is equivalent to f k and bk k k being of the form f k = x−w1 Dx−. . .−wk Dk x and bk = Dk x−w1k Dk−1 x−. . .−wkk x and, from k i normal equations (3), satisfying f D x = 0 and Dbk Di x = bk Di−1 x = 0 for i = 1, . . . , k. That this is true for k = 1 directly follows from the definition of f 1 and b1 . Now we assume that this is true for k − 1 ≥ 1 and show it is true for k. It is easy to see from the forms of f k−1 and bk−1 k k and from f k = f k−1 − uk Dbk−1 that f k has the correct form f k = x − w1 Dx − . . . − wk Dk x. k i k−1 k k−1 Regarding orthogonality for i = 1, . . . , k − 1 we have f D x = (f − u Db )Di x = f k−1 Di x − uk (Dbk−1 )Di x = 0 using the induction assumptions of orhogonality at k − 1. For the remaining i = k we note that f k is the error of the optimal linear prediction of f k−1 from Dbk−1 k−1 and therefore 0 = f k Dbk−1 = f k (Dk x − w1k−1 Dk−1 x − . . . + wk−1 Dx) = f k Dk x as desired. The bk case can be proven similarly. 2.2 Continuous-time implementation The last hurdle remaining for modeling neuronal circuits which operate in continuous time with a lattice filter is its discrete-time operation. To obtain a continuous-time implementation of the lattice 4 filter we cannot simply take the time step size to zero as prediction-error filters would become infinitesimally short. Here, we adapt the discrete-time lattice filter to continous-time operation in two steps. First, we introduce a discrete-time Laguerre lattice filter [24, 17] which uses Laguerre polynomials rather than the shift operator to generate its basis functions, Figure 2. The input signal passes through a leaky integrator whose leakage constant α defines a time-scale distinct from the time step (14). A delay, D, at every stage is replaced by an all-pass filter, L, (15) with the same constant α, which preserves the magnitude of every Fourier component of the input but shifts its phase in a frequency dependent manner. Such all-pass filter reduces to a single time-step delay when α = 0. The optimality of a general discrete-time Laguerre lattice filter can be proven similarly to that for the discrete-time filter, simply by replacing operator D with L in the proof of section 2.1. Figure 2: Continuous-time lattice filter using Laguerre polynomials. Compared to the discretetime version, it contains a leaky integrator, L0 ,(16) and replaces delays with all-pass filters, L, (17). Second, we obtain a continuous-time formulation of the lattice filter by replacing t − 1 → t − δt, defining the inverse time scale γ = (1 − α)/δt and taking the limit δt → 0 while keeping γ fixed. As a result L0 and L are given by: Discrete time L0 (x)t L(x)t Continuous time = αL0 (x)t−1 + xt (14) = α(L(x)t−1 − xt ) + xt−1 (15) dL0 (x)/dt = −γL0 (x) + x L(x) = x − 2γL0 (x) (16) (17) Representative impulse responses of the continuous Laguerre filter are shown in Figure 2. Note that, similarly to the discrete-time case, the area under the first (peak) phase is greater than the area under the second (rebound) phase in the forward branch and the opposite is true in the backward branch. Moreover, the temporal extent of the rebound is greater than that of the peak not just in the forward branch like in the basic discrete-time implementation but also in the backward branch. As will be seen in the next section, these predictions are confirmed by physiological recordings. 3 Experimental evidence for the lattice filter in visual pathways In this section we demonstrate that physiological measurements from visual pathways in vertebrates and invertebrates are consistent with the predictions of the lattice filter model. For the purpose of modeling visual pathways, we identify summation units of the lattice filter with neurons and propose that neural activity represents forward and backward errors. In the fly visual pathway neuronal activity is represented by continuously varying graded potentials. In the vertebrate visual system, all neurons starting with ganglion cells are spiking and we identify their firing rate with the activity in the lattice filter. 3.1 Mammalian visual pathway In mammals, visual processing is performed in stages. In the retina, photoreceptors synapse onto bipolar cells, which in turn synapse onto retinal ganglion cells (RGCs). RGCs send axons to the LGN, where they synapse onto LGN relay neurons projecting to the primary visual cortex, V1. In addition to this feedforward pathway, at each stage there are local circuits involving (usually inhibitory) inter-neurons such as horizontal and amacrine cells in the retina. Neurons of each class 5 come in many types, which differ in their connectivity, morphology and physiological response. The bewildering complexity of these circuits has posed a major challenge to visual neuroscience. Alonso et al. • Connections between LGN and Cortex J. Neurosci., June 1, 2001, 21(11):4002–4015 4009 Temporal Filter 1 0.5 0 -0.5 -1 RGC LGN 0 100 Time (ms) 200 Figure 7. Distribution of geniculate cells and simple cells with respect to the timing of their responses. The distribution of three parameters derived from impulse responses of geniculate and cortical neurons is shown. A, Peak time. B, Zero-crossing time. C, Rebound index. Peak time is the time with the strongest response in the first phase of the impulse response. Zero-crossing time is the time between the first and second phases. Rebound index is the area of the impulse response after the zero crossing divided by the area before the zero crossing. Only impulse responses with good signal to noise were included (Ͼ5 SD above baseline; n ϭ 169). Figure 3: Electrophysiologically measured temporal receptive fields get progressively longer along the cat visual pathway. Left: A cat LGN cell (red) has a longer receptive field than a corresponding RGC cell (blue) (adapted from [12] which also reports population data). Right (A,B): Extent of the temporal receptive fields of simple cells in cat V1 is greater than that of corresponding LGN cells as quantified by the peak (A) and zero-crossing (B) times. Right (C): In the temporal receptive fields of cat LGN and V1 cells the peak can be stronger or weaker than the rebound (adapted from [25]). simple cells and geniculate cells differed for all temporal parameters measured, there was considerable overlap between the distributions (Fig. 7). This overlap raises the following question: does connectivity depend on how well geniculate and cortical responses are matched with respect to time? For instance, do simple cells with fast subregions (early times to peak and early zero crossings) receive input mostly from geniculate cells with fast centers? Figure 8 illustrates the visual responses from a geniculate cell and a simple cell that were monosynaptically connected. A strong positive peak was observed in the correlogram (shown with a 10 msec time window to emphasize its short latency and fast rise time). In this case, an ON central subregion was well overlapped with an ON geniculate center (precisely at the peak of the subregion). Moreover, the timings of the visual responses from the overlapped subregion and the geniculate center were very similar (same onset, ϳ0 –25 msec; same peak, ϳ25–50 msec). It is worth noting that the two central subregions of the simple cell were faster and stronger than the two lateral subregions. The responses of the central subregions matched the timing of the geniculate center. In contrast, the timing of the lateral subregions resembled more closely the timing of the geniculate surround (both peaked at 25–50 msec). Unlike the example shown in Figure 8, a considerable number of geniculocortical pairs produced responses with different timing. For example, Figure 9 illustrates a case in which a geniculate center fully overlapped a strong simple-cell subregion of the same sign, but with slower timing (LGN onset, ϳ0 –25 msec; peak, ϳ25–50 msec; simple-cell onset, ϳ25–50 msec; peak, ϳ50 –75 msec). The cross-correlogram between this pair of neurons was flat, which indicates the absence of a monosynaptic connection (Fig. 9, top right). To examine the role of timing in geniculocortical connectivity, we measured the response time course from all cell pairs that met two criteria. First, the geniculate center overlapped a simple-cell subregion of the same sign (n ϭ 104). Second, the geniculate center overlapped the cortical subregion in a near-optimal position (relative overlap Ͼ 50%, n ϭ 47; see Materials and Methods; Fig. 5A). All these cell pairs had a high probability of being monosynaptically connected because of the precise match in receptive-field position and sign (31 of 47 were connected). The distributions of peak time, zero-crossing time, and rebound index from these cell pairs were very similar to the distributions from the entire sample (Fig. 7; see also Fig. 10 legend). The selected cell pairs included both presumed directional (predicted DI Ͼ 0.3, see Materials and Methods; 12/20 connected) and nondirectional (19/27 connected) simple cells. Most geniculate cells had small receptive fields (less than two simple-cell subregion widths; see Receptive-field sign), although five cells with larger receptive fields were also included (three connected). From the 47 cell pairs used in this analysis, those with similar response time courses had a higher probability of being connected (Fig. 10). In particular, cell pairs that had both similar peak time and zero-crossing time were all connected (n ϭ 12; Fig. 10 A). Directionally selective simple cells were included in all timing groups. For example, in Figure 10 A there were four, five, two, and one directionally selective cells in the time groups Ͻ20, 40, 60, and Ͼ60 msec, respectively. Similar results were obtained if we restricted our sample to geniculate centers overlapped with the dominant subregion of the simple cell (n ϭ 31). Interestingly, the efficacy and contributions of the connections seemed to depend little on the relative timing of the visual responses (Fig. 10, right). Although our sample of them was quite small, lagged cells are of considerable interest and therefore deserve comment. We recorded from 13 potentially lagged LGN cells whose centers were superimposed with a simple-cell subregion (eight with rebound indices between 1.2 and 1.5; five with rebound indices Ͼ1.9). Only seven of these pairs could be used for timing comparisons (in one pair the baseline of the correlogram had insufficient spikes; in three pairs the geniculate receptive fields were Here, we point out several experimental observations related to temporal processing in the visual system consistent with the lattice filter model. First, measurements of temporal receptive fields demonstrate that they get progressively longer at each consecutive stage: i) LGN neurons have longer receptive fields than corresponding pre-synaptic ganglion cells [12], Figure 3left; ii) simple cells in V1 have longer receptive fields than corresponding pre-synaptic LGN neurons [25], Figure 3rightA,B. These observation are consistent with the progressively greater temporal extent of the prediction-error filters (blue plots in Figure 2). Second, the weight of the peak (integrated area under the curve) may be either greater or less than that of the rebound both in LGN relay cells [26] and simple cells of V1 [25], Figure 3right(C). Neurons with peak weight exceeding that of rebound are often referred to as non-lagged while the others are known as lagged found both in cat [27, 28, 29] and monkey [30]. The reason for this becomes clear from the response to a step stimulus, Figure 4(top). By comparing experimentally measured receptive fields with those of the continuous lattice filter, Figure 4, we identify non-lagged neurons with the forward branch and lagged neurons with the backward branch. Another way to characterize step-stimulus response is whether the sign of the transient is the same (non-lagged) or different (lagged) relative to sustained response. Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [30]. This is consistent with backward pathway circuit of the Laguerre lattice filter, Figure 2, being more complex then that of the forward path (which results in different transfer function). ” (or replacing ”more complex” with ”different”) Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [31]. This is consistent with the backward branch circuit of the Laguerre lattice filter, Figure 2, being different then that of the forward branch (which results in different transfer function). In particular, a combination of different glutamate receptors such as AMPA and NMDA, as well as GABA receptors are thought to be responsible for observed responses in lagged cells [32]. However, further investigation of the corresponding circuitry, perhaps using connectomics technology, is desirable. Fourth, the cross-link weights of the lattice filter can be learned using Hebbian rules, (12,13) which are biologically plausible [22, 23]. Interestingly, if these weights are learned sequentially, starting from the first stage, they do not need to be re-learned when additional stages are added or learned. This property maps naturally on the fact that in the course of mammalian development the visual pathway matures in a stage-wise fashion - starting with the retina, then LGN, then V1 - and implying that the more peripheral structures do not need to adapt to the maturation of the downstream ones. 6 Figure 4: Comparison of electrophysiologically measured responses of cat LGN cells with the continuous-time lattice filter model. Top: Experimentally measured temporal receptive fields and step-stimulus responses of LGN cells (adapted from [26]). Bottom: Typical examples of responses in the continuous-time lattice filter model. Lattice filter coefficients were u1 = v 1 = 0.4, u2 = v 2 = 0.2 and 1/γ = 50ms to model the non-lagged cell and u1 = v 1 = u2 = v 2 = 0.2 and 1/γ = 60ms to model the lagged cell. To model photoreceptor contribution to the responses, an additional leaky integrator L0 was added to the circuit of Figure 2. While Hebbian rules are biologically plausible, one may get an impression from Figure 2 that they must apply to inhibitory cross-links. We point out that this circuit is meant to represent only the computation performed rather than the specific implementation in terms of neurons. As the same linear computation can be performed by circuits with a different arrangement of the same components, there are multiple implementations of the lattice filter. For example, activity of non-lagged OFF cells may be seen as representing minus forward error. Then the cross-links between the non-lagged OFF pathway and the lagged ON pathway would be excitatory. In general, classification of cells into lagged and non-lagged seems independent of their ON/OFF and X/Y classification [31, 28, 29], but see[33]. 3.2 Insect visual pathway In insects, two cell types, L1 and L2, both post-synaptic to photoreceptors play an important role in visual processing. Physiological responses of L1 and L2 indicate that they decorrelate visual signals by subtracting their predictable parts. In fact, receptive fields of these neurons were used as the first examples of predictive coding in neuroscience [6]. Yet, as the numbers of synapses from photoreceptors to L1 and L2 are the same [34] and their physiological properties are similar, it has been a mystery why insects, have not just one but a pair of such seemingly redundant neurons per facet. Previously, it was suggested that L1 and L2 provide inputs to the two pathways that map onto ON and OFF pathways in the vertebrate retina [35, 36]. Here, we put forward a hypothesis that the role of L1 and L2 in visual processing is similar to that of the two branches of the lattice filter. We do not incorporate the ON/OFF distinction in the effectively linear lattice filter model but anticipate that such combined description will materialize in the future. As was argued in Section 2, in forward prediction-error filters, the peak has greater weight than the rebound, while in backward prediction-error filters the opposite is true. Such difference implies that in response to a step-stimulus the signs of sustained responses compared to initial transients are different between the branches. Indeed, Ca2+ imaging shows that responses of L1 and L2 to step-stimulus are different as predicted by the lattice filter model [35], Figure 5b. Interestingly, the activity of L1 seems to represent minus forward error and L2 - plus backward error, suggesting that the lattice filter cross-links are excitatory. To summarize, the predictions of the lattice filter model seem to be consistent with the physiological measurements in the fly visual system and may help understand its operation. 7 Stimulus 1 0.5 0 0 5 10 15 20 5 10 15 20 5 10 time 15 20 − Forward Error 1 0 −1 0 Backward Error 1 0 −1 0 Figure 5: Response of the lattice filter and fruit fly LMCs to a step-stimulus. Left: Responses of the first order discrete-time lattice filter to a step stimulus. Right: Responses of fly L1 and L2 cells to a moving step stimulus (adapted from [35]). Predicted and the experimentally measured responses have qualitatively the same shape: a transient followed by sustained response, which has the same sign for the forward error and L1 and the opposite sign for the backward error and L2. 4 Discussion Motivated by the cascade structure of the visual pathway, we propose to model its operation with the lattice filter. We demonstrate that the predictions of the continuous-time lattice filter model are consistent with the course of neural development and the physiological measurement in the LGN, V1 of cat and monkey, as well as fly LMC neurons. Therefore, lattice filters may offer a useful abstraction for understanding aspects of temporal processing in visual systems of vertebrates and invertebrates. Previously, [11] proposed that lagged and non-lagged cells could be a result of rectification by spiking neurons. Although we agree with [11] that LGN performs temporal decorrelation, our explanation does not rely on non-linear processing but rather on the cascade architecture and, hence, is fundamentally different. Our model generates the following predictions that are not obvious in [11]: i) Not only are LGN receptive fields longer than RGC but also V1 receptive fields are longer than LGN; ii) Even a linear model can generate a difference in the peak/rebound ratio; iii) The circuit from RGC to LGN should be different for lagged and non-lagged cells consistent with [31]; iv) The lattice filter circuit can self-organize using Hebbian rules, which gives a mechanistic explanation of receptive fields beyond the normative framework of [11]. In light of the redundancy reduction arguments given in the introduction, we note that, if the only goal of the system were to compress incoming signals using a given number of lattice filter stages, then after the compression is peformed only one kind of prediction errors, forward or backward needs to be transmitted. Therefore, having two channels, in the absence of noise, may seem redundant. However, transmitting both forward and backward errors gives one the flexibility to continue decorrelation further by adding stages performing relatively simple operations. We are grateful to D.A. Butts, E. Callaway, M. Carandini, D.A. Clark, J.A. Hirsch, T. Hu, S.B. Laughlin, D.N. Mastronarde, R.C. Reid, H. Rouault, A. Saul, L. Scheffer, F.T. Sommer, X. Wang for helpful discussions. References [1] F. Rieke, D. Warland, R.R. van Steveninck, and W. Bialek. Spikes: exploring the neural code. MIT press, 1999. [2] S.B. Laughlin. Matching coding, circuits, cells, and molecules to signals: general principles of retinal design in the fly’s eye. Progress in retinal and eye research, 13(1):165–196, 1994. [3] F. Attneave. Some informational aspects of visual perception. Psychological review, 61(3):183, 1954. [4] H. Barlow. Redundancy reduction revisited. Network: Comp in Neural Systems, 12(3):241–253, 2001. [5] R.M. Gray. Linear Predictive Coding and the Internet Protocol. Now Publishers, 2010. [6] MV Srinivasan, SB Laughlin, and A. Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982. [7] T. Hosoya, S.A. Baccus, and M. Meister. Dynamic predictive coding by the retina. Nature, 436:71, 2005. 8 [8] HK Hartline, H.G. Wagner, and EF MacNichol Jr. The peripheral origin of nervous activity in the visual system. Studies on excitation and inhibition in the retina: a collection of papers from the laboratories of H. Keffer Hartline, page 99, 1974. [9] N.A. Lesica, J. Jin, C. Weng, C.I. Yeh, D.A. Butts, G.B. Stanley, and J.M. Alonso. Adaptation to stimulus contrast and correlations during natural visual stimulation. Neuron, 55(3):479–491, 2007. [10] Y. Dan, J.J. Atick, and R.C. Reid. Efficient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. The Journal of Neuroscience, 16(10):3351–3362, 1996. [11] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6(3):345–358, 1995. [12] X. Wang, J.A. Hirsch, and F.T. Sommer. Recoding of sensory information across the retinothalamic synapse. The Journal of Neuroscience, 30(41):13567–13577, 2010. [13] C. Koch. Biophysics of computation: information processing in single neurons. Oxford Univ Press, 2005. [14] F. Itakura and S. Saito. On the optimum quantization of feature parameters in the parcor speech synthesizer. In Conference Record, 1972 International Conference on Speech Communication and Processing, Boston, MA, pages 434–437, 1972. [15] B. Widrow and S.D. Stearns. Adaptive signal processing. Prentice-Hall, Inc. Englewood Cliffs, NJ, 1985. [16] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood-Cliffs, NJ, 2003. [17] A.H. Sayed. Fundamentals of adaptive filtering. Wiley-IEEE Press, 2003. [18] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex, 1(1):1–47, 1991. [19] X. Wang, F.T. Sommer, and J.A. Hirsch. Inhibitory circuits for visual processing in thalamus. Current Opinion in Neurobiology, 2011. [20] SB Laughlin, J. Howard, and B. Blakeslee. Synaptic limitations to contrast coding in the retina of the blowfly calliphora. Proceedings of the Royal society of London. Series B. Biological sciences, 231(1265):437–467, 1987. [21] D.C. Lay. Linear Algebra and Its Applications. Addison-Wesley/Longman, New York/London, 2000. [22] D.O. Hebb. The organization of behavior: A neuropsychological theory. Lawrence Erlbaum, 2002. [23] O. Paulsen and T.J. Sejnowski. Natural patterns of activity and long-term synaptic plasticity. Current opinion in neurobiology, 10(2):172–180, 2000. [24] Z. Fejzo and H. Lev-Ari. Adaptive laguerre-lattice filters. Signal Processing, IEEE Transactions on, 45(12):3006–3016, 1997. [25] J.M. Alonso, W.M. Usrey, and R.C. Reid. Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. The Journal of Neuroscience, 21(11):4002–4015, 2001. [26] D. Cai, G.C. Deangelis, and R.D. Freeman. Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78(2):1045–1061, 1997. [27] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. i. receptive-field properties and classification of cells. Journal of Neurophysiology, 57(2):357–380, 1987. [28] J. Wolfe and L.A. Palmer. Temporal diversity in the lateral geniculate nucleus of cat. Visual neuroscience, 15(04):653–675, 1998. [29] AB Saul and AL Humphrey. Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. Journal of Neurophysiology, 64(1):206–224, 1990. [30] A.B. Saul. Lagged cells in alert monkey lateral geniculate nucleus. Visual neurosci, 25:647–659, 2008. [31] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. ii. retinal inputs and the generation of receptive-field properties. Journal of Neurophysiology, 57(2):381–413, 1987. [32] P. Heggelund and E. Hartveit. Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. i. lagged cells. Journal of neurophysiology, 63(6):1347–1360, 1990. [33] J. Jin, Y. Wang, R. Lashgari, H.A. Swadlow, and J.M. Alonso. Faster thalamocortical processing for dark than light visual targets. The Journal of Neuroscience, 31(48):17471–17479, 2011. [34] M. Rivera-Alba, S.N. Vitaladevuni, Y. Mischenko, Z. Lu, S. Takemura, L. Scheffer, I.A. Meinertzhagen, D.B. Chklovskii, and G.G. de Polavieja. Wiring economy and volume exclusion determine neuronal placement in the drosophila brain. Current Biology, 21(23):2000–5, 2011. [35] D.A. Clark, L. Bursztyn, M.A. Horowitz, M.J. Schnitzer, and T.R. Clandinin. Defining the computational structure of the motion detector in drosophila. Neuron, 70(6):1165–1177, 2011. [36] M. Joesch, B. Schnell, S.V. Raghu, D.F. Reiff, and A. Borst. On and off pathways in drosophila motion vision. Nature, 468(7321):300–304, 2010. 9

same-paper 2 0.74628699 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

3 0.71391404 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

4 0.6802848 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

5 0.62688875 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

6 0.60519499 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.54426032 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

8 0.54189056 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.54130602 260 nips-2012-Online Sum-Product Computation Over Trees

10 0.53961736 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

11 0.53923428 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

12 0.53870529 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

13 0.53830218 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

14 0.53819835 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

15 0.53698272 38 nips-2012-Algorithms for Learning Markov Field Policies

16 0.53646064 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

17 0.53640491 148 nips-2012-Hamming Distance Metric Learning

18 0.53636026 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

19 0.53620082 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

20 0.53601193 162 nips-2012-Inverse Reinforcement Learning through Structured Classification