nips nips2009 nips2009-261 knowledge-graph by maker-knowledge-mining

261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Source: pdf

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Ramadge1 1 Department of Electrical Engineering, 2 Neuroscience Institute, Princeton University 3 Department of Psychology, Dartmouth College Abstract The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. [sent-5, score-0.368]

2 In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. [sent-6, score-0.653]

3 major sulci and gyri, across subjects, either in the volume or on extracted cortical surfaces. [sent-15, score-0.22]

4 More advanced techniques match a denser set of anatomical features, such as cortical curvature [2], and derive nonlinear transformations between a reference space and each subject’s cortical surface. [sent-17, score-0.559]

5 It is known, however, that an accurate inter-subject functional correspondence cannot be derived using only anatomical features, since the size, shape and anatomical location of functional loci vary across subjects [3], [4]. [sent-18, score-0.777]

6 Because of this deﬁciency in current alignment methods, it is common practice to spatially smooth each subject’s functional data prior to a population based analysis. [sent-19, score-0.368]

7 However, this incurs the penalty of blurring the functional data within and across distinct cortical regions. [sent-20, score-0.385]

8 Thus, the functional alignment of multi-subject fMRI data remains an important problem. [sent-21, score-0.368]

9 We propose to register functional loci directly by using anatomical and functional data to learn an inter-subject cortical correspondence. [sent-22, score-0.708]

10 This approach was ﬁrst explored in [5], where subject cortices were registered by maximizing the inter-subject correlation of the functional response elicited by a common stimulus (a movie viewing). [sent-23, score-0.385]

11 The technique of [5] is hence not expected to improve alignment in these intrinsic regions. [sent-28, score-0.236]

12 In contrast to [5], we propose to achieve inter-subject alignment by aligning intra-subject patterns of cortical functional connectivity. [sent-29, score-0.597]

13 By functional connectivity, we mean within-subject similarity of ∗ This work was funded by a grant from the National Institute of Mental Health (5R01MH075706-02) 1 the temporal response of remote regions of cortex [9]. [sent-30, score-0.23]

14 This can be estimated from fMRI data, for example, by correlating the functional time series between pairs of cortical nodes within a subject. [sent-31, score-0.419]

15 This yields a dense set of functional features for each subject from which we learn an inter-subject correspondence. [sent-32, score-0.234]

16 [10]), we deﬁne connectivity between pairs of cortical nodes rather than with respect to anatomical regions of interest. [sent-35, score-0.502]

17 Our approach is inspired by studies showing that the patterns of functional connectivity in the intrinsic network are consistent across subjects [7], [11]. [sent-36, score-0.45]

18 This suggests that our method has the potential to learn an inter-subject functional correspondence within both extrinsic and intrinsic cortical networks. [sent-37, score-0.469]

19 In summary, we formulate a multi-subject cortical alignment algorithm that minimizes the difference between functional connectivity vectors of corresponding cortical nodes across subjects. [sent-38, score-0.954]

20 We do so by learning a dense-deformation ﬁeld on the cortex of each subject, suitably regularized to preserve cortical topology [2]. [sent-39, score-0.245]

21 Our key contributions are: a) the novel alignment objective, b) a principled algorithm for accomplishing the alignment, and c) experimental veriﬁcation on fMRI data. [sent-40, score-0.203]

22 In §2 we formulate the multi-subject alignment problem, followed by a detailed exposition of the algorithm in §3 and §4. [sent-42, score-0.203]

23 2 Formulation of the Multi-Subject Alignment Problem For each subject we are given volumetric anatomical MRI data and fMRI data. [sent-44, score-0.231]

24 Cortex is segmented, then each cortical hemisphere is inﬂated to obtain a smooth surface, which is projected to the sphere, S 2 , represented by a discrete spherical mesh Ms = {pk ∈ S 2 ; 1 ≤ k ≤ Nv /2}. [sent-47, score-0.412]

25 The two cortical hemispheres are hence modeled by the disjoint union S = S 2 S 2 , represented by the corresponding disjoint union of mesh points M = Ms Ms . [sent-48, score-0.351]

26 Anatomical cortical features, such as cortical curvature, are functions Da : S → RNa sampled on M . [sent-49, score-0.36]

27 This assigns each mesh node pk ∈ M a “volumetric cortical voxel” vk ∈ R3 , with associated functional time series fk ∈ RNt . [sent-52, score-0.667]

28 The functional time series data is then a function Df : S → RNt sampled on M . [sent-53, score-0.209]

29 As indicated in the introduction, we do not directly register the fMRI time series but instead register the functional connectivity derived from the time series. [sent-54, score-0.443]

30 Deﬁne the functional connectivity of the fMRI data under σ as the map C(pi , pj ) = σ(Df (pi ), Df (pj )), i. [sent-57, score-0.668]

31 , the similarity of the functional times series at the pairs of cortical nodes. [sent-59, score-0.389]

32 Functional connections both within and across cortical hemispheres are considered. [sent-60, score-0.22]

33 Functional connectivity can be conceptualized as the adjacency matrix of an edge-weighted graph on all cortical nodes. [sent-61, score-0.336]

34 The edge between nodes pi , pj is weighted by the pairwise similarity measure σ(fi , fj ) codifying the functional similarity of pi and pj . [sent-62, score-1.499]

35 For typical values of Nv (≈ 72, 000), the functional connectivity data structure is huge. [sent-64, score-0.321]

36 Subject k’s training data is speciﬁed by samples of the functions Da,k : Sk → RNa , Df,k : Sj → RNt , and the derived functional connectivity Ck , all sampled on the mesh Mk , k = 1, . [sent-67, score-0.492]

37 To do so, we could select a node from M1 for subject 1 and learn the corresponding points on the cortices of the remaining Ns − 1 subjects through smooth and invertible mappings gk : S1 → Sk , k = 2, . [sent-72, score-0.405]

38 Instead, we introduce a reference model Sref = S 2 S 2 with mesh Mref . [sent-77, score-0.212]

39 , gNs (p)), parameterized by gk : Sref → Sk , k = 1, . [sent-81, score-0.246]

40 In general terms, we can now summarize our task as follows: use the functional connectivity data Ck , in conjunction with the anatomical data Da,k , k = 1, . [sent-85, score-0.457]

41 , Ns }, subject to speciﬁed regularity conditions, that bring some speciﬁed balance of anatomy and functional connectivity into alignment across subjects. [sent-91, score-0.633]

42 That said, for the remainder 2 of the paper we restrict attention to aligning only functional connectivity across subjects. [sent-92, score-0.41]

43 Restricting attention to the alignment of functional connectivity will allow us to concentrate on the most novel and important aspects of our approach. [sent-94, score-0.524]

44 To proceed, assume a reference connectivity Cref , such that for each subject k = 1, . [sent-95, score-0.266]

45 , Ns , Ck (gk (pi ), gk (pj )) = Cref (pi , pj ) + pi , pj ∈ Mref k (pi , pj ), (1) where Ck (gk (pi ), gk (pj )) = σ(Df,k (gk (pi )), Df,k (gk (pj ))), and k is zero-mean random noise. [sent-98, score-1.822]

46 Since gk (p) may not be a mesh point, computation of Df,k (gk (p)) requires interpolation of the time series using mesh nodes in a neighborhood of gk (p). [sent-99, score-0.978]

47 , CNs |g) − λ k Reg(gk ) (2) where Reg(gk ) constrains each warping function gk to be smooth and invertible. [sent-104, score-0.288]

48 Because (4b) decouples, we can optimize over each subject’s warp separately, i. [sent-112, score-0.274]

49 , these optimizations can be done in parallel: gk = arg min gk pi ,pj ∈Sref (C ref (pi , pj ) − Ck (gk (pi ), gk (pj )))2 (5) However, an interesting alternative is to perform these sequentially with an E-step after each that updates the reference estimate C ref . [sent-114, score-1.597]

50 We note: (C ref (pi , pj ) − Ck (gk (pi ), gk (pj )))2 ∝ (C k (pi , pj ) − Ck (gk (pi ), gk (pj )))2 (6a) where C k (pi , pj ) = 1 (Ns −1) n=k Cn (gn (pi ), gn (pj )), pi , pj ∈ Mref , (7) is the leave-one-out template for subject k, which is indepedendent of gk . [sent-116, score-2.637]

51 3 Pairwise Cortical Alignment We now develop an algorithm for aligning one subject, with connectivity CF , to a reference, with connectivity CR , with CF , CR ∈ RNv ×Nv . [sent-123, score-0.361]

52 3 A function g : MR → SF maps a reference mesh point pi ∈ MR to g(pi ) ∈ SF . [sent-125, score-0.501]

53 By interpolating the ﬂoating subject’s times series at the points g(pi ) ∈ SF we obtain the associated warped functional F F ˜ ˜ connectivity: CF = [σ(fg(pi ) , fg(pj ) )]. [sent-126, score-0.277]

54 Here, we employ linear interpolation with a spherical i Nv F kernel Φ: f F (p) = i=1 fi Φ(p, pi ), p ∈ SF . [sent-138, score-0.448]

55 With these considerations in mind, we select Φ(p, pi ) to be a spherical radial basis function Φi : S 2 → R centered at pi ∈ S 2 and taking the form: Φi (p) = ϕ(d(p, pi )), p ∈ S 2 , where ϕ : [0, π] → R and d(p, pi ) is the spherical geodesic distance between p and pi [16]. [sent-145, score-1.567]

56 Then Φi (p) is monomodal with a maximum at pi , it depends only on the distance between p and pi and is radially symmetric. [sent-146, score-0.601]

57 In detail, we employ the particular spherical radial basis function: Φi (p) = ϕ(d(p, pi )) = (1 − (2/r) sin(d(p, pi )/2))4 ((8/r) sin(d(p, pi )/2) + 1) + (10) where r is a ﬁxed parameter, and (a)+ = a1{a ≥ 0}. [sent-147, score-0.928]

58 Φi (p) has two continuous derivatives and its support is {p ∈ S 2 : d(p, pi ) < 2 sin−1 (r/2)}. [sent-148, score-0.289]

59 At a spatial resolution of 2 mm, the spherical model of human cortex can yield Nv ≈ 72, 000 total mesh points. [sent-159, score-0.374]

60 Then given a warp g, T ˜ ˜ we compute: the interpolation matrix A, B1 = VF AVR , and ﬁnally B2 via QR factorization of ˜T VF − VR B T . [sent-183, score-0.344]

61 The use of such nonlinear warp models for inter-subject cortical alignment has been validated over, for example, rigid-body transformations [17]. [sent-188, score-0.657]

62 Consider pi ∈ S 2 parameterized by x(φ, θ) such that pi = x(φi , θi ). [sent-196, score-0.578]

63 Then the warp ﬁeld at pi is: ˜ ˜ g(pi ) = x(φi + ∆φi , θi + ∆θi ) = x(φi , θi ) (17) ˜ ˜ for displacements ∆φi and ∆θi . [sent-197, score-0.563]

64 The warp g is thus parameterized by: {φi , θi , i = 1, . [sent-198, score-0.274]

65 The warp g must be regularized to avoid undesired topological distortions (e. [sent-202, score-0.274]

66 The metric distortion term penalizes warps that disrupt local distances between neighboring mesh nodes. [sent-208, score-0.236]

67 , Ns 3: for t = 1 to T do 4: for k = 1 to Ns do 5: Construct C k as explained in §4 6: Align Ck to C k by Algorithm 1 with (t−1) 7: 8: 9: 10: 11: initial condition gk (t) Set gk to the output of the alignment (t) Use gk to update Σk , Vk end for end for (T ) (T ) Output result: g = {g1 , . [sent-219, score-0.941]

68 Φi (x(φj , θj )) TF a(x(φj , θj )) −1 depends only on the warp parameters of the j th mesh node, φj and θj . [sent-223, score-0.445]

69 The result at resolution rm is used to initialize the alignment at resolution rm+1 . [sent-229, score-0.307]

70 The alignment for rm is performed by matching the kernel parameter r in (10) to rm . [sent-230, score-0.311]

71 4 Multi-Subject Alignment: Computing Leave-one-out Templates We now return to the multi-subject alignment problem, which is summarized as Algorithm 2 in Figure 1. [sent-233, score-0.203]

72 Assume that Cn , the connectivity matrix of subject n after T warp gn (see (11)), has an efﬁcient d Nv dimensional SVD representation Cn = Vn Σn Vn . [sent-236, score-0.523]

73 To compute the SVD for C k , we exploit the sequential nature of the multi-subject alignment algoT rithm by reﬁning the SVD of the leave-one-out template for subject k−1, C k−1 = V k−1 Σk−1 V k−1 , computed in the previous iteration. [sent-237, score-0.31]

74 Additionally, it works directly on the singular values Σk and vectors Vk of each warped connectivity matrix Ck , alleviating the need to store large Nv × Nv matrices. [sent-243, score-0.224]

75 5 Experimental Results We tested the algorithm using fMRI data collected from 10 subjects viewing a movie split into 2 sessions separated by a short break. [sent-244, score-0.196]

76 For each subject, a structural scan was acquired before each session, from which the cortical surface model was derived (§2) and then anatomically aligned to a template using FreeSurfer (Fischl, http://surfer. [sent-246, score-0.276]

77 Similar to [5], we ﬁnd that anatomical alignment based on cortical curvature serves as a superior starting point for functional alignment over Talairach alignment. [sent-251, score-0.909]

78 First, functional connectivity was found for each subject and session: Ck,i , k = 1, . [sent-252, score-0.39]

79 Since the data starts in anatomical correspondence, we expect small warp displacements within subject and larger ones across subjects. [sent-257, score-0.519]

80 48), with 77% of the mesh nodes warped less than 1 mm and fewer than 1. [sent-260, score-0.328]

81 In contrast, the mean inter-subject warp displacement was 1. [sent-262, score-0.274]

82 In a separate analysis, each subject was aligned to its leave-one-out template on each session using Algorithm 1, yielding a set of warps gk,i (pj ), k = 1, . [sent-266, score-0.249]

83 To evaluate the consistency of the correspondence derived from different sessions, we compared the warps gk,1 to gk,2 for each subject k. [sent-273, score-0.217]

84 At node pj , we compute the angle 0 ≤ θ ≤ π between the warp tangent vectors of gk,1 (pj ) and gk,2 (pj ). [sent-276, score-0.653]

85 This measures the consistency of the direction of the warp across sessions: smaller values of θ suggest a greater warp coherence across sessions. [sent-277, score-0.655]

86 Figure 2(c) shows a histogram of θ averaged across the cortical nodes of all 10 subjects. [sent-278, score-0.25]

87 The tight distribution centered near θ = 0 suggests signiﬁcant consistency in the warp direction across sessions. [sent-279, score-0.341]

88 As a secondary comparison, we compute a normalized consistency measure WNC(pj ) = d(gk,1 (pj ), gk,2 (pj ))/(d(gk,1 (pj ), pj ) + d(gk,2 (pj ), pj )), where d(·, ·) is spherical geodesic distance. [sent-281, score-0.782]

89 The measure takes variability in both warp angle and magnitude into account; it is bounded between 0 and 1, and WNC(pj ) = 0 only if gk,1 (pj ) = gk,2 (pj ). [sent-282, score-0.306]

90 The alignment required approximately 10 hours on a Intel 3. [sent-291, score-0.203]

91 (a) Intra-subject warp distances; (b) Inter-subject warp distances; (c) Angle between warp vectors across sessions; (d) Across-session normalized warp consistency measure WNC. [sent-330, score-1.163]

92 (a) Lateral View (b) Medial View (c) Ventral View (d) Lateral View (e) Medial View (f) Ventral View Figure 3: Map of ISC on right cortical hemisphere, alignment: anatomical (top), functional (bottom). [sent-331, score-0.481]

93 correlation of each subject’s functional time series with the mean time series of the other subjects: ISC(pi ) = (1/Ns ) Ns k=1 k corr(fgk (pi ) , n=k n fgn (pi ) ), pi ∈ Mref We also compute the mean inter-subject correlation, ISC = (1/Nv ) Nv i=1 (24) ISC(pi ). [sent-332, score-0.58]

94 We compare the cross-validated ISC map with the ISC map of the second session movie viewing computed under anatomical correspondence. [sent-333, score-0.28]

95 Figure 3 shows the ISC maps computed under anatomical alignment and functional alignment on the inﬂated right cortical hemisphere. [sent-341, score-0.887]

96 6 Conclusion We have proposed a novel cortical registration algorithm that produces a functional correspondence across a set of subjects. [sent-343, score-0.529]

97 The algorithm uses the fMRI data directly to align the spatial patterns of functional response elicited by a movie viewing. [sent-344, score-0.243]

98 On the test data, the algorithm produces a consistent increase in intersubject correlation of fMRI time series, suggesting that functional alignment of extrinsic regions of cortex that are directly driven by the movie viewing experiment, such as visual and auditory areas, is improved considerably. [sent-348, score-0.635]

99 High-resolution intersubject averaging and a coordinate system for the cortical surface. [sent-362, score-0.211]

100 Impact of inter-subject image registration on group analysis of fmri data. [sent-526, score-0.252]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pj', 0.347), ('pi', 0.289), ('warp', 0.274), ('nv', 0.256), ('gk', 0.246), ('alignment', 0.203), ('ns', 0.196), ('cf', 0.196), ('vr', 0.186), ('cortical', 0.18), ('mesh', 0.171), ('functional', 0.165), ('fmri', 0.164), ('connectivity', 0.156), ('cr', 0.144), ('vf', 0.141), ('anatomical', 0.136), ('isc', 0.13), ('ck', 0.123), ('tf', 0.106), ('ref', 0.091), ('sref', 0.091), ('registration', 0.088), ('interpolation', 0.07), ('subject', 0.069), ('warped', 0.068), ('cref', 0.065), ('mref', 0.065), ('warps', 0.065), ('cortex', 0.065), ('spherical', 0.061), ('mm', 0.059), ('vk', 0.058), ('rnt', 0.057), ('subjects', 0.056), ('movie', 0.056), ('correspondence', 0.056), ('fischl', 0.052), ('wnc', 0.052), ('reg', 0.051), ('pk', 0.049), ('aligning', 0.049), ('session', 0.046), ('series', 0.044), ('sf', 0.044), ('qk', 0.042), ('sessions', 0.042), ('viewing', 0.042), ('svd', 0.042), ('warping', 0.042), ('reference', 0.041), ('across', 0.04), ('rm', 0.04), ('cerebral', 0.039), ('sin', 0.039), ('avr', 0.039), ('gns', 0.039), ('register', 0.039), ('registrations', 0.039), ('correlation', 0.038), ('template', 0.038), ('aij', 0.036), ('extrinsic', 0.035), ('rk', 0.035), ('talairach', 0.034), ('cortices', 0.034), ('vg', 0.034), ('intrinsic', 0.033), ('resolution', 0.032), ('angle', 0.032), ('brain', 0.032), ('pairwise', 0.032), ('intersubject', 0.031), ('aligned', 0.031), ('nodes', 0.03), ('tr', 0.028), ('kernel', 0.028), ('surface', 0.027), ('consistency', 0.027), ('areal', 0.026), ('fgk', 0.026), ('pnv', 0.026), ('volumetric', 0.026), ('vtf', 0.026), ('vtr', 0.026), ('corr', 0.024), ('gn', 0.024), ('mri', 0.024), ('human', 0.023), ('loci', 0.023), ('raichle', 0.023), ('registered', 0.023), ('svds', 0.023), ('radially', 0.023), ('spatial', 0.022), ('curvature', 0.022), ('df', 0.022), ('columns', 0.021), ('ated', 0.021), ('hasson', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

2 0.1477565 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classiﬁcation. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classiﬁcation, this method is only suitable for two-class EEG classiﬁcation problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we ﬁrst develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral ﬁlters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method signiﬁcantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classiﬁcation.

3 0.14691274 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

Author: Andreas Bartels, Matthew Blaschko, Jacquelyn A. Shelton

Abstract: Resting state activity is brain activation that arises in the absence of any task, and is usually measured in awake subjects during prolonged fMRI scanning sessions where the only instruction given is to close the eyes and do nothing. It has been recognized in recent years that resting state activity is implicated in a wide variety of brain function. While certain networks of brain areas have different levels of activation at rest and during a task, there is nevertheless signiﬁcant similarity between activations in the two cases. This suggests that recordings of resting state activity can be used as a source of unlabeled data to augment discriminative regression techniques in a semi-supervised setting. We evaluate this setting empirically yielding three main results: (i) regression tends to be improved by the use of Laplacian regularization even when no additional unlabeled data are available, (ii) resting state data seem to have a similar marginal distribution to that recorded during the execution of a visual processing task implying largely similar types of activation, and (iii) this source of information can be broadly exploited to improve the robustness of empirical inference in fMRI studies, an inherently data poor domain. 1

4 0.1465556 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

Author: Barry Chai, Dirk Walther, Diane Beck, Li Fei-fei

Abstract: In this study, we present a new method for establishing fMRI pattern-based functional connectivity between brain regions by estimating their multivariate mutual information. Recent advances in the numerical approximation of highdimensional probability distributions allow us to successfully estimate mutual information from scarce fMRI data. We also show that selecting voxels based on the multivariate mutual information of local activity patterns with respect to ground truth labels leads to higher decoding accuracy than established voxel selection methods. We validate our approach with a 6-way scene categorization fMRI experiment. Multivariate information analysis is able to ﬁnd strong information sharing between PPA and RSC, consistent with existing neuroscience studies on scenes. Furthermore, an exploratory whole-brain analysis uncovered other brain regions that share information with the PPA-RSC scene network.

5 0.14297727 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

6 0.14078298 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

7 0.13016579 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

8 0.085320503 50 nips-2009-Canonical Time Warping for Alignment of Human Behavior

9 0.084679171 70 nips-2009-Discriminative Network Models of Schizophrenia

10 0.084254213 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

11 0.078262113 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

12 0.065901935 83 nips-2009-Estimating image bases for visual image reconstruction from human brain activity

13 0.065104693 230 nips-2009-Statistical Consistency of Top-k Ranking

14 0.063138403 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

15 0.06290894 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

16 0.062233452 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

17 0.060630821 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

18 0.059366316 47 nips-2009-Boosting with Spatial Regularization

19 0.059091911 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

20 0.056193493 43 nips-2009-Bayesian estimation of orientation preference maps

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.16), (1, -0.051), (2, -0.004), (3, 0.066), (4, 0.003), (5, 0.106), (6, -0.023), (7, -0.172), (8, -0.202), (9, -0.039), (10, -0.003), (11, 0.02), (12, 0.008), (13, 0.082), (14, -0.049), (15, -0.081), (16, 0.126), (17, 0.024), (18, -0.122), (19, -0.029), (20, 0.222), (21, 0.013), (22, -0.043), (23, -0.091), (24, 0.012), (25, -0.036), (26, -0.03), (27, -0.081), (28, -0.097), (29, 0.2), (30, 0.146), (31, 0.162), (32, -0.065), (33, -0.114), (34, -0.022), (35, 0.001), (36, -0.038), (37, -0.008), (38, 0.071), (39, 0.025), (40, 0.078), (41, -0.032), (42, 0.166), (43, -0.001), (44, 0.016), (45, 0.075), (46, 0.109), (47, 0.001), (48, 0.023), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97575039 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

2 0.57271433 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

3 0.54948401 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

4 0.54338938 125 nips-2009-Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data

Author: Shuai Huang, Jing Li, Liang Sun, Jun Liu, Teresa Wu, Kewei Chen, Adam Fleisher, Eric Reiman, Jieping Ye

Abstract: Recent advances in neuroimaging techniques provide great potentials for effective diagnosis of Alzheimer’s disease (AD), the most common form of dementia. Previous studies have shown that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions. In this paper, we consider the problem of learning functional brain connectivity from neuroimaging, which holds great promise for identifying image-based markers used to distinguish Normal Controls (NC), patients with Mild Cognitive Impairment (MCI), and patients with AD. More specifically, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. In particular, we apply SICE to learn and analyze functional brain connectivity patterns from different subject groups, based on a key property of SICE, called the “monotone property” we established in this paper. Our experimental results on neuroimaging PET data of 42 AD, 116 MCI, and 67 NC subjects reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 1 In trod u cti on Alzheimer’s disease (AD) is a fatal, neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. It is the most common form of dementia and currently affects over five million Americans; this number will grow to as many as 14 million by year 2050. The current knowledge about the cause of AD is very limited; clinical diagnosis is imprecise with definite diagnosis only possible by autopsy; also, there is currently no cure for AD, while most drugs only alleviate the symptoms. To tackle these challenging issues, the rapidly advancing neuroimaging techniques provide great potentials. These techniques, such as MRI, PET, and fMRI, produce data (images) of brain structure and function, making it possible to identify the difference between AD and normal brains. Recent studies have demonstrated that neuroimaging data provide more sensitive and consistent measures of AD onset and progression than conventional clinical assessment and neuropsychological tests [1]. Recent studies have found that AD is closely related to the alternation in the functional brain network, i.e., the functional connectivity among different brain regions [ 2]-[3]. Specifically, it has been shown that functional connectivity substantially decreases between the hippocampus and other regions of AD brains [3]-[4]. Also, some studies have found increased connectivity between the regions in the frontal lobe [ 6]-[7]. Learning functional brain connectivity from neuroimaging data holds great promise for identifying image-based markers used to distinguish among AD, MCI (Mild Cognitive Impairment), and normal aging. Note that MCI is a transition stage from normal aging to AD. Understanding and precise diagnosis of MCI have significant clinical value since it can serve as an early warning sign of AD. Despite all these, existing research in functional brain connectivity modeling suffers from limitations. A large body of functional connectivity modeling has been based on correlation analysis [2]-[3], [5]. However, correlation only captures pairwise information and fails to provide a complete account for the interaction of many (more than two) brain regions. Other multivariate statistical methods have also been used, such as Principle Component Analysis (PCA) [8], PCA-based Scaled Subprofile Model [9], Independent Component Analysis [10]-[11], and Partial Least Squares [12]-[13], which group brain regions into latent components. The brain regions within each component are believed to have strong connectivity, while the connectivity between components is weak. One major drawback of these methods is that the latent components may not correspond to any biological entities, causing difficulty in interpretation. In addition, graphical models have been used to study brain connectivity, such as structural equation models [14]-[15], dynamic causal models [16], and Granger causality. However, most of these approaches are confirmative, rather than exploratory, in the sense that they require a prior model of brain connectivity to begin with. This makes them inadequate for studying AD brain connectivity, because there is little prior knowledge about which regions should be involved and how they are connected. This makes exploratory models highly desirable. In this paper, we study sparse inverse covariance estimation (SICE), also known as exploratory Gaussian graphical models, for brain connectivity modeling. Inverse covariance matrix has a clear interpretation that the off-diagonal elements correspond to partial correlations, i.e., the correlation between each pair of brain regions given all other regions. This provides a much better model for brain connectivity than simple correlation analysis which models each pair of regions without considering other regions. Also, imposing sparsity on the inverse covariance estimation ensures a reliable brain connectivity to be modeled with limited sample size, which is usually the case in AD studies since clinical samples are difficult to obtain. From a domain perspective, imposing sparsity is also valid because neurological findings have demonstrated that a brain region usually only directly interacts with a few other brain regions in neurological processes [ 2]-[3]. Various algorithms for achieving SICE have been developed in recent year [ 17]-[22]. In addition, SICE has been used in various applications [17], [21], [23]-[26]. In this paper, we apply SICE to learn functional brain connectivity from neuroimaging and analyze the difference among AD, MCI, and NC based on a key property of SICE, called the “monotone property” we established in this paper. Unlike the previous study which is based on a specific level of sparsity [26], the monotone property allows us to study the connectivity pattern using different levels of sparsity and obtain an order for the strength of connection between pairs of brain regions. In addition, we apply bootstrap hypothesis testing to assess the significance of the connection. Our experimental results on PET data of 42 AD, 116 MCI, and 67 NC subjects enrolled in the Alzheimer’s Disease Neuroimaging Initiative project reveal several interesting connectivity patterns consistent with literature findings, and also some new patterns that can help the knowledge discovery of AD. 2 S ICE : B ack grou n d an d th e Mon oton e P rop erty An inverse covariance matrix can be represented graphically. If used to represent brain connectivity, the nodes are activated brain regions; existence of an arc between two nodes means that the two brain regions are closely related in the brain's functiona l process. Let be all the brain regions under study. We assume that follows a multivariate Gaussian distribution with mean and covariance matrix . Let be the inverse covariance matrix. Suppose we have samples (e.g., subjects with AD) for these brain regions. Note that we will only illustrate here the SICE for AD, whereas the SICE for MCI and NC can be achieved in a similar way. We can formulate the SICE into an optimization problem, i.e., (1) where is the sample covariance matrix; , , and denote the determinant, trace, and sum of the absolute values of all elements of a matrix, respectively. The part “ ” in (1) is the log-likelihood, whereas the part “ ” represents the “sparsity” of the inverse covariance matrix . (1) aims to achieve a tradeoff between the likelihood fit of the inverse covariance estimate and the sparsity. The tradeoff is controlled by , called the regularization parameter; larger will result in more sparse estimate for . The formulation in (1) follows the same line of the -norm regularization, which has been introduced into the least squares formulation to achieve model sparsity and the resulting model is called Lasso [27]. We employ the algorithm in [19] in this paper. Next, we show that with going from small to large, the resulting brain connectivity models have a monotone property. Before introducing the monotone property, the following definitions are needed. Definition: In the graphical representation of the inverse covariance, if node to by an arc, then is called a “neighbor” of . If is connected to chain of arcs, then is called a “connectivity component” of . is connected though some Intuitively, being neighbors means that two nodes (i.e., brain regions) are directly connected, whereas being connectivity components means that two brain regions are indirectly connected, i.e., the connection is mediated through other regions. In other words, not being connectivity components (i.e., two nodes completely separated in the graph) means that the two corresponding brain regions are completely independent of each other. Connectivity components have the following monotone property: Monotone property of SICE: Let components of with and and be the sets of all the connectivity , respectively. If , then . Intuitively, if two regions are connected (either directly or indirectly) at one level of sparseness ( ), they will be connected at all lower levels of sparseness ( ). Proof of the monotone property can be found in the supplementary file [29]. This monotone property can be used to identify how strongly connected each node (brain region) to its connectivity components. For example, assuming that and , this means that is more strongly connected to than . Thus, by changing from small to large, we can obtain an order for the strength of connection between pairs of brain regions. As will be shown in Section 3, this order is different among AD, MCI, and NC. 3 3.1 Ap p l i cati on i n B rai n Con n ecti vi ty M od el i n g of AD D a t a a c q u i s i t i o n a n d p re p ro c e s s i n g We apply SICE on FDG-PET images for 49 AD, 116 MCI, and 67 NC subjects downloaded from the ADNI website. We apply Automated Anatomical Labeling (AAL) [28] to extract data from each of the 116 anatomical volumes of interest (AVOI), and derived average of each AVOI for every subject. The AVOIs represent different regions of the whole brain. 3.2 B r a i n c o n n e c t i v i t y mo d e l i n g b y S I C E 42 AVOIs are selected for brain connectivity modeling, as they are considered to be potentially related to AD. These regions distribute in the frontal, parietal, occipital, and temporal lobes. Table 1 list of the names of the AVOIs with their corresponding lobes. The number before each AVOI is used to index the node in the connectivity models. We apply the SICE algorithm to learn one connectivity model for AD, one for MCI, and one for NC, for a given . With different ’s, the resulting connectivity models hold a monotone property, which can help obtain an order for the strength of connection between brain regions. To show the order clearly, we develop a tree-like plot in Fig. 1, which is for the AD group. To generate this plot, we start at a very small value (i.e., the right-most of the horizontal axis), which results in a fully-connected connectivity model. A fully-connected connectivity model is one that contains no region disconnected with the rest of the brain. Then, we decrease by small steps and record the order of the regions disconnected with the rest of the brain regions. Table 1: Names of the AVOIs for connectivity modeling (“L” means that the brain region is located at the left hemisphere; “R” means right hemisphere.) Frontal lobe Parietal lobe Occipital lobe Temporal lobe 1 Frontal_Sup_L 13 Parietal_Sup_L 21 Occipital_Sup_L 27 T emporal_Sup_L 2 Frontal_Sup_R 14 Parietal_Sup_R 22 Occipital_Sup_R 28 T emporal_Sup_R 3 Frontal_Mid_L 15 Parietal_Inf_L 23 Occipital_Mid_L 29 T emporal_Pole_Sup_L 4 Frontal_Mid_R 16 Parietal_Inf_R 24 Occipital_Mid_R 30 T emporal_Pole_Sup_R 5 Frontal_Sup_Medial_L 17 Precuneus_L 25 Occipital_Inf_L 31 T emporal_Mid_L 6 Frontal_Sup_Medial_R 18 Precuneus_R 26 Occipital_Inf_R 32 T emporal_Mid_R 7 Frontal_Mid_Orb_L 19 Cingulum_Post_L 33 T emporal_Pole_Mid_L 8 Frontal_Mid_Orb_R 20 Cingulum_Post_R 34 T emporal_Pole_Mid_R 9 Rectus_L 35 T emporal_Inf_L 8301 10 Rectus_R 36 T emporal_Inf_R 8302 11 Cingulum_Ant_L 37 Fusiform_L 12 Cingulum_Ant_R 38 Fusiform_R 39 Hippocampus_L 40 Hippocampus_R 41 ParaHippocampal_L 42 ParaHippocampal_R For example, in Fig. 1, as decreases below (but still above ), region “Tempora_Sup_L” is the first one becoming disconnected from the rest of the brain. As decreases below (but still above ), the rest of the brain further divides into three disconnected clusters, including the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, the cluster of “Fusiform_R” up to “Hippocampus_L”, and the cluster of the other regions. As continuously decreases, each current cluster will split into smaller clusters; eventually, when reaches a very large value, there will be no arc in the IC model, i.e., each region is now a cluster of itself and the split will stop. The sequence of the splitting gives an order for the strength of connection between brain regions. Specifically, the earlier (i.e., smaller ) a region or a cluster of regions becomes disconnected from the rest of the brain, the weaker it is connected with the rest of the brain. For example, in Fig. 1, it can be known that “Tempora_Sup_L” may be the weakest region in the brain network of AD; the second weakest ones are the cluster of “Cingulum_Post_R” and “Cingulum_Post_L”, and the cluster of “Fusiform_R” up to “Hippocampus_L”. It is very interesting to see that the weakest and second weakest brain regions in the brain network include “Cingulum_Post_R” and “Cingulum_Post_L” as well as regions all in the temporal lobe, all of which have been found to be affected by AD early and severely [3]-[5]. Next, to facilitate the comparison between AD and NC, a tree-like plot is also constructed for NC, as shown in Fig. 2. By comparing the plots for AD and NC, we can observe the following two distinct phenomena: First, in AD, between-lobe connectivity tends to be weaker than within-lobe connectivity. This can be seen from Fig. 1 which shows a clear pattern that the lobes become disconnected with each other before the regions within each lobe become disconnected with each other, as goes from small to large. This pattern does not show in Fig. 2 for NC. Second, the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. This can be seen from Fig. 2 for NC, in which the same brain regions in the left and right hemisphere are still connected even at a very large for NC. However, this pattern does not show in Fig. 1 for AD. Furthermore, a tree-like plot is also constructed for MCI (Fig. 3), and compared with the plots for AD and NC. In terms of the two phenomena discussed previously, MCI shows similar patterns to AD, but these patterns are not as distinct from NC as AD. Specifically, in terms of the first phenomenon, MCI also shows weaker between-lobe connectivity than within-lobe connectivity, which is similar to AD. However, the degree of weakerness is not as distinctive as AD. For example, a few regions in the temporal lobe of MCI, including “Temporal_Mid_R” and “Temporal_Sup_R”, appear to be more strongly connected with the occipital lobe than with other regions in the temporal lobe. In terms of the second phenomenon, MCI also shows weaker between-hemisphere connectivity in the same brain region than NC. However, the degree of weakerness is not as distinctive as AD. For example, several left-right pairs of the same brain regions are still connected even at a very large , such as “Rectus_R” and “Rectus_L”, “Frontal_Mid_Orb_R” and “Frontal_Mid_Orb _L”, “Parietal_Sup_R” and “Parietal_Sup_L”, as well as “Precuneus_R” and “Precuneus_L”. All above findings are consistent with the knowledge that MCI is a transition stage between normal aging and AD. Large λ λ3 λ2 λ1 Small λ Fig 1: Order for the strength of connection between brain regions of AD Large λ Small λ Fig 2: Order for the strength of connection between brain regions of NC Fig 3: Order for the strength of connection between brain regions of MCI Furthermore, we would like to compare how within-lobe and between-lobe connectivity is different across AD, MCI, and NC. To achieve this, we first learn one connectivity model for AD, one for MCI, and one for NC. We adjust the in the learning of each model such that the three models, corresponding to AD, MCI, and NC, respectively, will have the same total number of arcs. This is to “normalize” the models, so that the comparison will be more focused on how the arcs distribute differently across different models. By selecting different values for the total number of arcs, we can obtain models representing the brain connectivity at different levels of strength. Specifically, given a small value for the total number of arcs, only strong arcs will show up in the resulting connectivity model, so the model is a model of strong brain connectivity; when increasing the total number of arcs, mild arcs will also show up in the resulting connectivity model, so the model is a model of mild and strong brain connectivity. For example, Fig. 4 shows the connectivity models for AD, MCI, and NC with the total number of arcs equal to 50 (Fig. 4(a)), 120 (Fig. 4(b)), and 180 (Fig. 4(c)). In this paper, we use a “matrix” representation for the SICE of a connectivity model. In the matrix, each row represents one node and each column also represents one node. Please see Table 1 for the correspondence between the numbering of the nodes and the brain region each number represents. The matrix contains black and white cells: a black cell at the -th row, -th column of the matrix represents existence of an arc between nodes and in the SICE-based connectivity model, whereas a white cell represents absence of an arc. According to this definition, the total number of black cells in the matrix is equal to twice the total number of arcs in the SICE-based connectivity model. Moreover, on each matrix, four red cubes are used to highlight the brain regions in each of the four lobes; that is, from top-left to bottom-right, the red cubes highlight the frontal, parietal, occipital, and temporal lobes, respectively. The black cells inside each red cube reflect within-lobe connectivity, whereas the black cells outside the cubes reflect between-lobe connectivity. While the connectivity models in Fig. 4 clearly show some connectivity difference between AD, MCI, and NC, it is highly desirable to test if the observed difference is statistically significant. Therefore, we further perform a hypothesis testing and the results are summarized in Table 2. Specifically, a P-value is recorded in the sub-table if it is smaller than 0.1, such a P-value is further highlighted if it is even smaller than 0.05; a “---” indicates that the corresponding test is not significant (P-value>0.1). We can observe from Fig. 4 and Table 2: Within-lobe connectivity: The temporal lobe of AD has significantly less connectivity than NC. This is true across different strength levels (e.g., strong, mild, and weak) of the connectivity; in other words, even the connectivity between some strongly-connected brain regions in the temporal lobe may be disrupted by AD. In particular, it is clearly from Fig. 4(b) that the regions “Hippocampus” and “ParaHippocampal” (numbered by 39-42, located at the right-bottom corner of Fig. 4(b)) are much more separated from other regions in AD than in NC. The decrease in connectivity in the temporal lobe of AD, especially between the Hippocampus and other regions, has been extensively reported in the literature [3]-[5]. Furthermore, the temporal lobe of MCI does not show a significant decrease in connectivity, compared with NC. This may be because MCI does not disrupt the temporal lobe as badly as AD. AD MCI NC Fig 4(a): SICE-based brain connectivity models (total number of arcs equal to 50) AD MCI NC Fig 4(b): SICE-based brain connectivity models (total number of arcs equal to 120) AD MCI NC Fig 4(c): SICE-based brain connectivity models (total number of arcs equal to 180) The frontal lobe of AD has significantly more connectivity than NC, which is true across different strength levels of the connectivity. This has been interpreted as compensatory reallocation or recruitment of cognitive resources [6]-[7]. Because the regions in the frontal lobe are typically affected later in the course of AD (our data are early AD), the increased connectivity in the frontal lobe may help preserve some cognitive functions in AD patients. Furthermore, the frontal lobe of MCI does not show a significant increase in connectivity, compared with NC. This indicates that the compensatory effect in MCI brain may not be as strong as that in AD brains. Table 2: P-values from the statistical significance test of connectivity difference among AD, MCI, and NC (a) Total number of arcs = 50 (b) Total number of arcs = 120 (c) Total number of arcs = 180 There is no significant difference among AD, MCI, and NC in terms of the connectivity within the parietal lobe and within the occipital lobe. Another interesting finding is that all the P-values in the third sub-table of Table 2(a) are insignificant. This implies that distribution of the strong connectivity within and between lobes for MCI is very similar to NC; in other words, MCI has not been able to disrupt the strong connectivity among brain regions (it disrupts some mild and weak connectivity though). Between-lobe connectivity: In general, human brains tend to have less between-lobe connectivity than within-lobe connectivity. A majority of the strong connectivity occurs within lobes, but rarely between lobes. These can be clearly seen from Fig. 4 (especially Fig. 4(a)) in which there are much more black cells along the diagonal direction than the off-diagonal direction, regardless of AD, MCI, and NC. The connectivity between the parietal and occipital lobes of AD is significantly more than NC which is true especially for mild and weak connectivity. The increased connectivity between the parietal and occipital lobes of AD has been previously reported in [3]. It is also interpreted as a compensatory effect in [6]-[7]. Furthermore, MCI also shows increased connectivity between the parietal and occipital lobes, compared with NC, but the increase is not as significant as AD. While the connectivity between the frontal and occipital lobes shows little difference between AD and NC, such connectivity for MCI shows a significant decrease especially for mild and weak connectivity. Also, AD may have less temporal-occipital connectivity, less frontal-parietal connectivity, but more parietal-temporal connectivity than NC. Between-hemisphere connectivity: Recall that we have observed from the tree-like plots in Figs. 3 and 4 that the same brain regions in the left and right hemisphere are connected much weaker in AD than in NC. It is desirable to test if this observed difference is statistically significant. To achieve this, we test the statistical significance of the difference among AD, MCI, and NC, in term of the number of connected same-region left-right pairs. Results show that when the total number of arcs in the connectivity models is equal to 120 or 90, none of the tests is significant. However, when the total number of arcs is equal to 50, the P-values of the tests for “AD vs. NC”, “AD vs. MCI”, and “MCI vs. NC” are 0.009, 0.004, and 0.315, respectively. We further perform tests for the total number of arcs equal to 30 and find the P-values to be 0. 0055, 0.053, and 0.158, respectively. These results indicate that AD disrupts the strong connectivity between the same regions of the left and right hemispheres, whereas this disruption is not significant in MCI. 4 Con cl u si on In the paper, we applied SICE to model functional brain connectivity of AD, MCI, and NC based on PET neuroimaging data, and analyze the patterns based on the monotone property of SICE. Our findings were consistent with the previous literature and also showed some new aspects that may suggest further investigation in brain connectivity research in the future. R e f e re n c e s [1] S. Molchan. (2005) The Alzheimer's disease neuroimaging initiative. Business Briefing: US Neurology Review, pp.30-32, 2005. [2] C.J. Stam, B.F. Jones, G. Nolte, M. Breakspear, and P. Scheltens. (2007) Small-world networks and functional connectivity in Alzheimer’s disease. Cerebral Corter 17:92-99. [3] K. Supekar, V. Menon, D. Rubin, M. Musen, M.D. Greicius. (2008) Network Analysis of Intrinsic Functional Brain Connectivity in Alzheimer's Disease. PLoS Comput Biol 4(6) 1-11. [4] K. Wang, M. Liang, L. Wang, L. Tian, X. Zhang, K. Li and T. Jiang. (2007) Altered Functional Connectivity in Early Alzheimer’s Disease: A Resting-State fMRI Study, Human Brain Mapping 28, 967978. [5] N.P. Azari, S.I. Rapoport, C.L. Grady, M.B. Schapiro, J.A. Salerno, A. Gonzales-Aviles. (1992) Patterns of interregional correlations of cerebral glucose metabolic rates in patients with dementia of the Alzheimer type. Neurodegeneration 1: 101–111. [6] R.L. Gould, B.Arroyo, R,G. Brown, A.M. Owen, E.T. Bullmore and R.J. Howard. (2006) Brain Mechanisms of Successful Compensation during Learning in Alzheimer Disease, Neurology 67, 1011-1017. [7] Y. Stern. (2006) Cognitive Reserve and Alzheimer Disease, Alzheimer Disease Associated Disorder 20, 69-74. [8] K.J. Friston. (1994) Functional and effective connectivity: A synthesis. Human Brain Mapping 2, 56-78. [9] G. Alexander, J. Moeller. (1994) Application of the Scaled Subprofile model: a statistical approach to the analysis of functional patterns in neuropsychiatric disorders: A principal component approach to modeling regional patterns of brain function in disease. Human Brain Mapping, 79-94. [10] V.D. Calhoun, T. Adali, G.D. Pearlson, J.J. Pekar. (2001) Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms. Hum.Brain Mapp. 13, 43-53. [11] V.D. Calhoun, T. Adali, J.J. Pekar, G.D. Pearlson. (2003) Latency (in)sensitive ICA. Group independent component analysis of fMRI data in the temporal frequency domain. Neuroimage. 20, 1661-1669. [12] A.R. McIntosh, F.L. Bookstein, J.V. Haxby, C.L. Grady. (1996) Spatial pattern analysis of functional brain images using partial least squares. Neuroimage. 3, 143-157. [13] K.J. Worsley, J.B. Poline, K.J. Friston, A.C. Evans. (1997) Characterizing the response of PET and fMRI data using multivariate linear models. Neuroimage. 6, 305-319. [14] E. Bullmore, B. Horwitz, G. Honey, M. Brammer, S. Williams, T. Sharma. (2000) How good is good enough in path analysis of fMRI data? NeuroImage 11, 289–301. [15] A.R. McIntosh, C.L. Grady, L.G. Ungerieider, J.V. Haxby, S.I. Rapoport, B. Horwitz. (1994) Network analysis of cortical visual pathways mapped with PET. J. Neurosci. 14 (2), 655–666. [16] K.J. Friston, L. Harrison, W. Penny. (2003) Dynamic causal modelling. Neuroimage 19, 1273-1302. [17] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. (2008) Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research 9:485516. [18] J. Dahl, L. Vandenberghe, and V. Roycowdhury. (2008) Covariance selection for nonchordal graphs via chordal embedding. Optimization Methods Software 23(4):501-520. [19] J. Friedman, T.astie, and R. Tibsirani. (2007) Spares inverse covariance estimation with the graphical lasso, Biostatistics 8(1):1-10. [20] J.Z. Huang, N. Liu, M. Pourahmadi, and L. Liu. (2006) Covariance matrix selection and estimation via penalized normal likelihood. Biometrika, 93(1):85-98. [21] H. Li and J. Gui. (2005) Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2):302-317. [22] Y. Lin. (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94(1)19-35, 2007. [23] A. Dobra, C. Hans, B. Jones, J.R. Nevins, G. Yao, and M. West. (2004) Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis 90(1):196-212. [24] A. Berge, A.C. Jensen, and A.H.S. Solberg. (2007) Sparse inverse covariance estimates for hyperspectral image classification, Geoscience and Remote Sensing, IEEE Transactions on, 45(5):1399-1407. [25] J.A. Bilmes. (2000) Factored sparse inverse covariance matrices. In ICASSP:1009-1012. [26] L. Sun and et al. (2009) Mining Brain Region Connectivity for Alzheimer's Disease Study via Sparse Inverse Covariance Estimation. In KDD: 1335-1344. [27] R. Tibshirani. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58(1):267-288. [28] N. Tzourio-Mazoyer and et al. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single subject brain. Neuroimage 15:273-289. [29] Supplemental information for “Learning Brain Connectivity of Alzheimer's Disease from Neuroimaging Data”. http://www.public.asu.edu/~jye02/Publications/AD-supplemental-NIPS09.pdf

5 0.5028047 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu

Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efﬁcient inference algorithms for a conditional random ﬁeld using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efﬁcient learning algorithms for these conditional random ﬁelds. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1

6 0.48220617 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

7 0.4547475 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

8 0.40255591 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

9 0.36325952 70 nips-2009-Discriminative Network Models of Schizophrenia

10 0.3401151 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

11 0.33194527 7 nips-2009-A Data-Driven Approach to Modeling Choice

12 0.32104877 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

13 0.29743484 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

14 0.28760955 69 nips-2009-Discrete MDL Predicts in Total Variation

15 0.28187236 50 nips-2009-Canonical Time Warping for Alignment of Human Behavior

16 0.2549473 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

17 0.23688404 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

18 0.22530183 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

19 0.2235053 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

20 0.22203529 237 nips-2009-Subject independent EEG-based BCI decoding

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.01), (24, 0.021), (25, 0.042), (28, 0.021), (31, 0.394), (35, 0.031), (36, 0.114), (39, 0.046), (58, 0.106), (61, 0.017), (71, 0.032), (81, 0.015), (86, 0.047), (91, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82941997 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

2 0.82048774 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons

Author: Romain Brasselet, Roland Johansson, Angelo Arleo

Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve ﬁbers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efﬁciency of the haptic discrimination process is quantiﬁed by a novel deﬁnition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the ﬁrst afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1

3 0.67167878 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random ﬁelds) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random ﬁeld develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random ﬁeld µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is deﬁned as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change signiﬁcantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for speciﬁc classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results conﬁrm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be deﬁnite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to ﬁnd W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after ﬁxing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is deﬁned as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More speciﬁcally, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This ﬁrst result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More speciﬁcally, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the ﬂuctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a speciﬁc low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is deﬁned from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our ﬁrst result shows that Rlr(λ) indeed reconstructs G if θ is sufﬁciently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires signiﬁcant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difﬁcult to learn. We will instead show how to evaluate it on speciﬁc graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufﬁcient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and deﬁne S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Deﬁne Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our ﬁrst auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufﬁcient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufﬁciently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisﬁed since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by deﬁnition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and deﬁne the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external ﬁeld’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any ﬁxed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) ﬁnite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simpliﬁcations arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random ﬁelds”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ ﬂing and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

4 0.63138801 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

Author: Chong Wang, David M. Blei

Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efﬁciently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1

5 0.54255891 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Author: Rob Fergus, Yair Weiss, Antonio Torralba

Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efﬁcient approximations for semi-supervised learning that are linear in the number of images. Speciﬁcally, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1

6 0.42104134 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.4054918 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

8 0.40293041 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

9 0.40232164 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

10 0.4014855 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

11 0.40087897 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

12 0.40080184 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

13 0.39969218 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

14 0.39892992 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

15 0.39730179 76 nips-2009-Efficient Learning using Forward-Backward Splitting

16 0.39669752 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

17 0.39560869 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

18 0.3948279 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

19 0.39469701 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

20 0.39445162 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing