nips nips2012 nips2012-157 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Wells (III) a ´ o Harvard Medical School Boston, MA 02115 Abstract Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. [sent-3, score-0.654]
2 This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “mass� [sent-6, score-0.484]
3 1 Introduction In addition to functional localization and integration, mapping the neural correlates of “mental states� [sent-9, score-0.121]
4 the distinct cognitive, affective or perceptive states of the human mind) is an important research topic for understanding the connection between mind and brain [2]. [sent-13, score-0.265]
5 In functional neuroimaging, this problem is equivalent to identifying recurrent spatial patterns from the recorded activation of neural circuits and relating them with the mental state of the subject. [sent-14, score-0.484]
6 In contrast to clustering voxels based on the similarity of their functional activity (i. [sent-16, score-0.302]
7 along the spatial dimension) [15], the problem of clustering fMRI data along the temporal dimension has not been widely explored in literature, primarily because of the following challenges: a) Lack of ∗ Corresponding Author. [sent-18, score-0.096]
8 com 1 a physiologically meaningful metric to compare the difference between the spatial distribution of recorded brain activity (i. [sent-21, score-0.404]
9 brain states) at two different time-points; b) Problems that arise because the number of voxels (i. [sent-23, score-0.255]
10 The dimensionality problem in fMRI has been typically addressed through PCA [16], ICA[3] or by selection of a subset of voxels either manually or via regression against the stimulus [18, 11]. [sent-29, score-0.114]
11 On the other hand, supervised featurespaces are inherently biased towards the experimental variables against which they were selected or by the investigator’s expectations, and may not capture unexpected patterns in the data. [sent-31, score-0.112]
12 Intuitively, this network-aware metric assesses the distance between two states zt1 , zt2 that differ mainly on proximally connected nodes to be less than the distance between states zt1 , zt2 that differ on unconnected nodes. [sent-35, score-0.485]
13 In the context of neuroimaging, where the network measures the functional connectivity [4] between brain regions, this implies that two brain activation patterns that differ mainly on functionally similar regions are functionally closer than two that differ on functionally unrelated regions. [sent-38, score-0.923]
14 For example, zt1 and zt2 that zt1 zt3 activated mainly in the cingulo-opercular network would be functionally more similar with each other than with zt3 that exhibited activity mainly in the fronto-parietal network. [sent-39, score-0.287]
15 Such network awareness is provided by the zt2 Kantorovich metric[20], also called the transportation distance (TD), which measures the minimum ow of “massâ€? [sent-40, score-0.39]
16 over the network to Figure 1: Shown are zt1 , zt2 and zt3 , three states make zt1 at time t1 match zt2 that at t2 . [sent-41, score-0.127]
17 Here, zt1 and cost of this ow is encoded by the weights of zt2 activate on more proximal regions of the graph and the edges of the graph. [sent-43, score-0.118]
18 tance (EMD), closely related to the transportation distance, is widely used for clustering and retrieval in computer vision, medical imaging, bio-informatics and data-mining [21, 22, 7]. [sent-46, score-0.279]
19 The TD, however, has the following limitations: Firstly, it is computationally expensive with worstcase complexity of O(NV 3 log NV ) where NV is the number of nodes in the graph [17]. [sent-48, score-0.134]
20 2 The second contribution of this paper is to address these issues through the development of a linear feature-space that provides a good approximation of the transportation distance. [sent-56, score-0.28]
21 This feature–space is motivated by spherical relaxation [14] of the dual polytope of the transportation problem, as described in Section 2. [sent-57, score-0.46]
22 The network function zG is then embedded into an Euclidean space via a similarity transformation such that the the transportation distance is well-approximated by the 2 distance in this space, as elucidated in Section 3. [sent-58, score-0.519]
23 In contrast to existing linear approximations, the feature-space developed here has a very simple form closely related to the graph Laplacian [6]. [sent-59, score-0.086]
24 Theoretical bounds to the error the approximation are developed and the accuracy of the method is validated empirically in Section 4. [sent-60, score-0.095]
25 2 Transportation Distance and Spherical Relaxation Let zt1 and zt2 denote the states of zG at time-points t1 , t2 on the graph G = (V, E), with nodes V = {1 . [sent-66, score-0.203]
26 The symmetric distance matrix WG [i, j] ∈ R+ encodes the cost of transport between nodes i and j . [sent-70, score-0.176]
27 Also, deďŹ ne the difference between two states as dz = zt1 − zt2 , and assume i∈V dz[i] = 0 without loss of generality 1 . [sent-71, score-0.309]
28 over the network to convert zt1 into zt2 , is posed as the following linear program (LP): TD(zt1 , zt2 ) = min f subject to f [i, j]WG [i, j], f [i, j] − j i∈V b∈V f [j, i] = dz[i]. [sent-73, score-0.093]
29 (1) j The corresponding TP dual, formulated in the unrestricted dual variables g : V → R, is: TD(zt1 , zt2 ) = max g, dz g  1 −1 1 0   . [sent-74, score-0.274]
30 The feasible set of the dual is a convex polytope formed by the intersection of the half-spaces speciďŹ ed by the constraints {ai,j , i = 1 : NV , j = 1 . [sent-129, score-0.146]
31 These constraints which form normals to the hyper-planes bounding this polytope, are symmetrically distributed in the +i Ă— −j quadrant of RNV for each combination of i and j . [sent-133, score-0.106]
32 Moreover, A is totally uni-modular [5], and has rank of NV − 1 with the LP polytope lying in an NV − 1 dimensional space orthogonal to 1NV , the 1 –vector in RNV . [sent-134, score-0.112]
33 Here, subject to TD(zt1 , zt2 ) = max < g, dz > g Ag ≤ 1NV Ă—(NV −1) . [sent-140, score-0.275]
34 (3) √ Each hyper-plane of the LP polytope is at distance 1/ 2 from the origin and the maximum inscribed √ hyper-sphere, with center at the origin and radius 1/ 2 touches all the polytope’s hyper-planes. [sent-141, score-0.328]
35 Relaxing the feasible set of the TP dual from the convex polytope to this hyper-sphere, eqn. [sent-144, score-0.146]
36 3 Linear Feature Space Embedding In the case of an arbitrary distance matrix WG , however, the polytope loses its regular structure, and has a variable number of extreme points. [sent-146, score-0.213]
37 Also, in general, the maximal inscribed hyper-sphere does not touch all the bounding hyper-planes, resulting in a very poor approximation [14]. [sent-147, score-0.11]
38 Therefore, to use the spherical relaxation for the general problem, we apply a similarity transformation M, such that A ¡ M = diag{wG }−1 A and M positive semi-deďŹ nite. [sent-148, score-0.111]
39 (2) in terms of a new variable Ξ Mg, we see that the general problem: TD(zt1 , zt2 ) = max < g, dz > g such that Ag ≤ wG (6) is equivalent to the special case given by eqn. [sent-150, score-0.24]
40 (3), in a transformed space, as per: TD(zt1 , zt2 ) = max < M− Ξ, dz > such that Ξ AΞ ≤ 1NV Ă—(NV −1) , (7) where M− is the (pseudo-)inverse of M. [sent-151, score-0.24]
41 2 2 (8) Consequently, the transportation distance can be approximated by a 2 metric through a similarity transformation of the original space. [sent-157, score-0.429]
42 In this case the error of the approximation is O(Îťâˆ’1 ||dz||2 ) min (§Theorem 1 of the Supplemental), which implies that the approximation improves as the smallest eigenvalue of the graph Laplacian increases. [sent-158, score-0.23]
43 Also, notice that the eigenvector vNV of LG corresponding to the smallest eigenvalue ÎťNV = 0 is a constant vector, and therefore vNV , dz = 0 by the requirement that i∈V dz[i] = 0 , thereby automatically reducing the dimension of the projected space to NV − 1 . [sent-159, score-0.24]
44 k=1 Îťk / 4 Results First, we start by providing an empirical validation of the approximation to the transportation distance in Section 4. [sent-165, score-0.381]
45 1 And then the feature-space is used to ďŹ nd representative patterns (i. [sent-166, score-0.112]
46 brain states) in the dynamically changing activations of the brain during a visuo-motor task in Section 4. [sent-168, score-0.334]
47 1 Validation To validate the linear approximation to the transportation distance on networks, like the brain, that exhibit a scale-free property [1], we simulated random graphs of NV vertices using the following procedure: a) Create an edge between nodes i and j with probability � [sent-171, score-0.429]
48 For 1 each instance G(n) of the graph, a set of T = 100 states zt : V(i) → R, t = 1 . [sent-173, score-0.119]
49 The transportation problem was solved using network simplex [17] in the IBM CPLEX optimization package, while the linear approximation was implemented in Matlab . [sent-181, score-0.338]
50 This is because the approximation error for an arbitrary graph is O(Îťâˆ’1 ||dz||2 ) , while min for random graphs satisfying basic regularity conditions the eigenvalues of the graph Laplacian increase as O(NV ) [8]. [sent-189, score-0.295]
51 In comparison, the Euclidean metric ||zt1 − zt2 ||2 starts with a much higher relative error with respect to the transportation distance, and although its error also reduces with graph size, the trend is slower. [sent-190, score-0.478]
52 Secondly, the variance of the error is much higher than the linear embedding proposed here. [sent-191, score-0.093]
53 2(c), we observe that for data-points that are relatively close to each other, the ordering relationships are preserved with very high accuracy and it reduces as the relative distance between the points increases. [sent-200, score-0.147]
54 The formulation has the effect of normalizing the distance between zt1 , zt2 with respect to the local neighborhood of zt1 . [sent-206, score-0.101]
55 2(d) that the approximation error according to this measure is extremely low and almost constant with respect to NV for points that are close to each other. [sent-208, score-0.095]
56 2 Neuroimaging Data Clustering using the feature-space described in this paper was applied to a data-set of ďŹ fteen subjects performing a visuo-motor task during functional MR imaging to discover salient patterns of recurrent brain activation. [sent-211, score-0.42]
57 The subjects were visually exposed to oriented wedges ďŹ lled with highcontrast random noise patterns and displayed randomly in one of four quadrants. [sent-212, score-0.152]
58 They were asked to focus on a center dot and to perform a ďŹ nger-tapping motion with the right or left hand when the visual wedge was active in the upper right or lower left quadrants, respectively. [sent-213, score-0.243]
59 Block length of each visual wedge stimulation varied from 5 to 15s and noise patterns changed at a frequency of 5Hz. [sent-214, score-0.389]
60 ˇ approx 20% 0% 128 256 128 512 1024 2048 4096 8192 16384 Number of nodes Number of nodes 60% 25%-ile Tđ? [sent-220, score-0.125]
61 (a) shows the (amortized) per-comparison running time in seconds for the transportation distance TD and its approximation TD with respect to with respect to graph size NV . [sent-223, score-0.467]
62 (b) the relative approximation error (TD − TD)/TD (Âą1 std. [sent-225, score-0.095]
63 The error for an Euclidean approximation ||zt1 − zt2 ||2 is also shown for comparison. [sent-228, score-0.095]
64 The set {zt2 } is divided into quartiles according to their distance TD(zt1 , zt2 ) from zt1 , where the 25 percentile is set of the ďŹ rst 25% closest points to zt1 (similarly for the 50 and 75%-iles). [sent-236, score-0.101]
65 Also shown is the ordering error of the Euclidean metric with respect to TD. [sent-237, score-0.161]
66 Fig (d) shows the quartile-wise approximation error normalized by the average distance of its 10 nearest neighbors. [sent-239, score-0.196]
67 The dashed line shows the un-normalized approximation error (§ Fig. [sent-240, score-0.095]
68 The fMRI scans were motion corrected using linear registration and co- registered with the structural scans using SPM8 [16]. [sent-245, score-0.149]
69 The design matrix included a regressor for the presentation of the wedge in each quadrant, convolved with a canonical hemodynamic response function. [sent-249, score-0.247]
70 The functional networks for a subject were computed by estimating the correlations between voxels using the method described in Supplemental Section C, that is sparse, consistent and computationally efďŹ cient. [sent-254, score-0.205]
71 The distance matrix of the functional connectivity graph was constructed as WG [i, j] = − log(|Ď [i, j]|/Ď„ ), where Ď [i, j] is the correlation between voxels i and j and Ď„ is a user-deďŹ ned scale parameter (typically set to 10). [sent-255, score-0.404]
72 A multinomial logistic classiďŹ er (MLC) was then trained to predict the wedge position at time t from Ď€t . [sent-270, score-0.214]
73 It should be noted here that identiďŹ cation of patterns of recorded brain activity was performed in a purely unsupervised manner. [sent-272, score-0.363]
74 Only model selection and model interpretation was done, post hoc, using observable correlates of the unobservable mental state of the subject. [sent-273, score-0.245]
75 Spatial maps for each wedge orientation were computed as an average of cluster centroids weighted by the MLC weights for that orientation. [sent-274, score-0.299]
76 3(c) shows the distribution of state probabilities for one subject corresponding to a sequence of wedges oriented in each quadrant for 4Ă—TRs each. [sent-278, score-0.202]
77 Here, we see that the probability of a particular state is highly structured with respect to the orientation of the wedge. [sent-279, score-0.096]
78 For example, at the start of the presentation with the wedge in the lower-right quadrant, state 1 is most probable. [sent-280, score-0.264]
79 However, as this orientation is maintained, the probability distribution peaks about state 4 and remains stable. [sent-283, score-0.096]
80 The spatial maps reconstructed from these two feature-spaces (not shown here) exhibited task-speciďŹ c activation patterns, although the foci were much weaker and much more diffused as compared to those of the TD feature-space. [sent-287, score-0.092]
81 The error of the MLC in predicting the stimulus at time t from the state probability vector πt , which reects the model’s ability to capture patterns in the data related to the mental state of the subject, for these three feature spaces is listed in Table 1. [sent-288, score-0.404]
82 Due to the random presentation of wedge orientations, the chance level prediction error varied between 68% − −81% for each subject. [sent-320, score-0.294]
83 (a): Group-level maximum intensity projections of signiďŹ cantly activated voxels (p < 0. [sent-326, score-0.115]
84 05, FWE corrected) at the four orientations of the wedge and the hand motor actions, computed using SPM8 Fig. [sent-327, score-0.289]
85 (b): Group-level z-maps showing the activity for each orientation of the wedge computed as an average of cluster centroids weighted by the MLC weights. [sent-328, score-0.383]
86 8 during the display of the wedge in lower right, lower left, upper left and upper right quadrants for 4TRs each. [sent-336, score-0.25]
87 5 Conclusion In this paper, we have presented an approach to compare and identify patterns of brain activation during a mental process using a distance metric that is aware of the connectivity structure of the underlying brain networks. [sent-338, score-0.827]
88 This distance metric is obtained by an Euclidean approximation of the transportation distance between patterns via a spherical relaxation of the linear-programming dual polytope. [sent-339, score-0.78]
89 The embedding is achieved by a transformation of the original space of the function with the graph Laplacian of the network. [sent-340, score-0.161]
90 We provided theoretical bounds on the quality of the approximation and through empirical validation demonstrated low error that, importantly, decreases as the size of the problem increases. [sent-342, score-0.095]
91 We also showed the superior ability of this distance metric to identify salient patterns of brain activity, related to the internal mental state of the subject, from an fMRI study of visuo-motor tasks. [sent-343, score-0.619]
92 The framework presented here is applicable to the more general problem of identifying patterns in time-varying measurements distributed over a network that has an intrinsic notion of distance and proximity, such as social, sensor, communication, transportation, energy and other similar networks. [sent-344, score-0.304]
93 Future work would include assessing the quality of the approximation for sparse, restricted topology, small-world and scale-free networks that arise in many real world cases, and applying the method for detecting patterns and outliers in these types of networks. [sent-345, score-0.161]
94 : A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. [sent-351, score-0.307]
95 : State–space models of mental processes o from fMRI (2011) 2, 7 [14] Khachiyan, L. [sent-411, score-0.12]
96 : Categories and functional units: An inďŹ nite hierarchical model for brain activations. [sent-419, score-0.249]
97 : Theoretical, statistical, and practical perspectives on pattern-based classiďŹ cation approaches to the analysis of functional neuroimaging data. [sent-440, score-0.167]
98 : Segmentation of brain electrical activity into microstates: model estimation and validation. [sent-446, score-0.251]
99 : Mass transportation problems: Volume I: Theory (probability and its applications) (March 1998) 2 [21] Rubner, Y. [sent-450, score-0.231]
100 s distance as a metric for image retrieval (1998) 2 [22] Shirdhonkar, S. [sent-455, score-0.17]
wordName wordTfidf (topN-words)
[('nv', 0.428), ('td', 0.368), ('wg', 0.324), ('dz', 0.24), ('transportation', 0.231), ('wedge', 0.214), ('brain', 0.167), ('fmri', 0.163), ('mlc', 0.121), ('mental', 0.12), ('patterns', 0.112), ('polytope', 0.112), ('distance', 0.101), ('voxels', 0.088), ('laplacian', 0.088), ('graph', 0.086), ('neuroimaging', 0.085), ('activity', 0.084), ('rnv', 0.082), ('lg', 0.082), ('functional', 0.082), ('quadrant', 0.077), ('states', 0.069), ('metric', 0.069), ('euclidean', 0.067), ('ssm', 0.066), ('supplemental', 0.063), ('zg', 0.062), ('functionally', 0.062), ('correig', 0.061), ('inscribed', 0.061), ('janoos', 0.061), ('network', 0.058), ('scans', 0.056), ('spherical', 0.053), ('state', 0.05), ('zt', 0.05), ('approximation', 0.049), ('clustering', 0.048), ('spatial', 0.048), ('nodes', 0.048), ('connectivity', 0.047), ('embedding', 0.047), ('volumes', 0.046), ('orientation', 0.046), ('error', 0.046), ('ordering', 0.046), ('ag', 0.045), ('jul', 0.044), ('activation', 0.044), ('pca', 0.041), ('earth', 0.041), ('rees', 0.04), ('rocz', 0.04), ('scanner', 0.04), ('vnv', 0.04), ('wedges', 0.04), ('ztn', 0.04), ('mri', 0.039), ('correlates', 0.039), ('cluster', 0.039), ('motor', 0.039), ('ow', 0.037), ('corrected', 0.037), ('orientations', 0.036), ('physiologically', 0.036), ('quadrants', 0.036), ('unobservable', 0.036), ('wells', 0.036), ('subject', 0.035), ('neurosci', 0.034), ('varied', 0.034), ('dual', 0.034), ('amortized', 0.033), ('deteriorate', 0.033), ('hemodynamic', 0.033), ('kantorovich', 0.033), ('mover', 0.033), ('intrinsic', 0.033), ('regions', 0.032), ('imaging', 0.031), ('masked', 0.031), ('emd', 0.031), ('relaxation', 0.03), ('symmetrically', 0.029), ('approx', 0.029), ('visual', 0.029), ('mind', 0.029), ('recurrent', 0.028), ('mainly', 0.028), ('glm', 0.028), ('artifacts', 0.028), ('transformation', 0.028), ('eigenvalues', 0.028), ('preserve', 0.027), ('lp', 0.027), ('activated', 0.027), ('transport', 0.027), ('origin', 0.027), ('stimulus', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
2 0.14584194 292 nips-2012-Regularized Off-Policy TD-Learning
Author: Bo Liu, Sridhar Mahadevan, Ji Liu
Abstract: We present a novel l1 regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying ROTD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm. 1
3 0.12103223 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
Author: Francisco Pereira, Matthew Botvinick
Abstract: This paper introduces a novel classification method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1
4 0.0851807 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain
Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1
5 0.079552315 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
6 0.076380171 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
7 0.074553862 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination
8 0.073517196 69 nips-2012-Clustering Sparse Graphs
9 0.069596745 9 nips-2012-A Geometric take on Metric Learning
10 0.066061355 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
11 0.065849215 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
12 0.06461636 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
13 0.062837772 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
14 0.061824802 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
15 0.06135264 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities
16 0.059899766 242 nips-2012-Non-linear Metric Learning
17 0.059518151 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks
18 0.058519043 167 nips-2012-Kernel Hyperalignment
19 0.057743173 195 nips-2012-Learning visual motion in recurrent neural networks
20 0.057312842 94 nips-2012-Delay Compensation with Dynamical Synapses
topicId topicWeight
[(0, 0.175), (1, 0.027), (2, -0.044), (3, -0.028), (4, 0.02), (5, 0.036), (6, -0.01), (7, -0.013), (8, -0.048), (9, 0.051), (10, 0.032), (11, -0.143), (12, 0.029), (13, 0.011), (14, 0.042), (15, 0.01), (16, -0.023), (17, 0.072), (18, 0.035), (19, 0.007), (20, 0.007), (21, -0.027), (22, -0.069), (23, -0.045), (24, 0.086), (25, -0.05), (26, -0.073), (27, 0.139), (28, 0.062), (29, 0.07), (30, 0.01), (31, 0.004), (32, -0.03), (33, 0.012), (34, 0.059), (35, -0.046), (36, 0.081), (37, -0.034), (38, -0.053), (39, 0.024), (40, -0.05), (41, 0.014), (42, 0.064), (43, -0.085), (44, 0.04), (45, 0.036), (46, 0.059), (47, 0.093), (48, -0.021), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.92595422 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
2 0.62163115 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination
Author: Won H. Kim, Deepti Pachauri, Charles Hatt, Moo. K. Chung, Sterling Johnson, Vikas Singh
Abstract: Hypothesis testing on signals defined on surfaces (such as the cortical surface) is a fundamental component of a variety of studies in Neuroscience. The goal here is to identify regions that exhibit changes as a function of the clinical condition under study. As the clinical questions of interest move towards identifying very early signs of diseases, the corresponding statistical differences at the group level invariably become weaker and increasingly hard to identify. Indeed, after a multiple comparisons correction is adopted (to account for correlated statistical tests over all surface points), very few regions may survive. In contrast to hypothesis tests on point-wise measurements, in this paper, we make the case for performing statistical analysis on multi-scale shape descriptors that characterize the local topological context of the signal around each surface vertex. Our descriptors are based on recent results from harmonic analysis, that show how wavelet theory extends to non-Euclidean settings (i.e., irregular weighted graphs). We provide strong evidence that these descriptors successfully pick up group-wise differences, where traditional methods either fail or yield unsatisfactory results. Other than this primary application, we show how the framework allows performing cortical surface smoothing in the native space without mappint to a unit sphere. 1
3 0.55627567 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
Author: Jinfeng Yi, Rong Jin, Shaili Jain, Tianbao Yang, Anil K. Jain
Abstract: One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called semi-crowdsourced clustering that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects and from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency. 1
4 0.54436243 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
Author: Jun Wang, Alexandros Kalousis, Adam Woznica
Abstract: We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1
5 0.53215057 46 nips-2012-Assessing Blinding in Clinical Trials
Author: Ognjen Arandjelovic
Abstract: The interaction between the patient’s expected outcome of an intervention and the inherent effects of that intervention can have extraordinary effects. Thus in clinical trials an effort is made to conceal the nature of the administered intervention from the participants in the trial i.e. to blind it. Yet, in practice perfect blinding is impossible to ensure or even verify. The current standard is follow up the trial with an auxiliary questionnaire, which allows trial participants to express their belief concerning the assigned intervention and which is used to compute a measure of the extent of blinding in the trial. If the estimated extent of blinding exceeds a threshold the trial is deemed sufficiently blinded; otherwise, the trial is deemed to have failed. In this paper we make several important contributions. Firstly, we identify a series of fundamental problems of the aforesaid practice and discuss them in context of the most commonly used blinding measures. Secondly, motivated by the highlighted problems, we formulate a novel method for handling imperfectly blinded trials. We too adopt a post-trial feedback questionnaire but interpret the collected data using an original approach, fundamentally different from those previously proposed. Unlike previous approaches, ours is void of any ad hoc free parameters, is robust to small changes in auxiliary data and is not predicated on any strong assumptions used to interpret participants’ feedback. 1
6 0.50784713 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling
7 0.50657618 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects
8 0.49724329 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
10 0.49180797 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning
11 0.49098566 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
12 0.48589313 2 nips-2012-3D Social Saliency from Head-mounted Cameras
13 0.47984272 338 nips-2012-The Perturbed Variation
14 0.47917062 242 nips-2012-Non-linear Metric Learning
15 0.47326663 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
16 0.47210416 196 nips-2012-Learning with Partially Absorbing Random Walks
17 0.46261835 94 nips-2012-Delay Compensation with Dynamical Synapses
18 0.45631018 273 nips-2012-Predicting Action Content On-Line and in Real Time before Action Onset – an Intracranial Human Study
19 0.45201206 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
20 0.45154157 253 nips-2012-On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks
topicId topicWeight
[(0, 0.037), (11, 0.021), (21, 0.025), (38, 0.092), (39, 0.018), (42, 0.049), (54, 0.02), (55, 0.04), (61, 0.011), (74, 0.065), (76, 0.151), (80, 0.068), (92, 0.031), (96, 0.273)]
simIndex simValue paperId paperTitle
1 0.80139548 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
Author: Zhitang Chen, Kun Zhang, Laiwan Chan
Abstract: In conventional causal discovery, structural equation models (SEM) are directly applied to the observed variables, meaning that the causal effect can be represented as a function of the direct causes themselves. However, in many real world problems, there are significant dependencies in the variances or energies, which indicates that causality may possibly take place at the level of variances or energies. In this paper, we propose a probabilistic causal scale-mixture model with spatiotemporal variance dependencies to represent a specific type of generating mechanism of the observations. In particular, the causal mechanism including contemporaneous and temporal causal relations in variances or energies is represented by a Structural Vector AutoRegressive model (SVAR). We prove the identifiability of this model under the non-Gaussian assumption on the innovation processes. We also propose algorithms to estimate the involved parameters and discover the contemporaneous causal structure. Experiments on synthetic and real world data are conducted to show the applicability of the proposed model and algorithms.
same-paper 2 0.7953611 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks
Author: Firdaus Janoos, Weichang Li, Niranjan Subrahmanya, Istvan Morocz, William Wells
Abstract: Identifying patterns from the neuroimaging recordings of brain activity related to the unobservable psychological or mental state of an individual can be treated as a unsupervised pattern recognition problem. The main challenges, however, for such an analysis of fMRI data are: a) deďŹ ning a physiologically meaningful feature-space for representing the spatial patterns across time; b) dealing with the high-dimensionality of the data; and c) robustness to the various artifacts and confounds in the fMRI time-series. In this paper, we present a network-aware feature-space to represent the states of a general network, that enables comparing and clustering such states in a manner that is a) meaningful in terms of the network connectivity structure; b)computationally efďŹ cient; c) low-dimensional; and d) relatively robust to structured and random noise artifacts. This feature-space is obtained from a spherical relaxation of the transportation distance metric which measures the cost of transporting “massâ€? over the network to transform one function into another. Through theoretical and empirical assessments, we demonstrate the accuracy and efďŹ ciency of the approximation, especially for large problems. 1
3 0.66528058 206 nips-2012-Majorization for CRFs and Latent Likelihoods
Author: Tony Jebara, Anna Choromanska
Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑ ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [ ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now m2,1 m1,i ∑ ∏ ∑ ∏ ∑ Z(θ) = ψ(w) . ψ(u) ψ(v) 3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2 d d 2 ∑ ∑ l(i) . x(i)2 l(i)2 ≥ x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2 ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2 . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥ ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17
4 0.64761007 197 nips-2012-Learning with Recursive Perceptual Representations
Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell
Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1
5 0.60555249 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
6 0.60519391 210 nips-2012-Memorability of Image Regions
7 0.60258287 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video
8 0.60108906 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
9 0.5955773 294 nips-2012-Repulsive Mixtures
10 0.59531164 298 nips-2012-Scalable Inference of Overlapping Communities
11 0.59483421 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
12 0.59464329 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
13 0.59446186 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
14 0.5943675 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
15 0.59436238 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
16 0.59410191 215 nips-2012-Minimizing Uncertainty in Pipelines
17 0.59317291 168 nips-2012-Kernel Latent SVM for Visual Recognition
18 0.59300983 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation
19 0.59204328 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
20 0.59198958 74 nips-2012-Collaborative Gaussian Processes for Preference Learning