nips nips2012 nips2012-199 knowledge-graph by maker-knowledge-mining

199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Source: pdf

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efﬁciently using proximal methods through a generalized forward-backward agorithm. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. [sent-2, score-0.502]

2 Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. [sent-3, score-0.804]

3 Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. [sent-4, score-0.553]

4 Statistical modeling techniques have been widely investigated in the context of multivariate time series either in the multiple linear regression setup [4] or with autoregressive models [23]. [sent-7, score-0.408]

5 These approaches share the use of the correlation structure among input variables to enrich the prediction on every single output. [sent-9, score-0.122]

6 A discrete encoding of correlations between variables can be modeled as a graph so that learning the dependence structure amounts to performing graph inference through the discovery of uncovered edges on the graph. [sent-11, score-0.373]

7 The latter problem is interesting per se and it is known as the problem of link prediction where it is assumed that only a part of the graph is actually observed [15, 9]. [sent-12, score-0.391]

8 This situation occurs in various applications such as recommender systems, social networks, or proteomics, and the appropriate tools can be found among matrix completion techniques [21, 5, 1]. [sent-13, score-0.221]

9 In the realistic setup of a time-evolving graph, matrix completion was also used and adapted to take into account the dynamics of the features of the graph [18]. [sent-14, score-0.355]

10 In this paper, we study the prediction problem where the observation is a sequence of graphs adjacency matrices (At )0≤t≤T and the goal is to predict AT +1 . [sent-15, score-0.594]

11 This type of problem arises in applications such as recommender systems where, given information on purchases made by some users, one would like to predict future purchases. [sent-16, score-0.139]

12 In this context, users and products can be modeled as the nodes of a bipartite graph, while purchases or clicks are modeled as edges. [sent-17, score-0.157]

13 In functional genomics and systems biology, estimating regulatory networks in gene expression can be performed by modeling the data as graphs and ﬁtting predictive models is a natural way for estimating evolving networks in these contexts. [sent-18, score-0.159]

14 A large variety of methods for link prediction only consider predicting from a single static snapshot of the graph - this includes heuristics [15, 20], matrix factorization [13], diffusion [16], or probabilistic methods [22]. [sent-19, score-0.676]

15 More recently, some works have investigated using sequences of observations of the graph to improve the prediction, such as using regression on features extracted from the graphs [18], using matrix factorization [14], continuous-time regression [25]. [sent-20, score-0.503]

16 Our main assumption is that the network effect is a 1 cause and a symptom at the same time, and therefore, the edges and the graph features should be estimated simultaneously. [sent-21, score-0.311]

17 We propose a regularized approach to predict the uncovered links and the evolution of the graph features simultaneously. [sent-22, score-0.365]

18 We provide oracle bounds under the assumption that the noise sequence has subgaussian tails and we prove that our procedure achieves a trade-off in the calibration of smoothing parameters which adjust with the sparsity and the rank of the unknown adjacency matrix. [sent-23, score-1.001]

19 In Section 3, we provide technical results with oracle inequalities and other theoretical guarantees on the joint estimation-prediction. [sent-26, score-0.341]

20 2 Estimation of low-rank graphs with autoregressive features Our approach is based on the asumption that features can explain most of the information contained in the graph, and that these features are evolving with time. [sent-30, score-0.747]

21 We make the following assumptions about the sequence (At )t≥0 of adjacency matrices of the graphs sequence. [sent-31, score-0.472]

22 This reﬂects the presence of highly connected groups of nodes such as communities in social networks, or product categories and groups of loyal/fan users in a market place data, and is sometimes motivated by the small number of factors that explain nodes interactions. [sent-34, score-0.217]

23 These matrices can be either deterministic or random in our theoretical analysis, but we take them deterministic for the sake of simplicity. [sent-37, score-0.116]

24 The vector time series (ω(At ))t≥0 has autoregressive dynamics, given by a VAR (Vector Auto-Regressive) model: ω(At+1 ) = W0 ω(At ) + Nt+1 , (2) where W0 ∈ Rd×d is a unknown sparse matrix and (Nt )t≥0 is a sequence of noise vectors in Rd . [sent-38, score-0.647]

25 number of edges connected to each node, or the sum of their weights if the edges are weighted), which is a measure of popularity in social and commerce networks. [sent-41, score-0.18]

26 This assumes that the noise is driven by time-series dynamics (a martingale increment), where each coordinates are independent (meaning that features are independently corrupted by noise), with a sub-gaussian tail and variance uniformly bounded by a constant σ 2 . [sent-52, score-0.269]

27 The notations · F , · p , · ∞ , · ∗ and · op stand, respectively, for the Frobenius norm, entry-wise p norm, entry-wise ∞ norm, trace-norm (or nuclear norm, given by the sum of the singular values) and operator norm (the largest singular value). [sent-55, score-0.461]

28 The product A ◦ B between two matrices with matching dimensions stands for the Hadamard or entry-wise product between A and B. [sent-59, score-0.116]

29 The matrix (M )+ is the componentwise positive part of the matrix M, and sign(M ) is the sign matrix associated to M with the convention sign(0) = 0 2 r If A is a n × n matrix with rank r, we write its SVD as A = U ΣV = j=1 σj uj vj where Σ = diag(σ1 , . [sent-61, score-0.539]

30 , σr ) is a r × r diagonal matrix containing the non-zero singular values of A in decreasing order, and U = [u1 , . [sent-64, score-0.156]

31 , vr ] are n × r matrices with columns given by the left and right singular vectors of A. [sent-70, score-0.197]

32 The operator PA : Rn×n → Rn×n given by PA (B) = PU B + BPV − PU BPV is the projector onto the linear space spanned by the matrices uk x and yvk for 1 ≤ j, k ≤ r and x, y ∈ Rn . [sent-74, score-0.277]

33 1 Joint prediction-estimation through penalized optimization In order to reﬂect the autoregressive dynamics of the features, we use a least-squares goodness-ofﬁt criterion that encourages the similarity between two feature vectors at successive time steps. [sent-78, score-0.51]

34 In order to induce sparsity in the estimator of W0 , we penalize this criterion using the 1 norm. [sent-79, score-0.13]

35 This leads to the following penalized objective function: 1 XT − XT −1 W T where κ > 0 is a smoothing parameter. [sent-80, score-0.128]

36 2 F J1 (W ) = +κ W 1, Now, for the prediction of AT +1 , we propose to minimize a least-squares criterion penalized by the combination of an 1 norm and a trace-norm. [sent-81, score-0.24]

37 This mixture of norms induces sparsity and a low-rank of the adjacency matrix. [sent-82, score-0.289]

38 Such a combination of 1 and trace-norm was already studied in [8] for the matrix regression model, and in [19] for the prediction of an adjacency matrix. [sent-83, score-0.392]

39 The objective function deﬁned below exploits the fact that if W is close to W0 , then the features of the next graph ω(AT +1 ) should be close to W ω(AT ). [sent-84, score-0.202]

40 (5) (A,W )∈A×W d×d It is natural to take W = R and A = (R+ )n×n since there is no a priori on the values of the feature matrix W0 , while the entries of the matrix AT +1 must be positive. [sent-88, score-0.189]

41 In the next section we propose oracle inequalities which prove that this procedure can estimate W0 and predict AT +1 at the same time. [sent-89, score-0.392]

42 2 Main result The central contribution of our work is to bound the prediction error with high probability under the following natural hypothesis on the noise process. [sent-91, score-0.23]

43 The prediction error and the estimation error can be simultaneously bounded by the sum of three terms that involve homogeneously (a) the sparsity, (b) the rank of the adjacency matrix AT +1 , and (c) the sparsity of the VAR model matrix W0 . [sent-102, score-0.801]

44 The tight bounds we obtain are similar to the bounds of the Lasso and are upper bounded by: 3 log d log n log n W0 0 + C2 AT +1 0 + C3 rank AT +1 . [sent-103, score-0.3]

45 The interplay between the rank and sparsity constraints on AT +1 are reﬂected in the observation that the values of C2 and C3 can be changed as long as their sum remains constant. [sent-105, score-0.388]

46 C1 3 Oracle inequalities In this section we give oracle inequalities for the mixed prediction-estimation error which is given, for any A ∈ Rn×n and W ∈ Rd×d , by 1 . [sent-106, score-0.424]

47 It entails in particular an upper-bound on the feature estimation error XT −1 (W − W0 ) F that makes (W − W0 ) ω(AT ) 2 smaller and consequently controls the prediction error over the graph edges through ω(A − AT +1 ) 2 . [sent-109, score-0.428]

48 The upper bounds on E given below exhibit the dependence of the accuracy of estimation and prediction on the number of features d, the number of edges n and the number T of observed graphs in the sequence. [sent-110, score-0.412]

49 The source of randomness comes from the noise sequence (Nt )t≥0 , see Assumption 1. [sent-115, score-0.153]

50 If these noise processes are controlled correctly, we can prove the following oracle inequalities for procedure (5). [sent-116, score-0.465]

51 The next result is an oracle inequality of slow type (see for instance [3]), that holds in full generality. [sent-117, score-0.186]

52 Then, we have E(A, W )2 ≤ inf E(A, W )2 + 2τ A (A,W )∈A×W ∗ + 2γ A 1 + 2κ W 1 . [sent-120, score-0.115]

53 For the proof of oracle inequalities of fast type, the restricted eigenvalue (RE) condition introduced in [3] and [10, 11] is of importance. [sent-121, score-0.341]

54 The ﬁrst cone is related to the W 1 term used in procedure (5). [sent-127, score-0.154]

55 If W ∈ Rd×d , we denote by ΘW = sign(W ) ∈ {0, ±1}d×d the signed sparsity pattern of W and by Θ⊥ ∈ {0, 1}d×d the W orthogonal sparsity pattern. [sent-128, score-0.26]

56 For a ﬁxed matrix W ∈ Rd×d and c > 0, we introduce the cone . [sent-129, score-0.177]

57 This cone contains the matrices W that have their largest entries in the sparsity pattern of W . [sent-131, score-0.348]

58 The second cone is related to mixture of the terms A it, we need further notations and deﬁnitions. [sent-132, score-0.138]

59 1 This cone consist of the matrices A with large entries close to that of A and that are “almost aligned” with the row and column spaces of A. [sent-136, score-0.218]

60 For W ∈ W and c > 0, we have µ µ1 (W, c) = inf µ > 0 : ΘW ◦ W F ≤ √ XT −1 W F , ∀W ∈ C1 (W, c) . [sent-139, score-0.115]

61 Now we can state the following Theorem that gives a fast oracle inequality for our procedure using RE. [sent-142, score-0.238]

62 Note that the residual term from this oracle inequality mixes the notions of sparsity of A and W via the terms rank(A), A 0 and W 0 . [sent-147, score-0.316]

63 It says that our mixed penalization procedure provides an optimal trade-off between ﬁtting the data and complexity, measured by both sparsity and low-rank. [sent-148, score-0.226]

64 In the next Theorem 3, we obtain convergence rates for the procedure (5) by combining Theorem 2 with controls on the noise processes. [sent-150, score-0.196]

65 ,d ∨ op 1 d d Ωj Ω j op j=1 2 where σω,j = 2 vΩ,∞ = , 1 T 1 d d Ωj ◦ Ωj j=1 ∞ , T ωj (At−1 )2 + ωj (AT )2 , t=1 which are the (observable) variance terms that naturally appear in the controls of the noise processes. [sent-154, score-0.332]

66 This term is a small price to pay for the fact that no independence assumption is required on the noise sequence (Nt )t≥0 , but only a martingale increment structure with sub-gaussian tails. [sent-159, score-0.299]

67 Consider the procedure (A, W ) given by (5) with smoothing parameters given by 2(x + log(2n)) , d 2(x + 2 log n) γ = 3(1 − α)σvΩ,∞ , d 2e(x + 2 log d + T ) κ = 6σσω + T τ = 3ασvΩ,op 5 2e(x + 2 log d + d T) . [sent-161, score-0.23]

68 In the next Theorem, we propose more explicit upper bounds for both the indivivual estimation of W0 and the prediction of AT +1 . [sent-165, score-0.167]

69 1 ∗ ≤ 5κµ1 (W0 )2 W0 1 ω(A) − ω(AT +1 ) d 0 +µ2 (AT +1 , W0 )(6 2 2 + rank AT +1 +5 γ τ AT +1 0 ) 25 µ2 (A, W0 )2 τ 2 rank(A) + γ 2 A 0 ) . [sent-168, score-0.195]

70 The proximal operator for the trace norm is given by the shrinkage operation, if Z = U diag(σ1 , · · · , σn )V T is the singular value decomposition of Z, proxτ ||. [sent-172, score-0.265]

71 Similarly, the proximal operator for the 1 -norm is the soft thresholding operator deﬁned by using the entry-wise product of matrices denoted by ◦: proxγ||. [sent-174, score-0.307]

72 The algorithm converges under very mild conditions when the step size θ is smaller than L is the operator norm of the joint quadratic loss: 1 1 Φ : (A, W ) → XT − XT −1 W 2 + ω(A) − W ω(AT ) 2 . [sent-176, score-0.169]

73 2 A generative model for graphs having linearly autoregressive features Let V0 ∈ Rn×r be a sparse matrix, V0† its pseudo-inverse such, that V0† V0 = V0 V0 † = Ir . [sent-182, score-0.608]

74 Fix two sparse matrices W0 ∈ Rr×r and U0 ∈ Rn×r . [sent-183, score-0.163]

75 Now deﬁne the sequence of matrices (At )t≥0 for t = 1, 2, · · · by Ut = Ut−1 W0 + Nt and At = Ut V0 + Mt for i. [sent-184, score-0.161]

76 d sparse noise matrices Nt and Mt , which means that for any pair of indices (i, j), with high probability (Nt )i,j = 0 and (Mt )i,j = 0. [sent-186, score-0.271]

77 The sequence ω(At ) = Ut + Mt V0 t † follows the linear autoregressive relation t ω(At ) = ω(At−1 ) W0 + Nt + Mt V0 † . [sent-188, score-0.417]

78 For any time index t, the matrix At is close to Ut V0 that has rank at most r 3. [sent-190, score-0.27]

79 The matrices At and Ut are both sparse by construction. [sent-191, score-0.163]

80 In our experiments the noise matrices Mt and Nt where built by soft-thresholding i. [sent-195, score-0.224]

81 We took as input T = 10 successive graph snapshots on n = 50 nodes graphs of rank r = 5. [sent-199, score-0.579]

82 We compare our methods to standard baselines in link prediction. [sent-202, score-0.139]

83 The competitor methods are the nearest neighbors (NN) and static sparse and low-rank estimation, that is the link prediction algorithm suggested in [19]. [sent-205, score-0.44]

84 The two methods autoregressive low-rank and static low-rank are regularized using only the trace-norm, (corresponding to forcing γ = 0) and are slightly inferior to their sparse and low-rank rivals. [sent-207, score-0.588]

85 Since the matrix V0 deﬁning the linear map ω is unknown we consider the feature map ω(A) = AV where AT = U ΣV is the SVD of AT . [sent-208, score-0.192]

86 The left-hand plot shows that if few snapshots are available (T ≤ 4 in these experiments), then static approaches are 7 AUC Link prediction performance 0. [sent-214, score-0.339]

87 75 2 3 4 5 6 7 8 9 0 10 T 10 20 30 rank A 40 50 60 70 T+1 Figure 1: Left: performance of algorithms in terms of Area Under the ROC Curve, average and conﬁdence intervals over 50 runs. [sent-229, score-0.195]

88 to be preferred, whereas feature autoregressive approaches outperform as soon as sufﬁcient number T graph snapshots are available (see phase transition). [sent-231, score-0.673]

89 The decreasing performance of static algorithms can be explained by the fact that they use as input a mixture of graphs observed at different time steps. [sent-232, score-0.249]

90 Knowing that at each time step the nodes have speciﬁc latent factors, despite the slow evolution of the factors, adding the resulting graphs leads to confuse the factors. [sent-233, score-0.203]

91 The right-hand ﬁgure is a phase transition diagram showing in which part of rank and time domain the estimation is accurate and illustrates the interplay between these two domain parameters. [sent-236, score-0.35]

92 In the current work we used the projection onto the vector space of the top-r singular vectors of the cumulative adjacency matrix as the linear map ω, and this choice has shown empirical superiority to other choices. [sent-239, score-0.393]

93 The question of choosing the best measurement to summarize graph information as in compress sensing seems to have both theoretical and application potential. [sent-240, score-0.13]

94 In this paper we consider only an autoregressive process of order 1. [sent-245, score-0.372]

95 For better prediction accuracy, one could consider mode general models, such as vector ARMA models, and use model-selection techniques for the choice of the orders of the model. [sent-246, score-0.122]

96 We presented a procedure for predicting graphs having linear autoregressive features. [sent-248, score-0.582]

97 Our approach can easily be generalized to non-linear prediction through kernel-based methods. [sent-249, score-0.122]

98 A new approach to collaborative ﬁltering: operator estimation with spectral regularization. [sent-256, score-0.176]

99 Nuclear norm penalization and optimal rates for noisy matrix completion. [sent-324, score-0.182]

100 On the conditions used to prove oracle results for the Lasso. [sent-401, score-0.186]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nt', 0.393), ('autoregressive', 0.372), ('rank', 0.195), ('oracle', 0.186), ('adjacency', 0.159), ('link', 0.139), ('static', 0.132), ('sparsity', 0.13), ('graph', 0.13), ('prediction', 0.122), ('inequalities', 0.119), ('graphs', 0.117), ('matrices', 0.116), ('inf', 0.115), ('pu', 0.113), ('noise', 0.108), ('xt', 0.108), ('mt', 0.106), ('cone', 0.102), ('var', 0.096), ('op', 0.094), ('ut', 0.092), ('ga', 0.088), ('snapshots', 0.085), ('pa', 0.081), ('singular', 0.081), ('matrix', 0.075), ('rn', 0.075), ('prox', 0.074), ('bpv', 0.074), ('cachan', 0.074), ('cmla', 0.074), ('smoothing', 0.073), ('features', 0.072), ('operator', 0.07), ('social', 0.068), ('umr', 0.065), ('norm', 0.063), ('rd', 0.063), ('interplay', 0.063), ('evgeniou', 0.063), ('collaborative', 0.061), ('ens', 0.06), ('purchases', 0.06), ('uncovered', 0.057), ('pv', 0.057), ('edges', 0.056), ('penalized', 0.055), ('cnrs', 0.054), ('gw', 0.054), ('richard', 0.054), ('assumption', 0.053), ('nodes', 0.052), ('procedure', 0.052), ('projector', 0.052), ('proximal', 0.051), ('dantzig', 0.05), ('increment', 0.048), ('phase', 0.047), ('sparse', 0.047), ('ltering', 0.047), ('theorem', 0.046), ('estimation', 0.045), ('martingale', 0.045), ('micchelli', 0.045), ('users', 0.045), ('sequence', 0.045), ('dynamics', 0.044), ('sign', 0.044), ('recommender', 0.044), ('penalization', 0.044), ('responses', 0.042), ('auc', 0.042), ('evolving', 0.042), ('predicting', 0.041), ('re', 0.04), ('roc', 0.04), ('map', 0.039), ('onto', 0.039), ('feature', 0.039), ('lasso', 0.039), ('proceeding', 0.038), ('nn', 0.038), ('diag', 0.037), ('factorization', 0.037), ('regularized', 0.037), ('notations', 0.036), ('nuclear', 0.036), ('controls', 0.036), ('eigenvalue', 0.036), ('joint', 0.036), ('ft', 0.036), ('cand', 0.036), ('regression', 0.036), ('log', 0.035), ('minimizing', 0.035), ('predict', 0.035), ('assumptions', 0.035), ('evolution', 0.034), ('completion', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

2 0.19624518 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

Author: Odalric-ambrym Maillard

Abstract: This paper aims to take a step forwards making the term “intrinsic motivation” from reinforcement learning theoretically well founded, focusing on curiositydriven learning. To that end, we consider the setting where, a ﬁxed partition P of a continuous space X being given, and a process ν deﬁned on X being unknown, we are asked to sequentially decide which cell of the partition to select as well as where to sample ν in that cell, in order to minimize a loss function that is inspired from previous work on curiosity-driven learning. The loss on each cell consists of one term measuring a simple worst case quadratic sampling error, and a penalty term proportional to the range of the variance in that cell. The corresponding problem formulation extends the setting known as active learning for multi-armed bandits to the case when each arm is a continuous region, and we show how an adaptation of recent algorithms for that problem and of hierarchical optimistic sampling algorithms for optimization can be used in order to solve this problem. The resulting procedure, called Hierarchical Optimistic Region SElection driven by Curiosity (HORSE.C) is provided together with a ﬁnite-time regret analysis. 1

3 0.16030319 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

Author: Arnak Dalalyan, Yin Chen

Abstract: In this paper, we develop a novel approach to the problem of learning sparse representations in the context of fused sparsity and unknown noise level. We propose an algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes the aforementioned learning task by means of a second-order cone program. A special emphasize is put on the particular instance of fused sparsity corresponding to the learning in presence of outliers. We establish ﬁnite sample risk bounds and carry out an experimental evaluation on both synthetic and real data. 1

4 0.1373938 247 nips-2012-Nonparametric Reduced Rank Regression

Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty

Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backﬁtting algorithms are derived and justiﬁed using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1

5 0.13046628 324 nips-2012-Stochastic Gradient Descent with Only One Projection

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi

Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semideﬁnite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1

6 0.12684888 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

7 0.11519808 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

8 0.11203337 69 nips-2012-Clustering Sparse Graphs

9 0.10955525 285 nips-2012-Query Complexity of Derivative-Free Optimization

10 0.10795928 180 nips-2012-Learning Mixtures of Tree Graphical Models

11 0.10673597 208 nips-2012-Matrix reconstruction with the local max norm

12 0.10431816 319 nips-2012-Sparse Prediction with the $k$-Support Norm

13 0.098969363 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

14 0.097385451 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

15 0.097276017 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

16 0.096419148 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

17 0.095100082 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

18 0.094003759 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

19 0.092484906 293 nips-2012-Relax and Randomize : From Value to Algorithms

20 0.090775505 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.285), (1, 0.043), (2, 0.135), (3, -0.026), (4, 0.071), (5, 0.038), (6, -0.021), (7, -0.03), (8, -0.121), (9, 0.05), (10, 0.049), (11, 0.037), (12, -0.14), (13, -0.013), (14, 0.024), (15, 0.064), (16, 0.112), (17, 0.079), (18, 0.003), (19, -0.038), (20, -0.001), (21, 0.04), (22, -0.018), (23, -0.085), (24, -0.06), (25, -0.007), (26, -0.046), (27, 0.11), (28, 0.006), (29, -0.128), (30, -0.05), (31, 0.092), (32, -0.005), (33, 0.021), (34, 0.025), (35, -0.02), (36, 0.072), (37, -0.058), (38, 0.113), (39, -0.022), (40, -0.051), (41, -0.028), (42, -0.04), (43, 0.059), (44, 0.06), (45, 0.022), (46, 0.061), (47, 0.104), (48, -0.024), (49, 0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9590258 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

2 0.6937803 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be deﬁned by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

3 0.65489101 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

Abstract: The Lov´ sz ϑ function of a graph, a fundamental tool in combinatorial optimizaa tion and approximation algorithms, is computed by solving a SDP. In this paper we establish that the Lov´ sz ϑ function is equivalent to a kernel learning problem a related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call SVM − ϑ graphs, on which the Lov´ sz ϑ function a can be approximated well by a one-class SVM. This leads to novel use of SVM techniques for solving algorithmic problems in large graphs e.g. identifying a √ 1 planted clique of size Θ( n) in a random graph G(n, 2 ). A classic approach for this problem involves computing the ϑ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of SVM − ϑ graph. As a consequence a SVM based approach easily identiﬁes the clique in large graphs and is competitive with the state-of-the-art. We introduce the notion of common orthogonal labelling and show that it can be computed by solving a Multiple Kernel learning problem. It is further shown that such a labelling is extremely useful in identifying a large common dense subgraph in multiple graphs, which is known to be a computationally difﬁcult problem. The proposed algorithm achieves an order of magnitude scalability compared to state of the art methods. 1

4 0.64098006 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

Author: Antonino Freno, Mikaela Keller, Marc Tommasi

Abstract: Statistical models for networks have been typically committed to strong prior assumptions concerning the form of the modeled distributions. Moreover, the vast majority of currently available models are explicitly designed for capturing some speciﬁc graph properties (such as power-law degree distributions), which makes them unsuitable for application to domains where the behavior of the target quantities is not known a priori. The key contribution of this paper is twofold. First, we introduce the Fiedler delta statistic, based on the Laplacian spectrum of graphs, which allows to dispense with any parametric assumption concerning the modeled network properties. Second, we use the deﬁned statistic to develop the Fiedler random ﬁeld model, which allows for efﬁcient estimation of edge distributions over large-scale random networks. After analyzing the dependence structure involved in Fiedler random ﬁelds, we estimate them over several real-world networks, showing that they achieve a much higher modeling accuracy than other well-known statistical approaches.

5 0.63142824 221 nips-2012-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1

6 0.6298663 346 nips-2012-Topology Constraints in Graphical Models

7 0.61875075 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

8 0.61176598 327 nips-2012-Structured Learning of Gaussian Graphical Models

9 0.61030614 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

10 0.59769869 285 nips-2012-Query Complexity of Derivative-Free Optimization

11 0.58933765 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

12 0.57217473 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

13 0.57186985 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

14 0.57165134 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation

15 0.56933087 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

16 0.56884199 34 nips-2012-Active Learning of Multi-Index Function Models

17 0.56816936 208 nips-2012-Matrix reconstruction with the local max norm

18 0.56395793 319 nips-2012-Sparse Prediction with the $k$-Support Norm

19 0.56112367 86 nips-2012-Convex Multi-view Subspace Learning

20 0.56092006 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.056), (1, 0.013), (21, 0.029), (36, 0.01), (38, 0.168), (39, 0.013), (42, 0.034), (44, 0.15), (54, 0.035), (55, 0.023), (74, 0.077), (76, 0.142), (80, 0.126), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95098627 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important ﬁrst step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The ﬁrst layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of ﬁrst layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while ﬁne-scale structure is captured by recurrent interactions within the ﬁrst layer. When ﬁt to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a signiﬁcant improvement over standard methods. 1

2 0.94711584 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face veriﬁcation. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Speciﬁcally, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the speciﬁc data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned ﬁlters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face veriﬁcation compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

3 0.92696577 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof

Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classiﬁers with the ability to access intermediate prediction (here: classiﬁcation and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for reﬁning the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classiﬁcation and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overﬁtting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classiﬁcation [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classiﬁcation and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modiﬁed to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classiﬁcation on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the ﬁnal results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classiﬁer to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a speciﬁc context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-ﬁrst training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the ﬁeld of object class segmentation [21], however, mostly for high-level reasoning in random ﬁeld models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classiﬁers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classiﬁer boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classiﬁcation feature, that is consequently reﬁned by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for speciﬁc, contextual label conﬁgurations which was demonstrated to signiﬁcantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and uniﬁed interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classiﬁcation results from an earlier level of the classiﬁer construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-ﬁrst way, we propose a priority-based construction (which could actually consider depth- or breadth-ﬁrst as particular cases). The priority is determined by the current training error, i.e. we ﬁrst grow the parts of the tree where we experience higher error. To this end, we introduce a uniﬁed splitting criterion that estimates the joint error of classiﬁcation and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree ﬁrst tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The ﬁrst image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-ﬁrst way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classiﬁcation and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a signiﬁcant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to speciﬁc applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classiﬁcation and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a uniﬁed way. Starting from this general view we ﬁnally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is deﬁned on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classiﬁcation and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by ﬁnding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following deﬁnition of P (y|x, t), which is inductive in the structure of t:  if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1)  P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overﬁtting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classiﬁcation and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identiﬁed by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classiﬁcation and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classiﬁcation and regression problems in a joint manner and by leaving a wider ﬂexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we ﬁnd a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for ﬁnding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as deﬁned in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances signiﬁcantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally ﬂipped images) was sampled homogeneously per category per image. The patch size was ﬁxed to 30 × 30 and we performed 1600 node tests for ﬁnding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the ﬁrst 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was ﬁxed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally ﬂipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and conﬁguration ﬁles from the ﬁrst authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, ﬁrst row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-ﬁrst way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the ﬁrst row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we ﬁnd similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reﬂect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the ﬂipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the ﬁrst precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classiﬁcation and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can signiﬁcantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges ﬁnancial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classiﬁcation using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A uniﬁed framework for classiﬁcation, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efﬁcient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-speciﬁc hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efﬁcient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classiﬁcation. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Reﬁce. Learning compatibility coefﬁcients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9

4 0.91307938 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efﬁcient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a ﬂexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classiﬁcation results using no more than kNN on the binary codes. 1

same-paper 5 0.90642262 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

6 0.8681792 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.86316264 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

8 0.86251855 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

9 0.85665232 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

10 0.85519373 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

11 0.85261452 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.85201871 65 nips-2012-Cardinality Restricted Boltzmann Machines

13 0.85195196 292 nips-2012-Regularized Off-Policy TD-Learning

14 0.85142559 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

15 0.85064965 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

16 0.8504616 160 nips-2012-Imitation Learning by Coaching

17 0.85043496 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

18 0.85038567 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

19 0.85002065 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

20 0.84963548 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs