nips nips2004 nips2004-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. [sent-9, score-0.189]
2 We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. [sent-11, score-0.352]
3 1 Introduction Fitting a target matrix Y with a low-rank matrix X by minimizing the sum-squared error is a common approach to modeling tabulated data, and can be done explicitly in terms of the singular value decomposition of Y . [sent-12, score-0.414]
4 It is often desirable, though, to minimize a different loss function: loss corresponding to a specific probabilistic model (where X are the mean parameters, as in pLSA [1], or the natural parameters [2]); or loss functions such as hinge loss appropriate for binary or discrete ordinal data. [sent-13, score-0.543]
5 Even with a squarederror loss, when only some of the entries in Y are observed, as is the case for collaborative filtering, local minima arise and SVD techniques are no longer applicable [3]. [sent-15, score-0.344]
6 In this paper we suggest regularizing the factorization by constraining the norm of U and V —constraints that arise naturally when matrix factorizations are viewed as feature learning for large-margin linear prediction (Section 2). [sent-18, score-0.786]
7 Unlike low-rank factorizations, such constraints lead to convex optimization problems that can be formulated as semi-definite programs (Section 4). [sent-19, score-0.141]
8 Throughout the paper, we focus on using low-norm factorizations for “collaborative prediction”: predicting unobserved entries of a target matrix Y , based on a subset S of observed entries YS . [sent-20, score-0.667]
9 In Section 5, we present generalization error bounds for collaborative prediction using low-norm factorizations. [sent-21, score-0.493]
10 2 Matrix Factorization as Feature Learning Using a low-rank model for collaborative prediction [5, 6, 3] is straightforward: A lowrank matrix X is sought that minimizes a loss versus the observed entries YS . [sent-22, score-0.755]
11 Unobserved entries in Y are predicted according to X. [sent-23, score-0.155]
12 Matrices of rank at most k are those that can be factored into X = U V , U ∈ Rn×k , V ∈ Rm×k , and so seeking a low-rank matrix is equivalent to seeking a low-dimensional factorization. [sent-24, score-0.28]
13 If one of the matrices, say U , is fixed, and only the other matrix V needs to be learned, then fitting each column of the target matrix Y is a separate linear prediction problem. [sent-25, score-0.358]
14 Each row of U functions as a “feature vector”, and each column of V is a linear predictor, predicting the entries in the corresponding column of Y based on the “features” in U . [sent-26, score-0.214]
15 In collaborative prediction, both U and V are unknown and need to be estimated. [sent-27, score-0.189]
16 This can be thought of as learning feature vectors (rows in U ) for each of the rows of Y , enabling good linear prediction across all of the prediction problems (columns of Y ) concurrently, each with a different linear predictor (columns of V ). [sent-28, score-0.365]
17 The features are learned without any external information or constraints which is impossible for a single prediction task (we would use the labels as features). [sent-29, score-0.227]
18 The underlying assumption that enables us to do this in a collaborative filtering situation is that the prediction tasks (columns of Y ) are related, in that the same features can be used for all of them, though possibly in different ways. [sent-30, score-0.323]
19 Low-rank collaborative prediction corresponds to regularizing by limiting the dimensionality of the feature space—each column is a linear prediction problem in a low-dimensional space. [sent-31, score-0.495]
20 Consider adding to the loss a penalty term which is the sum of squares of entries in U and 2 2 V , i. [sent-33, score-0.276]
21 With an appropriate loss function, or constraints on the observed entries, these correspond to large-margin linear discrimination problems. [sent-37, score-0.244]
22 For example, if we learn a binary observation matrix by minimizing a hinge loss plus such a regularization term, each conditional problem decomposes into a collection of SVMs. [sent-38, score-0.309]
23 3 Maximum-Margin Matrix Factorizations Matrices with a factorization X = U V , where U and V have low Frobenius norm (recall that the dimensionality of U and V is no longer bounded! [sent-39, score-0.381]
24 ), can be characterized in several equivalent ways, and are known as low trace norm matrices: Definition 1. [sent-40, score-0.488]
25 Fro V Fro = minX=U V 1 2( U 2 Fro + V 2 Fro ) The characterization in terms of the singular value decomposition allows us to characterize low trace norm matrices as the convex hull of bounded-norm rank-one matrices: Lemma 2. [sent-43, score-0.778]
26 {X | X 2 Σ 2 ≤ B} = conv uv |u ∈ Rn , v ∈ Rm , |u| = |v| = B In particular, the trace norm is a convex function, and the set of bounded trace norm matrices is a convex set. [sent-44, score-1.193]
27 For convex loss functions, seeking a bounded trace norm matrix minimizing the loss versus some target matrix is a convex optimization problem. [sent-45, score-1.283]
28 This contrasts sharply with minimizing loss over low-rank matrices—a non-convex problem. [sent-46, score-0.165]
29 Although the sum-squared error versus a fully observed target matrix can be minimized efficiently using the SVD (despite the optimization problem being non-convex! [sent-47, score-0.289]
30 ), minimizing other loss functions, or even minimizing a squared loss versus a partially observed matrix, is a difficult optimization problem with multiple local minima [3]. [sent-48, score-0.436]
31 1 Also known as the nuclear norm and the Ky-Fan n-norm. [sent-49, score-0.215]
32 In fact, the trace norm has been suggested as a convex surrogate to the rank for various rank-minimization problems [7]. [sent-50, score-0.591]
33 Here, we justify the trace norm directly, both as a natural extension of large-margin methods and by providing generalization error bounds. [sent-51, score-0.556]
34 We consider hardmargin matrix factorization, where we seek a minimum trace norm matrix X that matches the observed labels with a margin of one: Yia Xia ≥ 1 for all ia ∈ S. [sent-53, score-1.069]
35 We also consider soft-margin learning, where we minimize a trade-off between the trace norm of X and its hinge-loss relative to YS : minimize X Σ max(0, 1 − Yia Xia ). [sent-54, score-0.459]
36 +c (1) ia∈S As in maximum-margin linear discrimination, there is an inverse dependence between the norm and the margin. [sent-55, score-0.215]
37 Fixing the margin and minimizing the trace norm is equivalent to fixing the trace norm and maximizing the margin. [sent-56, score-1.013]
38 radial) kernels, the data is always separable with sufficiently high trace norm (a trace norm of n|S| is sufficient to attain a margin of one). [sent-59, score-0.969]
39 The max-norm variant Instead of constraining the norms of rows in U and V on average, we can constrain all rows of U and V to have small L2 norm, replacing the trace norm with X max = minX=U V (maxi |Ui |)(maxa |Va |) where Ui , Va are rows of U, V . [sent-60, score-0.729]
40 First, note that predicting the target matrix with the signs of a rank-k matrix corresponds to mapping the “items” (columns) to points in Rk , and the “users” (rows) to homogeneous hyperplanes, such that each user’s hyperplane separates his positive items from his negative items. [sent-62, score-0.338]
41 Hard-margin low-max-norm prediction corresponds to mapping the users and items to points and hyperplanes in a high-dimensional unit sphere such that each user’s hyperplane separates his positive and negative items with a large-margin (the margin being the inverse of the maxnorm). [sent-63, score-0.461]
42 a low norm factorization U V , given a binary target matrix. [sent-66, score-0.435]
43 Bounding the trace norm of U V by 2 2 1 2 ( U Fro + V Fro ), we can characterize the trace norm in terms of the trace of a positive semi-definite matrix: Lemma 3 ([7, Lemma 1]). [sent-67, score-1.162]
44 For any X ∈ Rn×m and t ∈ R: X Σ ≤ t iff there exists A A ∈ Rn×n and B ∈ Rm×m such that 2 X X 0 and tr A + tr B ≤ 2t. [sent-68, score-0.226]
45 Conversely, if X Σ ≤ t we can write it as X = U V with tr U U + tr V V ≤ 2t and consider the p. [sent-73, score-0.226]
46 X V Lemma 3 can be used in order to formulate minimizing the trace norm as a semi-definite optimization problem (SDP). [sent-77, score-0.538]
47 Soft-margin matrix factorization (1), can be written as: 1 min (tr A + tr B) + c 2 2 A ξia s. [sent-78, score-0.335]
48 ia∈S A X 0 denotes A is positive semi-definite X B 0, yia Xia ≥ 1 − ξia ∀ia ∈ S (2) ξia ≥ 0 Associating a dual variable Qia with each constraint on Xia , the dual of (2) is [8, Section 5. [sent-80, score-0.604]
49 max ia∈S I (−Q ⊗ Y ) (−Q ⊗ Y ) I 0, 0 ≤ Qia ≤ c (3) where Q ⊗ Y denotes the sparse matrix (Q ⊗ Y )ia = Qia Yia for ia ∈ S and zeros elsewhere. [sent-84, score-0.431]
50 constraint in the dual (3) is equivalent to bounding the spectral norm of Q ⊗ Y , and the dual can also be written as an optimization problem subject to a bound on the spectral norm, i. [sent-89, score-0.666]
51 max ia∈S Q⊗Y 2 ≤1 0 ≤ Qia ≤ c ∀ia ∈ S (4) In typical collaborative prediction problems, we observe only a small fraction of the entries in a large target matrix. [sent-93, score-0.532]
52 Such a situation translates to a sparse dual semi-definite program, with the number of variables equal to the number of observed entries. [sent-94, score-0.218]
53 The prediction matrix X ∗ minimizing (1) is part of the primal optimal solution of (2), and can be extracted from it directly. [sent-96, score-0.435]
54 Nevertheless, it is interesting to study how the optimal prediction matrix X ∗ can be directly recovered from a dual optimal solution Q∗ alone. [sent-97, score-0.52]
55 Recovering X ∗ from Q∗ As for linear programming, recovering a primal optimal solution directly from a dual optimal solution is not always possible for SDPs. [sent-99, score-0.464]
56 However, at least for the hard-margin problem (no slack) this is possible, and we describe below how an optimal prediction matrix X ∗ can be recovered from a dual optimal solution Q∗ by calculating a singular value decomposition and solving linear equations. [sent-100, score-0.659]
57 Given a dual optimal Q∗ , consider its singular value decomposition Q∗ ⊗ Y = U ΛV . [sent-101, score-0.323]
58 Recall that all singular values of Q∗ ⊗Y are bounded by one, and consider only the columns ˜ ˜ U ∈ Rn×p of U and V ∈ Rm×p of V with singular value one. [sent-102, score-0.255]
59 3], using complimentary slackness, that for some matrix R ∈ Rp×p , X ∗ = ˜ ˜ U RR V is an optimal solution to the maximum margin matrix factorization problem (1). [sent-105, score-0.474]
60 When Q∗ > 0, ia ia 2 and assuming hard-margin constraints, i. [sent-107, score-0.692]
61 no box constraints in the dual, complimentary ∗ ˜ ˜ slackness dictates that Xia = Ui RR Va = Yia , providing us with a linear equation on p(p+1) the entries in the symmetric RR . [sent-109, score-0.278]
62 For hard-margin matrix factorization, we can 2 therefore recover the entries of RR by solving a system of linear equations, with a number of variables bounded by the number of observed entries. [sent-110, score-0.387]
63 Recovering specific entries The approach described above requires solving a large system of linear equations (with as many variables as observations). [sent-111, score-0.192]
64 Furthermore, especially when the observations are very sparse (only a small fraction of the entries in the target matrix are observed), the dual solution is much more compact then the prediction matrix: the dual involves a single number for each observed entry. [sent-112, score-0.855]
65 It might be desirable to avoid ∗ storing the prediction matrix X ∗ explicitly, and calculate a desired entry Xi0 a0 , or at least ∗ its sign, directly from the dual optimal solution Q . [sent-113, score-0.506]
66 If there exists an optimal ∗ solution X ∗ to the original SDP with Xi0 a0 > 0, then this is also an optimal solution to the modified SDP, with the same objective value. [sent-115, score-0.16]
67 Otherwise, the optimal solution of the modified SDP is not optimal for the original SDP, and the optimal value of the modified SDP is higher (worse) than the optimal value of the original SDP. [sent-116, score-0.218]
68 Introducing the constraint Xi0 a0 > 0 to the primal SDP (2) corresponds to introducing a new variable Qi0 a0 to the dual SDP (3), appearing in Q ⊗ Y (with Yi0 a0 = 1) but not in the objective. [sent-117, score-0.293]
69 In this modified dual, the optimal solution Q∗ of the original dual would always ∗ be feasible. [sent-118, score-0.255]
70 But, if Xi0 a0 < 0 in all primal optimal solutions, then the modified primal SDP has a higher value, and so does the dual, and Q∗ is no longer optimal for the new dual. [sent-119, score-0.276]
71 If Yi0 a0 Xi0 a0 < 0 (in all optimal solutions), then the dual solution can be improved by introducing Qi0 a0 with a sign of Yi0 a0 . [sent-124, score-0.309]
72 Predictions for new users So far, we assumed that learning is done on the known entries in all rows. [sent-125, score-0.235]
73 It is commonly desirable to predict entries in a new partially observed row of Y (a new user), not included in the original training set. [sent-126, score-0.259]
74 This essentially requires solving a “conditional” problem, where V is already known, and a new row of U is learned (the predictor for the new user) based on a new partially observed row of X. [sent-127, score-0.175]
75 Max-norm MMMF as a SDP The max-norm variant can also be written as a SDP, with the primal and dual taking the forms: min t + c ξia s. [sent-129, score-0.267]
76 We present here generalization error bounds that holds for any target matrix Y , and for a random subset of observations S, and bound the average error across all entries in terms of the observed margin error3 . [sent-134, score-0.638]
77 For all target matrices Y ∈ {±1}n×m and sample sizes |S| > n log n, and for a uniformly selected sample S of |S| entries in Y , with probability at least 1 − δ over 3 The bounds presented here are special cases of bounds for general loss functions that we present and prove elsewhere [8, Section 6. [sent-140, score-0.562]
78 To prove the bounds we bound the Rademacher complexity of bounded trace norm and bounded max-norm matrices (i. [sent-142, score-0.788]
79 The unit trace norm ball is the convex hull of outer products of unit norm vectors. [sent-148, score-0.854]
80 It is therefore enough to bound the Rademacher complexity of such outer products, which boils down to analyzing the spectral norm of random matrices. [sent-149, score-0.3]
81 As a consequence of Grothendiek’s inequality, the unit max-norm ball is within a factor of two of the convex hull of outer products of sign vectors. [sent-150, score-0.208]
82 The Rademacher complexity of such outer products can be bounded by considering their cardinality. [sent-151, score-0.145]
83 To understand the scaling of these bounds, consider n × m matrices X = U V where the norms of rows of U and V are bounded by√ i. [sent-153, score-0.271]
84 The r, trace norm of such matrices is bounded by r2 / nm, and so the two bounds agree up to log-factors—the cost of allowing the norm to be low on-average but not uniformly. [sent-156, score-0.929]
85 6 Implementation and Experiments Ratings In many collaborative prediction tasks, the labels are not binary, but rather are discrete “ratings” in several ordered levels (e. [sent-161, score-0.323]
86 A soft-margin version of these constraints, with slack variables for the two constraints on each observed rating, corresponds to a generalization of the hinge loss which is a convex bound on the zero/one level-agreement error (ZOE) [10]. [sent-165, score-0.503]
87 Furthermore, it is straightforward to learn also the thresholds (they appear as variables in the primal, and correspond to constraints in the dual)—either a single set of thresholds for the entire matrix, or a separate threshold vector for each row of the matrix (each “user”). [sent-169, score-0.219]
88 The ratings are on a discrete scale of one through five, and we experimented with both generalizations of the hinge loss above, allowing per-user thresholds. [sent-173, score-0.263]
89 For each of the four splits, we selected the two MMMF learners with lowest CV ZOE and MAE and the two Baseline learners with lowest CV ZOE and MAE, and measured their error on the held-out test data. [sent-177, score-0.222]
90 Test Set 1 2 3 4 Method WLRA rank 2 WLRA rank 2 WLRA rank 1 WLRA rank 2 Avg. [sent-180, score-0.284]
91 673 Table 1: Baseline (top) and MMMF (bottom) methods and parameters that achieved the lowest cross validation error (on the training data) for each train/test split, and the error for this predictor on the test data. [sent-226, score-0.182]
92 7 Discussion Learning maximum-margin matrix factorizations requires solving a sparse semi-definite program. [sent-228, score-0.267]
93 We propose that just as generic QP solvers do not perform well on SVM problems, special purpose techniques, taking advantage of the very simple structure of the dual (3), are necessary in order to solve large-scale MMMF problems. [sent-230, score-0.228]
94 to a linear combination of a few “base feature spaces” (or base kernels), which represent the external information necessary to solve a single prediction problem. [sent-244, score-0.182]
95 It is possible to combine the two approaches, seeking constrained features for multiple related prediction problems, as a way of combining external information (e. [sent-245, score-0.244]
96 details of users and of items) and collaborative information. [sent-247, score-0.269]
97 An alternate method for introducing external information into our formulation is by adding to U and/or V additional fixed (non-learned) columns representing the external features. [sent-248, score-0.158]
98 An important limitation of the approach we have described, is that observed entries are assumed to be uniformly sampled. [sent-250, score-0.198]
99 However, obtaining generalization error bounds in this case is much harder. [sent-259, score-0.17]
100 Generalization error bounds for collaborative prediction with low-rank matrices. [sent-303, score-0.44]
wordName wordTfidf (topN-words)
[('ia', 0.346), ('fro', 0.271), ('yia', 0.254), ('trace', 0.244), ('sdp', 0.218), ('xia', 0.215), ('norm', 0.215), ('mmmf', 0.208), ('collaborative', 0.189), ('dual', 0.175), ('qia', 0.167), ('entries', 0.155), ('wlra', 0.146), ('factorizations', 0.145), ('factorization', 0.137), ('prediction', 0.134), ('loss', 0.121), ('tr', 0.113), ('zoe', 0.104), ('primal', 0.092), ('matrices', 0.086), ('matrix', 0.085), ('items', 0.084), ('rr', 0.083), ('cv', 0.083), ('mae', 0.083), ('users', 0.08), ('singular', 0.076), ('bounds', 0.073), ('rank', 0.071), ('bounded', 0.067), ('user', 0.064), ('minx', 0.062), ('seeking', 0.062), ('ys', 0.062), ('convex', 0.061), ('rows', 0.06), ('hinge', 0.059), ('norms', 0.058), ('ln', 0.057), ('ratings', 0.055), ('nathan', 0.054), ('srebro', 0.054), ('target', 0.054), ('solvers', 0.053), ('generalization', 0.053), ('margin', 0.051), ('rating', 0.05), ('outer', 0.049), ('external', 0.048), ('rn', 0.047), ('rademacher', 0.046), ('optimal', 0.046), ('learners', 0.046), ('constraints', 0.045), ('minimizing', 0.044), ('error', 0.044), ('baseline', 0.044), ('va', 0.044), ('observed', 0.043), ('nm', 0.042), ('slackness', 0.042), ('hull', 0.041), ('slack', 0.041), ('regularizing', 0.038), ('recovering', 0.037), ('solving', 0.037), ('predictor', 0.037), ('complimentary', 0.036), ('csdp', 0.036), ('sdps', 0.036), ('bound', 0.036), ('columns', 0.036), ('discrimination', 0.035), ('optimization', 0.035), ('lemma', 0.035), ('solution', 0.034), ('tommi', 0.033), ('nite', 0.033), ('constraining', 0.032), ('desirable', 0.032), ('bounding', 0.03), ('thresholds', 0.03), ('predicting', 0.03), ('products', 0.029), ('lanckriet', 0.029), ('lowest', 0.029), ('low', 0.029), ('row', 0.029), ('sign', 0.028), ('rm', 0.028), ('toronto', 0.028), ('versus', 0.028), ('hyperplanes', 0.028), ('experimented', 0.028), ('test', 0.028), ('modi', 0.027), ('decomposition', 0.026), ('introducing', 0.026), ('ui', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
2 0.34346199 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
Author: Nathan Srebro, Noga Alon, Tommi S. Jaakkola
Abstract: We prove generalization error bounds for predicting entries in a partially observed matrix by fitting the observed entries with a low-rank matrix. In justifying the analysis approach we take to obtain the bounds, we present an example of a class of functions of finite pseudodimension such that the sums of functions from this class have unbounded pseudodimension. 1
3 0.18231337 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
4 0.13832247 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
5 0.12363402 124 nips-2004-Multiple Alignment of Continuous Time Series
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
6 0.12294733 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
7 0.11438186 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
8 0.11281808 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
9 0.1009865 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
10 0.098419935 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
11 0.08577875 115 nips-2004-Maximum Margin Clustering
12 0.083532207 42 nips-2004-Computing regularization paths for learning multiple kernels
13 0.082980946 92 nips-2004-Kernel Methods for Implicit Surface Modeling
14 0.077344164 68 nips-2004-Face Detection --- Efficient and Rank Deficient
15 0.07473509 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
16 0.070944026 70 nips-2004-Following Curved Regularized Optimization Solution Paths
17 0.068424702 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
18 0.067241259 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
19 0.062290654 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
20 0.06058494 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
topicId topicWeight
[(0, -0.209), (1, 0.095), (2, 0.023), (3, 0.104), (4, -0.019), (5, -0.065), (6, -0.058), (7, 0.012), (8, 0.208), (9, 0.037), (10, -0.014), (11, -0.122), (12, 0.108), (13, 0.239), (14, 0.206), (15, 0.032), (16, -0.016), (17, 0.208), (18, -0.292), (19, -0.065), (20, 0.211), (21, 0.206), (22, -0.033), (23, 0.009), (24, 0.034), (25, 0.022), (26, 0.03), (27, 0.09), (28, 0.071), (29, -0.019), (30, -0.019), (31, 0.018), (32, 0.204), (33, 0.045), (34, 0.203), (35, -0.021), (36, -0.04), (37, 0.018), (38, -0.006), (39, -0.051), (40, -0.0), (41, 0.054), (42, -0.0), (43, 0.012), (44, -0.008), (45, 0.065), (46, -0.039), (47, 0.03), (48, -0.051), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.96794027 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
2 0.86170155 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
Author: Nathan Srebro, Noga Alon, Tommi S. Jaakkola
Abstract: We prove generalization error bounds for predicting entries in a partially observed matrix by fitting the observed entries with a low-rank matrix. In justifying the analysis approach we take to obtain the bounds, we present an example of a class of functions of finite pseudodimension such that the sums of functions from this class have unbounded pseudodimension. 1
3 0.51537001 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
4 0.43445671 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
5 0.3841868 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
Author: Dori Peleg, Ron Meir
Abstract: A novel linear feature selection algorithm is presented based on the global minimization of a data-dependent generalization error bound. Feature selection and scaling algorithms often lead to non-convex optimization problems, which in many previous approaches were addressed through gradient descent procedures that can only guarantee convergence to a local minimum. We propose an alternative approach, whereby the global solution of the non-convex optimization problem is derived via an equivalent optimization problem. Moreover, the convex optimization task is reduced to a conic quadratic programming problem for which efficient solvers are available. Highly competitive numerical results on both artificial and real-world data sets are reported. 1
6 0.38077533 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
7 0.35861284 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
8 0.33660793 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
9 0.30325392 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
10 0.29523909 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
11 0.29343212 68 nips-2004-Face Detection --- Efficient and Rank Deficient
12 0.2933909 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
13 0.28833106 138 nips-2004-Online Bounds for Bayesian Algorithms
14 0.28316113 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
15 0.28195789 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
16 0.27813452 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
17 0.27559832 94 nips-2004-Kernels for Multi--task Learning
18 0.26251069 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity
19 0.25711399 70 nips-2004-Following Curved Regularized Optimization Solution Paths
20 0.25316885 42 nips-2004-Computing regularization paths for learning multiple kernels
topicId topicWeight
[(13, 0.069), (15, 0.121), (25, 0.01), (26, 0.036), (31, 0.058), (33, 0.152), (35, 0.02), (39, 0.395), (50, 0.024), (76, 0.01), (80, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.85548794 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
2 0.84723842 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
3 0.79743856 42 nips-2004-Computing regularization paths for learning multiple kernels
Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan
Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1
4 0.65030533 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
5 0.61045134 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
6 0.60481387 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
7 0.60367745 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
8 0.59178829 124 nips-2004-Multiple Alignment of Continuous Time Series
9 0.58141977 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.58055997 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
11 0.57635564 68 nips-2004-Face Detection --- Efficient and Rank Deficient
12 0.56987554 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
13 0.56836158 125 nips-2004-Multiple Relational Embedding
14 0.56735915 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
15 0.56713086 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
16 0.55832446 82 nips-2004-Incremental Algorithms for Hierarchical Classification
17 0.5555262 178 nips-2004-Support Vector Classification with Input Data Uncertainty
18 0.55349427 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
19 0.55019373 131 nips-2004-Non-Local Manifold Tangent Learning
20 0.54544592 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection