nips nips2013 nips2013-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. [sent-6, score-0.431]
2 Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. [sent-7, score-0.422]
3 Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. [sent-8, score-0.465]
4 Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. [sent-9, score-0.649]
5 The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. [sent-10, score-0.116]
6 1 Introduction Deep learning has recently been enjoying a resurgence [1, 2] due to the discovery that stage-wise pre-training can significantly improve the results of classical training methods [3–5]. [sent-11, score-0.116]
7 The advantage of latent variable models is that they allow abstract “semantic” features of observed data to be represented, which can enhance the ability to capture predictive relationships between observed variables. [sent-12, score-0.471]
8 In this way, latent variable models can greatly simplify the description of otherwise complex relationships between observed variates. [sent-13, score-0.431]
9 , “generative”) settings, latent variable models have been used to express feature discovery problems such as dimensionality reduction [6], clustering [7], sparse coding [8], and independent components analysis [9]. [sent-16, score-0.533]
10 More recently, such latent variable models have been used to discover abstract features of visual data invariant to low level transformations [1, 2, 4]. [sent-17, score-0.48]
11 “conditional”) setting, latent variable models are used to discover intervening feature representations that allow more accurate reconstruction of outputs from inputs. [sent-22, score-0.561]
12 However, latent variables also cause difficulty in this case because they impose nested nonlinearities between the input and output variables. [sent-24, score-0.485]
13 Some important examples of conditional latent learning approaches include those that seek an intervening lower dimensional representation [10] latent clustering [11], sparse feature representation [8] or invariant latent representation [1, 3, 4, 12] between inputs and outputs. [sent-25, score-1.48]
14 Despite their growing success, the difficulty of training a latent variable model remains clear: since the model parameters have to be trained concurrently with inference over latent variables, the convexity of the training problem is usually destroyed. [sent-26, score-1.206]
15 Only highly restricted models can be trained to optimality, and current deep learning strategies provide no guarantees about solution quality. [sent-27, score-0.148]
16 1 Meanwhile, a growing body of research has investigated reformulations of latent variable learning that are able to yield tractable global training methods in special cases. [sent-29, score-0.547]
17 Unfortunately, none of these approaches has yet been able to accommodate a non-trivial hidden layer between an input and output layer while retaining the representational capacity of an auto-encoder or RBM (e. [sent-33, score-0.661]
18 boosting strategies embed an intractable subproblem in these cases [15–17]). [sent-35, score-0.133]
19 Some recent work has been able to capture restricted forms of latent structure in a conditional model—namely, a single latent cluster variable [18–20]—but this remains a rather limited approach. [sent-36, score-1.0]
20 In this paper we demonstrate that more general latent variable structures can be accommodated within a tractable convex framework. [sent-37, score-0.594]
21 In particular, we show how two-layer latent conditional models with a single latent layer can be expressed equivalently in terms of a latent feature kernel. [sent-38, score-1.558]
22 This reformulation allows a rich set of latent feature representations to be captured, while allowing useful convex relaxations in terms of a semidefinite optimization. [sent-39, score-0.762]
23 Unlike [26], the latent kernel in this model is explicitly learned (nonparametrically). [sent-40, score-0.447]
24 2 Two-Layer Conditional Modeling We address the problem of training a two-layer latent conditional model in the form of Figure 1; i. [sent-43, score-0.569]
25 , where there is a single layer of h latent variables, , between a layer of n input variables, x, and m output variables, y. [sent-45, score-1.04]
26 To learn the model parameters, we assume we are given t training pairs {(xj , yj )}t , stacked in two matrices j=1 X = (x1 , . [sent-49, score-0.183]
27 , yt ) 2 Rm⇥t , but the corresponding set of latent variable values = ( 1 , . [sent-55, score-0.431]
28 W V j f1 f2 xj yj t Figure 1: Latent conditional model f1 (W x) ; , f2 (V ) ; y , where j is ˆ a latent variable, xj is an observed input vector, yj is an observed output vector, W are first layer parameters, and V are second layer parameters. [sent-59, score-1.248]
29 To formulate the training problem, we will consider two losses, L1 and L2 , that relate the input to the latent layer, and the latent to the output layer respectively. [sent-60, score-1.225]
30 For example, one can think of losses as negative log-likelihoods in a conditional model that generates each successive layer given its predecessor; i. [sent-61, score-0.499]
31 (However, a loss based formulation is more flexible, since every negative log-likelihood is a loss but not vice versa. [sent-64, score-0.162]
32 Given such a set-up many training principles become possible. [sent-66, score-0.116]
33 For simplicity, we consider a Viterbi based training principle where the parameters W and V are optimized with respect to an optimal imputation of the latent values . [sent-67, score-0.495]
34 To do so, define the first and second layer training objectives as F1 (W, ) = L1 (W X, ) + ↵ kW k2 , F 2 and F2 ( , V ) = L2 (V , Y ) + 2 kV k2 , F (1) where we assume the losses are convex in their first arguments. [sent-68, score-0.704]
35 Here it is typical to assume Pt ˆ that the losses decompose columnwise; that is, L1 ( ˆ , ) = j=1 L1 ( j , j ) and L2 (Z, Y ) = Pt ˆj is the jth column of ˆ and zj is the jth column of Z respectively. [sent-69, score-0.203]
36 This ˆ ˆ z j=1 L2 (ˆj , yj ), where 2 follows for example if the training pairs (xj , yj ) are assumed I. [sent-70, score-0.25]
37 These two objectives can be combined to obtain the following joint training problem: min min F1 (W, ) + F2 ( , V ), (2) W,V where > 0 is a trade off parameter that balances the first versus second layer discrepancy. [sent-77, score-0.554]
38 Unfortunately (2) is not jointly convex in the unknowns W , V and . [sent-78, score-0.301]
39 A key modeling question concerns the structure of the latent representation . [sent-79, score-0.461]
40 As noted, the extensive literature on latent variable modeling has proposed a variety of forms for latent structure. [sent-80, score-0.891]
41 Here, we follow work on deep learning and sparse coding and assume that the latent variables are boolean, 2 {0, 1}h⇥1 ; an assumption that is also often made in auto-encoders [13], PFNs [27], and RBMs [5]. [sent-81, score-0.479]
42 A boolean representation can capture structures that range from a single latent clustering [11, 19, 20], by imposing the assumption that 0 1 = 1, to a general sparse code, by imposing the assumption that 0 1 = k for some small k [1, 4, 13]. [sent-82, score-0.618]
43 1 Observe that, in the latter case, one can control the complexity of the latent representation by imposing a constraint on the number of “active” variables k rather than directly controlling the latent dimensionality h. [sent-83, score-0.845]
44 1 Multi-Layer Perceptrons and Large-Margin Losses To complete a specification of the two-layer model in Figure 1 and the associated training problem (2), we need to commit to specific forms for the transfer functions f1 and f2 and the losses in (1). [sent-85, score-0.307]
45 Although it has been traditional in deep learning research to focus on exponential family conditional models (e. [sent-87, score-0.211]
46 Although it is common to adopt a softmax transfer for f2 in such a case, it is also useful to consider a perceptron model defined by f2 (ˆ) = indmax(ˆ) such that z z indmax(ˆ) = 1i (vector of all 0s except a 1 in the ith position) where zi zl for all l. [sent-93, score-0.173]
47 Therefore, z ˆ ˆ for multi-class classification, we will simply adopt the standard large-margin multi-class loss [29]: ˆ ˆ L2 (ˆ, y) = max(1 y + z 1y0 z). [sent-94, score-0.165]
48 z (3) ˆ Intuitively, if yc = 1 is the correct label, this loss encourages the response zc = y0 z on the correct ˆ label to be a margin greater than the response zi on any other label i 6= c. [sent-95, score-0.344]
49 Although the loss (3) has proved to be highly successful for multi-class classification problems, it is not suitable for the first layer because it assumes there is only a single target component active in any latent vector ; i. [sent-97, score-0.813]
50 Therefore, we instead adopt a multi-label perceptron model for the first layer, defined by the transfer function f1 ( ˆ) = step( ˆ) applied componentwise to the response vector ˆ; i. [sent-101, score-0.177]
51 Here again, instead of using a traditional negative loglikelihood loss, we will adopt a simple large-margin loss for multi-label classification that naturally accommodates multiple binary latent classifications in parallel. [sent-104, score-0.581]
52 Although several loss formulations exist for multi-label classification [30, 31], we adopt the following: L1 ( ˆ, ) = max(1 + ˆ 0 1 1 0 ˆ) ⌘ max (1 )/( 0 1) + ˆ 1 0 ˆ/( 0 1) . [sent-105, score-0.165]
53 (4) Intuitively, this loss encourages the average response on the active labels, 0 ˆ/( 0 1), to exceed the response ˆi on any inactive label i, i = 0, by some margin, while also encouraging the response on any active label to match the average of the active responses. [sent-106, score-0.466]
54 Therefore, the overall architecture we investigate embeds two nonlinear conditionals around a non-trivial latent layer. [sent-108, score-0.529]
55 3 3 Equivalent Reformulation The main contribution of this paper is to show that the training problem (2) has a convex relaxation that preserves sufficient structure to transcend one-layer models. [sent-110, score-0.391]
56 To demonstrate this relaxation, we first need to establish the key observation that problem (2) can be re-expressed in terms of a kernel matrix between latent representation vectors. [sent-111, score-0.538]
57 Importantly, this reformulation allows the problem to be re-expressed in terms of an optimization objective that is jointly convex in all participating variables. [sent-112, score-0.438]
58 Next, re-express the second layer objective F2 in (1) by the following. [sent-117, score-0.374]
59 For any fixed , letting N = min F2 ( , V ) V = , it follows that 0 min B2Im(N ) L2 (B, Y ) + 2 tr(BN † B 0 ). [sent-119, score-0.128]
60 However, we require L2 to be convex in its first argument to ensure a convex problem below. [sent-124, score-0.326]
61 Moreover, the term tr(BN † B 0 ) is jointly convex in N and B since it is a perspective function [32], hence the objective in (5) is jointly convex. [sent-126, score-0.399]
62 Next, we reformulate the first layer objective F1 in (1). [sent-127, score-0.374]
63 For any L1 if there exists a function L1 such that L1 ( ˆ , ) = L1 ( ˜ 0 ˜ , ˜ 0 ˜ ) for all h⇥t h⇥t 0 ˆ 2R and 2 {0, 1} , such that 1 = 1k, it then follows that min F1 (W, ) W = min ˜ D2Im(N ) ˜ ˜ ˜ L1 (DK, N ) + ↵ 2 ˜ ˜ tr(D0 N † DK). [sent-131, score-0.128]
64 The second equality (10) follows from the representer theorem applied to kW k2 , which F ˜ ˜ ˜ implies that the optimal W must be in the form of W = ˜ C X 0 for some C 2 Rt⇥t (using the fact ˜ that ˜ has full rank h) [28]. [sent-134, score-0.116]
65 4 ˜ ˜ ˜ Observe that the term tr(D0 N † DK) is again jointly convex in N and D (also a perspective func˜ 1 (DK, N ) as defined in Lemma 3 below is also jointly convex ˜ ˜ tion), while it is easy to verify that L ˜ in N and D [32]; therefore the objective in (8) is jointly convex. [sent-136, score-0.648]
66 1; that is, assume L1 is given by the large-margin multi-label loss (4): P ˆ 0 L1 ( ˆ , ) = 1 0 ˆj j + j j1 j j max 1 0 ˆ diag( 0 1) 1 diag( 0 ˆ )0 , such that ⌧ (⇥) := P max(✓j ), (12) = ⌧ 11 + j where we use ˆj , j and ✓j to denote the jth columns of ˆ , and ⇥ respectively. [sent-138, score-0.125]
67 For the multi-label loss L1 defined in (4), and for any fixed 2 {0, 1}h⇥t where 0 ˜ 1 ( ˜ 0 ˜ , ˜ 0 ˜ ) := ⌧ ( ˜ 0 ˜ ˜ 0 ˜ /k) + t tr( ˜ 0 ˜ ) using the augmentation 1 = 1k, the definition L ˜ above satisfies the property that L1 ( ˆ , ) = L1 ( ˜ 0 ˜ , ˜ 0 ˜ ) for any ˆ 2 Rh⇥t . [sent-140, score-0.143]
68 To do so, consider the sequence ˜) tr ⌦0 ˜ 0 (k ˜ (14) ˜) = ⌧(˜0 ˜ ˜ 0 ˜ /k), (15) where the equalities in (14) and (15) follow from the definition of ⌧ and the fact that linear maximizations over the simplex obtain their solutions at the vertices. [sent-144, score-0.176]
69 To establish the equality between ˜ ˜ ˜ (14) and (15), since ˜ embeds the submatrix kI, for any ⇤ 2 Rh⇥t there must exist an ⌦ 2 Rt⇥t sat+ + isfying ⇤ = ˜ ⌦/k. [sent-145, score-0.145]
70 ˜ Therefore, the result (8) holds for the first layer loss (4), using L1 defined in Lemma 3. [sent-147, score-0.391]
71 For any second layer loss and any first layer loss that satisfies the assumption of Lemma 2 (for example the large-margin multi-label loss (4)), the following equivalence holds: (2) = min ˜ {N :9 2{0,1}t⇥h s. [sent-151, score-0.979]
72 min min ˜ ˜ ˜ 1=1k,N = ˜0 ˜ } B2Im(N ) D2Im(N ) ˜ ˜ ˜ L1 (DK, N ) + + L2 (B, Y ) + 2 ↵ 2 ˜ ˜ tr(D0 N † DK) ˜ tr(B N † B 0 ). [sent-153, score-0.128]
73 ) Note that no relaxation has occurred thus far: the objective value of (16) matches that of (2). [sent-155, score-0.176]
74 Not only has this reformulation resulted in (2) ˜ being entirely expressed in terms of the latent kernel matrix N , the objective in (16) is jointly convex ˜ in all participating unknowns, N , B and D. [sent-156, score-0.885]
75 To then achieve a convex form we further relax the constraints in (16). [sent-163, score-0.163]
76 do NT arg minN ⌫0 L(N, MT 1 , T 1 ), by using the boosting Algorithm 2. [sent-175, score-0.133]
77 Therefore, we need to adopt the further relaxed set N2 , which is convex. [sent-193, score-0.131]
78 Therefore, we develop an effective training algorithm that exploits problem structure to bypass the main computational bottlenecks. [sent-197, score-0.116]
79 Note that F is still convex in N by the joint convexity of (20). [sent-200, score-0.199]
80 To solve the problem in Step 3 we develop an efficient boosting procedure based on [36] that retains low rank iterates NT while avoiding the need to determine N † when computing G(N ) and rG(N ); see Algorithm 2. [sent-205, score-0.133]
81 For example, consider the first layer objective and let ˜ G1 (N ) = minD L1 (DK, N ) + ↵ tr(D0 N † DK). [sent-207, score-0.374]
82 By defining D = N C, we obtain G1 (N ) = 2 ˜ minC L1 (N CK, N ) + ↵ tr(C 0 N CK), which no longer involves N † but remains convex in C; this 2 problem can be solved efficiently after a slight smoothing of the objective [37] (e. [sent-208, score-0.264]
83 to the second layer yields an efficient procedure for evaluating G(N ) and rG(N ). [sent-268, score-0.31]
84 Finally note that many of the matrix-vector multiplications in this procedure can be further accelerated by exploiting the low rank factorization of N maintained by the boosting algorithm; see the Appendix for details. [sent-269, score-0.133]
85 Since maxi Nii is convex in N , it is well known that there must exist a constant c1 > 0 such that the optimal N is also an optimal solution 2 to minN ⌫0 F(N ) + c1 (maxi Nii ) . [sent-273, score-0.207]
86 6 Experimental Evaluation To investigate the effectiveness of the proposed relaxation scheme for training a two-layer conditional model, we conducted a number of experiments to compare learning quality against baseline methods. [sent-276, score-0.302]
87 In such a set-up, training data is divided into a labeled and unlabeled portion, where the method receives X = [X` , Xu ] ˆ and Y` , and at test time the resulting predictions Yu are evaluated against the held-out labels Yu . [sent-279, score-0.116]
88 Our experiments were conducted with the following common protocol: First, the data was split into a separate training and test set. [sent-283, score-0.116]
89 Then the parameters of each procedure were optimized by a three-fold cross validation on the training set. [sent-284, score-0.116]
90 For transductive procedures, the same three training sets from the first phase were used, but then combined with ten new test sets drawn from the disjoint test data (hence 30 overall) for the final evaluation. [sent-286, score-0.253]
91 We initially ran a proof of concept experiment on three binary labeled artificial data sets depicted in Figure 2 (showing data set sizes n ⇥ t) with 100/100 labeled/unlabeled training points. [sent-290, score-0.116]
92 Here the goal was simply to determine whether the relaxed two-layer training method could preserve sufficient structure to overcome the limits of a one-layer architecture. [sent-291, score-0.163]
93 In these data sets, CVX2 is easily able to capture latent nonlinearities while outperforming the locally trained LOC2. [sent-443, score-0.532]
94 Note that this advantage must be due to two-layer versus one-layer modeling, since the transductive SVM methods TSS1 and TSJ1 demonstrate no advantage over SVM1. [sent-454, score-0.137]
95 For the second group, the effectiveness of SVM1 demonstrates that only minor gains can be possible via transductive or two-layer extensions, although some gains are realized. [sent-455, score-0.179]
96 Unfortunately, the convex latent clustering method TJB2 was also not competitive on any of these data sets. [sent-457, score-0.607]
97 7 Conclusion We have introduced a new convex approach to two-layer conditional modeling by reformulating the problem in terms of a latent kernel over intermediate feature representations. [sent-459, score-0.837]
98 The proposed model can accommodate latent feature representations that go well beyond a latent clustering, extending current convex approaches. [sent-460, score-1.004]
99 A semidefinite relaxation of the latent kernel allows a reasonable implementation that is able to demonstrate advantages over single-layer models and local training methods. [sent-461, score-0.675]
100 From a deep learning perspective, this work demonstrates that trainable latent layers can be expressed in terms of reproducing kernel Hilbert spaces, while large margin methods can be usefully applied to multi-layer prediction architectures. [sent-462, score-0.652]
wordName wordTfidf (topN-words)
[('latent', 0.379), ('layer', 0.31), ('nii', 0.225), ('dk', 0.177), ('tr', 0.176), ('convex', 0.163), ('transductive', 0.137), ('boosting', 0.133), ('training', 0.116), ('losses', 0.115), ('relaxation', 0.112), ('nt', 0.106), ('coil', 0.105), ('minn', 0.105), ('deep', 0.1), ('indmax', 0.096), ('pfns', 0.096), ('semide', 0.096), ('kw', 0.093), ('jointly', 0.086), ('adopt', 0.084), ('rh', 0.083), ('loss', 0.081), ('usps', 0.081), ('reformulation', 0.078), ('rbm', 0.074), ('conditional', 0.074), ('reformulating', 0.074), ('representer', 0.074), ('bn', 0.071), ('kv', 0.07), ('lbfgs', 0.07), ('diag', 0.07), ('letter', 0.069), ('kernel', 0.068), ('cifar', 0.067), ('yj', 0.067), ('mt', 0.066), ('clustering', 0.065), ('rg', 0.065), ('nonlinearities', 0.065), ('bht', 0.064), ('min', 0.064), ('objective', 0.064), ('margin', 0.063), ('augmentation', 0.062), ('rbms', 0.06), ('relaxations', 0.059), ('architecture', 0.059), ('joulin', 0.057), ('nij', 0.057), ('response', 0.056), ('ht', 0.055), ('classi', 0.054), ('mnist', 0.054), ('stdev', 0.052), ('embeds', 0.052), ('multilabel', 0.052), ('unknowns', 0.052), ('equivalence', 0.052), ('admm', 0.052), ('softmax', 0.052), ('variable', 0.052), ('establish', 0.051), ('lemma', 0.051), ('xor', 0.049), ('transformations', 0.049), ('im', 0.049), ('unfortunately', 0.048), ('trained', 0.048), ('imposing', 0.047), ('dropping', 0.047), ('relaxed', 0.047), ('participating', 0.047), ('intervening', 0.047), ('representations', 0.046), ('jth', 0.044), ('label', 0.044), ('maxi', 0.044), ('lemmas', 0.043), ('concurrently', 0.043), ('active', 0.043), ('demonstrates', 0.042), ('equality', 0.042), ('modeling', 0.042), ('rounding', 0.042), ('output', 0.041), ('preserving', 0.041), ('capture', 0.04), ('transfers', 0.04), ('representation', 0.04), ('boxes', 0.039), ('nonlinear', 0.039), ('forms', 0.039), ('feature', 0.037), ('transfer', 0.037), ('proportions', 0.037), ('traditional', 0.037), ('remains', 0.037), ('convexity', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
2 0.25302657 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.21123239 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
4 0.20171933 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
5 0.1544304 148 nips-2013-Latent Maximum Margin Clustering
Author: Guang-Tong Zhou, Tian Lan, Arash Vahdat, Greg Mori
Abstract: We present a maximum margin framework that clusters data using latent variables. Using latent representations enables our framework to model unobserved information embedded in the data. We implement our idea by large margin learning, and develop an alternating descent algorithm to effectively solve the resultant non-convex optimization problem. We instantiate our latent maximum margin clustering framework with tag-based video clustering tasks, where each video is represented by a latent tag model describing the presence or absence of video tags. Experimental results obtained on three standard datasets show that the proposed method outperforms non-latent maximum margin clustering as well as conventional clustering approaches. 1
6 0.15018533 5 nips-2013-A Deep Architecture for Matching Short Texts
7 0.1323366 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
8 0.13161258 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
9 0.13023511 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
10 0.12464756 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
11 0.12300334 149 nips-2013-Latent Structured Active Learning
12 0.12145459 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
13 0.12118938 211 nips-2013-Non-Linear Domain Adaptation with Boosting
14 0.11633131 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
15 0.11440139 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
16 0.11406983 171 nips-2013-Learning with Noisy Labels
17 0.11381154 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
18 0.11378286 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
19 0.11346996 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
20 0.10859823 158 nips-2013-Learning Multiple Models via Regularized Weighting
topicId topicWeight
[(0, 0.314), (1, 0.136), (2, -0.055), (3, -0.058), (4, 0.136), (5, -0.132), (6, -0.06), (7, 0.089), (8, 0.073), (9, -0.121), (10, 0.099), (11, 0.0), (12, -0.022), (13, -0.013), (14, 0.077), (15, 0.0), (16, -0.005), (17, 0.023), (18, 0.082), (19, -0.139), (20, -0.038), (21, -0.021), (22, 0.245), (23, 0.031), (24, 0.096), (25, -0.129), (26, -0.138), (27, -0.059), (28, 0.049), (29, 0.037), (30, -0.065), (31, -0.037), (32, 0.039), (33, -0.046), (34, 0.063), (35, 0.078), (36, -0.063), (37, -0.065), (38, -0.035), (39, 0.127), (40, 0.043), (41, 0.001), (42, 0.039), (43, -0.083), (44, -0.04), (45, -0.009), (46, 0.013), (47, 0.048), (48, -0.052), (49, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.97037369 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
2 0.7690227 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
3 0.73602474 85 nips-2013-Deep content-based music recommendation
Author: Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen
Abstract: Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. However, this approach suffers from the cold start problem: it fails when no usage data is available, so it is not effective for recommending new and unpopular songs. In this paper, we propose to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data. We compare a traditional approach using a bag-of-words representation of the audio signals with deep convolutional neural networks, and evaluate the predictions quantitatively and qualitatively on the Million Song Dataset. We show that using predicted latent factors produces sensible recommendations, despite the fact that there is a large semantic gap between the characteristics of a song that affect user preference and the corresponding audio signal. We also show that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach. 1
4 0.70213646 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests
Author: Yacine Jernite, Yonatan Halpern, David Sontag
Abstract: We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM. 1
5 0.67908233 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
6 0.66678143 148 nips-2013-Latent Maximum Margin Clustering
7 0.66314995 251 nips-2013-Predicting Parameters in Deep Learning
8 0.65939736 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification
9 0.65785372 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
10 0.63562429 5 nips-2013-A Deep Architecture for Matching Short Texts
11 0.63020295 294 nips-2013-Similarity Component Analysis
12 0.60800242 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
13 0.60338908 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
14 0.59923267 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
15 0.59449518 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
16 0.57464796 244 nips-2013-Parametric Task Learning
17 0.56904596 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
18 0.55581629 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
19 0.53528202 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
20 0.53451025 160 nips-2013-Learning Stochastic Feedforward Neural Networks
topicId topicWeight
[(2, 0.014), (16, 0.03), (33, 0.189), (34, 0.159), (41, 0.03), (49, 0.058), (53, 0.129), (56, 0.119), (70, 0.04), (85, 0.039), (89, 0.032), (93, 0.077), (95, 0.03)]
simIndex simValue paperId paperTitle
1 0.97436011 205 nips-2013-Multisensory Encoding, Decoding, and Identification
Author: Aurel A. Lazar, Yevgeniy Slutskiy
Abstract: We investigate a spiking neuron model of multisensory integration. Multiple stimuli from different sensory modalities are encoded by a single neural circuit comprised of a multisensory bank of receptive fields in cascade with a population of biophysical spike generators. We demonstrate that stimuli of different dimensions can be faithfully multiplexed and encoded in the spike domain and derive tractable algorithms for decoding each stimulus from the common pool of spikes. We also show that the identification of multisensory processing in a single neuron is dual to the recovery of stimuli encoded with a population of multisensory neurons, and prove that only a projection of the circuit onto input stimuli can be identified. We provide an example of multisensory integration using natural audio and video and discuss the performance of the proposed decoding and identification algorithms. 1
same-paper 2 0.91824055 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
3 0.9008553 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
4 0.89972633 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.89637065 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.89628518 64 nips-2013-Compete to Compute
7 0.89550984 294 nips-2013-Similarity Component Analysis
8 0.89546168 251 nips-2013-Predicting Parameters in Deep Learning
9 0.89511591 99 nips-2013-Dropout Training as Adaptive Regularization
10 0.89483136 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
11 0.89455813 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
12 0.89438176 137 nips-2013-High-Dimensional Gaussian Process Bandits
13 0.89410484 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
14 0.89137679 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
15 0.89075303 183 nips-2013-Mapping paradigm ontologies to and from the brain
16 0.89024037 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
17 0.89005816 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
18 0.88936502 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
19 0.88884807 301 nips-2013-Sparse Additive Text Models with Low Rank Background
20 0.88882327 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data