nips nips2010 nips2010-287 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. [sent-3, score-0.262]
2 In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. [sent-4, score-0.798]
3 Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. [sent-5, score-0.874]
4 This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. [sent-6, score-0.093]
5 When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. [sent-7, score-0.175]
6 Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. [sent-8, score-0.073]
7 Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. [sent-9, score-0.186]
8 Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. [sent-10, score-0.321]
9 To alleviate these problems, one common approach is to perform dimensionality reduction on the data. [sent-14, score-0.262]
10 An assumption underlying many dimensionality reduction techniques is that the most useful information in many high-dimensional datasets resides in a low-dimensional latent space. [sent-15, score-0.296]
11 Principal component analysis (PCA) [8] and linear discriminant analysis (LDA) [7] are two classical dimensionality reduction methods that are still widely used in many applications. [sent-16, score-0.412]
12 PCA, as an unsupervised linear dimensionality reduction method, ďŹ nds a lowdimensional subspace that preserves as much of the data variance as possible. [sent-17, score-0.286]
13 On the other hand, LDA is a supervised linear dimensionality reduction method which seeks to ďŹ nd a low-dimensional subspace that keeps data points from different classes far apart and those from the same class as close as possible. [sent-18, score-0.406]
14 The focus of this paper is on the supervised dimensionality reduction setting like that for LDA. [sent-19, score-0.262]
15 To set the stage, we ďŹ rst analyze the between-class and within-class scatter measures used in conventional LDA. [sent-20, score-0.671]
16 We then establish that conventional LDA seeks to maximize the average pairwise distance between class means and minimize the average within-class pairwise distance over all classes. [sent-21, score-0.491]
17 To put this thinking into practice, we incorporate a worst-case view to deďŹ ne a new between-class 1 scatter measure as the minimum of the pairwise distances between class means, and a new withinclass scatter measure as the maximum of the within-class pairwise distances over all classes. [sent-23, score-1.322]
18 Based on the new scatter measures, we propose a novel dimensionality reduction method called worst-case linear discriminant analysis (WLDA). [sent-24, score-0.874]
19 WLDA solves an optimization problem which simultaneously maximizes the worst-case between-class scatter measure and minimizes the worst-case within-class scatter measure. [sent-25, score-1.055]
20 , below 100, we propose to relax the optimization problem and formulate it as a metric learning problem. [sent-28, score-0.15]
21 In case both the number of training data points and the number of features are large, we propose a greedy approach based on the constrained concave-convex procedure (CCCP) [24, 18] to ďŹ nd one direction of the transformation at a time with the other directions ďŹ xed. [sent-29, score-0.098]
22 Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. [sent-30, score-0.186]
23 We perform linear dimensionality reduction by ďŹ nding a transformation matrix W ∈ â„? [sent-55, score-0.342]
24 1 Objective Function We ďŹ rst briey review the conventional LDA. [sent-60, score-0.164]
25 The between-class scatter matrix and within-class scatter matrix are deďŹ ned as Sđ? [sent-61, score-1.04]
26 Based on the scatter matrices, the between-class scatter measure and within-class scatter measure are deďŹ ned as 1 đ? [sent-95, score-1.521]
27 LDA seeks to ďŹ nd the optimal solution of W that maximizes the ratio đ? [sent-108, score-0.067]
28 ‘—=1 According to this and the deďŹ nition of the within-class scatter measure, we can see that LDA tries to maximize the average pairwise distance between class means {mđ? [sent-142, score-0.666]
29 ‘– } and minimize the average ÂŻ within-class pairwise distance over all classes. [sent-143, score-0.104]
30 Instead of taking this average-case view, our WLDA model adopts a worst-case view which arguably is more suitable for classiďŹ cation applications. [sent-144, score-0.093]
31 ‘˜ Unlike LDA which uses the average of the distances between each class mean and the sample mean as the between-class scatter measure, here we use the minimum of the pairwise distances between class means as the between-class scatter measure: { ( )} đ? [sent-158, score-1.205]
32 ‘— (2) Also, we deďŹ ne the new within-class scatter measure as { ( )} đ? [sent-170, score-0.518]
33 ‘– which is the maximum of the average within-class pairwise distances. [sent-175, score-0.071]
34 2 (3) Similar to LDA, we deďŹ ne the optimality criterion of WLDA as the ratio of the between-class scatter measure to the within-class scatter measure: max W s. [sent-176, score-1.157]
35 The orthonormality constraint in problem (4) is widely used by many existing dimensionality reduction methods. [sent-191, score-0.401]
36 2 Optimization Procedure Since problem (4) is not easy to optimize with respect to W, we resort to formulate this dimensionality reduction problem as a metric learning problem [22, 21, 4]. [sent-194, score-0.399]
37 ‘– due to a property of the matrix trace that tr(AB) = tr(BA) for any matrices A and B with proper sizes. [sent-216, score-0.071]
38 The orthonormality constraint in problem (4) is non-convex with respect to W and cannot be expressed in terms of ÎŁ. [sent-217, score-0.116]
39 ‘‘ , where 0 denotes the zero vector or matrix of appropriate size and A ⪯ B means that the matrix B − A is positive semideďŹ nite. [sent-239, score-0.092]
40 Table 1: Algorithm for solving optimization problem (5) Input: {mđ? [sent-274, score-0.052]
41 2: Solve the optimization problem { ( { ( )} )} ÎŁ(đ? [sent-292, score-0.052]
42 œ€ = 10−4 ) break; Output: ÎŁ We now present the solution of the optimization problem in step 2. [sent-308, score-0.052]
43 It is equivalent to the following problem { } { } ( ) ( ) min đ? [sent-310, score-0.068]
44 ‘– ÎŁ is a convex function because it is the maximum of { ( )} several convex functions, and minđ? [sent-321, score-0.102]
45 3 Optimization in Dual Form In the previous subsection, we need to solve the SDP problem in problem (7). [sent-511, score-0.069]
46 However, SDP is not scalable to high dimensionality đ? [sent-512, score-0.158]
47 In many real-world applications to which dimensionality reduction is applied, the number of data points đ? [sent-514, score-0.287]
48 Under such circumstances, speedup can be obtained by solving the dual form of problem (4) instead. [sent-517, score-0.068]
49 (11) Note that problem (11) is almost the same as problem (4) and so we can use the same relaxation ˜ technique above to solve problem (11). [sent-560, score-0.092]
50 ‘‡ used to deďŹ ne the metric in the dual form is of size đ? [sent-562, score-0.091]
51 So solving the problem in the dual form is more efďŹ cient. [sent-569, score-0.068]
52 Here we introduce yet another optimization procedure based on a greedy approach to solve problem (4) when both đ? [sent-575, score-0.103]
53 We ďŹ nd the ďŹ rst column of W by solving problem (4) where W is a vector, then ďŹ nd the second column of W by assuming the ďŹ rst column is ďŹ xed, and so on. [sent-578, score-0.098]
54 ‘˜ − 1 columns are already known and the constraint in problem (4) becomes đ? [sent-587, score-0.074]
55 ‘˜âˆ’1 can be viewed as an empty matrix and the constraint Wđ? [sent-598, score-0.086]
56 ‘˜th step, we need to solve the following problem min wđ? [sent-602, score-0.091]
57 ‘‡ constraint of problem (12), we relax the constraint on wđ? [sent-660, score-0.155]
58 So the objective function of problem (12) is non-convex. [sent-670, score-0.051]
59 Moreover, the second constraint in problem (12), which is the difference of two convex functions, is also non-convex. [sent-671, score-0.125]
60 ‘™ + 1)th iteration of CCCP, we replace the non-convex parts of the objective function and the second constraint with (đ? [sent-691, score-0.079]
61 ‘™th iteration and solve the following problem min wđ? [sent-698, score-0.091]
62 ‘Ą, 2 where âˆĽ â‹… âˆĽ2 denotes the 2-norm of a vector, we can reformulate problem (13) into a second-order cone programming (SOCP) problem [12] which is more efďŹ cient than SDP: min wđ? [sent-783, score-0.119]
63 5 (14) Analysis It is well known that in binary classiďŹ cation problems when both classes are normally distributed with the same covariance matrix, the solution given by conventional LDA is the Bayes optimal solution. [sent-839, score-0.186]
64 max (15) Here, similar to conventional LDA, the reduced dimensionality đ? [sent-851, score-0.372]
65 , S1 = S2 , the problem degenerates to the optimization problem of conventional LDA since wđ? [sent-855, score-0.239]
66 ‘‡ S2 w for any w and w is the solution of conventional LDA. [sent-857, score-0.164]
67 1 So WLDA also gives the same Bayes optimal solution as conventional LDA. [sent-858, score-0.164]
68 Since the scale of w does not affect the ďŹ nal solution in problem (15), we simplify problem (15) as max w s. [sent-859, score-0.096]
69 (16) Since problem (16) is to maximize a convex function, it is not a convex problem. [sent-866, score-0.149]
70 ‘™ + 1)th iteration of CCCP, we need to solve the following problem max w s. [sent-869, score-0.096]
71 ÂŻ ÂŻ This is similar to the following property of the optimal solution in conventional LDA w★ âˆ? [sent-900, score-0.164]
72 ›˝ ★ are not ďŹ xed but learned from the following dual problem min đ? [sent-910, score-0.113]
73 Note that the ďŹ rst term in the objective function of problem (18) ÂŻ ÂŻ is just the scaled optimality criterion of conventional LDA when we assume the within-class scatter matrix Sđ? [sent-925, score-0.814]
74 proposed a maximum margin criterion for dimensionality reduction by changing ( ) the optimization problem of conventional LDA to: maxW tr Wđ? [sent-932, score-0.914]
75 The objective function has a physical meaning similar to that of LDA which favors a large between-class scatter measure and a small within-class scatter measure. [sent-937, score-1.031]
76 However, similar to LDA, the maximum margin criterion also uses the average distances to describe the between-class and within-class scatter measures. [sent-938, score-0.598]
77 [10] proposed another maximum margin criterion for dimensionality reduction. [sent-940, score-0.231]
78 The objective function in [10] is identical to that of support vector machine (SVM) and it treats the decision function in SVM as one direction in the transformation matrix W. [sent-941, score-0.108]
79 proposed a robust LDA algorithm to deal with data uncertainty in classiďŹ cation applications by formulating the problem as a convex problem. [sent-943, score-0.074]
80 The orthogonality constraint on the transformation matrix W has been widely used by dimensionality reduction methods, such as Foley-Sammon LDA (FSLDA) [6, 5] and orthogonal LDA [23]. [sent-946, score-0.446]
81 The orthogonality constraint can help to eliminate the redundant information in W. [sent-947, score-0.081]
82 This has been shown to be effective for dimensionality reduction. [sent-948, score-0.158]
83 4 Experimental Validation In this section, we evaluate WLDA empirically on some benchmark datasets and compare WLDA with several related methods, including conventional LDA, trace-ratio LDA [20], FSLDA [6, 5], and MarginLDA [11]. [sent-949, score-0.223]
84 For fair comparison with conventional LDA, we set the reduced dimensionality of each method compared to đ? [sent-950, score-0.322]
85 After dimensionality reduction, we use a simple nearest-neighbor classiďŹ er to perform classiďŹ cation. [sent-955, score-0.158]
86 2 Experiments on Face and Object Datasets Dimensionality reduction methods have been widely used for face and object recognition applications. [sent-973, score-0.254]
87 Previous research found that face and object images usually lie in a low-dimensional subspace 7 Table 2: Average classiďŹ cation errors LDA [20]. [sent-974, score-0.183]
88 Fisherface (based on LDA) [2] is one representative dimensionality reduction method. [sent-1027, score-0.262]
89 We use three face databases, ORL [2], PIE [17] and AR [13], and one object database, COIL [15], in our experiments. [sent-1028, score-0.127]
90 In the AR face database, 2,600 images of 100 persons (50 men and 50 women) are used. [sent-1029, score-0.141]
91 The ORL face database contains 400 face images of 40 persons, each having 10 images. [sent-1031, score-0.229]
92 In our experiment, we choose the frontal pose from the PIE database with varying lighting and illumination conditions. [sent-1033, score-0.074]
93 The COIL database contains 1,440 grayscale images with black background for 20 objects with each object having 72 different images. [sent-1036, score-0.128]
94 In face and object recognition applications, the size of the training set is usually not very large since labeling data is very laborious and costly. [sent-1037, score-0.148]
95 To simulate this realistic situation, we randomly choose 4 images of a person or object in the database to form the training set and the remaining images to form the test set. [sent-1038, score-0.16]
96 Table 3: Average classiďŹ cation errors on the face and object datasets. [sent-1041, score-0.127]
97 1593 5 Conclusion In this paper, we have presented a new supervised dimensionality reduction method by exploiting the worst-case view instead of average-case view in the formulation. [sent-1063, score-0.328]
98 An optimal transformation for discriminant and principal component analysis. [sent-1100, score-0.195]
99 Distance metric learning for large margin nearest neighbor classiďŹ cation. [sent-1219, score-0.08]
100 Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. [sent-1242, score-0.127]
wordName wordTfidf (topN-words)
[('scatter', 0.485), ('wlda', 0.481), ('tr', 0.363), ('lda', 0.349), ('conventional', 0.164), ('dimensionality', 0.158), ('discriminant', 0.127), ('reduction', 0.104), ('fslda', 0.096), ('th', 0.083), ('cccp', 0.076), ('face', 0.076), ('marginlda', 0.072), ('pairwise', 0.071), ('columbia', 0.069), ('sdp', 0.068), ('pie', 0.063), ('orl', 0.063), ('vancouver', 0.061), ('british', 0.057), ('coil', 0.055), ('convex', 0.051), ('constraint', 0.051), ('object', 0.051), ('max', 0.05), ('kocsor', 0.048), ('metric', 0.046), ('min', 0.045), ('transformation', 0.045), ('dual', 0.045), ('database', 0.045), ('seeks', 0.042), ('ww', 0.042), ('orthonormality', 0.042), ('distances', 0.04), ('canada', 0.04), ('optimality', 0.04), ('criterion', 0.039), ('uci', 0.039), ('xa', 0.039), ('ar', 0.037), ('adopts', 0.036), ('vandenberghe', 0.036), ('trace', 0.036), ('lkopf', 0.035), ('classi', 0.035), ('matrix', 0.035), ('margin', 0.034), ('datasets', 0.034), ('persons', 0.033), ('distance', 0.033), ('moreover', 0.033), ('view', 0.033), ('measure', 0.033), ('images', 0.032), ('class', 0.031), ('orthogonality', 0.03), ('relax', 0.03), ('optimization', 0.029), ('illumination', 0.029), ('hong', 0.029), ('thrun', 0.029), ('objective', 0.028), ('sch', 0.028), ('greedy', 0.028), ('cone', 0.028), ('semide', 0.027), ('column', 0.025), ('editors', 0.025), ('benchmark', 0.025), ('ratio', 0.025), ('points', 0.025), ('subspace', 0.024), ('arguably', 0.024), ('maximize', 0.024), ('boyd', 0.024), ('solve', 0.023), ('principal', 0.023), ('problem', 0.023), ('widely', 0.023), ('platt', 0.023), ('table', 0.023), ('classes', 0.022), ('formulate', 0.022), ('means', 0.022), ('analyze', 0.022), ('rewrite', 0.022), ('kim', 0.021), ('laborious', 0.021), ('resize', 0.021), ('mart', 0.021), ('nez', 0.021), ('rayleigh', 0.021), ('spambase', 0.021), ('foley', 0.021), ('overton', 0.021), ('nayar', 0.021), ('nene', 0.021), ('convincingly', 0.021), ('hespanha', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 287 nips-2010-Worst-Case Linear Discriminant Analysis
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
2 0.15415436 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
3 0.11755527 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
4 0.11026749 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
Author: Issei Sato, Kenichi Kurihara, Hiroshi Nakagawa
Abstract: We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.
5 0.10698269 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
6 0.10269021 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
7 0.10212248 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
8 0.078605488 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
9 0.078048602 235 nips-2010-Self-Paced Learning for Latent Variable Models
10 0.073725037 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
11 0.07077764 280 nips-2010-Unsupervised Kernel Dimension Reduction
12 0.069467023 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
13 0.068224587 286 nips-2010-Word Features for Latent Dirichlet Allocation
14 0.061372355 138 nips-2010-Large Margin Multi-Task Metric Learning
15 0.060237348 217 nips-2010-Probabilistic Multi-Task Feature Selection
16 0.059653603 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
17 0.055645276 151 nips-2010-Learning from Candidate Labeling Sets
18 0.052742962 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
19 0.049750842 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
20 0.047280118 221 nips-2010-Random Projections for $k$-means Clustering
topicId topicWeight
[(0, 0.159), (1, 0.059), (2, 0.042), (3, -0.062), (4, -0.032), (5, 0.026), (6, 0.12), (7, -0.011), (8, -0.042), (9, -0.012), (10, 0.114), (11, 0.066), (12, 0.121), (13, 0.023), (14, 0.025), (15, 0.029), (16, -0.009), (17, -0.02), (18, -0.019), (19, 0.047), (20, 0.031), (21, -0.103), (22, -0.066), (23, 0.003), (24, -0.001), (25, -0.042), (26, 0.066), (27, 0.028), (28, -0.023), (29, -0.131), (30, -0.095), (31, -0.034), (32, 0.024), (33, -0.041), (34, 0.072), (35, 0.099), (36, -0.129), (37, -0.085), (38, 0.015), (39, -0.12), (40, -0.025), (41, -0.021), (42, 0.042), (43, -0.007), (44, -0.127), (45, -0.006), (46, 0.08), (47, 0.045), (48, -0.13), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.91918725 287 nips-2010-Worst-Case Linear Discriminant Analysis
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
2 0.56372321 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee
Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classification. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from finite sampling effects, and find an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1
3 0.52542967 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
Author: Issei Sato, Kenichi Kurihara, Hiroshi Nakagawa
Abstract: We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.
4 0.52245975 194 nips-2010-Online Learning for Latent Dirichlet Allocation
Author: Matthew Hoffman, Francis R. Bach, David M. Blei
Abstract: We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
5 0.50575906 124 nips-2010-Inductive Regularized Learning of Kernel Functions
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon
Abstract: In this paper we consider the problem of semi-supervised kernel function learning. We first propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process. We also consider the problem of kernelized inductive dimensionality reduction in the semi-supervised setting. To this end, we introduce a novel method for this problem by considering a special case of our general kernel learning framework where we select the trace norm function as the regularizer. We empirically demonstrate that our framework learns useful kernel functions, improving the k-NN classification accuracy significantly in a variety of domains. Furthermore, our kernelized dimensionality reduction technique significantly reduces the dimensionality of the feature space while achieving competitive classification accuracies. 1
6 0.46461272 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
7 0.45470724 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
8 0.44487196 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
9 0.43983749 235 nips-2010-Self-Paced Learning for Latent Variable Models
10 0.43556571 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
11 0.43319947 280 nips-2010-Unsupervised Kernel Dimension Reduction
12 0.43047428 138 nips-2010-Large Margin Multi-Task Metric Learning
13 0.40330029 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
14 0.3922967 231 nips-2010-Robust PCA via Outlier Pursuit
15 0.38573301 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
16 0.363774 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
18 0.36179647 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
19 0.35954323 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
20 0.35700786 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
topicId topicWeight
[(5, 0.18), (13, 0.067), (17, 0.015), (27, 0.086), (30, 0.061), (35, 0.023), (45, 0.192), (50, 0.094), (52, 0.042), (60, 0.075), (77, 0.044), (90, 0.027)]
simIndex simValue paperId paperTitle
1 0.83451706 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
Author: Lauren Hannah, Warren Powell, David M. Blei
Abstract: In this paper we study convex stochastic optimization problems where a noisy objective function value is observed after a decision is made. There are many stochastic optimization problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state s. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel based weights and Dirichlet process based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems. 1
same-paper 2 0.82814604 287 nips-2010-Worst-Case Linear Discriminant Analysis
Author: Yu Zhang, Dit-Yan Yeung
Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1
3 0.79598701 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.79228377 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
5 0.79035026 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
6 0.78263253 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
7 0.78174073 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
8 0.78167969 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
9 0.78048754 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
10 0.77855563 63 nips-2010-Distributed Dual Averaging In Networks
11 0.77826709 265 nips-2010-The LASSO risk: asymptotic results and real world examples
12 0.77746415 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.7771312 194 nips-2010-Online Learning for Latent Dirichlet Allocation
14 0.77665305 148 nips-2010-Learning Networks of Stochastic Differential Equations
15 0.77651262 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
16 0.77637476 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
17 0.7762717 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
18 0.77536088 155 nips-2010-Learning the context of a category
19 0.77488303 280 nips-2010-Unsupervised Kernel Dimension Reduction
20 0.77405518 158 nips-2010-Learning via Gaussian Herding