nips nips2010 nips2010-275 knowledge-graph by maker-knowledge-mining

275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone


Source: pdf

Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu

Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We pose transductive classification as a matrix completion problem. [sent-6, score-0.535]

2 We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. [sent-8, score-0.176]

3 Our method allows for different loss functions to apply on the feature and label entries of the matrix. [sent-9, score-0.369]

4 The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. [sent-10, score-0.196]

5 In this work, we present two transductive learning methods under the novel assumption that the feature-by-item and label-by-item matrices are jointly low rank. [sent-12, score-0.245]

6 This assumption effectively couples different label prediction tasks, allowing us to implicitly use observed labels in one task to recover unobserved labels in others. [sent-13, score-0.559]

7 In fact, our methods learn in the difficult regime of multi-label transductive learning with missing data that one sometimes encounters in practice. [sent-15, score-0.41]

8 That is, each item is associated with many class labels, many of the items’ labels may be unobserved (some items may be completely unlabeled across all labels), and many features may also be unobserved. [sent-16, score-0.426]

9 Our methods build upon recent advances in matrix completion, with efficient algorithms to handle matrices with mixed real-valued features and discrete labels. [sent-17, score-0.159]

10 xn ] be the d × n feature matrix whose columns are the items. [sent-26, score-0.181]

11 Let ΩX be the index set of observed features in X, such that (i, j) ∈ ΩX if and only if xij is observed. [sent-35, score-0.257]

12 Similarly, let ΩY be the index set of observed labels in Y. [sent-36, score-0.195]

13 Our main goal is to predict the missing labels yij for (i, j) ∈ ΩY . [sent-37, score-0.543]

14 Of course, this reduces to standard transductive / learning when t = 1, |ΩX | = nd (no missing features), and 1 < |ΩY | < n (some missing labels). [sent-38, score-0.602]

15 In our more general setting, as a side product we are also interested in imputing the missing features, and de-noising the observed features, in X. [sent-39, score-0.306]

16 In a nutshell, we assume that X and Y are jointly produced by an underlying low rank matrix. [sent-43, score-0.176]

17 We then take advantage of the sparsity to fill in the missing labels and features using a modified method of matrix completion. [sent-44, score-0.461]

18 It starts from a d × n low rank “pre”-feature matrix X0 , with rank(X0 ) min(d, n). [sent-46, score-0.278]

19 The actual feature matrix X is obtained by adding iid Gaussian noise to the entries of X0 : X = X0 + , 0 0 0 where ij ∼ N (0, σ 2 ). [sent-47, score-0.398]

20 ytj ≡ yj ∈ Rt of item j are 0 0 produced by yj = Wxj + b, where W is a t × d weight matrix, and b ∈ Rt is a bias vector. [sent-51, score-0.245]

21 Note the combined (t + d) × n matrix Y0 ; X0 is low rank too: rank( Y0 ; X0 ) ≤ rank(X0 ) + 1. [sent-56, score-0.278]

22 The actual label yij ∈ {−1, 1} is generated randomly 0 0 via a sigmoid function: P (yij |yij ) = 1/ 1 + exp(−yij yij ) . [sent-57, score-0.562]

23 Finally, two random masks ΩX , ΩY are applied to expose only some of the entries in X and Y, and we use ω to denote the percentage of observed entries. [sent-58, score-0.281]

24 Given the partially observed features and labels as specified by X, Y, ΩX , ΩY , we would like to recover the intermediate low rank matrix Y0 ; X0 . [sent-62, score-0.546]

25 The key assumption is that the (t + d) × n stacked matrix Y0 ; X0 is of low rank. [sent-64, score-0.182]

26 sign(zij ) = yij , ∀(i, j) ∈ ΩY ; z(i+t)j = xij , ∀(i, j) ∈ ΩX Here, Z is meant to recover Y0 ; X0 by directly minimizing the rank while obeying the observed features and labels. [sent-68, score-0.63]

27 To index the corresponding element in the larger stacked matrix Z, we need to shift the row index by t to skip the part for Y0 , and hence the constraints z(i+t)j = xij . [sent-73, score-0.369]

28 Following recent work in matrix completion [3, 2], we relax rank() with the convex nuclear norm Z ∗ = min(t+d,n) σk (Z), where σk ’s are the singular values of Z. [sent-76, score-0.465]

29 Instead of the equality constraints in (1), we minimize a loss function cx (z(i+t)j , xij ). [sent-79, score-0.205]

30 The observed labels are of a different type than the observed features. [sent-82, score-0.22]

31 We therefore introduce another loss function cy (zij , yij ) to account for the heterogeneous data. [sent-83, score-0.349]

32 In addition to these changes, we will model the bias b either explicitly or implicitly, leading to two alternative matrix completion formulations below. [sent-85, score-0.391]

33 Here, Z corresponds to the stacked matrix WX0 ; X0 instead of Y0 ; X0 , making it potentially lower rank. [sent-88, score-0.155]

34 The optimization problem is argmin Z,b µ Z ∗ + λ |ΩY | cy (zij + bi , yij ) + (i,j)∈ΩY 2 1 |ΩX | cx (z(i+t)j , xij ), (i,j)∈ΩX (2) where µ, λ are positive trade-off weights. [sent-89, score-0.543]

35 Once the optimal Z, b are found, we recover the task-i label of item j by sign(zij + bi ), and feature k of item j by z(k+t)j . [sent-92, score-0.549]

36 Similar to how bias is commonly handled in linear classifiers, we append an additional feature with constant value one to each item. [sent-95, score-0.153]

37 Under the same label assumption yj = Wx0 + b, the rows of the soft label matrix j 0 0 Y are linear combinations of rows in X ; 1 , i. [sent-97, score-0.45]

38 We then let Z correspond to the (t + d + 1) × n stacked matrix Y0 ; X0 ; 1 , by forcing its last row to be 1 (hence the name): 1 λ cy (zij , yij ) + cx (z(i+t)j , xij ) (3) argmin µ Z ∗+ |ΩY | |ΩX | Z∈R(t+d+1)×n (i,j)∈ΩY (i,j)∈ΩX s. [sent-100, score-0.697]

39 Once the optimal Z is found, we recover the task-i label of item j by sign(zij ), and feature k of item j by z(k+t)j . [sent-104, score-0.509]

40 One way is to let Z correspond to Y0 ; X0 directly, without introducing bias b or the all-1 row, and hope nuclear norm minimization will prevail. [sent-108, score-0.169]

41 3 Input: Initial matrix Z0 , Input: Initial matrix Z0 , bias b0 , parameters µ, λ, Step sizes τZ parameters µ, λ, Step sizes τb , τZ Determine µ1 > µ2 > · · · > µL = µ > 0. [sent-124, score-0.334]

42 In the shrinkage step, SτZ µ (·) is a matrix shrinkage operator. [sent-136, score-0.222]

43 4 Experiments We now empirically study the ability of matrix completion to perform multi-class transductive classification when there is missing data. [sent-162, score-0.727]

44 We first present a family of 24 experiments on a synthetic task by systematically varying different aspects of the task, including the rank of the problem, noise level, number of items, and observed label and feature percentage. [sent-163, score-0.555]

45 We then present experiments on two real-world datasets: music emotions and yeast microarray. [sent-164, score-0.251]

46 We then run our matrix completion algorithms using 4 of the observed entries, measure its performance on the remaining 1 , and average over 5 5 the five folds. [sent-169, score-0.372]

47 Since our main goal is to predict unobserved labels, we use label error as the CV performance criterion to select parameters. [sent-170, score-0.289]

48 25 and, as in [10], consider µ values starting at σ1 ηµ , where σ1 is the largest singular value of the matrix of observed entries in [Y; X] (with the unobserved entries set to 0), and decrease µ until 10−5 . [sent-173, score-0.513]

49 We initialized b0 to be all zero and Z0 to be the rank-1 approximation of the matrix of observed entries in [Y; X] (with unobserved entries set to 0) obtained by performing an SVD and reconstructing the matrix using only the largest singular value and corresponding left and right singular vectors. [sent-175, score-0.668]

50 Baselines: We compare to the following baselines, each consisting of some missing feature imputation step on X first, then using a standard SVM to predict the labels: [FPC+SVM] Matrix completion on X alone using FPC [10]. [sent-180, score-0.898]

51 [EM(k)+SVM] Expectation Maximization algorithm to impute missing X entries using a mixture of k Gaussian components. [sent-181, score-0.365]

52 As in [9], missing features, mixing component parameters, and the assignments of items to components are treated as hidden variables, which are estimated in an iterative manner to maximize the likelihood of the data. [sent-182, score-0.265]

53 [Mean+SVM] Impute each missing feature by the mean of the observed entries for that feature. [sent-183, score-0.444]

54 Evaluation Method: To evaluate performance, we consider two measures: transductive label error, i. [sent-191, score-0.364]

55 , the percentage of unobserved labels predicted incorrectly; and relative feature imputation error ˆ 2 / ij ∈ΩX x2 , where x is the predicted feature value. [sent-193, score-0.847]

56 In the tables below, ˆ ij ij ∈ΩX (xij − xij ) / / for each parameter setting, we report the mean performance (and standard deviation in parenthesis) of different algorithms over 10 random trials. [sent-194, score-0.185]

57 We first create a rank-r matrix X0 = LR , where L ∈ Rd×r and R ∈ Rn×r with entries drawn iid from N (0, 1). [sent-199, score-0.255]

58 Next, we create a weight matrix W ∈ Rt×d and bias vector b ∈ Rt , with all entries drawn iid from N (0, 10). [sent-201, score-0.329]

59 Synthetic experiment results: Table 1 shows the transductive label errors, and Table 2 shows the relative feature imputation errors, on the synthetic datasets. [sent-209, score-0.891]

60 However, the imputations are not perfect, because in these particular parameter settings the ratio between the number of observed entries over the degrees of freedom needed to describe the feature matrix (i. [sent-212, score-0.354]

61 , r(d + n − r)) is below the necessary condition for perfect matrix completion [2], and because there is some feature noise. [sent-214, score-0.425]

62 Furthermore, our CV tuning procedure selects parameters µ, λ to optimize label error, which often leads to suboptimal imputation performance. [sent-215, score-0.494]

63 html 5 Table 1: Transductive label error of six algorithms on the 24 synthetic datasets. [sent-219, score-0.261]

64 The varying parameters are feature noise σ 2 , rank(X0 ) = r, number of items n, and observed label and feature percentage ω. [sent-220, score-0.519]

65 Each cell shows the mean(standard deviation) of transductive label error (in percentage) over 10 random trials. [sent-222, score-0.407]

66 0 imputation error, both MC-b and MC-1 did achieve perfect feature imputation. [sent-520, score-0.456]

67 We believe the fact that MC-b and MC-1 can use information in Y to enhance feature imputation in X made them better than FPC+SVM. [sent-523, score-0.427]

68 Observation 2: MC-1 is the best for multi-label transductive classification, as suggested by Table 1. [sent-524, score-0.218]

69 Surprisingly, the feature imputation advantage of MC-b did not translate into classification, and FPC+SVM took second place. [sent-525, score-0.427]

70 Observation 3: The same factors that affect standard matrix completion also affect classification performance of MC-b and MC-1. [sent-526, score-0.317]

71 As the tables show, everything else being equal, less feature noise (smaller σ 2 ), lower rank r, more items, or more observed features and labels, reduce label error. [sent-527, score-0.515]

72 Table 3 reveals that both MC methods achieve statistically significantly better label prediction and imputation performance with t = 10 than with only t = 2 (as determined by two-sample t-tests at significance level 0. [sent-532, score-0.581]

73 html 6 Table 2: Relative feature imputation error on the synthetic datasets. [sent-544, score-0.542]

74 The algorithm Zero+SVM is not shown because it by definition has relative feature imputation error 1. [sent-545, score-0.498]

75 02 Table 3: More tasks help matrix completion (ω = 10%, n = 400, r = 2, d = 20, σ 2 = 0. [sent-793, score-0.345]

76 03) relative feature imputation error Table 4: Performance on the music emotions data. [sent-819, score-0.666]

77 4) transductive label error Algorithm MC-b MC-1 FPC+SVM EM1+SVM EM4+SVM Mean+SVM Zero+SVM ω =40% 60% 80% 0. [sent-862, score-0.407]

78 00) relative feature imputation error We vary the percentage of observed entries ω = 40%, 60%, 80%. [sent-904, score-0.729]

79 Most importantly, these results show that MC-1 is useful for this realworld multi-label classification problem, leading to the best (or statistically indistinguishable from the best) transductive error performance with 60% and 80% of the data available, and close to the best with only 40%. [sent-908, score-0.369]

80 , no indices are missing from ΩX ) and the training labels in ΩY to a standard SVM, and let it predict the unspecified labels. [sent-912, score-0.335]

81 On the same random trials, for observed percentage ω = 40%, 60%, 80%, the oracle baseline achieved label error rate 22. [sent-913, score-0.396]

82 For this larger dataset, we omitted the computationally expensive EM4+SVM methods, and tuned only µ for matrix completion while fixing λ = 1. [sent-926, score-0.359]

83 Table 5 reveals that MC-b leads to statistically significantly lower transductive label error for this biological dataset. [sent-927, score-0.494]

84 Although not highlighted in the table, MC-1 is also statistically better than the SVM methods in label error. [sent-928, score-0.206]

85 In terms of feature imputation performance, the MC methods are weaker than FPC+SVM. [sent-929, score-0.427]

86 However, it seems simultaneously predicting the missing labels and features appears to provide a large advantage to the MC methods. [sent-930, score-0.359]

87 It should be pointed out that all algorithms except Zero+SVM in fact have small but non-zero standard deviation on imputation error, despite what the fixed-point formatting in the table suggests. [sent-931, score-0.391]

88 Again, we compared these algorithms to an oracle SVM baseline with 100% observed entries in ΩX . [sent-936, score-0.267]

89 The oracle SVM approach achieves label error of 20. [sent-937, score-0.25]

90 We attribute this advantage to a combination of multi-label learning and transduction that is intrinsic to our matrix completion methods. [sent-946, score-0.376]

91 4) transductive label error 5 Algorithm MC-b MC-1 FPC+SVM EM1+SVM Mean+SVM Zero+SVM ω =40% 60% 80% 0. [sent-984, score-0.407]

92 00) relative feature imputation error Discussions and Future Work We have introduced two matrix completion methods for multi-label transductive learning with missing features, which outperformed several baselines. [sent-1020, score-1.225]

93 In terms of problem formulation, our methods differ considerably from sparse multi-task learning [11, 1, 13] in that we regularize the feature and label matrix directly, without ever learning explicit weight vectors. [sent-1021, score-0.327]

94 Our methods also differ from multi-label prediction via reduction to binary classification or ranking [15], and via compressed sensing [7], which assumes sparsity in that each item has a small number of positive labels, rather than the low-rank nature of feature matrices. [sent-1022, score-0.198]

95 These methods do not naturally allow for missing features. [sent-1023, score-0.192]

96 Learning in the presence of missing data typically involves imputation followed by learning with completed data [9]. [sent-1026, score-0.579]

97 Our methods perform imputation plus learning in one step, similar to EM on missing labels and features [6], but the underlying model assumption is quite different. [sent-1027, score-0.707]

98 One future extension is to explicitly map the partial feature matrix to a partially observed polynomial (or other) kernel Gram matrix, and apply our methods there. [sent-1029, score-0.236]

99 Though such mapping proliferates the missing entries, we hope that the low-rank structure in the kernel matrix will allow us to recover labels that are nonlinear functions of the original features. [sent-1030, score-0.45]

100 Fixed point and Bregman iterative methods for matrix rank minimization. [sent-1083, score-0.251]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fpc', 0.484), ('imputation', 0.348), ('transductive', 0.218), ('completion', 0.215), ('svm', 0.208), ('yij', 0.208), ('missing', 0.192), ('zij', 0.173), ('rank', 0.149), ('label', 0.146), ('item', 0.119), ('entries', 0.118), ('xij', 0.115), ('labels', 0.11), ('matrix', 0.102), ('continuation', 0.101), ('nuclear', 0.095), ('emotions', 0.085), ('cy', 0.085), ('yeast', 0.083), ('music', 0.083), ('svd', 0.08), ('feature', 0.079), ('bias', 0.074), ('items', 0.073), ('synthetic', 0.072), ('unobserved', 0.067), ('zk', 0.066), ('cx', 0.064), ('cv', 0.064), ('oracle', 0.061), ('statistically', 0.06), ('shrinkage', 0.06), ('fixed', 0.059), ('imputing', 0.059), ('transduction', 0.059), ('percentage', 0.058), ('features', 0.057), ('ak', 0.055), ('impute', 0.055), ('observed', 0.055), ('singular', 0.053), ('stacked', 0.053), ('masks', 0.05), ('trohidis', 0.048), ('formulation', 0.048), ('rt', 0.048), ('indistinguishable', 0.048), ('mc', 0.046), ('bk', 0.046), ('recover', 0.046), ('error', 0.043), ('table', 0.043), ('tsoumakas', 0.043), ('tuned', 0.042), ('sdp', 0.041), ('classi', 0.041), ('bi', 0.04), ('row', 0.039), ('completed', 0.039), ('unspeci', 0.037), ('emmanuel', 0.037), ('xiaojin', 0.037), ('ij', 0.035), ('iid', 0.035), ('foreach', 0.035), ('elisseeff', 0.035), ('cance', 0.033), ('baseline', 0.033), ('predict', 0.033), ('subspace', 0.033), ('donald', 0.032), ('goldfarb', 0.032), ('story', 0.032), ('step', 0.031), ('argmin', 0.031), ('sign', 0.03), ('index', 0.03), ('soft', 0.03), ('heterogeneous', 0.03), ('baselines', 0.029), ('noise', 0.029), ('perfect', 0.029), ('modi', 0.029), ('tasks', 0.028), ('sizes', 0.028), ('cand', 0.028), ('relative', 0.028), ('reveals', 0.027), ('generation', 0.027), ('low', 0.027), ('afosr', 0.027), ('zoubin', 0.027), ('yj', 0.026), ('microarray', 0.026), ('loss', 0.026), ('em', 0.025), ('systematically', 0.025), ('implicitly', 0.025), ('trials', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu

Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1

2 0.24102053 94 nips-2010-Feature Set Embedding for Incomplete Data

Author: David Grangier, Iain Melvin

Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1

3 0.1828198 162 nips-2010-Link Discovery using Graph Feature Tracking

Author: Emile Richard, Nicolas Baskiotis, Theodoros Evgeniou, Nicolas Vayatis

Abstract: We consider the problem of discovering links of an evolving undirected graph given a series of past snapshots of that graph. The graph is observed through the time sequence of its adjacency matrix and only the presence of edges is observed. The absence of an edge on a certain snapshot cannot be distinguished from a missing entry in the adjacency matrix. Additional information can be provided by examining the dynamics of the graph through a set of topological features, such as the degrees of the vertices. We develop a novel methodology by building on both static matrix completion methods and the estimation of the future state of relevant graph features. Our procedure relies on the formulation of an optimization problem which can be approximately solved by a fast alternating linearized algorithm whose properties are examined. We show experiments with both simulated and real data which reveal the interest of our methodology. 1

4 0.13031918 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

Author: Umar Syed, Ben Taskar

Abstract: We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines. 1

5 0.12554112 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence

Author: Yang Wang, Greg Mori

Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.

6 0.11605684 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

7 0.10831907 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

8 0.10697507 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm

9 0.10552412 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

10 0.10503173 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

11 0.10448495 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

12 0.093387671 228 nips-2010-Reverse Multi-Label Learning

13 0.089798227 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

14 0.085142501 284 nips-2010-Variational bounds for mixed-data factor analysis

15 0.084425777 231 nips-2010-Robust PCA via Outlier Pursuit

16 0.079418622 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

17 0.078227103 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

18 0.075235583 124 nips-2010-Inductive Regularized Learning of Kernel Functions

19 0.074965008 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

20 0.07244315 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.216), (1, 0.067), (2, 0.059), (3, -0.074), (4, 0.057), (5, -0.029), (6, 0.014), (7, -0.027), (8, -0.104), (9, -0.104), (10, -0.007), (11, 0.061), (12, 0.207), (13, 0.109), (14, 0.131), (15, -0.034), (16, -0.066), (17, -0.048), (18, 0.077), (19, 0.083), (20, 0.006), (21, 0.083), (22, 0.001), (23, -0.237), (24, 0.173), (25, 0.088), (26, 0.009), (27, 0.022), (28, -0.107), (29, -0.017), (30, -0.009), (31, 0.021), (32, -0.067), (33, 0.052), (34, 0.014), (35, -0.007), (36, 0.108), (37, 0.051), (38, 0.055), (39, 0.14), (40, 0.022), (41, -0.1), (42, -0.078), (43, 0.083), (44, 0.068), (45, 0.009), (46, 0.032), (47, -0.058), (48, 0.03), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9507485 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu

Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1

2 0.76548368 94 nips-2010-Feature Set Embedding for Incomplete Data

Author: David Grangier, Iain Melvin

Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1

3 0.76459807 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

Author: Kaushik Mitra, Sameer Sheorey, Rama Chellappa

Abstract: Matrix factorization in the presence of missing data is at the core of many computer vision problems such as structure from motion (SfM), non-rigid SfM and photometric stereo. We formulate the problem of matrix factorization with missing data as a low-rank semidefinite program (LRSDP) with the advantage that: 1) an efficient quasi-Newton implementation of the LRSDP enables us to solve large-scale factorization problems, and 2) additional constraints such as orthonormality, required in orthographic SfM, can be directly incorporated in the new formulation. Our empirical evaluations suggest that, under the conditions of matrix completion theory, the proposed algorithm finds the optimal solution, and also requires fewer observations compared to the current state-of-the-art algorithms. We further demonstrate the effectiveness of the proposed algorithm in solving the affine SfM problem, non-rigid SfM and photometric stereo problems.

4 0.6848098 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

Author: Prateek Jain, Raghu Meka, Inderjit S. Dhillon

Abstract: Minimizing the rank of a matrix subject to affine constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under affine constraints (ARMP) and show that SVP recovers the minimum rank solution for affine constraints that satisfy a restricted isometry property (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of lowrank matrix completion, for which the defining affine constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of [5, 18, 14], for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is significantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem. 1

5 0.63846838 162 nips-2010-Link Discovery using Graph Feature Tracking

Author: Emile Richard, Nicolas Baskiotis, Theodoros Evgeniou, Nicolas Vayatis

Abstract: We consider the problem of discovering links of an evolving undirected graph given a series of past snapshots of that graph. The graph is observed through the time sequence of its adjacency matrix and only the presence of edges is observed. The absence of an edge on a certain snapshot cannot be distinguished from a missing entry in the adjacency matrix. Additional information can be provided by examining the dynamics of the graph through a set of topological features, such as the degrees of the vertices. We develop a novel methodology by building on both static matrix completion methods and the estimation of the future state of relevant graph features. Our procedure relies on the formulation of an optimization problem which can be approximately solved by a fast alternating linearized algorithm whose properties are examined. We show experiments with both simulated and real data which reveal the interest of our methodology. 1

6 0.60406548 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

7 0.60223281 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm

8 0.57314003 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

9 0.47789761 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

10 0.45714727 228 nips-2010-Reverse Multi-Label Learning

11 0.449839 267 nips-2010-The Multidimensional Wisdom of Crowds

12 0.44502121 151 nips-2010-Learning from Candidate Labeling Sets

13 0.44472697 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

14 0.44149444 284 nips-2010-Variational bounds for mixed-data factor analysis

15 0.43940857 212 nips-2010-Predictive State Temporal Difference Learning

16 0.42289671 231 nips-2010-Robust PCA via Outlier Pursuit

17 0.41950741 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

18 0.40764067 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

19 0.40104631 45 nips-2010-CUR from a Sparse Optimization Viewpoint

20 0.3969909 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.083), (17, 0.037), (26, 0.181), (27, 0.088), (30, 0.051), (35, 0.022), (45, 0.216), (50, 0.05), (51, 0.016), (52, 0.042), (60, 0.036), (77, 0.038), (78, 0.034), (90, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88835526 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

Author: Daniel Golovin, Andreas Krause, Debajyoti Ray

Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1

same-paper 2 0.87414968 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu

Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1

3 0.85353601 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

Author: Armand Joulin, Jean Ponce, Francis R. Bach

Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1

4 0.80741102 258 nips-2010-Structured sparsity-inducing norms through submodular functions

Author: Francis R. Bach

Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.

5 0.8022967 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

Author: Surya Ganguli, Haim Sompolinsky

Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1

6 0.80031586 117 nips-2010-Identifying graph-structured activation patterns in networks

7 0.80028039 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

8 0.79834992 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

9 0.7973271 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

10 0.79677719 265 nips-2010-The LASSO risk: asymptotic results and real world examples

11 0.79564607 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

12 0.79525065 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

13 0.79335135 63 nips-2010-Distributed Dual Averaging In Networks

14 0.79314762 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

15 0.79303104 194 nips-2010-Online Learning for Latent Dirichlet Allocation

16 0.79285568 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior

17 0.79276448 98 nips-2010-Functional form of motion priors in human motion perception

18 0.79267627 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

19 0.79245126 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

20 0.79188555 158 nips-2010-Learning via Gaussian Herding