nips nips2004 nips2004-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). [sent-9, score-0.466]
2 We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. [sent-10, score-0.393]
3 This is potentially useful for two distinct purposes: a. [sent-11, score-0.07]
4 As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. [sent-12, score-0.203]
5 We present several examples to illustrate the practical usefulness of our method. [sent-14, score-0.165]
6 1 Introduction In semi-supervised learning, we assume we have a sample (xi , yi , si )n , of i. [sent-15, score-0.298]
7 draws i=1 from a joint distribution on (X, Y, S), where:1 • xi ∈ Rp are p-vectors of features. [sent-18, score-0.078]
8 • yi is a label, or response (yi ∈ R for regression, yi ∈ {0, 1} for 2-class classification). [sent-19, score-0.38]
9 • si ∈ {0, 1} is a “labeling indicator”, that is yi is observed if and only if si = 1, while xi is observed for all i. [sent-20, score-0.545]
10 Our goal is to model this unknown dependence: l(x, y) = P r(S = 1|x, y) (1) Note that the dependence on y (which is unobserved when S = 0) prevents us from using standard supervised modeling approaches to learn l. [sent-23, score-0.306]
11 We show here that we can use the whole data-set (labeled+unlabeled data) to obtain estimates of this probability distribution within a parametric family of distributions, without needing to “impute” the unobserved responses. [sent-24, score-0.227]
12 2 We believe this setup is of significant practical interest. [sent-25, score-0.045]
13 Here are a couple of examples of realistic situations: 1. [sent-26, score-0.07]
14 The problem of learning from positive examples and unlabeled data is of significant interest in document topic learning [4, 6, 8]. [sent-27, score-0.285]
15 These probabilities may also not be uniform within each class, and depend on the features as well. [sent-29, score-0.102]
16 Our methods allow us to infer these labeling probabilities by utilizing the unlabeled data. [sent-30, score-0.446]
17 Consider a satisfaction survey, where clients of a company are requested to report their level of satisfaction, but they can choose whether or not they do so. [sent-32, score-0.272]
18 It is reasonable to assume that their willingness to report their satisfaction depends on their actual satisfaction level. [sent-33, score-0.725]
19 Using our methods, we can infer the dependence of the reporting probability on the actual satisfaction by utilizing the unlabeled data, i. [sent-34, score-0.736]
20 Being able to infer the labeling mechanism is important for two distinct reasons. [sent-37, score-0.33]
21 First, it may be useful for “de-biasing” the results of supervised learning, which uses only the labeled examples. [sent-38, score-0.203]
22 The us of this for maximum likelihood estimation is well established in the literature as a method for correcting sampling bias (of which semi-supervised learning is an example) [10]. [sent-42, score-0.379]
23 We can also use the learned mechanism to post-adjust the probabilities from a probability estimation methods such as logistic regression to attain “unbiasedness” and consistency [11]. [sent-43, score-0.442]
24 Second, understanding the labeling mechanism may be an interesting and useful learning task in itself. [sent-44, score-0.313]
25 Consider, for example, the “satisfaction survey” scenario described above. [sent-45, score-0.064]
26 Understanding the way in which satisfaction affects the customers’ willingness to respond to the survey can be used to get a better picture of overall satisfaction and to design better future surveys, regardless of any supervised learning task which models the actual satisfaction. [sent-46, score-0.927]
27 Our approach is described in section 2, and is based on a method of moments. [sent-47, score-0.038]
28 Observe that for every function of the features g(x), we can get an unbiased estimate of its mean n 1 as n i=1 g(xi ). [sent-48, score-0.231]
29 We show that if we know the underlying label sampling mechanism l(x, y) we can get a different unbiased estimate of Eg(x), which uses only the labeled examples, weighted by 1/l(x, y). [sent-49, score-0.837]
30 We suggest inferring the unknown function l(x, y) by requiring that we get identical estimates of Eg(x) using both approaches. [sent-50, score-0.318]
31 We illustrate our method’s implementation on the California Housing data-set in section 3. [sent-51, score-0.05]
32 2 The method Let g(x) be any function of our features. [sent-54, score-0.038]
33 We construct two different unbiased estimates of Eg(x), one based on all n data points and one based on labeled examples only, assuming P (S = 1|x, y) is known. [sent-55, score-0.439]
34 Then, our method uses the equality in expectation of the two estimates to infer P (S = 1|x, y). [sent-56, score-0.236]
35 , gk , and solve the resulting k equations to get an estimate of θ. [sent-71, score-0.437]
36 Many practical and theoretical considerations arise when we consider what “good” choices of the representative functions g1 (x), . [sent-72, score-0.089]
37 Qualitatively we would like to accomplish the standard desirable properties of inverse problems: uniqueness, stability and robustness. [sent-76, score-0.147]
38 We want the resulting equations to have a unique “correct” solution. [sent-77, score-0.131]
39 We want our functions to have low variance so the inaccuracy in (3) is minimal, and we want them to be “different” from each other to get a stable solution in the k-dimensional space. [sent-78, score-0.126]
40 It is of course much more difficult to give concrete quantitative criteria for selecting the functions in practical situations. [sent-79, score-0.086]
41 What we can do in practice is evaluate how stable the results we get are. [sent-80, score-0.126]
42 A second set of considerations in selecting these functions is the computational one: can we even solve the resulting inverse problems with a reasonable computational effort? [sent-82, score-0.286]
43 These questions are sometimes inter-related with the choice of gk (x). [sent-85, score-0.163]
44 Suppose we wish to solve a set of non-linear equations for θ: gk (xi ) hk (θ) = − gk (xi ) = 0, k = 1, . [sent-86, score-0.517]
45 , K pθ (xi , yi ) s =1 i (4) i The solution of (4) is similar to hk (θ)2 arg min h(θ) = arg min (5) m Notice that every solution to (4) minimizes (5), but there may be local minima of (5) that are not solutions to (4). [sent-89, score-0.231]
46 Hence simply applying a Newton-Raphson method to (5) is not a good idea: if we have a sufficiently good initial guess about the solution, the NewtonRaphson method converges quadratically fast; however, it can also fail to converge, if the root does not exist nearby. [sent-90, score-0.076]
47 In some cases we can employ simpler methods, since the equations we get can be manipulated algebraically to give more “friendly” formulations. [sent-96, score-0.205]
48 1 Examples of simplified calculations We consider two situations where we can use algebra to simplify the solution of the equations our method gives. [sent-99, score-0.164]
49 The first is the obvious application to two-class classification, where the label sampling mechanism depends on the class label only. [sent-100, score-0.602]
50 Our method then reduces to the one suggested by [11]. [sent-101, score-0.083]
51 The second is a more involved regression scenario, with a logistic dependence between the sampling probability and the actual label. [sent-102, score-0.513]
52 First, consider a two-class classification scenario, where the sampling mechanism is independent of x: p1 if y = 1 P (S = 1|x, y) = p0 if y = 0 Then we need two functions of x to “de-bias” our classes. [sent-103, score-0.44]
53 One natural choice is g(x) = 1, which implies we are simply trying to invert the sampling probabilities. [sent-104, score-0.252]
54 Assume we use one of the features g(x) = xj as our second function. [sent-105, score-0.057]
55 However, the decomposition they offer allows us to solve them by searching first over b to solve (7), then plugging the result into (8) to get an estimate of a. [sent-109, score-0.343]
56 It contains 20640 observations about log( median house price) throughout California regions. [sent-113, score-0.112]
57 The eight features are: median income, housing median age, total rooms, total bedrooms, population, households, latitude and longitude. [sent-114, score-0.53]
58 Of the training data, we hide some of the labels stochastically, based on the “label sampling” model: P (S = 1|y) = logit−1 (1. [sent-116, score-0.132]
59 5) ¯ (9) this scheme results in having 6027 labeled training examples, 9372 training examples with the labels removed and 5241 test examples. [sent-118, score-0.262]
60 We use equations (7,8) to estimate a, ˆ based on each one of the 8 features. [sent-119, score-0.08]
61 In Figure 1 we display the value of x si =1 exp(−byi )(¯0j − xj ) for a range of possible values for b. [sent-121, score-0.169]
62 We observe that all features give us 0 crossing around the correct value of 1. [sent-122, score-0.196]
63 5, and so we expect to observe “zero crossings” around that value, which we indeed observe on all 8 graphs. [sent-126, score-0.116]
64 • Find ˆ by minimizing | b si =1 exp(−byi )(¯0j − xij )| over the range b ∈ [0, 3]. [sent-127, score-0.355]
65 ˆ b The table also shows the results of using these estimates to “de-bias” the prediction model, ˆ i. [sent-129, score-0.178]
66 once we have a, ˆ we use them to calculate P (S = 1|y) and use the inverse of these ˆ b estimated probabilities as weights in a least squares analysis of the labeled data. [sent-131, score-0.275]
67 Most of the 8 features give reasonable estimates, and the prediction models built using the resulting weighting schemes perform comparably to the one built using the “correct” weights. [sent-133, score-0.451]
68 They generally attain MSE about 20% smaller than that of the non-weighted model built without regard to the label sampling mechanism. [sent-134, score-0.461]
69 The stability of the resulting estimates is related to the “reasonableness” of the selected g(x) functions. [sent-135, score-0.247]
70 To illustrate that, we also tried the function g(x) = x3 · x4 · x5 /(x1 · x2 ) (still in combination with the identity function, so we can use (7,8)). [sent-136, score-0.05]
71 Clearly these numbers are way outside the reasonable range of b ˆ the results in Table 1. [sent-140, score-0.044]
72 Thus, a few “outliers” dominate the two estimates of E(g(x)) in (3) which we use to estimate a, b. [sent-142, score-0.137]
73 4 Related work The surge of interest in semi-supervised learning in recent years has been mainly in the context of text classification ([4, 6, 8] are several examples of many). [sent-143, score-0.111]
74 There is also a Table 1: Summary of estimates of sampling mechanism using each of 8 features Feature median income housing median age total rooms total bedrooms population households latitude longitude (no weighting) (true sampling model) b 1. [sent-144, score-1.744]
75 1148 wealth of statistical literature on missing data and biased sampling (e. [sent-172, score-0.347]
76 Here we briefly survey some of the interesting and popular approaches and relate them to our method. [sent-175, score-0.145]
77 The EM approach to text classification is advocated by [8]. [sent-176, score-0.041]
78 They consists of iterating between completing class labels and estimating the classification model. [sent-178, score-0.05]
79 The main caveat of all these methods is that they ignore the sampling mechanism, and thus implicitly assume it cancels out in the likelihood function — i. [sent-179, score-0.252]
80 , that the sampling is random and that l(x, y) is fixed. [sent-181, score-0.252]
81 It is possible, in principle, to remove this assumption, but that would significantly increase the complexity of the algorithms, as it would require specifying a likelihood model for the sampling mechanism and adding its parameters to the estimation procedure. [sent-182, score-0.478]
82 The use of re-weighted loss to account for unknown sampling mechanisms is suggested by [4, 11]. [sent-184, score-0.391]
83 Although they differ significantly in the details, both of these can account for label-dependent sampling in two-class classification. [sent-185, score-0.252]
84 They do not offer solutions for other modeling tasks or for feature-dependent sampling, which our approach covers. [sent-186, score-0.147]
85 In the missing data literature, [7] (chapter 15) and references therein offer several methods for handling “nonignorable nonresponse”. [sent-187, score-0.104]
86 These are all based on assuming complete probability models for (X, Y, S) and designing EM algorithms for the resulting problem. [sent-188, score-0.051]
87 An interesting example is the bivariate normal stochastic ensemble model, originally suggested by [3]. [sent-189, score-0.191]
88 In our notation, they assume that there is an additional fully unobserved “response” zi , and that yi is observed if and only if zi > 0. [sent-190, score-0.349]
89 They also assume that yi , zi are bivariate normal, depending on the features xi , that is: yi zi ∼N xi β1 xi β2 , σ2 ρσ 2 ρσ 2 1 this leads to a complex, but manageable, EM algorithm for inferring the sampling mechanism and fitting the actual model at once. [sent-191, score-1.307]
90 The main shortcoming of this approach, as we see it, is in the need to specify a complete and realistic joint probability model engulfing both the sampling mechanism and the response function. [sent-192, score-0.562]
91 This precludes completely the use of non-probabilistic methods for the response model part (like trees or kernel methods), and seems to involve significant computational complications if we stray from normal distributions. [sent-193, score-0.224]
92 5 Discussion The method we suggest in this paper allows for the separate and unbiased estimation of label-sampling mechanisms, even when they stochastically depend on the partially unobserved labels. [sent-194, score-0.342]
93 Our method of moments suffers from the same problems all such methods (and inverse problems in general) share, namely the uncertainty about the stability and validity of the results. [sent-196, score-0.247]
94 It is difficult to develop general theory for stable solutions to inverse problems. [sent-197, score-0.18]
95 What we can do in practice is attempt to validate the estimates we get. [sent-198, score-0.137]
96 We have already seen one approach for doing this in section 3, where we used multiple choices for g(x) and compared the resulting estimates of the parameters determining l(x, y). [sent-199, score-0.188]
97 Even if we had not known the true values of a and b, the fact that we got similar estimates using different features would reassure us that these estimates were reliable, and give us an idea of their uncertainty. [sent-200, score-0.372]
98 A second approach for evaluating the resulting estimates could be to use bootstrap sampling, which can be used to calculate bootstrap confidence intervals of the parameter estimates. [sent-201, score-0.284]
99 The computational issues also need to be tackled if our method is to be applicable for large scale problems with complex sampling mechanisms. [sent-202, score-0.29]
100 We have suggested a general methodology in section 2, and some ad-hoc solutions for special cases in section 2. [sent-203, score-0.095]
wordName wordTfidf (topN-words)
[('satisfaction', 0.272), ('sampling', 0.252), ('unlabeled', 0.215), ('mechanism', 0.188), ('xij', 0.186), ('housing', 0.178), ('si', 0.169), ('logit', 0.164), ('gk', 0.163), ('labeled', 0.142), ('estimates', 0.137), ('eg', 0.135), ('yi', 0.129), ('response', 0.122), ('median', 0.112), ('weighting', 0.11), ('survey', 0.101), ('unbiased', 0.09), ('unobserved', 0.09), ('inverse', 0.088), ('stochastically', 0.086), ('get', 0.084), ('bedrooms', 0.082), ('byi', 0.082), ('hide', 0.082), ('households', 0.082), ('income', 0.082), ('rooms', 0.082), ('plugging', 0.081), ('label', 0.081), ('labeling', 0.081), ('equations', 0.08), ('dp', 0.079), ('dependence', 0.078), ('xi', 0.078), ('built', 0.074), ('zou', 0.071), ('friendly', 0.071), ('latitude', 0.071), ('willingness', 0.071), ('potentially', 0.07), ('stanford', 0.07), ('examples', 0.07), ('logistic', 0.067), ('actual', 0.066), ('complications', 0.065), ('bivariate', 0.065), ('zi', 0.065), ('hastie', 0.064), ('scenario', 0.064), ('moments', 0.062), ('infer', 0.061), ('supervised', 0.061), ('customers', 0.061), ('offer', 0.06), ('solve', 0.059), ('stability', 0.059), ('observe', 0.058), ('inferring', 0.057), ('mse', 0.057), ('age', 0.057), ('features', 0.057), ('california', 0.054), ('attain', 0.054), ('mechanisms', 0.054), ('classi', 0.053), ('hk', 0.052), ('tool', 0.052), ('literature', 0.051), ('resulting', 0.051), ('exp', 0.051), ('lee', 0.051), ('regression', 0.05), ('labels', 0.05), ('solutions', 0.05), ('illustrate', 0.05), ('bootstrap', 0.048), ('liu', 0.048), ('algebra', 0.046), ('suggested', 0.045), ('find', 0.045), ('probabilities', 0.045), ('practical', 0.045), ('missing', 0.044), ('interesting', 0.044), ('rk', 0.044), ('considerations', 0.044), ('utilizing', 0.044), ('reasonable', 0.044), ('stable', 0.042), ('table', 0.041), ('give', 0.041), ('text', 0.041), ('correct', 0.04), ('unknown', 0.04), ('method', 0.038), ('estimation', 0.038), ('modeling', 0.037), ('normal', 0.037), ('em', 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
2 0.27144703 164 nips-2004-Semi-supervised Learning by Entropy Minimization
Author: Yves Grandvalet, Yoshua Bengio
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1
3 0.14836226 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
Author: Chiranjib Bhattacharyya, Pannagadatta K. Shivaswamy, Alex J. Smola
Abstract: We propose a convex optimization based strategy to deal with uncertainty in the observations of a classification problem. We assume that instead of a sample (xi , yi ) a distribution over (xi , yi ) is specified. In particular, we derive a robust formulation when the distribution is given by a normal distribution. It leads to Second Order Cone Programming formulation. Our method is applied to the problem of missing data, where it outperforms direct imputation. 1
4 0.14068212 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
5 0.13664766 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
6 0.12733054 54 nips-2004-Distributed Information Regularization on Graphs
7 0.12540922 70 nips-2004-Following Curved Regularized Optimization Solution Paths
8 0.12525909 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
9 0.10968185 23 nips-2004-Analysis of a greedy active learning strategy
10 0.10326746 115 nips-2004-Maximum Margin Clustering
11 0.094623215 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
12 0.093218304 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
13 0.09013395 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
14 0.090083145 44 nips-2004-Conditional Random Fields for Object Recognition
15 0.086296521 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
16 0.085653678 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
17 0.084137216 161 nips-2004-Self-Tuning Spectral Clustering
18 0.082441494 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
19 0.07880532 136 nips-2004-On Semi-Supervised Classification
20 0.0786881 166 nips-2004-Semi-supervised Learning via Gaussian Processes
topicId topicWeight
[(0, -0.283), (1, 0.094), (2, -0.025), (3, 0.106), (4, 0.016), (5, 0.096), (6, 0.005), (7, 0.116), (8, -0.086), (9, 0.041), (10, 0.185), (11, 0.151), (12, -0.007), (13, -0.063), (14, -0.025), (15, 0.094), (16, 0.019), (17, -0.088), (18, 0.048), (19, -0.101), (20, 0.062), (21, 0.095), (22, 0.143), (23, 0.042), (24, 0.042), (25, 0.124), (26, -0.023), (27, 0.051), (28, -0.068), (29, 0.041), (30, 0.107), (31, -0.007), (32, 0.036), (33, 0.063), (34, 0.05), (35, 0.035), (36, 0.109), (37, 0.06), (38, -0.123), (39, 0.025), (40, 0.068), (41, 0.033), (42, -0.078), (43, -0.004), (44, 0.095), (45, -0.003), (46, -0.019), (47, -0.043), (48, 0.056), (49, 0.07)]
simIndex simValue paperId paperTitle
same-paper 1 0.96496952 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
2 0.85855287 164 nips-2004-Semi-supervised Learning by Entropy Minimization
Author: Yves Grandvalet, Yoshua Bengio
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1
3 0.69678688 54 nips-2004-Distributed Information Regularization on Graphs
Author: Adrian Corduneanu, Tommi S. Jaakkola
Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classification tasks. 1
4 0.58175218 23 nips-2004-Analysis of a greedy active learning strategy
Author: Sanjoy Dasgupta
Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1
5 0.5772565 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
Author: Omid Madani, David M. Pennock, Gary W. Flake
Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1
6 0.56440687 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
7 0.54263955 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
8 0.51565313 178 nips-2004-Support Vector Classification with Input Data Uncertainty
9 0.50134319 166 nips-2004-Semi-supervised Learning via Gaussian Processes
10 0.49143681 136 nips-2004-On Semi-Supervised Classification
11 0.47147959 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
12 0.4693065 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems
13 0.45845422 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
14 0.44194844 158 nips-2004-Sampling Methods for Unsupervised Learning
15 0.43922591 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
16 0.43583488 44 nips-2004-Conditional Random Fields for Object Recognition
17 0.43456239 17 nips-2004-Adaptive Manifold Learning
18 0.43377668 130 nips-2004-Newscast EM
19 0.4260762 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
20 0.42523026 27 nips-2004-Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation
topicId topicWeight
[(13, 0.088), (15, 0.319), (24, 0.115), (26, 0.064), (31, 0.069), (32, 0.018), (33, 0.152), (35, 0.019), (39, 0.022), (50, 0.027), (59, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.94429779 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie
Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.
2 0.93831903 183 nips-2004-Temporal-Difference Networks
Author: Richard S. Sutton, Brian Tanner
Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as fixed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the first observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive definition of Node 2’s prediction is the probability that the first observation bit will be 1 two time steps from now. Node 3 similarly predicts the first observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the first observation bit and Node 4’s prediction, both on the next step. This is the first case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive definitions are difficult to express but which would nevertheless be completely defined as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modifiable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the defining TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to defining u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we define it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clarified by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed fixed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is specified by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved significant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not find that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufficiently small), which greatly simplifies comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the final predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a significant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically significant. TD methods appear to have a significant data-efficiency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the figure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically significant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always find such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the first few. Our results show that even in a fully observable setting there may be significant advantages to TD methods when learning TD-defined predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-defined predictions. Although most of our experiments have concerned the representational expressiveness and efficiency of TD-defined predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–121.
3 0.93822259 197 nips-2004-Two-Dimensional Linear Discriminant Analysis
Author: Jieping Ye, Ravi Janardan, Qi Li
Abstract: Linear Discriminant Analysis (LDA) is a well-known scheme for feature extraction and dimension reduction. It has been used widely in many applications involving high-dimensional data, such as face recognition and image retrieval. An intrinsic limitation of classical LDA is the so-called singularity problem, that is, it fails when all scatter matrices are singular. A well-known approach to deal with the singularity problem is to apply an intermediate dimension reduction stage using Principal Component Analysis (PCA) before LDA. The algorithm, called PCA+LDA, is used widely in face recognition. However, PCA+LDA has high costs in time and space, due to the need for an eigen-decomposition involving the scatter matrices. In this paper, we propose a novel LDA algorithm, namely 2DLDA, which stands for 2-Dimensional Linear Discriminant Analysis. 2DLDA overcomes the singularity problem implicitly, while achieving efficiency. The key difference between 2DLDA and classical LDA lies in the model for data representation. Classical LDA works with vectorized representations of data, while the 2DLDA algorithm works with data in matrix representation. To further reduce the dimension by 2DLDA, the combination of 2DLDA and classical LDA, namely 2DLDA+LDA, is studied, where LDA is preceded by 2DLDA. The proposed algorithms are applied on face recognition and compared with PCA+LDA. Experiments show that 2DLDA and 2DLDA+LDA achieve competitive recognition accuracy, while being much more efficient. 1
4 0.93435013 92 nips-2004-Kernel Methods for Implicit Surface Modeling
Author: Joachim Giesen, Simon Spalinger, Bernhard Schölkopf
Abstract: We describe methods for computing an implicit model of a hypersurface that is given only by a finite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. 1
5 0.93050969 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
Author: Le Lu, Gregory D. Hager, Laurent Younes
Abstract: Visual action recognition is an important problem in computer vision. In this paper, we propose a new method to probabilistically model and recognize actions of articulated objects, such as hand or body gestures, in image sequences. Our method consists of three levels of representation. At the low level, we first extract a feature vector invariant to scale and in-plane rotation by using the Fourier transform of a circular spatial histogram. Then, spectral partitioning [20] is utilized to obtain an initial clustering; this clustering is then refined using a temporal smoothness constraint. Gaussian mixture model (GMM) based clustering and density estimation in the subspace of linear discriminant analysis (LDA) are then applied to thousands of image feature vectors to obtain an intermediate level representation. Finally, at the high level we build a temporal multiresolution histogram model for each action by aggregating the clustering weights of sampled images belonging to that action. We discuss how this high level representation can be extended to achieve temporal scaling invariance and to include Bi-gram or Multi-gram transition information. Both image clustering and action recognition/segmentation results are given to show the validity of our three tiered representation.
6 0.92428434 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition
7 0.91567355 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
8 0.91532999 50 nips-2004-Dependent Gaussian Processes
9 0.89517903 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks
10 0.88773805 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
11 0.88360119 40 nips-2004-Common-Frame Model for Object Recognition
12 0.88083595 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
13 0.88052183 148 nips-2004-Probabilistic Computation in Spiking Populations
14 0.88030672 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
15 0.879161 168 nips-2004-Semigroup Kernels on Finite Sets
16 0.87213612 178 nips-2004-Support Vector Classification with Input Data Uncertainty
17 0.86926007 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
18 0.85647035 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
19 0.8558895 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
20 0.85464156 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning