jmlr jmlr2005 jmlr2005-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
Reference: text
sentIndex sentText sentNum sentScore
1 Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information. [sent-22, score-0.129]
2 BAR H ILLEL , H ERTZ , S HENTAL AND W EINSHALL in which equivalence constraints between a few of the data points are provided. [sent-35, score-0.129]
3 A key observation is that in contrast to explicit labels that are usually provided by a human instructor, in many unsupervised learning tasks equivalence constraints may be extracted with minimal effort or even automatically. [sent-38, score-0.127]
4 A different scenario, in which equivalence constraints are the natural source of training data, occurs when we wish to learn from several teachers who do not know each other and who are not able to coordinate among themselves the use of common labels. [sent-45, score-0.151]
5 1 For example, assume that you are given a large database of facial images of many people, which cannot be labelled by a small number of teachers due to its vast size. [sent-47, score-0.179]
6 However, equivalence constraints can be easily extracted, since points which were given the same tag by a certain teacher are known to originate from the same class. [sent-51, score-0.143]
7 In this paper we study how to use equivalence constraints in order to learn an optimal Mahalanobis metric between data points. [sent-52, score-0.143]
8 , 1997); this example shows that RCA with partial equivalence constraints typically yields comparable results to supervised algorithms which use fully labelled training data. [sent-70, score-0.177]
9 (b) A large data set of images collected by a real-time surveillance application, where the equivalence constraints are gathered automatically. [sent-71, score-0.181]
10 Algorithm 1 The RCA algorithm n j Given a data set X = {xi }N and n chunklets C j = {x ji }i=1 j = 1 . [sent-115, score-0.301]
11 Compute the within chunklet covariance matrix (Figure 1d) ˆ 1 C= N n nj ∑ ∑ (x ji − m j )(x ji − m j )t , (1) j=1 i=1 where m j denotes the mean of the j’th chunklet. [sent-119, score-0.491]
12 Compute the whitening transformation associated with C: W = C− 2 (Figure 1e), and apply it to the data points: Xnew = W X (Figure 1f), where X refers to the data points after dimenˆ sionality reduction when applicable. [sent-123, score-0.12]
13 If points x1 and x2 are related by a positive constraint, and x2 and x3 are also related by a positive constraint, then a chunklet {x1 , x2 , x3 } is formed. [sent-126, score-0.245]
14 Generally, chunklets are formed by applying transitive closure over the whole set of positive equivalence constraints. [sent-127, score-0.277]
15 (c) The set of chunklets that are provided to the RCA algorithm (points that share the same color and marker type form a chunklet). [sent-137, score-0.223]
16 ) Secondly, empirical evidence from clustering algorithms which use both types of constraints shows that in most cases positive constraints give a much higher performance gain (Shental et al. [sent-158, score-0.147]
17 Finally, in most cases in which equivalence constraints are gathered automatically, only positive constraints can be gathered. [sent-161, score-0.17]
18 In high dimensional spaces dimensionality reduction is almost always essential for the success of the algorithm, because the whitening transformation essentially re-scales the variability in all directions so as to equalize them. [sent-163, score-0.187]
19 The dimensions eliminated by PCA have small variability in the original data space (corresponding to Cov(X)), while the dimensions emphasized by RCA have low variability in a space where each point is translated according to the centroid of its own chunklet (corresponding to Cov(X|Z)). [sent-168, score-0.35]
20 Information Maximization with Chunklet Constraints How can we use chunklets to find a transformation of the data which improves its representation? [sent-171, score-0.25]
21 We show that for normally distributed data RCA is the optimal transformation when its dimensionality reduction is obtained with a constraints based Fisher’s Linear Discriminant (FLD). [sent-179, score-0.159]
22 In order to avoid the trivial solution λ → ∞, we can limit the distances between points contained in a single chunklet . [sent-196, score-0.245]
23 This can be done by constraining the average distance between a point in a chunklet and the chunklet’s mean. [sent-197, score-0.249]
24 1 N n nj ∑ ∑ ||y ji − myj || ≤ κ (2) j=1 i=1 n, n j where {y ji } j=1,i=1 denote the set of points in n chunklets after the transformation, my denotes the j mean of chunklet j after the transformation, and κ is a constant. [sent-200, score-0.667]
25 943 BAR H ILLEL , H ERTZ , S HENTAL AND W EINSHALL Multiplying a solution matrix A by λ > 1 increases both the log|A| argument and the constrained sum of within chunklet distances. [sent-211, score-0.267]
26 The 1 ˆ ˆ solution is B = D C−1 , where C is the average chunklet covariance matrix (1) and D is the dimensionality of the data space. [sent-223, score-0.351]
27 A t (5) j=1 i=1 For a given target dimensionality K, the solution of the problem is Fisher linear discriminant (FLD),4 followed by the whitening of the within chunklet covariance in the reduced space. [sent-233, score-0.391]
28 Since the FLD transformation is computed based on the estimated within chunklet covariance matrix, it is essentially a semi-supervised technique, as described in Section 6. [sent-236, score-0.314]
29 (6) j=1 i=1 We interpret problem (6) as follows: a Mahalanobis distance B is sought, which minimizes the sum of all within chunklet squared distances, while |B| ≥ 1 prevents the solution from being achieved by “shrinking” the entire space. [sent-244, score-0.249]
30 (7) j=1 i=1 ˆ ˆ 1 ˆ Differentiating this Lagrangian shows that the minimum is given by B = |C| D C−1 , where C is the average chunklet covariance matrix. [sent-248, score-0.287]
31 It is assumed that in addition to the set of positive constraints QP , one is also given access to a set of negative constraints QN –a set of pairs of points known to be dissimilar. [sent-253, score-0.133]
32 In order to allow a clear comparison of RCA with (8), we reformulate the argument of (6) using only within chunklet pairwise distances. [sent-258, score-0.241]
33 For each point x ji in chunklet j we have n x ji − m j = x ji − n 1 j 1 j x jk = ∑ ∑ (x ji − x jk ). [sent-259, score-0.54]
34 B j=1 j i=1 (9) When only chunklets of size 2 are given, as in the case studied by Xing et al. [sent-263, score-0.223]
35 Under the assumption that the chunklets are sampled i. [sent-274, score-0.223]
36 and that points within each chunklet are also sampled i. [sent-277, score-0.245]
37 , the likelihood of the chunklets’ distribution can be written as n nj ∏ ∏ (2π) j=1 i=1 1 D 2 1 |Σ| 2 1 exp (− 2 (x ji −m j )t Σ−1 (x ji −m j )). [sent-280, score-0.184]
38 Writing the log-likelihood while neglecting constant terms and denoting B = Σ−1 , we obtain n nj ∑ ∑ ||x ji − m j ||2 − N log |B|, B (11) j=1 i=1 where N is the total number of points in chunklets. [sent-281, score-0.123]
39 Assume for simplicity that there are N constrained data points divided into n chunklets of size k each. [sent-286, score-0.259]
40 The unbiased RCA estimator can be written as 1 ˆ C(n, k) = n n 1 k ∑ k − 1 ∑ (x ji − mi )(x ji − mi )t , j=1 i=1 ˆ where C(n, k) denotes the empirical mean of the covariance estimators produced by each chunklet. [sent-287, score-0.244]
41 This bound shows that the variance of the RCA estimator rapidly converges to the variance of the best estimator, even for chunklets of small size. [sent-289, score-0.282]
42 In our case the optimization problem is of the same form as in (14), with the within chunklet covariance matrix from (1) playing the role of Sw . [sent-312, score-0.307]
43 Intuitively, better dimensionality reduction can be obtained by comparing the total covariance matrix (used by PCA) to the within class covariance matrix (used by RCA), and this is exactly what the partially supervised cFLD is trying to accomplish in (14). [sent-316, score-0.252]
44 The cFLD dimensionality reduction can only be used if the rank of the within chunklet covariance matrix is higher than the dimensionality of the initial data space. [sent-317, score-0.425]
45 Compute the rank of the estimated within chunklet covariance matrix R = ∑n (|C j | − 1), j=1 where |C j | denotes the size of the j’th chunklet. [sent-322, score-0.307]
46 Compute the total covariance matrix estimate St , and estimate the within class covariance ˆ matrix using Sw = C from (1). [sent-326, score-0.158]
47 In Algorithm 3 we present the procedure for the simple case of chunklets of size 2. [sent-336, score-0.223]
48 The extension of this algorithm to general chunklets is briefly described in Appendix C. [sent-337, score-0.223]
49 The correlation matrix of h is therefore equivalent to the within chunklet covariance matrix. [sent-346, score-0.307]
50 We also used this data set to test the effect of dimensionality reduction using cFLD, and the sensitivity of RCA to average chunklet size and the total amount of points in chunklets. [sent-362, score-0.319]
51 2 presents a more realistic surveillance application in which equivalence constraints are gathered automatically from a Markovian process. [sent-364, score-0.132]
52 3 we conclude our experimental validation by comparing RCA with other methods which make use of equivalence constraints in a clustering task, using a few benchmark data sets from the UCI repository (Blake and Merz, 1998). [sent-366, score-0.143]
53 The evaluation of different metrics below is presented using cumulative neighbor purity graphs, which display the average (over all data points) percentage of correct neighbors among the first k neighbors, as a function of k. [sent-367, score-0.143]
54 1 Applying RCA to Facial Recognition The task here is to classify facial images with respect to the person photographed. [sent-369, score-0.118]
55 In these experiments we consider a retrieval paradigm reminiscent of nearest neighbor classification, in which a query image leads to the retrieval of its nearest neighbor or its K-nearest neighbors in the data set. [sent-370, score-0.144]
56 5 the retrieval performance of RCA is compared with the 949 BAR H ILLEL , H ERTZ , S HENTAL AND W EINSHALL Figure 2: A subset of the YaleB database which contains 1920 frontal face images of 30 individuals taken under different lighting conditions. [sent-380, score-0.129]
57 6 we evaluate the effect of chunklets sizes on retrieval performance, and compare it to the predicted effect of chunklet size on the variance of the RCA estimator. [sent-385, score-0.482]
58 , 1997), which contains facial images of 30 subjects under varying lighting conditions. [sent-389, score-0.132]
59 The variability between images of the same person is mainly due to different lighting conditions. [sent-391, score-0.143]
60 These factors caused the variability among images belonging to the same subject to be greater than the variability among images of different subjects (Adini et al. [sent-392, score-0.178]
61 The total number of points in chunklets grows linearly with T L, the number of data points seen by all teachers. [sent-404, score-0.257]
62 5 The parameter L controls the distribution of chunklet sizes. [sent-407, score-0.228]
63 Figure 3 shows a histogram of typical chunklet sizes, as obtained in our experiments. [sent-415, score-0.241]
64 6 30% of points in chunkelts 120 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 Figure 3: Sample chunklet size distribution obtained using the distributed learning scenario on a subset of the yaleB data set with 1920 images from M = 30 classes. [sent-416, score-0.306]
65 To this extent we compared the following methods: • Eigenfaces (Turk and Pentland, 1991): this unsupervised method reduces the dimensionality of the data using PCA, and compares the images using the Euclidean metric in the reduced space. [sent-422, score-0.139]
66 As can be seen, even with low amounts of equivalence constraints the performance of RCA is much closer to the performance of the supervised FLD than to the performance of the unsupervised PCA. [sent-431, score-0.147]
67 With Moderate and high amounts of equivalence constraints RCA achieves neighbor purity rates which are 6. [sent-432, score-0.197]
68 We used a different sampling scheme in the experiments which address the effect of chunklet size, see Section 8. [sent-433, score-0.228]
69 Right: Cumulative purity graphs for the fully supervised FLD, with and without fully labelled RCA. [sent-456, score-0.141]
70 higher than those achieved by the fully supervised Fisherfaces method, while relying only on fragmentary chunklets with unknown class labels. [sent-458, score-0.261]
71 In this case, chunklets correspond uniquely and fully to classes, and the cFLD algorithm for dimensionality reduction is equivalent to the standard FLD. [sent-461, score-0.315]
72 of the inner chunklet covariance matrix (Step 3). [sent-477, score-0.307]
73 varied the size of the chunklets S in the range {2 − 10}. [sent-532, score-0.223]
74 The chunklets were obtained by randomly selecting 30% of the data (total of P = 1920 points) and dividing it into chunklets of size S. [sent-533, score-0.446]
75 As expected the performance of RCA improves as the size of the chunklets increases. [sent-535, score-0.223]
76 Qualitatively, this improvement agrees with the predicted improvement in the RCA estimator’s variance, as most of the gain in performance is already obtained with chunklets of size S = 3. [sent-536, score-0.223]
77 In this experiment we varied the chunklet sizes while fixing the total amount of points in chunklets. [sent-577, score-0.245]
78 We used the clips as chunklets in order to compute the RCA transformation. [sent-599, score-0.257]
79 While there may be several reasons for this, an important factor is the difference between the way chunklets were obtained in the two data sets. [sent-605, score-0.223]
80 The automatic gathering of chunklets from a Markovian process tends to provide chunklets with dependent data points, which supply less information regarding the within class covariance matrix. [sent-606, score-0.505]
81 For each data set we randomly selected a set QP of pairwise positive equivalence constraints (or chunklets of size 2). [sent-610, score-0.348]
82 (2002), we used exactly the same experimental setup as it affects the gathering of equivalence constraints and the evaluation score used. [sent-631, score-0.127]
83 Generally in the context of K-means, we observe that using equivalence constraints to find a better metric improves results much more than using this information to constrain the algorithm. [sent-699, score-0.143]
84 In this example we used all of the points from each class as a single chunklet and therefore the chunklet covariance matrix is the average within-class covariance matrix. [sent-719, score-0.611]
85 The second assumption justifies RCA’s main computational step, which uses the empirical average of all the chunklets covariance matrices in order to estimate the global within class covariance matrix. [sent-754, score-0.341]
86 The third assumption may break down in many practical applications, when chunklets are automatically collected and the points within a chunklet are no longer independent of one another. [sent-760, score-0.468]
87 As a result chunklets may be composed of points which are rather close to each other, and whose distribution does not reflect all the typical variance of the true distribution. [sent-761, score-0.255]
88 We can rewrite the constrained expression from Equation 5 as 1 N n nj ˆ ˆ ∑ ∑ (x ji − m j )t At A(x ji − m j ) = tr(At AC) = tr(At CA). [sent-772, score-0.203]
89 Hence as in RCA, A must whiten ˆ the data with respect to the chunklet covariance C in a yet to be determined subspace. [sent-776, score-0.287]
90 Assume we have N = nk data points X = {x ji }n,k j=1 in n chunklets of size k each. [sent-789, score-0.342]
91 We assume that all chunklets are drawn independently i=1, from Gaussian sources with the same covariance matrix. [sent-790, score-0.282]
92 Denoting by mi the mean of chunklet i, the unbiased RCA estimator of this covariance matrix is 1 ˆ C(n, k) = n n 1 k ∑ k − 1 ∑ (x ji − mi )(x ji − mi )T . [sent-791, score-0.492]
93 Let U denote the diagonalization transformation of the covariance matrix C of the Gaussian sources, that is, UCU t = Λ where Λ is a diagonal matrix with {λi }D on the diagonal. [sent-794, score-0.126]
94 We can analyze the variance of Cu as follows: ˆ and denote the chunklet means by mi i ˆ var(Cu (n, k)) = var[ 1 n 1 ∑ n i=1 k − 1 k ∑ (z ji − mu )(z ji − mu )T ] i i j=1 961 BAR H ILLEL , H ERTZ , S HENTAL AND W EINSHALL 1 1 = var[ n k−1 k ∑ (z ji − mu )(z ji − mu )T ]. [sent-797, score-0.619]
95 1 1 (15) j=1 The last equality holds since the summands of the external sum are sample covariance matrices of independent chunklets drawn from sources with the same covariance matrix. [sent-798, score-0.341]
96 Online RCA with Chunklets of General Size The online RCA algorithm can be extended to handle a stream of chunklets of varying size. [sent-806, score-0.223]
97 962 M AHALANOBIS M ETRIC FROM E QUIVALENCE C ONSTRAINTS Algorithm 4 Online RCA for chunklets of variable size Input: a stream of chunklets where the points in a chunklet are known to belong to the same class. [sent-810, score-0.691]
98 At time step T do: T T • receive a chunklet {x1 , . [sent-812, score-0.228]
99 The Expected Chunklet Size in the Distributed Learning Paradigm We estimate the expected chunklet size obtained when using the distributed learning paradigm introduced in Section 8. [sent-818, score-0.228]
100 M M M Using these approximations, we can derive an approximation for the expected chunklet size as L a function of the ratio r = M : E(x|x = 0, x = 1) = L M − p(x = 1) 1 − p(x = 0) − p(x = 1) r(1 − e−r ) . [sent-829, score-0.228]
wordName wordTfidf (topN-words)
[('rca', 0.854), ('chunklet', 0.228), ('chunklets', 0.223), ('fld', 0.141), ('cfld', 0.097), ('ji', 0.078), ('quivalence', 0.073), ('ahalanobis', 0.068), ('einshall', 0.068), ('ertz', 0.068), ('etric', 0.068), ('hental', 0.068), ('illel', 0.068), ('kc', 0.068), ('bar', 0.061), ('onstraints', 0.061), ('mahalanobis', 0.061), ('covariance', 0.059), ('purity', 0.058), ('yaleb', 0.058), ('constraints', 0.058), ('neighbors', 0.058), ('equivalence', 0.054), ('xing', 0.049), ('images', 0.049), ('facial', 0.049), ('whitening', 0.046), ('var', 0.045), ('dimensionality', 0.044), ('shental', 0.041), ('variability', 0.04), ('teachers', 0.039), ('pca', 0.038), ('clips', 0.034), ('lighting', 0.034), ('rand', 0.033), ('metric', 0.031), ('clustering', 0.031), ('reduction', 0.03), ('estimator', 0.029), ('nj', 0.028), ('transformation', 0.027), ('neighbor', 0.027), ('labelled', 0.027), ('qp', 0.026), ('belhumeur', 0.024), ('acat', 0.024), ('ciuj', 0.024), ('qruiqu', 0.024), ('theoretic', 0.024), ('nk', 0.024), ('fukunaga', 0.022), ('dimensions', 0.021), ('cu', 0.021), ('distance', 0.021), ('hertz', 0.02), ('jr', 0.02), ('surveillance', 0.02), ('matrix', 0.02), ('supervised', 0.02), ('person', 0.02), ('ckl', 0.019), ('ci', 0.019), ('constrained', 0.019), ('fully', 0.018), ('points', 0.017), ('em', 0.017), ('sw', 0.016), ('wagstaff', 0.016), ('eigenfaces', 0.016), ('mu', 0.016), ('illumination', 0.016), ('cii', 0.016), ('retrieval', 0.016), ('cov', 0.016), ('irrelevant', 0.015), ('unsupervised', 0.015), ('uci', 0.015), ('database', 0.015), ('variance', 0.015), ('score', 0.015), ('clip', 0.015), ('euclid', 0.015), ('frontal', 0.015), ('myj', 0.015), ('teacher', 0.014), ('discriminant', 0.014), ('video', 0.014), ('bars', 0.014), ('pairwise', 0.013), ('transformations', 0.013), ('histogram', 0.013), ('xit', 0.013), ('faces', 0.013), ('daphna', 0.012), ('camera', 0.012), ('turk', 0.012), ('moderate', 0.012), ('yyt', 0.012), ('scenario', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
2 0.039664015 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
Author: Hal Daumé III, Daniel Marcu
Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian
3 0.035514604 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
Author: Neil Lawrence
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.030280657 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
Author: Marianthi Markatou, Hong Tian, Shameek Biswas, George Hripcsak
Abstract: This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random T T variables Y = Card(S j S j ) and Y ∗ = Card(Sc Sc ), where S j , S j are two training sets, and Sc , j j j c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric Sj and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation. Keywords: cross-validation, generalization error, moment approximation, prediction, variance estimation
6 0.027245741 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
7 0.026949733 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
8 0.022687245 39 jmlr-2005-Information Bottleneck for Gaussian Variables
9 0.022415278 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems
10 0.022258189 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
11 0.021541011 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
12 0.020773241 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
13 0.019016434 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.018321058 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
15 0.018281443 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
16 0.017668208 36 jmlr-2005-Gaussian Processes for Ordinal Regression
17 0.017186781 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.016662117 41 jmlr-2005-Kernel Methods for Measuring Independence
19 0.016112916 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
20 0.015990388 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
topicId topicWeight
[(0, 0.103), (1, 0.062), (2, -0.002), (3, -0.059), (4, 0.045), (5, -0.016), (6, -0.082), (7, 0.046), (8, 0.012), (9, -0.001), (10, -0.029), (11, 0.053), (12, 0.002), (13, 0.017), (14, -0.063), (15, 0.021), (16, 0.192), (17, -0.007), (18, -0.007), (19, 0.039), (20, -0.092), (21, -0.37), (22, 0.038), (23, 0.048), (24, -0.138), (25, -0.269), (26, 0.009), (27, -0.145), (28, -0.356), (29, -0.465), (30, 0.287), (31, 0.127), (32, -0.063), (33, -0.139), (34, 0.011), (35, 0.07), (36, -0.084), (37, 0.124), (38, -0.208), (39, -0.211), (40, -0.091), (41, -0.16), (42, 0.093), (43, -0.096), (44, 0.091), (45, 0.042), (46, 0.003), (47, 0.036), (48, -0.035), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.9651286 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
2 0.10828757 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
Author: Hal Daumé III, Daniel Marcu
Abstract: We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics. Keywords: supervised clustering, record linkage, citation matching, coreference, Dirichlet process, non-parametric Bayesian
3 0.085987121 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
4 0.082073078 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
Author: Neil Lawrence
Abstract: Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets. Keywords: Gaussian processes, latent variable models, principal component analysis, spectral methods, unsupervised learning, visualisation
5 0.080371812 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
6 0.079478867 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
7 0.067454509 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems
8 0.067247234 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
9 0.065393701 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
10 0.061812446 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
11 0.060623124 39 jmlr-2005-Information Bottleneck for Gaussian Variables
12 0.059693716 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
13 0.057715029 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
14 0.057458736 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
15 0.057057694 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
16 0.056708544 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
17 0.055114564 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
18 0.054423604 41 jmlr-2005-Kernel Methods for Measuring Independence
19 0.052087061 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
20 0.05172183 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
topicId topicWeight
[(13, 0.025), (17, 0.028), (19, 0.025), (31, 0.391), (36, 0.036), (37, 0.039), (42, 0.025), (43, 0.037), (47, 0.013), (52, 0.083), (59, 0.045), (70, 0.035), (88, 0.072), (90, 0.019), (94, 0.017)]
simIndex simValue paperId paperTitle
1 0.85308349 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
Author: Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin
Abstract: Working set selection is an important step in decomposition methods for training support vector machines (SVMs). This paper develops a new technique for working set selection in SMO-type decomposition methods. It uses second order information to achieve fast convergence. Theoretical properties such as linear convergence are established. Experiments demonstrate that the proposed method is faster than existing selection methods using first order information. Keywords: support vector machines, decomposition methods, sequential minimal optimization, working set selection
same-paper 2 0.67342997 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
Author: Aharon Bar-Hillel, Tomer Hertz, Noam Shental, Daphna Weinshall
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification. Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher’s linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods. Keywords: clustering, metric learning, dimensionality reduction, equivalence constraints, side information.
3 0.32147479 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
Author: Antoine Bordes, Seyda Ertekin, Jason Weston, Léon Bottou
Abstract: Very high dimensional learning systems become theoretically possible when training examples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least take a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.
4 0.31944039 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.31789213 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.31700692 36 jmlr-2005-Gaussian Processes for Ordinal Regression
7 0.31312275 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
8 0.3125917 64 jmlr-2005-Semigroup Kernels on Measures
9 0.3112666 3 jmlr-2005-A Classification Framework for Anomaly Detection
10 0.31078845 39 jmlr-2005-Information Bottleneck for Gaussian Variables
11 0.3083702 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
12 0.3065317 44 jmlr-2005-Learning Module Networks
13 0.3053782 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
14 0.30490154 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
15 0.30048445 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
16 0.30017886 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
17 0.29892465 54 jmlr-2005-Managing Diversity in Regression Ensembles
18 0.29815653 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
19 0.2976214 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
20 0.29754198 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets