jmlr jmlr2008 jmlr2008-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
Reference: text
sentIndex sentText sentNum sentScore
1 55 D-80799 M¨ nchen u Germany Editor: Peter Bartlett Abstract A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. [sent-3, score-0.275]
2 Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. [sent-4, score-0.531]
3 The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. [sent-6, score-0.297]
4 It is the very universality of the concept which explains the lack of a universal definition, beyond the structural feature that similarity is a property possessed by pairs of objects and the vague feeling that it is somehow related to geometric proximity. [sent-11, score-0.232]
5 , xm , xm , rm ∈ X 2 × {−1, 1} m , generated in m independent, identical trials of ρ, that is, S ∼ ρ m . [sent-33, score-0.374]
6 The pair hypothesis attempts to predict the similarity value of a pair (x, x ) and its risk is its error probability as measured by the pair oracle. [sent-39, score-0.322]
7 2 Risk Bounds and Regularized Objectives So far we have only used the structural condition that similarity is a property possessed by pairs of objects. [sent-43, score-0.232]
8 A transformation T ∈ L 0 (H) thus defines a hypothesis f T which regards a pair of inputs as similar if the distance between their respective images under T is smaller than one, and as dissimilar if this distance is larger than one. [sent-46, score-0.252]
9 The risk of the hypothesis f T induces a risk functional on L0 (H) R (T ) = R ( fT ) = Pr (x,x ,r)∼ρ r 1− Tx−Tx 2 ≤0 . [sent-51, score-0.333]
10 , (xm , xm , rm )) we define the empirical risk estimate 1 m ˆ Rψ (T, S) = ∑ ψ ri 1 − T xi − xi m i=1 2 . [sent-61, score-0.621]
11 The theorem gives a high-probability-bound on the true risk valid for all transformations in terms of the empirical risk estimate and a complexity term. [sent-65, score-0.396]
12 This leads to the regularized objective function Λhγ ,λ (T ) := 1 m ∑ h γ ri 1 − T x i − x i m i=1 2 + λ T ∗T √ m 2 , which is convex in T ∗ T . [sent-71, score-0.195]
13 It follows from the nature of the objective function that the minimizing operator will have rank ≤ m. [sent-74, score-0.205]
14 Human learning appears to cope with these empirically deficient problems: In addition to the recognition of the already known categories human learners develop meta-techniques to improve the learning and recognition of future categories. [sent-78, score-0.192]
15 By a very crude operational definition similarity is a property of two phenomena which makes them belong to the same category of some domain, and dissimilarity is the negation of similarity. [sent-83, score-0.224]
16 Following this idea we can define a pair oracle which regards pairs of inputs as similar if and only if they come with the same label, use this oracle to generate a large number of examples and train our algorithm to obtain a similarity rule, together with a representing operator T . [sent-84, score-0.648]
17 Corresponding experiments were carried out on a number of problems in image recognition, such as the recognition of handwritten characters, rotation and scale invariant character recognition and the recognition of human faces. [sent-90, score-0.473]
18 , (xm , xm , rm )) corresponds to m independent realizations Qx1 −x1 , r1 , . [sent-227, score-0.249]
19 2 The Objective Functional Because of the complicated estimates involved, risk bounds such as Theorem 13 have a tendency to be somewhat loose, so that a learning algorithm relying on naive minimization of the bounds may end up with suboptimal hypotheses. [sent-266, score-0.276]
20 Departing from the simpler of the two conclusions of Theorem 13, a naive approach would look for some T ∈ L0 (H) to minimize the objective functional 4L T ∗ T ˆ √ Rψ (T, S) + m 2 + ln (2 T ∗ T 2m 2 /δ) . [sent-268, score-0.216]
21 , (xm , xm , rm )) generated in m independent, identical trials of the pair oracle ρ. [sent-286, score-0.387]
22 It is inherent to the problem of similarity learning that one is led to consider an asymmetric response to similar and dissimilar examples (see, for example, the approaches of Bar-Hillel et al. [sent-289, score-0.342]
23 We thus consider two functions ψ 1 , ψ−1 : R → R , ψi ≥ 1(−∞,0] and the objective functional Λψ1 ,ψ−1 ,λ (T, S) = 1 m ∑ ψr ri 1 − T x i − x i m i=1 i 2 + λ T ∗T √ m 2 . [sent-293, score-0.206]
24 Note that our generalization bounds remain valid for the empirical risk estimate using ψ 1 and ψ−1 as long as we use for L the Lipschitz constant of min {ψ1 , ψ−1 }. [sent-294, score-0.242]
25 Using ψi = hγi we are effectively considering an inside margin γ1 , which applies to similar pairs, and an outside margin γ−1 , which applies to dissimilar pairs. [sent-295, score-0.254]
26 2) such that an objective functional can be written as − V, A 2 + λ T ∗ T p , p 1061 M AURER to be minimized with positive operators V . [sent-314, score-0.215]
27 For p = 2 (the Hilbert-Schmidt case) one finds that the minimizer is a multiple of the positive part A+ of the empirical operator A (the source of the sparsity observed in our experiments), and stable under small perturbations of the eigenvalues of A. [sent-316, score-0.227]
28 , (xm , xm , rm )) and assume that ψ : R → R is + convex, satisfying ψ ≥ 1(−∞,0] . [sent-324, score-0.249]
29 = Ω (An ) < Ω (Av ) = Φ (v), so Observe that this result justifies the gradient descent method in the presence of positivity constraints also for other convex loss functions besides the hinge loss. [sent-367, score-0.255]
30 Briefly, one extends the functional Ω to all of L2 (H), so that the problem becomes equivalent to an SVM, and then alternates between gradient descent in Ω, which is convex, and projections onto + + L2 (H). [sent-373, score-0.201]
31 The projection of an operator A ∈ L2 (H) to L2 (H) is effected by an eigen-decomposition and reconstruction with all the negative eigenvalues set to zero, so that only the positive part of A is retained. [sent-374, score-0.233]
32 Straightforward differentiation gives for the k-th component of the gradient of Φ at v ∈ M m the expression (∇Φ)k (v) = −2 m ∑ψ m i=1 + m ri 1 − ∑ a2j i j=1 2λ √ Av 2 m ri aik xi − xi m ∑ vk , v j v j , j=1 2 1/2 where aik = xi − xi , vk and Av 2 = ∑i, j vi , v j . [sent-395, score-0.502]
33 In a simplified view, which disregards the regularization term (or if λ = 0), the algorithm will modify T in an attempt to bring T xi and T xi closer if xi and xi are similar (i. [sent-397, score-0.293]
34 , ri = 1) and their distance exceeds 1 − γ, it will attempt to move T xi and T xi further apart if xi and xi are dissimilar (i. [sent-399, score-0.476]
35 If T x − T x ≤ 1 − γ for similar or T x − T x ≥ 1 + γ for dissimilar pairs there is no effect. [sent-408, score-0.202]
36 If T x − T x > 1 − γ for dissimilar pairs, the ellipsoid is dilated. [sent-410, score-0.212]
37 The first is rather obvious and extends the method described above to continuous similarity values. [sent-432, score-0.223]
38 The other method is derived from the risk functional R and has already been described in Maurer (2006a). [sent-433, score-0.197]
39 1 have straightforward extensions to the case, when the oracle measure is not supported on X 2 × {−1, 1} but on X 2 × [0, 1], corresponding to a continuum of similarity values, and : [0, 1] × R → R is a loss function such that (y, . [sent-436, score-0.324]
40 The corresponding risk functional to be minimized would be R (T ) = E(x,x ,r)∼ρ r, T x − T x 2 . [sent-438, score-0.197]
41 2 Risk Bounds with Affine Loss Functions and Hyperbolic PCA Consider again the case of a binary oracle (similar/dissimilar) and the task of selecting an operator from some set V ⊂ L2 (H). [sent-445, score-0.338]
42 Of course we can use the risk bounds of the previous section for an empirical estimator constructed from the Lipschitz functions ψ1 , ψ−1 used in the proof above, corresponding to an inner margin of 1 and an outer margin of c2 − 1 (see Section 3. [sent-456, score-0.34]
43 Minimizing the bounds in Proposition 15 is equivalent to maximizing T ∗ T, A 2 , which is the only term depending on the operator T . [sent-461, score-0.21]
44 , (xm , xm , rm )) the obvious way to try this ˆ ˆ is by maximizing the empirical counterpart T ∗ T, A where A is the empirical operator 2 1 m ˆ A (S) = ∑ ηri Qxi −xi . [sent-465, score-0.498]
45 This gives an alternative algorithm to the one described in Section 3: Fix a quantity c > 2 and ˆ construct a matrix representation of the empirical operator A given in (9). [sent-485, score-0.223]
46 There is an intuitive interpretation to this algorithm, which could be called hyperbolic PCA: The empirical objective functional is proportional to m ˆ m P, A 2 = ∑ ηr i i=1 = 1 c2 − 1 Pxi − Pxi ∑ 2 i:xi ,xi dissimilar Pxi − Pxi 1068 2 − ∑ i:xi ,xi similar Pxi − Pxi 2 . [sent-489, score-0.406]
47 Typically we have c 1 so that 1/ c2 − 1 1, so that dissimilar pairs (’negative equivalence constraints’) receive a much smaller weight than similar pairs, corresponding to the intuitive counting argument given ba Bar-Hillel et al. [sent-492, score-0.202]
48 In contrast to PCA, where the operator in question is the empirical covariance operator, which is always nonnegative, we will project to an eigenspace of an empirical operator which is a linear combination of empirical covariances and generally not positive. [sent-495, score-0.458]
49 While the quadratic form associated with the covariance has elliptic level sets, the empirical operator induces hyperbolic level sets. [sent-496, score-0.264]
50 Applications to Multi-category Problems and Learning to Learn In this section we apply similarity learning to problems where the nature of the application task is partially or completely unknown, so that the available data is used for a preparatory learning process, to facilitate future learning. [sent-498, score-0.246]
51 There is a canonical pair oracle derived from the task τ. [sent-508, score-0.198]
52 Then E(x,y)∼µ [errτ (εT (x, y))] = Rρτ (T ) , (11) so we can use the risk bounds in Sections 3. [sent-511, score-0.206]
53 If there are many labels appearing approximately equally likely, then dissimilar pairs will be sampled much more frequently than similar ones, resulting in a negative bias of elementary classifiers. [sent-514, score-0.287]
54 , (xm , xm , rm )) from which we generate the operator T according to the algorithm derived in Section 3, then we are essentially minimizing a bound on the expected error of elementary verifiers trained on future examples. [sent-527, score-0.474]
55 , (xm , xm , rm )) we train an operator T and then use the elementary verifiers ε T for face-verification on the entire population. [sent-534, score-0.474]
56 The operator T generated from the sample drawn from µ may nonetheless work for the original task τ corresponding to the entire population. [sent-547, score-0.2]
57 This points to a different method to apply the proposed algorithm: Use one task τ = (Y , µ), the training task, to draw the training sample and train the representation T , and apply T to the data of the application task τ = (Y , µ ). [sent-548, score-0.291]
58 sample S of size m drawn from µ we have ˆ Rρτ (T ) ≤ R1(−∞,0] T, S + ln (1/δ) , 2m ˆ which is of course much better than the bounds in Theorem 11 (here R1(−∞,0] (T, S ) is the empirical risk of T on S ). [sent-554, score-0.332]
59 Of course this type of transfer can be attempted between any two binary classification tasks, but its success normally requires a similarity of the pattern classes themselves. [sent-555, score-0.269]
60 A classifier trained to distinguish images of the characters ”a” and ”b” will be successfully applied to two classes of images if these pattern classes have some resemblance of ”a” and ”b”. [sent-556, score-0.289]
61 A representation of similarity however can be successfully transferred if there is a similar notion of similarity applicable to both problems. [sent-557, score-0.419]
62 It finds the operator Tk∗ which optimally represents the data of τ for the purpose of recognition on the basis of single training examples. [sent-573, score-0.28]
63 If R1(−∞,0] (Tk , S ) is small on the other hand, one could merge the data S of the new task with the data Sk∗ of the most closely matching task and retrain on S ∪ Sk∗ to obtain an operator T to replace Tk∗ for better generalization due to the larger sample size, and keep the number K constant. [sent-586, score-0.26]
64 The proposed method has been derived from the objective of minimizing the risk functional R which is connected to classification through the identities (11) and (12). [sent-593, score-0.262]
65 (2006) appears to have such properties, corresponding risk bounds are lacking and the relationship to the current work remains to be explored. [sent-598, score-0.206]
66 The experiments involved the recognition of randomly rotated-, randomly scaled-, randomly rotated and scaled-, and handwritten characters, spatially rotated objects and face recognition. [sent-601, score-0.39]
67 1 Parametrization and Experimental Setup All the experiments with character recognition used gray-scale images of 28 × 28 pixels, corresponding to 784-dimensional vectors. [sent-607, score-0.227]
68 The images of the COIL100 database had 64 × 64, the images of the ATT face database had 92 × 112 pixels. [sent-608, score-0.331]
69 These 2 values were determined by cross validation for problem of handwritten character recognition below and reused in all the other experiments. [sent-621, score-0.211]
70 For ˆ the training task we report the final values of the objective function Λ and the empirical risk R (T ) = ˆ R1(−∞,0] (T ). [sent-625, score-0.34]
71 The empirical risk R (T ) = R1(−∞,0] (T ) as an estimator for the true risk R (T ). [sent-631, score-0.308]
72 2 Training and Application Tasks In some of the transfer experiments the training and application tasks shared a definitive class of invariants, so that similarity of two pattern corresponds the existence of a transformation in the class of invariants, which (roughly) maps one pattern to the other. [sent-644, score-0.312]
73 Randomly rotated images of printed alpha characters were used for the training set and randomly rotated images of printed digits were used for the test set, the digit 9 being omitted for obvious reasons. [sent-646, score-0.843]
74 Randomly scaled images of printed alpha characters for the training, randomly scaled images of printed digits were used for the test set. [sent-648, score-0.549]
75 Randomly rotated and scaled characters for training, randomly rotated and scaled images of printed digits were used for the test set, the digit 9 being omitted for obvious reasons. [sent-651, score-0.576]
76 The COIL100 database contains images of objects rotated about an axis at an angle 60◦ to the optical axis. [sent-654, score-0.254]
77 The images of handwritten alpha characters from the NIST database were used for training, the handwritten digits from the MNIST database for testing. [sent-659, score-0.575]
78 1074 L EARNING S IMILARITY type of experiment training set nr of categories nr of examples ˆ R (T ) Λ (T ) sparsity of T test set nr of categories nr of examples ˆ R (T ) ROC area input ROC area T error input error T rotation invariance alpha 20 2000 0. [sent-676, score-0.713]
79 127) Table 2: type of experiment training set nr of categories nr of examples ˆ R (T ) Λ (T ) sparsity of T test set nr of categories nr of examples ˆ R (T ) ROC area input ROC area T error input error T spatial rot. [sent-704, score-0.463]
80 The oracle presents the learner with pairs of these images together with a similarity value r ∈ {−1, 1}. [sent-733, score-0.466]
81 2) dis1075 M AURER Figure 2: Randomly rotated and scaled alpha-characters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 3: The 15 largest eigenvalues of T ∗ T in proportion Figure 4: Randomly rotated and scaled digits similar ones from different categories (different columns). [sent-735, score-0.401]
82 The pairs are chosen at random under the constraint that similar pairs appear with equal frequency as dissimilar ones. [sent-736, score-0.248]
83 3) of the resulting operator T shows a marked decrease of singular values, allowing the conclusion that the data characterizing rotation and scale invariant character categories in the chosen Gaussian kernel representation is essentially only 5-dimensional. [sent-740, score-0.477]
84 1076 correct accept of similarity L EARNING S IMILARITY false accept of similarity Figure 5: ROC curve for the metric as a feature for similarity. [sent-745, score-0.372]
85 ˆ ˆ To measure the empirical risk R (T ) = R1(−∞,0] (T ) we generate a random sequence of pairs as we did with the training task above and average the relative rates of dissimilar pairs with T x − T x H ≤ 1 and the rate of similar pairs with T x − T x H ≥ 1. [sent-747, score-0.569]
86 128 in the row labeled by ’Risk T ’, so the representation T organizes the data of rotation and scale invariant character categories into balls of diameter 1, up to an error of about 13%. [sent-750, score-0.337]
87 5 Classification of Tasks We report some simple results concerning the recognition of task-families on the basis of the similarity of underlying similarity concepts, as proposed in Section 5. [sent-764, score-0.469]
88 These families share the properties of rotation invariance, scale invariance, combined rotation and scale invariance and the property of being handwritten respectively. [sent-767, score-0.378]
89 Each entry is the empirical risk R (T ) of the operator T trained from the alpha-task heading the column, measured on the digit-task heading the row. [sent-769, score-0.312]
90 18 Table 4: on the diagonal, which shows that the underlying similarity property or invariance of a data-set is reliably recognized. [sent-788, score-0.272]
91 The margin of this minimum is weakened only for separate rotation and scale invariances, because the representation for combined rotation and scale invariance also performs reasonably well on these data-sets, although not as well as the specialized representations. [sent-789, score-0.394]
92 Scale invariant representations perform better on handwritten data than rotation invariant ones, probably because scale invariance is more related to the latent invariance properties of handwritten characters than full rotation invariance. [sent-790, score-0.717]
93 In contrast to LDA hyperbolic PCA projects onto a dominant eigenspace of a weighted difference and not the quotient of the inter- and intra-class covariances. [sent-801, score-0.194]
94 The problem of similarity learning from pair oracles similar to this paper has been considered by several authors. [sent-809, score-0.233]
95 , (xm , xm , rm )) min T ∑ i:ri =1 T xi − T x i 2 such that ∑ i:ri =−1 T xi − T xi ≥ 1. [sent-819, score-0.429]
96 (2002) the existence of a pair oracle to generate a sample of labeled pairs (or equivalence constraints) is implicitly assumed, without attempting to directly predict this oracles behavior. [sent-823, score-0.231]
97 Conclusion This work presented a technique to represent pattern similarity on the basis of data generated by a domain dependent oracle. [sent-831, score-0.224]
98 P-measure on X 2 × {−1, 1} generic triplet (x, x , r) ∈ X 2 × {−1, 1} m training sample S ∈ X 2 × {−1, 1} , S ∼ ρm risk of operator = Pr r 1 − T x − T x Ω H . [sent-839, score-0.319]
99 d-dimensional orthogonal projections in H the operator Qx (z) = z, x x maximal norm in E, that is, supx∈E x S empirical Rademacher complexity of F Rademacher variables, uniform on {−1, 1} 1080 6. [sent-849, score-0.241]
100 Learning a similarity metric discriminatively, with application to face verification. [sent-927, score-0.223]
wordName wordTfidf (topN-words)
[('qx', 0.332), ('av', 0.273), ('aurer', 0.252), ('imilarity', 0.19), ('similarity', 0.186), ('tv', 0.156), ('dissimilar', 0.156), ('operator', 0.14), ('oracle', 0.138), ('risk', 0.136), ('xm', 0.125), ('rm', 0.124), ('rotated', 0.107), ('rotation', 0.106), ('rademacher', 0.097), ('characters', 0.097), ('tk', 0.097), ('images', 0.096), ('lipschitz', 0.094), ('ln', 0.09), ('operators', 0.089), ('hyperbolic', 0.088), ('invariance', 0.086), ('hilbert', 0.085), ('elementary', 0.085), ('pr', 0.085), ('pap', 0.084), ('qxi', 0.084), ('thrun', 0.083), ('maurer', 0.083), ('transfer', 0.083), ('pd', 0.082), ('handwritten', 0.08), ('ri', 0.08), ('ei', 0.08), ('err', 0.078), ('earning', 0.075), ('categories', 0.074), ('character', 0.072), ('veri', 0.071), ('pxi', 0.071), ('att', 0.071), ('bounds', 0.07), ('eigenspace', 0.07), ('printed', 0.07), ('tei', 0.07), ('nr', 0.068), ('roc', 0.066), ('norm', 0.065), ('objective', 0.065), ('positivity', 0.064), ('digits', 0.062), ('functional', 0.061), ('task', 0.06), ('xi', 0.06), ('recognition', 0.059), ('alpha', 0.058), ('descent', 0.056), ('avn', 0.056), ('chopra', 0.056), ('ellipsoid', 0.056), ('nca', 0.056), ('pixel', 0.055), ('vi', 0.054), ('lda', 0.053), ('regularization', 0.053), ('theorem', 0.052), ('database', 0.051), ('eigenvalues', 0.051), ('convex', 0.05), ('margin', 0.049), ('gradient', 0.048), ('diam', 0.047), ('oracles', 0.047), ('representation', 0.047), ('pairs', 0.046), ('tn', 0.046), ('bartlett', 0.045), ('orthonormal', 0.045), ('pca', 0.043), ('training', 0.043), ('pratt', 0.043), ('fleuret', 0.043), ('effected', 0.042), ('ker', 0.042), ('pan', 0.042), ('vd', 0.042), ('valued', 0.04), ('category', 0.038), ('draw', 0.038), ('invariant', 0.038), ('basis', 0.038), ('er', 0.038), ('face', 0.037), ('lemma', 0.037), ('obvious', 0.037), ('hinge', 0.037), ('transformations', 0.036), ('onto', 0.036), ('empirical', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
2 0.14619491 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
3 0.14393555 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
4 0.1070251 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
Author: Mikio L. Braun, Joachim M. Buhmann, Klaus-Robert Müller
Abstract: We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results. Keywords: kernel methods, feature space, dimension reduction, effective dimensionality
5 0.1003158 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
6 0.095999487 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
7 0.083812334 92 jmlr-2008-Universal Multi-Task Kernels
8 0.081161402 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
9 0.07389532 58 jmlr-2008-Max-margin Classification of Data with Absent Features
10 0.070695385 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
11 0.07064908 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
12 0.068164855 54 jmlr-2008-Learning to Select Features using their Properties
13 0.06798742 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
14 0.066515081 96 jmlr-2008-Visualizing Data using t-SNE
15 0.065339372 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
16 0.064408757 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
17 0.062491592 56 jmlr-2008-Magic Moments for Structured Output Prediction
18 0.062198464 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
19 0.061697669 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
20 0.060643386 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
topicId topicWeight
[(0, 0.326), (1, -0.114), (2, -0.081), (3, -0.099), (4, 0.032), (5, -0.132), (6, 0.088), (7, -0.089), (8, 0.031), (9, 0.018), (10, 0.052), (11, 0.075), (12, 0.114), (13, 0.077), (14, -0.04), (15, 0.124), (16, -0.24), (17, -0.04), (18, -0.241), (19, -0.102), (20, -0.01), (21, 0.041), (22, -0.002), (23, -0.052), (24, -0.069), (25, 0.021), (26, -0.063), (27, -0.024), (28, 0.025), (29, -0.073), (30, 0.021), (31, 0.01), (32, -0.035), (33, -0.039), (34, 0.03), (35, -0.117), (36, -0.065), (37, 0.073), (38, 0.061), (39, 0.011), (40, 0.06), (41, -0.004), (42, 0.06), (43, 0.112), (44, -0.077), (45, -0.05), (46, -0.085), (47, -0.149), (48, 0.039), (49, 0.143)]
simIndex simValue paperId paperTitle
same-paper 1 0.92962992 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
2 0.64945447 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
3 0.56607056 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
4 0.45655513 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
5 0.40794653 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
6 0.36709553 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
7 0.36479867 96 jmlr-2008-Visualizing Data using t-SNE
8 0.36432174 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
9 0.35962769 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
10 0.35135648 56 jmlr-2008-Magic Moments for Structured Output Prediction
11 0.34213978 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
12 0.33896104 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
13 0.33575654 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
14 0.32832 58 jmlr-2008-Max-margin Classification of Data with Absent Features
15 0.32729691 92 jmlr-2008-Universal Multi-Task Kernels
16 0.31659609 9 jmlr-2008-Active Learning by Spherical Subdivision
17 0.3088994 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
18 0.30734757 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
19 0.30549318 87 jmlr-2008-Stationary Features and Cat Detection
topicId topicWeight
[(0, 0.029), (1, 0.36), (5, 0.031), (31, 0.015), (40, 0.05), (54, 0.049), (58, 0.035), (66, 0.065), (76, 0.022), (78, 0.027), (88, 0.088), (92, 0.065), (94, 0.053), (99, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.80428839 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
2 0.39091644 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
3 0.3877928 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.38326666 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
5 0.38264298 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
6 0.3823719 57 jmlr-2008-Manifold Learning: The Price of Normalization
7 0.38170707 56 jmlr-2008-Magic Moments for Structured Output Prediction
8 0.38037109 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
9 0.37929726 86 jmlr-2008-SimpleMKL
10 0.37923992 9 jmlr-2008-Active Learning by Spherical Subdivision
11 0.37888509 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
12 0.37672237 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
13 0.37296692 83 jmlr-2008-Robust Submodular Observation Selection
14 0.37238181 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
17 0.36739078 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
18 0.36718968 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
19 0.36619481 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
20 0.36496076 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks