jmlr jmlr2009 jmlr2009-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
Reference: text
sentIndex sentText sentNum sentScore
1 This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. [sent-12, score-0.247]
2 We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. [sent-13, score-0.149]
3 Recently, it has been realized that unlabeled instances when used in conjunction with a small amount of labeled instances can deliver considerable learning performance improvement in comparison to using labeled instances alone. [sent-19, score-0.234]
4 Each group is equipped with information on class label proportions. [sent-32, score-0.115]
5 This type of learning problem appears in areas like e-commerce, politics, spam filtering and improper content detection, as we illustrate below. [sent-34, score-0.247]
6 Obviously sending out discount coupons will increase sales, but sending coupons to customers who would have purchased the goods anyway decreases the margins. [sent-36, score-0.191]
7 Alternatively, failing to send coupons to customers who would only buy in case of a discount reduces overall sales. [sent-37, score-0.135]
8 The mixing proportions can be reliably estimated using random assignment to control and treatment groups. [sent-41, score-0.38]
9 They can rely on a set of always-favorable voters who will favor them regardless, plus a set of swing voters who will make their decision dependent on what the candidates offer. [sent-44, score-0.141]
10 Data sets of spam are likely to contain almost pure spam (this is achieved e. [sent-50, score-0.366]
11 by listing e-mails as spam bait), while user’s inboxes typically contain a mix of spam and non-spam. [sent-52, score-0.393]
12 In many cases it is possible to estimate the proportions of spam and non-spam in a user’s inbox much more cheaply than the actual labels. [sent-54, score-0.485]
13 We would like to use this information to categorize e-mails into spam and non-spam. [sent-55, score-0.183]
14 However the rest of images on the web (those not labeled) is a far larger data set, albeit without labels (after all, this is what we would like to estimate the labels for). [sent-58, score-0.118]
15 That said, it is considerably cheaper to obtain a good estimate of the proportions of proper and improper content in addition to having one data set of images being of likely improper content. [sent-59, score-0.383]
16 Problem Definition In this paper, we present a method that makes use of the knowledge of label proportions directly. [sent-62, score-0.335]
17 As motivated by the above examples, our method would be practically useful in many domains such as identifying potential customers, potential voters, spam e-mails and improper images. [sent-63, score-0.247]
18 More specifically, we may not require any label to be known, only their proportions within each of the involved data sets. [sent-66, score-0.335]
19 Finally, it is possible to apply our method to problems where the test label proportions are unknown, too. [sent-68, score-0.383]
20 This simple modification allows us to use this technique whenever covariate shift via label bias is present. [sent-69, score-0.114]
21 Formally, in a learning from proportions setting, we are given n sets of observations Xi = i , . [sent-70, score-0.299]
22 , xi x1 mi of respective sample sizes mi (calibration set) i = 1, . [sent-73, score-0.147]
23 Moreover, we are given the fractions πiy of labels y ∈ Y (|Y | ≤ n) contained in each set Xi . [sent-80, score-0.101]
24 These fractions form a full (column) rank mixing matrix, π ∈ Rn×|Y | with the constraint that each row sums up to 1 and all entries are nonnegative. [sent-81, score-0.244]
25 We have X1 = “mail in spam box” (only spam) and X2 = “mail in inbox” (spam mixed with non-spam). [sent-86, score-0.183]
26 Also suppose that we may know the proportion of spam vs non-spam in our inbox is 1 : 9. [sent-87, score-0.258]
27 In the spam filtering example, we have no pure non-spam instances. [sent-99, score-0.183]
28 In other ⊥ words, we assume that the conditional distribution of x is independent of the index i, as long as we know the label y. [sent-101, score-0.109]
29 In doing so, we are able to ı design estimators with the same performance guarantees in terms of uniform convergence as those with full access to the label information. [sent-106, score-0.176]
30 the test set X is one of the calibration data sets Xi ) or whenever φ(x, y) factorizes into ψ(x) ⊗ ϕ(y). [sent-139, score-0.16]
31 , ym } number of observations in the test set X proportion of label y in set i map from (x, y) to a Hilbert Space Table 1: Notations used in the paper. [sent-152, score-0.244]
32 y (3) y In the above equation, we form a mixing matrix π with the element πiy = p(y|i). [sent-161, score-0.164]
33 x However the same convergence results governing the convergence of µXY to µxy also hold for the convergence of µset [i, y′ ] to µset [i, y′ ]. [sent-171, score-0.105]
34 2 B IG P ICTURE Overall, our strategy is as follows: use empirical means on the bags Xi to approximate expectations with respect to the bag distribution. [sent-175, score-0.178]
35 Use the latter to compute expectations with respect to a given label, and finally, use the means conditional on the label distribution to obtain µxy which is a good proxy for µXY (see Algorithm 1), i. [sent-176, score-0.189]
36 As we shall see, whenever there are considerably more bags than classes we can exploit the overdetermined system to our advantage to reduce the overall estimation error and use a rescaled version of (4). [sent-181, score-0.243]
37 They arise whenever the matrix π has special structure or whenever the test set and one of the training 2355 Q UADRIANTO , S MOLA , C AETANO AND L E sets coincide. [sent-184, score-0.155]
38 Moreover, we may encounter situations where the fractions of observations in the test set are unknown and we would like, nonetheless, to find a good proxy for µXY . [sent-185, score-0.173]
39 In general, the kernel function on inputs and labels can be different. [sent-206, score-0.093]
40 Specifically, for a label diagonal kernel k(y, y′ ) = δ(y, y′ ), the standard winner-takes-all multiclass classification is recovered (Tsochantaridis et al. [sent-207, score-0.114]
41 If we moreover assume that X1 only contains class 1 and X2 = X contains a mixture of classes with labels 1 and −1 with proportions p(1) =: ρ and p(−1) = 1 − ρ respectively, we obtain the mixing matrix π= 1 0 ρ 1−ρ ⇒ π−1 = 2356 1 0 −ρ 1−ρ 1 1−ρ . [sent-214, score-0.513]
42 one set containing spam whereas the other one containing an unlabeled mix of spam and non-spam, allows one to obtain the sufficient statistics needed for estimation. [sent-218, score-0.434]
43 5 Overdetermined Systems Assume that we have significantly more bags n than class labels |Y |, possibly with varying numbers of observations mi per bag. [sent-220, score-0.29]
44 In this case it would make sense to find a weighting of the bags such that those which are largest and most relevant for the test set are given the highest degree of importance. [sent-221, score-0.145]
45 Remark 1 (Underdetermined Systems) Similarly, when we have less bags n than class labels |Y |, we can state the problem as one of solving a regularized least squares problem as follows n minimize ∑ µclass x i=1 2 µset [i] − X ∑ πiy µclass [y] x y∈Y + λΩ(µclass [y] ∀ y ∈ Y ). [sent-229, score-0.191]
46 This makes sense x x x whenever different labels have related means µclass [y]. [sent-231, score-0.093]
47 Altun and Smola (2006) state the following result: Theorem 3 (Convergence of Empirical Means) Denote by φ : X → B a map into a Banach space B , denote by B ∗ its dual space and let F the class of linear functions on B with bounded B ∗ norm by 1. [sent-250, score-0.106]
48 x X (9) In the more general case of overdetermined systems we have ˆ ε = ∑ p(y) (π⊤W π)−1 π⊤W ε y 2358 y,y . [sent-260, score-0.112]
49 2 Special Cases A closed form solution in the general case is not particularly useful since it depends heavily on the kernel k, the mixing proportions π and the class probabilities on the test set. [sent-268, score-0.497]
50 Combining several bounds we have the following theorem: Theorem 5 Assume that we have n sets of observations Xi of size mi , each of which drawn from distributions with probabilities πiy of observing data with label y. [sent-279, score-0.205]
51 ˆx X It is easy to see that whenever n = |Y | and π has full rank there is only one possible solution regardless of the choice of W . [sent-305, score-0.111]
52 For overdetermined systems the choice of W may greatly affect the 2360 E STIMATING L ABELS FROM L ABEL P ROPORTIONS quality of the solution and it is therefore desirable to choose a weighting which minimizes the error in estimating µXY . [sent-306, score-0.112]
53 This follows from the fact that all bags are drawn independently of each other. [sent-308, score-0.097]
54 Note that the optimal value of W depends both on the mixtures of the bags πi and on the propensity of each class p(y). [sent-313, score-0.132]
55 ˆ Corollary 8 The deviation between θ∗ , as defined in (1) and θ∗ , the minimizer of the approximate −1 1 log-posterior using µXY rather than µXY , is bounded by O(m− 2 + ∑i mi 2 ). [sent-326, score-0.101]
56 Using the bound ˆ ˆ L(θ∗ , µ) − L(θ∗ , µ) ≤ θ∗ − θ∗ yields the following bound for the log-posterior: 2361 µ−µ ˆ Q UADRIANTO , S MOLA , C AETANO AND L E ˆ Corollary 9 The minimizer θ∗ of the approximate log-posterior using µXY rather than µXY incurs a ˆ −1 µ ˆ XY − µXY 2 . [sent-328, score-0.098]
57 Let ∆ be the ˜ perturbation matrix such that the perturbed mixing matrix π is related to the original mixing matrix ˜ ˜ π by π = π + ∆. [sent-331, score-0.51]
58 Note that the perturbed mixing matrix π still needs to have non-negative entries and ˜ each row sums up to 1, π1 = 1. [sent-332, score-0.212]
59 The stochasticity constraint on the perturbed mixing matrix imposes special structure on the perturbation matrix, i. [sent-333, score-0.307]
60 each row of perturbation matrix must sum up to 0, ˆ ∆1 = 0. [sent-335, score-0.134]
61 Let θ∗ be the minimizer of (2) with mean µXY approximated via mixing matrix π. [sent-336, score-0.21]
62 Similarly, ˆ ∗ for µ ˆ ˜ ˜ ˜ define θ ˜ XY with mixing matrix π. [sent-337, score-0.164]
63 Our perturbation bound relies on Lemma 7 and on the fact that we can bound the errors made in computing an (pseudo-) inverse of a matrix: Lemma 10 (Stability of Inverses) For any matrix norm . [sent-339, score-0.213]
64 and full column rank matrices π and π + ∆, the error between the pseudo-inverses of π and π + ∆ is bounded by π† − (π + ∆)† ≤ µ π† σ∞ (π + ∆)† σ∞ where µ denotes a scalar constant depending on the matrix norm, . [sent-345, score-0.116]
65 Remark 12 For full rank matrices, the constant term µ in Theorem 11 is equal to unity regardless of the matrix norm considered (Wedin, 1973). [sent-349, score-0.143]
66 the following bound holds: µXY − µXY ˆ ˜ σ∞ ≤ Vy,p σ∞ π−1 σ∞ ∆ σ∞ σ∞ (π + ∆)−1 and a full rank mixing matrix π, ∑ σ∞ 1 2 µset [i], µset [ j] X X . [sent-356, score-0.267]
67 The last inequality and a full column rank mixing ∑ 1 2 µset [i], µset [ j] X X . [sent-365, score-0.202]
68 ˆ ˜ It is clear from (13) and (14) that the stability of our algorithm under perturbation will depend on the size of the perturbation and on the behavior of the (pseudo-) inverse of the perturbed mixing ˆ matrix. [sent-368, score-0.403]
69 Note that by the triangle inequality, the distance in (12) can be decomposed as θ∗ − θ∗ ≤ ∗ − θ∗ + θ∗ − θ∗ and the second term in RHS vanishes whenever the size of perturbation ∆ is ˜ ˜ ˆ θ zero. [sent-369, score-0.129]
70 Extensions We describe two types of extensions on our proposed estimator: function spaces and unknown label proportions on the test sets. [sent-371, score-0.383]
71 In the conditional case, p denotes the collection of probabilities p(y|xi ) and the operator 1 Ez∼p [φ(z)] = m ∑m Ey|p(y|xi ) [φ(xi , y)] is the conditional expectation operator on the set of obseri=1 1 vations. [sent-382, score-0.124]
72 The key point is that our reasoning of estimating µXY based on an aggregate of samples with unknown labels but known label proportions is still applicable. [sent-398, score-0.442]
73 2 Unknown Test Label Proportions In many practical applications we may not actually know the label proportions on the test set. [sent-400, score-0.383]
74 For instance, when deploying the algorithm to assess the spam in a user’s mailbox we will not know 2364 E STIMATING L ABELS FROM L ABEL P ROPORTIONS what the fraction would be. [sent-401, score-0.209]
75 This means that we need to estimate those proportions in addition to the class means µclass . [sent-403, score-0.29]
76 This means that as long as the conditional distributions p(x|y) are different for different choices of y we will be able to recover the test label proportions by the simple procedure of minimizing the distance between µ[p] and ∑y αy µ[p(x|y)]. [sent-409, score-0.412]
77 Note that obviously (15) may be used separately from the previous discussion, that is, when the training proportions are known but the test proportions are not. [sent-415, score-0.586]
78 1 Transduction In transduction one attempts to solve a related problem: the patterns xi on the test set are known, usually also some label proportions on the test set are known but obviously the actual labels on the test set are not known. [sent-424, score-0.673]
79 One way of tackling this problem is to perform transduction by enforcing a proportionality constraint on the unlabeled data, e. [sent-425, score-0.111]
80 2 Self Consistent Proportions K¨ ck and de Freitas (2005) introduced a more informative variant of the binary multiple-instance u learning, in which groups of instances are given along with estimates of the fraction of positivelylabeled instances per group. [sent-434, score-0.185]
81 Such a procedure is likely to perform well when a large number of bags is present. [sent-436, score-0.097]
82 Instead, we have the property that i ⊥ x| y, ⊥ ⊥ that is, the distribution over x for a given class label does not depend on the bag. [sent-450, score-0.115]
83 The intuition is that since the bags X1 and X2 do contain some information about how the two classes differ, we should be able to use this information to distinguish between different class labels. [sent-466, score-0.132]
84 While, in principle, this approach is flawed because of the incorrect conditional independence assumptions, it can still lead to acceptable results whenever each of the bags contains one majority class. [sent-472, score-0.16]
85 In scenario A we use class 1 exclusively in X1 , class 2 exclusively in X2 , and a mix of all three classes weighted by (0. [sent-485, score-0.097]
86 2, the quality of our minimizer of the negative log-posterior depends on the mixing matrix and this is noticeable in the reduction of performance for the dense mixing matrix (scenario B) in comparison to the better conditioned sparse mixing matrix (scenario A). [sent-537, score-0.538]
87 Unknown test label proportions: In this experiment, we use binary and three-class classification data sets with the same split procedure as in the previous experiment but we select testing examples by a biased procedure to introduce unknown test label proportions. [sent-539, score-0.286]
88 Since we are interested particularly to assess the effectiveness of our test proportion estimation method, in solving (15) we assume that we can compute µclass [y] directly, i. [sent-550, score-0.102]
89 Overdetermined systems: Here we are interested to assess the performance of our estimator with optimized weights when we have more data sets n than class labels |Y | with varying number of observations mi per data set. [sent-821, score-0.219]
90 Binary data sets Data australian breastcancer adult derm gisette wdbc MSE 0. [sent-861, score-0.262]
91 Square errors of estimating the test proportions on UCI/LibSVM data sets. [sent-880, score-0.303]
92 Stability of Mixing Matrices: Lastly, we are interested to assess the performance of our proposed method when the given mixing matrix π are perturbed so that they do not exactly match how the data is generated. [sent-882, score-0.238]
93 We used binary classification data sets and defined the perturbed mixing 2370 E STIMATING L ABELS FROM L ABEL P ROPORTIONS Binary data sets Data wdbc australian svmguide3 gisette splice unweighted 23. [sent-883, score-0.325]
94 Errors of weighted/unweighted estimators for overdetermined systems on UCI/LibSVM data sets. [sent-916, score-0.138]
95 Note that unperturbed mixing matrix refer to the case of {ε1 , ε2 } = {0, 0}. [sent-936, score-0.164]
96 Conclusion In this paper we obtained a rather surprising result, namely that it is possible to consistently reconstruct the labels of a data set if we can only obtain information about the proportions of occurrence of each class (in at least as many data aggregates as there are classes). [sent-940, score-0.378]
97 In particular, we proved that up to constants, our algorithm enjoys the same rates of convergence afforded to methods which have full access to all label information. [sent-941, score-0.15]
98 While this only provides aggregate information about the proportions of drug users, such data, in combination with detailed demographic information might be used to perform more detailed inference with regard to the propensity of individuals to use controlled substances. [sent-949, score-0.33]
99 ∆ 2 ∆ 2 ∆ 2 ∆ 2 ∆ 2 (c) Breastcancer Figure 2: Performance accuracy of binary classification data sets (n = |Y | = 2) as a function of ˜ the amount of perturbation applied to the mixing matrix, ∆ 2 = tr(∆⊤ ∆) with ∆ = π − π. [sent-954, score-0.25]
100 5}, for example red colored plot refers to performance when only label proportions of the first set are perturbed. [sent-972, score-0.335]
wordName wordTfidf (topN-words)
[('xy', 0.629), ('proportions', 0.255), ('spam', 0.183), ('aetano', 0.181), ('uadrianto', 0.181), ('abels', 0.167), ('roportions', 0.167), ('stimating', 0.167), ('mola', 0.154), ('abel', 0.142), ('iy', 0.13), ('mixing', 0.125), ('overdetermined', 0.112), ('bags', 0.097), ('perturbation', 0.095), ('kde', 0.083), ('label', 0.08), ('altun', 0.078), ('smola', 0.073), ('transduction', 0.07), ('improper', 0.064), ('labels', 0.059), ('rademacher', 0.058), ('mcmc', 0.056), ('adult', 0.056), ('breastcancer', 0.056), ('caetano', 0.056), ('coupons', 0.056), ('freitas', 0.056), ('quadrianto', 0.056), ('senseit', 0.056), ('voters', 0.056), ('mi', 0.055), ('australia', 0.049), ('aggregate', 0.048), ('australian', 0.048), ('perturbed', 0.048), ('test', 0.048), ('inbox', 0.047), ('dud', 0.047), ('instances', 0.047), ('minimizer', 0.046), ('customers', 0.045), ('map', 0.044), ('observations', 0.044), ('rank', 0.042), ('fractions', 0.042), ('gisette', 0.042), ('musicant', 0.042), ('novi', 0.042), ('wedin', 0.042), ('expectations', 0.041), ('unlabeled', 0.041), ('banach', 0.04), ('bag', 0.04), ('stability', 0.04), ('ex', 0.04), ('factorizes', 0.039), ('calibration', 0.039), ('proxy', 0.039), ('matrix', 0.039), ('xi', 0.037), ('class', 0.035), ('convergence', 0.035), ('full', 0.035), ('ck', 0.034), ('discount', 0.034), ('whenever', 0.034), ('kernel', 0.034), ('operator', 0.033), ('wdbc', 0.032), ('binary', 0.03), ('swing', 0.029), ('teo', 0.029), ('reconstruct', 0.029), ('conditional', 0.029), ('obviously', 0.028), ('proportion', 0.028), ('chiaia', 0.028), ('derm', 0.028), ('ledoux', 0.028), ('politics', 0.028), ('recalibrating', 0.028), ('sriperumbudur', 0.028), ('tib', 0.028), ('averages', 0.028), ('chen', 0.028), ('norm', 0.027), ('demographic', 0.027), ('mix', 0.027), ('ltering', 0.027), ('dna', 0.027), ('hilbert', 0.027), ('estimates', 0.027), ('bound', 0.026), ('labeled', 0.026), ('bounds', 0.026), ('estimators', 0.026), ('assess', 0.026), ('subsequently', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
2 0.093986996 23 jmlr-2009-Discriminative Learning Under Covariate Shift
Author: Steffen Bickel, Michael Brückner, Tobias Scheffer
Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning
3 0.0747381 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
Author: Emma Brunskill, Bethany R. Leffler, Lihong Li, Michael L. Littman, Nicholas Roy
Abstract: To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigation across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture real-world dynamics in a sufficient manner to produce good performance. Keywords: reinforcement learning, provably efficient learning
4 0.069748871 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
5 0.06845706 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang
Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems
6 0.063077308 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
7 0.062492728 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
8 0.061814684 38 jmlr-2009-Hash Kernels for Structured Data
9 0.058630645 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
10 0.054585651 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
11 0.042550225 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
12 0.04207487 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
14 0.041449547 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
15 0.040561743 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
17 0.036176652 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
18 0.035627875 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
19 0.034766242 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
20 0.033348389 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
topicId topicWeight
[(0, 0.206), (1, -0.061), (2, 0.08), (3, -0.055), (4, -0.045), (5, 0.006), (6, 0.014), (7, -0.027), (8, 0.001), (9, -0.024), (10, 0.115), (11, -0.007), (12, 0.047), (13, -0.184), (14, -0.029), (15, -0.084), (16, 0.22), (17, 0.002), (18, 0.086), (19, -0.019), (20, -0.117), (21, 0.026), (22, -0.076), (23, -0.126), (24, 0.04), (25, -0.057), (26, 0.135), (27, 0.108), (28, 0.035), (29, 0.111), (30, 0.2), (31, 0.111), (32, -0.004), (33, -0.315), (34, 0.003), (35, 0.078), (36, 0.137), (37, -0.219), (38, -0.071), (39, -0.15), (40, -0.278), (41, 0.003), (42, -0.075), (43, -0.077), (44, 0.008), (45, -0.101), (46, 0.263), (47, 0.104), (48, -0.117), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.94676769 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
2 0.38842496 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér
Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics
4 0.30207914 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
Author: Junhui Wang, Xiaotong Shen, Wei Pan
Abstract: In classification, semisupervised learning usually involves a large amount of unlabeled data with only a small number of labeled data. This imposes a great challenge in that it is difficult to achieve good classification performance through labeled data alone. To leverage unlabeled data for enhancing classification, this article introduces a large margin semisupervised learning method within the framework of regularization, based on an efficient margin loss for unlabeled data, which seeks efficient extraction of the information from unlabeled data for estimating the Bayes decision boundary for classification. For implementation, an iterative scheme is derived through conditional expectations. Finally, theoretical and numerical analyses are conducted, in addition to an application to gene function prediction. They suggest that the proposed method enables to recover the performance of its supervised counterpart based on complete data in rates of convergence, when possible. Keywords: difference convex programming, classification, nonconvex minimization, regularization, support vectors
5 0.29268208 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon
Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness
6 0.28780431 38 jmlr-2009-Hash Kernels for Structured Data
7 0.28330138 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
8 0.28088126 23 jmlr-2009-Discriminative Learning Under Covariate Shift
9 0.26392168 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning
10 0.25613022 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
11 0.24709803 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
12 0.24529986 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
14 0.21703336 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
15 0.20944469 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
16 0.20318753 15 jmlr-2009-Cautious Collective Classification
17 0.20247312 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
18 0.17872296 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
19 0.17695042 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
20 0.17653586 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
topicId topicWeight
[(8, 0.025), (11, 0.017), (21, 0.019), (26, 0.014), (38, 0.503), (47, 0.01), (52, 0.069), (55, 0.049), (58, 0.033), (66, 0.097), (68, 0.016), (90, 0.047), (96, 0.028)]
simIndex simValue paperId paperTitle
Author: Brian Tanner, Adam White
Abstract: RL-Glue is a standard, language-independent software package for reinforcement-learning experiments. The standardization provided by RL-Glue facilitates code sharing and collaboration. Code sharing reduces the need to re-engineer tasks and experimental apparatus, both common barriers to comparatively evaluating new ideas in the context of the literature. Our software features a minimalist interface and works with several languages and computing platforms. RL-Glue compatibility can be extended to any programming language that supports network socket communication. RL-Glue has been used to teach classes, to run international competitions, and is currently used by several other open-source software and hardware projects. Keywords: reinforcement learning, empirical evaluation, standardization, open source 1. Introduction and Motivation Reinforcement learning is an embodied, trial-and-error problem formulation for artificial intelligence (Sutton and Barto, 1998; Kaelbling et al., 1996; Bertsekas and Tsitsiklis, 1996). At a series of time steps, the agent emits actions in response to observations and rewards generated by the environment. The agent’s objective is to select actions that maximize the future rewards. Reinforcementlearning methods have been successfully applied to many problems including backgammon (Tesauro, 1994), elevator control (Crites and Barto, 1998), and helicopter control (Ng et al., 2004). Reinforcementlearning models and formalisms have influenced a number of fields, including operations research, cognitive science, optimal control, psychology, neuroscience, and others. Reinforcement-learning practitioners create their agents and environments using various incompatible software frameworks, making collaboration inconvenient and thus slowing progress in our community. It can be time consuming, difficult, and sometimes even impossible to exactly reproduce the work of others. A conference or journal article is not the appropriate medium to share a sufficiently detailed specification of the environment, agent and overall experimental apparatus. We need a convenient way to share source code. We believe that a standard programming interface for reinforcement-learning experiments will remove some barriers to collaboration and accelerate the pace of research in our field. To encourage widespread adoption, this interface should be easy to adhere to, and it should not force users to abandon their favorite tools or languages. With these goals in mind, we have developed RL-Glue: language independent software for reinforcement-learning experiments. c 2009 Brian Tanner and Adam White. TANNER AND W HITE 2. RL-Glue Reinforcement-learning environments cannot be stored as fixed data-sets, as is common in conventional supervised machine learning. The environment generates observations and rewards in response to actions selected by the agent, making it more natural to think of the environment and agent as interactive programs. Sutton and Barto describe one prevalent view of agent-environment interactions in their introductory text (1998). Their view, shown in Figure 1, clearly separates the agent and environment into different components which interact in a particular way, following a particular sequence. observation ot reward rt Agent action at rt+1 Environment ot+1 Figure 1: Sutton and Barto’s agent-environment interface, with states generalized to observations. White’s RL-Glue Protocol (2006) formalizes Sutton and Barto’s interface for online, singleagent reinforcement learning. The RL-Glue Protocol describes how the different aspects of a reinforcement-learning experiment should be arranged into programs, and the etiquette they should follow when communicating with each other. These programs (Figure 2) are the agent, the environment, the experiment, and RL-Glue. The agent program implements the learning algorithm and action-selection mechanism. The environment program implements the dynamics of the task and generates the observations and rewards. The experiment program directs the experiment’s execution, including the sequence of agent-environment interactions and agent performance evaluation. The RL-Glue program mediates the communication between the agent and environment programs in response to commands from the experiment program. Our RL-Glue software (RL-Glue) implements White’s protocol.1 Experiment Program Agent Program RL-Glue Program Environment Program Figure 2: The four programs specified by the RL-Glue Protocol. Arrows indicate the direction of the flow of control. RL-Glue can be used either in internal or external mode. In internal mode, the agent, environment and experiment are linked into a single program, and their communication is through function calls. Internal mode is currently an option if the agent, environment, and experiment are written exclusively in Java or C/C++. In external mode, the agent, environment and experiment are linked 1. This can be found at http://glue.rl-community.org/protocol. 2134 RL-G LUE into separate programs. Each program connects to the RL-Glue server program, and all communication is over TCP/IP socket connections. External mode allows these programs to be written in any programming language that supports socket communication. External mode is currently supported for C/C++, Java, Python, Lisp, and Matlab. Each mode has its strengths and weaknesses. Internal mode has less overhead, so it can execute more steps per second. External mode is more flexible and portable. The performance difference between the two modes vanishes as the agent or environment becomes complex enough that computation dominates the socket overhead in terms of time per step. The agent and environment are indifferent and unaware of their execution mode; the difference in modes lies only in how the agent and environment are linked or loaded. 3. RL-Glue in Practice RL-Glue’s provides a common interface for a number of software and hardware projects in the reinforcement-learning community. For example, there is the annual RL-Competition, where teams from around the world compare their agents on a variety of challenging environments. The competition software uses the API, called RL-Viz, that is layered on top of RL-Glue to dynamically load agent and environment programs, modify parameters at runtime and visualize interaction and performance. All of the environments and sample agents created by the competition organizers are added to the RL-Library, a public, community-supported repository of RL-Glue compatible code. The RL-Library is also available as an archive of top competition agents, challenge problems, project code from academic publications, or any other RL-Glue compatible software that members of our community would like to share. The socket architecture of RL-Glue allows diverse software and hardware platforms to be connected as RL-Glue environment programs. There are ongoing projects that connect a mobile robot platform, a keepaway soccer server, a real-time strategy game, and an Atari emulator to RL-Glue. Our socket architecture helps lower the barriers for researchers wishing to work on larger scale environments by providing a simple and familiar interface. RL-Glue has been used for teaching reinforcement learning in several university courses and to create experiments for scientific articles published in leading conferences. See our RL-Glue in practice web page for an updated list of projects and papers that have used RL-Glue.2 4. Other Reinforcement-Learning Software Projects RL-Glue is not the first software project that aims to standardize empirical reinforcement learning or to make agent and environment programs more accessible within our community. However, RLGlue is the only project that offers a standardized language-independent interface, rich actions and observations, and fine-grained control of the experiment. Other projects, most notably: CLSquare,3 PIQLE,4 RL Toolbox,5 JRLF,6 and LibPG,7 offer significant value to the reinforcement-learning community by offering agents and environments, 2. 3. 4. 5. 6. 7. This can be found at http://glue.rl-community.org/rl-glue-in-practice. This can be found at http://www.ni.uos.de/index.php?id=70. This can be found at http://piqle.sourceforge.net/. This can be found at http://www.igi.tugraz.at/ril-toolbox/. This can be found at http://mykel.kochenderfer.com/jrlf/. This can be found at http://code.google.com/p/libpgrl/. 2135 TANNER AND W HITE intuitive visualizations, programming tools, etc. Users should not be forced to choose between RL-Glue and these alternative projects. Our design makes it relatively easy to interface existing frameworks with RL-Glue. We are currently offering our assistance in bridging other frameworks to RL-Glue, with the hope of improving access to all of these tools for all members of our community. 5. RL-Glue Open Source Project Website: http://glue.rl-community.org License: Apache 2.0 RL-Glue is more than an interface; it connects a family of community projects, with many levels of possible participation. Members of the community are invited to submit agent, environment and experiment programs to the RL-Library. Developers can also extend the reach of RL-Glue compatibility by writing external-mode or internal-mode interfaces for their favorite programming language. The RL-Glue software project also welcomes submissions and improvements for all parts of the software and documentation. Acknowledgments We would like to thank the users, testers, and developers for their contributions to RL-Glue 3.0. Special thanks to G´ bor Bal´ zs, Jos´ Antonio Martin H., Scott Livingston, Marc Bellemare, Istv´ n a a e a Szita, Marc Lanctot, Anna Koop, Dan Lizotte, Richard Sutton, Monica Dinculescu, Jordan Frank, and Andrew Butcher. Of course, we also owe a great debt to all of the talented people responsible for the historic and ongoing development of RL-Glue.8 References Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming (Optimization and Neural Computation Series, 3). Athena Scientific, May 1996. ISBN 1886529108. Robert H. Crites and Andrew G. Barto. Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3):235–262, 1998. Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. Andrew Y. Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang. Autonomous inverted helicopter flight via reinforcement learning. In Proceedings of the International Symposium on Experimental Robotics, pages 363–372, 2004. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, 1998. Gerald Tesauro. TD-gammon, a self-teaching backgammon program achieves master-level play. Neural Computation, 6:215–219, 1994. Adam White. A Standard System for Benchmarking in Reinforcement Learning. Master’s thesis, University of Alberta, Alberta, Canada, 2006. 8. This can be found at http://glue.rl-community.org/contributors-history. 2136
2 0.96072143 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)
Author: Francis Maes
Abstract: N IEME,1 In this paper we introduce a machine learning library for large-scale classification, regression and ranking. N IEME relies on the framework of energy-based models (LeCun et al., 2006) which unifies several learning algorithms ranging from simple perceptrons to recent models such as the pegasos support vector machine or l1-regularized maximum entropy models. This framework also unifies batch and stochastic learning which are both seen as energy minimization problems. N IEME can hence be used in a wide range of situations, but is particularly interesting for large-scale learning tasks where both the examples and the features are processed incrementally. Being able to deal with new incoming features at any time within the learning process is another original feature of the N IEME toolbox. N IEME is released under the GPL license. It is efficiently implemented in C++, it works on Linux, Mac OS X and Windows and provides interfaces for C++, Java and Python. Keywords: large-scale machine learning, classification, ranking, regression, energy-based models, machine learning software
same-paper 3 0.8635534 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
Author: Abhik Shah, Peter Woolf
Abstract: In this paper, we introduce PEBL, a Python library and application for learning Bayesian network structure from data and prior knowledge that provides features unmatched by alternative software packages: the ability to use interventional data, flexible specification of structural priors, modeling with hidden variables and exploitation of parallel processing. PEBL is released under the MIT open-source license, can be installed from the Python Package Index and is available at http://pebl-project.googlecode.com. Keywords: Bayesian networks, python, open source software
5 0.54413372 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
Author: Davis E. King
Abstract: There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy use of these tools, the entire library has been developed with contract programming, which provides complete and precise documentation as well as powerful debugging tools. Keywords: kernel-methods, svm, rvm, kernel clustering, C++, Bayesian networks
6 0.45196936 38 jmlr-2009-Hash Kernels for Structured Data
7 0.4198015 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
8 0.41931358 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training
9 0.41259873 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
10 0.40186122 48 jmlr-2009-Learning Nondeterministic Classifiers
11 0.39578238 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
12 0.39339027 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
13 0.39268503 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
14 0.39217642 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
15 0.39096031 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
16 0.38923427 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
17 0.38718382 15 jmlr-2009-Cautious Collective Classification
18 0.38546276 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
19 0.38233864 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
20 0.38207689 9 jmlr-2009-Analysis of Perceptron-Based Active Learning