jmlr jmlr2009 jmlr2009-50 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
Reference: text
sentIndex sentText sentNum sentScore
1 As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. [sent-13, score-0.315]
2 Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). [sent-15, score-0.537]
3 Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization 1. [sent-16, score-0.369]
4 Figure 1: The problem of quick classification in the presence of myriad classes: How can a system quickly classify a given instance, specified by a feature vector x ∈ Rn , into a small subset of classes from among possibly millions of candidate classes (shown by small circles)? [sent-25, score-0.345]
5 During classification, given an instance containing certain features, the index is used (“looked up”) much like a typical inverted index for document retrieval would be. [sent-55, score-0.431]
6 In our indexing algorithms it is the features that update and normalize their connections to the classes. [sent-124, score-0.376]
7 In a multiclass calendar scheduling task (Blum, 1997), Blum investigates an algorithm in which each feature votes for (connects to) the majority class in the past 5 classes seen for that feature (the classes of the most recent 5 instances in which the feature was active). [sent-148, score-0.798]
8 However, in our case, the classes (to be indexed), unlike the documents, are implicit, indirectly specified by the training instances (the instances are not the “documents” to be indexed), and the index construction becomes 1. [sent-185, score-0.524]
9 As one simple consequence, the presence of a feature in a training instance that belongs to class c does not imply that the feature will point to class c in the index learned. [sent-188, score-0.558]
10 Indexing could be used to index already trained (say linear) classifiers, but the issues of space and time efficient learning remain, and accuracy can suffer when using binary classifiers for class ranking (see Section 6. [sent-195, score-0.335]
11 Each training instance is specified by a vector of feature values, vx , as well as a class (or assigned label) that the instance belongs to,2 cx . [sent-243, score-0.572]
12 vx [ f ] denotes the value of feature f in the vector of features of instance x, where vx [ f ] ≥ 0. [sent-247, score-0.605]
13 If vx [ f ] > 0, we say feature f is active (in instance x), and denote this aspect by f ∈ x. [sent-248, score-0.396]
14 Thus the vector vx corresponding to a document lives in an |F | dimensional space, where vx [i] = k iff the word with id i (corresponding to dimension i) appears k times in the document, where k ≥ 0 (other possibilities for feature values include Boolean and TFIDF weighting). [sent-263, score-0.477]
15 Therefore, in typical text categorization tasks, the number of active features in an instance is the number of unique words that appear in the corresponding document. [sent-264, score-0.317]
16 The number of classes is so large that indexing them, not unlike the inverted index used for retrieval of documents and other object types, is a plausible approach. [sent-319, score-0.47]
17 Figure 3 presents the basic cycle of categorization via index look up and learning via index updating (adjustments to connection weights). [sent-329, score-0.471]
18 , vx [ f ] > 0): For the first dmax classes with highest 2. [sent-348, score-0.463]
19 Retrieve, score, and rank classes via connection weight to f : active features of x 1. [sent-349, score-0.409]
20 The score that a class c receives, sc , can be written as sc = ∑ r f × w f ,c × vx [ f ], (1) f ∈x where r f is a measure of the predictiveness power or the rating of feature f , and we describe a method for computing it in Section 4. [sent-373, score-0.441]
21 In a sense, each active feature casts votes for a subset of the classes, and those classes receive and tally their incoming votes (scores). [sent-379, score-0.364]
22 The positive scoring classes can then be ranked by their score, or, if it suffices, only the maximum scoring class can be kept track of and reported. [sent-381, score-0.32]
23 On instance x, and with d connections per feature in the index, there can be at most |vx |d unique classes scored. [sent-386, score-0.407]
24 In our implementation, for each active feature, only at most the dmax classes (25 in our experiments) with highest connection weights to the feature participate in scoring. [sent-388, score-0.569]
25 An edge connecting feature f to class c has a positive weight denoted by w f ,c , or wi, j for feature i and class j, and corresponds to a list entry in the list for feature f . [sent-395, score-0.487]
26 Let kx be the rank of the highest ranked true class after presenting instance x to the system. [sent-422, score-0.345]
27 2581 M ADANI , C ONNOR AND G REINER the highest ranked class is assigned to the instance, and R1 measures the proportions of instances to which the true class was assigned. [sent-433, score-0.452]
28 kx MRR gives a reward of 1 if a correct class is ranked highest, the reward drops to 1/2 at rank 2, and slowly goes down the higher the k (the lower the rank). [sent-435, score-0.326]
29 Our indexing techniques are more appropriate for the problem of obtaining good rankings per instance, similar to some other multiclass ranking algorithms (e. [sent-460, score-0.327]
30 On a given instance, after the use of the index for scoring and ranking (an invocation of RankedRetrieval), if a measure of margin (to be described shortly) is not large enough, an update to the index is made. [sent-489, score-0.53]
31 The margin is the score obtained by the true class, minus the highest scoring incorrect (negative) class (either of the two scores can be zero). [sent-490, score-0.312]
32 2583 M ADANI , C ONNOR AND G REINER /* The FF Algorithm */ Algorithm FeatureFocus(x, wmin , dmax , δm ) 1. [sent-506, score-0.407]
33 , vx [ f ] > 0): For the first dmax classes with highest connection weight to f : 1. [sent-518, score-0.576]
34 (b) /* Feature Streaming Update (allowing “leaks”) */ Algorithm FSU(x, f , wmin ) /* Single feature updating */ 1. [sent-522, score-0.472]
35 If w f ,c < wmin , then /* drop tiny weights */ w f ,c ← 0, w′f ,c ← 0 (c) Algorithm GenericWeightUpdate Each active feature: 1. [sent-527, score-0.441]
36 Ignoring other features for now, and considering efficiency constraints, to which classes should this feature connect to, and with what weights? [sent-545, score-0.358]
37 In this single feature case, classes are ranked by the weight assigned to them by the feature. [sent-547, score-0.352]
38 2584 L EARNING W HEN C ONCEPTS A BOUND It is not hard to verify that the best classes are the dmax classes with the highest proportions in the stream, or the highest P(c) if the distribution is fixed and known (more precisely, P(c| f ), but f is fixed here) and the ranking should also be by P(c). [sent-551, score-0.694]
39 To maximize HR, when the feature can connect to at most k different classes, a k highest frequency set of classes should be picked, that is, choose S, such that |S| = k and S = {c|nc ≥ nc′ , ∀c′ ∈ S}), where nc denotes the number of times c occurs in the sequence. [sent-554, score-0.389]
40 Since the weights are between 0 and 1 and approximate probabilities, it eases the decision of assessing importance of a connection: weights below wmin are dropped at the expense of some potential loss in accuracy. [sent-570, score-0.469]
41 Note that wmin effec1 tively bounds the maximum outdegree during learning to be wmin . [sent-572, score-0.777]
42 Given that FSU zeros some weights during its computation, it is instructive to look at how well it does in approximating proportions for the (sub)stream of classes that it processes for a single feature. [sent-576, score-0.313]
43 This gives us an idea of how to set the wmin parameter and what to expect. [sent-577, score-0.317]
44 To summarize, when the true probability (weight) w of interest is several multiples of wmin , with sufficient sample size, the chance of dropping it is very low (the probability quickly goes down to 0 with increasing ww ), and moreover, the computed weight is also min close to the true conditional. [sent-579, score-0.375]
45 In case of non-Boolean feature values, similar to perceptron and Winnow updates (Rosenblatt, 1958; Littlestone, 1988), the degree of activity of the feature, vx [ f ], affects how much the connection between the feature and the true class is strengthened. [sent-600, score-0.593]
46 In particular, after w f > 1 9 wmin , a new class will be immediately dropped. [sent-604, score-0.36]
47 2 Always Updating (the IND Algorithm) One method of index construction is to simply assign each edge the class conditional probabilities, P(c| f ) (the conditional probability that instance x ∈ c given that f ∈ x). [sent-617, score-0.314]
48 At this point, updates can only affect classes already connected, and updates may improve the accuracy of their assigned weights, though there is a small chance that even classes with significant weights may be eventually dropped (this has probability 1 over an infinite sequence! [sent-628, score-0.462]
49 3 Mistake-Driven Updating Using a Margin (the FF Algorithm) FF adds and drops edges and modifies edge weights during learning by processing one instance at a time,12 and by invoking a feature updating algorithm, such as FSU. [sent-667, score-0.39]
50 Unlike IND, FF addresses feature dependencies by not updating the index on every training instance. [sent-668, score-0.352]
51 Equivalently, a feature updates its connection on only a fraction of the training instances in which it is active. [sent-669, score-0.347]
52 It can, for example, avoid over counting the influence of features that are basically duplicates by learning relatively low connection weights for each such feature (similar to a rational for mistake driven updates in other learning algorithms such as the perceptron). [sent-684, score-0.357]
53 The margin on the current instance is the score of the positive class minus the score of the highest scoring negative 2588 L EARNING W HEN C ONCEPTS A BOUND class: δ = scx − s′ , where scx ≥ 0, s′ ≥ 0, s′ = max sc . [sent-729, score-0.415]
54 Individual edge weights are in the [0, 1] range, and when the instances are l2 normalized, we have observed that on average top classes obtain scores in the [0, 1] range as well, irrespective of data sets or choice of margin threshold (Madani and Connor, 2007). [sent-736, score-0.51]
55 During learning, FF may be viewed as directing a stream of classes to each feature, so that each feature can compute weights for a subset of the classes that it may connect to. [sent-753, score-0.526]
56 Every active feature is updated for each true class for which its margin is below the margin threshold. [sent-760, score-0.316]
57 In our case, it is the classes whose connections to a feature may be weakened due to one or more classes being strengthened. [sent-770, score-0.453]
58 In the FSU update given in this paper, this weakening happens irrespective of whether a class was ranked high (this aspect is similar to Winnow, but again, for class weights instead of feature weights). [sent-771, score-0.375]
59 It appeared harder to us to bound the number of features a class needs, and different classes may require widely varying number of features for adequate performance. [sent-778, score-0.336]
60 Finally, we seek rapid categorization per instance, and constraining indegree of classes may not guarantee that 2590 L EARNING W HEN C ONCEPTS A BOUND the outdegree of commonly occurring features would be small. [sent-791, score-0.48]
61 False positive classes may obtain negative connections to features they weren’t connected to (when ranked higher than the true positive). [sent-796, score-0.388]
62 Note that the use of index for classification is one-versus-rest (or “flat” classification), but the index was not obtained by training binary classifiers. [sent-822, score-0.345]
63 For the web data set, to obtain a sufficient number of instances per class, we cut the taxonomy at depth 4, that is, we only considered the true classes up to depth 4. [sent-904, score-0.329]
64 Note that dmax of 25 means a class is retrieved as along as it is within the first 25 highest weight connections of some active feature, even if its weight is not much higher than wmin . [sent-933, score-0.855]
65 5 basically means to update on most instances as index edge weights are less than 1, and thus class score differences tend not to be much higher than 1. [sent-943, score-0.49]
66 The classes with the highest proportion in training are not the classes with the highest proportion on the test set. [sent-963, score-0.451]
67 The scores of such a classifier are more suitable for ranking the instances for the corresponding class than ranking classes for each instance. [sent-1156, score-0.498]
68 274 Ttr (single pass) 0s vs 0s 3s vs 2s 9s vs 4s [40s − 50s] vs 6s 45m vs 27s 2h vs 64s 1h vs 41s Figure 11: No constraints on dmax (maximum outdegree) nor wmin (wmin set to 0), compared to the default settings. [sent-1215, score-0.407]
69 In these experiments, we set wmin to 0 and dmax to a large number (1000). [sent-1224, score-0.407]
70 At outdegree constraint of 3 for RCV1, the number of edges in the learned index is around 80k instead of 180k (for the default dmax = 25), and the number of classes (connections) touched per feature is under 3, instead of 15 (Figure 10). [sent-1240, score-0.688]
71 In general, it may be a better policy to use a weight threshold, greater than wmin , instead of a max outdegree constraint, for more efficient retrieval, as well as more reduction in index size, without loss in ranking accuracy. [sent-1241, score-0.754]
72 3 0 10 20 30 40 50 max outdegree allowed Figure 12: Accuracy (R1 ) after one pass against the outdegree constraint. [sent-1253, score-0.342]
73 However, since the outdegree is constrained for memory reasons, if we imposed a constraint that the connection weights of a feature should sum to 1, then “the” may give significant but inaccurate weights to the classes that it happens to get connected with. [sent-1335, score-0.572]
74 Thus IND is similar to Boolean FF with high margin and wmin = 0, but IND also has a post-improvement step of adjusting pind (using the training set), which we have observed can improve the test accuracy of IND significantly (in addition to reducing index size). [sent-1389, score-0.705]
75 Note that the scores that the classes receive during retrieval can increase significantly with Boolean features (compared to using the feature values in l2 normalized vector representation). [sent-1413, score-0.419]
76 (2007) uses a threshold ttol during training and updates the index only when more than ttol many false positive classes are retrieved on a training instance or when a false negative occurs. [sent-1442, score-0.675]
77 2602 L EARNING W HEN C ONCEPTS A BOUND Train and Test Accuracy versus Pass on the news group data set Train and Test Accuracy versus Pass on the Web Dataset 1 1 train marg 0 test marg 0 train marg 0. [sent-1456, score-0.395]
78 Let the indegree of a class, that is, the length of the prototype vector, be the number of features that have a significant edge to the class (within the highest dmax edges for each feature). [sent-1498, score-0.519]
79 Each feature computes that choice of classes it may connect to and the connection weights. [sent-1520, score-0.328]
80 Theorem 4 The index learning problem with the objective of either maximizing accuracy (R1 ) or minimizing HR on a given set of instances, with the constraint of a constant upper bound, such as 1, on the outdegree of each feature is NP-Hard. [sent-1542, score-0.446]
81 , those features whose corresponding sets are in the cover) to c1 , each with weight of |S |, and we connect all the other features to c2 with a relatively small weight of say 1. [sent-1564, score-0.337]
82 1 wherein a feature wants to compute the proportions of the (sufficiently frequent) classes in the stream it observes. [sent-1583, score-0.39]
83 9 1 Figure 21: The performance of FSU under different allowances wmin . [sent-1621, score-0.317]
84 In the plot of part (b), the deviation is also compared to the case of 100 classes and wmin = 0. [sent-1632, score-0.44]
85 • Setting small weights (below wmin ) to 0 (dropping edges) to save memory. [sent-1639, score-0.393]
86 Intuitively, FSU should work well as long as the proportions we are interested in sufficiently exceed the wmin 1 threshold. [sent-1642, score-0.431]
87 The probability that a class with say probability p is not seen in some wmin trials is p (1 − p)1/wmin , and as long the ratio wmin is high (several multiples), for example, p > 4wmin , this probability is relatively small. [sent-1643, score-0.711]
88 More generally, the chance of being set to 0 (dropped) for a class with occurrence probability p quickly diminishes as we increase p the ratio wmin , and therefore the cause of inaccuracies due to finite memory (the outdegree constraint on features) is mitigated. [sent-1648, score-0.503]
89 We conducted experiments to see how much the proportion estimation by FSU deviates from true proportions and in particular compared that deviation to the deviations when FSU is not memory constrained (when wmin is set to 0). [sent-1649, score-0.431]
90 12 Figure 22: The performance of FSU, under different allowances wmin . [sent-1669, score-0.317]
91 01 yields distances comparable to FSU with wmin = 0, but wmin =0. [sent-1675, score-0.634]
92 We then generated a stream of 1000 class observations (1000 iid draws) from C |−1 such a distribution, and gave it to FSU with different values of wmin . [sent-1678, score-0.414]
93 We plot the results for FSU under different wmin constraints. [sent-1693, score-0.317]
94 We compared a number of other statistics, such as the maximum deviation from true probability, and the probability that the deviation is larger than a threshold, and FSU with wmin = 0. [sent-1699, score-0.317]
95 01 performed similarly to wmin = 0 on the distributions tested. [sent-1700, score-0.317]
96 The reason as alluded to earlier is that those classes with proportions 2608 L EARNING W HEN C ONCEPTS A BOUND significantly greater than wmin have a high chance of being seen early and frequently enough in the stream and not being dropped. [sent-1701, score-0.608]
97 Thus, as long as we expect that the useful proportions are a few multiples away from the wmin we choose, FSU is expected to compute proportions that are close to ones computed by the FSU with wmin set to 0 (no space constraints). [sent-1702, score-0.862]
98 Further, we expected that most often the important feature connection weights that determine the true classes during ranking have fairly high weight. [sent-1703, score-0.441]
99 Finally, vector length is a factor: if there tend to exist strong features-class connections, the influence of the weaker connections on changing the ranking will be limited, in particular when the number of active features is adequately small. [sent-1705, score-0.329]
100 In general however, some experimentation may be required to set the wmin parameter. [sent-1710, score-0.317]
wordName wordTfidf (topN-words)
[('wmin', 0.317), ('fsu', 0.294), ('ff', 0.268), ('madani', 0.265), ('ind', 0.194), ('vx', 0.172), ('adani', 0.151), ('oncepts', 0.151), ('onnor', 0.151), ('reiner', 0.151), ('index', 0.148), ('outdegree', 0.143), ('indexing', 0.141), ('classes', 0.123), ('marg', 0.115), ('proportions', 0.114), ('hr', 0.11), ('connections', 0.108), ('instances', 0.102), ('feature', 0.099), ('hen', 0.091), ('dmax', 0.09), ('ranking', 0.088), ('features', 0.085), ('perceptron', 0.083), ('highest', 0.078), ('instance', 0.077), ('ads', 0.076), ('prototype', 0.076), ('weights', 0.076), ('newsgroup', 0.073), ('reuters', 0.073), ('ranked', 0.072), ('mrr', 0.072), ('pind', 0.072), ('reward', 0.068), ('classi', 0.066), ('dumais', 0.065), ('indegree', 0.065), ('categorization', 0.064), ('web', 0.064), ('margin', 0.063), ('multiclass', 0.063), ('committee', 0.06), ('weight', 0.058), ('retrieval', 0.058), ('austen', 0.057), ('pass', 0.056), ('accuracy', 0.056), ('updating', 0.056), ('retrieved', 0.055), ('cx', 0.055), ('connection', 0.055), ('scores', 0.054), ('stream', 0.054), ('connect', 0.051), ('online', 0.051), ('connor', 0.05), ('freqbaseline', 0.05), ('ciency', 0.05), ('news', 0.05), ('training', 0.049), ('touched', 0.049), ('active', 0.048), ('ers', 0.047), ('sc', 0.047), ('votes', 0.047), ('threshold', 0.046), ('passes', 0.046), ('edge', 0.046), ('industry', 0.046), ('crammer', 0.046), ('boolean', 0.044), ('winnow', 0.044), ('jane', 0.044), ('pci', 0.043), ('rankedretrieval', 0.043), ('ttol', 0.043), ('ttr', 0.043), ('class', 0.043), ('text', 0.043), ('update', 0.042), ('perceptrons', 0.042), ('updates', 0.042), ('earning', 0.042), ('scoring', 0.041), ('rank', 0.04), ('taxonomy', 0.04), ('nc', 0.038), ('unweighted', 0.037), ('mesterharm', 0.036), ('edges', 0.036), ('kx', 0.035), ('rankings', 0.035), ('streaming', 0.035), ('inferior', 0.035), ('word', 0.034), ('trials', 0.034), ('score', 0.033), ('invoked', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000019 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
2 0.078079008 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
Author: Shivani Agarwal, Partha Niyogi
Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability
3 0.069748871 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
4 0.063121043 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
5 0.061389901 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti
Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels
6 0.061059795 23 jmlr-2009-Discriminative Learning Under Covariate Shift
7 0.05796656 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
8 0.057611194 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
9 0.056112591 38 jmlr-2009-Hash Kernels for Structured Data
10 0.055381808 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)
11 0.052518725 67 jmlr-2009-Online Learning with Sample Path Constraints
12 0.050312657 48 jmlr-2009-Learning Nondeterministic Classifiers
13 0.049495783 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
14 0.048467975 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
15 0.047057275 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)
16 0.046863705 13 jmlr-2009-Bounded Kernel-Based Online Learning
17 0.046843253 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
18 0.046726447 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey
19 0.045690626 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
20 0.044426031 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
topicId topicWeight
[(0, 0.23), (1, -0.075), (2, 0.053), (3, -0.036), (4, 0.028), (5, -0.019), (6, 0.022), (7, 0.207), (8, 0.017), (9, -0.024), (10, -0.03), (11, -0.074), (12, 0.072), (13, -0.111), (14, -0.109), (15, -0.128), (16, 0.106), (17, -0.1), (18, -0.06), (19, -0.05), (20, -0.057), (21, 0.168), (22, -0.013), (23, -0.097), (24, -0.044), (25, -0.07), (26, 0.179), (27, -0.058), (28, -0.168), (29, 0.004), (30, -0.009), (31, 0.144), (32, -0.074), (33, 0.007), (34, -0.017), (35, -0.169), (36, 0.122), (37, -0.173), (38, 0.145), (39, -0.041), (40, 0.012), (41, 0.068), (42, 0.036), (43, -0.015), (44, 0.032), (45, -0.023), (46, 0.122), (47, -0.006), (48, 0.042), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.93709022 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
2 0.47219467 29 jmlr-2009-Estimating Labels from Label Proportions
Author: Novi Quadrianto, Alex J. Smola, Tibério S. Caetano, Quoc V. Le
Abstract: Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Keywords: unsupervised learning, Gaussian processes, classification and prediction, probabilistic models, missing variables
3 0.46791214 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
Author: Shivani Agarwal, Partha Niyogi
Abstract: The problem of ranking, in which the goal is to learn a real-valued ranking function that induces a ranking or ordering over an instance space, has recently gained much attention in machine learning. We study generalization properties of ranking algorithms using the notion of algorithmic stability; in particular, we derive generalization bounds for ranking algorithms that have good stability properties. We show that kernel-based ranking algorithms that perform regularization in a reproducing kernel Hilbert space have such stability properties, and therefore our bounds can be applied to these algorithms; this is in contrast with generalization bounds based on uniform convergence, which in many cases cannot be applied to these algorithms. Our results generalize earlier results that were derived in the special setting of bipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of the ranking problem that arises frequently in applications. Keywords: ranking, generalization bounds, algorithmic stability
4 0.43687898 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
5 0.41773003 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
6 0.40483567 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
8 0.38174164 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
9 0.35957685 38 jmlr-2009-Hash Kernels for Structured Data
10 0.35006875 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
11 0.3309871 23 jmlr-2009-Discriminative Learning Under Covariate Shift
13 0.32768834 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
14 0.31359509 49 jmlr-2009-Learning Permutations with Exponential Weights
15 0.30361509 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
16 0.30049071 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
17 0.29647577 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)
18 0.2906644 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
20 0.28252479 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
topicId topicWeight
[(8, 0.024), (11, 0.031), (19, 0.019), (21, 0.011), (26, 0.018), (38, 0.036), (47, 0.013), (52, 0.553), (55, 0.031), (58, 0.023), (66, 0.086), (68, 0.017), (90, 0.051), (96, 0.028)]
simIndex simValue paperId paperTitle
1 0.98099738 23 jmlr-2009-Discriminative Learning Under Covariate Shift
Author: Steffen Bickel, Michael Brückner, Tobias Scheffer
Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning
same-paper 2 0.95796609 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
3 0.6904335 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
Author: Babak Shahbaba, Radford Neal
Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification
4 0.67828554 48 jmlr-2009-Learning Nondeterministic Classifiers
Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde
Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities
5 0.65643466 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
Author: Pradip Ghanty, Samrat Paul, Nikhil R. Pal
Abstract: In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use support vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed. Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid system, majority voting, averaging c 2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal. G HANTY, PAUL AND PAL
6 0.64206147 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
7 0.62717676 38 jmlr-2009-Hash Kernels for Structured Data
8 0.61697119 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
9 0.61029518 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
10 0.58695656 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
11 0.56723046 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
12 0.5584392 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
13 0.55455804 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
14 0.54758108 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
15 0.53346241 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
16 0.52400392 25 jmlr-2009-Distributed Algorithms for Topic Models
17 0.51795876 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
18 0.50046706 15 jmlr-2009-Cautious Collective Classification
19 0.4901439 29 jmlr-2009-Estimating Labels from Label Proportions
20 0.48421335 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability