jmlr jmlr2009 jmlr2009-4 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Asela Gunawardana, Guy Shani
Abstract: Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm. In this paper we review the proper construction of offline experiments for deciding on the most appropriate algorithm. We discuss three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest. We also discuss other important considerations when designing offline experiments. Keywords: recommender systems, collaborative filtering, statistical analysis, comparative studies
Reference: text
sentIndex sentText sentNum sentScore
1 Such systems typically provide the user with a list of recommended items they might prefer, or supply guesses of how much the user might prefer each item. [sent-16, score-1.011]
2 These systems help users to decide on appropriate items, and ease the task of finding preferred items in the collection. [sent-17, score-0.709]
3 For example, the DVD rental provider Netflix1 displays predicted ratings for every displayed movie in order to help the user decide which movie to rent. [sent-18, score-0.688]
4 The online book retailer Amazon2 provides average user ratings for displayed books, and a list of other books that are bought by users who buy a specific book. [sent-19, score-0.984]
5 Algorithmic Approaches There are two dominant approaches for computing recommendations for the active user—the user that is currently interacting with the application and the recommender system. [sent-67, score-0.789]
6 , 1998) assumes that users who agreed on preferred items in the past will tend to agree in the future too. [sent-69, score-0.66]
7 Many such methods rely on a matrix of user-item ratings to predict unknown matrix entries, and thus to decide which items to recommend. [sent-70, score-0.726]
8 Then, items that were preferred by users in the neighborhood are recommended to the active user. [sent-74, score-0.722]
9 , 2003), known as item based collaborative filtering recommends items also prefered by users that prefer a particular active item to other users that also prefer that active item. [sent-76, score-1.232]
10 In collaborative filtering approaches, the system only has access to the item and user identifiers, and no additional information over items or users is used. [sent-77, score-1.147]
11 The system then learns the user preferences over features, and uses these computed preferences to recommend new items with similar features. [sent-81, score-0.811]
12 Just as users should not need to take into account the details of the underlying algorithm when using the resulting recommendations, it is inappropriate to select different evaluation metrics for different recommendation approaches. [sent-88, score-0.729]
13 Below, we categorize recommender systems into three classes, based on the recommendation task that they are designed for McNee et al. [sent-92, score-0.737]
14 1 Recommending Good Items Perhaps the most common task of recommendation engines is to recommend good items to users (Herlocker et al. [sent-102, score-1.152]
15 Such systems typically present a list of items that the user is predicted to prefer. [sent-105, score-0.774]
16 In the Amazon website, for instance, when the user is looking at an item, the system presents below a list of other items that the user may be interested in. [sent-111, score-0.988]
17 Another example can be found in Netflix—when a user adds a movie to her queue, the system displays a list of other recommended movies that the user may want to add to the queue too. [sent-112, score-0.765]
18 The simplest way is through cross-selling; by suggesting additional items to the users, we increase the probability that the user will buy more than he originally intended. [sent-134, score-0.699]
19 In a subscription 2938 A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS service, where revenue comes from users paying a regular subscription, the goal may be to allow users to easily reach items of interest. [sent-137, score-0.858]
20 In this case, the system should suggest items such that the user reaches items of interest with minimal effort. [sent-138, score-1.16]
21 For example, in the e-commerce scenario, given two items that the system perceives as equally relevant, suggesting the item with the higher profit can further increase revenue. [sent-147, score-0.683]
22 In the subscription service, recommending items that are harder for the user to reach without the recommender system may be beneficial. [sent-149, score-1.092]
23 Another common practice of recommendation systems is to suggest recommendations that provide the most “value” to the user. [sent-150, score-0.658]
24 For example, recommending popular items can be redundant, as the user is probably already familiar with them. [sent-151, score-0.704]
25 Defining the correct utility function for a given application can be difficult (Braziunas and Boutilier, 2005), and typically system designers make simplifying assumptions about the user utility function. [sent-154, score-0.678]
26 In the e-commerce case the utility function is typically the profit resulting from recommending an item, and in the news scenario the utility can be the expected time for reading a news item, but these choices ignore the effect of the resulting recommendations on long-term profits. [sent-155, score-0.86]
27 In fact, it is possible to view many recommendation tasks, such as providing novel or serendipitious recommendations as maximizing some utility function. [sent-158, score-0.845]
28 Furthermore, one can view a predicted high rating as a recommendation to use the item, and a predicted low rating as a recommendation to avoid the item. [sent-174, score-1.032]
29 1 Online Evaluation In the recommendation and utility optimization tasks, the designer of the system wishes to influence the behavior of users. [sent-187, score-0.684]
30 Care must be exercised to ensure that there is no bias in the distributions of users, items and ratings selected. [sent-217, score-0.754]
31 For example, in cases where data from an existing system (perhaps a system without a recommender) is available, the experimenter may be tempted to pre-filter the data by excluding items or users with low counts, in order to reduce the costs of experimentation. [sent-218, score-0.772]
32 This is usually done by recording historical user data, and then hiding some of these interactions in order to simulate the knowledge of how a user will rate an item, or which recommendations a user will act upon. [sent-224, score-0.922]
33 We discuss these concerns explicitly for the case of selecting used items for hiding in the evaluation of recommendation tasks, and note that the same considerations apply when selecting ratings to hide for evaluation of ratings prediction tasks. [sent-227, score-1.531]
34 Ideally, if we have access to time-stamps for user selections, we can randomly sample test users, randomly sample a time just prior to a user action, hide all selections (of all users) after that instant, and then attempt to recommend items to that user. [sent-229, score-1.026]
35 This simulates a situation where the recommender system is “trained” as of the test time, and then makes recommendations without 2941 G UNAWARDANA AND S HANI taking into account any new data that arrives after the test time. [sent-232, score-0.709]
36 A final alternative is to ignore time; We sample a set of test users, then sample the number na of items to hide for each user a, then sample na items to hide. [sent-235, score-1.214]
37 A common protocol used in many research papers is to use a fixed number of known items or a fixed number of hidden items per test user (so called “given n” or “all but n” protocols). [sent-239, score-1.158]
38 However, when we wish to make decisions on the algorithm that we will use in our application, we must ask ourselves whether we are truly interested in presenting recommendations for users who have rated exactly n items, or are expected to rate exactly n items more. [sent-241, score-0.944]
39 However, when recommendations or predictions of multiple items are made to the same user, it is unlikely that the resulting per-item performance metrics are independent. [sent-266, score-0.814]
40 Evaluating Tasks An application designer that wishes to employ a recommendation system typically knows the purpose of the system, and can map it into one of the tasks defined above—recommendation, utility optimization, and ratings prediction. [sent-307, score-1.046]
41 If pi, j is the predicted rating for user i over item j, and vi, j is the true rating, and K = {(i, j)} is the set of hidden user-item ratings then the RMSE is defined as: ∑(i, j)∈K (pi, j − vi, j )2 . [sent-318, score-0.812]
42 Compared to ratings data sets, where users typically rate only a very small number of items, making the data set extremely sparse, binary selection data sets are dense, as each item was either selected or not by the user. [sent-330, score-0.675]
43 The task is to provide, given an existing list of items that were viewed, a list of additional items that the user may want to visit. [sent-333, score-1.248]
44 In some applications, where the number of recommendations that are presented to the user is not preordained, it is therefore preferable to evaluate algorithms over a range of recommendation list lengths, rather than using a fixed length. [sent-339, score-0.926]
45 While both curves measure the proportion of preferred items that are actually recommended, precision-recall curves emphasize the proportion of recommended items that are preferred while ROC curves emphasize the proportion of items that are not preferred that end up being recommended. [sent-342, score-1.622]
46 The simplest is to aggregate the hidden ratings from the test set into a set of user-item pairs, generate a ranked list of user-item pairs by combining the recommendation lists for the test users, and then compute the precision-recall or ROC curve on this aggregated data. [sent-361, score-0.864]
47 This aggregation process assumes that we have a means of comparing recommendations made to different users in order to combine the recommendation lists into a single ranked list. [sent-362, score-0.886]
48 Computing ROC curves in this manner treats the recommendations of different items to each user as being independent detection or classification tasks, and the resulting curve is termed a global ROC (GROC) curve (Schein et al. [sent-363, score-1.019]
49 This method is more relevant in the “recommend all good items” sub-task, if the system provides the user with all available recommendations and the user then scans the list linearly, marking each scanned item as relevant or not. [sent-373, score-0.999]
50 The system can then compute the precision of the items scanned so far, and use the precision recall curve to give the user an estimate of what proportion of the good items have yet to be found. [sent-374, score-1.255]
51 3 Optimize Utility Estimating the utility of a list of recommendations requires a model of the way users interact with the recommendations. [sent-376, score-0.694]
52 This approach evaluates an unbounded recommendation list, that potentially contains all the items in the catalog. [sent-383, score-0.827]
53 Given such a list we assume that the user looks at items starting from the top. [sent-384, score-0.708]
54 More generally, we can plug any utility function u(a, j) that assigns a value to a user item pair into the half-life utility score, obtaining the following formula: Ra = ∑ j u(a, j) 2(idx( j)−1)/(α−1) . [sent-388, score-0.771]
55 Now, Rmax is the score for the list of the recommendation where all the observed items are ordered a by decreasing utility. [sent-389, score-0.907]
56 For example, a system may get a relatively high half-life utility score, only due to items that fall outside the fixed list, while another system that selects all the items in the list correctly, and uninteresting items elsewhere, might get a lower score. [sent-400, score-1.684]
57 Another important difference, is that for a small list, the order of items in the list is less important, as we can assume that the user looks at all the items in the list. [sent-402, score-1.148]
58 In that case, the half-life utility score prefers a recommender system that places interesting items closer to the head of the list, but provides an evaluation for the entire list in a single score. [sent-411, score-1.125]
59 2948 A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS our experiments, as we are working with simple algorithms, we have reduced the data set to users who rated more than 100 movies, leaving us with 21, 179 users, 17, 415 movies, and 117 ratings per user on average. [sent-434, score-0.712]
60 This data set is even more sparse than the Netflix data set that we used, as there are more items and less ratings per user. [sent-442, score-0.726]
61 We can therefore expect that the ratings of the BookCrossing are less representative of the general population of book readers, than the ratings of Netflix user from the general population of DVD renters. [sent-451, score-0.789]
62 2 R ECOMMENDATION TASK One instance of the “recommend good items” task is the case where, given a set of items that the user has used (bought, viewed), we wish to recommend a set of items that are likely to be used. [sent-454, score-1.237]
63 Therefore the task is to recommend more items that the user may want to add to the basket. [sent-461, score-0.797]
64 The task is, given the news items that a user has read so far, recommend more news items that the user will likely read. [sent-464, score-1.598]
65 Such companies may provide recommendation for items, hoping that customers following these recommendations will produce higher revenue. [sent-472, score-0.658]
66 There are 32, 266 users and 23, 812 items, where the average number of items bought by a user is 23. [sent-476, score-0.895]
67 2 Recommendation Algorithms As the focus of this survey is on the correct evaluation of recommender systems, and not on sophisticated algorithms for computing recommendation lists, we limit ourselves to a set of very simple collaborative filtering algorithms. [sent-479, score-0.804]
68 We do this because collaborative filtering is by far the most popular recommendation approach, and because we do not believe that it is appropriate to select the evaluation metric based on the recommendation approach (e. [sent-480, score-0.925]
69 1 P EARSON C ORRELATION Typically, the input for a prediction task is a data set consisting of the ratings provided by n users for m items, where vi, j is the rating of user i for item j. [sent-490, score-1.004]
70 If our assumption is true, a system that recommends items to add to the rental queue by order of decreasing predicted rating, may not do as well as a system that predicts the probability of adding a movie to the queue directly. [sent-506, score-0.739]
71 Thus, Ii is the set of items that user i has rated positively and Ia,i is the set of items that both users rated positively. [sent-512, score-1.33]
72 Intuitively, if the task requires lists that optimize a utility function u(a, j), an obvious method is to order the recommendation by decreasing expected utility: E[ j|a] = pr( j|a) · u(a, j) ˜ ˜ where pr is a conditional probability estimate and u is a utility estimate. [sent-533, score-0.853]
73 First we evaluated the algorithms in predicting ratings on the Netflix and BookCrossing data sets, where we sampled 2000 test users and a randomly chosen number of test items per test user on each data set. [sent-548, score-1.239]
74 We then evaluated the two algorithms on the recommendation task on the Belgian retailer and news click stream data sets, where we again sampled 2000 test users and a randomly chosen number of test items per test user on each data set. [sent-551, score-1.609]
75 As Figure 2 shows, in both cases the recommendation lists generated by the Cosine similarity dominate the recommendation lists generated by the Pearson correlation algorithm in terms of precision. [sent-555, score-0.882]
76 This experiment shows that an algorithm that is uniformly better at predicting ratings on ratings data sets is not necessarily better at making recommendations on usage data sets. [sent-561, score-0.874]
77 An interesting experiment would be, given both ratings and usage data over the same users and items, to see whether algorithms that generate recommendation lists by decreasing order of predicted ratings do as well in the recommendation task over the usage data. [sent-563, score-1.727]
78 Nevertheless, companies such as Amazon or Netflix collect data both of user purchases and of user ratings over items. [sent-565, score-0.744]
79 Deciding on the best recommendation engine based solely on RMSE in the prediction task may lead to worse recommendation lists in the two other cases. [sent-575, score-0.866]
80 For example, even though a retailer website wishes to maximize its profit, it may use a binary data set of items that were bought by users to generate recommendations of the type “people who bought this item also bought . [sent-580, score-1.351]
81 To evaluate the performance of such an approach, we used an item-item recommendation system to generate recommendations for the Ta-Feng data set. [sent-584, score-0.721]
82 Multiplying the probability that the user would buy an item buy the profit that would result if the user bought the item yielded the required estimate of expected utility. [sent-592, score-0.931]
83 Figure 3: Comparing recommendations generated by the item-item recommender and the expected profit recommender on the Ta-Feng data set. [sent-597, score-0.873]
84 We then measured an half-life utility score where the utility of a correct recommendation was the profit from selling the correctly recommended item to the user. [sent-598, score-1.032]
85 The utility of a correct recommendation was the profit from selling that item to that user, while the half-life was 5. [sent-605, score-0.754]
86 However, the success of a recommendation system does not depend solely on the quality of the recommendation algorithm. [sent-617, score-0.837]
87 The success of the deployed system in influencing users can be measured through the change in user behavior, such as the number of recommendations that are followed, or the change in revenue. [sent-619, score-0.736]
88 When the application is centered around the recommendation system, it is important to select the user interface together with the recommendation algorithm. [sent-623, score-1.016]
89 2956 A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS When the recommender system is only a supporting system, the designer of the application will probably make the decision about the user interface without focusing on positioning recommendations where they have the most influence. [sent-628, score-0.924]
90 In such cases, which we believe to be very common, the developer of the recommender system is constrained by the pre-designed interface, and in many cases can therefore only decide on the best recommendation algorithm, and in some cases perhaps the length of the recommendation list. [sent-629, score-1.138]
91 This utility function can be the user utility for items (see, e. [sent-636, score-1.031]
92 Therefore, the utility of an movie for a user (from the website perspective) is not whether the user enjoys the movie, but rather whether the user will maintain her subscription if the movie is suggested to her. [sent-649, score-1.01]
93 Explicit ratings Our classification of recommendation tasks sheds some light over another, commonly discussed subject in recommender systems, namely, implicit ratings (Claypool et al. [sent-655, score-1.312]
94 If the task is to recommend more items that the user may buy, given the items that the user has already bought, purchase data becomes an explicit indication. [sent-659, score-1.454]
95 In this case, using ratings that users provide over items is an implicit indication to whether the user will buy the item. [sent-660, score-1.17]
96 Another interesting contribution of that paper is a classification of recommendation engines from the user task perspective, namely, what are the reasons and motivations that a user has when interacting with a recommender system. [sent-673, score-1.171]
97 Finally, their survey attempted to cover as many evaluation metrics and user task variations as possible, we focus here on the appropriate metrics for the most popular recommendation tasks only. [sent-675, score-0.965]
98 McLaughlin and Herlocker (2004) argue, as we do, that MAE is not appropriate for evaluating recommendation tasks, and that ratings are not necessarily indicative of whether a user is likely to watch a movie. [sent-691, score-0.942]
99 2958 A S URVEY OF E VALUATION M ETRICS OF R ECOMMENDATION TASKS Some researchers have suggested taking a more holistic approach, and considering the recommendation algorithm within the complete recommendation system. [sent-693, score-0.774]
100 We review three core tasks of recommendation systems—the prediction task, the recommendation task, and the utility maximization task. [sent-700, score-1.013]
wordName wordTfidf (topN-words)
[('items', 0.44), ('recommendation', 0.387), ('recommender', 0.301), ('ratings', 0.286), ('recommendations', 0.271), ('user', 0.217), ('utility', 0.187), ('users', 0.185), ('item', 0.18), ('ecommendation', 0.111), ('metrics', 0.103), ('unawardana', 0.097), ('recommend', 0.091), ('etrics', 0.09), ('rating', 0.087), ('hani', 0.082), ('movies', 0.08), ('ix', 0.08), ('urvey', 0.076), ('retailer', 0.076), ('news', 0.072), ('valuation', 0.069), ('system', 0.063), ('collaborative', 0.062), ('recommended', 0.062), ('net', 0.062), ('pro', 0.056), ('ine', 0.055), ('bookcrossing', 0.055), ('mcnee', 0.055), ('evaluation', 0.054), ('movie', 0.054), ('bought', 0.053), ('tasks', 0.052), ('list', 0.051), ('task', 0.049), ('click', 0.048), ('rmse', 0.047), ('designer', 0.047), ('recommending', 0.047), ('cosine', 0.045), ('curves', 0.045), ('lists', 0.043), ('ltering', 0.043), ('buy', 0.042), ('laptop', 0.042), ('predicted', 0.042), ('belgian', 0.042), ('konstan', 0.041), ('books', 0.04), ('website', 0.04), ('pearson', 0.04), ('test', 0.037), ('precision', 0.036), ('herlocker', 0.035), ('rental', 0.035), ('metric', 0.035), ('preferred', 0.035), ('people', 0.034), ('online', 0.034), ('roc', 0.034), ('usage', 0.031), ('score', 0.029), ('na', 0.028), ('browsing', 0.028), ('exercised', 0.028), ('gunawardana', 0.028), ('laptops', 0.028), ('shani', 0.028), ('watch', 0.028), ('nb', 0.025), ('interface', 0.025), ('protocol', 0.024), ('rated', 0.024), ('dem', 0.024), ('hide', 0.024), ('stream', 0.024), ('retrieval', 0.024), ('decisions', 0.024), ('typically', 0.024), ('evaluating', 0.024), ('purchases', 0.024), ('revenue', 0.024), ('stars', 0.024), ('subscription', 0.024), ('websites', 0.024), ('ziegler', 0.024), ('curve', 0.023), ('va', 0.023), ('similarity', 0.022), ('queue', 0.021), ('breese', 0.021), ('experimenter', 0.021), ('protocols', 0.021), ('unsuitable', 0.021), ('advertising', 0.021), ('braziunas', 0.021), ('grandvalet', 0.021), ('idx', 0.021), ('linden', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
Author: Asela Gunawardana, Guy Shani
Abstract: Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm. In this paper we review the proper construction of offline experiments for deciding on the most appropriate algorithm. We discuss three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest. We also discuss other important considerations when designing offline experiments. Keywords: recommender systems, collaborative filtering, statistical analysis, comparative studies
Author: Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk
Abstract: The collaborative filtering (CF) using known user ratings of items has proved to be effective for predicting user preferences in item selection. This thriving subfield of machine learning became popular in the late 1990s with the spread of online services that use recommender systems, such as Amazon, Yahoo! Music, and Netflix. CF approaches are usually designed to work on very large data sets. Therefore the scalability of the methods is crucial. In this work, we propose various scalable solutions that are validated against the Netflix Prize data set, currently the largest publicly available collection. First, we propose various matrix factorization (MF) based techniques. Second, a neighbor correction method for MF is outlined, which alloys the global perspective of MF and the localized property of neighbor based approaches efficiently. In the experimentation section, we first report on some implementation issues, and we suggest on how parameter optimization can be performed efficiently for MFs. We then show that the proposed scalable approaches compare favorably with existing ones in terms of prediction accuracy and/or required training time. Finally, we report on some experiments performed on MovieLens and Jester data sets. Keywords: collaborative filtering, recommender systems, matrix factorization, neighbor based correction, Netflix prize ∗. All authors also affiliated with Gravity R&D; Ltd., 1092 Budapest, Kinizsi u. 11., Hungary; info@gravityrd.com. c 2009 G´ bor Tak´ cs, Istv´ n Pil´ szy, Botty´ n N´ meth and Domonkos Tikk. a a a a a e ´ ´ ´ TAK ACS , P IL ASZY, N E METH AND T IKK
3 0.12037514 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert
Abstract: We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators mapping a set of “users” to a set of possibly desired “objects”. In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects—a feature currently lacking in existing regularization-based CF approaches—using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning algorithms based on low-rank decompositions and test them on a standard CF data set. The experiments indicate the advantages of generalizing the existing regularization-based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach. Keywords: collaborative filtering, matrix completion, kernel methods, spectral regularization
Author: Troy Raeder, Nitesh V. Chawla
Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis
5 0.045690626 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
6 0.043372251 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey
7 0.040296882 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
8 0.031179195 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
9 0.030819606 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
10 0.029389806 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
11 0.027355907 48 jmlr-2009-Learning Nondeterministic Classifiers
13 0.026346173 81 jmlr-2009-Robust Process Discovery with Artificial Negative Events (Special Topic on Mining and Learning with Graphs and Relations)
14 0.025712196 23 jmlr-2009-Discriminative Learning Under Covariate Shift
15 0.025187168 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
16 0.024463784 29 jmlr-2009-Estimating Labels from Label Proportions
17 0.024257492 20 jmlr-2009-DL-Learner: Learning Concepts in Description Logics
18 0.024210511 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
19 0.023998346 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
20 0.023395728 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
topicId topicWeight
[(0, 0.131), (1, -0.105), (2, 0.083), (3, -0.094), (4, -0.007), (5, -0.066), (6, 0.224), (7, 0.127), (8, 0.248), (9, -0.517), (10, -0.108), (11, -0.17), (12, -0.24), (13, 0.05), (14, 0.003), (15, -0.028), (16, -0.048), (17, 0.018), (18, 0.009), (19, 0.033), (20, -0.078), (21, -0.088), (22, 0.083), (23, -0.004), (24, -0.057), (25, 0.01), (26, -0.046), (27, 0.01), (28, -0.05), (29, -0.039), (30, 0.015), (31, -0.014), (32, 0.02), (33, -0.065), (34, 0.038), (35, -0.035), (36, -0.032), (37, 0.003), (38, -0.048), (39, 0.029), (40, 0.033), (41, 0.013), (42, -0.023), (43, 0.009), (44, 0.01), (45, -0.004), (46, 0.016), (47, 0.008), (48, -0.034), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.97702503 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
Author: Asela Gunawardana, Guy Shani
Abstract: Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm. In this paper we review the proper construction of offline experiments for deciding on the most appropriate algorithm. We discuss three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest. We also discuss other important considerations when designing offline experiments. Keywords: recommender systems, collaborative filtering, statistical analysis, comparative studies
Author: Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk
Abstract: The collaborative filtering (CF) using known user ratings of items has proved to be effective for predicting user preferences in item selection. This thriving subfield of machine learning became popular in the late 1990s with the spread of online services that use recommender systems, such as Amazon, Yahoo! Music, and Netflix. CF approaches are usually designed to work on very large data sets. Therefore the scalability of the methods is crucial. In this work, we propose various scalable solutions that are validated against the Netflix Prize data set, currently the largest publicly available collection. First, we propose various matrix factorization (MF) based techniques. Second, a neighbor correction method for MF is outlined, which alloys the global perspective of MF and the localized property of neighbor based approaches efficiently. In the experimentation section, we first report on some implementation issues, and we suggest on how parameter optimization can be performed efficiently for MFs. We then show that the proposed scalable approaches compare favorably with existing ones in terms of prediction accuracy and/or required training time. Finally, we report on some experiments performed on MovieLens and Jester data sets. Keywords: collaborative filtering, recommender systems, matrix factorization, neighbor based correction, Netflix prize ∗. All authors also affiliated with Gravity R&D; Ltd., 1092 Budapest, Kinizsi u. 11., Hungary; info@gravityrd.com. c 2009 G´ bor Tak´ cs, Istv´ n Pil´ szy, Botty´ n N´ meth and Domonkos Tikk. a a a a a e ´ ´ ´ TAK ACS , P IL ASZY, N E METH AND T IKK
3 0.38108522 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert
Abstract: We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators mapping a set of “users” to a set of possibly desired “objects”. In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects—a feature currently lacking in existing regularization-based CF approaches—using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning algorithms based on low-rank decompositions and test them on a standard CF data set. The experiments indicate the advantages of generalizing the existing regularization-based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach. Keywords: collaborative filtering, matrix completion, kernel methods, spectral regularization
Author: Troy Raeder, Nitesh V. Chawla
Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis
5 0.17369822 50 jmlr-2009-Learning When Concepts Abound
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
6 0.15404545 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)
7 0.15095568 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
8 0.14535329 96 jmlr-2009-Transfer Learning for Reinforcement Learning Domains: A Survey
9 0.12981704 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
10 0.12846446 48 jmlr-2009-Learning Nondeterministic Classifiers
11 0.1155691 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
13 0.11279391 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
15 0.10915101 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
16 0.10754321 43 jmlr-2009-Java-ML: A Machine Learning Library (Machine Learning Open Source Software Paper)
17 0.1049896 23 jmlr-2009-Discriminative Learning Under Covariate Shift
18 0.10202559 15 jmlr-2009-Cautious Collective Classification
19 0.099728413 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
20 0.09940955 92 jmlr-2009-Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining
topicId topicWeight
[(8, 0.033), (11, 0.021), (19, 0.013), (21, 0.023), (26, 0.021), (38, 0.045), (47, 0.039), (52, 0.049), (55, 0.034), (58, 0.025), (66, 0.062), (68, 0.022), (73, 0.432), (90, 0.06), (96, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.77919292 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
Author: Asela Gunawardana, Guy Shani
Abstract: Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm. In this paper we review the proper construction of offline experiments for deciding on the most appropriate algorithm. We discuss three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest. We also discuss other important considerations when designing offline experiments. Keywords: recommender systems, collaborative filtering, statistical analysis, comparative studies
Author: Gábor Takács, István Pilászy, Bottyán Németh, Domonkos Tikk
Abstract: The collaborative filtering (CF) using known user ratings of items has proved to be effective for predicting user preferences in item selection. This thriving subfield of machine learning became popular in the late 1990s with the spread of online services that use recommender systems, such as Amazon, Yahoo! Music, and Netflix. CF approaches are usually designed to work on very large data sets. Therefore the scalability of the methods is crucial. In this work, we propose various scalable solutions that are validated against the Netflix Prize data set, currently the largest publicly available collection. First, we propose various matrix factorization (MF) based techniques. Second, a neighbor correction method for MF is outlined, which alloys the global perspective of MF and the localized property of neighbor based approaches efficiently. In the experimentation section, we first report on some implementation issues, and we suggest on how parameter optimization can be performed efficiently for MFs. We then show that the proposed scalable approaches compare favorably with existing ones in terms of prediction accuracy and/or required training time. Finally, we report on some experiments performed on MovieLens and Jester data sets. Keywords: collaborative filtering, recommender systems, matrix factorization, neighbor based correction, Netflix prize ∗. All authors also affiliated with Gravity R&D; Ltd., 1092 Budapest, Kinizsi u. 11., Hungary; info@gravityrd.com. c 2009 G´ bor Tak´ cs, Istv´ n Pil´ szy, Botty´ n N´ meth and Domonkos Tikk. a a a a a e ´ ´ ´ TAK ACS , P IL ASZY, N E METH AND T IKK
3 0.26491785 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning
Author: Halbert White, Karim Chalak
Abstract: Judea Pearl’s Causal Model is a rich framework that provides deep insight into the nature of causal relations. As yet, however, the Pearl Causal Model (PCM) has had a lesser impact on economics or econometrics than on other disciplines. This may be due in part to the fact that the PCM is not as well suited to analyzing structures that exhibit features of central interest to economists and econometricians: optimization, equilibrium, and learning. We offer the settable systems framework as an extension of the PCM that permits causal discourse in systems embodying optimization, equilibrium, and learning. Because these are common features of physical, natural, or social systems, our framework may prove generally useful for machine learning. Important features distinguishing the settable system framework from the PCM are its countable dimensionality and the use of partitioning and partition-specific response functions to accommodate the behavior of optimizing and interacting agents and to eliminate the requirement of a unique fixed point for the system. Refinements of the PCM include the settable systems treatment of attributes, the causal role of exogenous variables, and the dual role of variables as causes and responses. A series of closely related machine learning examples and examples from game theory and machine learning with feedback demonstrates some limitations of the PCM and motivates the distinguishing features of settable systems. Keywords: equations causal models, game theory, machine learning, recursive estimation, simultaneous
4 0.26219115 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
Author: Emma Brunskill, Bethany R. Leffler, Lihong Li, Michael L. Littman, Nicholas Roy
Abstract: To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigation across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture real-world dynamics in a sufficient manner to produce good performance. Keywords: reinforcement learning, provably efficient learning
5 0.26008308 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
6 0.25807503 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
7 0.25755244 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
8 0.25732055 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
9 0.25721881 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
10 0.25672948 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures
11 0.25653332 38 jmlr-2009-Hash Kernels for Structured Data
12 0.25644073 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
13 0.25598899 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers
14 0.25525147 48 jmlr-2009-Learning Nondeterministic Classifiers
15 0.25382951 29 jmlr-2009-Estimating Labels from Label Proportions
16 0.25247994 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
17 0.25211382 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
18 0.25157338 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
19 0.25143301 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification